Troubleshooting cloud services and infrastructure is an ongoing challenge for organizations of all sizes. As organizations adopt more cloud services and their cloud environments grow more complex, they naturally produce more telemetry data – including application, system and security logs that document all types of events. All cloud services and infrastructure components generate their own, distinct logs.
Troubleshooting a problem, at times, can feel like searching for a needle in a haystack. Sifting through massive amounts of log data can be both unproductive and impractical. Finding the root cause of an issue often requires a lengthy investigation into raw log data, without knowing which specific event triggered the problem in the first place.
Even when real-time cloud observability systems are in place, teams often still have to sift through historical log data to determine what went wrong. What’s more, the data retention windows on systems like security information and event management (SIEM) and other monitoring and observability tools are often less than 30 days; this timespan is mostly adequate for day-to-day operations, but gaining meaningful insight into the source of a persistent or longstanding issue requires months or more of log data.
Applying log analytics can help reduce some of the headaches associated with troubleshooting common cloud infrastructure and services issues. Uncovering these issues faster can help improve incident management KPIs, which include mean time to know (MTTK), mean time to repair (MTTR), and mean time between failure (MTBF), among others.
In this article, we’ll dive into log analytics, why it’s important, types of log data, common cloud infrastructure issues, best practices for cloud troubleshooting, and how to effectively store and query your logs in the event of an issue.
What is Log Analytics?
The proliferation of cloud computing services and infrastructure has led to an explosion of log data. This log data is crucial to understanding cloud performance and security issues alike. DevOps teams are responsible for dealing with issues in code and the connection between code and the cloud production environment. Log analytics software solutions are used to collect, aggregate, analyze, and visualize computer-generated log data from sources throughout the IT environment.
Key capabilities of log analytics solutions include:
- Log Data Collection and Aggregation – Log analytics solutions collect, aggregate, and centralize log data for analysis. These solutions gather log data from a broad spectrum of sources that includes virtual machines (VMs), containers, storage, operating systems, network infrastructure, applications, and endpoint devices.
- Log Data Normalization – Various data sources tend to format their logs in different ways, so most log analytics solutions offer a means of normalizing log data such that a single unified index can be used for analysis.
- Log Indexing, Storage and Retention – Normalized log data must be indexed for rapid retrieval before it can be searched, queried, and analyzed. Log analytics solutions offer more cost-effective long-term data storage.
- Querying and Analytics – Log analytics tools allow teams to run queries and perform log analysis on indexed data to research root cause or to discover potential issues before they impact production systems.
- Visualization and Dashboarding – Log analytics solutions offer visualization and dashboarding features that make it easier for DevOps teams to consume data or report on the results of log analytics operations.
DevOps and SecOps teams use log analytics for forensic analysis, and to monitor cloud environments that support systems and applications. Let’s learn more about the typical types of log data used for troubleshooting in AWS and Google Cloud.
Useful Log Data for Troubleshooting
Cloud monitoring services often capture metrics, metadata and events that can help inform DevOps teams about the status of cloud-based applications and infrastructure. There are many types of logs used for troubleshooting cloud services and infrastructure. Among them include:
- Event logs: These provide information about network traffic, usage and more. For example, event logs can capture login sessions or other activity on a network, or record application errors.
- Transaction logs: These log files list changes to a database or cloud storage environment, and are commonly associated with SQL server transactions.
- Message logs: These logs document activity from messages, including email, chat and more.
- Audit logs: Audit logs may vary between applications and systems but typically capture events that show who did what, and how the system responded.
Within AWS specifically, there are several types of logs DevOps teams typically monitor for log analytics, including:
- CloudTrail: CloudTrail enables you to log, continuously monitor, and retain account activity related to actions across your AWS infrastructure. This includes the event history of your AWS account activity, such as actions taken through the AWS Management Console, AWS SDKs, command line tools, and other AWS services.
- Elastic Load Balancing (ELB): ELB logs capture detailed information about requests sent to your load balancer, such as the time the request was received, the client’s IP address, latencies, request paths, and server responses. These access logs are useful for analyzing traffic patterns and troubleshooting issues.
- VPC Flow: VPC Flow Logs help you capture information about the IP traffic going to and from network interfaces in your virtual private cloud (VPC). Flow logs can help diagnose overly restrictive security group rules, monitor traffic reaching your instance, and determine the direction of traffic to and from the network.
- Route 53: These logs are focused on queries to the Domain Name System (DNS), such as the domain or subdomain that was requested, the date and time of the request, and the DNS record type.
Beyond AWS, each cloud has its own services for which logs are generated. Applications, containers, microservices, compute nodes, and other components also generate logs, which add to the volume of data through which DevOps teams must sift to find meaningful insights.
Typical Cloud Services and Infrastructure Issues
While there are thousands of things that could go wrong within a cloud environment, the most typical cloud issues include the following:
- Cloud security and configuration management: Maintaining consistent configuration management across all of your cloud infrastructure can be a major challenge. In fact, hackers often exploit common cloud misconfigurations, which include using default credentials or accidentally exposing credentials, exposed ports or poorly secured S3 buckets, and more.
- Cloud availability and latency issues: Sometimes issues can occur on the user’s side, and other times, server-side issues may cause latency and availability issues for public cloud services.
- Cloud application performance issues: Cloud applications can be delayed or fail for a number of reasons, including how they’re built and configured, poor database performance, or the cloud computing services themselves.
- Cloud cost issues: Many organizations face out-of-control cloud costs, due to the way in which they’re utilizing cloud resources. Sometimes unexpected surges or unattended projects can cause unexpected spikes in the next month’s cloud bill.
- Multicloud deployment issues: Organizations embracing a multicloud approach need to learn to do the same things differently across cloud platforms. Learning a new system can be a costly and error-prone process.
Troubleshooting these issues require different approaches, yet there are some common best practices that can be leveraged across the board.
Cloud Troubleshooting Best Practices
Once you know there’s a problem, there are important steps you must take to mitigate the risks of an issue persisting. Unlike navigating on-premises IT troubleshooting tasks, troubleshooting cloud infrastructure within the shared responsibility model of public cloud providers requires sharp communication skills with your provider on what you’ve done to fix the issue yourself. Google outlines typical cloud troubleshooting best practices for site reliability engineers (SREs), which include:
- Triage: Mitigate the impact if you can
- Examine: Gather observations and share them with your cloud provider or other team members
- Diagnose: Create a hypothesis that explains the observations
- Test and treat:
- Identify tests that may prove or disprove the hypothesis
- Execute the tests and agree on the meaning of the result
- Move on to the next hypothesis; repeat until solved
While working with a cloud provider on an issue means that you lose some element of control over the situation at hand, it’s critical to maintain a time-stamped record of troubleshooting steps you’ve taken so far, with screenshots and any relevant log snippets or other documentation attached.