
Any critical site, service or application has Site Reliability Engineers (SREs), DevOps engineers or support staff constantly reviewing operational and performance metrics to ensure their users have a good experience. The work requires considerable attention to alerts and particularly to numerous observability dashboards stretched across multiple screens to keep track of SLAs and application responsiveness.
Given the increasing complexity of applications, the speed of delivering new versions or updates and the variability of use cases, it is inevitable that things break in any software organization. The result could be an outage, loss of important functionality or some major issue, and a crisis ensues. A demanding job becomes even more so, with new levels of urgency and the stress of scrutiny. For many organizations, it is all hands on deck to identify and solve the issue as quickly as possible, minimizing impact to customers or users. Such scenarios are generally not a question of if, but how often.
While symptoms of the problem may figure prominently on these dashboards, determining the real cause of the problem is usually more complicated. One typically uses the metrics dashboards to zoom in on the time frame “when” things went wrong, use traces to narrow the scope of “where” the problem might lie (which services are involved), and inevitably get stuck hunting deep within logs to figure out “why.” Symptoms with widespread impact are well captured by monitoring dashboards, such as user-affecting-latency, or overall traffic, or a spike in errors. But these are rarely enough to understand the cause of a problem. And they typically don’t surface isolated issues, such as a bug affecting specific user actions, or warnings because of particular API calls failing. Also, software problems often build up over time (seconds, minutes, or in some cases, even over hours). It might start with warnings, retries or restarts, and then spiral out of control and impact users as downstream service start to fail.
Such leading indicators are really hard to catch using observability tools. Some organizations do build alert rules and signatures to identify and respond to problems that are particularly nasty – allowing quick diagnosis and resolution when they occur. But such signatures and rules take considerable work to build and maintain, so most organizations have a lengthy backlog of signature and rule updates for their problem identification systems. Besides being labor intensive, this kind of approach is not terribly flexible. Changes in software can change failure modes or thresholds or formats of the underlying metrics and logs making it hard to reliably match to future problems. In a modern cloud native environment, it is impractical to build and maintain enough alert rules for the unbounded ways in which problems can start.
Regardless of investment in early warning alerts, troubleshooting modern software is not easy. The scalability of an observability platform, or the speed of the queries is not the bottleneck – the bottleneck is the eyes and brain of an SRE or members of the support team. How quickly can one scan and flip through dozens of dashboards and drill down into your metrics and traces to narrow the scope. And then, how does one identify the events that explain the root cause. There are typically millions (if not billions) of events across thousands of log streams to look at. One might be able to identify clusters of errors near the problem, but then one has to look upstream in time, and horizontally across logs from the services to figure out which unusual event (often not an error) triggered the whole sequence.
Just finding these events is hard enough, and then the next challenge is making proper correlations between events, or “connecting the dots” to see that a particular combination of factors has caused a problem or that a certain combination of events points to a true cause. This is a “forest for the trees” sort of challenge. Sometimes a problem may be found by gleaning information from a particular log, but most often understanding the problem is found between logs or in gaining a true picture from details in multiple logs. All of these factors make troubleshooting difficult and time consuming.
Ironically, the things that make this task hard for humans (sifting through vast quantities of data, spotting outliers, correlating across hundreds of streams) are actually extremely well-suited for machine learning. Specialized, automated systems can quickly comb through large quantities of unstructured and sometimes voluminous logs to find unusual events and clusters, and make meaningful correlations – all in the time it takes one to fill a coffee mug and log in.
Now, imagine having this capability within one’s observability dashboard – not an external application that one has to switch context to view and engage with. Further, having root cause analysis integrated means no separate configuring, tuning and managing. When an oddity pops up on any dashboard, one only has to look for at the corresponding root cause analysis summary presented in plain language (using NLP) right below it. That takes root cause analysis from something that was a reactive, stressful workflow requiring lots of windows and visual correlation – and turns it into something that is automatically and seamlessly surfaced in familiar tools. In addition, problem solving can become more proactive so that issues can be more readily addressed early, before they may be experienced by customers or users.
Marrying the traditional observability world with root cause analysis takes more than integrating systems or technology. Incorporating it with ongoing work practices to make it a natural part, or at least an extension, of daily activities advances the way the team works and the quality and reliability it can provide. The combination brings new levels of effectiveness and efficiency at a time when growing application complexity, increasing expectations and higher stakes and managing teams that are likely understaffed and overworked are a growing reality.
About the author: Ajay Singh is a strong advocate for creating products that “just work” to address real-life customer needs. As Zebrium CEO, he is passionate about building a world class team focused on using machine learning to build a new kind of log monitoring platform. Ajay led Product at Nimble (News - Alert) Storage from concept to annual revenue of $500M and over 10,000 enterprise customers. Ajay started his career as an engineer and has also held senior product management roles at NetApp and Logitech (News - Alert).
Edited by
Erik Linask