Observability basics

“The observability provides you answer to a very basic question: what is going on inside my system?”

 

Observability is the extent to which you can understand the internal state or condition of a complex system based only on knowledge of its external outputs. The more observable a system, the more quickly and accurately you can navigate from an identified performance problem to its root cause, without additional testing or coding [1].

 

Author

 

Miroslav Lazovic is an experienced software engineer from Belgrade. Over the past 15 years, he worked on many different projects of every size – from small applications to mission-critical services that are used by hundreds of thousands of users on a daily basis, both as a consultant or part of the development team.

Since 2016, he is focused on building and managing high-performing teams and helping people do their best work. He likes discussing various important engineering and management topics – and that’s the main reason for this blog. Besides that, Miroslav has quite a few war-stories to tell and probably far too many books and hobbies. There are also rumors of him being a proficient illustrator, but the origin of these rumors remains a mystery.

– Backend TL@Neon and Dev. Advocate @Holycode

 


 

The observability provides you answer to a very basic question: what is going on inside my system? I believe it is very clear how important this is. Without knowing what goes on inside your system, you will not be able to properly troubleshoot or debug your applications (as well as the infrastructure they are working on), nor you will be able to meet various business requirements (SLAs, customer experience expectations, etc.). Trying to figure out why something does not work (or why it behaves in a certain way) will require tremendous effort and considerable amount of time – which may, in the end, lead to dissatisfaction and burnout. Without at least basic observability tooling/practices in place, it can be hard to diagnose and resolve issues even in a standalone, monolithic application; trying the same thing in a microservice environment with many moving parts may be nearly impossible.

To achieve this, you need to collect various data about your system(s). There are many differing views regarding which data you need to collect, but in this article, we are going to focus on 4 types of data: metrics, evets, logs, and traces (so-called MELT approach) [2].

 

Metrics

 

Metrics are numerical data that describe various aspects of application and system health over time (but can describe a lot more as well). Here are some metrics for example:

– CPU% / disk / memory used
– Number of requests per second
– Number of errors per time interval
– Number of transactions per time interval

Some metrics are provided by operating system or by specific library or framework. However, there are tools that can help you create your own metrics that describe specific aspects of application or system that you are interested in. Metrics are useful because they can provide a lot of information about some specific measurement over time (and these intervals can go from last few minutes to last few months or even more). This can help you spot issues that are occurring right now (for example, disk space is very low, or CPU usage is extremely high for the past 5 minutes), but it can also help you detect certain trends that you can then react on (for example, regular patterns in your HTTP traffic).

 

Events

 

Events are specific actions that occurred in your application. For example, “user created”, “transaction executed”, “cart updated” and many, many more. Events might not only describe user-related actions, but also many system-level actions that are happening in the background (like some scheduled job being triggered at specific point in time). Events may also be related to actions like clicking a button or using a search bar. If you keep track of all these events you can get a view on what is going on in your system while it is operating and while users are interacting with it.
However, if you feed these events and related metadata (like timestamps or some custom attributes, for example) into some of the observability tools you might be able to aggregate them and then run specific queries that will tell a lot regarding how your system operates or how it is used. For example, you could get answers to questions like these:

– How many new users are created daily?
– How many logins a typical user makes during a single day?
– How many users are using the new feature that we rolled out last month?

These are just some of the basic examples – depending on how you are tracking events, you might be able to get very sophisticated information about the way your system operates. This in turn can lead to improved user experience and overall stability/performance, but it could also help you detect some previously unknown issues.

 

Logs

 

For any type of application, logs are very important because they contain detailed information about the various application events. However, they are much more granular than events and a single event in an application can produce multiple log entries (for example, specific event like creating a user can require execution of multiple blocks of code and each of these blocks may log relevant entries). Logs are one of the primary sources of information when you want to either understand how application works or debug and fix issues. However, not all logs are equal and there are some guidelines that you can follow so you can have better (and more useful) logs. In general, there are two important things you need to consider when it comes to logs: level of detail and log format.

Good logs contain the right amount of detail. I have seen logs that contain no usable information at all – simply stating that certain event has occurred but providing no additional details. Such logs are only slightly better than having no logs at all. You must make sure that all relevant events (ad not only errors) have been logged with enough information about the event itself – so you can establish chain of effects that led to the event being investigated. For example, stating that “error occurred” without any timestamp does not mean anything, but having a timestamped log entry containing exception stack trace is a completely different thing.

However, there is such thing as “logging too much information” – so you should try to figure out what information must be always present in the logs. You should also consider the option to increase the level of details in the logs by changing the application configuration (and many popular libraries and frameworks allow you to do that). For example, you might not always log information about state of database pool and SQL queries being executed, but it would be great if you can change that setting so you can see that information as well if/when you need them. Same goes for HTTP headers, request/response payloads, etc. – depending on the context and business case, some of these information might be critical, so you always log them, but in some other cases you might just want to “turn them on” when needed.

The second problem we mentioned is the format of logs – this commonly happens in a microservice environment when every service logs the same thing in a different way. This presents an issue for several reasons. First, it is not easy for an engineer to juggle between multiple unstructured log entries on the fly. Different log formats between the services might work if there is only a few of them or if the system is in early stages of development – but for production system, I do not believe that this is a good idea at all (unless you have some very strong reason to do it). Setting up common log format is a small thing that might save you a lot of time and effort when it comes to understanding what’s going on inside your system. Once you define this format, you may include it with every new service – so you are sure that all relevant information will be presented in the logs in the exact same way.
Second, not having a common log format might be an issue when it comes to aggregating logs. There are many popular services (some free, some paid) that you can use for log aggregation and major cloud service providers also offer their own tools for this. Basically, log aggregation tools help you collect logs from different sources and manage them in a centralized location, where you can also analyze those logs. In microservice environment, this becomes extremely useful, because all logs, from all your services, will be stored on a single location where you can view them using some user-friendly UI and easily search them by executing queries – and these things are much more efficient if all logs are streamlined and share common format. Depending on the tool, you can even create some metrics or add dashboards with specific information that you want to see. Just imagine the alternative – juggling dozes of log files scattered across dozens of different application nodes and trying to figure out how to connect the relevant string of information. It’s not impossible – but it’s far from efficient.

 

Traces

 

Traces (or distributed traces) are samples of chains of events between different components in microservice environment. Like logs, traces are irregular in occurrence [2].

Traces are made of components called spans – that describe one of the events that occurred during the trace. For example, let’s say that you have a REST API endpoint that needs to validate some query parameters and then retrieve some data from the database. Both operations are implemented as methods that will be called at some point of execution: the first method performs the validation and the second performs the data retrieval.

This means that the trace that describes the operation (calling the API endpoint and getting a response) will show two spans (validation and data retrieval).

 

Trace showing several operations and their durations

 

Now, this problem becomes much more complex in a microservice environment because the initial request might require subsequent calls to other microservices – and for each of them, the spans would have to be tracked. Each of the services might have their own validation, data retrieval or some other logic that needs to be executed. But tracing the spans and their order is just half of the problem – the other part is correlating those spans. By correlating them, you can tie them together in a meaningful way and provide all the required information about the anatomy of the trace. Usually, this is achieved by services passing correlation identifiers (or trace context) to each other – which allows many different spans to be viewed as a single chain of events (or trace). Observability tools are then able to collect this information and present them to the user, usually with a lot of additional details (like HTTP headers and payloads, SQL queries being executed, etc.).

 

Trace showing external and internal calls

 

It’s easy to see where this information would be extremely helpful – in troubleshooting issues and for getting more insight in how specific operations are executed. For example, traces may show that certain methods take too long to execute – so you might put some effort into investigating what is the root cause for bad performance. There are powerful observability tools that can do a lot of work for you and provide you with a single place where you can analyze all your traces, but there are also libraries that allow you set everything on your own (of course, this is much more time-consuming compared to the first solution, but each approach has its own pros and cons).

 

So, what do I do with all this?

 

Since we have covered the basic data types that are used to achieve observability, now we can focus on how to utilize this data. As we have already mentioned in the previous section, there are powerful observability tools (NewRelic or Data Dog, for example) that can collect all this data in a single place, so you can analyze it any way you want. We have already listed numerous benefits that come from analyzing such data, but there is one important thing that wasn’t mentioned – detecting the “unknown unknowns”, which means finding out what you don’t know about your system. Maybe there is a specific sequence of calls that results in a particular error, or maybe the overall response time rapidly increases when the number of users goes past certain number. This and many more can be learned from looking at the data at collected within the observability tools and some of these issues can be caught even before they have manifested in production environment in full effect.

 

“Many companies might work just fine with the right level of observability – but you must find out what is the right level for your system.”

 

However, this is just the beginning. There is another concept that usually goes hand in hand with observability – monitoring and alerting. If you are collecting data in real time, you can set up alerts that will trigger once the specific event has occurred. The most common scenario is that alerts will be raised when the overall stability of the system is compromised (meaning your SLAs are compromised as well): certain metrics being too high or too low, rapid increase in error count, increased response time, rapid decrease in available system resources, etc. If something like that happens, an alert will be raised, and members of engineering team will receive notifications with the quick summary of the issue. These notifications are fully automated: once you set the rules or conditions for raising an alert, the notification will be sent by email, SMS, automated phone call or some other way of communication (dedicated Slack channel, for example), as soon as those conditions are met. Basically, this is your system telling you that something bad is happening right now and that you need to look at it – in short, you are notified about the issues as soon as they are happening. This allows the team not only to react in time, but to be proactive and contain the issue, limiting the consequences.

 

Full-stack observability overview [3]

 

It is obvious that your approach to work might benefit in many ways from achieving observability, for both big and small projects. However, once you start working on achieving observability, you will probably quickly realize that there is so much to cover. Achieving full-stack observability (being able to see everything in your tech stack) is hard (and even harder for complex projects). Just think about it for a second: application servers, networks, databases, mobile applications – monitoring and tracking all this can be quite an epic undertaking in some cases. According to the recent Observability report published by New Relic [3], only 27% of the organizations that participated in the survey claimed that they achieved full-stack observability. This gets further complicated by the fact that there are many different approaches and tools (both free and paid). According to the same report [3], 82% of the organizations use 4 or more observability tools. The reasons for this are numerous: all-in-one solutions might not perfectly suit their setup, they might use a lot of custom tooling, they think that some specific tools are better for certain use-cases, etc. So, while these companies did achieve some (potentially high) level of observability, using all these different tools can put much more strain on the engineers (because they now need to learn to operate different tools to find out what happens inside the system).

Having full-stack observability is great. However, achieving it is hard and it might not be exactly what you need. Many companies might work just fine with the right level of observability – but you must find out what is the right level for your system. On the other hand, many companies are still detecting outages manually or from complaints. This means that there’s a lot of engineers that don’t have the full picture regarding how their systems operate – and you sure don’t want to be in their shoes. Observability is a complex topic, with lots of different approaches and many tools – so research and deciding about what works the best for you will not be easy. However, in the long run it will pay off because it will lead to an increased level of confidence in your system.

 


 

References:

  1. What is observability? https://www.ibm.com/topics/observability
  2. MELT 101 – An introduction to four essential telemetry data types: https://newrelic.com/platform/telemetry-data-101
  3. New Relic observability report for 2022: https://newrelic.com/observability-forecast/2022/about-this-report