Microservices Observability

An overview of observability. Source: APM Experts 2020.
An overview of observability. Source: APM Experts 2020.

With traditional monolithic applications, troubleshooting problems was easier compared to the more recent microservices architecture. Log files captured information for debugging and analysis. Monitoring highlighted problems via static dashboards and alerts. The application itself was generally well understood.

With microservices, there are more moving parts. The system is dynamic and heterogenous. It's parts are loosely coupled and transient. The system could fail in many different and unpredictable ways. It's necessary to monitor the application, the network and the infrastructure. It's in this context that observability (aka o11y) becomes important.

Can we understand the inner workings of the system? Can we correlate the outputs with the inputs? Can we explain unexpected behaviour? Can data help us achieve business goals? A system that answers these questions effectively is said to be observable.

Discussion

  • How does observability differ from monitoring?
    Monitoring versus observability. Source: Usman et al. 2022, fig. 2.
    Monitoring versus observability. Source: Usman et al. 2022, fig. 2.

    IT systems have been monitoring performance since the early 2000s. They sample and aggregate important data points such as system downtime, response time and memory usage. These are compared against expectations. Anomalies and violations are thrown up as alerts for support teams. Monitoring focuses on overall system health and business KPIs.

    Observability goes deeper with the aim of capturing system behaviour in greater detail and context. An application composed of many interacting microservices can fail in unpredictable ways. So it's not sufficient to focus on specific failures or scenarios. While monitoring deals with known unknowns, observability deals with unknown unknowns. Data analysis is an essential aspect of observability. Failures are traced back to root causes. To put it simply,

    Monitoring tells you when something is wrong. Observability lets you ask why.

    Observability doesn't replace monitoring. Monitoring is essential to observability. We may even view observability as a superset of monitoring.

  • What are the main benefits of observability?

    Observability leads to better visibility into the system dynamics. We can identify bottlenecks and optimize workflows. It creates a culture of innovation, improves operational efficiency and enables data-driven business decisions. DevSecOps teams can use the insights from observability to build more secure applications.

    Because observability helps us understand the system better, engineers can more confidently update and release software. Release cycles can be shorter. Quality can be improved. Problems that you we didn't know existed in the system ("unknown unknowns") become visible. We can find and fix issues earlier in the software development phase. When combined with AIOps, issues can be solved automatically without human intervention.

    For any organization moving from monoliths to microservices, observability helps in this transition. It can help them better manage their cloud-native applications.

  • What are the key pillars of observability?
    The three pillars of observability. Source: Fernandes 2019.
    The three pillars of observability. Source: Fernandes 2019.

    Observability has three key pillars:

    • Logs: Capture individual events. Logs are granular, timestamped and immutable. When things go wrong, logs are invaluable in determining the root cause. A good practice is to log in structured formats such as JSON. Structured logging enables easy visualization, search and analysis.
    • Metrics: Data aggregated over time. Monitoring solutions rely on metrics. Uptime, CPU utilization, system load, memory usage, throughput, response time and error rate are some examples of metrics.
    • Traces: Sequence of calls triggered by an event. In the world of microservices, a failure can be traced back to an offending microservice or an API call.

    Events and exceptions can be seen as special cases of logs. Some identify dependencies as another pillar. This captures how components depend on one another.

    Historically (in 2016), engineers at Twitter identified these four pillars: monitoring, alerting/visualization, distributed systems tracing infrastructure, and log aggregation/analytics.

  • What's the typical pipeline for microservices observability?
    Microservices observability pipeline. Source: Li et al. 2022, fig. 1.
    Microservices observability pipeline. Source: Li et al. 2022, fig. 1.

    A logger produces logs within each service instance. A centralized collector collects the logs. Data is then pre-processed and stored to ease later analysis. Pre-processing might include cleaning, formatting, sampling and aggregating (for metrics). Analysts may run ad hoc queries to explore and visualize data, and to solve problems.

    Each logger might be part of the main service or deployed as a sidecar container within the same Kubernetes pod of the main container. More generally, an agent pull logs from services or log files and then pushes them to the collector. Alternatively, log collection is agent-less, that is, services themselves push logs to the collector. The latter is easier to deploy but requires developers to integrate logging functionality into the service code.

    Different types of analysis are possible: timeline analysis, service dependency analysis, aggregation analysis, root cause analysis and anomaly detection. Analysis is enabled by the use of statistics, rules and visualization.

  • What design patterns are available for microservices observability?
    Illustrating distributed tracing. Source: Conran 2022.
    Illustrating distributed tracing. Source: Conran 2022.

    Among the design patterns are:

    • Application Metrics: Gives a complete picture of the application. Includes infrastructure, application and end user metrics.
    • Audit Logging: Record all user or service account actions. Needed for regulatory compliance. Event sourcing is one approach to capturing audit logs. Likewise, changes to deployment ought to be logged.
    • Distributed Tracing: Helps with performance profiling and root cause analysis. A sequence of API calls triggered by a user request is a trace. A trace is composed of spans. Each span records a unit of work within the trace. For example, user request, API gateway processing, service processing and database access are all spans within a trace. Distributed tracing requires context or correlation IDs passed from one service to the next.
    • Health Check: An unhealthy service is one that's running but has problems. Health check API can assess current system health. It can trigger alerts, recovery, service restarts, etc.
    • Exception Tracking: Includes error messages, error codes and stack traces.
    • Log Aggregation: Logs from various microservices are sent to a centralized log server.
  • As a developer, should I care about observability?

    Observability is not just the concern of operations. Observability must be considered at the design stage itself. System should be designed to be testable. It should be incrementally deployable with support for rollback. System should collect detailed runtime data.

    Logs should be designed to support data with high cardinality and high dimensionality. This makes it possible to query the log data effectively to bring out patterns and root causes. A useful query such as "list all 502 errors in the last 20 minutes from host foobar.com" becomes possible.

    Developers must acknowledge that complex systems are unpredictable and failures are inevitable. They should proactively instrument their code, rather than rely on automatic instrumentation. Even controlled testing in production may be permitted. Developers should consider the operational semantics and dependencies of their software. For example, developers should understand how services start/shutdown, their static/dynamic configurations, service discovery, concurrency, etc.

  • What are the best practices when implementing observability?

    Monitor essential metrics such as latency, traffic, error rate and saturation. Collect metrics along with contextual metadata. Configure and prioritize alerts correctly. For example, an alert for every 500 status code is a bad thing. Create specialized dashboard rather than overwhelming everyone with all the data.

    Rules that detect incidents should be simple, predictable, reliable and actionable. Monitoring systems can become complex over time. They could end up collecting lots of metrics that are never used or complex rules that are never triggered. Data should lead to actionable insights. Data should create effective feedback loops.

    Investing in observability tools matters, but tools alone will not solve problems. Teams must come with a mindset towards making their systems observable, from design to deployment.

    An observability framework can serve as a reference for implementations. To be effective, observability must be guided by business goals such as Service Level Objectives (SLOs). Look at observability from the perspective of end users (response times, failed purchases) and backend applications (slow queries, container restarts).

  • What tools are available for observability?
    Illustrating observability in AWS. Source: Dhamija et al. 2022, fig. 4.
    Illustrating observability in AWS. Source: Dhamija et al. 2022, fig. 4.

    Many frameworks (eg. ASP.NET and Spring Boot) have libraries or plugins that implement OpenTelemetry. Pick one that supports W3C Trace Context. Select tools that support automation and are easy to maintain.

    Some have proposed observability stacks that combine analysis, logging and visualization: ELK (Elasticsearch, Logstash, and Kibana), EFK (Elasticsearch, Fluentd, and Kibana), or PLG (Promtail, Loki, Grafana).

    For logging from microservices we have Splunk, Loggly, Sumologic, Logstash, Beats, Fluent Bit, and more.

    Time-series data can be stored in Atlas, InfluxDB and Prometheus. Databases (such as Oracle) provide exporters and make data accessible to the observability stack. Prometheus helps with monitoring containers. Jaeger offers a visual representation of traces and dependencies. Dashboards can be created with Grafana.

    A few others include Chronosphere, Cisco AppDynamics, Datadog, Dynatrace, Honeycomb, IBM Instana, Lumigo, New Relic, Sensu, Sentry, SolarWinds, Serverless360, and Zipkin.

    Cloud providers offer managed services for observability. For example, AWS has managed services for Grafana, Prometheus and OpenTelemetry (Java SDK + collector implementation). It also offers its proprietary AWS X-Ray and Amazon CloudWatch.

  • What are some challenges with observability?
    Observability in 2022 is far from mature. Source: Basteri and Brabham 2022.
    Observability in 2022 is far from mature. Source: Basteri and Brabham 2022.

    A New Relic survey from 2022 showed that observability is far from mature. Many problems are still detected manually. Organizations use too many tools. Some tools require developers to instrument their code with calls to tracing libraries. Some tools can monitor microservices but not monoliths. Some tools are SaaS only. Some tools sample data while others don't.

    Large volumes of data can be overwhelming. Many tools may not scale well when data has high cardinality. Reducing the cardinality will make the system less observable. Designers must therefore consider this tradeoff until better tools become available.

    Where service meshes are used, observability is harder due to increased variety of services, data volume and complex request paths. Often systems collect all metrics and later determine the relevant ones to analyze. It would be better to determine upfront the most relevant metrics and collect only those. ViperProbe can do this.

Milestones

1960

Rudolf E. Kálmán coins the term observability in the domain of control theory. It's "a measure of how well internal states of a system can be inferred from knowledge of its external outputs." Five decades later, this term gets repurposed towards distributed software systems. In the latter case, observability is the ability to understand and explain any system state, no matter how unusual or unexpected it may be.

1988

At IETF, RFC 1067 titled A Simple Network Management Protocol is published. In subsequent years, this becomes the essential protocol for monitoring network infrastructure via metrics. Time-series databases and dashboards are later born out of metrics. But engineers reach for low-level tools (strace, tcpdump) when metrics fall short for debugging complex systems.

1999

Engineers at Sun Microsystems use the term "observability" as something that enables application performance monitoring and capacity planning. At this time, there are no microservices. Hence this definition of observability is rather narrow and closer to the modern use of the term "monitoring".

2011

The term microservices gets discussed at a workshop of software architects near Venice, to describe a common architectural style that several of them were exploring at the time. By 2015, microservices become mainstream as per a survey by NGINX.

2013
The observability stack at Twitter. Source: Twitter 2013.
The observability stack at Twitter. Source: Twitter 2013.

Twitter publishes on its engineering blog an article titled Observability at Twitter. The article describes the elements of observability as practised at Twitter as they moved from a monolithic architecture to a microservices architecture. The article uses the terms "observability team" and "observability stack". Beyond just traditional monitoring, the stack includes time-series database, dashboards and ability to run ad hoc queries. This extension of traditional monitoring has been called whitebox monitoring.

2015

By mid-2010s, the term "observability" becomes more common. By 2018, it's commonly used in conferences and blog posts.

2019

OpenTelemetry is formed by merging two earlier projects named OpenTracing and OpenCensus. It's an incubation project at CNCF. It collects logs, traces and metrics from services, and sends them to analysis tools. It integrates easily with popular frameworks and libraries.

Nov
2021
Trace contexts are passed across microservices to help correlate events. Source: Anand 2023.
Trace contexts are passed across microservices to help correlate events. Source: Anand 2023.

W3C publishes Trace Context as a W3C Recommendation. This standardizes new HTTP header names and formats that allow context information to be propagated across HTTP clients and servers in any distributed system. It's worth noting that OpenTelemetry supports W3C Trace Context format.

2023
Vendors in the cloud observability space. Source: Williams and Clarke 2023, fig. 1.
Vendors in the cloud observability space. Source: Williams and Clarke 2023, fig. 1.

GigaOm publishes a research finding on the current state of the market in the cloud observability space. The study considers supported deployment models: SaaS, on-premise or hybrid. It considers many features: dashboards, user interaction performance monitoring, predictive analysis, microservices detection, etc. It's likely that no single platform is the best fit for every use case.

References

  1. APM Experts. 2020. "What is Observability and How to Implement it." APM Experts, March 1. Accessed 2023-06-05.
  2. Anand, A. 2023. "What is Context Propagation in Distributed Tracing?" Blog, SigNoz, April 3. Accessed 2023-06-07.
  3. Basteri, A., and D. Brabham. 2022. "2022 Observability Forecast." Research paper, New Relic, September. Accessed 2023-06-07.
  4. Case, J., M. Fedor, M. Schoffstall, and J. Davin. 1988. "A Simple Network Management Protocol." RFC 1067, IETF, August. Accessed 2023-06-07.
  5. Chevre, S. 2019. "Distributed tracing with W3C Trace Context for improved end-to-end visibility." Blog, Dynatrace, May 17. Updated 2022-11-22. Accessed 2023-06-05.
  6. Conran, M. 2022. "Microservices Observability." The Visual Age, Blog, Network Insight, August 16. Accessed 2023-06-05.
  7. Dapr. 2022. "Observability." Documentation, Dapr, October 12. Accessed 2023-06-05.
  8. Darrington, J. 2022. "Why Observability is Important for IT Ops." Blog, Graylog, December 9. Accessed 2023-06-05.
  9. Dhaduk, H. 2023. "6 Observability Design Patterns for Microservices Every CTO Should Know." Blog, Simform, January 12. Accessed 2023-06-05.
  10. Dhamija, G., V. Mehto, and Y. Sethi. 2022. "Build an observability solution using managed AWS services and the OpenTelemetry standard." Blog, AWS Cloud Operations & Migrations, January 20. Accessed 2023-06-05.
  11. Fernandes, I. 2019. "Observability at La Redoute." Blog, La Redoute, September 18. Accessed 2023-06-05.
  12. Google Cloud. 2022. "Observability in Google Cloud." White paper, Google Cloud, May 4. Accessed 2023-06-05.
  13. Hamric, K. 2022. "Tracing the History of Distributed Tracing & OTel." Blog, Tracetest, May 26. Accessed 2023-06-07.
  14. IBM. 2023. "What is observability?" IBM. Accessed 2023-06-05.
  15. Jagannathan, I. K., and R. McDonough. 2022. "Full-stack observability and application monitoring with AWS." AWS Summit SF 2022, AWS Events, on YouTube, August 8. Accessed 2023-06-05.
  16. Kanjilal, J. 2021. "Microservices Observability and Monitoring." Developer.com, October 8. Accessed 2023-06-05.
  17. Kanzhelev, S., M. McLean, A. Reitbauer, B. Drutu, N. Molnar, and Y. Shkuro (eds). 2021. "Trace Context." W3C Recommendation, November 23. Accessed 2023-06-05.
  18. Kushwaha, N. 2022. "Microservices Observability Design Patterns." LEARNCSDESIGN, July 1. Accessed 2023-06-05.
  19. Levin, J. and T. A. Benson. 2020. "ViperProbe: Rethinking Microservice Observability with eBPF." 2020 IEEE 9th International Conference on Cloud Networking (CloudNet), Piscataway, NJ, USA, pp. 1-8. doi: 10.1109/CloudNet51028.2020.9335808. Accessed 2023-06-05.
  20. Lewis, James and Martin Fowler. 2014. "Microservices." March 10. Updated 2014-03-25. Accessed 2023-06-07.
  21. Li, B., X. Peng, Q. Xiang, H. Wang, T. Xie, J. Sun, and X. Liu. 2022. "Enjoy your observability: an industrial survey of microservice tracing and analysis." Empirical Software Engineering, vol. 27, article no. 25. Accessed 2023-06-05.
  22. Lumigo. 2022. "Microservices Observability: 3 Pillars and 6 Design Patterns." Lumigo, August 31. Updated 2022-11-30. Accessed 2023-06-05.
  23. Majors, C., L. Fong-Jones, and G. Miranda. 2022. "Observability Engineering." First Edition, O'Reilly Media, May. Accessed 2023-06-05.
  24. Microsoft Docs. 2022. "Observability patterns." .NET Documentation, Microsoft, April 7. Accessed 2023-06-05.
  25. Microsoft Docs. 2023. "Cloud monitoring guide: Observability." Documentation, Azure, Microsoft, March 21. Accessed 2023-06-05.
  26. NGINX. 2015. "The Future of Application Development and Delivery Is Now." NGINX. Accessed 2018-11-11.
  27. OpenTelemetry. 2023. "Homepage." OpenTelemetry. Accessed 2023-06-05.
  28. OpenTelemetry. 2023a. "Traces." Documentation, OpenTelemetry, May 12. Accessed 2023-06-07.
  29. Parkinson, P. 2022. "Unified Observability: Metrics, Logs, and Tracing of App and Database Tiers in a Single Grafana Console." DZone, February 9. Accessed 2023-06-05.
  30. Reinholds, A. 2021. "What is observability?" Blog, New Relic, November 30. Updated 2022-12-12. Accessed 2023-06-05.
  31. Richardson, C. 2023. "tagged with: observability." Microservices.io. Accessed 2023-06-05.
  32. Sindhu. 2021. "Observability Vs Monitoring: Key Differences You should know." Blog, Atatus, December 10. Accessed 2023-06-05.
  33. Sridharan, C. 2017. "Monitoring and Observability." On Medium, September 5. Accessed 2023-06-07.
  34. Sridharan, C. 2018. "Distributed Systems Observability." First Edition, O'Reilly Media, May. Accessed 2023-06-05.
  35. Tozzi, C. 2022. "From Kálmán to Kubernetes: A History of Observability in IT." Blog, Broadcom, January 31. Accessed 2023-06-05.
  36. Treat, T. 2019. "Microservice Observability, Part 1: Disambiguating Observability and Monitoring." Brave New Geek, October 3. Accessed 2023-06-05.
  37. Twitter. 2013. "Observability at Twitter." Blog, Twitter, September 9. Accessed 2023-06-05.
  38. Twitter. 2016. "Observability at Twitter: technical overview, part I." Blog, Twitter, March 18. Accessed 2023-06-05.
  39. Usman, M., S. Ferlin, A. Brunstrom, and J. Taheri. 2022. "A Survey on Observability of Distributed Edge & Container-Based Microservices." IEEE Access, vol. 10, pp. 86904-86919. doi: 10.1109/ACCESS.2022.3193102. Accessed 2023-06-05.
  40. Van der Linde, H. 2020. "Understanding Observability: A Cloud Observability Framework." Blog, VMware, September 17. Accessed 2023-06-05.
  41. Wiggers, S-J. 2022. "Top 5 Azure Observability Tools in 2023." Blog, November 28. Accessed 2023-06-05.
  42. Williams, R. and S. Clarke. 2023. "GigaOm Radar for Cloud Observability." White paper, v3.02, GigaOm, March 10. Accessed 2023-06-05.

Further Reading

  1. Li, B., X. Peng, Q. Xiang, H. Wang, T. Xie, J. Sun, and X. Liu. 2022. "Enjoy your observability: an industrial survey of microservice tracing and analysis." Empirical Software Engineering, vol. 27, article no. 25. Accessed 2023-06-05.
  2. Sridharan, C. 2017. "Monitoring and Observability." On Medium, September 5. Accessed 2023-06-07.
  3. Twitter. 2013. "Observability at Twitter." Blog, Twitter, September 9. Accessed 2023-06-05.
  4. Majors, C., L. Fong-Jones, and G. Miranda. 2022. "Observability Engineering." First Edition, O'Reilly Media, May. Accessed 2023-06-05.
  5. Anand, A. 2023. "What is Context Propagation in Distributed Tracing?" Blog, SigNoz, April 3. Accessed 2023-06-07.
  6. Treat, T. 2020. "Microservice Observability, Part 2: Evolutionary Patterns for Solving Observability Problems." Brave New Geek, January 3. Accessed 2023-06-05.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
3
0
1638
2110
Words
2
Likes
2831
Hits

Cite As

Devopedia. 2023. "Microservices Observability." Version 3, June 11. Accessed 2024-06-25. https://devopedia.org/microservices-observability
Contributed by
1 author


Last updated on
2023-06-11 04:06:11

Improve this article
  • Distributed Tracing
  • Cloud Monitoring
  • Log Analytics
  • Log Aggregation
  • Microservices Frameworks
  • Data Observability

Article Warnings

  • Readability score of this article is below 50 (48.4). Use shorter sentences. Use simpler words.