Microservices Observability

Article Info

Contributed by
1 author

Last updated on
2023-06-11 04:06:11

Improve this article

Distributed Tracing
Cloud Monitoring
Log Analytics
Log Aggregation
Microservices Frameworks
Data Observability

Article Versions

3 2023-06-11 04:06:11
3983,3982 3,3983

By arvindpdmn

Added info related to ViperProbe.
2 2023-06-09 08:58:34
3982,3981 2,3982

By arvindpdmn

Completing content. Publishing.
1 2023-06-05 11:37:42
1,3981

By arvindpdmn

Initial questions.

Chat Room

Submitting ...

You are editing an existing chat message.

An overview of observability. Source: APM Experts 2020.

With traditional monolithic applications, troubleshooting problems was easier compared to the more recent microservices architecture. Log files captured information for debugging and analysis. Monitoring highlighted problems via static dashboards and alerts. The application itself was generally well understood.

With microservices, there are more moving parts. The system is dynamic and heterogenous. It's parts are loosely coupled and transient. The system could fail in many different and unpredictable ways. It's necessary to monitor the application, the network and the infrastructure. It's in this context that observability (aka o11y) becomes important.

Can we understand the inner workings of the system? Can we correlate the outputs with the inputs? Can we explain unexpected behaviour? Can data help us achieve business goals? A system that answers these questions effectively is said to be observable.

Discussion

How does observability differ from monitoring?
Monitoring versus observability. Source: Usman et al. 2022, fig. 2.
IT systems have been monitoring performance since the early 2000s. They sample and aggregate important data points such as system downtime, response time and memory usage. These are compared against expectations. Anomalies and violations are thrown up as alerts for support teams. Monitoring focuses on overall system health and business KPIs.
Observability goes deeper with the aim of capturing system behaviour in greater detail and context. An application composed of many interacting microservices can fail in unpredictable ways. So it's not sufficient to focus on specific failures or scenarios. While monitoring deals with known unknowns, observability deals with unknown unknowns. Data analysis is an essential aspect of observability. Failures are traced back to root causes. To put it simply,
Monitoring tells you when something is wrong. Observability lets you ask why.
Observability doesn't replace monitoring. Monitoring is essential to observability. We may even view observability as a superset of monitoring.
What are the main benefits of observability?
Observability leads to better visibility into the system dynamics. We can identify bottlenecks and optimize workflows. It creates a culture of innovation, improves operational efficiency and enables data-driven business decisions. DevSecOps teams can use the insights from observability to build more secure applications.
Because observability helps us understand the system better, engineers can more confidently update and release software. Release cycles can be shorter. Quality can be improved. Problems that you we didn't know existed in the system ("unknown unknowns") become visible. We can find and fix issues earlier in the software development phase. When combined with AIOps, issues can be solved automatically without human intervention.
For any organization moving from monoliths to microservices, observability helps in this transition. It can help them better manage their cloud-native applications.
What are the key pillars of observability?
The three pillars of observability. Source: Fernandes 2019.
Observability has three key pillars:
- Logs: Capture individual events. Logs are granular, timestamped and immutable. When things go wrong, logs are invaluable in determining the root cause. A good practice is to log in structured formats such as JSON. Structured logging enables easy visualization, search and analysis.
- Metrics: Data aggregated over time. Monitoring solutions rely on metrics. Uptime, CPU utilization, system load, memory usage, throughput, response time and error rate are some examples of metrics.
- Traces: Sequence of calls triggered by an event. In the world of microservices, a failure can be traced back to an offending microservice or an API call.
Events and exceptions can be seen as special cases of logs. Some identify dependencies as another pillar. This captures how components depend on one another.
Historically (in 2016), engineers at Twitter identified these four pillars: monitoring, alerting/visualization, distributed systems tracing infrastructure, and log aggregation/analytics.
What's the typical pipeline for microservices observability?
Microservices observability pipeline. Source: Li et al. 2022, fig. 1.
A logger produces logs within each service instance. A centralized collector collects the logs. Data is then pre-processed and stored to ease later analysis. Pre-processing might include cleaning, formatting, sampling and aggregating (for metrics). Analysts may run ad hoc queries to explore and visualize data, and to solve problems.
Each logger might be part of the main service or deployed as a sidecar container within the same Kubernetes pod of the main container. More generally, an agent pull logs from services or log files and then pushes them to the collector. Alternatively, log collection is agent-less, that is, services themselves push logs to the collector. The latter is easier to deploy but requires developers to integrate logging functionality into the service code.
Different types of analysis are possible: timeline analysis, service dependency analysis, aggregation analysis, root cause analysis and anomaly detection. Analysis is enabled by the use of statistics, rules and visualization.
What design patterns are available for microservices observability?
Illustrating distributed tracing. Source: Conran 2022.
Among the design patterns are:
- Application Metrics: Gives a complete picture of the application. Includes infrastructure, application and end user metrics.
- Audit Logging: Record all user or service account actions. Needed for regulatory compliance. Event sourcing is one approach to capturing audit logs. Likewise, changes to deployment ought to be logged.
- Distributed Tracing: Helps with performance profiling and root cause analysis. A sequence of API calls triggered by a user request is a trace. A trace is composed of spans. Each span records a unit of work within the trace. For example, user request, API gateway processing, service processing and database access are all spans within a trace. Distributed tracing requires context or correlation IDs passed from one service to the next.
- Health Check: An unhealthy service is one that's running but has problems. Health check API can assess current system health. It can trigger alerts, recovery, service restarts, etc.
- Exception Tracking: Includes error messages, error codes and stack traces.
- Log Aggregation: Logs from various microservices are sent to a centralized log server.
As a developer, should I care about observability?
Observability is not just the concern of operations. Observability must be considered at the design stage itself. System should be designed to be testable. It should be incrementally deployable with support for rollback. System should collect detailed runtime data.
Logs should be designed to support data with high cardinality and high dimensionality. This makes it possible to query the log data effectively to bring out patterns and root causes. A useful query such as "list all 502 errors in the last 20 minutes from host foobar.com" becomes possible.
Developers must acknowledge that complex systems are unpredictable and failures are inevitable. They should proactively instrument their code, rather than rely on automatic instrumentation. Even controlled testing in production may be permitted. Developers should consider the operational semantics and dependencies of their software. For example, developers should understand how services start/shutdown, their static/dynamic configurations, service discovery, concurrency, etc.
What are the best practices when implementing observability?
Monitor essential metrics such as latency, traffic, error rate and saturation. Collect metrics along with contextual metadata. Configure and prioritize alerts correctly. For example, an alert for every 500 status code is a bad thing. Create specialized dashboard rather than overwhelming everyone with all the data.
Rules that detect incidents should be simple, predictable, reliable and actionable. Monitoring systems can become complex over time. They could end up collecting lots of metrics that are never used or complex rules that are never triggered. Data should lead to actionable insights. Data should create effective feedback loops.
Investing in observability tools matters, but tools alone will not solve problems. Teams must come with a mindset towards making their systems observable, from design to deployment.
An observability framework can serve as a reference for implementations. To be effective, observability must be guided by business goals such as Service Level Objectives (SLOs). Look at observability from the perspective of end users (response times, failed purchases) and backend applications (slow queries, container restarts).
What tools are available for observability?
Illustrating observability in AWS. Source: Dhamija et al. 2022, fig. 4.
Many frameworks (eg. ASP.NET and Spring Boot) have libraries or plugins that implement OpenTelemetry. Pick one that supports W3C Trace Context. Select tools that support automation and are easy to maintain.
Some have proposed observability stacks that combine analysis, logging and visualization: ELK (Elasticsearch, Logstash, and Kibana), EFK (Elasticsearch, Fluentd, and Kibana), or PLG (Promtail, Loki, Grafana).
For logging from microservices we have Splunk, Loggly, Sumologic, Logstash, Beats, Fluent Bit, and more.
Time-series data can be stored in Atlas, InfluxDB and Prometheus. Databases (such as Oracle) provide exporters and make data accessible to the observability stack. Prometheus helps with monitoring containers. Jaeger offers a visual representation of traces and dependencies. Dashboards can be created with Grafana.
A few others include Chronosphere, Cisco AppDynamics, Datadog, Dynatrace, Honeycomb, IBM Instana, Lumigo, New Relic, Sensu, Sentry, SolarWinds, Serverless360, and Zipkin.
Cloud providers offer managed services for observability. For example, AWS has managed services for Grafana, Prometheus and OpenTelemetry (Java SDK + collector implementation). It also offers its proprietary AWS X-Ray and Amazon CloudWatch.
What are some challenges with observability?
Observability in 2022 is far from mature. Source: Basteri and Brabham 2022.
A New Relic survey from 2022 showed that observability is far from mature. Many problems are still detected manually. Organizations use too many tools. Some tools require developers to instrument their code with calls to tracing libraries. Some tools can monitor microservices but not monoliths. Some tools are SaaS only. Some tools sample data while others don't.
Large volumes of data can be overwhelming. Many tools may not scale well when data has high cardinality. Reducing the cardinality will make the system less observable. Designers must therefore consider this tradeoff until better tools become available.
Where service meshes are used, observability is harder due to increased variety of services, data volume and complex request paths. Often systems collect all metrics and later determine the relevant ones to analyze. It would be better to determine upfront the most relevant metrics and collect only those. ViperProbe can do this.

Milestones

1960

Rudolf E. Kálmán coins the term observability in the domain of control theory. It's "a measure of how well internal states of a system can be inferred from knowledge of its external outputs." Five decades later, this term gets repurposed towards distributed software systems. In the latter case, observability is the ability to understand and explain any system state, no matter how unusual or unexpected it may be.

1988

At IETF, RFC 1067 titled A Simple Network Management Protocol is published. In subsequent years, this becomes the essential protocol for monitoring network infrastructure via metrics. Time-series databases and dashboards are later born out of metrics. But engineers reach for low-level tools (strace, tcpdump) when metrics fall short for debugging complex systems.

1999

Engineers at Sun Microsystems use the term "observability" as something that enables application performance monitoring and capacity planning. At this time, there are no microservices. Hence this definition of observability is rather narrow and closer to the modern use of the term "monitoring".

2011

The term microservices gets discussed at a workshop of software architects near Venice, to describe a common architectural style that several of them were exploring at the time. By 2015, microservices become mainstream as per a survey by NGINX.

2013

Twitter publishes on its engineering blog an article titled Observability at Twitter. The article describes the elements of observability as practised at Twitter as they moved from a monolithic architecture to a microservices architecture. The article uses the terms "observability team" and "observability stack". Beyond just traditional monitoring, the stack includes time-series database, dashboards and ability to run ad hoc queries. This extension of traditional monitoring has been called whitebox monitoring.

2015

By mid-2010s, the term "observability" becomes more common. By 2018, it's commonly used in conferences and blog posts.

2019

OpenTelemetry is formed by merging two earlier projects named OpenTracing and OpenCensus. It's an incubation project at CNCF. It collects logs, traces and metrics from services, and sends them to analysis tools. It integrates easily with popular frameworks and libraries.

Nov
2021

W3C publishes Trace Context as a W3C Recommendation. This standardizes new HTTP header names and formats that allow context information to be propagated across HTTP clients and servers in any distributed system. It's worth noting that OpenTelemetry supports W3C Trace Context format.

2023

GigaOm publishes a research finding on the current state of the market in the cloud observability space. The study considers supported deployment models: SaaS, on-premise or hybrid. It considers many features: dashboards, user interaction performance monitoring, predictive analysis, microservices detection, etc. It's likely that no single platform is the best fit for every use case.

References

Article Stats

2110

Words

Authors

Edits

Chats

Likes

2831

Hits

Cite As

Devopedia. 2023. "Microservices Observability." Version 3, June 11. Accessed 2024-06-25. https://devopedia.org/microservices-observability

Contributed by
1 author

Last updated on
2023-06-11 04:06:11

Improve this article

tools system distributed computing microservices cloud debugging

Distributed Tracing
Cloud Monitoring
Log Analytics
Log Aggregation
Microservices Frameworks
Data Observability

Microservices Observability

Discussion

Milestones

References

Further Reading

Article Stats

Cite As

See Also

Microservices Observability

Discussion

Milestones

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Article Warnings

Login