Chaos Engineering

Distributed software systems are an integral part of today's world. Many organizations such as Amazon and Netflix are serving billions of users through such distributed systems. These systems are inherently complex and chaotic. They have many interacting parts whose collective behaviour can be unpredictable. Moreover, frequent deployments add to the uncertainty. The most significant weaknesses should be addressed proactively, before they affect customers in production and lead to loss of revenue.

Chaos Engineering addresses these challenges by injecting failures into the system in a controlled manner. We then observe how resilient is the system. Where we find significant flaws, we seek solutions. Chaos Engineering leads to more failure-resistant systems.

It's been said that,

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Discussion

  • What's the context in which Chaos Engineering is relevant?

    In a distributed system, even when each component is well-tested, there could be problems when these components come together. Distributed systems are by nature vulnerable to network connectivity issues of low bandwidth, high latency, packet loss or complete link failure. Each component can also fail in unexpected ways. It's hard to test for many real-world failures in a typical software development lifecycle.

    It's possible to design a system to be fault tolerant. Even when one component fails, this should affect the system minimally. The system as a whole is designed to be better than the sum of its parts. But design alone isn't enough. We need to be confident that the system will perform despite failures. This is where Chaos Engineering becomes relevant.

    Chaos Engineering starts by establishing normal steady state behaviour in terms of observable metrics. Then we inject failures in a controlled manner and record the same metrics. The hypothesis is that such failures don't affect the system. If the hypothesis fails, we know that we need to improve the system to handle those specific failures.

  • What are the principles of Chaos Engineering?
    Principles of Chaos Engineering. Source: Rosenthal 2021.
    Principles of Chaos Engineering. Source: Rosenthal 2021.

    The main principles of Chaos Engineering are as follows:

    • Build a Hypothesis around Steady State Behavior: Measure system output over a short period of time. This is the steady state. It indicates normal behavior. Hypothesize that the steady state will continue during experiments. Steady state should be based on measurable output, not internal attributes of the system. Throughput, error rates, latency percentiles, etc. could be metrics that define steady state behavior.
    • Vary Real-world Events: Inject real-world events such as server crashes, malformed responses, or traffic spikes. Observe what happens. Compare against the steady state. Any deviation disproves the hypothesis.
    • Run Experiments in Production: Prefer to experiment directly in production. This ensures authenticity and relevance of the currently deployed system.
    • Automate Experiments to Run Continuously: Running experiments manually is unsustainable. Automate the process.
    • Minimize Blast Radius: Ideally, experiments shouldn't affect customers. At least, minimize the negative impact on customers.
  • Could you share more details into implementing Chaos Engineering?
    Phases of Chaos Engineering. Source: Hornsby 2019.
    Phases of Chaos Engineering. Source: Hornsby 2019.

    Amazon uses number of orders as a metric since they found that page load times affect this metric . Netflix uses the number of times a user clicks play, which is affected by system failures. Given that metrics are central to Chaos Engineering, they should be easy to measure, report and visualize.

    When metrics show problems, monitor how long it took to notice the problem, notify engineers, and do self-recovery. Do a post-mortem of every problem: what happened, what was the impact, why it occurred, and how to prevent it. Prioritize fixing these problems over developing new features.

    Experiments could inject complete failures or affect performance in any component. Terminate the recommendation engine, the caching service or the load balancer. Increase inter-service latency or packet drops. Failures can even be at the UI layer. If a UI widget fails to load, do the other widgets expand to fill up the extra space? Get the whole team involved since each one is likely to propose a different type of experiment. Study the specifications to identify possible experiments.

  • What are some myths about Chaos Engineering?

    The following will describe and dispel some myths about Chaos Engineering:

    • Chaos: It's not about introducing chaos. Experiments are done in a controlled manner. Engineers select which systems should be affected and to what extent. They monitor continuously. If customers get affected, the experiment is stopped.
    • Reliability: Apart from adding new features, developers also care about reliability. Developers perform unit testing. Chaos Engineering complements their efforts by finding problems in real-world scenarios. But it's not about testing in production. We expect well-tested systems to come into production.
    • Tooling: Chaos Engineering doesn't require any specialized hardware/software tools. In most cases, access to the host OS or containers is adequate.
    • Observability: Basic metrics can suffice, often readily available in cloud platforms. We don't need to wait to implement complex metrics collection.
    • Scale: Chaos Engineering is not just for large distributed systems. It could be applied to even monoliths that need to be understood better.
    • ROI: Teams need to invest time upfront but this is worthwhile. It saves time later by minimizing production outages and troubleshooting. Chaos Engineering gives insights into optimal use of resources, increase productivity, and grow the business.
  • What is Chaos Monkey?
    Chaos Monkey logo. Source: Netflix 2021.
    Chaos Monkey logo. Source: Netflix 2021.

    Chaos Monkey is a software tool developed at Netflix that randomly simulates failures of production instances. In the world of microservices, it should be possible to lose an instance, and replace that with another instance without loss of application functionality or consistency. Instances are meant to be stateless; that is, they don't store data. Depending on the traffic load, instances are meant to be created and destroyed at will. Chaos Monkey validates this capability.

    Imagine a wild weaponized monkey "randomly shooting down instances and chewing through cables." This is where we get the name Chaos Monkey. In the early years of Netflix, it was common practice to run Chaos Monkey during business hours, monitor for changes and identify weaknesses. This led to better auto-recovery mechanisms.

    The myth that Chaos Engineering breaks things in production comes from the early use of Chaos Monkey. Today, we have many more tools to exercise Chaos Engineering. The randomness of Chaos Monkey is a useful starting point when the system's a black box. With better knowledge of how the system works, more fine-grained and controlled experiments can be done.

  • What is the Simian Army and what are its different members?
    Chaos Kong simulating regional outage in US-West-2 region. Source: Basiri et al. 2015.
    Chaos Kong simulating regional outage in US-West-2 region. Source: Basiri et al. 2015.

    The Simian Army is a suite of failure-inducing tools designed to add more functionalities beyond Chaos Monkey. Some tools inject failures while others take proactive actions to avoid future failures. Its members include:

    • Latency Monkey: It causes artificial delays or even service shutdowns in RESTful client-server communications.
    • Conformity Monkey: It finds out instances that don't adhere to predefined rule sets and shuts them down.
    • Doctor Monkey: It performs health checks (CPU load, memory usage, etc.) to detect unhealthy instances and removes them.
    • Janitor Monkey: It ensures that the cloud environment is running free of clutter and waste.
    • Security Monkey: It finds security violations or vulnerabilities and terminates the violating instances.
    • 10–18 Monkey: It detects configurations and runtime problems in instances that are accessible across multiple geographic regions, involving multiple languages and character sets.
    • Chaos Gorilla: It simulates an outage of an entire AWS availability zone. Services should automatically rebalance to the other active availability zones.
    • Chaos Kong: It simulates region outages. An AWS Region has multiple availability zones.
  • What are the benefits of Chaos Engineering?

    Chaos Engineering has different benefits depending on the perspective:

    • Customer: Chaos Engineering predicts or prevents failures before they happen. This results in increased availability and durability of services.
    • Business: Chaos Engineering helps in preventing extremely large losses in revenue and maintenance costs, which can occur due to failures in production hours of businesses.
    • Technical: The insights from experiments help reduce incidents, increase understanding of system failure modes, improve system design and detect high-severity incidents faster.

    A 2021 study reported that increased availability, lower mean time to resolution (MTTR), lower mean time to detection (MTTD), fewer bugs in production, and fewer outages are some of the benefits. Those who practise Chaos Engineering can achieve more than 99.9% service availability.

    In 2015, Amazon’s DynamoDB service experienced an availability issue in their US-EAST-1 region. This issue caused many other additional AWS services to fail. It resulted in the unavailability of some of Internet's biggest sites and applications. However, Netflix services were affected only in a minor way, thanks to their Chaos Engineering practices.

  • Beyond Netflix, how has the industry adopted chaos engineering?
    Types of failures used at DBS Bank. Source: DBS 2021, 7:11.
    Types of failures used at DBS Bank. Source: DBS 2021, 7:11.

    In 2019, a DevOps and Cloud InfoQ Trends report showed that chaos engineering was emerging from the "innovator adoption" stage to the "early adoption" stage. The primary adopters of chaos engineering have been e-commerce and big tech. Downtime directly impacts revenue in e-commerce. Big tech is about better returns from better reliability.

    By 2020, many businesses from large financial institutions to health care organizations were adopting Chaos Engineering in their culture. Big organizations such as Uber, JP Morgan Chase and GrubHub have also adopted chaos engineering to maximize their service quality.

    Singapore's DBS Bank tried out tools Pumba, Toxiproxy and Vizceral. Their approach was to do Chaos Engineering in non-production environments and collect only metrics in production. They still got many insights and made their applications more fault tolerant. They applied Chaos Engineering across the entire software development lifecycle.

  • What are some best practices in Chaos Engineering?
    Some essentials to have before practising Chaos Engineering. Source: Hornsby 2019.
    Some essentials to have before practising Chaos Engineering. Source: Hornsby 2019.

    Before starting on Chaos Engineering, design the system to be resilient at many levels: infrastructure, networking, data, application, people and culture. Adopt architectural patterns to achieve this. Three techniques adopted at Netflix are timeouts, retries and fallbacks.

    When proposing a hypothesis, start with what you know and what you understand. Then move towards what you don't know and don't understand. A useful question to ask is, "What could go wrong?" Inject only one failure at the time.

    A company can have a dedicated Chaos Engineering team of 2-5 engineers. But Chaos Engineering is not the responsibility of this team alone. It's been observed that some teams are quick to embrace it: Traffic Team (e.g. Nginx, Apache, DNS), Streaming Team (e.g. Kafka), Storage Team (e.g. S3), Data Team (e.g. Hadoop/HDFS), and Database Team (e.g. MySQL, Amazon RDS, PostgreSQL).

Milestones

2000

In the early part of this decade, Amazon creates a program named GameDay. Its purpose is to inject failures into critical systems, and then use monitoring tools and alerts to obtain greater insight into how well Amazon's retail website responds to these failures. GameDay brings out architectural flaws and other defects. While initially successful, the program is later halted due to its impact on customer-facing services.

2010

Netflix Engineering Team creates Chaos Monkey. It's created in response to Netflix's move from physical infrastructure to AWS cloud infrastructure. Chaos Monkey is meant to test the capability that the loss of an instance doesn't affect the Netflix streaming experience. Chaos Monkey has its limitations. It requires Spinnaker and MySQL. It lacks recovery capabilities and a user inteface.

2011
Logo of the Simian Army. Source: Netflix 2014.
Logo of the Simian Army. Source: Netflix 2014.

After the success of Chaos Monkey, Netflix Engineering team develops Simian Army. The Simian Army adds additional failure injection modes allowing developers to test the system with different failures. In 2012, Netflix makes Chaos Monkey publicly available by sharing its source code on Github.

2014

Netflix decides to create a new role in organization: the Chaos Engineer. Bruce Wong coins the term and Dan Woods shares it with the greater engineering community via Twitter. In October, Netflix shares in blog post that they're using an internal solution called Failure Injection Testing (FIT). FIT allows engineers to break things but control with precision the impact.

2016

Kolton Andrus and Matthew Fornaciari establishes Gremlin, the world's first managed enterprise Chaos Engineering solution. Gremlin becomes publicly available in late 2017 with multiple failure injection modes.

Oct
2016
A typical workflow with Netflix ChAP. Source: Dumiak 2021.
A typical workflow with Netflix ChAP. Source: Dumiak 2021.

At the IEEE International Symposium on Software Reliability Engineering, Netflix engineers present their internal platform that automates chaos experiments. They call it Chaos Automated Platform (ChAP). It improves on FIT by routing traffic between control and experimental clusters. These clusters are created with the same configuration as production clusters. Changes in metrics are more easily observed and compared. Complex experiments can be conducted without impacting customers.

2018

Gremlin launches Chaos Conf, world's first large-scale conference on Chaos Engineering.

2020

Amazon Web Services (AWS) adds Chaos Engineering to the reliability pillar of the AWS Well-Architected Framework (WAF). AWS announces Fault Injection Simulator (FIS), a fully managed service for natively running chaos experiments on AWS services.

2021

Gremlin publishes the first ever State of Chaos Engineering report that shows how the practice of Chaos Engineering has grown among organizations, key benefits of Chaos Engineering, how often top performing teams run chaos experiments, and more.

References

  1. Andrus, Kolton. 2020. "The Future Of Chaos Engineering Across Industries." Forbes, November 23. Accessed 2021-12-27.
  2. Andrus, Kolton, Naresh Gopalani, and Ben Schmaus. 2014. "FIT: Failure Injection Testing." Blog, Netflix, October 23. Accessed 2022-01-09.
  3. Basiri, Ali, Lorin Hochstein, Abhijit Thosar, and Casey Rosenthal. 2015. "Chaos Engineering Upgraded." Blog, Netflix, September 25. Accessed 2021-12-22.
  4. Basiri, Ali, Aaron Blohowiak, Lorin Hochstein, Nora Jones, Casey Rosenthal, and Haley Tucker. 2017. "ChAP: Chaos Automation Platform." Blog, Netflix, July 26. Accessed 2021-12-27.
  5. Basiri, Ali, Lorin Hochstein, Nora Jones, and Haley Tucker. 2019. "Automating chaos experiments in production." arXiv, v1, May 12. Accessed 2022-01-09.
  6. Butow, Tammy. 2021. "Chaos Engineering: the history, principles, and practice." Tutorial, Gremlin, May 5. Accessed 2021-12-23.
  7. DBS. 2021. "How DBS dispelled the myths of Chaos Engineering." DBS, on YouTube, May 5. Accessed 2022-01-09.
  8. Dumiak, Michael. 2021. "Chaos Engineering Saved Your Netflix." IEEE Spectrum, March 03. Accessed 2021-12-27.
  9. Gremlin. 2018. "The Simian Army: Overview and Resources." Gremlin, October 17. Accessed 2021-12-23.
  10. Gremlin. 2021. "Chaos Monkey Guide for Engineers - Tips, Tutorials, and Training." Gremlin. Accessed 2021-12-26.
  11. Hornsby, Adrian. 2018. "Patterns for Resilient Architecture — Part 1." The Cloud Architect, on Medium, July 25. Accessed 2022-01-09.
  12. Hornsby, Adrian. 2019. "Chaos Engineering — Part 1: The art of breaking things purposefully." The Cloud Architect, on Medium, July 01. Accessed 2021-12-24.
  13. Izrailevsky, Yury, and Ariel Tseitlin. 2011. "The Netflix Simian Army." Blog, Netflix, July 19. Accessed 2021-12-24.
  14. Jones, Mark. 2020. "JPMC, Uber, GrubHub — tech giants talk chaos engineering." Techq, October 05. Accessed 2021-12-27.
  15. Netflix. 2014. "File:Netflix simianarmy-768x797.jpg." Wikimedia Commons, April 02. Accessed 2021-12-24.
  16. Netflix. 2021. "Chaos Monkey: Home." Docs, Netflix. Accessed 2021-12-24.
  17. Newman, Andre. 2020. "7 Important Truths About Chaos Engineering." DevOps, November 19. Accessed 2021-12-27.
  18. Pawlikowski, Mikolaj. 2021. "Breaking the top five myths around chaos engineering." CloudTech, Cloud Computing News, April 28. Accessed 2022-01-09.
  19. Rosenthal, Casey. 2021. "The Advanced Principles of Chaos Engineering." Blog, Verica, May 27. Accessed 2021-12-24.
  20. chaos-eng. 2019. "Principles of Chaos Engineering." chaos-eng, on GitHub, March. Accessed 2021-12-23.

Further Reading

  1. chaos-eng. 2019. "Principles of Chaos Engineering." chaos-eng, on GitHub, March. Accessed 2021-12-23.
  2. Hornsby, Adrian. 2019. "Chaos Engineering — Part 1: The art of breaking things purposefully." The Cloud Architect, on Medium, July 01. Accessed 2021-12-24.
  3. Basiri, Ali, Lorin Hochstein, Abhijit Thosar, and Casey Rosenthal. 2015. "Chaos Engineering Upgraded." Blog, Netflix, September 25. Accessed 2021-12-22.
  4. Butow, Tammy. 2021. "Chaos Engineering: the history, principles, and practice." Tutorial, Gremlin, May 5. Accessed 2021-12-23.
  5. Izrailevsky, Yury, and Ariel Tseitlin. 2011. "The Netflix Simian Army." Blog, Netflix, July 19. Accessed 2021-12-24.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
5
5
1564
4
3
1301
1
0
9
2181
Words
3
Likes
4597
Hits

Cite As

Devopedia. 2022. "Chaos Engineering." Version 10, February 15. Accessed 2023-11-12. https://devopedia.org/chaos-engineering
Contributed by
3 authors


Last updated on
2022-02-15 11:57:14