Get a demo

How-to guides

Chaos testing: the ultimate guide

Learn the fundamentals of Chaos Testing, the practice of testing systems under real-world pressure to assess resilience and adaptability.

Eddie Goikhman | April 4, 2024

Chaos testing probes the reliability of software systems by intentionally introducing errors in production. Far from being chaotic, it is a methodical approach to fault tolerance that effectively emulates the unpredictability of real problems.

Adopting chaos engineering helps you identify failure points, increase resilience, and rehearse disaster response strategies. It is a way to manage risk by provoking problems on your own terms. This article will introduce the principles behind chaos engineering and show you how to start implementing your own chaos experiments.

TL;DR: Chaos testing at a glance

Chaos testing involves intentionally introducing disturbances into systems to test resilience and discover vulnerabilities. By simulating real-world disruptions, such as network failures, it prepares systems for unforeseen issues, enhancing reliability. Originating from Netflix’s Chaos Monkey, this proactive approach is essential for building robust, fault-tolerant infrastructures.

Understanding chaos testing

Chaos engineering is a straightforward concept with a catchy name. Instead of waiting for problems to occur, you proactively introduce failure states so issues are uncovered earlier on. The chaos arises because the problems are supposed to be generated randomly so that they’re unpredictable.

The technique originated at Netflix in the early 2010s. A seminal 2011 blog post explained how an internal tool called Chaos Monkey would periodically disable pieces of Netflix’s production infrastructure. This induced failures that didn’t show up in regular tests. Netflix open-sourced Chaos Monkey, sparking a new approach to reliability engineering.

Chaos testing tweet

Chaos engineering aims to help you build more resilient systems by uncovering the hidden interdependencies, flaky code paths, and lurking bugs that cause outages you wouldn’t normally anticipate. Many failures in production occur as a result of highly specific states that aren’t covered by any testing procedure.

Intentionally breaking production lets you discover these issues in a controlled environment. You can develop patches and look for similar problems before a real outage occurs.

 

Chaos testing methodology

Chaos tests are often referred to as experiments. You’re introducing an unknown into your system and observing how the environment reacts.

A good experiment should have a specific purpose, such as checking how the system responds to a missing component or measuring the impact of increased network latency. Using small and focused experiments will help limit the impact on customers if an outage is triggered.

Here’s a simple example of a chaos experiment procedure:

1. Hypothesis

Write a short statement of what should happen, e.g., “A failover occurs if the primary database server is inaccessible. The system continues to function and is served from the database replicas. The operations team is alerted.”

2. Design a safe experiment

This hypothesis could be tested by taking the primary database server offline. However, if the server takes a while to come back up, this could be dangerous if the hypothesis is disproven and no failover occurs. A safer option could be adding a networking rule that blocks connections to the server. You can then rapidly delete the rule if the test causes an incident in production.

3. Execute the experiment

Carry out the plan you’ve designed. Introduce the chaos into your system and measure its effects. If the hypothesis was disproved, use the rollback plan to restore service and then implement changes to improve your system’s resiliency. Chaos testing has just found a weakness for you.

4. Repeat until the hypothesis is proven

The system should reach a state where it remains stable in the chaos’ presence. This proves the hypothesis and allows you to close the experiment. You should subsequently analyze your findings, because they might hint at similar problems in other parts of the system.

Adopting a procedure like this one helps ensure chaos engineering is conducted safely, under controlled conditions. Although you’re intentionally breaking things in production, chaos testing is never meant to cause an incident long enough that customers will notice and complain. To achieve this, you must carefully plan your experiments so a fast rollback is always available.

 

Make use of automation

Automated chaos testing is the purest form, because it guarantees the element of randomness that is missing from planned experiments. There are several tools available to break things for you in different programming frameworks and cloud environments.

Chaos Monkey

The original chaos testing tool, Chaos Monkey randomly terminates virtual machines and containers to simulate service failures. It requires Netflix’s Spinnaker continuous delivery platform.

Kube-Monkey

Kube-Monkey brings chaos testing to Kubernetes clusters using an approach inspired by Chaos Monkey. It randomly kills pPods within your cluster. The tool is highly configurable, letting you customize the maximum number of pPods to terminate, a blacklist of services that must not stop, and the time and duration that the monkey runs.

VMWare Mangle

VMWare Mangle can introduce faults to many different deployment environments including Kubernetes, Docker, and VMWare’s vCenter. This is a more flexible tool that supports a wide range of different faults beyond simple service terminations. It includes infrastructure-level outages that affect multiple services at once.

Litmus

A cloud-native chaos engineering platform that is now backed by the Cloud Native Computing Foundation (CNCF). Litmus runs within Kubernetes, using microservices and custom resource definitions to let you define, execute, and analyze chaos experiments. Litmus is a great option for setting up complex chaos workflows at scale.

Chaos Toolkit

Chaos Toolkit is a tool for writing and running chaos experiments from your terminal. Hypotheses are defined in JSON files that state how the system should behave after a particular event occurs.

Using one or more of these tools lets you add chaos to your system while maintaining safeguards in case problems occur. Combine random service terminations with your own purposeful experiments to get the most complete coverage.

Chaos testing, security, and vulnerabilities

Chaos testing is also an effective way to manage vulnerabilities. It helps to pinpoint weaknesses that could give an attacker leverage inside your architecture.

A successful exploit of a vulnerability can lead to a chaos scenario, as your system gets exposed to an unknown threat. Attack chains can spread throughout infrastructure, causing cascading failures in disparate areas.

As an example, a denial-of-service (DoS) attack against a low-priority service might force critical ones offline too—if they have hidden interdependencies. Engaging in chaos testing is a good way to anticipate and mitigate the damage that exploits can cause.

 

54%

In February 2023, Cloudflare detected and mitigated the largest distributed denial-of-service (DDoS) attack ever recorded. The 71 million request-per-second (rps) DDoS attack, dubbed “hyper-volumetric,” is 54 percent higher than the previously reported attack, of 46 million rps in strength, in June 2022.

 

Chaos testing also helps gauge your system’s susceptibility to attack techniques outlined by standards such as the MITRE ATT&CK framework. You can assess whether compromise of one service is likely to adversely affect the others by emulating common invasion mechanisms such as request flooding. Knowledge of failure points can even be an effective way to combat an attack by intentionally disabling pieces of essential infrastructure, creating a kill chain.

 

Get into the right mindset

Chaos testing isn’t just about tools and experiments. Engineers should adopt a naturally analytical mindset so potential problems are resolved before they are introduced into code.

Many failure points can be anticipated early on during the design and development phases of the software lifecycle. Hard dependencies on specific services, reliance on outside providers, and assumptions that a stable network will be available are capable of causing outages in production. All three of these examples can be easily handled in code by implementing a retry and fallback system.

Developers can create more resilient systems by taking a “what if” approach to their work. This is the chaos mindset. Continually assessing possible failure modes encourages protections to be implemented at the time code is written, instead of after an outage in production.

 

Why chaos testing won’t prevent every outage

Chaos testing is an effective way to increase reliability, but it can’t prevent every production incident. It’s not realistic to anticipate every possible failure. Some will only occur under highly specific situations that even random chaos experiments can’t replicate.

What chaos testing does deliver is a deeper understanding of your system’s failure points. This information is invaluable when building resiliency, implementing security protections and addressing live incidents. The insights gleaned from your chaos experiments can inform the likely cause of an outage even when you haven’t seen the exact set of symptoms before.

Adopting a chaos mindset promotes institutional awareness of weaknesses by encouraging defensive coding practice. This culminates in a net improvement in software reliability over time. Chaos isn’t about preventing outages entirely; it’s meant to offer early mitigation of discoverable issues, while better equipping you to assess the probable causes of what remains.

 

Next steps

Chaos testing is a technique that enhances software reliability through the intentional introduction of failures. It sounds disruptive but is a proven way to find faults early on before they cause unanticipated incidents.

Chaos testing doesn’t mean a chaotic implementation. You should take a methodical approach so that chaos is added safely and with minimal impact on your users. Clear experiments with a planned rollback strategy are the key to successful testing. You can also pick from a growing selection of automated tools that will randomly terminate infrastructure components for you.

Chaos engineering is a proactive approach to fault tolerance where issues are discovered on your terms. This is more efficient and less stressful than dealing with incidents reactively, while customers are being affected. Viewing chaos engineering as a mindset delivers the greatest results by helping you ship code that’s resilient from the outset.

Drive resiliency with a holistic approach to cyber risk management. Correlate, prioritize, and manage vulnerabilities and risk at scale and across all your attack surfaces with Vulcan Cyber®. Schedule a demo today.

 

Chaos testing FAQs

What is the difference between stress testing and chaos testing?

Stress testing evaluates a system’s capacity under heavy load, aiming to identify and mitigate performance issues. Chaos testing, in contrast, introduces failures to test a system’s resilience and recovery from unexpected conditions. While stress testing examines load handling, chaos testing assesses adaptability to disruptions, both crucial for system reliability and performance improvement.

What is Chaos Monkey testing?

Netflix developed Chaos Monkey to test the resilience of its AWS infrastructure by simulating failures, like shutting down computers in production. It’s part of the Simian Army, a suite of tools designed for various failure tests. Chaos Monkey helps enhance application resilience against random failures, alongside other tools like Latency Monkey and Security Monkey, aimed at improving system robustness.

What is chaos engineering Kubernetes?

Built on Kubernetes (K8s) custom resource definitions (CRDs), Chaos Mesh is a free, open-source platform for Chaos Engineering. It offers a variety of fault types to simulate chaos experiments and uses CustomResourceDefinition (CRD) to define chaos experiments. Gremlin is a Kubernetes platform for Chaos Engineering, and the Chaos Toolkit includes tasks like probes and actions that can be called from an experiment to conduct Chaos Engineering.

Is chaos testing done in production?

DevOps and IT groups employing chaos engineering must establish a suite of monitoring instruments and consistently execute chaos tests within a live production setting. By doing so, these teams can observe authentic simulations that demonstrate the reaction of their applications or services to various forms of strain and disruptions.

Where is chaos testing most useful?

Chaos testing holds significant value because it preemptively pinpoints and mitigates the frailties within a software system. Adopting this methodology enables organizations to uncover concealed flaws, anticipate and forestall possible downtimes promptly, and gain insights into the various ways a system might fail, among other benefits.

We use cookies to personalise content and ads, to provide social media features and to analyse our traffic. We also share information about your use of our site with our social media, advertising and analytics partners.

View more
Accept
Decline

Get rid of silos;

Start owning exposure risk

Test drive the leader in exposure risk management