Chaos testing probes the reliability of software systems by intentionally introducing errors in production. Far from being chaotic, it is a methodical approach to fault tolerance that effectively emulates the unpredictability of real problems.
Adopting chaos engineering helps you identify failure points, increase resilience, and rehearse disaster response strategies. It is a way to manage risk by provoking problems on your own terms. This article will introduce the principles behind chaos engineering and show you how to start implementing your own chaos experiments.
Table of contents
- Understanding chaos testing
- How to use chaos testing
- Chaos testing, security, and vulnerabilities
- The chaos testing mindset
- Why chaos testing won’t prevent every outage
- Next steps
Understanding chaos testing
Chaos engineering is a straightforward concept with a catchy name. Instead of waiting for problems to occur, you proactively introduce failure states so issues are uncovered earlier on. The chaos arises because the problems are supposed to be generated randomly so that they’re unpredictable.
The technique originated at Netflix in the early 2010s. A seminal 2011 blog post explained how an internal tool called Chaos Monkey would periodically disable pieces of Netflix’s production infrastructure. This induced failures that didn’t show up in regular tests. Netflix open-sourced Chaos Monkey, sparking a new approach to reliability engineering.
Chaos engineering aims to help you build more resilient systems by uncovering the hidden interdependencies, flaky code paths, and lurking bugs that cause outages you wouldn’t normally anticipate. Many failures in production occur as a result of highly specific states that aren’t covered by any testing procedure.
How to use chaos testing
Chaos tests are often referred to as experiments. You’re introducing an unknown into your system and observing how the environment reacts. A good experiment should have a specific purpose, such as checking how the system responds to a missing component or measuring the impact of increased network latency. Using small and focused experiments will help limit the impact on customers if an outage is triggered.
Here’s a simple example of a chaos experiment procedure:
- Hypothesis: Write a short statement of what should happen, e.g., “A failover occurs if the primary database server is inaccessible. The system continues to function and is served from the database replicas. The operations team is alerted.”
- Design a safe experiment: This hypothesis could be tested by taking the primary database server offline. However, if the server takes a while to come back up, this could be dangerous if the hypothesis is disproven and no failover occurs. A safer option could be adding a networking rule that blocks connections to the server. You can then rapidly delete the rule if the test causes an incident in production.
- Execute the experiment: Carry out the plan you’ve designed. Introduce the chaos into your system and measure its effects. If the hypothesis was disproved, use the rollback plan to restore service and then implement changes to improve your system’s resiliency. Chaos testing has just found a weakness for you.
- Repeat until the hypothesis is proven: The system should reach a state where it remains stable in the chaos’ presence. This proves the hypothesis and allows you to close the experiment. You should subsequently analyze your findings, because they might hint at similar problems in other parts of the system.
Adopting a procedure like this one helps ensure chaos engineering is conducted safely, under controlled conditions. Although you’re intentionally breaking things in production, chaos testing is never meant to cause an incident long enough that customers will notice and complain. To achieve this, you must carefully plan your experiments so a fast rollback is always available.
Make use of automation
Automated chaos testing is the purest form, because it guarantees the element of randomness that is missing from planned experiments. There are several tools available to break things for you in different programming frameworks and cloud environments.
- Chaos Monkey: The original chaos testing tool, it randomly terminates virtual machines and containers to simulate service failures. It requires Netflix’s Spinnaker continuous delivery platform.
- Kube-Monkey: Brings chaos testing to Kubernetes clusters using an approach inspired by Chaos Monkey. It randomly kills pPods within your cluster. The tool is highly configurable, letting you customize the maximum number of pPods to terminate, a blacklist of services that must not stop, and the time and duration that the monkey runs.
- VMWare Mangle: Can introduce faults to many different deployment environments including Kubernetes, Docker, and VMWare’s vCenter. This is a more flexible tool that supports a wide range of different faults beyond simple service terminations. It includes infrastructure-level outages that affect multiple services at once.
- Litmus: A cloud-native chaos engineering platform that is now backed by the Cloud Native Computing Foundation (CNCF). It runs within Kubernetes, using microservices and custom resource definitions to let you define, execute, and analyze chaos experiments. Litmus is a great option for setting up complex chaos workflows at scale.
- Chaos Toolkit: A tool for writing and running chaos experiments from your terminal. Hypotheses are defined in JSON files that state how the system should behave after a particular event occurs.
Using one or more of these tools lets you add chaos to your system while maintaining safeguards in case problems occur. Combine random service terminations with your own purposeful experiments to get the most complete coverage.
Chaos testing, security, and vulnerabilities
Chaos testing is also an effective way to manage vulnerabilities. It helps to pinpoint weaknesses that could give an attacker leverage inside your architecture.
A successful exploit of a vulnerability can lead to a chaos scenario, as your system gets exposed to an unknown threat. Attack chains can spread throughout infrastructure, causing cascading failures in disparate areas. As an example, a denial-of-service (DoS) attack against a low-priority service might force critical ones offline too—if they have hidden interdependencies. Engaging in chaos testing is a good way to anticipate and mitigate the damage that exploits can cause.
Chaos testing also helps gauge your system’s susceptibility to attack techniques outlined by standards such as the MITRE ATT&CK framework. You can assess whether compromise of one service is likely to adversely affect the others by emulating common invasion mechanisms such as request flooding. Knowledge of failure points can even be an effective way to combat an attack by intentionally disabling pieces of essential infrastructure, creating a kill chain.
Get into the right mindset
Chaos testing isn’t just about tools and experiments. Engineers should adopt a naturally analytical mindset so potential problems are resolved before they are introduced into code.
Many failure points can be anticipated early on during the design and development phases of the software lifecycle. Hard dependencies on specific services, reliance on outside providers, and assumptions that a stable network will be available are capable of causing outages in production. All three of these examples can be easily handled in code by implementing a retry and fallback system.
Developers can create more resilient systems by taking a “what if” approach to their work. This is the chaos mindset. Continually assessing possible failure modes encourages protections to be implemented at the time code is written, instead of after an outage in production.
Why chaos testing won’t prevent every outage
Chaos testing is an effective way to increase reliability, but it can’t prevent every production incident. It’s not realistic to anticipate every possible failure. Some will only occur under highly specific situations that even random chaos experiments can’t replicate.
What chaos testing does deliver is a deeper understanding of your system’s failure points. This information is invaluable when building resiliency, implementing security protections and addressing live incidents. The insights gleaned from your chaos experiments can inform the likely cause of an outage even when you haven’t seen the exact set of symptoms before.
Adopting a chaos mindset promotes institutional awareness of weaknesses by encouraging defensive coding practice. This culminates in a net improvement in software reliability over time. Chaos isn’t about preventing outages entirely; it’s meant to offer early mitigation of discoverable issues, while better equipping you to assess the probable causes of what remains.
Chaos testing is a technique that enhances software reliability through the intentional introduction of failures. It sounds disruptive but is a proven way to find faults early on before they cause unanticipated incidents.
Chaos testing doesn’t mean a chaotic implementation. You should take a methodical approach so that chaos is added safely and with minimal impact on your users. Clear experiments with a planned rollback strategy are the key to successful testing. You can also pick from a growing selection of automated tools that will randomly terminate infrastructure components for you.
Chaos engineering is a proactive approach to fault tolerance where issues are discovered on your terms. This is more efficient and less stressful than dealing with incidents reactively, while customers are being affected. Viewing chaos engineering as a mindset delivers the greatest results by helping you ship code that’s resilient from the outset.
Drive resiliency with a holistic approach to cyber risk management. Correlate, prioritize, and manage vulnerabilities and risk at scale and across all your attack surfaces with Vulcan Cyber®. Schedule a demo today.
What is the difference between stress testing and chaos testing?
Stress testing and chaos testing are both types of software testing that are used to improve the reliability and performance of a system, but they have different goals and approaches.
Stress testing involves putting a system under a heavy load to see how it performs under extreme conditions. The goal of stress testing is to identify the limits of a system’s capacity and to ensure that it can handle a high volume of traffic or activity without crashing or experiencing other issues. Stress testing can help identify performance bottlenecks, memory leaks, or other issues that can impact the stability of a system.
On the other hand, chaos testing involves intentionally injecting failures or errors into a system to see how it responds under unpredictable and chaotic conditions. The goal of chaos testing is to proactively identify weaknesses or vulnerabilities in a system, and to improve its resilience and ability to recover from failures. Chaos testing often involves randomly terminating services, changing network configurations, or introducing latency and other issues to see how the system adapts.
While stress testing focuses on the system’s ability to handle a heavy load, chaos testing focuses on the system’s ability to handle unexpected and unpredictable failures. Both types of testing are important for improving the reliability and performance of a system, but they have different goals and approaches.
What is Chaos Monkey testing?
Netflix engineers created the software program Chaos Monkey testing to evaluate the resilience and recoverability of their Amazon Web Services (AWS) infrastructure. The tool simulates system failures and edge cases by purposefully turning off computers in the production network. The Simian Army, a larger collection of tools that were created to test responses to different system failures, includes Chaos Monkey.
One of the tools for chaos testing is called Chaos Monkey because it aids in making applications resilient to random instance failures. Latency Monkey, Janitor Monkey, Security Monkey, and Conformity Monkey are additional members of the Simian Army.
What is chaos engineering Kubernetes?
Built on Kubernetes (K8s) custom resource definitions (CRDs), Chaos Mesh is a free, open-source platform for Chaos Engineering. It offers a variety of fault types to simulate chaos experiments and uses CustomResourceDefinition (CRD) to define chaos experiments. Gremlin is a Kubernetes platform for Chaos Engineering, and the Chaos Toolkit includes tasks like probes and actions that can be called from an experiment to conduct Chaos Engineering.