Learn the fundamentals of Chaos Testing, the practice of testing systems under real-world pressure to assess resilience and adaptability.
Chaos testing probes the reliability of software systems by intentionally introducing errors in production. Far from being chaotic, it is a methodical approach to fault tolerance that effectively emulates the unpredictability of real problems.
Adopting chaos engineering helps you identify failure points, increase resilience, and rehearse disaster response strategies. It is a way to manage risk by provoking problems on your own terms. This article will introduce the principles behind chaos engineering and show you how to start implementing your own chaos experiments.
Chaos engineering is a straightforward concept with a catchy name. Instead of waiting for problems to occur, you proactively introduce failure states so issues are uncovered earlier on. The chaos arises because the problems are supposed to be generated randomly so that they’re unpredictable.
The technique originated at Netflix in the early 2010s. A seminal 2011 blog post explained how an internal tool called Chaos Monkey would periodically disable pieces of Netflix’s production infrastructure. This induced failures that didn’t show up in regular tests. Netflix open-sourced Chaos Monkey, sparking a new approach to reliability engineering.
Chaos engineering aims to help you build more resilient systems by uncovering the hidden interdependencies, flaky code paths, and lurking bugs that cause outages you wouldn’t normally anticipate. Many failures in production occur as a result of highly specific states that aren’t covered by any testing procedure.
Intentionally breaking production lets you discover these issues in a controlled environment. You can develop patches and look for similar problems before a real outage occurs.
Chaos tests are often referred to as experiments. You’re introducing an unknown into your system and observing how the environment reacts.
A good experiment should have a specific purpose, such as checking how the system responds to a missing component or measuring the impact of increased network latency. Using small and focused experiments will help limit the impact on customers if an outage is triggered.
Here’s a simple example of a chaos experiment procedure:
Write a short statement of what should happen, e.g., “A failover occurs if the primary database server is inaccessible. The system continues to function and is served from the database replicas. The operations team is alerted.”
This hypothesis could be tested by taking the primary database server offline. However, if the server takes a while to come back up, this could be dangerous if the hypothesis is disproven and no failover occurs. A safer option could be adding a networking rule that blocks connections to the server. You can then rapidly delete the rule if the test causes an incident in production.
Carry out the plan you’ve designed. Introduce the chaos into your system and measure its effects. If the hypothesis was disproved, use the rollback plan to restore service and then implement changes to improve your system’s resiliency. Chaos testing has just found a weakness for you.
The system should reach a state where it remains stable in the chaos’ presence. This proves the hypothesis and allows you to close the experiment. You should subsequently analyze your findings, because they might hint at similar problems in other parts of the system.
Adopting a procedure like this one helps ensure chaos engineering is conducted safely, under controlled conditions. Although you’re intentionally breaking things in production, chaos testing is never meant to cause an incident long enough that customers will notice and complain. To achieve this, you must carefully plan your experiments so a fast rollback is always available.
Automated chaos testing is the purest form, because it guarantees the element of randomness that is missing from planned experiments. There are several tools available to break things for you in different programming frameworks and cloud environments.
The original chaos testing tool, Chaos Monkey randomly terminates virtual machines and containers to simulate service failures. It requires Netflix’s Spinnaker continuous delivery platform.
Kube-Monkey brings chaos testing to Kubernetes clusters using an approach inspired by Chaos Monkey. It randomly kills pPods within your cluster. The tool is highly configurable, letting you customize the maximum number of pPods to terminate, a blacklist of services that must not stop, and the time and duration that the monkey runs.
VMWare Mangle can introduce faults to many different deployment environments including Kubernetes, Docker, and VMWare’s vCenter. This is a more flexible tool that supports a wide range of different faults beyond simple service terminations. It includes infrastructure-level outages that affect multiple services at once.
A cloud-native chaos engineering platform that is now backed by the Cloud Native Computing Foundation (CNCF). Litmus runs within Kubernetes, using microservices and custom resource definitions to let you define, execute, and analyze chaos experiments. Litmus is a great option for setting up complex chaos workflows at scale.
Chaos Toolkit is a tool for writing and running chaos experiments from your terminal. Hypotheses are defined in JSON files that state how the system should behave after a particular event occurs.
Using one or more of these tools lets you add chaos to your system while maintaining safeguards in case problems occur. Combine random service terminations with your own purposeful experiments to get the most complete coverage.
Chaos testing is also an effective way to manage vulnerabilities. It helps to pinpoint weaknesses that could give an attacker leverage inside your architecture.
A successful exploit of a vulnerability can lead to a chaos scenario, as your system gets exposed to an unknown threat. Attack chains can spread throughout infrastructure, causing cascading failures in disparate areas.
As an example, a denial-of-service (DoS) attack against a low-priority service might force critical ones offline too—if they have hidden interdependencies. Engaging in chaos testing is a good way to anticipate and mitigate the damage that exploits can cause.
54%
In February 2023, Cloudflare detected and mitigated the largest distributed denial-of-service (DDoS) attack ever recorded. The 71 million request-per-second (rps) DDoS attack, dubbed “hyper-volumetric,” is 54 percent higher than the previously reported attack, of 46 million rps in strength, in June 2022.
Chaos testing also helps gauge your system’s susceptibility to attack techniques outlined by standards such as the MITRE ATT&CK framework. You can assess whether compromise of one service is likely to adversely affect the others by emulating common invasion mechanisms such as request flooding. Knowledge of failure points can even be an effective way to combat an attack by intentionally disabling pieces of essential infrastructure, creating a kill chain.
Chaos testing isn’t just about tools and experiments. Engineers should adopt a naturally analytical mindset so potential problems are resolved before they are introduced into code.
Many failure points can be anticipated early on during the design and development phases of the software lifecycle. Hard dependencies on specific services, reliance on outside providers, and assumptions that a stable network will be available are capable of causing outages in production. All three of these examples can be easily handled in code by implementing a retry and fallback system.
Developers can create more resilient systems by taking a “what if” approach to their work. This is the chaos mindset. Continually assessing possible failure modes encourages protections to be implemented at the time code is written, instead of after an outage in production.
Chaos testing is an effective way to increase reliability, but it can’t prevent every production incident. It’s not realistic to anticipate every possible failure. Some will only occur under highly specific situations that even random chaos experiments can’t replicate.
What chaos testing does deliver is a deeper understanding of your system’s failure points. This information is invaluable when building resiliency, implementing security protections and addressing live incidents. The insights gleaned from your chaos experiments can inform the likely cause of an outage even when you haven’t seen the exact set of symptoms before.
Adopting a chaos mindset promotes institutional awareness of weaknesses by encouraging defensive coding practice. This culminates in a net improvement in software reliability over time. Chaos isn’t about preventing outages entirely; it’s meant to offer early mitigation of discoverable issues, while better equipping you to assess the probable causes of what remains.
Chaos testing is a technique that enhances software reliability through the intentional introduction of failures. It sounds disruptive but is a proven way to find faults early on before they cause unanticipated incidents.
Chaos testing doesn’t mean a chaotic implementation. You should take a methodical approach so that chaos is added safely and with minimal impact on your users. Clear experiments with a planned rollback strategy are the key to successful testing. You can also pick from a growing selection of automated tools that will randomly terminate infrastructure components for you.
Chaos engineering is a proactive approach to fault tolerance where issues are discovered on your terms. This is more efficient and less stressful than dealing with incidents reactively, while customers are being affected. Viewing chaos engineering as a mindset delivers the greatest results by helping you ship code that’s resilient from the outset.
Drive resiliency with a holistic approach to cyber risk management. Correlate, prioritize, and manage vulnerabilities and risk at scale and across all your attack surfaces with Vulcan Cyber®. Schedule a demo today.