OpenSSL3 Critical vulnerability: How to fix CVE-2022-3602 and CVE-2022-3786 | Read here  >>

The CyberRisk Summit is back: Join us on Dec 6. as we recap the cyber risk landscape in 2022 | Get free ticket >> 

Product update: Group and deduplicate vulnerabilities with “Vulnerability Clusters” for efficient cyber risk management | Read here  >>

OpenSSL3 Critical vulnerability: How to fix CVE-2022-3602 and CVE-2022-3786 | Read here  >>

The CyberRisk Summit is back: Join us on Dec 6. as we recap the cyber risk landscape in 2022 | Get free ticket >> 

Product update: Group and deduplicate vulnerabilities with “Vulnerability Clusters” for efficient cyber risk management | Read here  >>

How-to guides

Chaos testing: what you need to know

Chaos testing lets teams check the resilience and the ability of their systems to adapt to real-world pressure. Here's what you need to know.

Eddie Goikhman | October 19, 2022

Chaos engineering probes the reliability of software systems by intentionally introducing errors in production. Far from being chaotic, it is a methodical approach to fault tolerance that effectively emulates the unpredictability of real problems.

Adopting chaos engineering helps you identify failure points, increase resilience, and rehearse disaster response strategies. It is a way to manage risk by provoking problems on your own terms. This article will introduce the principles behind chaos engineering and show you how to start implementing your own chaos experiments.

Table of contents

  1. Understanding chaos testing
  2. How to use chaos testing
  3. Chaos testing, security, and vulnerabilities
  4. The chaos testing mindset
  5. Why chaos testing won't prevent every outage
  6. Next steps

Understanding chaos testing

Chaos engineering is a straightforward concept with a catchy name. Instead of waiting for problems to occur, you proactively introduce failure states so issues are uncovered earlier on. The chaos arises because the problems are supposed to be generated randomly so that they're unpredictable.

The technique originated at Netflix in the early 2010s. A seminal 2011 blog post explained how an internal tool called Chaos Monkey would periodically disable pieces of Netflix's production infrastructure. This induced failures that didn't show up in regular tests. Netflix open-sourced Chaos Monkey, sparking a new approach to reliability engineering.

Chaos engineering aims to help you build more resilient systems by uncovering the hidden interdependencies, flaky code paths, and lurking bugs that cause outages you wouldn't normally anticipate. Many failures in production occur as a result of highly specific states that aren't covered by any testing procedure.

Intentionally breaking production lets you discover these issues in a controlled environment. You can develop patches and look for similar problems before a real outage occurs.

How to use chaos testing

Chaos tests are often referred to as experiments. You're introducing an unknown into your system and observing how the environment reacts. A good experiment should have a specific purpose, such as checking how the system responds to a missing component or measuring the impact of increased network latency. Using small and focused experiments will help limit the impact on customers if an outage is triggered.

Here's a simple example of a chaos experiment procedure:

  1. Hypothesis: Write a short statement of what should happen, e.g., "A failover occurs if the primary database server is inaccessible. The system continues to function and is served from the database replicas. The operations team is alerted."
  2. Design a safe experiment: This hypothesis could be tested by taking the primary database server offline. However, if the server takes a while to come back up, this could be dangerous if the hypothesis is disproven and no failover occurs. A safer option could be adding a networking rule that blocks connections to the server. You can then rapidly delete the rule if the test causes an incident in production.
  3. Execute the experiment: Carry out the plan you've designed. Introduce the chaos into your system and measure its effects. If the hypothesis was disproved, use the rollback plan to restore service and then implement changes to improve your system's resiliency. Chaos testing has just found a weakness for you.
  4. Repeat until the hypothesis is proven: The system should reach a state where it remains stable in the chaos' presence. This proves the hypothesis and allows you to close the experiment. You should subsequently analyze your findings, because they might hint at similar problems in other parts of the system.

Adopting a procedure like this one helps ensure chaos engineering is conducted safely, under controlled conditions. Although you're intentionally breaking things in production, chaos testing is never meant to cause an incident long enough that customers will notice and complain. To achieve this, you must carefully plan your experiments so a fast rollback is always available.

Automating chaos testing

Automated chaos testing is the purest form, because it guarantees the element of randomness that is missing from planned experiments. There are several tools available to break things for you in different programming frameworks and cloud environments.

  • Chaos Monkey: The original chaos testing tool, it randomly terminates virtual machines and containers to simulate service failures. It requires Netflix's Spinnaker continuous delivery platform.
  • Kube-Monkey: Brings chaos testing to Kubernetes clusters using an approach inspired by Chaos Monkey. It randomly kills pPods within your cluster. The tool is highly configurable, letting you customize the maximum number of pPods to terminate, a blacklist of services that must not stop, and the time and duration that the monkey runs.
  • VMWare Mangle: Can introduce faults to many different deployment environments including Kubernetes, Docker, and VMWare's vCenter. This is a more flexible tool that supports a wide range of different faults beyond simple service terminations. It includes infrastructure-level outages that affect multiple services at once.
  • Litmus: A cloud-native chaos engineering platform that is now backed by the Cloud Native Computing Foundation (CNCF). It runs within Kubernetes, using microservices and custom resource definitions to let you define, execute, and analyze chaos experiments. Litmus is a great option for setting up complex chaos workflows at scale.
  • Chaos Toolkit: A tool for writing and running chaos experiments from your terminal. Hypotheses are defined in JSON files that state how the system should behave after a particular event occurs.

Using one or more of these tools lets you add chaos to your system while maintaining safeguards in case problems occur. Combine random service terminations with your own purposeful experiments to get the most complete coverage.

Chaos testing, security, and vulnerabilities

Chaos testing is also an effective way to manage vulnerabilities. It helps to pinpoint weaknesses that could give an attacker leverage inside your architecture.

A successful exploit of a vulnerability can lead to a chaos scenario, as your system gets exposed to an unknown threat. Attack chains can spread throughout infrastructure, causing cascading failures in disparate areas. As an example, a denial-of-service (DoS) attack against a low-priority service might force critical ones offline too—if they have hidden interdependencies. Engaging in chaos testing is a good way to anticipate and mitigate the damage that exploits can cause.

chaos testing

Chaos testing also helps gauge your system's susceptibility to attack techniques outlined by standards such as the MITRE ATT&CK framework. You can assess whether compromise of one service is likely to adversely affect the others by emulating common invasion mechanisms such as request flooding. Knowledge of failure points can even be an effective way to combat an attack by intentionally disabling pieces of essential infrastructure, creating a kill chain.

The chaos testing mindset

Chaos testing isn't just about tools and experiments. Engineers should adopt a naturally analytical mindset so potential problems are resolved before they are introduced into code.

Many failure points can be anticipated early on during the design and development phases of the software lifecycle. Hard dependencies on specific services, reliance on outside providers, and assumptions that a stable network will be available are capable of causing outages in production. All three of these examples can be easily handled in code by implementing a retry and fallback system.

Developers can create more resilient systems by taking a "what if" approach to their work. This is the chaos mindset. Continually assessing possible failure modes encourages protections to be implemented at the time code is written, instead of after an outage in production.

Why chaos testing won't prevent every outage

Chaos testing is an effective way to increase reliability, but it can't prevent every production incident. It's not realistic to anticipate every possible failure. Some will only occur under highly specific situations that even random chaos experiments can't replicate.

What chaos testing does deliver is a deeper understanding of your system's failure points. This information is invaluable when building resiliency, implementing security protections and addressing live incidents. The insights gleaned from your chaos experiments can inform the likely cause of an outage even when you haven’t seen the exact set of symptoms before.

Adopting a chaos mindset promotes institutional awareness of weaknesses by encouraging defensive coding practice. This culminates in a net improvement in software reliability over time. Chaos isn't about preventing outages entirely; it's meant to offer early mitigation of discoverable issues, while better equipping you to assess the probable causes of what remains.

Next steps

Chaos testing is a technique that enhances software reliability through the intentional introduction of failures. It sounds disruptive but is a proven way to find faults early on before they cause unanticipated incidents.

Chaos testing doesn't mean a chaotic implementation. You should take a methodical approach so that chaos is added safely and with minimal impact on your users. Clear experiments with a planned rollback strategy are the key to successful testing. You can also pick from a growing selection of automated tools that will randomly terminate infrastructure components for you.

Chaos engineering is a proactive approach to fault tolerance where issues are discovered on your terms. This is more efficient and less stressful than dealing with incidents reactively, while customers are being affected. Viewing chaos engineering as a mindset delivers the greatest results by helping you ship code that's resilient from the outset.

Drive resiliency with a holistic approach to cyber risk management. Correlate, prioritize, and manage vulnerabilities and risk at scale and across all your attack surfaces with Vulcan Cyber®. Schedule a demo today.