Expect the unexpected. This adage is perhaps one of the best for testing distributed software systems. But how exactly do you test for the unexpected? Chaos engineering gets close to the answer.
Chaos engineering helps you design more resilient systems. This is achieved by forcing you to think about how these complex systems will respond to unexpected events. It gives you confidence that your system will be able to handle real-world conditions, not just the idealised conditions that are often assumed during development and testing.
What is chaos engineering?
Chaos engineering is a disciplined method of resilience testing that intentionally introduces “chaos” into a system to discover vulnerabilities and weaknesses that can be exploited by attackers. Essentially, it’s stress testing through anarchy: randomly terminating processes, injecting faults into networks, or causing other types of failures. This helps to build confidence in the system and test its capability to withstand turbulent conditions in production.
The origins of chaos engineering can be traced back to Netflix's "Simian Army" project, which was designed to test the streaming service’s ability to withstand outages and failures. Since then, many other companies have adopted similar chaos engineering experiments.
Chaos engineering as a testing strategy
Anyone who’s been involved in distributed software development understands that software testing identifies errors, gaps, or missing requirements.
Software testing can be executed manually (by running through the steps yourself), automatically (with specialised tools), or a combination of both approaches (manual exploratory testing supplemented by automated regression cheques). Automation is often leveraged in regression testing since this approach requires retesting software after implementing changes. This guarantees they haven’t inadvertently introduced new bugs or broken existing functionality. It also provides faster feedback and more comprehensive coverage than manual testing alone.
However, it also requires extra effort upfront to develop reliable test scripts that don't produce false positives/negatives. Generally, which approach to take depends on factors that include team size/skillset, application complexity, and time constraints.
The types of software tests depend on their purpose; for example:
- Unit tests - verify that individual software components work as expected.
- Integration tests - check whether various system components work together correctly.
- System tests - evaluate the end-to-end behaviour of the system to make sure it meets all functional requirements.
- Acceptance tests - confirm that the system works as required from the perspective of an external stakeholder such as a user or customer.
Chaos tests have their unique purpose as well, as it focuses on replicating real-world problems in production environments. This contrasts with other forms of testing, which often takes place in controlled environments where it’s easier to replicate specific test conditions.
The process behind the chaos
The design behind the implementation of chaos testing is straightforward.
Measuring progress against a standard is critical for improving system resilience. This baseline is the nominal state of the system during normal operation (without injected chaos). Measuring the steady state provides performance data that can be used to detect changes resulting from the chaos that is introduced later.
One common approach to measuring the nominal state of a system is called black-box testing. These measurements are taken from outside of the system under test (without knowledge of its internals). This type of measurement is especially useful when testing distributed systems, where observing internals at every node would be prohibitively expensive or outright infeasible. Black-box testing can take many forms, but some common approaches include load testing and monitoring external metrics such as response time and error rates.
Of course, establishing a baseline can be complicated. In some cases, it may be difficult to determine what "normal" behaviour looks like due to variability in user traffic or other factors. In these situations, it may be necessary to run multiple experiments before an accurate baseline can be established.
Once you’re armed with a clear understanding of how your system performs normally, you can break it. It’s time to develop a chaos test and make predictions about how the system will respond to chaotic variables.
Your prediction of how baseline systems react to chaos is your hypothesis. This proposition comes in the form of a question and an assumption. For example, if you introduce latency to database calls in a web app (question), page load time will slow down (assumption). Chaos testing specifically introduces uncertain elements to either prove or break your hypothesis. When this occurs (your system does something outside your assumption), the fallout is called a blast radius.
So, after establishing a system baseline, ask yourself, “what could go wrong?” Then make use of service-level indicators and objectives to form the basis for a realistic assumption. That’s your hypothesis. To develop the actual chaos test, choose variables that will be introduced one at a time (to manage the potential blast radius) that directly challenges your hypothesis.
The actual chaos is introduced by tools such as Chaos Monkey, Chaos Mesh, or Gremlin. These implementations directly tamper with different components of your system—such as CPU usage or networking conditions—to simulate issues that may occur in a real production environment.
The aforementioned Simian Army is an open-source (and thus still-growing) collection of chaos engineering tools that list the different problems you can introduce into your system:
- Chaos Monkey - randomly shuts down virtual machines (VMs) to create small disruptions that shouldn’t impact the overall service. Chaos Gorilla is a larger-scale version.
- Latency Monkey - simulates service degradation to see if upstream services react appropriately.
- Conformity Monkey - detects instances not coded to best-practise guidelines. It terminates them to give the service owner a chance to properly relaunch them.
- Security Monkey - searches out security weaknesses and terminates them. It also cheques SSL and DRM certificates if they are expired or near expiration.
- Doctor Monkey - performs health cheques on instances. It also monitors external signs of process health (CPU and memory usage).
- Janitor Monkey - searches for unused resources to discard.
When introducing chaos, alter one variable at a time to limit the potential blast radius. Doing so allows you to monitor the results and take appropriate action if necessary. Plan for how to abort the experiment if it threatens production software, as well as how to revert changes if something goes awry. When conducting the test, aim to disprove your hypothesis. This will reveal areas where system improvements can be made.
Once the chaos has been introduced, it’s time to observe the effects on your system. This is where service-level indicators and objectives come into play again. By monitoring these metrics before, during, and after the chaos test, you can determine whether your hypothesis was correct.
If everything goes according to plan, congratulations! Your system was able to withstand some simulated adversity and perform as well (or better) than before. However, if something goes wrong during the chaos test—if there’s an unexpected blast radius—that can actually be seen as a positive. It means you found a potential weakness in your system that can now be addressed (by analysing the blast radius itself, for example) before it becomes an actual problem in production.
Finally, the results of the test are compared to the predictions made when creating it to assess accuracy and identify areas for improvement. By engineering experiments with predicted outcomes (and preferably, mitigation strategies as well), you gain valuable insights into how your systems behave when faced with chaotic conditions.
The chaos leads to insight
Chaos testing allows developers and software reliability engineering (SRE) teams to gain a better understanding of their system's behaviour under duress and identify potential problems or develop mitigation strategies.
This allows you to spot hidden bugs and dependencies between subsystems that may otherwise behave normally when tested independently. You can now verify (and subsequently improve) your production-level performance bottlenecks and fault tolerance mechanisms. If your system can’t gracefully handle an injected fault, it’s also likely to fail in production when confronted with unexpected real-world conditions. This could result in data loss or corruption, service downtime, and other negative outcomes—placing a financial and time drain on your organisation.
Testing for the predictable is standard. Chaos testing provides you with a glimpse of the unexpected and, therefore, a way to prepare for it.