Chaos Engineering

Chaos Engineering, as a concept, is on the rise. And rightfully so. Systems are becoming more distributed and reliant on various technologies and architectures. With so many organizations depending on technology to reach customers, ensuring that technology can remain resilient under any circumstance is vital to success.

What is Chaos Engineering

Chaos Engineering can be defined as the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. (Principles of Chaos Engineering). However, it goes much deeper once you define the variables (attacks) needed for a realistic experiment and how much confidence you’re wanting to build through your test. Chaos Engineering is much like a science problem. You’ll want to create your thesis (what you’re trying to prove) and then tackle how you’re going to prove it.

Should I Be Doing This?

More often than not, organizations are building powerful, distributed, complex and resource hungry solutions. As an industry, we’ve spent a lot of time refining the product, development and deployment lifecycles – but do we spend enough time ensuring those applications remain resilient under any circumstance? Most organizations invest time into performing simple load tests. Being able to answer the question “can my application hold up under specified load” is important. However, it’s not just about the load though. There are a host of attacks that can crash an application. Waiting for those to happen in production, to learn if your application is resilient, is a bet most organizations shouldn’t want to take. Disappointing a customer, due to system performance, is a sure way to send them directly to your competitors. Chaos Engineering is something every organization should consider.

However, I’d like to insert a caveat here. If I were assessing an organization’s readiness for Chaos Engineering I would ask these questions:

Is there a dedicated group of individuals that will own this? This function should be prioritized and have dedicated people to gain the most value. It’s crucial that multiple functions within your organization are having conversations about real-world attacks, system readiness, and resolving issues as they arise.
Will your performance team own this function? If so, are they matured? If you’re just getting performance testing off the ground, I’d focus my efforts on maturing that a bit before diving into another effort. Likewise, perhaps you have a Site Reliability Engineering group that will focus their time on this.

If the answer is NO to these questions, I would consider focusing time on those initiatives first. Like with most things at work, if you’re not completely engaged and prioritizing the work, you won’t get value from the exercise.

Steps to Perform Chaos Engineering

The high level steps required to perform Chaos Engineering experimentation are as follows:

Define “normal” – What does steady state look like for your application?
Discuss what could go wrong? – This is one of the most powerful questions you should focus on upfront. Work with your team to hypothesize what variables (attacks) you think could result in a negative impact. How specifically would those attacks impact your customers, services or dependencies?
Define your blast radius – A blast radius is simply the conditions of your experiment. Your first experiment is a learning opportunity so try to set your blast radius as small as possible. For example, if you are testing the application’s tolerance to process termination, try terminating a single process first. I also recommend executing your first test in a non-production environment. This allows you to get your feet wet without causing unwanted consequences.
Bring the CHAOS! – It’s time to execute the experiment. Introduce variables that reflect the real-world events your team discussed in step 2. Chaos Engineering tools will help make this step more structured and efficient.
Measure and Analyze – Similar to performance testing, you’ve defined KPIs to measure success. If you see unwanted results during your test, you should halt the experiment immediately. In addition, you’ll want to measure the attack you’re specifically testing for. This will help you prove your hypothesis. To help keep a pulse on your overall application stability, I recommend using any application health and performance tools you’re currently utilizing in your organization.

What Variables Should I Introduce?

The answer to this question will be unique to your application’s architecture. However, here are some considerations:

Crash a server
Dead a hard drive
Sever or corrupt network traffic
Inject a delay into outbound network traffic
Spike traffic
Generate load across CPU cores
Time-changes – if your application has to consider Daylight Savings Time, simulate that time-change
Prepare for DNS outages
Terminate a process or set of processes
Eat up a specific amount of memory or space on a storage device
Packet loss
Latency
Create read/write pressure on I/O devices such as hard disks

Best Practices

Chaos Engineering can get overly complex and become less valuable if not thought through and executed correctly. Here are some best practices:

Keep it realistic – Your attacks should closely align with probable attacks you may face.
Start off simple – Your first experiment should have a small blast radius.
Build-in redundancies – If your executing in production make sure you have a fall back plan or built-in redundancies in case the sky falls.
Collaborate – The best laid plans are ones the whole team creates and executes. Incorporating multiple areas of your organization into this process will add to the success probability.
Use a real-world environment – This returns the most accurate results. The effort and cost to duplicate a large, distributed system for testing purposes is sometimes unrealistic. However, if you are using a production environment, prioritize your test outside of peak hours.
Understand your baseline – Having a solid understanding of what a healthy system is will ensure you are measuring and diagnosing accurately.

Chaos Engineer Tools

Chaos Engineering tools provide users the ability to inject “chaos” into an application. There are many tools on the market and some of them have very deliberate uses. I would recommend researching tools that fit your specific requirements. However, here are some tools to consider on that journey:

Chaos Monkey

Created by Netflix in the early 2010’s, this open source tool has been around the longest. While it has been in circulation for quite some time, there are more robust and costly tools on the market. Chaos Monkey has some limitations, such as only allowing one experiment type (shutdown). However, you can schedule the attacks. It appears it also requires Spinnaker and MySQL.

Gremlin

Gremlin is designed to improve web-based reliability. Gremlin will help you determine what type of attack would best meet your needs and also allows you to execute tests simultaneously to provide a high level of confidence. If you’re in the CICD space, the tool can also be automated within your pipelines.

ChaosBlade

ChaosBlade is a cloud-native solution that offers value as an open source solution. The ideal application of this tool is testing resiliency at the code level by using application fault injections. It also has various attack methods you can use to gain confidence. I will say the documentation is lacking and their website was sluggish.

Litmus

Litmus is a Kubernetes native tool that provides a large number of experiments for testing containers, Pods, and nodes, as well as specific platforms and tools. The tool also has a cool feature called Litmus Probes, which lets you monitor the health of your application before, during, and after an experiment. You can also assign weights to each experiment and it will capture that in the reporting outputs. However, ramp-up time, as well as individual test setup and clean-up times, are above average. I would take that into consideration if I were using this tool daily.

Conclusion

Today’s applications are required to be much more reliable than the applications of the past. The complexity and amount of variables that could risk your application’s resiliency are growing everyday. What I find valuable about Chaos Engineering is that it provides a direct level of confidence that automated or performance testing alone can’t offer. And what I find useful about Chaos Engineering tools is that someone else does the heavy lifting of staying on top of those variables so you’re always assessing real-world attacks. This type of testing has been around for many years now, being employed by big names like Netflix, Google, Facebook, etc… The concept is rolling downhill and catching the ears of executives everywhere. Prepare yourself now, add this capability to your tool set, and you’ll win the day!

Resources

A curated list of awesome Chaos Engineering resources: https://github.com/dastergon/awesome-chaos-engineering
Chaos-Community google group: https://groups.google.com/g/chaos-community