Chaos Workshop
This section is from the book “Learning Chaos Engineering”, by Russ Miles, O’Reilly Media, 2019. This book is good to provide a hands on approach to the topic.
Setting up the environment
The Chaos Toolkit was chosen because it is free and open source and has a large ecosystem of extensions that allow you to fine-tune your chaos experiments to your own needs.The Chaos Toolkit also uses a chaos experiment format that you specify using YAML or JSON.
|
|
A simple Chaos Experiment
The target system is made up of a single Python file, service.py, which contains a single runtime service. The source code for the execution of this experiment is under the dir community-playground/learning-chaos-engineering-book-samples/chapter5/
|
|
Every chaos experiment should have a meaningful title and a description that conveys how you believe the system will survive. In this case, you’ll be exploring how the service performs if, or more likely when, the exchange.dat file disappears for whatever reason. The title indicates that the service should tolerate this loss, but there is doubt. This chaos experiment will empirically prove whether your belief in the resiliency of the service is well founded.
The next section of the experiment file captures the steady-state hypothesis.
Remember that the steady-state hypothesis expresses, within certain tolerances, what constitutes normal and healthy for the portion of the target system being subjected to the chaos experiment. With only one service in your target system, the Blast Radius of your experiment—that is, the area anticipated to be impacted by the experiment—is also limited to your single service.
A Chaos Toolkit experiment’s steady-state hypothesis comprises a collection of probes. Each probe inspects some property of the target system and judges whether the property’s value is within an expected tolerance.
A Chaos Toolkit experiment’s method defines actions that will affect the system and cause the turbulent conditions, the chaos, that should be applied to the target system. Here the experiment is exploring how resilient the service is to the sudden absence of the exchange.dat file, so all the experiment’s method needs to do is rename that file so that it cannot be found by the service.
|
|
|
|
The Chaos Toolkit uses the steady-state hypothesis for two purposes.
-
At the beginning of an experiment’s execution, the steady-state hypothesis is assessed to decide whether the target system is in a recognizably normal state. If the target system is deviating from the expectations of the steady-state hypothesis at this point, the experiment is aborted, as there is no value in executing an experiment’s method when the target system isn’t recognizably “normal” to begin with. In scientific terms, we have a “dirty petri dish” problem.
-
The second use of the steady-state hypothesis is its main role in an experiment’s execution. After an experiment’s method, with its turbulent condition–inducing actions, has completed, the steady-state hypothesis is again compared against the target system. This is the critical point of an experiment’s execution, because this is when any deviation from the conditions expected by the steady-state hypothesis will indicate that there may be a weakness surfacing under the method’s actions.
Improved Service verification
The previous code assumes that the exchange.dat file is always there. If the file disappears for any reason, the service fails when its root URL is accessed, returning a server error. First ensure you’ve killed the original service instance that contained the weakness, and then run the new, improved, and more resilient service below. Verify by rerunning the chaos experiment that the identified weakness is not there.
|
|