Chaos Workshop

This section is from the book “Learning Chaos Engineering”, by Russ Miles, O’Reilly Media, 2019. This book is good to provide a hands on approach to the topic.

Setting up the environment

chaos-toolkit

The Chaos Toolkit was chosen because it is free and open source and has a large ecosystem of extensions that allow you to fine-tune your chaos experiments to your own needs.The Chaos Toolkit also uses a chaos experiment format that you specify using YAML or JSON.

1
2
3
4


git clone https://github.com/chaostoolkit-incubator/community-playground.git
cd community-playground
python3 -m venv .venv
pip install -U chaostoolkit

A simple Chaos Experiment

simple-experiment

The target system is made up of a single Python file, service.py, which contains a single runtime service. The source code for the execution of this experiment is under the dir community-playground/learning-chaos-engineering-book-samples/chapter5/

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44


from datetime import datetime
import io
import time
import threading
from wsgiref.validate import validator
from wsgiref.simple_server import make_server

EXCHANGE_FILE = "./exchange.dat"


def update_exchange_file():
    """
    Writes the current date and time every 10 seconds into the exchange file.

    The file is created if it does not exist.
    """
    print("Will update to exchange file")
    while True:
        with io.open(EXCHANGE_FILE, "w") as f:
            f.write(datetime.now().isoformat())
        time.sleep(10)


def simple_app(environ, start_response):
    """
    Read the content of the exchange file and return it.
    """
    start_response('200 OK', [('Content-type', 'text/plain')])
    with io.open(EXCHANGE_FILE) as f:
        return [f.read().encode('utf-8')]


if __name__ == '__main__':
    t = threading.Thread(target=update_exchange_file)
    t.start()

    httpd = make_server('', 8080, simple_app)
    print("Listening on port 8080....")

    try:
        httpd.serve_forever()
    except KeyboardInterrupt:
        httpd.shutdown()
        t.join(timeout=1)

Every chaos experiment should have a meaningful title and a description that conveys how you believe the system will survive. In this case, you’ll be exploring how the service performs if, or more likely when, the exchange.dat file disappears for whatever reason. The title indicates that the service should tolerate this loss, but there is doubt. This chaos experiment will empirically prove whether your belief in the resiliency of the service is well founded.

The next section of the experiment file captures the steady-state hypothesis.

Remember that the steady-state hypothesis expresses, within certain tolerances, what constitutes normal and healthy for the portion of the target system being subjected to the chaos experiment. With only one service in your target system, the Blast Radius of your experiment—that is, the area anticipated to be impacted by the experiment—is also limited to your single service.

A Chaos Toolkit experiment’s steady-state hypothesis comprises a collection of probes. Each probe inspects some property of the target system and judges whether the property’s value is within an expected tolerance.

A Chaos Toolkit experiment’s method defines actions that will affect the system and cause the turbulent conditions, the chaos, that should be applied to the target system. Here the experiment is exploring how resilient the service is to the sudden absence of the exchange.dat file, so all the experiment’s method needs to do is rename that file so that it cannot be found by the service.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37


{
    "title": "Does our service tolerate the loss of its exchange file?",
    "description": "Our service reads data from an exchange file, can it support that file disappearing?",
    "tags": [
        "tutorial",
        "filesystem"
    ],
    "steady-state-hypothesis": {
        "title": "The exchange file must exist",
        "probes": [
            {
                "type": "probe",
                "name": "service-is-unavailable",
                "tolerance": [200, 503],
                "provider": {
                    "type": "http",
                    "url": "http://localhost:8080/"
                }
            }
        ]
    },
    "method": [
        {
            "name": "move-exchange-file",
            "type": "action",
            "provider": {
                "type": "python",
                "module": "os",
                "func": "rename",
                "arguments": {
                    "src": "./exchange.dat",
                    "dst": "./exchange.dat.old"
                }
            }
        }
    ]
}

1
2


python3 service.py
chaos run experiment.json

The Chaos Toolkit uses the steady-state hypothesis for two purposes.

At the beginning of an experiment’s execution, the steady-state hypothesis is assessed to decide whether the target system is in a recognizably normal state. If the target system is deviating from the expectations of the steady-state hypothesis at this point, the experiment is aborted, as there is no value in executing an experiment’s method when the target system isn’t recognizably “normal” to begin with. In scientific terms, we have a “dirty petri dish” problem.
The second use of the steady-state hypothesis is its main role in an experiment’s execution. After an experiment’s method, with its turbulent condition–inducing actions, has completed, the steady-state hypothesis is again compared against the target system. This is the critical point of an experiment’s execution, because this is when any deviation from the conditions expected by the steady-state hypothesis will indicate that there may be a weakness surfacing under the method’s actions.

Improved Service verification

The previous code assumes that the exchange.dat file is always there. If the file disappears for any reason, the service fails when its root URL is accessed, returning a server error. First ensure you’ve killed the original service instance that contained the weakness, and then run the new, improved, and more resilient service below. Verify by rerunning the chaos experiment that the identified weakness is not there.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51


from datetime import datetime
import io
import os.path
import time
import threading
from wsgiref.validate import validator
from wsgiref.simple_server import make_server

EXCHANGE_FILE = "./exchange.dat"


def update_exchange_file():
    """
    Writes the current date and time every 10 seconds into the exchange file.
    The file is created if it does not exist.
    """
    print("Will update to exchange file")
    while True:
        with io.open(EXCHANGE_FILE, "w") as f:
            f.write(datetime.now().isoformat())
        time.sleep(10)


def simple_app(environ, start_response):
    """
    Read the content of the exchange file and return it.
    """
    if not os.path.exists(EXCHANGE_FILE):
        start_response(
            '503 Service Unavailable',
            [('Content-type', 'text/plain')]
        )
        return [b'Exchange file is not ready']

    start_response('200 OK', [('Content-type', 'text/plain')])
    with io.open(EXCHANGE_FILE) as f:
        return [f.read().encode('utf-8')]


if __name__ == '__main__':
    t = threading.Thread(target=update_exchange_file)
    t.start()

    httpd = make_server('', 8080, simple_app)
    print("Listening on port 8080....")

    try:
        httpd.serve_forever()
    except KeyboardInterrupt:
        httpd.shutdown()
        t.join(timeout=1)