This section is from the book “Chaos Engineering” by Casey Rosenthal; Nora Jones, O’Reilly Media, 2020. This book is good to provide an overview of the methods and broad use cases (without code).
Behaviors of Complex Distributed Systems
Black Swan events may be more frequent than you think in complex systems
In this section we present manifestations of complex system behavior.
Two Tier App Behavior (Netflix Recommendation Service)
In the complex system shown below we have 4 microservices.
Diagram of microservice components showing flow of requests coming in to P and proceeding through storage
-
Service P Stores personalized information. An ID represents a person and some metadata associated with that person. For simplicity, the metadata stored is never very large, and people are never removed from the system. P passes data to Q to be persisted. If Q times out, or returns an error, then P can degrade gracefully by returning a default response. For example, P could return un-personalized metadata for a given person if Q is failing.
-
Service Q A generic storage service used by several upstream services. It stores data in a persistent database for fault tolerance and recovery, and in a memory-based cache database for speed. Service Q will write data to both services: S and T. When retrieving data, it will read from Service T first, since that is quicker. If the cache fails for some reason, it will read from Service S. If both Service T and Service S fail, then it can send a default response for the database back upstream.
-
Service S A persistent storage database, perhaps a columnar storage system like Cassandra or DynamoDB.
-
Service T An in-memory cache, perhaps something like Redis or Memcached.
- One day, T fails.
- Lookups to P start to slow down, because Q notices that T is no longer responding, and so it switches to reading from S. Unfortunately for this setup, it’s common for systems with large caches to have read-heavy workloads. In this case, T was handling the read load quite well because reading directly from memory is fast, but S is not provisioned to handle this sudden workload increase.
- S slows down and eventually fails. Those requests time out.
- Fortunately, Q was prepared for this as well, and so it returns a default response. The default response for a particular version of Cassandra when looking up a data object when all three replicas are unavailable is a 404 [Not Found] response code, so Q emits a 404 to P.
- P knows that the person it is looking up exists because it has an ID. People are never removed from the service. The 404 [Not Found] response that P receives from Q is therefore an impossible condition by virtue of the business logic. P could have handled an error from Q, or even a lack of response, but it has no condition to catch this impossible response. P crashes, taking down the entire system with it.
Two tiered application with Autoscaler
Autoscaling Storage
-
System R Stores a personalized user interface. Given an ID that represents a person, it will return a user interface customized to the movie preferences of that individual. R calls S for additional information about each person. If R can’t retrieve information about a person from S, then it has a default user interface.
-
System S Stores a variety of information about users, such as whether they have a valid account and what they are allowed to watch. This is too much data to fit on one instance or virtual machine, so S separates access and reading and writing into two subcomponents:
- S-L Load balancer that uses a consistent hash algorithm to distribute the read-heavy load to the S-D components.
- S-D Storage unit that has a small sample of the full dataset. For example, one instance of S-D might have information about all of the users whose names start with the letter “m” whereas another might store those whose names start with the letter “p.”
-
Embedded autoscaler: Scaling policies keep the clusters appropriately sized. If disk I/O drops below a certain threshold on S-D, for example, S-D will hand off data from the least busy node and shut that node down, and S-L will redistribute that workload to the remaining nodes. S-D data is held in a redundant on-node cache, so if the disk is slow for some reason, a slightly stale result can be returned from the cache. Alerts are set to trigger on increased error ratios, outlier detection will restart instances behaving oddly, etc.
Can you figure out what it could go wrong when a significant load of requires from the same user arrive at the same time?