Distributed Systems for Fun and Profit - Ch 2

Posted on March 26, 2021

Distributed Systems for Fun and Profit is a short, informal text that provides a broad overview of the what, why, and how of distributed systems. It's an introduction recommended by a good chunk of people working in this space, from blog posts to the undergrad operating systems course I'm taking right now.

The text is already pretty concise, so I'll try to avoid repeating its contents. Instead, I'll try to draw connections between what the text is saying and background knowledge from other resources. I'll also note down any lingering thoughts / questions I have.

Chapter 2: Up and down the level of abstraction

System models

The text's definition of a system model boils down to choosing what aspects of the system we consider important vs what aspects we can safely ignore, such that we can reason about the system as a whole more effectively. System models can vary around what assumptions they make about the real-world. The weaker the assumptions, the more robust the model. The more robust the model, the better our solutions perform in the real-world. On the other hand, the stronger the assumptions, the easier it is to reason about the model, at the cost of robustness.

This is similar to the method of thinking I've observed in control system design for robots. As a basic example, we can, with relative ease, design a theoretically optimal controller using a simple model of a given mechanical system, with the strong assumption that the system we're trying to control has no mechanical backlash. But if the actual physical system does in fact exhibit significant backlash, then our controller has a high chance of not working well, if at all. Removing the assumption of no-backlash from our system model would mean the controller we develop will necessarily be more complex, but if done correctly, will perform well in the real world.

This also relates to something I've observed from doing Leetcode-style interviews for internships. Given very few assumptions in the problem statement, a big thing interviewers judge interviewees on is how well they consider malformed inputs and edge cases. Something I learned the hard way.

Nodes, communication links, and network partitions

The text mentions common failure models for nodes, including crash-recovery models and Byzantine faults. Crash models, where nodes can only fail by crashing and are assumed to behave properly up to the time of crash, are a strong assumption and are considered more practical to deal with. Byzantine faults, where nodes can be "dishonest" and pass arbitrary data to different observers in the system, are a more general problem and are considered difficult / expensive to address. This blog talks a lot about Byzantine fault tolerance, among other things. Interesting applications of BFT can be found in cryptocurrency and avionics.

Concepts such as fault-tolerance and availability are concepts that are prevalent across all types of systems engineering (see reliability engineering). It could be interesting to explore what methods other fields have developed and bring them into the fold of computing.

The model of nodes and communication links reminds me of graph theory. When learning graph theory, we made strong assumptions that nodes (vertices) and communication links (edges) never "failed", so we could focus on the higher-level algorithmic reasoning. This means that the algorithms we studied can be generalized to many different application domains, but wouldn't work well off-the-shelf if the actual application situation presented faults we didn't account for. In this sense, graph theory can be considered a very high-level abstraction for many types of systems. I won't be surprised if topics like strong connectivity and network flow show up later.

Timing / ordering assumptions

Synchronous system models make a lot stronger assumptions, so they are easier to reason about, but less realistic. The text (and presumably most modern systems research) focuses on asynchronous system models.

The consensus problem

The first time I heard the term "consensus" was in the context of multi-agent robotics. It's interesting to see the connection between UAV swarms and datacenters. While on the surface they seem like wildly different technologies, both are really just a bunch of computers working together.

The FLP impossibility result

The text summarizes the FLP result as follows:

the FLP result states that "there does not exist a (deterministic) algorithm for the consensus problem in an asynchronous system subject to failures, even if messages can never be lost, at most one process may fail, and it can only fail by crashing (stopping executing)".

According to the text, the FLP result is significant because it means that under the asynchronous model, there does not exist a consensus algorithm that can guarantee both safety and liveness.

Henry Robinson also has a more in-depth look on the FLP result here.

The CAP theorem

Conjectured by Berkeley Professor Eric Brewer and later formally proven by Gilbert and Lynch, the CAP theorem is one of the most popular methods of thinking in system design, even making its way into interviews for general software engineers.

There are a lot of different interpretations of and arguments over the details of the CAP theorem floating around. Refer to the explanation in this text as well as this update by Brewer in 2012. The key is that in more modern systems, there's a lot of nuance beyond just "choose two of three".

With a more nuanced view, does this text's categorization of 2PC as CA, Paxos as CP, and Dynamo as AP still hold up? If not, it would be interesting to dive into the details of the tradeoffs for each of those three examples.

Update (3/29/21):

Today I had the pleasure of speaking with Professor Crooks, who teaches the operating systems class I'm taking right now, about some concepts presented in this chapter (among other cool topics like RDMA and DPDK)! I will attempt to summarize some of what I learned below. It's possible that I misinterpreted ideas that Professor Crooks was trying to convey to me; any erroneous conclusions below are my own.

On safety vs liveness

Definitions:

safety means that nothing bad will ever happen.
liveness means that something good will eventually happen.

Note that "safety" and "liveness" are fundamentally different types of guarantees. Before, I had the wrong intuition that if we have "safety", then we automatically have "liveness". This made me question why the textbook mentioned that FLP implies we need to choose between the two. I now see that I was mistaking "liveness" for absence of deadlock. This reminds me of a previous lecture in OS class which emphasized that deadlock avoidance strategies like the Banker's Algorithm only guarantee that we stay in a "safe state", but they don't guarantee forward progress.

The textbook's comment of choosing between safety and liveness makes more sense with these corrected definitions.

Note that "safety" and "liveness" are not singular concepts, but rather a type of property. Different systems can have different levels of "safety", depending on their architecture. In the past decade or so, a lot of databases have been created that offer varying levels of safety guarantees, which I assume is traded off with levels of liveness guarantees or measures of latency.

This blog post by Peter Bailis could be an interesting additional read related to the concept of safety vs liveness.

On developments on methods of reasoning about different consistency levels

Professor Crooks' PhD thesis happens to be relevant here! The first two chapters were recommended to build context around what this problem space looks like.