Distributed Systems for Fun and Profit - Ch 1
Posted on
Distributed Systems for Fun and Profit is a short, informal text that provides a broad overview of the what, why, and how of distributed systems. It's an introduction recommended by a good chunk of people working in this space, from blog posts to the undergrad operating systems course I'm taking right now.
The text is already pretty concise, so I'll try to avoid repeating its contents. Instead, I'll try to draw connections between what the text is saying and background knowledge from other resources. I'll also note down any lingering thoughts / questions I have.
Chapter 1: Distributed systems at a high level
Scalability
The fundamental definition of scalability makes the most sense to me in the context of algorithmic complexity in intro CS classes, with other costs substituted for runtime. For example, services would want their hardware expenses to resemble log(n)
rather than n^2
, where n
could be a metric like number of customers.
Beyond the basic definition of scalability, the text also mentions geographic scalability and administrative scalability.
- Geographic scalability reminds me of the memory hierarchy in computer architecture. Storing relevant data closer to consumers can decrease latency / response time. This is extremely relevant for the giants in cloud computing, as evidenced by Werner Vogels' many posts about AWS global expansion.
- Modern orchestration frameworks and event monitoring solutions, such as the well-known Kubernetes project, can be viewed as examples of increasing administrative scalability through software systems.
Scalability has become one of those buzzwords everyone in the computing world likes to throw around / work towards. Here's an interesting paper on the notion of scalability, by some of the great minds behind timely dataflow.
Performance
The text defines "performance" as the following:
- Short response time/low latency for a given piece of work
- High throughput (rate of processing work)
- Low utilization of computing resource(s)
The first two are pretty familiar from classes on operating systems and computer architecture, but the last one is a bit curious. I'm used to high utilization being a good thing, because less idle time means more efficiency. For example, threads in operating systems block on expensive I/O operations so the CPU can run other threads in the meantime instead of just waiting there doing nothing. Perhaps here the author is referring to saving energy by keeping utilization low? Or the fact that, based on basic queueing theory, latencies grow to infinity as your system approaches 100% utilization? Maybe I'm missing something here.
The text mentions batch processing as an example of a performance tradeoff, accepting worse response time for better throughput. This reminds me of all the discussion in OS class about schedulers. Engineering do be all about tradeoffs.
Availability (and fault-tolerance)
Distributed systems are not only a mechanism for computational efficiency at scale, but are also a mechanism for tolerance of partial failures. With a single node, the system is either available or it's not. With multiple nodes, we can increase availability by allowing individual nodes to fail without affecting the functioning of the overall system. Those individual nodes can be repaired / replaced asynchronously, and it will seem to consumers as if nothing ever happened. Note that "nodes" can be components, machines, or even entire datacenters.
What prevents us from achieving good things?
The text continues on to introduce some of the real world constraints we need to consider and guarantees we might want to provide when designing systems. Presumably, reading papers published by major cloud players will provide a lot more depth of information here.
Abstractions and models
I really like the text's discussion of abstractions and models here. One of the coolest aspects of engineering and mathematics in general is how powerful correct abstraction and modeling can be. It's also interesting to think about the duality of performance and (human) understandability, as presented in the text.
Partitioning and replication
The text presents partitioning and replication as the two fundamental techniques for distributing data storage / computation. It's kind of weird to think that all techniques for distributing work with data boil down to these two simple concepts, but at the same time it makes sense.