Notes on Designing Data-Intensive Applications
2024 Feb 13
I just finished reading Kleppman's Designing Data-Intensive Applications, which at this point seems to be something of a seminal text, and wanted to jot down some thoughts I had about the book.
Some context on my background: I spent 3.5 years at Google, most of it working on Cloud Firestore and in the guts of Bigtable, Megastore, and Spanner, so I know some things about distributed data systems.
Overall, I found the book to be really well done:
- It introduces the reader to a wide variety of design problems and the patterns used to solve them.
- It manages a good balance between theory and practice, by consistently providing real-life examples of the design patterns it discusses.
- It spends, appropriately, a lot of time discussing transactions and consistency semantics1.
Criticisms
-
I would've liked to see discussion of migration patterns, e.g.:
- How do you do a live migration?
- How do you do a migration with downtime?
- How do you decide what acceptable downtime for a migration is?
- Do you migrate all clients individually?
- Do you migrate clients transparently using a middleware layer?
-
I would've liked to see discussion of techniques for detecting and managing data consistency issues - specifically, patterns around cron jobs and automatic vs on-demand data repair.2
-
I do not like its presentation of exactly-once semantics. You can only guarantee either at-most-once semantics or at-least-once semantics.
Yes, you can combine at-least-once semantics with idempotent operations to give off the appearance of exactly-once semantics, but you cannot actually provide exactly-once semantics (i.e. you can't guarantee that an operation will succeed).
-
I wish it talked about tradeoffs at a very abstract (and somewhat reductionist) level, e.g. that OLTP vs OLAP is the difference between systems that supports mixed read-write workloads and systems that prioritize read-heavy workloads.
-
I would've liked to see some case studies - but the book is already incredibly long, and very dense as it is.
There's always more
Some things that came to mind which the book did not discuss (appropriately so, I think, since Kleppman had to draw the line somewhere for content to put in the book):
General distributed system patterns
-
Caches are tricky, because cache invalidation is a hard problem. It sounds easy to do cache invalidation, but when everyone has a story about an outage they dealt with because a cache somewhere wasn't invalidated, well...
Or, as I like to think of it: caches are an easy way to introduce eventual consistency into your system.
-
"control plane and data plane" is super common - your data plane, which handles most user-facing work, needs to scale, but your control plane, which handles your partitioning/replication, usually doesn't (and often doesn't even have the same availability requirements!).
-
When you start building distributed systems, you also rapidly run into new classes of problems: long tail latencies, queries of death, feedback loops, and so on. You also have to design for these.
Operational Considerations
-
How do you monitor a database? What telemetry is important? (e.g. lock contention as a proxy for user-facing errors, logging of per-query memory usage to identify OOM-causing queries)
-
How do you monitor data consistency or replication lag?
-
How do you handle a data replication bug in production? What custom telemetry do you need to debug it? Do you need tooling to, say, quarantine a replica or directly modify a replica's on-disk data?
-
How do you release changes to the database? Schema updates, storage engine upgrades, replication topology changes, and so forth.
-
Do you have backups? Do you know how you would restore from a backup? Do you know your RPO and RTO?
-
What testing strategies should you employ to validate a major change? e.g. can you replay production traffic in a test environment? (Do you need to sanitize it, to respect users' privacy?)
-
I do wish the book also warned users that even if a system claims to have such-and-such semantics, that such claims can also be misleading if not blatantly wrong. For example, recent research has found that "MySQL Repeatable Read transactions cannot safely read a value and then write it". ↩
-
There's a brief discussion in Ch. 11, "Stream Processing", about "Keeping Systems in Sync". This felt way too short to me - this is an extremely common data design problem, and there's a lot of complexity here. For example, what do you do if your change-data-capture processor has a bug (which it likely will!) - and how would you even notice it? ↩