home | library | resume

Notes on Designing Data-Intensive Applications

2024 Feb 13


I just finished reading Kleppman's Designing Data-Intensive Applications, which at this point seems to be something of a seminal text, and wanted to jot down some thoughts I had about the book.

Some context on my background: I spent 3.5 years at Google, most of it working on Cloud Firestore and in the guts of Bigtable, Megastore, and Spanner, so I know some things about distributed data systems.

Overall, I found the book to be really well done:

Criticisms

There's always more

Some things that came to mind which the book did not discuss (appropriately so, I think, since Kleppman had to draw the line somewhere for content to put in the book):

General distributed system patterns

Operational Considerations


  1. I do wish the book also warned users that even if a system claims to have such-and-such semantics, that such claims can also be misleading if not blatantly wrong. For example, recent research has found that "MySQL Repeatable Read transactions cannot safely read a value and then write it".↩︎

  2. There's a brief discussion in Ch. 11, "Stream Processing", about "Keeping Systems in Sync". This felt way too short to me - this is an extremely common data design problem, and there's a lot of complexity here. For example, what do you do if your change-data-capture processor has a bug (which it likely will!) - and how would you even notice it?↩︎