Posts

Analyzing Metastable Failures in Distributed Systems

Image
So it goes: your system is purring like a tiger, devouring requests, until, without warning, it slumps into existential dread. Not a crash. Not a bang. A quiet, self-sustaining collapse. The system doesn’t stop. It just refuses to get better. Metastable failure is what happens when the feedback loops in the system go feral. Retries pile up, queues overflow, recovery stalls. Everything runs but nothing improves. The system is busy and useless. In an earlier post, I reviewed the excellent OSDI ’22 paper on metastable failures , which dissected real-world incidents and laid the theoretical groundwork. If you haven’t read that one, start there. This HotOS ’25 paper picks up the thread. It introduces tooling and a simulation framework to help engineers identify potential metastable failure modes before disaster strikes. It’s early stage work. A short paper. But a promising start. Let’s walk through it. Introduction Like most great tragedies, metastable failure doesn't begin with villain...

Chapter 6: Centralized Recovery (Concurrency Control Book)

Image
With Chapter 6, the Concurrency Control and Recovery in Database Systems book shifts focus from concurrency control to the recovery! This chapter addresses how to preserve the atomicity and durability of transactions in the presence of failures, and how to restore the system to a consistent state afterward. The book offers a remarkably complete foundation for transactional recovery, covering undo/redo logging, checkpointing, and crash recovery. While it doesn't use the phrase "write-ahead logging", the basic concepts are there, including log-before-data and dual-pass recovery. When the book was written, the full WAL abstraction in ARIES was still to come in another five years at 1992 ( see my review here ). I revisit this discussion/comparison at the end of the post. System Model and Architecture In earlier chapters, we had reviewed the architecture: transactions pass through a transaction manager (TM), which sends operations to a scheduler and then to a data manager (DM...

Chapter 5: Multiversion Concurrency Control (Concurrency Control Book)

Image
Chapter 5 of Concurrency Control and Recovery in Database Systems (1987) introduces multiversion concurrency control (MVCC), a fundamental advance over single-version techniques. Instead of overwriting data, each write operation creates a new version of the data item. Readers can access older committed versions without blocking concurrent writes or being blocked by concurrent writes. MVCC removes read-write conflicts and increases concurrency significantly. Having multiple versions around gives the scheduler flexibility: if a read arrives "too late" to see the latest write, it can still proceed by accessing an older version. This avoids unnecessary aborts. Writes may still abort due to write-write conflicts, but reads are largely unimpeded. This is especially beneficial in read-heavy workloads. This chapter presents three broad classes of multiversion methods: Multiversion Timestamp Ordering (MVTO), Multiversion Two-Phase Locking (MV2PL), and Multiversion Mixed Methods. For ...

Chapter 4: Non-Locking Schedulers (Concurrency Control Book)

Image
Chapter 4 of the Concurrency Control and Recovery in Database Systems book (1987) opens with a sentence that doesn't quite pass the grammar test: "In this chapter we will examine two scheduling techniques that do not use locks, timestamp ordering (TO) and serialization graph testing (SGT)."  That comma is trying to do the job of a colon and failing at it. Precision matters, more so in technical writing. The writing is otherwise clear and careful. And as par the book, it is ahead of its time. The chapter presents a spectrum of non-locking schedulers, starting from Basic TO, expanding into certifiers (which basically stands for optimistic concurrency control), and ending with modular, composable scheduler designs that separate synchronization concerns cleanly between read-write and write-write synchronization.  Let's dig into the details. Timestamp Ordering (TO) Timestamp Ordering (TO) uses transaction start timestamps to impose a serial order on conflicting operations...

Chapter 3: Two Phase Locking (Concurrency Control Book)

Image
Chapter 3 presents two-phase locking (2PL). Remember I told you in Chapter 2: Serializability Theory that the discussion is very scheduler-centric? Well, this is a deeper dive into the scheduler, using 2PL as the concurrency control mechanism. The chapter examines the design trade-offs in scheduler behavior, proves the correctness of basic 2PL, dissects how deadlocks arise and are handled, and discusses many variations and implementation issues. Here are the section headings in Chapter 3. Aggressive and Conservative Schedulers Basic Two Phase Locking Correctness of Basic Two Phase Locking Deadlocks Variations of Two Phase Locking Implementation Issues The Phantom Problem Locking Additional Operators Multigranularity Locking Distributed Two Phase Locking Distributed Deadlocks Locking Performance Tree Locking Yep, this is a long chapter: 65 pages.  3.1 Aggressive and Conservative Schedulers The chapter opens by asking: how aggressive or conservative should a scheduler be? An aggress...

Chapter 2: Serializability Theory (Concurrency Control Book)

Image
Chapter 2 of Concurrency Control and Recovery in Database Systems (1987) by Bernstein, Hadzilacos, and Goodman is a foundational treatment of serializability theory. It is precise, formal, yet simple and elegant, a rare combination for foundational theory in a systems domain. Databases got lucky here: serializability theory is both powerful and clean. The chapter builds up the theory step by step, introducing: Histories Serializable histories The Serializability Theorem Recoverability and its variants Generalized operations beyond reads/writes View equivalence Each section motivates definitions clearly, presents tight formalism, and illustrates ideas with well-chosen examples. 2.1 Histories This section lays the groundwork. It starts slow, and doesn't do anything fancy. It first defines what it means for the operations within a transaction to form a well-founded partial order. This intra-transaction ordering extends naturally to inter-transaction operations, forming a superset rel...

Modular verification of MongoDB Transactions using TLA+

Image
Joint work with Will Schultz . A transaction groups multiple operations into an all-or-nothing logical-box to reduce the surface area exposed to concurrency control and fault recovery, simplifying the application programmer's job. Transactions support ACID guarantees: atomicity, consistency, isolation, durability. Popular isolation levels include, Read Committed (RC), Snapshot Isolation (SI), and Serializability (SER), which offer increasing protection against concurrency anomalies. MongoDB Transactions MongoDB’s transaction model has evolved incrementally.   v3.2 (2015): Introduced single-document transactions using MVCC in the WiredTiger storage engine.   v4.0 (2018): Extended support to multi-document transactions within a replica set (aka shard).   v4.2 (2019): Enabled fully distributed transactions across shards. Replica Set Transactions.  All transaction operations are first performed on the primary using the WiredTiger transaction workflow/algorithm. Before co...

Concurrency Control and Recovery in Database Systems: Preface and Chapter 1

Image
I'm catching up on Phil Eaton's book club and just finished the preface and Chapter 1 of Concurrency Control and Recovery in Database Systems by Bernstein, Hadzilacos, and Goodman. This book came out in 1987: two years before the fall of Berlin wall, 5 years before Windows 3.1, and before the Gray/Reuters book on Transaction Processing .  I was first surprised about why "Recovery" was featured prominently in the book title, but then realized that in 1987 there was no solid solution for recovery. The ARIES paper on the write-ahead-log came out in 1992. Once I realized that, the structure made sense: concurrency control (Chapters 1–5), recovery (Chapters 6–7), and a forward-looking chapter on replication (Chapter 8), where they candidly admit: "No database systems that we know of support general purpose access to replicated distributed data." The Problem Serializability Theory Two Phase Locking Non-Locking Schedulers Multiversion Concurrency Control Centrali...

Notes from the TLA+ Community Event

I attended the TLA+ Community Event at Hamilton, Ontario on Sunday. Several talks pushed the boundaries of formal methods in the real world through incorporating testing, conformance, model translation, and performance estimation. The common thread was that: TLA+ isn't just for specs anymore. It's being integrated into tooling: fuzzers, trace validators, and compilers. The community is building bridges from models to reality, and it's a good time to be in the loop. Below is a summary of selected talks, followed by some miscellaneous observations. This is just a teaser; the recordings will be posted soon on the TLA+ webpage . Model-Based Fuzzing for Distributed Systems — Srinidhi Nagendra Traditional fuzzing relies on random inputs and coverage-guided mutation, and works well for single-process software. But it fails for distributed systems due to too many concurrent programs, too many interleavings, and no clear notion of global coverage. Srinidhi's work brings model-b...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Learning about distributed systems: where to start?

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

What I'd do as a College Freshman in 2025

Distributed Transactions at Scale in Amazon DynamoDB