Analyzing Metastable Failures in Distributed Systems

So it goes: your system is purring like a tiger, devouring requests, until, without warning, it slumps into existential dread. Not a crash. Not a bang. A quiet, self-sustaining collapse. The system doesn’t stop. It just refuses to get better. Metastable failure is what happens when the feedback loops in the system go feral. Retries pile up, queues overflow, recovery stalls. Everything runs but nothing improves. The system is busy and useless. In an earlier post, I reviewed the excellent OSDI ’22 paper on metastable failures , which dissected real-world incidents and laid the theoretical groundwork. If you haven’t read that one, start there. This HotOS ’25 paper picks up the thread. It introduces tooling and a simulation framework to help engineers identify potential metastable failure modes before disaster strikes. It’s early stage work. A short paper. But a promising start. Let’s walk through it. Introduction Like most great tragedies, metastable failure doesn't begin with villain...