> Disks and flash devices exhibit a subtle and complex failure model: a few blocks of data could become inaccessible or be silently corrupted [8, 9, 32, 59]. Although such storage faults are rare compared to whole-machine failures, in large-scale distributed systems, even rare failures become prevalent [60, 62]. Thus, it is critical to reliably detect and recover from storage faults.
It's not true in general that a node in a distributed system binds its persisted state to a local disk, or flash device, or any specific implementation of any specific kind of storage system. The storage layer is the responsibility of the node to manage, and irrelevant to the wider distributed system in which the node is a participant. Any of those kinds of storage faults need to be accommodated by the node that utilizes those storage layers, but their specific details don't need to be communicated beyond the specific node where they apply. And it's not at all critical for those nodes to detect and recover from any faults in their storage layer; those faults can easily be communicated to the broader distributed system, which necessarily must be able to handle node-specific failures like those without breaking everything down!
> how to think about local recovery actions in the context of the global consensus protocol!
Local recovery actions, or any other kinds of node-specific details, have no relevance or influence on the information communicated thru the the global consensus protocol. A node can persist its state to a local disk, or to RAM, or to an S3 object, or anything else, and none of these details matter at all to the details that node communicates to other nodes in its cluster.
tl;dr: nodes don't necessarily persist state to local disk
I’m not disagreeing that diskless crash recovery protocols exist.
In fact, an early version of TigerBeetle implemented one of these from Cowling and Liskov’s VSR’12 paper.
However, since then, we invested in TigerBeetle’s stable storage, for reasons which I won’t go into further here.
If you are curious to learn more about all the stable storage techniques in TigerBeetle in particular (again, this is not trying to suggest that VSR can’t also run without stable storage, or to deny other techniques such as object storage or tiering!), but if these things are interesting for you, and if you do want to learn more, then I’d recommend diving into “Durability and the Art of Consensus”.
I've read that paper, and all of the papers that TigerBeetle mentions in any and all of its docs/blog posts/etc. Stable storage doesn't matter to the points I'm trying to shine a light on here. It's not "diskless" that I'm talking about, in fact I'm not talking about "crash recovery" at all -- !
To be clear, “Durability and the Art of Consensus” is not a paper.
> tl;dr: nodes don't necessarily persist state to local disk
In the consensus literature, this is sometimes referred to as “diskless crash recovery”. For example, see work by Dan Ports.
> Local recovery actions, or any other kinds of node-specific details, have no relevance or influence on the information communicated thru the the global consensus protocol.
This goes directly against the central finding of PAR, which gives counter-examples where your statement does not universally hold true.
Again, it’s not intuitive. And that’s why it won FAST18—because it says “everything we know is wrong”. Complete red pill and mindbender.
PAR isn't any kind of panacea or golden rule for nodes in a distributed system, it describes properties of nodes that meet very narrowly-defined requirements, which are in no way universal, and which are in no way requirements for those nodes to participate in the distributed system.
More broadly, there's no concept of "crash recovery" at the system level, which has any meaningful utility. Nodes are either there or they're not there, exactly how they crash or recover from their crashes are irrelevant to the overall distributed system, insofar as if a crashed-and-recovered node comes back online, it's gonna need to re-sync with its peers before it can talk to anyone else, and that's not anything to do with "crash recovery" related to local disk or anything like that (which is all implementation details of the node itself) -- it's just normal node synchronization, orthogonal to any state storage stuff of the node.
PAR is something that your system can maybe implement, it's not any kind of rule or definition or requirement that all systems of some classification must satisfy..!
If I might add on to what you and Joran are both saying, after some time working with TigerBeetle, I found it useful to think of Protocol-Aware Recovery as similar to TAPIR (https://syslab.cs.washington.edu/papers/tapir-tr14.pdf). Normally we build distributed systems on top of clean abstraction layers, like "Nodes are pure state machines that do not corrupt or forget state", or "the transaction protocol assumes each key is backed by a sequentially-consistent system like a Paxos state machine". TAPIR and PAR show a path for building a more efficient, or more capable, system, by breaking the boundaries between those layers and coupling them together.
> Disks and flash devices exhibit a subtle and complex failure model: a few blocks of data could become inaccessible or be silently corrupted [8, 9, 32, 59]. Although such storage faults are rare compared to whole-machine failures, in large-scale distributed systems, even rare failures become prevalent [60, 62]. Thus, it is critical to reliably detect and recover from storage faults.
It's not true in general that a node in a distributed system binds its persisted state to a local disk, or flash device, or any specific implementation of any specific kind of storage system. The storage layer is the responsibility of the node to manage, and irrelevant to the wider distributed system in which the node is a participant. Any of those kinds of storage faults need to be accommodated by the node that utilizes those storage layers, but their specific details don't need to be communicated beyond the specific node where they apply. And it's not at all critical for those nodes to detect and recover from any faults in their storage layer; those faults can easily be communicated to the broader distributed system, which necessarily must be able to handle node-specific failures like those without breaking everything down!
> how to think about local recovery actions in the context of the global consensus protocol!
Local recovery actions, or any other kinds of node-specific details, have no relevance or influence on the information communicated thru the the global consensus protocol. A node can persist its state to a local disk, or to RAM, or to an S3 object, or anything else, and none of these details matter at all to the details that node communicates to other nodes in its cluster.
tl;dr: nodes don't necessarily persist state to local disk