It takes 5-10 years to write, test and productionize a database or filesystem. I...

derision · on Dec 27, 2019

> if you're writing a distributed database

is this really a requirement for all distributed databases or only decentralized databases? if I'm using a distributed database and I control all the nodes, if one of them is down I don't plan on spinning it back up (assuming the data is replicated across other nodes). I guess what I'm saying is if I need 20 nodes, I'm going to make sure I always have 20 nodes running.

teraflop · on Dec 28, 2019

Yes, this applies to any distributed system.

It's a basic truth about the computer networks that if you send a message to a machine and don't hear back, you don't know whether it's actually dead or just not able to communicate with you. If a machine finds itself in that situation, one option is to simply wait until the peer eventually comes back to life, or retries enough times to eventually get through.

If it takes any other action, it has to make sure that action is "safe" (w.r.t whatever guarantees it's trying to provide) under either scenario. That's all "partition tolerance" really is: a statement that you, as the developer, have a sort of burden of proof to make sure you've considered all the possible failure scenarios. (If that seems like a tautology, well, that's why we usually don't bother talking about the "P" in the CAP theorem.)

As your comment alludes to, one can often simplify the problem a bit by reducing the number of possible scenarios, by assuming that crashed nodes never recover, they're only replaced by new nodes. But that still doesn't make the problem go away. Unless your network (including both the hardware and the OS network stack!) is infallible, you can't reliably know whether a remote machine has crashed in the first place.

Consider the common situation where you want to provide some kind of transactional guarantees; for instance, transactions should appear to complete in causal order, and their effects should not disappear once committed. That implies that even if a node "looks" dead, it's not safe for another one to take over its role unless it's really, truly dead, or you would risk returning stale (transactionally invalid) results.

zamalek · on Dec 27, 2019

> design the database from the standpont of a node being down for a month.

It uses Raft, so this should be handled by nature. If you are referring to writes to said node (not that Raft would allow it), you are delving into CAP theorem and what you are suggesting is impossible (unless you don't care about consistency).

zbentley · on Dec 28, 2019

Raft is a consensus protocol. It doesn't handle what you actually do during long partitions (stop writes if quorum is lost? Indicate that consistency may be compromised in some other way?) Or how to handle long-lost nodes rejoining (do they rejoin no matter how long they've been gone and service stale reads? Does the cluster "forget" them after some period of time? If so, how do you synchronize the act of forgetting so you don't end up in split brain if only some healthy nodes have forgotten their long-lost brother when it comes back online?)

Raft is not "use this library and your distributed system Just Works". It's a low level building block, like TLS--and nobody says "just use OpenSSL and your app is fully secure".