Designing an Atomic Block File Format

geocar · on Feb 24, 2019

4-byte writes are not guaranteed to be atomic. Don't do this.

fsync() and error checking is also required at key places. Seriously, don't do this.

The correct strategy is a dancing tree pivoted by a single byte: Single-byte writes are guaranteed to be atomic.

Consider something like this:

    struct disk {
      unsigned char d;
      struct toc a[2];
      char data[0];
    };

The correct toc is disk.a[disk.d]

Writing involves:

1. Write data. any errors? abort: close fd, reopen and retry.

2. Update disk.a[disk.d^1]

3. Write disk.a to offset 1 of the disk. Any errors? abort: close fd, reopen and retry.

4. Fsync. error? abort: close fd, reopen and retry. NB use F_FULLFSYNC on OSX.

5. write the byte disk.d^1 at offset 0 of the disk. fsync(), but ignore the result. NB use F_FULLFSYNC on OSX.

6. Open the file with a second fd (holding the first one open). Read 1+sizeof(struct toc)*2 bytes and confirm they contain the same contents. If they don't, close both fd, and restart.

7. Close the second fd (but be careful if multithreaded†)

†: http://geocar.sdf1.org/close.html

pjc50 · on Feb 24, 2019

Hmm. While a 4-byte write may be split over disk blocks, is there any situation where you can actually get a sub-block short write on a disk or SSD device?

(I implemented a variant on this block system for a Windows CE device, but instead of having a TOC it had a rolling sequence of numbered files. Periodically it would "roll" undiscarded records from the oldest file onto the end and delete it. There were a couple of surprises including the discovery that "delete A; write to B" could result in A being resurrected after a crash

The underlying file system was Windows "TFAT", which uses a similar system to the one you describe to alternate metadata writes between two FAT structures)

geocar · on Feb 25, 2019

> is there any situation where you can actually get a sub-block short write on a disk or SSD device?

Notwithstanding an inopportune time for power loss, or certain other kinds of "hardware" failure, I'm not aware of any. And speaking O_DIRECT to the block device (as long as it's a real one, and not some network-mapped device, or loop) can indeed work well with a simpler protocol like the one you describe.

That being said, I've had a lot of digs into file-based databases over the years, and I've never built a lot of confidence that those guys have any idea what they're doing: O_DIRECT or not. Usually it is application level, like an enterprise user trying to mix some performance suggestions with the database vendor and a storage vendor's own set of layers. The recent Postgres fiasco rather confirms that thinking, and suggests it doesn't even take something that complex going on to go wrong.

However, more to the point: users of "XBlockFile" do not have any reason to believe the four-byte write is always going to be within the same sub-block, and there's absolutely nothing that can be done at the file-level to get that guarantee on UNIX (or to my knowledge: on Windows).

amelius · on Feb 24, 2019

How do you know the OS/hardware will perform the writes in the order that your program performs them?

boomlinde · on Feb 24, 2019

Without disabling any (potentially and probably out-of-order) disk cache or using some kind of barrier, this will just not provide any guarantees in exceptional error conditions like power failure, crashes etc.

Maybe it's better understood as being "atomic" only on the stream level, in the sense that you can kill the userland process at any time and always end with a working file. Even then, as some other user already mentioned, something like fwrite of four bytes for the TOC offset isn't atomic.

pjc50 · on Feb 24, 2019

The normal solution is to sync between each write, although (Postgres passim) this may not behave as you expect either.

sametmax · on Feb 24, 2019

What's the point of doing this instead of just a regular directory with a metadatafile and zipping the whole thing ? That's how most modern document softwares do it, including MS office and libre office.

eesmith · on Feb 24, 2019

Your "just" description doesn't support atomic in-place updates.

I don't know what MS Office/LibreOffice do. Do they create a new file then rename? See https://stackoverflow.com/questions/20634684/what-is-sublime... for disadvantages to that.

The zip file format allows in-place appends and deletes. This is non-atomic as it depends on a central directory located at the end of the file. An incomplete write of the directory can be recovered by looking through the file, but as the Wikipedia page notes:

> tools that attempt to recover data from damaged ZIP archives will most likely scan the archive for local file header signatures; this is made more difficult by the fact that the compressed size of a file chunk may be stored after the file chunk, making sequential processing difficult.

The file format described at this link assumes that modifying the 4 byte offset to the correct ToC is always atomic. Even if not, it's a much smaller window.

marci · on Feb 24, 2019

> The idea was to create a format not unlike an entire web site packed into a single file, containing text, images, code, and what connected them into a whole.

If it were a web site, it would contain every singe page of that web site and would be able to edit a single page/image/audio/js/css without the need to rewrite the entire file each time.

>Instead of changing data, we always write a new copy.

And it's a Copy-on-Write file format.