What you describe is very similar to how Icechunk[1] works. It works beautifully for transactional writes to "repos" containing PBs of scientific array data in object storage.
The generalized form of this range-request-based streaming approach looks something like my project VirtualiZarr [0].
Many of these scientific file formats (HDF5, netCDF, TIFF/COG, FITS, GRIB, JPEG and more) are essentially just contiguous multidimensional array(/"tensor") chunks embedded alongside metadata about what's in the chunks. Efficiently fetching these from object storage is just about efficiently fetching the metadata up front so you know where the chunks you want are [1].
The data model of Zarr [2] generalizes this pattern pretty well, so that when backed by Icechunk [3], you can store a "datacube" of "virtual chunk references" that point at chunks anywhere inside the original files on S3.
This allows you to stream data out as fast as the S3 network connection allows [4], and then you're free to pull that directly, or build tile servers on top of it [5].
In the Pangeo project and at Earthmover we do all this for Weather and Climate science data. But the underlying OSS stack is domain-agnostic, so works for all sorts of multidimensional array data, and VirtualiZarr has a plugin system for parsing different scientific file formats.
I would love to see if someone could create a virtual Zarr store pointing at this WSI data!
IMO Zarr is that newer format. It abstracts over the features of all these other formats so neatly that it can literally subsume them.
I feel that we no longer really need TIFF etc. - for scientific use cases in the cloud Zarr is all that's needed going forwards. The other file formats become just archival blobs that either are converted to Zarr or pointed at by virtual Zarr stores.
Sounds like an approach that would also work for ML model weights files — just another kind of multidimensional array with metadata.
I wonder what exactly the big multi-model AI companies are doing to optimize model cold-start latency, and how much it just looks like Zarr on top of on-prem object storage.
People have literally used Zarr for this - at one point Gemini used Zarr for checkpointing model weights. Not sure what the current fashion in that space is though.
It's definitely one of many fields that see convergent evolution towards something that just looks like Zarr. In fact you can use VirtualiZarr to parse HuggingFace's "SafeTensors" format [0].
> Many of these scientific file formats (HDF5, netCDF, TIFF/COG, FITS, GRIB, JPEG and more) are essentially just contiguous multidimensional array(/"tensor") chunks
Yeah, a recurring thought is that these should condense into Apache Arrow queried by DuckDB but there must be some reason for this not to have already happened.
God this article is 10000% better than the posted one. This is great:
> Names should not describe what you currently think the thing you’re naming is for. Imagine naming your newborn child "Doctor", or "SupportsMeInMyOldAge". Poor kid.
The pitch for this sounds very similar to the pitch for Vortex (i.e. obviating the need to create a new format every time a shift occurs in data processing and computing by providing a data organization structure and a general-purpose API to allow developers to add new encoding schemes easily).
But I'm not totally clear what the relationship between F3 and Vortex is. It says their prototype uses the encoding implementation in Vortex, but does not use the Vortex type system?
The backstory is complicated. The plan was to establish a consortium between CMU, Tsinghua, Meta, CWI, VoltronData, Nvidia, and SpiralDB to unify behind a single file format. But that fell through after CMU's lawyers freaked out over Meta's NDA stuff to get access to a preview of Velox Nimble. IANAL, but Meta's NDA seemed reasonable to me. So the plan fell through after about a year, and then everyone released their own format:
On the research side, we (CMU + Tsinghua) weren't interested in developing new encoders and instead wanted to focus on the WASM embedding part. The original idea came as a suggestion from Hannes@DuckDB to Wes McKinney (a co-author with us). We just used Vortex's implementations since they were in Rust and with some tweaks we could get most of them to compile to WASM. Vortex is orthogonal to the F3 project and has the engineering energy necessary to support it. F3 is an academic prototype right now.
I note that the Germans also released their own fileformat this year that also uses WASM. But they WASM-ify the entire file and not individual column groups:
Andrew, it’s always great to read the background from the author on how (and even why!) this all played out. This comment is incredibly helpful for understanding the context of why all these multiple formats were born.
If I could ask you to speculate for a second, how do you think we will go from here to a clear successor to Parquet?
Will one of the new formats absorb the others' features? Will there be a format war a la iceberg vs delta lake vs hudi? Will there be a new consortium now that everyone's formats are out in the wild?
... Are you saying that there's 5 competing "universal" file format projects? Each with different non-compatible approaches? Is this a laughing/crying thing, or a "lots of interesting paths to explore" thing?
Also, back on topic - is your file format encryptable via that WASM embedding?
I would love to bring these benefits to the multidimensional array world, via integration with the Zarr/Icechunk formats somehow (which I work on). But this fragmentation of formats makes it very hard to know where to start.
Presumably because everyone in MCF has been waiting for ITER for decades, and JET is being decommissioned after a last gasp. Every other tokamak is considerably smaller (or similar size like DIII-D or JT-60SA).
Much of the interesting tokamak engineering ideas were on small (so low-power) machines or just concepts using high-temperature superconducting magnets.
The really depressing part is if you plot rate of new delays against real time elapsed, the projected finishing date is even further.
This is why much of the fusion research community feel disillusioned with ITER, and so are more interested in these smaller (and supposedly more "agile") machines with high-temperature superconductors instead.
I wrote the article I wish I could have read back when I first heard of Zarr and cloud-native science back in 2018.
This explains how object storage and conventional filesystems are different, and the key properties that make Zarr work so well in cloud object storage.
Yes, that assumption is called the Ergodic Hypothesis, and generally justified in undergraduate statistical mechanics courses by proving and appealing to Liouville's theorem.
It's worth noting that there's more than just ergodicity at play, although that's a fundamental requirement. For instance, applying the Pauli Exclusion Principle gives rise to Fermi-Dirac statistics.
Isn't that more about enumerating the microstates? The Pauli exclusion principle just ends up forbidding some of the microstates (forbidding a significant fraction of them if you're in the low-temperature regime).
It is about enumerating the microstates, but in a way that takes into account how the particles interact with each other (aka making assumptions about the dynamics).
If we didn't take into account any interactions, we'd be unable to do anything with statistical mechanics beyond rederiving the ideal gas law.
The scientific community works primarily with array (or "tensor") data, using tools like numpy, xarray, and zarr. People familiar with modern relational database tools such as DuckDB and Parquet often ask why can't we just use those? This article explains why: it's massively inefficient to use tabular tools on array data, and demonstrates with a benchmark showing a 10x difference in query speed.
This entire stack also now exists for arrays as well as for tabular data. It's still S3 for storage, but Zarr instead of parquet, Icechunk instead of Iceberg, and Xarray for queries in python.
[1]: https://icechunk.io/en/latest/
reply