I am surprised there is no mention of hdf5 (via h5py). It's what I used to deal ...

CreRecombinase · on Nov 12, 2019

This is exactly what HDF5 was built for. Figuring out how often to persist to disk is going to depend on a number of factors. If you want to get fancy you can dedicate a thread to I/O but that gets hairy quickly. You also might want to look into fast compression filters like blosc/lzf as a way of spending down some of your surplus CPU budget.

JBorrow · on Nov 12, 2019

That depends - we use HDF5 to store data from huge simulations (~1TB/snap) which are then obviously too large to analyse on a single machine.

The benefit of HDF5 is that it allows for very easy slicing of data. So 'chunking' where I load e.g. 10% of one dataset at once is very simple with HDF5.