I am surprised there is no mention of hdf5 (via h5py). It's what I used to deal with handling simulation data that was getting too big for RAM to handle.
Use case: CPU is running a simulation and constantly spitting out data, eventually RAM is not enough to hold said data. Solution: every N simulation steps, store data in RAM onto disk, and then continue simulating. N must be chosen judicially so as to balance time cost of writing to disk (don't want to do it too often).
I figure this is what is referred to as "chunking" in the article? Why not list some packages that can help one chunk?
Overall opinions on this method? Could it be done better?
This is exactly what HDF5 was built for. Figuring out how often to persist to disk is going to depend on a number of factors. If you want to get fancy you can dedicate a thread to I/O but that gets hairy quickly. You also might want to look into fast compression filters like blosc/lzf as a way of spending down some of your surplus CPU budget.
That depends - we use HDF5 to store data from huge simulations (~1TB/snap) which are then obviously too large to analyse on a single machine.
The benefit of HDF5 is that it allows for very easy slicing of data. So 'chunking' where I load e.g. 10% of one dataset at once is very simple with HDF5.
Use case: CPU is running a simulation and constantly spitting out data, eventually RAM is not enough to hold said data. Solution: every N simulation steps, store data in RAM onto disk, and then continue simulating. N must be chosen judicially so as to balance time cost of writing to disk (don't want to do it too often).
I figure this is what is referred to as "chunking" in the article? Why not list some packages that can help one chunk?
Overall opinions on this method? Could it be done better?