How data is split in chunks ? Just curious.

sesm · on Dec 13, 2022

They mention 'content-defined chunking', but it as far as understand it requires different chunking algorithms for different content types. Does it support plugins for chunking different file formats?

ylow · on Dec 13, 2022

Today we just have a variation of FastCDC in production, but we have alternate experimental chunkers for some file formats (ex: a heuristic chunker for CSV files that will enable almost free subsampling). Hope to have them enter production in the next 6 months.

sesm · on Dec 13, 2022

That's interesting. Can a CSV chunker make adding a column not affect all of the chunks?

ylow · on Dec 13, 2022

The simplest really is to chunk row-wise so adding columns will unfortunately rewrite all the chunks. If you have a parquet file, adding columns will be cheap.

ylow · on Dec 13, 2022

CEO/Cofounder here! Content defined chunking. Specifically a variation of FastCDC. We have a paper coming out soon with a lot more technical details.