From a quick read of the SnowFS source code, it looks like it splits large files...

sebastian_io · on June 8, 2021

I am currently working on the compression, as it is not complete yet. The 100 MB is indeed excessive but the window is dynamic and can differ from file to file since it is written to a `*.hblock` file which is stored next to the object in the object database https://github.com/Snowtrack/SnowFS/blob/03e5f839326e666c891...

Let me explain where the 100 MB window comes from as its not only related to the upcoming compression implementation. Some graphic applications touch the timestamps of their files for no reason, making it harder to detect if a file changed. But some file formats always change their 'header' or 'footer'. Means, comparing the hash of the first or last 100 MB of a file that is 8GB in size gives a great performance boost to detect if a file got modified.

digikata · on June 8, 2021

There's a large set of different algorithms with a sliding window. Another interesting one is the Rabin fingerprint. This kind of chunking is often used in storage file systems w/ deduplication and snapshot features.

https://en.wikipedia.org/wiki/Rabin_fingerprint

high_byte · on June 8, 2021

cool. although I think with 4mb window it would be more efficient. 100mb seems excessive, then I assume you wouldn't need a sliding window. (if it works well enough for 100mb)

stereosteve · on June 8, 2021

the problem happens with any fixed window spacing regardless of the block size.

If you create a block every Xmb... inserting a single byte at the beginning of the file will change every subsequent block.

411111111111111 · on June 8, 2021

You're technically speaking wrong, but I'm sure the author doesn't want to reimplement block storage devices... So the spirit of the message is probably correct

stereosteve · on June 8, 2021

Oh I'm not talking about disks... this is based on how SnowFS (the library for this project) splits up big files into chunks:

https://github.com/Snowtrack/SnowFS/blob/main/src/common.ts#...

The intent is a simple form of delta encoding, the hope is that many chunks will be common between two versions.

sebastian_io · on June 8, 2021

I should clarify this. The 100 MB window in SnowFS is currently unrelated to compression as it is only used to compare if a block changed. Each block gets a hash. This is a fallback used for some file formats where the mtime timestamp cannot be trusted. Some files have a change in the first block e.g. 100 MB and that is faster to compare than an entire 8GB file. But this window size is dynamic and can be changed and used for compression in the future

stereosteve · on June 8, 2021

Ahh this is my bad. For some reason I assumed the blocks were part of the storage scheme, but I see they only are used to compute hash, and that the whole file is added to zip. Sorry for the misunderstanding!