XetHub Co-founder here. Yes, one illustrative example of the difference is: Imag...

unqueued · on Dec 13, 2022

I combine git-annex with the bup special remote[1], which lets me still externalize big files, while benefiting from block level deduplication. Or depending on your needs, you can just use a tool like bup[2] or borg directly. Bup actually uses the git pack file format and git metadata.

I actually wrote a script which I'm happy to share, that makes this much easier, and even lets you mount your bup repo over .git/annex/objects for direct access.

[1]: https://git-annex.branchable.com/walkthrough/using_bup/

[2]: https://github.com/bup/bup

AustinDev · on Dec 13, 2022

Have you tested this out with Unreal Engine blueprint files? If you all can do block-based diffing on those, and other binary assets used in game development it'd be huge for game development.

I have a couple ~1TB repositories I've had the misfortune of working with using perforce in the past.

vvanders · on Dec 14, 2022

Last time I used perforce in anger it did pretty decent with ~800GB repo(checkout+history).

I keep expecting someone to come along and dethrone it but as far as I can tell it hasn't been done yet. The combination of specific filetree views, drop-in proxies, UI-forward and checkout based workflow that works well with unmergeable binary assets still left Git LFS and other solutions in the dust.

+1 on testing this against a moderate size gamedev repo, that usually has some of the harder constraints where code + assets can be coupled and the art portion of a sync can easily top a couple hundred GB.

AustinDev · on Dec 14, 2022

1TB of checkout is the kind of repo I'm talking about I have two such repos checked out on this box currently. I'm not sure I've ever checked out a repo of this scale locally with history. I'd love to have the local history.

rajatarya · on Dec 13, 2022

Not yet. Would be happy to try - can you point me to a project to use?

Do you have a repo you could try us out with?

We have tried a couple Unity projects (41% smaller due to republication) but not much from Unreal projects yet.

AustinDev · on Dec 13, 2022

Most of my examples of that size are AAA game source that I can't share however, I think this is a project using similar files that is based on unreal. It should show if there is any benefit: https://github.com/CesiumGS/cesium-unreal-samples & where the .umap binaries have been updated and in this example where the .uasset blueprints have been updated https://github.com/renhaiyizhigou/Unreal-Blueprint-Project

civilized · on Dec 13, 2022

Does that work equally well whether the changes are primarily row-based or primarily column-based?

prirun · on Dec 13, 2022

HashBackup author here. Your question is (I think) about how well block-based dedup functions on a database - whether rows are changed or columns are changed. This answer is how most block-based dedup software, including HashBackup work.

Block-based dedup can be done either with fixed block sizes or variable block sizes. For a database with fixed page sizes, a fixed block size matching the page size is most efficient. For a database with variable page sizes, a variable block size will work better, assuming there the dedup "chunking" algorithm is fine-grained enough to detect the database page size. For example, if the db used a 4-6K variable page size and the dedup algo used a 1M variable block size, it could not save a single modified db page but would save more like 20 db pages surrounding the modified page.

Your column vs row question depends on how the db stores data, whether key fields are changed, etc. The main dedup efficiency criteria are whether the changes are physically clustered together in the file or whether they are dispersed throughout the file, and how fine-grained the dedup block detection algorithm is.

rajatarya · on Dec 13, 2022

Yes, see this for more details of how XetHub deduplication: https://xethub.com/assets/docs/xet-specifics/how-xet-dedupli...