Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Indeed, there is a lot of pain if you actually try to store large binary data in git. But we managed to make that work! So a question worth asking is how might things change IF you can store large binary data in git??


That is exactly what git-lfs is, a way to "version control" binary files, by storing revisions - possibly separately, while the actual repo contains text files + "pointer" files that references a binary file.

It's not perfect, and still feels like a bit of a hack compared to something like p4 for the context I uses LFS in (game dev), but it works, and doesn't require expensive custom licenses when teams grow beyond an arbitrary number like 3 or 5.


XetHub Co-founder here. Yes, we use the same Git extension mechanism as Git LFS (clean/smudge filters) and we store pointer files in the git repository. Unlike Git LFS we do block-level deduplication (Git LFS does file-level deduplication) and this can result in a significant savings in storage and bandwidth.

As an example, a Unity game repo reduced in size by 41% using our block-level deduplication vs Git LFS. Raw repo was 48.9GB, Git LFS was 48.2GB, and with XetHub was 28.7GB.

Why do you think using a Git-based solution is a hack compared to p4? What part of the p4 workflow feels more natural to you?


The centralised model of Perforce is more of a natural fit for one thing, since by default it allows you to clone subsets, and just the latest version of files. File locking is much more integrated into the p4 workflow as well, in git you can still modify files locally, then commit them. The check happens on push, and sometimes git fails to send the lock notification upstream. Oh and it breaks down entirely if you use branches.

Some of these have workarounds and hacks for more experienced users. I'm not about to run around teaching people the intricacies of arcane git incantations, while p4 functions, by default, how you'd want to. The programming side is better on git though, yeah.


(XetHub engineer here)

We're working on perforce-style locking on XetHub, and I believe git already supports things like only cloning the latest version of files. Cloning the full repo without "smudging" (pulling in binary file contents) is already possible, and cloning while smudging a subset is on our roadmap. We're definitely on a path to making git UX for dealing with large binary files as easy as perforce, and there are lots of advantages to keeping a git-based workflow for teams that already work with git.


I think this is a foot-gun, it's a bad idea even if it works great, and I doubt it works very well. You should manage your build artifacts explicitly, not just jam them in git along with the code that generates them because you are already using it and you haven't thought it through.


I don't think you've made your case here. The practices you describe are partly an artifact of computation, bandwidth, and storage costs. But not the current ones, the ones when git was invented more than 15 years ago. In the short term, we have to conform to the computer's needs. But in the long term, it has to be the other way around.


You're right! It makes way more sense, in the long run, to abuse a tool like git in a way that it isn't designed for and which it can't actually support, then instead of actually using git use a proprietary service that may or may not be around in a week. Here I was thinking short term.


You seem nice.


Than you, I've worked very hard to become so. You seem nice, too!


Xet's initial focus appears to be on data files used to drive machine learning pipelines, not on any resulting binaries.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: