Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Modern MS documents files are zipped XML. To do this comparison they would need to unzip each file, run it through a rendering engine and hold it in memory, and then do version comparison. For this to be feasible you would need to use a file format that supports this sort of comparison in a way that isn't very resource intensive.


It's not that, it's not like 100% of your users will be diffing documents 100% of the time. The real reason is that office formats are super, super complex and diffing them is a hard problem, even more so for the proprietary Microsoft formats.

https://www.joelonsoftware.com/2008/02/19/why-are-the-micros...

The "zipped XMLs" you mention are basically XML dumps of the former binary format that evolved organically from the 1980s, when resources were scarce and they had to hack together a working office solution.


This! If anyone thinks that it’s trivial to diff Microsoft’s XML formats then I urge you to please try this:

• Create a simple Excel document.

• Clone the document and change the text value of one cell.

• Unzip both .xlsx files into two different directories.

• Now launch Meld/WinMerge or similar and diff the directories.

Now tell me if you still think diffing this format is trivial.


If you just want a content-aware diff (never mind formatting), it's not actually that difficult to diff; read the stylesheet so you can understand the style refs, then parse the workbook sheets and look up style refs on demand.

(Have written streaming XLSX parser in the past.)


Cool. Someone should do that :) [1]

AFAIK there are no ready-made solutions for that so far. Would be very useful![2]

[1] It would be interesting to dive further in to this subject but personally I can’t currently find the time for that.

[2] Now that I think of it, this might be an interesting project for someone participating in Google Summer of Code. Not sure if the Git project will participate this year or not.


The proper way to diff .docx documents would be to Microsoft release a diff tool for .docx documents. If they released a three-way merge tool as well then it could be used in git too. git supports 3rd party diff and merge tools for specific file formats.


It might be a lot of work and the benefits are not super obvious for them (other than community goodwill :-) ).


They already got the functionality to diff between two documents in Word. I use it all the time to see if legal made any changes while "forgetting" track changes.


Maybe libreoffice should work on it for .odt then, together with various VCS plugins (mainly git and maybe hg). It could be an interesting differentiator feature.

Sure, it's hard to diff and merge tree data structures, but it doesn't have to be perfect. Text diffing and merging is already imperfect anyway, yet it's very useful.


It’s not something you could use from a command line but Microsoft Word already does very good dock comparisons. The feature is called “compare”.

Not sure why we would need this at a file system level. You’d need diff tools for all sorts of file type.


But for the main use case you probably don't have to diff the totality of the file, just the content.

If you would simply render each version to plain text and compare them (which is a solved problem), it would already be very useful.


That’s an interesting idea. Somehow it irks me that changes of style would be totally invisible in the diff. But it could still be useful.


Not all of them. I believe Microsoft uses a special format for Office documents in OneDrive. (These files are converted to xml when you access them with non-Microsoft software)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: