Modern MS documents files are zipped XML. To do this comparison they would need ...

oblio · on Feb 3, 2020

It's not that, it's not like 100% of your users will be diffing documents 100% of the time. The real reason is that office formats are super, super complex and diffing them is a hard problem, even more so for the proprietary Microsoft formats.

https://www.joelonsoftware.com/2008/02/19/why-are-the-micros...

The "zipped XMLs" you mention are basically XML dumps of the former binary format that evolved organically from the 1980s, when resources were scarce and they had to hack together a working office solution.

cpach · on Feb 3, 2020

This! If anyone thinks that it’s trivial to diff Microsoft’s XML formats then I urge you to please try this:

• Create a simple Excel document.

• Clone the document and change the text value of one cell.

• Unzip both .xlsx files into two different directories.

• Now launch Meld/WinMerge or similar and diff the directories.

Now tell me if you still think diffing this format is trivial.

barrkel · on Feb 3, 2020

If you just want a content-aware diff (never mind formatting), it's not actually that difficult to diff; read the stylesheet so you can understand the style refs, then parse the workbook sheets and look up style refs on demand.

(Have written streaming XLSX parser in the past.)

cpach · on Feb 3, 2020

Cool. Someone should do that :) [1]

AFAIK there are no ready-made solutions for that so far. Would be very useful![2]

[1] It would be interesting to dive further in to this subject but personally I can’t currently find the time for that.

[2] Now that I think of it, this might be an interesting project for someone participating in Google Summer of Code. Not sure if the Git project will participate this year or not.

steerablesafe · on Feb 3, 2020

The proper way to diff .docx documents would be to Microsoft release a diff tool for .docx documents. If they released a three-way merge tool as well then it could be used in git too. git supports 3rd party diff and merge tools for specific file formats.

oblio · on Feb 3, 2020

It might be a lot of work and the benefits are not super obvious for them (other than community goodwill :-) ).

markus92 · on Feb 3, 2020

They already got the functionality to diff between two documents in Word. I use it all the time to see if legal made any changes while "forgetting" track changes.

steerablesafe · on Feb 3, 2020

Maybe libreoffice should work on it for .odt then, together with various VCS plugins (mainly git and maybe hg). It could be an interesting differentiator feature.

Sure, it's hard to diff and merge tree data structures, but it doesn't have to be perfect. Text diffing and merging is already imperfect anyway, yet it's very useful.

Tagbert · on Feb 4, 2020

It’s not something you could use from a command line but Microsoft Word already does very good dock comparisons. The feature is called “compare”.

Not sure why we would need this at a file system level. You’d need diff tools for all sorts of file type.

bambax · on Feb 3, 2020

But for the main use case you probably don't have to diff the totality of the file, just the content.

If you would simply render each version to plain text and compare them (which is a solved problem), it would already be very useful.

cpach · on Feb 3, 2020

That’s an interesting idea. Somehow it irks me that changes of style would be totally invisible in the diff. But it could still be useful.

jannes · on Feb 3, 2020

Not all of them. I believe Microsoft uses a special format for Office documents in OneDrive. (These files are converted to xml when you access them with non-Microsoft software)