Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That was an impressive result, so I tried it on a huge email inbox.

    uncompressed:    1512662084
    xz --extreme -9:  508431572  12:47
    zstd --ultra -21: 508432560  12:44
(-22 ran out of memory.) So at least by me zstd was identical to xz almost to the byte and the second.


It does really vary based on the data set.

If the email data is mostly text with markup (like HTML/XML), you might want to try bzip3 too.

It's also possible that a large part of your email is actually already-compressed binary data (like PDFs and images) possibly encoded in base-64. In that case it's likely that all tools are pretty good at compressing the text and headers, but can do little to compress the attachments, which would explain why the results you get are so close.


    bzip3 -b511: 580771424  8:51
I suspect your theory about compressed attachments is correct, although bzip3 isn't doing very well compared to the rest.


Interesting--thanks for checking! I had good experiences with bzip3 compressing Wikipedia XML dumps, to the point it even outperformed xz, so I thought something similar might happen here. Compression does remain a bit of a black art, where it's hard to predict what works without trying it out.

Overall I'm still slightly biased towards using zstd as a default, in that I believe:

  1. zstd will almost always be among fastest formats for decompression, which is obviously nice-to-have everything else being equal.
  2. zstd can achieve a very high compression ratio, depending on tuning; rarely will zstd significantly underperform the next best option.
Overall this is a pretty good case for using zstd by default, even if in some cases it's not noticably better than other formats. In your case, xz seems to be just as good.


I got -22 to run:

    zstd --ultra -22: 494517545 14:00
Pretty minor difference.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: