yakovsh2's comments

yakovsh2 · 2025-08-20T20:24:56 1755721496

Critical XXE in Apache Tika (tika-parser-pdf-module) in Apache Tika 1.13 through and including 3.2.1 on all platforms allows an attacker to carry out XML External Entity injection via a crafted XFA file inside of a PDF. An attacker may be able to read sensitive data or trigger malicious requests to internal resources or third-party servers. Note that the tika-parser-pdf-module is used as a dependency in several Tika packages including at least: tika-parsers-standard-modules, tika-parsers-standard-package, tika-app, tika-grpc and tika-server-standard.

Users are recommended to upgrade to version 3.2.2, which fixes this issue.

yakovsh2 · on June 3, 2021

Default encoding is now UTF-8, due to the way the RFC process works it is not in the original document: https://www.iana.org/assignments/media-types/text/csv

yakovsh2 · on June 3, 2021

(I'm the author of the RFC)

Not sure why this is trending now, but this RFC is being revised now: https://datatracker.ietf.org/doc/html/draft-shafranovich-rfc...

Suggestions and comments are welcome here: https://github.com/nightwatchcybersecurity/rfc4180-bis

Grollicus · on June 3, 2021

Is Microsoft somehow on board for this? In my experience there are CSV files and there are CSV files that excel understands and as long as you don't get your file format into excel that will seriously hinder adoption.

miffe · on June 3, 2021

In my experience excel doesn't even understand its own csv files. If you save one using the Swedish locale it uses semicolons as field separators since comma is used as decimal points. Trying to open the resulting file using a English locale results in garbage.

sheetjs · on June 3, 2021

When saving as CSV, Excel will use the regional "List separator" setting. You can override this in Windows 7 with Region and Language > Additional Settings > List separator.

If you are trying to generate a file that plays nice with Excel, there is a way to force a specific delimiter with the "sep" pragma:

    sep=|
    a|b|c
    1|2|3

teknopaul · on June 3, 2021

Please give up on that and make csv itself nice and simple. :) Let's have it err comma separated.

Simple is good.

If it's really simple even Microsoft will be able to implement it.

u801e · on June 3, 2021

Would that work if you try to use the ascii record separator character with the sep pragma?

contravariant · on June 4, 2021

Just open it in Libreoffice and save it to .xlsx. It's the only way I've found that actually works reliably.

jtvjan · on June 3, 2021

I think we should just stop using commas and newlines and start using the ASCII unit separator and record separator. It would alleviate most quoting and escaping issues because those two rarely appear in actual text.

colejohnson66 · on June 3, 2021

Until I can type those ASCII separators on my keyboard, they’re not going to happen. That’s why CSV won out: the comma key already existed.

skissane · on June 3, 2021

For terminal mode applications, unit separator is Control+Underscore and record separator is Control+Caret (Ctrl+Shift+6 on a US English keyboard).

However, many terminal mode applications intercept those keys and expect them to be commands rather than data. Often there is some kind of quote key which you can press first. In readline and vim, the quote key is Ctrl+V. In emacs, it is Ctrl+Q instead.

GUI applications are more variable. But vim and emacs still support their respective quote keys in GUI mode.

And if you have to type this a lot, you can modify your vim/emacs/readline/etc configuration to remove the requirement for the quote key to be pressed first.

teddyh · on June 3, 2021

Sure, maybe, but that won’t be CSV.

Jiocus · on June 3, 2021

Maybe that would be an "SSV"? Separator Separated Values file.

jpalomaki · on June 3, 2021

On the other hand, if the characters still technically can exist in the content, the rareness just makes the problems harder to spot.

In this sense sticking to a relatively common separators is good, because they encourage you to do the right thing from the start.

SahAssar · on June 3, 2021

That would just make it easier to implement the standard in a way that works most of the time but breaks on valid input. Why would that be desirable?

teknopaul · on June 3, 2021

significant whitespace is nasty, by definition its hard to see errors. Not convenient if you can't edit the format with a normal keyboard.

contravariant · on June 3, 2021

Oh interesting it includes a specification for comments. That is likely going to be the most controversial part.

Does any implementation support/generate comments that way? The most I've seen so far is an oversized multiline header.

chriswarbo · on June 3, 2021

Wadler's Law strikes again https://wiki.c2.com/?WadlersLaw

billpg · on June 3, 2021

Part of me thinks that a new CSV standard with technical breaking changes (lines starting with #) is not needed because we have JSON array-of-arrays files.

The other part thinks this is quite cool and wishes your efforts well.

ape4 · on June 3, 2021

The new RFC allows for names for the fields: "3. The first record in the file MAY be an optional header with the same format as normal records. This header will contain names corresponding to the fields in the file" Is there a reliable way to tell if this first record has names or actual data?

gfody · on June 3, 2021

iirc text/csv; header=present

ape4 · on June 3, 2021

Oh that's new to me! But not available after downloading of course.

skissane · on June 3, 2021

'Tis a pity most platforms don't (in practice) support saving MIME types as a file attribute.

(Even when some platforms have the facility to store the MIME type in an extended attribute, few applications will actually support retrieving and acting on that attribute.)

slver · on June 3, 2021

I feel the simple recommendation for all CSV usecases should be "use JSON".

JoBrad · on June 4, 2021

JSON is easier for people to consume, but formats like csv can be much more efficient when comparing the amount of space used, and easier to consume for an application.

slver · on June 4, 2021

Space doesn’t matter when most you documents are few kb. And when it does, zip it.

JoBrad · on June 4, 2021

If you're dealing with files that small then JSON is a great solution.

In the past I've used CSV to handle files that are several GB after being compressed and encrypted. Formats such as JSON would have added a lot more to the total file size.

dragonwriter · on June 4, 2021

> In the past I've used CSV to handle files that are several GB after being compressed and encrypted. Formats such as JSON would have added a lot more to the total file size.

If you store tables as list of objects, sure, but that's comparing the JSON way of doing something CSV can’t handle at all (lists of heterogenous objects) to CSV doing a much simpler thing.

A compact, dedicated JSON table representation (list of lists with the first list as header, or a single flat list with the header length — number of columns — as the first item, then the headers, then the data) is pretty closely comparable to CSV in size.