Hacker Newsnew | past | comments | ask | show | jobs | submit | yakovsh2's commentslogin

Critical XXE in Apache Tika (tika-parser-pdf-module) in Apache Tika 1.13 through and including 3.2.1 on all platforms allows an attacker to carry out XML External Entity injection via a crafted XFA file inside of a PDF. An attacker may be able to read sensitive data or trigger malicious requests to internal resources or third-party servers. Note that the tika-parser-pdf-module is used as a dependency in several Tika packages including at least: tika-parsers-standard-modules, tika-parsers-standard-package, tika-app, tika-grpc and tika-server-standard.

Users are recommended to upgrade to version 3.2.2, which fixes this issue.


Default encoding is now UTF-8, due to the way the RFC process works it is not in the original document: https://www.iana.org/assignments/media-types/text/csv


(I'm the author of the RFC)

Not sure why this is trending now, but this RFC is being revised now: https://datatracker.ietf.org/doc/html/draft-shafranovich-rfc...

Suggestions and comments are welcome here: https://github.com/nightwatchcybersecurity/rfc4180-bis


Is Microsoft somehow on board for this? In my experience there are CSV files and there are CSV files that excel understands and as long as you don't get your file format into excel that will seriously hinder adoption.


In my experience excel doesn't even understand its own csv files. If you save one using the Swedish locale it uses semicolons as field separators since comma is used as decimal points. Trying to open the resulting file using a English locale results in garbage.


When saving as CSV, Excel will use the regional "List separator" setting. You can override this in Windows 7 with Region and Language > Additional Settings > List separator.

If you are trying to generate a file that plays nice with Excel, there is a way to force a specific delimiter with the "sep" pragma:

    sep=|
    a|b|c
    1|2|3


Please give up on that and make csv itself nice and simple. :) Let's have it err comma separated.

Simple is good.

If it's really simple even Microsoft will be able to implement it.


Would that work if you try to use the ascii record separator character with the sep pragma?


Just open it in Libreoffice and save it to .xlsx. It's the only way I've found that actually works reliably.


I think we should just stop using commas and newlines and start using the ASCII unit separator and record separator. It would alleviate most quoting and escaping issues because those two rarely appear in actual text.


Until I can type those ASCII separators on my keyboard, they’re not going to happen. That’s why CSV won out: the comma key already existed.


For terminal mode applications, unit separator is Control+Underscore and record separator is Control+Caret (Ctrl+Shift+6 on a US English keyboard).

However, many terminal mode applications intercept those keys and expect them to be commands rather than data. Often there is some kind of quote key which you can press first. In readline and vim, the quote key is Ctrl+V. In emacs, it is Ctrl+Q instead.

GUI applications are more variable. But vim and emacs still support their respective quote keys in GUI mode.

And if you have to type this a lot, you can modify your vim/emacs/readline/etc configuration to remove the requirement for the quote key to be pressed first.


Sure, maybe, but that won’t be CSV.


Maybe that would be an "SSV"? Separator Separated Values file.


On the other hand, if the characters still technically can exist in the content, the rareness just makes the problems harder to spot.

In this sense sticking to a relatively common separators is good, because they encourage you to do the right thing from the start.


That would just make it easier to implement the standard in a way that works most of the time but breaks on valid input. Why would that be desirable?


significant whitespace is nasty, by definition its hard to see errors. Not convenient if you can't edit the format with a normal keyboard.


Oh interesting it includes a specification for comments. That is likely going to be the most controversial part.

Does any implementation support/generate comments that way? The most I've seen so far is an oversized multiline header.


Wadler's Law strikes again https://wiki.c2.com/?WadlersLaw


Part of me thinks that a new CSV standard with technical breaking changes (lines starting with #) is not needed because we have JSON array-of-arrays files.

The other part thinks this is quite cool and wishes your efforts well.


The new RFC allows for names for the fields: "3. The first record in the file MAY be an optional header with the same format as normal records. This header will contain names corresponding to the fields in the file" Is there a reliable way to tell if this first record has names or actual data?


iirc text/csv; header=present


Oh that's new to me! But not available after downloading of course.


'Tis a pity most platforms don't (in practice) support saving MIME types as a file attribute.

(Even when some platforms have the facility to store the MIME type in an extended attribute, few applications will actually support retrieving and acting on that attribute.)


I feel the simple recommendation for all CSV usecases should be "use JSON".


JSON is easier for people to consume, but formats like csv can be much more efficient when comparing the amount of space used, and easier to consume for an application.


Space doesn’t matter when most you documents are few kb. And when it does, zip it.


If you're dealing with files that small then JSON is a great solution.

In the past I've used CSV to handle files that are several GB after being compressed and encrypted. Formats such as JSON would have added a lot more to the total file size.


> In the past I've used CSV to handle files that are several GB after being compressed and encrypted. Formats such as JSON would have added a lot more to the total file size.

If you store tables as list of objects, sure, but that's comparing the JSON way of doing something CSV can’t handle at all (lists of heterogenous objects) to CSV doing a much simpler thing.

A compact, dedicated JSON table representation (list of lists with the first list as header, or a single flat list with the header length — number of columns — as the first item, then the headers, then the data) is pretty closely comparable to CSV in size.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: