Critical XXE in Apache Tika (tika-parser-pdf-module) in Apache Tika 1.13 through and including 3.2.1 on all platforms allows an attacker to carry out XML External Entity injection via a crafted XFA file inside of a PDF. An attacker may be able to read sensitive data or trigger malicious requests to internal resources or third-party servers. Note that the tika-parser-pdf-module is used as a dependency in several Tika packages including at least: tika-parsers-standard-modules, tika-parsers-standard-package, tika-app, tika-grpc and tika-server-standard.
Users are recommended to upgrade to version 3.2.2, which fixes this issue.
Is Microsoft somehow on board for this? In my experience there are CSV files and there are CSV files that excel understands and as long as you don't get your file format into excel that will seriously hinder adoption.
In my experience excel doesn't even understand its own csv files. If you save one using the Swedish locale it uses semicolons as field separators since comma is used as decimal points. Trying to open the resulting file using a English locale results in garbage.
When saving as CSV, Excel will use the regional "List separator" setting. You can override this in Windows 7 with Region and Language > Additional Settings > List separator.
If you are trying to generate a file that plays nice with Excel, there is a way to force a specific delimiter with the "sep" pragma:
I think we should just stop using commas and newlines and start using the ASCII unit separator and record separator. It would alleviate most quoting and escaping issues because those two rarely appear in actual text.
For terminal mode applications, unit separator is Control+Underscore and record separator is Control+Caret (Ctrl+Shift+6 on a US English keyboard).
However, many terminal mode applications intercept those keys and expect them to be commands rather than data. Often there is some kind of quote key which you can press first. In readline and vim, the quote key is Ctrl+V. In emacs, it is Ctrl+Q instead.
GUI applications are more variable. But vim and emacs still support their respective quote keys in GUI mode.
And if you have to type this a lot, you can modify your vim/emacs/readline/etc configuration to remove the requirement for the quote key to be pressed first.
Part of me thinks that a new CSV standard with technical breaking changes (lines starting with #) is not needed because we have JSON array-of-arrays files.
The other part thinks this is quite cool and wishes your efforts well.
The new RFC allows for names for the fields:
"3. The first record in the file MAY be an optional header with the same format as normal records. This header will contain names corresponding to the fields in the file"
Is there a reliable way to tell if this first record has names or actual data?
'Tis a pity most platforms don't (in practice) support saving MIME types as a file attribute.
(Even when some platforms have the facility to store the MIME type in an extended attribute, few applications will actually support retrieving and acting on that attribute.)
JSON is easier for people to consume, but formats like csv can be much more efficient when comparing the amount of space used, and easier to consume for an application.
If you're dealing with files that small then JSON is a great solution.
In the past I've used CSV to handle files that are several GB after being compressed and encrypted. Formats such as JSON would have added a lot more to the total file size.
> In the past I've used CSV to handle files that are several GB after being compressed and encrypted. Formats such as JSON would have added a lot more to the total file size.
If you store tables as list of objects, sure, but that's comparing the JSON way of doing something CSV can’t handle at all (lists of heterogenous objects) to CSV doing a much simpler thing.
A compact, dedicated JSON table representation (list of lists with the first list as header, or a single flat list with the header length — number of columns — as the first item, then the headers, then the data) is pretty closely comparable to CSV in size.
Users are recommended to upgrade to version 3.2.2, which fixes this issue.