Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Most people don't directly query or otherwise operate on raw CSV, though. Large source datasets in CSV format still reign in many enterprises, but these are typically read into a dataframe, manipulated and stored as Parquet and the like, then operated upon by DuckDB, Polars, etc., or modeled (E.g. DBT) and pushed to an OLAP target.


There are folks who still directly query CSV formats in a data lake using a query engine like Athena or Spark or Redshift Spectrum — which ends up being much slower and consuming more resources than is necessary due to full table scans.

CSV is only good for append only.

But so is Parquet and if you can write Parquet from the get go, you save on storage as well has have a directly queryable column store from the start.

CSV still exists because of legacy data generating processes and dearth of Parquet familiarity among many software engineers. CSV is simple to generate and easy to troubleshoot without specialized tools (compared to Parquet which requires tools like Visidata). But you pay for it elsewhere.


how about using Sqlite database files as an interchange format?


I haven't thought about sqlite as a data interchange format, but I was looking at deploying sqlite as a data lake format some time ago, and found it wanting.

1. Dynamically typed (with type affinity) [1]. This causes problems with there are multiple data generating processes. The new sqlite has a STRICT table type that enforces types but only for the few basic types that it has.

2. Doesn't have a date/time type [1]. This is problematic because you can store dates as TEXT, REAL or INTEGER (it's up to the developer) and if you have sqlite files from > 1 source, date fields could be any of those types, and you have to convert between them.

3. Isn't columnar, so complex analytics at scale is not performant.

I guess one can use sqlite as a data interchange format, but it's not ideal.

One area sqlite does excel in is as a application file format [2] and that's where it is mostly used [3].

[1] https://www.sqlite.org/datatype3.html

[2] https://www.sqlite.org/appfileformat.html

[3] https://en.wikipedia.org/wiki/SQLite#Notable_uses


exactly.. parquet is good for append only.. stream mods to parquet in new partitions.. compact, repeat.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: