Hacker Newsnew | past | comments | ask | show | jobs | submit | gunnarmorling's commentslogin

Thanks! The heavy dependency footprint of parquet-java was the main driver for kicking off this project. Hardwood doesn't have any mandatory dependencies; any libs for compression algorithms used can be added by the user (most of them are single JARs with no further transitive dependencies) as needed. Same for log bindings (Hardwood is using the System.Logger abstraction).

Yes, absolutely, DuckDB is great. But I think there's a space and need for a pure Java library.

Thanks! See https://news.ycombinator.com/item?id=47206861 for some general comments on performance. I haven't measured bit unpacking specifically yet.

We have some first benchmarks here: https://github.com/hardwood-hq/hardwood/blob/main/performanc....

From the post:

> As an example, the values of three out of 20 columns of the NYC taxi ride data set (a subset of 119 files overall, ~9.2 GB total, ~650M rows) can be summed up in ~2.7 sec using the row reader API with indexed access on my MacBook Pro M3 Max with 16 CPU cores. With the column reader API, the same task takes ~1.2 sec.

In my measurements, this is significantly faster than parquet-java for the same task (which is not surprising, as Hardwood is multi-threaded); but I want to be sure I am setting up and configuring parquet-java correctly before publishing any comparisons. The test above also is hooked up to run parquet-java (and there's a set-up for PyArrow, too), so you could run it yourself on your machine if you wanted to.

So far, we've spent most time optimizing for flat (non-nested) data sets which are fully parsed (either all columns, or with projections) and I think it's faring really well for those. There's no support for predicate push-down yet, so right now, Hardwood isn't optimal for use cases with high query selectivity; this is the next thing on the roadmap though.


I am working on a new Java parser for the Apache Parquet file format, with minimal dependencies and multi-threaded execution: https://github.com/hardwood-hq/hardwood.

Approaching the home stretch for a first 1.0 preview release, including: support for parsing Parquet files with flat and nested schemas, all physical and logical column types, core and advanced encodings, projections, compression, multi-threading, etc. all that with a pretty decent performance.

Next on the roadmap are SIMD support, predicate push-down (bloom filters, statistics, etc.), writer support.


Nice one, great to see this addition to the Rust ecosystem!

Reading through the README, this piqued my curiosity:

> Small or fast transactions may share the same WAL position.

I don't think that's true; each data change and each commit (whether explicit or not) has its own dedicated LSN.

> LSNs should be treated as monotonic but not dense.

That's not correct; commit LSNs are monotonically increasing, and within a transaction, event LSNs are monotonically increasing. I.e. the tuple commit-LSN/event-LSN is monotonically increasing, but not LSNs per se. You can run multiple concurrent transactions to observe this.


Good catch, you are correct. I did mix a few things there and the statements were incorrect or at least very misleading.

To demo your point I created a gist, for myself and others to see the (commit-LSN, event-LSN) ordering in action:

https://gist.github.com/vnvo/a8cf59fc3cd8719dbea56d3bb5201f9...

I'll update the readme to reflect this more accurately. Appreciate you taking the time to point it out.


Oh nice, thanks for providing that data!

Made it to #369 in 2025 with morling.dev; let's see what's in stock this year :)

  year  total_score  rank  days_mentioned
  2025  903          369   8
  2024  604          581   2
  2023  547          861   3
  2022  450          1165  4
  2021  188          2308  2


It's a non-issue with GraalVM native binaries. See https://news.ycombinator.com/item?id=46445989 for an example: this CLI tools starts in ms, fast enough you can launch it during tab completions and have it invoke a REST API without any noticeable delay whatsoever.

But also when running on the JVM, things have improved dramatically over the last few years, e.g. due to things such as AOT class loading and linking. For instance, a single node Kafka broker starts in ~300 ms.


Time comparisons are (or should be) relative. https://news.ycombinator.com/item?id=46447490

graalvm is literally 500x more overhead than a statically linked dash script.

Maybe not an issue for terminal UIs, but the article mentions both TUIs and CLI tools. A lot of people use CLI tools with a shell. As soon as you do `for file in *.c; do tool "$file"; done` (as a simple example), pure overhead on the order of even 10s of ms becomes noticeable. This is not theoretical. I recently had this trouble with python3, but I didn't want to rewrite all my f-strings into python2. So, it does arise in practice. (At least in the practice of some.)


Assuming JVM installation is not required (to which I agree, it shouldn't be), why would you care which language a CLI tool is written in? I mean, do you even know whether a given binary is implemented in Go, Rust, etc.? I don't see how it makes any meaningful difference from a user perspective.

> Pkl, which is at least built using Graal Native Image, but (IMO) would _still_ have better adoption if it was written in something else.

Why do you think is this?


It makes a difference in size, in how arguments tend to be handled, and so forth.

As for why Pkl was in Java: it was originally built to configure apps written in Java, and heavily uses Truffle. Pkl is a name chosen for open sourcing, it had a different name internally to Apple before that which made the choices a little more obvious.


As a practical example for a Java-based CLI tool in the wild, here's kcctl, a command line client for Kafka Connect: https://github.com/kcctl/kcctl/. It's a native binary (via GraalVM), starting up in a few ms, so that it actually be invoked during tab completions and do a round-trip to the Kafka Connect REST API without any noticeable delay whatsoever.

Installation is via brew, so same experience as for all the other CLI tools you're using. The binary size is on the higher end (52 MB), but I don't think this makes any relevant difference for practical purposes. Build times with GraalVM are still not ideal (though getting better). Cross compilation is another sore point, I'm managing it via platform-specific GitHub Action runners. From a user perspective, non of this matters, I'd bet most users don't know that kcctl is written in Java.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: