Giving it a quick look, seems like they've addressed a lot of the shortcomings of Parquet which is very exciting. In no particular order:
- Parquet metadata is Thrift, but with comments saying "if this field exists, this other field must exist", and no code actually verifying the fact, so I'm pretty sure you could feed it bogus Thrift metadata and crash the reader.
- Parquet metadata must be parsed out, meaning you have to: allocate a buffer, read the metadata bytes, and then dynamically keep allocating a whole bunch of stuff as you parse the metadata bytes, since you don't know the size of the materialized metadata! Too many heap allocations! This file format's Flatbuffers approach seems to solve this as you can interpret Flatbuffer bytes directly.
- The encodings are much more powerful. I think a lot of people in the database community have been saying that we need composable/recursive lightweight encodings for a long time. BtrBlocks was the first such format that was open in my memory, and then FastLanes followed up. Both of these were much better than Parquet by itself, so I'm glad ideas from those two formats are being taken up.
- Parquet did the Dremel record-shredding thing which just made my brain explode and I'm glad they got rid of it. It seemed to needlessly complicate the format with no real benefit.
- Parquet datapages might contain different numbers of rows, so you have to scan the whole ColumnChunk to find the row you want. Here it seems like you can just jump to the DataPage (IOUnit) you want.
- They got rid of the heavyweight compression and just stuck with the Delta/Dictionary/RLE stuff. Heavyweight compression never did anything anyway, and was super annoying to implement, and basically required you to pull in 20 dependencies.
Overall great improvement, I'm looking forward to this taking over the data analytics space.
> - They got rid of the heavyweight compression and just stuck with the Delta/Dictionary/RLE stuff. Heavyweight compression never did anything anyway, and was super annoying to implement, and basically required you to pull in 20 dependencies.
"Heavyweight compression" as in zstd and brotli? That's very useful for columns of non-repeated strings. I get compression ratios in the order of 1% on some of those columns, because they are mostly ASCII and have lots of common substrings.
I think the wasm compiler is going to bring in more dependencies than the ‘heavy’ compression would have.
I think that more expensive compression may have made more of a difference 15 years ago when cpu was more plentiful compared to network or disk bandwidth.
I think it’s a pretty common choice when you want compression in a new format or protocol. It works better for compressing chunks of your data rather than large files where you want to maintain some kind of index or random access. Similarly, if you have many chunks then you can parallelise decompression (I’m not sure any kind of parallelism support should have been built in to the zstd format, though it is useful for command line uses).
A big problem for some people is that Java support is hard as it isn’t portable so eg making a Java web server compress its responses with are isn’t so easy.
Sure, I don't want to make a big deal about this but I have observed Java projects choosing to not support zstd for portability (or software packaging) reasons.
Depends on the use-case. For transparent filesystem compression I would still recommend lz4 over zstd because speed matters more than compression ratio in that use case.
Typically requests are binned by context length so that they can be batched together. So you might have a 10k bin and a 50k bin and a 500k bin, and then you drop context past 500k. So the costs are fixed per-bin.
Makes sense, and each model has a max context length, so they could charge per token assuming full context by model if they wanted to assume worst case.
I intuitively think about linear regression as attaching a spring between every point and your regression line (and constraining the spring to be vertical). When the line settles, that's your regression! Also gives a physical intuition about what happens to the line when you add a point. Adding a point at the very end will "tilt" the line, while adding a point towards the middle of your distribution will shift it up or down.
A while ago I think I even proved to myself that this hypothetical mechanical system is mathematically equivalent to doing a linear regression, since the system naturally tries to minimize the potential energy.
Perfect analogy! The cool part is that your model also gives good intuition about the gradient descent part. The springs' forces are the gradients, and the act of the line "snapping" into place is the gradient descent process.
Technically, physical springs will also have momentum and overshoot/oscillate. But even this is something that is used in practice, gradient descent with momentumg.
Maybe I'm missing something, but why do people expect PoW to be effective against companies who's whole existence revolves around acquiring more compute?
I was under the impression that the bad crawlers exist because it's cheaper to reload the data all the time than to cache it somewhere. If this changes the cost balance, those companies might decide to download only once instead of over and over again, which would probably be satisfactory to everyone.
I've been beating the drum about this to everyone who will listen lately, but I'll beat it here too! Why don't we use seL4 for everything? People are talking about moving to a smart grid, having IoT devices everywhere, putting chips inside of peoples' brains (!!!), cars connect to the internet, etc.
Anyway, it's insane that we have a mathematically-proven secure kernel, we should use it! Surely there's a startup in this somewhere..
Almost all vulnerabilities are in apps and libraries which seL4 does little or nothing to solve. The only solution is secure coding across the entire stack which will reveal that much of the existing code is so low-quality that it just has to be thrown away and rewritten.
> Sure, we don’t produce anything, but we have companies with high revenues and we can raise money based on those revenues. We’ll both be rich!
I think this is the central hole in the argument that the US is stagnant. The money that investors give you has to come from somewhere! Particularly in venture capital, you only get returns if you produce value.
Nevertheless, I do agree with a lot of the points here.
In my view, the stagnant part begins when extractive industries grow uncontrollably (think financial services). Yes, money is slushing around and line goes up, but ultimately the value production behind it remained the same.
Money is detached from real output. Especially true in a zero interest rate environment like we had until recently.
There is no reinvestment when rent-seeking activities and financialization take place. Wealth is accumulated but not reinvested for greater growth.
> in venture capital you only get returns if you produce value
I was pretty excited about Hare until Devault said that Hare wouldn't be doing multithreading as he preferred multiprocessing. That was a pretty big dealbreaker for me. The rest of the language looks quite clean though!
hare-ev [0] is using epoll under the covers, which means multithreading is there, already. Especially as ev may be merged into the stdlib at some point.
Maybe I'm misunderstanding something, but it seems like ev is still multiprocessing? Reading the code, it looks like you can read/write to files, and if you want to kick off some other work it spawns a process. I don't see any instance of threads there.
epoll is orthogonal to threads. It _can_ be used in a multithreaded program, but it doesn't have to be. It may very well be implemented in terms of kernel threads, but that's not what I'm talking about. I'm talking about user-space threads.
Conceptually yes, but I suspect there's going to be a lot hairier in practice. For instance, I think there's some stuff that needs language support such as thread-local storage. I'd guess it would be simpler to just re-implement threading from scratch using syscalls. But I also don't think the language provides any support for atomics, so you'd have to roll your own there.
- Parquet metadata is Thrift, but with comments saying "if this field exists, this other field must exist", and no code actually verifying the fact, so I'm pretty sure you could feed it bogus Thrift metadata and crash the reader.
- Parquet metadata must be parsed out, meaning you have to: allocate a buffer, read the metadata bytes, and then dynamically keep allocating a whole bunch of stuff as you parse the metadata bytes, since you don't know the size of the materialized metadata! Too many heap allocations! This file format's Flatbuffers approach seems to solve this as you can interpret Flatbuffer bytes directly.
- The encodings are much more powerful. I think a lot of people in the database community have been saying that we need composable/recursive lightweight encodings for a long time. BtrBlocks was the first such format that was open in my memory, and then FastLanes followed up. Both of these were much better than Parquet by itself, so I'm glad ideas from those two formats are being taken up.
- Parquet did the Dremel record-shredding thing which just made my brain explode and I'm glad they got rid of it. It seemed to needlessly complicate the format with no real benefit.
- Parquet datapages might contain different numbers of rows, so you have to scan the whole ColumnChunk to find the row you want. Here it seems like you can just jump to the DataPage (IOUnit) you want.
- They got rid of the heavyweight compression and just stuck with the Delta/Dictionary/RLE stuff. Heavyweight compression never did anything anyway, and was super annoying to implement, and basically required you to pull in 20 dependencies.
Overall great improvement, I'm looking forward to this taking over the data analytics space.