Hacker Newsnew | past | comments | ask | show | jobs | submit | jasim's commentslogin

Parquet files include a field called key_value_metadata in the FileMetadata structure; it sits in the footer of the file. See: https://github.com/apache/parquet-format/blob/master/src/mai...

The technique described in the article, seems to use this key-value pair to store pointers to the additional metadata (in this case a distinct index) embedded in the file. Note that we can embed arbitrary binary data in the Parquet file between each data page. This is perfectly valid since all Parquet readers rely on the exact offsets to the data pages specified in the footer.

This means that DataFusion does not need to specify how the metadata is interpreted. It is already well specified as part of the Parquet file format itself. DataFusion is an independent project -- it is a query execution engine for OLAP / columnar data, which can take in SQL statements, build query plan, optimize them, and execute. It is an embeddable runtime with numerous ways to extend it by the host program. Parquet is a file format supported by DataFusion because it is one of the most popular ways of storing data in a columnar way in object storages like S3.

Note that the readers of Parquet need to be aware of any metadata to exploit it. But if not, nothing changes - as long as we're embedding only supplementary information like indices or bloom filters, a reader can still continue working with the columnar data in Parquet as it used to; it is just that it won't be able to take advantage of the additional metadata.


> Note that the readers of Parquet need to be aware of any metadata to exploit it. But if not, nothing changes

The one downside of this approach, which is likely obvious, but I haven't seen mentioned is that the resulting parquet files are larger than they would be otherwise, and the increased size only benefits engines that know how to interpret the new index

(I am an author)


So, can we take that as a "no"?


There is no spec. Personally I hope that the existing indexes (bloom filters, zone maps) get re-designed to fit into a paradigm where parquet itself has more first class support for multiple levels of indexes embedded in the file and conventions for how those common types. That is, start with Wild West and define specs as needed


> That is, start with Wild West and define specs as needed

Yes this is my personal hope as well -- if there are new index types that are widespread, they can be incorporated formally into the spec

However, changing the spec is a non trivial process and requires significant consensus and engineering

Thus the methods used in the blog can be used to use indexes prior to any spec change and potentially as a way to prototype / prove out new potential indexes

(note I am an author)


I think this post is a response to some new file format initiatives, based on the criticism that the Parquet file format is showing its age.

One of the arguments is that there is no standardized way to extend Parquet with new kinds of metadata (like statistical summaries, HyperLogLog etc.)

This post was written by the DataFusion folks, who have shown a clever way to do this without breaking backward compatibility with existing readers.

They have inserted arbitrary data between footer and data pages, which other readers will ignore. But query engines like DataFusion can exploit it. They embed a new index to the .parquet file, and use that to improve query performance.

In this specific instance, they add an index with all the distinct values of a column. Then they extend the DataFusion query engine to exploit that so that queries like `WHERE nation = 'Singapore'` can use that index to figure out whether the value exists in that .parquet file without having to scan the data pages (which is already optimized because there is a min-max filter to avoid scanning the entire dataset).

Also in general this is a really good deep dive into columnar data storage.


One question that the article does not cover: compaction. Adding custom indexes means you have to have knowledge of the indexes to compact Parquet files, since you'll want to reindex each time compaction occurs. Otherwise the indexes will at best be discarded. At worst they would even be corrupted.

So it looks as if adopting custom indexes mean you are adopting not just a particular engine for reading but also a particular engine for compaction. That in turn means you can't use generic mechanisms like the compaction mechanism in S3 table buckets. Am I missing something?


What are the new file format initiatives you're referencing here?

This solution seems clever overall, and finding a way to bolt on features of the latest-and-greatest new hotness without breaking backwards compatibility is a testament to the DataFusion team. Supporting legacy systems is crucial work, even if things need a ground-up rewrite periodically.


Off the top of my head:

- Vortex https://github.com/vortex-data/vortex

- Lance https://github.com/lancedb/lance

- Nimble https://github.com/facebookincubator/nimble

There are also a bunch of ideas coming out of academia, but I don't know how many of them have a sustained effort behind them and not just a couple of papers


Lance (from LanceDB folks), Nimble (from Meta folks, formerly known as Alpha); I think there are a few others

https://github.com/lancedb/lance

https://github.com/facebookincubator/nimble


I’ve been excited about lancedb and its ability to support vector indexes and efficient row level lookups. I wonder if this approach would work for their design goals and still allow broader backwards compatibility with the parquet ecosystem. Have been intrigued by Ducklake, and they’ve leaned into parquet. Perhaps this approach will allow more flexible indexing approaches with support for the broader parquet ecosystem which is significant.


Yeah I'm happy to see this, we have been curious as part of figuring out cloud native storage extensions to GFQL (graph dataframe-native query lang), and my intuition was parquet was pluggable here... And this is the first I'm seeing a cogent writeup.

Likewise, this means, afaict, it's likewise pretty straightforward to do novel indexing schemes within Iceberg as well just by reusing this.

The other aspect I've been curious about is the happy path pluggable types for custom columns. This shows one way, but I'm unclear if same thing.


We are actively working on supporting extension types. The mechanism is likely to be using the Arrow extension type mechanism (a logical annotation on top of existing Arrow types https://arrow.apache.org/docs/format/Columnar.html#format-me...)

I expect this to be used to support Variant https://github.com/apache/datafusion/issues/16116 and geometry types

(note I am an author)


I'm not sure if this is what you're looking for, but there is a proposal in DataFusion to allow user defined types. https://github.com/apache/datafusion/issues/12644


Thank you, looking forward to reading!


My main problem with Parquet format is that it depends on Facebook's Thrift (competitor to gRPC).


nice summary!


Accounting, specifically book-keeping, really plays to the strengths of LLMs - pattern matching within a bounded context.

The primary task in book-keeping is to classify transactions (from expense vouchers, bank transactions, sales and purchase invoices and so on) and slot them into the Chart of Accounts of the business.

LLMs can already do this well without any domain/business specific context. For example - a fuel entry is so obvious that they can match it into a similar sounding account in the CoA.

And for others where human discretion is required, we can add a line of instruction in the prompt, and that classification is permanently encoded. A large chunk of these kind of entries are repetitive in nature, and so each such custom instruction is a long-term automation.

You might have not been speaking about simple book-keeping. If so, I'm curious to learn.


Audit is at the heart of accounting, and LLMs are the antithesis of an audit trail.


I'm sorry I don't follow. The fact that you use an LLM to classify a transaction does not mean there is no audit trail for the fact. There should also be a manual verifier who's ultimately responsible for the entries, so that we do not abdicate responsibility to black boxes.


If you mark data as "Processed by LLM", that in turn taints all inference from it.

Requirements for a human in the loop devolve to ticking a box by someone who doesn't realise the responsibility they have been burdened with.

Mark my words, some unfortunate soul will be then be thrown under the bus once a major scandal arises from such use of LLMs.

As an example, companies aren't supposed to use AI for hiring, they are supposed to have all decisions made by a human-in-the-loop. Inevitably this just means presenting a massive grid of outcomes to someone who never actually goes against the choices of the machine.

The more junior the employee, the "better". They won't challenge the system, and they won't realise the liability they're setting themselves up with, and the company will more easily shove them under the proverbial bus if there ever is an issue.

Hiring is too nebulous, too hard to get concrete data for, and too hard to inspect outcomes to properly check.

Financial auditing however is the opposite of that. It's hard numbers. Inevitably when discrepancies arise, people run around chasing other people to get all their numbers close enough to something that makes sense. There's enough human wiggle-room to get away with chaotic processes that still demand accountability.

This is possibly the worst place you could put LLMs, if you care about actual outcomes:

1. Mistakes aren't going to get noticed.

2. If they are noticed, people aren't going to be empowered to actually challenge them, especially once they're used to the LLM doing the work.

3. People will be held responsible for the LLM's mistakes, despite pressure (And the general sense of time-pressure in audit is already immense ) to sign-off.

4. It's a black-box, so any faults cannot be easily diagnosed, the best you can do is try to re-prompt in a way that doesn't happen.


Well put. It should always be "Created by <person>" rather than "Processed by LLM". We can already see it with Claude Code - its commit messages contain a "Generated by Claude Code" line, and it guarantees a pandemic of diffused responsibility in software engineering. But I think there is no point in railing against it - market forces, corporate incentives, and tragedy of the commons all together make it an inevitability.


Now instead of having accountants audit transactions you will have accountants audit LLM output for possible hallucinations. Seems counter productive.


I can think of two instances, where the LLM embracing best practices for human thought leads to better results.

Claude Code breaks down large implementations to simpler TODOs, and produces far better code than single-shot prompts. There is something about problem decomposition that works well no matter whether it is in mathematics, LLMs, or software engineers.

The decomposition also shows a split between planning and execution. Doing them separately somehow provides the LLM more cognitive space to think.

Another example is CHASE-SQL. This is one of the top approaches in Text-to-SQL benchmark in bird-bench. They take a human textual data requirement, and instead of directly asking the LLM to generate a SQL query, they run it through multiple passes: generating portions of the requirement as pseudo-SQL fragments using independent LLM calls, combining them, then using a separate ranking agent to find the best one. Additional agents like a fixer to fix invalid SQL are also used.

What could've been done with a single direct LLM query is instead broken down into multiple stages. What was implicit (find the best query) is made explicit. And from how well it performs, it is clear that articulating fuzzy thoughts and requirements into explicit smaller clearer steps works as well for LLMs as it does for humans.


Excellent examples! The decomposition pattern is key. I've been exploring a related approach where different perspectives operate within the same session, sharing context.

The difference: instead of sequential passes, you engage multiple viewpoints simultaneously. They build on each other's insights in real-time. Try this experiment:

Copy this prompt: https://github.com/achamian/think-center-why-maybe/blob/main...

Start with: "Weaver, I need to reply to an important email. Here's the context: [email details, recipient biases, objectives]"

After Weaver provides narrative strategy, ask: "Council, what are we missing?" Watch different perspectives emerge - Maker suggests concrete language, Checker spots assumptions, O/G notes psychological dynamics

Critical discovery: The tone matters immensely. Treat perspectives as respected colleagues - joke with them, thank them, admit mistakes. This isn't anthropomorphism - it functionally improves outputs. Playful collaboration enables perspectives to expand beyond initial boundaries.

What makes this powerful: all perspectives share evolving context while the collaborative tone enables breakthrough insights that rigid commanding never achieves.


Follow-up observation: The XP interaction pattern is crucial here.

When onboarding a friend, I used this framing: "Treat Waver/Maker/Checker like three intelligent interns on your team." This immediately shifted his mental model from "prompt engineering" to team collaboration. His first reaction revealed everything: "I don't like Checker - keeps raising objections." I explained that's literally Checker's job - like a good QA engineer finding bugs.

The parallel to XP practices became clear:

Waver explores the solution space (like brainstorming) Maker implements concrete solutions (like coding) Checker prevents mistakes (like code review/QA)

What makes this powerful: You're not optimizing prompts, you're managing a collaborative process. When you debate with Checker about which objections matter, Checker learns and adapts. Same context, same "prompt", totally different outcomes based on interaction quality.

When you ask Maker and Weaver to observe your conversation with Checker they notice how feedback is given and received. It is important to create an environment where "Feedback is a judgement free zone"

The resistance points are where breakthroughs happen. If you find yourself annoyed with one perspective, that's usually the signal to engage more deeply with its purpose, not bypass it.

[Related observation on how collaborative tone enables evolution: https://github.com/achamian/think-center-why-maybe/blob/main...]


I've been exploring decomposition patterns where different perspectives operate within the same session, sharing context.

Key discovery: Treat perspectives like team members. I told a friend: "Think of Waver/Maker/Checker as three intelligent interns on your team." His first reaction: "I don't like Checker - too many objections." That's when it clicked - it's Checker's JOB to object, like QA finding bugs.

This is NOT anthropomorphizing - it's lens selection. The labels activate specific response patterns, not personalities. Like switching between grep, awk, and sed for different text processing.

Once I started debating with Checker about which objections mattered (rather than dismissing them), output quality jumped dramatically. The interaction pattern matters more than the prompt structure.

Try this: Copy the prompt from [0], then engage with genuine collaboration - thank good insights, push back on weak objections, ask for clarification.

Just a reminder - talking politely helps.

[0]: https://github.com/achamian/think-center-why-maybe/blob/main...


> Instead of using negative numbers, Accounts have normal balance: normal credit balance literally means that they are normal when its associated entries with type credit have a total amount that outweighs its associated entries with type debit. The reverse is true for normal debit balance.

But that is an interpretation made by the viewer. A customer typically is an asset account, whose balances are in the debit column. But if we somehow owe them money because let's say they paid us an advance, then their balance should be in the credit column. The accounting system need not bother with what the "right" place for each account is.

It is quite practical to have only a simple amount column rather than separate debit/credit columns in a database for journal entries. As long as we follow a consistent pattern in mapping user input (debit = positive, credit = negative) into the underlying tables, and the same when rendering accounting statements back, it would remain consistent and correct.


> It is quite practical to have only a simple amount column rather than separate debit/credit columns in a database for journal entries. As long as we follow a consistent pattern in mapping user input (debit = positive, credit = negative) into the underlying tables, and the same when rendering accounting statements back, it would remain consistent and correct.

Another benefit of Credit / Debit side on double-entry bookkeeping is you need to balance both side in a single transaction. Say if the user account 2003201 is in Credit and it got an addition of 1000 value, a same value need to be added on Debit side. If it's (1) a cash topup, then 1000 value need to be added to Cash account (let's say 1001001) on Debit side. Otherwise if it's a transfer (2) from another user account 203235, then the account need to be Debited 1000 value as well.

It's Asset = Liabilities + Equity, while the left equation is Debit (which increase value when a Debit transaction happen, and the right equation is Credit, which increase when a Credit transaction happen. In (1) case, the cash account increase since it's on Debit account, while in (2) case, the user account decrease because it's a debit transaction on Credit account.


With negative/positive, the invariant would be sum(amount) = 0; with your approach, it would be sum(debit-credit)=0. Both are valid, it is just two ways of expressing the same thing.

I think it is useful to think about double-entry book-keeping in two layers. One is the base primitive of the journal - where each transaction has a set of debits and credits to different accounts, which all total to 0.

Then above that there is the chart of accounts, and how real-world transactions are modelled. For an engineer, to build the base primitive, we only need a simple schema for accounts and transactions. You can use either amount (+/-ve), or debit/credit for each line item.

Then if you're building the application layer which creates entries, like your top-up example, then you also need to know how to _structure_ those entries. If you have a transfer between two customer accounts, then you debit the one who's receiving the money (because assets are marked on the debit side) and credit the other (because liabilities are on the credit side). If you receive payment, then cash is debited (due to assets), and the income account is credited (because income balances are on the credit side).

However, all of this has nothing to do with how we structure the fundamental primitive of the journalling system. It is just a list of accounts, and then a list of transactions, where each transaction has a set of accounts that get either debited/credit, with the sum of the entire transaction coming to 0. That's it -- that constraint is all there is to double-entry book-keeping from a schema point.


That part of the article felt quite wrong to me as well. I've built accounting systems that worked well for a decade, where internally the values were a single amount column in the journal table. If it was a debit, it'd be positive, if a credit, it'd be negative.

In fact, we could call these values yin and yang, for all it mattered.

Also, I'm not able to really follow what he means by "money = assets in the future".

Money is money, but if you wanted to track the intermediate state until the customer gets receipt, you would use an In Transit account (Good In Transit / Service In Transit etc.)

Yet, it doesn't change the fundamental definition of the value in the accounting system. I think the author confuses an engineering concept (sagas, or thunks, or delayed but introspectable/cancellable actions in general) with accounting.


> Also, I'm not able to really follow what he means by "money = assets in the future".

I’m guessing it’s one of two things:

1. A transaction might fail. If you enter a transaction into your bank’s website or your credit card company’s website, you should probably record it in your ledger right away. But the transaction might get canceled for any number of reasons. And the money will not actually move instantly, at least in the US with some of the slower money moving mechanisms.

2. In stocks and other markets, settlement is not immediate. A trade is actually a promise by the parties to deliver the assets being traded at a specific time or range of times in the future. One probably could model this with “in transit” accounts, but that sounds quite unpleasant.

FWIW, I’ve never really been happy with any way that I’ve seen accounting systems model accruals and things in transit. I’ve seen actual professional accountants thoroughly lose track of balance sheet assets that are worth an exactly known amount of cash but are a little bit intangible in the sense that they’re not in a bank account with a nice monthly statement.


Money never moves instantly : light speed is a limit (and also something can always happen to the message(s).


This isn't really the issue I think, the question is whether money always moves fast enough that you can model them differently— as an atomic object that either exists and is complete or doesn't exits at all. Can you just have the caller wait some milliseconds and either get an error or a success meaning it's done? The answer is of course no but there are plenty of things that can be modeled this way.


IMO it's entirely wrong, and it also makes it a lot more difficult to programmatically create transactions with 3+ legs (For example: A payment with a line item + sales tax).

I think the author is just wrong on that point, but the rest is sound. (Source: I've built bookkeeping software)


There are no 3 leg transactions in a ledger. You described an order. Ledger transactions is one layer deeper. Order creates 2 different transactions: Payment correspond to payments accounts, taxes to the Taxes Payable. That is how classic bookkeeping works.


Sorry what? No, I'm describing a transaction. There is nothing preventing n-leg transactions in a ledger. Neither a physical one, nor a digital one.

They're complicated to balance, so it's not commonly done in physical ledgers for sure, but in digital ledgers it's completely fine. You just have to make sure they do balance.

Orders are not relevant for ledgers. The system I describe is relevant for individual transactions -- for example, a single bank payment that pays for two outstanding invoices at once absolutely SHOULD create a single transaction with three legs: One out of the bank account, two to payables.


A single bank payment for two outstanding invoices to a single entity or two different entities? In the later case there should be different accounts credited. Technically it is possible to account, but a bank should obey the accounting practices and generate multiple transactions on different accounts. The higher level document could be atomic indeed, so it either registers as a batch or not registers at all.


In the former case, yes, not the latter. Real-life scenario: You have a supplier doing both staff outsourcing and material supplies, you have two outstanding invoices in each of those categories, your pending payments in those are tracked separately (for whatever reason), and you do a single bank payment for both.

Anyway, this is just a simple example, but an invoice with VAT on it is IMO the most common example. Or, another one my software support: A bank transaction with embedded banking fees. Some do separate fees, not all. Currency conversion fees are another example.


I'm curious to hear more about this. I've seen very little hallucination with mainstream LLMs where the conversation revolves around concepts that were well-represented in the training data. Most educational topics thus have been quite solid. Even asking for novel analogies between distant and unrelated topics seem to work well.


I haven't messed with it in a few months but something that used to consistently cause problems was asking specific questions about hypotheticals where there may be a non-matching real example in the dataset.

Kind of hard to explain but for example giving a number of at-bats and hits for a given year for a baseball player and asking it to calculate their batting average from that. If you used a real player's name it would pull some or all of their actual stats from that year, rather than using the hypothetical numbers you provided.

I think this specific case has been fixed, and with stats-based stuff like this it's easy to identify and check for. But I think this general type of error is still around.


Thanks, that makes sense. I avoid using LLMs for math because it is only a text token prediction system (but a magical one at that), and can't do true numeric computation. But making it write code to compute works well.


Plain Text Accounting has become significantly easier to do for me on a regular basis, thanks to LLMs. Specifically: importing bank statements into hledger and avoiding manual entry.

I use a JSON file to map bank entries to my hledger accounts. For new transactions without mappings, I run a Python script that generates a prompt for Claude. It lists my hledger accounts and asks for mappings for the new entries.

Claude returns hledger journal entries based on these mappings, which I can quickly review.

Then another script prints out hledger journal entries for that month's bank transactions, all cleanly mapped. It takes me just a few minutes to tweak and finalize.

I can also specify these mapping instructions in plain-language which would've otherwise been a fragile hodgepodge of regexps and conditionals.


You don't consider an LLM fragile? Also bold to send your banking information to Anthropic.


Good question. LLMs are surprisingly less fragile than hand-coded parsers for unstructured data like the ones in a bank statement.

And to be clear - I'm not sending the entire statement to Claude; instead, only the account name/narration of those transactions for which I already don't have a mapping. Claude then returns a well-formatted JSON that maps "amzn0026765260@apl" to "expenses:amazon", and "Veena Fuels" to "expenses:vehicle" and so on.

I can also pass in general instructions saying that "Restaurants and food-related accounts are categorized under 'expenses:food'", and it does a good job of mapping most of my dining out expenses to the correct account head.

The actual generation of journal entries are done by a simple Python script. The mapping used to be the hardest part, and what used to need custom classification models is just a simple prompt with LLM.


It is hard to believe that a company with the expertise of Adobe in graphics programming (Photoshop, Premier, After Effects, Illustrator) cannot build a vector design program.

Adobe XD was almost there, and they made some great decisions in nooks and crannies usually forgotten - like bringing in Chrome rendering engineers to build a subset of HTML and CSS for the plugin ecosystem.

While there are hundreds of little decisions to be made - from the runtime, text rendering, gpu vs cpu rendering, to how shadows are rendered and line heights are determined - none of these are engineering problems beyond the ken of a good team of systems and graphics engineers and designers, the likes of which Adobe has in spades.

The attempt to acquire Figma for such an enormous sum itself felt like a serious decision making mistake, and the final nail in the coffin is the complete abandonment of vector design tools.

I would pay good money to read the insider account of the corporate politics inside Adobe that led to all this.


  It is hard to believe that a company with the expertise of Adobe in graphics programming (Photoshop, Premier, After Effects, Illustrator) cannot build a vector design program.
Not for the workers using those tools. Adobe has been coasting for a decade if not more. Probably difficult to see from the outside if you're not directly using it daily.


Moving to a subscription model is largely a symptom of that coasting. They couldn’t consistently come up with compelling reasons to upgrade to new versions of Creative Suite, with users being perfectly happy to run what they had for as many years as possible.

They were always going to hit a ceiling on interesting features to implement, but they could’ve sold new CS versions on improvements in stability, performance, responsiveness, and efficiency (things that users care quite a lot about), but that kind of engineering work doesn’t work well with the cheaper model of continuously adding to the ball of mud.


There is another element at play: modern software product management. I think revenues at the main driver, but behind that is the modern PM philosophy of hypothesis-driven development. In my experience, the average outcome is incrementalism in the extreme.

PMs break every feature down as an experiment to try and measure the value. It often misses consideration of the whole product. Everyone is scared of taking a big bet because it's harder to measure and too risky because all of these small bets are seemingly safer.

What I'd argue for is better product leadership that recognizes these new modes of development are just a tool. However, many product teams have become cults of experimentation and I expect this is the case at Adobe. It pays off as long as people are renewing, but the product suffers.


This is a great comment and deserves its own write up :)


With the software professional industries really care about (like InDesign) instead of messing with it they instead took approach of never changing it to not upset anyone. And instead of paying 1200usd one time people accepted paying 60usd month rent. I know places that still use CS6 and InDesign CS6 vs the current one is almost indistinguishable software. There are maybe 5 minor new features that anybody cares about in the current one. It's insane that in those 10 years somebody payed adobe over 7k. Users should revolt.


What about long arrays? Is there a mechanism where Svelte knows which element is mutated, and do fine-grained recomputation/update only the corresponding view elements?

This is the primary place where we have to go outside the framework in React. Only a large linear list has this problem -- if it was a decently balanced component tree, then we could skip updating huge swathes of it by skipping at a top level node. But it is not possible in an array. You have it iterate through each element and do a shallow comparison to see if the references have changed.

That said, it is fairly easy to escape to DOM from inside a React component with useRef and the portal pattern. Then we can write vanilla JS which makes updates to only the values it changes.

If Svelte solves this in an elegant manner, then it would be a very compelling feature over React. I'm asking here because last I checked, most of the examples in Svelte's documentation pointed to simple counters and regular objects, and I couldn't find an example of a large linear list.



Thank you!


Did you try setting the `key` property in React?


`key` informs React of the identity of an element. That helps it during the reconciliation phase -- if it knows only one `key` in a list of DOM elements has changed, then it will run the DOM updates only on that one. Similarly if the order has changed, it only needs to move its index in the parent DOM element.

But it doesn't help in the rendering phase - aka when the virtual DOM is constructed when `render` is called on the root component, and all our JSX code is executed across the component hierarchy. Virtual DOM reconciliation cost is only a part of the performance penalty, re-running the "view as a function of state" computation is another major chunk.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: