> "For example, if a benchmark reuses questions from a calculator-free exam such...

layer8 · 2025-11-08T15:56:47 1762617407

I don’t think the fact that LLMs can handle small numbers more reliably has anything to do with their reasoning ability. To the contrary, reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.

However:

> Testing only on these problems would not predict performance on larger numbers, where LLMs struggle.

Since performance on large numbers is not what these exams are intended to test for, I don’t see this as a counterargument, unless the benchmarks are misrepresenting what is being tested for.

luke0016 · 2025-11-08T16:28:23 1762619303

> reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.

Or given a calculator. Which it's running on. Which it in some sense is. There's something deeply ironic about the fact that we have an "AI" running on the most technologically advanced calculator in the history of mankind and...it can't do basic math.

novok · 2025-11-08T17:55:10 1762624510

This is like saying it's ironic that an alternator in a car cannot combust gasoline when the gasoline engine is right beside it, even though the alternator 'runs' on the gasoline engine.

luke0016 · 2025-11-08T19:08:50 1762628930

Or similarly having a gasoline engine without an alternator and making the observation that there's an absurdity there in that you're generating large amounts of energy, yet aren't able to charge a relatively small 12V battery with any of it. It's a very practical and natural limitation, yet in some sense you have exactly what you want - energy - you just can't use it because of the form. If you step back there's an amusing irony buried in that. At least in my humble opinion :-)

benjiro · 2025-11-08T17:48:23 1762624103

Thing is, a LLM is nothing but a prediction algorithm based upon what it trained. So it missing basic calculator functionality is a given. This is why tool usage is more and more a thing for LLMs. So that the LLM can from itself use a calculator for the actual math parts it needs. Thus increasing accuracy ...

DrewADesign · 2025-11-08T18:54:05 1762628045

If they were selling LLMs as “LLMs” instead of magic code-writing, answer-giving PhD replacements, the lack of basic arithmetic capability would be a given… but they aren’t. Judging a paid service using their own implied claims is perfectly reasonable.

throwup238 · 2025-11-08T18:02:07 1762624927

Why is it a given? The universal approximation theorem should apply since addition is a continuous function. Now whether the network is sufficiently trained for that is another question but I don’t think it's a given that a trillion parameter model can’t approximate the most basic math operations.

I think the tokenization is a bigger problem than the model itself.

benjiro · 2025-11-08T19:01:43 1762628503

Easy to answer that one ... predictions are based upon accuracy. So if you have a int4 vs a float16, the chance that the prediction goes off is higher with a int4. But even with a float16, your still going to run into issues where your prediction model goes off. Its going to be a lot less, your still going to get rounding issue, what may result in a 5 being a 8 (just a example).

So while it can look like a LLM calculates correctly, its still restricted by this accuracy issue. What happens when you get a single number wrong in a calculation, everything is wrong.

While a calculator does not deal with predictions but basic adding/multiplying/subtracting etc .. Things that are 100% accurate (if we not not count issues like cosmic rays hitting, failures in silica etc).

A trillion parameter model is just that, a trillion parameters, but what matter is not the tokens but the accuracy as in, the do they use int, float16, float32, float64 ... The issue is, the higher we go, the memory usage explodes.

There is no point in spending terabytes of memory, to just get a somewhat accurate predictive calculator, when we can just have the LLM call a actual calculator, to ensure its results are accurate.

Think of a LLM more like somebody with Dyslexia / Dyscalculia... It does not matter how good you are, all it takes is to switch one number in a algebraic calculation to get a 0/10 ... The reason why i mention this, is because i often think of a LLM like a person with Dyslexia / Dyscalculia. It can have insane knowledge, be smart, but be considered dumb by society because of that less then accurate prediction (or number swiping issue).

Take it from somebody that wasted a few years in school thanks to that issue, it really does not matter if your a good programmer later in life, when you flunk a few years thanks to undiagnosed issues. And yet, just like a LLM, i simply rely on tool usage to fix my inaccuracy issues. No point in wasting good shoulder space trying to graft a dozen more heads/brains onto me, when i can simply delegate the issue away. ;)

The fact that we can get computer models, that can almost program, write texts, ... and do so much more like a slightly malfunctioning human, amazes me. And at the same time, i curse at it like my teachers did, and also call it dumb at times hehehe ... I now understand how my teachers felt loool

halJordan · 2025-11-08T17:07:16 1762621636

This is a very unserious take. It's not ironic, because it's not a calculator.

rrr_oh_man · 2025-11-08T17:26:04 1762622764

What's meaning of `computer`, remind me quick?

anamexis · 2025-11-08T17:37:32 1762623452

Computer vision algorithms run on computers and they can’t do basic arithmetic.

My email client runs on my computer and it doesn’t do basic arithmetic either.

Something running on a computer does not imply that it can or should do basic arithmetic

TheOtherHobbes · 2025-11-08T18:01:50 1762624910

That's confusing basic arithmetic as a user feature and as an implementation requirement.

I guarantee that computer vision and email clients both use basic arithmetic in implementation. And it would be trivially easy to bolt a calculator into an email app, because the languages used to write email apps include math features.

That's not true of LLMs. There's math at the bottom of the stack. But LLMs run as a separate closed and opaque application of a unique and self-contained type, which isn't easily extensible.

They don't include hooks into math features on the GPUs, and there's no easy way to add hooks.

If you want math, you need a separate tool call to conventional code.

IMO testing LLMs as if they "should" be able to do arithmetic is bizarre. They can't. They're not designed to. And even if they did, they'd be ridiculously inefficient at it.

anamexis · 2025-11-08T18:04:56 1762625096

Yes, you are agreeing with me.

gishh · 2025-11-08T18:05:35 1762625135

Pretty sure the only thing computer vision does is math.

I’ve also observed email clients tallying the number of unread emails I have. It’s quite obnoxious actually, but I qualify adding as math.

ghurtado · 2025-11-08T19:32:15 1762630335

> Pretty sure the only thing computer vision does is math.

That is only marginally less pedantic than saying that the only thing computer vision does is run discrete electrical signals through billions of transistors.

gishh · 2025-11-09T20:16:08 1762719368

If you’ve ever written code for a computer vision application, you’d realize how incorrect this statement is.

anamexis · 2025-11-08T18:11:22 1762625482

Yes, everything that a computer does, it does using math. This does not imply that things running on the computer can do basic arithmetic tasks for the user.

zamadatix · 2025-11-08T16:18:01 1762618681

Pencil and paper is just testing with tools enabled.

LadyCailin · 2025-11-08T16:23:07 1762618987

I’d say it’s fair for LLMs to be able to use any tool in benchmarks, so long as they are the ones to decide to use them.

zamadatix · 2025-11-08T16:34:15 1762619655

Agreed. I don't like when the prompt sets up a good portion of how to go about finding the answer by saying which tools to use and how. The LLM needs to decide when and how to use them, not the prompt.

daveguy · 2025-11-08T17:11:38 1762621898

I don't think it should be completely open ended. I mean, you could have an "ask_hooman" tool that solves a ton of problems with current LLMs. But that doesn't mean the LLM is capable with respect to the benchmark.

vntok · 2025-11-08T18:15:27 1762625727

Why not? One of the most intelligent things to do when stuck on a problem is to get outside help.

If allowing this behaviour raises a problem, you can always add constraints to the benchmark such as "final answer must come out under 15s" or something. The LLM can then make the decision to ask around in accordance to the time risk.

daveguy · 2025-11-08T19:28:23 1762630103

Because AI are good at devolving to the highest score, regardless of test intent. For most problems "ask_hooman", or especially the plural, would be much more effective. So, the degenerate case would dominate and tell you precisely zero about the intelligence of the AI. If a specific "tool" is more adept than the "AI" then "choose tool" will always be the correct answer. But I agree, a tight time constraint would help.

Dylan16807 · 2025-11-08T17:53:30 1762624410

On some level this makes sense, but on the other hand LLMs already have perfect recall of thousands of symbols built into them, which is what pencil and paper gives to a human test taker.

zamadatix · 2025-11-08T19:11:35 1762629095

If only context recall was actually perfect! The data is certainly stored well, accurately accessing the right part... maybe worse than a human :D.

Dylan16807 · 2025-11-08T19:15:51 1762629351

If you're not doing clever hacks for very long windows, I thought a basic design fed in the entire window and it's up to the weights to use it properly.

layer8 · 2025-11-08T16:22:20 1762618940

You seem to be addressing an argument that wasn’t made.

Personally, I’d say that such tool use is more akin to a human using a calculator.

zamadatix · 2025-11-08T16:26:20 1762619180

I'm not addressing an argument, just stating that's already a form of LLM testing done today for people wanting to look at the difference in results the same as the human analogy.

layer8 · 2025-11-08T16:28:45 1762619325

Okay, but then I don’t understand why you replied to my comment for that, there is no direct connection to what I wrote, nor to what bee_rider wrote.

zamadatix · 2025-11-08T16:29:50 1762619390

> To the contrary, reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.

People interested can see the results of giving LLMs pen and paper today by looking at benchmarks with tools enabled. It's an addition to what you said, not an attack on a portion of your comment :).

layer8 · 2025-11-08T16:35:17 1762619717

I see now. My focus was on the effect of LLMs’ (and by analogy, humans’) reasoning abilities argued by bee_rider. The fact that tool use can enable more reliable handling of large numbers has no bearing on that, hence I found the reply confusing.

zamadatix · 2025-11-08T16:40:10 1762620010

Hmm, maybe it depends on the specific test and reasoning in it? I certainly think reasoning how and when to use allowed tools and when not to is a big part of the reasoning and verification process E.g. most human math scores allow for a pen and paper calculation, or even a calculator, and that can be a great way to say spot check a symbolic derivative and see it needs to be revisited without relying on the calculator/paper to do the actual reasoning for the testee. Or to see the equation for motion of a system can't possibly have been right with some test values (without which I'm not sure I'd have passed my mid level physics course haha).

At the very least, the scores for benchmarking a human on such a test with and without tools would be different to comparing an LLM without the analogous constraints. Which is (IMO) a useful note in comparing reasoning abilities and why I thought it was interesting to note this kind of testing is just called testing with tools on the LLM side (not sure there is an equally as standard term on the human testing side? Guess the same could be used for both though).

At the same time I'm sure other reasoning tests don't gain much from/expect use of tools at all. So it wouldn't be relevant for those reasoning tests.

ambicapter · 2025-11-08T16:31:01 1762619461

> Since performance on large numbers is not what these exams are intended to test for,

How so? Isn't the point of these exams to test arithmetic skills? I would hope we'd like arithmetic skills to be at a constant level regardless of the size of the number?

singron · 2025-11-08T16:47:16 1762620436

No. AIME is a test for advanced high schoolers that mostly tests higher level math concepts like algebra and combinatorics. The arithmetic required is basic. All the answers are 3-digit numbers so that judging is objective and automated while making guessing infeasible. You have 12 minutes on average for each question, so even if you are terribly slow at arithmetic, you should still be able to calculate the correct answer if you can perform all the other math.

ambicapter · 2025-11-08T18:26:45 1762626405

That's probably a great test for high schoolers but it doesn't really test what we want from AI, no? I would expect AI to be limited by the far greater constraints of its computing ability, and not the working memory of a human high schooler.

jvanderbot · 2025-11-08T18:03:25 1762625005

Absolutely not.

College exam takers use those tricks because they are on a time limit and are gaming the system. It's clever and wink wink nudge nudge ok everyone does it. But it's one tiny signal in a huge spectrum of things we use to evaluate people.

Instead, these metrics are gamed and presented as the entire multi special signal of competence for LLMs because it is literally impossible to say that success in one domain would translate the way it might with a good hire.

What I want is something I don't have to guard against gaming. Something conscientious and capable like my co workers. Until then it's google version 2 married to intellisense and I'm not letting do anything by itself.

novok · 2025-11-08T17:53:41 1762624421

IMO I think the calculator problem goes away with tool use or NN architectures that basically add a calculator equivalent as one of the potential 'experts' or similar. It won't be much of a trope for longer.

davedx · 2025-11-08T19:14:29 1762629269

Chatgpt has been calculating things in its python sandbox for years already. This is a trope indeed

gardnr · 2025-11-08T15:56:50 1762617410

A discussion on models "figuring out" things: https://www.youtube.com/watch?v=Xx4Tpsk_fnM (Forbidden Technique)

joe_the_user · 2025-11-08T22:12:49 1762639969

The point of these LLMs is to do things that computers were bad at.

That's a good point imo but we achieved this stuff by at least 2022 when ChatGPT was released. The thing about these giant black boxes is that they also fail to do things that directly human-written software ("computers") does easily. The inability to print text onto generated images or do general arithmetic is important. And sure, some of these limits look like "limits of humans". But it is important to avoid jumping from "they do this human-thing" to "they're like humans".

6510 · 2025-11-08T17:22:05 1762622525

I don't claim to know anything but I thought tool usage was a major sign of intelligence. For example floats are a wonderful technology but people use them as if chainsaws are great for cutting bread and butter. We now have entire languages that cant do basic arithmetic. I thought it was alarming: People it cant compute like this! Now we have language models, those are still computers, why cant we just give them.. you know... calculators? Arguably the best thing their universe has to offer.

edit: I forgot my point: calculating big numbers is not a real world problem anyone has.

yunyu · 2025-11-08T17:28:07 1762622887

We do? Tool use started coming in vogue around 2023

riskable · 2025-11-08T18:31:22 1762626682

Actually, tool use started coming into vogue around 3.3 million years ago.

nradov · 2025-11-08T16:17:08 1762618628

LLMs can probably be taught or configured to use external tools like Excel or Mathematica when such calculations are needed. Just like humans. There are plenty of untapped optimization opportunities.

joegibbs · 2025-11-09T08:46:59 1762678019

I tried making a spreadsheet application and found that they’re not that great at working with 2D data, especially if there’s a lot of it. It’s harder to do search for a large spreadsheet than a large text file - you might get a range of thousands of numbers, how do you search that? And things like headers or important information may not be anywhere near where it’s focused which means it needs to read a ton of irrelevant context. For small sheets it works perfectly though, it’ll have to be something I’ll take another look at in the future.

davedx · 2025-11-08T19:15:23 1762629323

Chatgpt spins up a python sandbox for any complex calculations. It’s been able to do that for a while now

Forgeties79 · 2025-11-08T16:11:02 1762618262

>the point of these LLMs is to do things that computers were bad at.

The way they’re being deployed it feels like the point of LLMs is largely to replace basic online search or to run your online customer support cheaply.

I’m a bit out on a limb here because this is not really my technical expertise by any stretch of the imagination, but it seems to me these benchmark tests don’t really tell us much about how LLM’s perform in the ways most people actually use them. Maybe I’m off base here though

riskable · 2025-11-08T18:39:14 1762627154

Nobody really knows "the point" of LLMs yet. They weren't even "invented" as much as they emerged as a trick to get computers to better understand human language.

They're still brand spanking new and everyone's trying to figure out how to best use them. We don't even really know if they're ever going to be "really good at" any given task!

Are they "really good at" these things or are they merely "OK-ish"?

    * Answering factual questions.
    * Programming.
    * Understanding what the user wants from natural language. 
    * Searching/recommending stuff.

Real world testing suggests that with billions and billions of dollars spent, you really can get an LLM to be "OK-ish" at all those things :D

Forgeties79 · 2025-11-09T00:32:52 1762648372

> Nobody really knows "the point" of LLMs yet

Yet literally hundreds of billions of dollars are being invested in them. That’s what’s so concerning. And I can tell you not one of these startups would EVER acknowledge the truth of your statement.