> "For example, if a benchmark reuses questions from a calculator-free exam such as AIME," the study says, "numbers in each problem will have been chosen to facilitate basic arithmetic. Testing only on these problems would not predict performance on larger numbers, where LLMs struggle."
When models figure out how to exploit an effect that every clever college student does, that should count as a win. That’s a much more human-like reasoning ability, than the ability to multiply large numbers or whatever (computers were already good at that, to the point that it has become a useless skill for humans to have). The point of these LLMs is to do things that computers were bad at.
I don’t think the fact that LLMs can handle small numbers more reliably has anything to do with their reasoning ability. To the contrary, reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.
However:
> Testing only on these problems would not predict performance on larger numbers, where LLMs struggle.
Since performance on large numbers is not what these exams are intended to test for, I don’t see this as a counterargument, unless the benchmarks are misrepresenting what is being tested for.
> reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.
Or given a calculator. Which it's running on. Which it in some sense is. There's something deeply ironic about the fact that we have an "AI" running on the most technologically advanced calculator in the history of mankind and...it can't do basic math.
This is like saying it's ironic that an alternator in a car cannot combust gasoline when the gasoline engine is right beside it, even though the alternator 'runs' on the gasoline engine.
Or similarly having a gasoline engine without an alternator and making the observation that there's an absurdity there in that you're generating large amounts of energy, yet aren't able to charge a relatively small 12V battery with any of it. It's a very practical and natural limitation, yet in some sense you have exactly what you want - energy - you just can't use it because of the form. If you step back there's an amusing irony buried in that. At least in my humble opinion :-)
Thing is, a LLM is nothing but a prediction algorithm based upon what it trained. So it missing basic calculator functionality is a given. This is why tool usage is more and more a thing for LLMs. So that the LLM can from itself use a calculator for the actual math parts it needs. Thus increasing accuracy ...
If they were selling LLMs as “LLMs” instead of magic code-writing, answer-giving PhD replacements, the lack of basic arithmetic capability would be a given… but they aren’t. Judging a paid service using their own implied claims is perfectly reasonable.
Why is it a given? The universal approximation theorem should apply since addition is a continuous function. Now whether the network is sufficiently trained for that is another question but I don’t think it's a given that a trillion parameter model can’t approximate the most basic math operations.
I think the tokenization is a bigger problem than the model itself.
Easy to answer that one ... predictions are based upon accuracy. So if you have a int4 vs a float16, the chance that the prediction goes off is higher with a int4. But even with a float16, your still going to run into issues where your prediction model goes off. Its going to be a lot less, your still going to get rounding issue, what may result in a 5 being a 8 (just a example).
So while it can look like a LLM calculates correctly, its still restricted by this accuracy issue. What happens when you get a single number wrong in a calculation, everything is wrong.
While a calculator does not deal with predictions but basic adding/multiplying/subtracting etc .. Things that are 100% accurate (if we not not count issues like cosmic rays hitting, failures in silica etc).
A trillion parameter model is just that, a trillion parameters, but what matter is not the tokens but the accuracy as in, the do they use int, float16, float32, float64 ... The issue is, the higher we go, the memory usage explodes.
There is no point in spending terabytes of memory, to just get a somewhat accurate predictive calculator, when we can just have the LLM call a actual calculator, to ensure its results are accurate.
Think of a LLM more like somebody with Dyslexia / Dyscalculia... It does not matter how good you are, all it takes is to switch one number in a algebraic calculation to get a 0/10 ... The reason why i mention this, is because i often think of a LLM like a person with Dyslexia / Dyscalculia. It can have insane knowledge, be smart, but be considered dumb by society because of that less then accurate prediction (or number swiping issue).
Take it from somebody that wasted a few years in school thanks to that issue, it really does not matter if your a good programmer later in life, when you flunk a few years thanks to undiagnosed issues. And yet, just like a LLM, i simply rely on tool usage to fix my inaccuracy issues. No point in wasting good shoulder space trying to graft a dozen more heads/brains onto me, when i can simply delegate the issue away. ;)
The fact that we can get computer models, that can almost program, write texts, ... and do so much more like a slightly malfunctioning human, amazes me. And at the same time, i curse at it like my teachers did, and also call it dumb at times hehehe ... I now understand how my teachers felt loool
That's confusing basic arithmetic as a user feature and as an implementation requirement.
I guarantee that computer vision and email clients both use basic arithmetic in implementation. And it would be trivially easy to bolt a calculator into an email app, because the languages used to write email apps include math features.
That's not true of LLMs. There's math at the bottom of the stack. But LLMs run as a separate closed and opaque application of a unique and self-contained type, which isn't easily extensible.
They don't include hooks into math features on the GPUs, and there's no easy way to add hooks.
If you want math, you need a separate tool call to conventional code.
IMO testing LLMs as if they "should" be able to do arithmetic is bizarre. They can't. They're not designed to. And even if they did, they'd be ridiculously inefficient at it.
> Pretty sure the only thing computer vision does is math.
That is only marginally less pedantic than saying that the only thing computer vision does is run discrete electrical signals through billions of transistors.
Yes, everything that a computer does, it does using math. This does not imply that things running on the computer can do basic arithmetic tasks for the user.
Agreed. I don't like when the prompt sets up a good portion of how to go about finding the answer by saying which tools to use and how. The LLM needs to decide when and how to use them, not the prompt.
I don't think it should be completely open ended. I mean, you could have an "ask_hooman" tool that solves a ton of problems with current LLMs. But that doesn't mean the LLM is capable with respect to the benchmark.
Why not? One of the most intelligent things to do when stuck on a problem is to get outside help.
If allowing this behaviour raises a problem, you can always add constraints to the benchmark such as "final answer must come out under 15s" or something. The LLM can then make the decision to ask around in accordance to the time risk.
Because AI are good at devolving to the highest score, regardless of test intent. For most problems "ask_hooman", or especially the plural, would be much more effective. So, the degenerate case would dominate and tell you precisely zero about the intelligence of the AI. If a specific "tool" is more adept than the "AI" then "choose tool" will always be the correct answer. But I agree, a tight time constraint would help.
On some level this makes sense, but on the other hand LLMs already have perfect recall of thousands of symbols built into them, which is what pencil and paper gives to a human test taker.
If you're not doing clever hacks for very long windows, I thought a basic design fed in the entire window and it's up to the weights to use it properly.
I'm not addressing an argument, just stating that's already a form of LLM testing done today for people wanting to look at the difference in results the same as the human analogy.
> To the contrary, reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.
People interested can see the results of giving LLMs pen and paper today by looking at benchmarks with tools enabled. It's an addition to what you said, not an attack on a portion of your comment :).
I see now. My focus was on the effect of LLMs’ (and by analogy, humans’) reasoning abilities argued by bee_rider. The fact that tool use can enable more reliable handling of large numbers has no bearing on that, hence I found the reply confusing.
Hmm, maybe it depends on the specific test and reasoning in it? I certainly think reasoning how and when to use allowed tools and when not to is a big part of the reasoning and verification process E.g. most human math scores allow for a pen and paper calculation, or even a calculator, and that can be a great way to say spot check a symbolic derivative and see it needs to be revisited without relying on the calculator/paper to do the actual reasoning for the testee. Or to see the equation for motion of a system can't possibly have been right with some test values (without which I'm not sure I'd have passed my mid level physics course haha).
At the very least, the scores for benchmarking a human on such a test with and without tools would be different to comparing an LLM without the analogous constraints. Which is (IMO) a useful note in comparing reasoning abilities and why I thought it was interesting to note this kind of testing is just called testing with tools on the LLM side (not sure there is an equally as standard term on the human testing side? Guess the same could be used for both though).
At the same time I'm sure other reasoning tests don't gain much from/expect use of tools at all. So it wouldn't be relevant for those reasoning tests.
> Since performance on large numbers is not what these exams are intended to test for,
How so? Isn't the point of these exams to test arithmetic skills? I would hope we'd like arithmetic skills to be at a constant level regardless of the size of the number?
No. AIME is a test for advanced high schoolers that mostly tests higher level math concepts like algebra and combinatorics. The arithmetic required is basic. All the answers are 3-digit numbers so that judging is objective and automated while making guessing infeasible. You have 12 minutes on average for each question, so even if you are terribly slow at arithmetic, you should still be able to calculate the correct answer if you can perform all the other math.
That's probably a great test for high schoolers but it doesn't really test what we want from AI, no? I would expect AI to be limited by the far greater constraints of its computing ability, and not the working memory of a human high schooler.
College exam takers use those tricks because they are on a time limit and are gaming the system. It's clever and wink wink nudge nudge ok everyone does it. But it's one tiny signal in a huge spectrum of things we use to evaluate people.
Instead, these metrics are gamed and presented as the entire multi special signal of competence for LLMs because it is literally impossible to say that success in one domain would translate the way it might with a good hire.
What I want is something I don't have to guard against gaming. Something conscientious and capable like my co workers. Until then it's google version 2 married to intellisense and I'm not letting do anything by itself.
IMO I think the calculator problem goes away with tool use or NN architectures that basically add a calculator equivalent as one of the potential 'experts' or similar. It won't be much of a trope for longer.
The point of these LLMs is to do things that computers were bad at.
That's a good point imo but we achieved this stuff by at least 2022 when ChatGPT was released. The thing about these giant black boxes is that they also fail to do things that directly human-written software ("computers") does easily. The inability to print text onto generated images or do general arithmetic is important. And sure, some of these limits look like "limits of humans". But it is important to avoid jumping from "they do this human-thing" to "they're like humans".
I don't claim to know anything but I thought tool usage was a major sign of intelligence. For example floats are a wonderful technology but people use them as if chainsaws are great for cutting bread and butter. We now have entire languages that cant do basic arithmetic. I thought it was alarming: People it cant compute like this! Now we have language models, those are still computers, why cant we just give them.. you know... calculators? Arguably the best thing their universe has to offer.
edit: I forgot my point: calculating big numbers is not a real world problem anyone has.
LLMs can probably be taught or configured to use external tools like Excel or Mathematica when such calculations are needed. Just like humans. There are plenty of untapped optimization opportunities.
I tried making a spreadsheet application and found that they’re not that great at working with 2D data, especially if there’s a lot of it. It’s harder to do search for a large spreadsheet than a large text file - you might get a range of thousands of numbers, how do you search that? And things like headers or important information may not be anywhere near where it’s focused which means it needs to read a ton of irrelevant context. For small sheets it works perfectly though, it’ll have to be something I’ll take another look at in the future.
>the point of these LLMs is to do things that computers were bad at.
The way they’re being deployed it feels like the point of LLMs is largely to replace basic online search or to run your online customer support cheaply.
I’m a bit out on a limb here because this is not really my technical expertise by any stretch of the imagination, but it seems to me these benchmark tests don’t really tell us much about how LLM’s perform in the ways most people actually use them. Maybe I’m off base here though
Nobody really knows "the point" of LLMs yet. They weren't even "invented" as much as they emerged as a trick to get computers to better understand human language.
They're still brand spanking new and everyone's trying to figure out how to best use them. We don't even really know if they're ever going to be "really good at" any given task!
Are they "really good at" these things or are they merely "OK-ish"?
* Answering factual questions.
* Programming.
* Understanding what the user wants from natural language.
* Searching/recommending stuff.
Real world testing suggests that with billions and billions of dollars spent, you really can get an LLM to be "OK-ish" at all those things :D
Yet literally hundreds of billions of dollars are being invested in them. That’s what’s so concerning. And I can tell you not one of these startups would EVER acknowledge the truth of your statement.
When models figure out how to exploit an effect that every clever college student does, that should count as a win. That’s a much more human-like reasoning ability, than the ability to multiply large numbers or whatever (computers were already good at that, to the point that it has become a useless skill for humans to have). The point of these LLMs is to do things that computers were bad at.