I see now. My focus was on the effect of LLMs’ (and by analogy, humans’) reasoni...

zamadatix · 2025-11-08T16:40:10 1762620010

Hmm, maybe it depends on the specific test and reasoning in it? I certainly think reasoning how and when to use allowed tools and when not to is a big part of the reasoning and verification process E.g. most human math scores allow for a pen and paper calculation, or even a calculator, and that can be a great way to say spot check a symbolic derivative and see it needs to be revisited without relying on the calculator/paper to do the actual reasoning for the testee. Or to see the equation for motion of a system can't possibly have been right with some test values (without which I'm not sure I'd have passed my mid level physics course haha).

At the very least, the scores for benchmarking a human on such a test with and without tools would be different to comparing an LLM without the analogous constraints. Which is (IMO) a useful note in comparing reasoning abilities and why I thought it was interesting to note this kind of testing is just called testing with tools on the LLM side (not sure there is an equally as standard term on the human testing side? Guess the same could be used for both though).

At the same time I'm sure other reasoning tests don't gain much from/expect use of tools at all. So it wouldn't be relevant for those reasoning tests.