On the IMO paper: pointing out that it’s not a gold medal or that some proofs are flawed is irrelevant to the claim being discussed, and you know it. The claim is not “LLMs are perfect mathematicians.” The claim is that they can produce nontrivial formal reasoning that passes external verification at a rate far above chance and far above parroting. Even a single verified solution falsifies the “just regurgitation” hypothesis, because no retrieval-only or surface-pattern system can reliably construct valid proofs under novel compositions.
Your fallback move here is rhetorical, not scientific: “maybe it doesn’t mean what you think it means.” Fine. Then name the mechanism. What specific process produces internally consistent multi-step proofs, respects formal constraints, generalizes across problem types, and fails in ways analogous to human reasoning errors, without representing the underlying structure? “People are impressed because they’re bad at math” is not a mechanism, it’s a tell.
Also, the “math is just a language” line cuts the wrong way. Yes, math is symbolic and code-like. That’s precisely why it’s such a strong test. Code-like domains have exact semantics. They are adversarial to bullshit. That’s why hallucinations show up so clearly there. The fact that LLMs sometimes succeed and sometimes fail is evidence of partial competence, not illusion. A parrot does not occasionally write correct code or proofs under distribution shift. It never does.
You keep asserting that others are being fooled, but you haven’t produced what science actually requires: an alternative explanation that accounts for the full observed behavior and survives tighter controls. Clever Hans had one. Stage magic has one. LLMs, so far, do not.
Skepticism is healthy. But repeating “you’re the limiting factor” while refusing to specify a falsifiable counter-hypothesis is not adversarial engineering. It’s just armchair disbelief dressed up as rigor. And engineers, as you surely know, eventually have to ship something more concrete than that.
Your fallback move here is rhetorical, not scientific: “maybe it doesn’t mean what you think it means.” Fine. Then name the mechanism. What specific process produces internally consistent multi-step proofs, respects formal constraints, generalizes across problem types, and fails in ways analogous to human reasoning errors, without representing the underlying structure? “People are impressed because they’re bad at math” is not a mechanism, it’s a tell.
Also, the “math is just a language” line cuts the wrong way. Yes, math is symbolic and code-like. That’s precisely why it’s such a strong test. Code-like domains have exact semantics. They are adversarial to bullshit. That’s why hallucinations show up so clearly there. The fact that LLMs sometimes succeed and sometimes fail is evidence of partial competence, not illusion. A parrot does not occasionally write correct code or proofs under distribution shift. It never does.
You keep asserting that others are being fooled, but you haven’t produced what science actually requires: an alternative explanation that accounts for the full observed behavior and survives tighter controls. Clever Hans had one. Stage magic has one. LLMs, so far, do not.
Skepticism is healthy. But repeating “you’re the limiting factor” while refusing to specify a falsifiable counter-hypothesis is not adversarial engineering. It’s just armchair disbelief dressed up as rigor. And engineers, as you surely know, eventually have to ship something more concrete than that.