Hmm. I’m a hard disagree. The problems they show have a number of really nice properties for LLM assessment: They require broad, often integrated knowledge of diverse areas of mathematics, the answers reduce to a number, often a very large number, and thus extremely difficult to guess, and they require a significant amount of symbolic parsing and (I would say) reasoning skills. If we think about what makes a quality mathematician, I’d propose it’s the ability to come at a problem both from the top —- conceptually — and from the bottom — applying various tools and transformations — with a sort of direction in mind that gets to a result.
I’d say these problems strongly encourage that sort of behavior.
I’m also someone who thinks building in abilities like this to LLMs would broadly benefit the LLMs and the world, because I think this stuff generalizes. But, even if not, It would be hard to say that an LLM that could test 80% on this benchmark would be not useful to a research mathematician. Terence Tao’s dream is something like this that can hook up to LEAN, leaving research mathematicians as editors, advisors, and occasionally working on the really hard parts while the rest is automated and provably correct. There’s no doubt in my mind that a high scoring LLM for this benchmark would be helpful in that concept.
I’d say these problems strongly encourage that sort of behavior.
I’m also someone who thinks building in abilities like this to LLMs would broadly benefit the LLMs and the world, because I think this stuff generalizes. But, even if not, It would be hard to say that an LLM that could test 80% on this benchmark would be not useful to a research mathematician. Terence Tao’s dream is something like this that can hook up to LEAN, leaving research mathematicians as editors, advisors, and occasionally working on the really hard parts while the rest is automated and provably correct. There’s no doubt in my mind that a high scoring LLM for this benchmark would be helpful in that concept.