I am personally impressed by the continued improvement in ARC-AGI-2, where Gemini 3 got 31.1% (vs ChatGPT 5.1's 17.6%). To me this is the kind of problem that does not lend itself well to LLMs - many of the puzzles test the kind of thing that humans intuit because of millions of years of evolution, but these concepts do not necessarily appear in written form (or when they do, it's not clear how they connect to specific ARC puzzles).
The fact that these models can keep getting better at this task given the setup of training is mind-boggling to me.
What I would do if I was in the position of a large company in this space is to arrange an internal team to create an ARC replica, covering very similar puzzles and use that as part of the training.
Ultimately, most benchmarks can be gamed and their real utility is thus short-lived.
But I think this is also fair to use any means to beat it.
I agree that for any given test, you could build a specific pipeline to optimize for that test. I supposed that's why it is helpful to have many tests.
However, many people have worked hard to optimize tools specifically for ARC over many years, and it's proven to be a particularly hard test to optimize for. This is why I find it so interesting that LLMs can do it well at all, regardless of whether tests like it are included in training.
This isn’t gaming the benchmark though. If training on similar data generalizes that’s called learning. Training on the exact set is memorization.
There is for a fact teams creating puzzles to RL against as training environments. As it’s beneficial to RL training and in particular compute efficient if you schedule the environment difficulty throughout training. There was a great recent paper on this. Creating environment data that generalizes outside the environment is a challenging engineering task and super valuable whether it looks like AGC AGI or not.
Also ARC AGI is general enough that if you create similar data you’re just creating generic visual puzzle data. Should all visual puzzle data be off limits ?
The 'private' set is just a pinkie promise not to store logs or not to use the logs when the evaluator uses the API to run the test, so yeah. It's trivially exploitable.
Not only do you have the financial self-interest to do it (helps with capital raising to be #1), but you are worried that your competitors are doing it, so you may as well cheat to make things fair. Easy to do and easy to justify.
Maybe a way to make the benchmark more robust to this adversarial environment is to introduce noise and random red herrings into the question, and run the test 20 times and average the correctness. So even if you assume they're training on it, you have some semblance of a test still happening. You'd probably end up with a better benchmark anyway which better reflects real-world usage, where there's a lot of junk in the context window.
My point is that it does not matter if the set is private or not.
If you want to train your model you'd need more data than the private set anyway.
So you have to build a very large training set on your own, using the same kind of puzzles.
Yes you can build your dataset of n puzzles but it was still really hard for any system to achieve any scores, it even beats specialized one for this just one task and this puzzles shouldn't really be possible just to be memorized by the amount of variations that can be created.
Is "good at benchmarks instead of real world tasks" really something to optimize for? What does this achieve? Surely people would be initially impressed, try it out, be underwhelmed and then move on. That's not great for Google
If they're memory/reference constrained systems that can't directly "store" every solution, then doing well on benchmarks should result in better real world/reasoning performance, since lack of memorized answer requires understanding.
Like with humans [1], generalized reasoning ability lets you skip the direct storage of that solution, and many many others, completely! You can just synthesize a solution when a problem is presented.
Benchmarks are intended as proxy for real usage, and they are often useful to incrementally improve a system, especially when the end-goal is not well-defined.
The trick is to not put more value in the score than what it is.
Initial impressions are currently worth a lot. In the long run I think the moat will dissolve, but currently its a race to lock-in users to your model and make switching costs high.
If there is something new a LLM/AI model can't solve today, plenty of humans can't either.
But tomorrow every LLM/AI model can solve it and again plent of humans still can't.
Even if AGI is just the sum of companies adding more and more trainingdata, as long as this learning pipeline becomes faster and easier to train with new scenarios, that will start to bleed out humans in the loop.
It leads on arc-agi-1 with Gemini 3.0 Deep Think, which uses "tool calls" according to google's post, whereas regular Gemini 3.0 Pro doesn't use "tool calls" for the same benchmark. I am unsure how significant this difference is.
Even if the prompts are technically leaked to the provider, how would they be identified as something worth optimizing for out of the millions of other prompts received?
Its almost certain that it was, but the purpose of this puzzle benchmark is that it shouldn't really be possible just to be memorized by the amount of variations that can be created and other criteria detailed in it.
Sure, but the types of pattern in these problems do repeat, so I don't think it'd be too hard to RL train on these, whether public samples, or a privately generated more-of-the-same dataset, to improve performance a lot.
Every company releasing new models leads with benchmark numbers, so it's hard to imagine they are not all putting a lot of effort into benchmark-maxxing.
Yes everyone is doing that on benchmarks but they are still somewhat useful and the likes arc agi even more, though we are not be able to quantize exactly how much better they are getting they are still necessary. For arc agi these are some big gains by which ever way the went about it, since everyone also has been trying to max it for the last 3 years but we do need to come up with better benchmarks/evals like arc tried.
The fact that these models can keep getting better at this task given the setup of training is mind-boggling to me.
The ARC puzzles in question: https://arcprize.org/arc-agi/2/