I am personally impressed by the continued improvement in ARC-AGI-2, where Gemin...

stephc_int13 · 2025-11-18T17:41:28 1763487688

What I would do if I was in the position of a large company in this space is to arrange an internal team to create an ARC replica, covering very similar puzzles and use that as part of the training.

Ultimately, most benchmarks can be gamed and their real utility is thus short-lived.

But I think this is also fair to use any means to beat it.

tylervigen · 2025-11-18T17:46:39 1763487999

I agree that for any given test, you could build a specific pipeline to optimize for that test. I supposed that's why it is helpful to have many tests.

However, many people have worked hard to optimize tools specifically for ARC over many years, and it's proven to be a particularly hard test to optimize for. This is why I find it so interesting that LLMs can do it well at all, regardless of whether tests like it are included in training.

stephc_int13 · 2025-11-18T20:20:39 1763497239

The real strength of current neural nets/transformers relies on huge datasets.

ARC do not provide this kind of dataset, only a small public one and a private one where they do the benchmarks.

Building your own large private ARC set does not seem too difficult if you have enough resources.

egeozcan · 2025-11-19T09:39:21 1763545161

How can they keep it private? It's not like they can run these models locally. Do the providers promise not to peak when they are testing?

benlivengood · 2025-11-19T00:04:52 1763510692

That's ok; just start publishing your real problems to solve as "AI benchmarks" and then it'll work in ~6 months.

nbardy · 2025-11-19T14:51:41 1763563901

This isn’t gaming the benchmark though. If training on similar data generalizes that’s called learning. Training on the exact set is memorization.

There is for a fact teams creating puzzles to RL against as training environments. As it’s beneficial to RL training and in particular compute efficient if you schedule the environment difficulty throughout training. There was a great recent paper on this. Creating environment data that generalizes outside the environment is a challenging engineering task and super valuable whether it looks like AGC AGI or not.

Also ARC AGI is general enough that if you create similar data you’re just creating generic visual puzzle data. Should all visual puzzle data be off limits ?

riku_iki · 2025-11-18T22:53:51 1763506431

> internal team to create an ARC replica, covering very similar puzzles

they can target benchmark directly, not just replica. If google or OAI are bad actors, they already have benchmark data from previous runs.

energy123 · 2025-11-18T23:44:43 1763509483

The 'private' set is just a pinkie promise not to store logs or not to use the logs when the evaluator uses the API to run the test, so yeah. It's trivially exploitable.

Not only do you have the financial self-interest to do it (helps with capital raising to be #1), but you are worried that your competitors are doing it, so you may as well cheat to make things fair. Easy to do and easy to justify.

Maybe a way to make the benchmark more robust to this adversarial environment is to introduce noise and random red herrings into the question, and run the test 20 times and average the correctness. So even if you assume they're training on it, you have some semblance of a test still happening. You'd probably end up with a better benchmark anyway which better reflects real-world usage, where there's a lot of junk in the context window.

riku_iki · 2025-11-18T23:50:23 1763509823

they have two sets:

- semi-private, which they use to test proprietary models and which could be leaked

-private: used to test downloadable open source models.

ARG-AGI prize itself is for open source models.

stephc_int13 · 2025-11-19T01:46:04 1763516764

My point is that it does not matter if the set is private or not.

If you want to train your model you'd need more data than the private set anyway. So you have to build a very large training set on your own, using the same kind of puzzles.

It is not that hard, really, just tedious.

ld4nt3 · 2025-11-19T05:57:58 1763531878

Yes you can build your dataset of n puzzles but it was still really hard for any system to achieve any scores, it even beats specialized one for this just one task and this puzzles shouldn't really be possible just to be memorized by the amount of variations that can be created.

AstroBen · 2025-11-18T20:06:30 1763496390

Is "good at benchmarks instead of real world tasks" really something to optimize for? What does this achieve? Surely people would be initially impressed, try it out, be underwhelmed and then move on. That's not great for Google

nomel · 2025-11-18T21:50:24 1763502624

If they're memory/reference constrained systems that can't directly "store" every solution, then doing well on benchmarks should result in better real world/reasoning performance, since lack of memorized answer requires understanding.

Like with humans [1], generalized reasoning ability lets you skip the direct storage of that solution, and many many others, completely! You can just synthesize a solution when a problem is presented.

[1] https://www.youtube.com/watch?v=f58kEHx6AQ8

stephc_int13 · 2025-11-18T20:16:09 1763496969

Benchmarks are intended as proxy for real usage, and they are often useful to incrementally improve a system, especially when the end-goal is not well-defined.

The trick is to not put more value in the score than what it is.

spprashant · 2025-11-18T20:13:02 1763496782

Initial impressions are currently worth a lot. In the long run I think the moat will dissolve, but currently its a race to lock-in users to your model and make switching costs high.

simpsond · 2025-11-18T17:54:40 1763488480

Humans study for tests. They just tend to forget.

Blamklmo · 2025-11-18T23:07:08 1763507228

Doesn't even matter at this point.

We have a global RL Pipeline on our hand.

If there is something new a LLM/AI model can't solve today, plenty of humans can't either.

But tomorrow every LLM/AI model can solve it and again plent of humans still can't.

Even if AGI is just the sum of companies adding more and more trainingdata, as long as this learning pipeline becomes faster and easier to train with new scenarios, that will start to bleed out humans in the loop.

grantpitt · 2025-11-18T16:59:11 1763485151

Agreed, it also leads performance on arc-agi-1. Here's the leaderboard where you can toggle between arc-agi-1 and 2: https://arcprize.org/leaderboard

energy123 · 2025-11-18T22:46:26 1763505986

It leads on arc-agi-1 with Gemini 3.0 Deep Think, which uses "tool calls" according to google's post, whereas regular Gemini 3.0 Pro doesn't use "tool calls" for the same benchmark. I am unsure how significant this difference is.

tylervigen · 2025-11-18T18:51:18 1763491878

This comment was moved from another thread. The original thread included a benchmark chart with ARC performance: https://blog.google/products/gemini/gemini-3/#gemini-3

HarHarVeryFunny · 2025-11-18T19:16:19 1763493379

There's a good chance Gemini 3 was trained on ARG-AGI problems, unless they state otherwise.

knowriju · 2025-11-18T23:30:57 1763508657

ARC-AGI has a hidden private test suite, right ? No model will have access to that set.

variadix · 2025-11-18T23:59:10 1763510350

I doubt they have offline access to the model, i.e. the prompts are sent to the model provider.

xlbuttplug2 · 2025-11-19T01:42:41 1763516561

Even if the prompts are technically leaked to the provider, how would they be identified as something worth optimizing for out of the millions of other prompts received?

ld4nt3 · 2025-11-19T06:00:23 1763532023

Its almost certain that it was, but the purpose of this puzzle benchmark is that it shouldn't really be possible just to be memorized by the amount of variations that can be created and other criteria detailed in it.

HarHarVeryFunny · 2025-11-19T17:42:19 1763574139

Sure, but the types of pattern in these problems do repeat, so I don't think it'd be too hard to RL train on these, whether public samples, or a privately generated more-of-the-same dataset, to improve performance a lot.

Every company releasing new models leads with benchmark numbers, so it's hard to imagine they are not all putting a lot of effort into benchmark-maxxing.

ld4nt3 · 2025-11-19T22:50:04 1763592604

Yes everyone is doing that on benchmarks but they are still somewhat useful and the likes arc agi even more, though we are not be able to quantize exactly how much better they are getting they are still necessary. For arc agi these are some big gains by which ever way the went about it, since everyone also has been trying to max it for the last 3 years but we do need to come up with better benchmarks/evals like arc tried.

m3kw9 · 2025-11-18T21:57:16 1763503036

that looks great, but we all care how it translate to real world problems like programming where it isn't really excelling by 2x.