Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I am personally impressed by the continued improvement in ARC-AGI-2, where Gemini 3 got 31.1% (vs ChatGPT 5.1's 17.6%). To me this is the kind of problem that does not lend itself well to LLMs - many of the puzzles test the kind of thing that humans intuit because of millions of years of evolution, but these concepts do not necessarily appear in written form (or when they do, it's not clear how they connect to specific ARC puzzles).

The fact that these models can keep getting better at this task given the setup of training is mind-boggling to me.

The ARC puzzles in question: https://arcprize.org/arc-agi/2/



What I would do if I was in the position of a large company in this space is to arrange an internal team to create an ARC replica, covering very similar puzzles and use that as part of the training.

Ultimately, most benchmarks can be gamed and their real utility is thus short-lived.

But I think this is also fair to use any means to beat it.


I agree that for any given test, you could build a specific pipeline to optimize for that test. I supposed that's why it is helpful to have many tests.

However, many people have worked hard to optimize tools specifically for ARC over many years, and it's proven to be a particularly hard test to optimize for. This is why I find it so interesting that LLMs can do it well at all, regardless of whether tests like it are included in training.


The real strength of current neural nets/transformers relies on huge datasets.

ARC do not provide this kind of dataset, only a small public one and a private one where they do the benchmarks.

Building your own large private ARC set does not seem too difficult if you have enough resources.


How can they keep it private? It's not like they can run these models locally. Do the providers promise not to peak when they are testing?


That's ok; just start publishing your real problems to solve as "AI benchmarks" and then it'll work in ~6 months.


This isn’t gaming the benchmark though. If training on similar data generalizes that’s called learning. Training on the exact set is memorization.

There is for a fact teams creating puzzles to RL against as training environments. As it’s beneficial to RL training and in particular compute efficient if you schedule the environment difficulty throughout training. There was a great recent paper on this. Creating environment data that generalizes outside the environment is a challenging engineering task and super valuable whether it looks like AGC AGI or not.

Also ARC AGI is general enough that if you create similar data you’re just creating generic visual puzzle data. Should all visual puzzle data be off limits ?


> internal team to create an ARC replica, covering very similar puzzles

they can target benchmark directly, not just replica. If google or OAI are bad actors, they already have benchmark data from previous runs.


The 'private' set is just a pinkie promise not to store logs or not to use the logs when the evaluator uses the API to run the test, so yeah. It's trivially exploitable.

Not only do you have the financial self-interest to do it (helps with capital raising to be #1), but you are worried that your competitors are doing it, so you may as well cheat to make things fair. Easy to do and easy to justify.

Maybe a way to make the benchmark more robust to this adversarial environment is to introduce noise and random red herrings into the question, and run the test 20 times and average the correctness. So even if you assume they're training on it, you have some semblance of a test still happening. You'd probably end up with a better benchmark anyway which better reflects real-world usage, where there's a lot of junk in the context window.


they have two sets:

- semi-private, which they use to test proprietary models and which could be leaked

-private: used to test downloadable open source models.

ARG-AGI prize itself is for open source models.


My point is that it does not matter if the set is private or not.

If you want to train your model you'd need more data than the private set anyway. So you have to build a very large training set on your own, using the same kind of puzzles.

It is not that hard, really, just tedious.


Yes you can build your dataset of n puzzles but it was still really hard for any system to achieve any scores, it even beats specialized one for this just one task and this puzzles shouldn't really be possible just to be memorized by the amount of variations that can be created.


Is "good at benchmarks instead of real world tasks" really something to optimize for? What does this achieve? Surely people would be initially impressed, try it out, be underwhelmed and then move on. That's not great for Google


If they're memory/reference constrained systems that can't directly "store" every solution, then doing well on benchmarks should result in better real world/reasoning performance, since lack of memorized answer requires understanding.

Like with humans [1], generalized reasoning ability lets you skip the direct storage of that solution, and many many others, completely! You can just synthesize a solution when a problem is presented.

[1] https://www.youtube.com/watch?v=f58kEHx6AQ8


Benchmarks are intended as proxy for real usage, and they are often useful to incrementally improve a system, especially when the end-goal is not well-defined.

The trick is to not put more value in the score than what it is.


Initial impressions are currently worth a lot. In the long run I think the moat will dissolve, but currently its a race to lock-in users to your model and make switching costs high.


Humans study for tests. They just tend to forget.


Doesn't even matter at this point.

We have a global RL Pipeline on our hand.

If there is something new a LLM/AI model can't solve today, plenty of humans can't either.

But tomorrow every LLM/AI model can solve it and again plent of humans still can't.

Even if AGI is just the sum of companies adding more and more trainingdata, as long as this learning pipeline becomes faster and easier to train with new scenarios, that will start to bleed out humans in the loop.


Agreed, it also leads performance on arc-agi-1. Here's the leaderboard where you can toggle between arc-agi-1 and 2: https://arcprize.org/leaderboard


It leads on arc-agi-1 with Gemini 3.0 Deep Think, which uses "tool calls" according to google's post, whereas regular Gemini 3.0 Pro doesn't use "tool calls" for the same benchmark. I am unsure how significant this difference is.


This comment was moved from another thread. The original thread included a benchmark chart with ARC performance: https://blog.google/products/gemini/gemini-3/#gemini-3


There's a good chance Gemini 3 was trained on ARG-AGI problems, unless they state otherwise.


ARC-AGI has a hidden private test suite, right ? No model will have access to that set.


I doubt they have offline access to the model, i.e. the prompts are sent to the model provider.


Even if the prompts are technically leaked to the provider, how would they be identified as something worth optimizing for out of the millions of other prompts received?


Its almost certain that it was, but the purpose of this puzzle benchmark is that it shouldn't really be possible just to be memorized by the amount of variations that can be created and other criteria detailed in it.


Sure, but the types of pattern in these problems do repeat, so I don't think it'd be too hard to RL train on these, whether public samples, or a privately generated more-of-the-same dataset, to improve performance a lot.

Every company releasing new models leads with benchmark numbers, so it's hard to imagine they are not all putting a lot of effort into benchmark-maxxing.


Yes everyone is doing that on benchmarks but they are still somewhat useful and the likes arc agi even more, though we are not be able to quantize exactly how much better they are getting they are still necessary. For arc agi these are some big gains by which ever way the went about it, since everyone also has been trying to max it for the last 3 years but we do need to come up with better benchmarks/evals like arc tried.


that looks great, but we all care how it translate to real world problems like programming where it isn't really excelling by 2x.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: