Hacker Newsnew | past | comments | ask | show | jobs | submit | hypoxia's commentslogin

Did you try it with high reasoning effort?


Sorry, not directed at you specifically. But every time I see questions like this I can’t help but rephrase in my head:

“Did you try running it over and over until you got the results you wanted?”


This is not a good analogy because reasoning models are not choosing the best from a set of attempts based on knowledge of the correct answer. It really is more like what it sounds like: “did you think about it longer until you ruled out various doubts and became more confident?” Of course nobody knows quite why directing more computation in this way makes them better, and nobody seems to take the reasoning trace too seriously as a record of what is happening. But it is clear that it works!


> Of course nobody knows quite why directing more computation in this way makes them better, and nobody seems to take the reasoning trace too seriously as a record of what is happening. But it is clear that it works!

One thing it's hard to wrap my head around is that we are giving more and more trust to something we don't understand with the assumption (often unchecked) that it just works. Basically your refrain is used to justify all sorts of odd setup of AIs, agents, etc.


Trusting things to work based on practical experience and without formal verification is the norm rather than the exception. In formal contexts like software development people have the means to evaluate and use good judgment.

I am much more worried about the problem where LLMs are actively misleading low-info users into thinking they’re people, especially children and old people.


Bad news: it doesn't seem to work as well as you might think: https://arxiv.org/pdf/2508.01191

As one might expect, because the AI isn't actually thinking, it's just spending more tokens on the problem. This sometimes leads to the desired outcome but the phenomenon is very brittle and disappears when the AI is pushed outside the bounds of its training.

To quote their discussion, "CoT is not a mechanism for genuine logical inference but rather a sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training. When pushed even slightly beyond this distribution, its performance degrades significantly, exposing the superficial nature of the “reasoning” it produces."


I keep wondering whether people have actually examined how this work draws its conclusions before citing it.

This is science at its worst, where you start at an inflammatory conclusion and work backwards. There is nothing particularly novel presented here, especially not in the mathematics; obviously performance will degrade on out-of-distribution tasks (and will do so for humans under the same formulation), but the real question is how out-of-distribution a lot of tasks actually are if they can still be solved with CoT. Yes, if you restrict the dataset, then it will perform poorly. But humans already have a pretty large visual dataset to pull from, so what are we comparing to here? How do tiny language models trained on small amounts of data demonstrate fundamental limitations?

I'm eager to see more works showing the limitations of LLM reasoning, both at small and large scale, but this ain't it. Others have already supplied similar critiques, so let's please stop sharing this one around without the grain of salt.


"This is science at its worst, where you start at an inflammatory conclusion and work backwards"

Science starts with a guess and you run experiments to test.


True, but the experiments are engineered to give results they want. It's a mathematical certainty that the performance will drop off here, but is not an accurate assessment of what is going on at scale. If you present an appropriately large and well-trained model with in-context patterns, it often does a decent job, even when it isn't trained on them. By nerfing the model (4 layers), the conclusion is foregone.

I honestly wish this paper actually showed what it claims, since it is a significant open problem to understand CoT reasoning relative to the underlying training set.


Without a provable hold out, claim that "large models do fine on unseen patterns" is unfalsifiable. In controlled from scratch training, CoT performance collapses under modest distribution shift, even with plausible chains. If you have results where the transformation family is provably excluded from training and a large model still shows robust CoT, please share them. Otherwise this paper’s claim stands for the regime it tests.


I don't buy this for the simple fact that benchmarks show much better performance on thinking than on non thinking models. Benchmarks already consider the generalisation and "unseen patterns" aspect.

What would be your argument against

1. COT models performing way better in benchmarks than normal models

2. people choose to use the COT models in day to day life because they actually find that it gives better performance


This paper's claim holds - for 4 layer models. Models improve on out of context examples dramatically at larger scales.


> claim that "large models do fine on unseen patterns" is unfalsifiable

I know what you're saying here, and I know it is primarily a critique of my phrasing, but establishing something like this is the objective of in-context learning theory and mathematical applications of deep learning. It is possible to prove that sufficiently well-trained models will generalize for certain unseen classes of patterns, e.g. transformer acting like gradient descent. There is still a long way to go in the theory---it is difficult research!

> performance collapses under modest distribution shift

The problem is that the notion of "modest" depends on the scale here. With enough varied data and/or enough parameters, what was once out-of-distribution can become in-distribution. The paper is purposely ignorant of this fact. Yes, the claims hold for tiny models, but I don't think anyone ever doubted this.


A viable consideration is that the models will hone in on and reinforce an incorrect answer - a natural side effect of the LLM technology wanting to push certain answers higher in probability and repeat anything in context.

Regardless of being in conversation or thinking context this doesn't prevent the model from speaking the wrong answer so the paper on the illusion of thinking makes sense.

What actually seems to be happening is a form of conversational prompting. Of course with the right conversation back and forth with an LLM you can inject knowledge in a way that causes the natural distribution to shift (again - side effect of the LLM tech.) but by itself it won't naturally get the answer perfect every time.

If this extended thinking were actually working you would expect the LLM to be able to logically conclude an answer with very high accuracy 100% of the time which it does not.


The other commenter is more articulate, but you simply cannot draw the conclusion from this paper that reasoning models don't work well. They trained tiny little models and showed they don't work. Big surprise! Meanwhile every other piece of evidence available shows that reasoning models are more reliable at sophisticated problems. Just a few examples.

- https://arcprize.org/leaderboard

- https://aider.chat/docs/leaderboards/

- https://arstechnica.com/ai/2025/07/google-deepmind-earns-gol...

Surely the IMO problems weren't "within the bounds" of Gemini's training data.


The Gemini IMO result used a specifically fine tuned model for math.

Certainly they weren't training on the unreleased problems. Defining out of distribution gets tricky.


>The Gemini IMO result used a specifically fine tuned model for math.

This is false.

https://x.com/YiTayML/status/1947350087941951596

This is false even for the OpenAI model

https://x.com/polynoamial/status/1946478250974200272

"Typically for these AI results, like in Go/Dota/Poker/Diplomacy, researchers spend years making an AI that masters one narrow domain and does little else. But this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques."


Every human taking that exam has fine tuned for math, specifically on IMO problems.


This is not the slam dunk you think it is. Thinking longer genuinely provides better accuracy. Sure there are decreasing returns to increasing thinking tokens.

GPT 5 fast gets many things wrong but switching to the thinking model fixes the issues very often.


They experimented with gpt-2 scale models. Hard to make any meaningful conclusions in the gpt-5 era.


What you describe is a person selecting the best results, but if you can get better results one shot with that option enabled, it’s worth testing and reporting results.


I get that. But then if that option doesn't help, what I've seen is that the next followup is inevitably "have you tried doing/prompting x instead of y"


It can be summarized as "Did you RTFM?". One shouldn't expect optimal results if the time and effort wasn't invested in learning the tool, any tool. LLMs are no different. GPT-5 isn't one model, it's 6: gpt-5, gpt-5 mini, gpt-nano. Each takes high|medium|low configurations. Anyone who is serious about measuring model capability would go for the best configuration, especially in medicine.

I skimmed through the paper and I didnt see any mention of what parameters they used other than they use gpt-5 via the API.

What was the reasoning_effort? verbosity? temperature?

These things matter.


Something I've experienced with multiple new model releases is plugging them into my app makes my app worse. Then I do a bunch of work on prompts and now my app is better than ever. And it's not like the prompts are just better and make the old model work better too - usually the new prompts make the old model worse or there isn't any change.

So it makes sense to me that you should try until you get the results you want (or fail to do so). And it makes sense to ask people what they've tried. I haven't done the work yet to try this for gpt5 and am not that optimistic, but it is possible it will turn out this way again.


> I get that. But then if that option doesn't help, what I've seen is that the next followup is inevitably "have you tried doing/prompting x instead of y"

Maybe I’m misunderstanding, but it sounds like you’re framing a completely normal proces (try, fail, adjust) as if it’s unreasonable?

In reality, when something doesn’t work, it would seem to me that the obvious next step is to adapt and try again. This does not seem like a radical approach but instead seems to largely be how problem solving sort of works?

For example, when I was a kid trying to push start my motorcycle, it wouldn’t fire no matter what I did. Someone suggested a simple tweak, try a different gear. I did, and instantly the bike roared to life. What I was doing wasn’t wrong, it just needed a slight adjustment to get the result I was after.


I get trying and improving until you get it right. But I just can't make the bridge in my head around

1. this is magic and will one-shot your questions 2. but if it goes wrong, keep trying until it works

Plus, knowing it's all probabilistic, how do you know, without knowing ahead of time already, that the result is correct? Is that not the classic halting problem?


> I get trying and improving until you get it right. But I just can't make the bridge in my head around

> 1. this is magic and will one-shot your questions 2. but if it goes wrong, keep trying until it works

Ah that makes sense. I forgot the "magic" part, and was looking at it more practically.


To clarify on the “learn and improve” part, I mean I get it in the context of a human doing it. When a person learns, that lesson sticks so errors and retries are valuable.

For LLMs none of it sticks. You keep “teaching” it and the next time it forgets everything.

So again you keep trying until you get the results you want, which you need to know ahead of time.


Or...

"Did you try a room full of chimpanzees with typewriters?"


I think the defining story of 2025 will be AI agents getting very good with computer use, largely enabled by RL fine tuning.


Lets hope so; computer use with AI is currently absolutely terrible. It is something I expected to see far larger progress in this year but it's no better than last year.


Yeah, +1. Looking back to the WebVoyager [1] and GPT4V generalist agent [2] papers from last January, it feels like we haven't come that far.

But there are now several major technical unlocks - fine tuning for cursor locations (in Claude), better reasoning with o3, and RL fine-tuning so we can learn based on task success.

That gives me significant hope.

[1] https://arxiv.org/abs/2401.13919

[2] https://arxiv.org/abs/2401.01614


Could you help understand the importance of RL finetuning? What can it accomplish that regular finetuning can't? What's a use case for it?


From my experience there are three key issues with agents today:

1. They usually don't end up completing the right set of steps required to complete tasks when using our human-defined frameworks (react, rewoo, supervisor-worker, teams of multi-agents, etc.)

2. They get lost easily, and forget what they were doing or complete the same tasks over and over in a loop (bad planning)

3. They exit early, thinking they have completed the task when they have not (bad evaluation)

The jump in reasoning ability from 4o to o3 will enable a drastic improvement in planning and execution within our human defined frameworks.

But, more importantly, I believe RL fine tuning will enable the model to learn better general approaches to planning and executing steps to complete work. This is Sutton's bitter lesson at work.

For me, desktop automation is the killer app of RL fine tuning, rather than better reasoning in chatbot apps and APIs.

When OpenAI releases their desktop agent capabilities built on this, hopefully in Jan, I think we're going to see another ChatGPT moment.

Even if not, the ability to easily train the system to complete your tasks successfully with full desktop usage is going to be a major unlock for enterprises.

More on RL fine tuning here: https://openai.com/form/rft-research-program/


Many are incorrectly citing 85% as human-level performance.

85% is just the (semi-arbitrary) threshold for the winning the prize.

o3 actually beats the human average by a wide margin: 64.2% for humans vs. 82.8%+ for o3.

...

Here's the full breakdown by dataset, since none of the articles make it clear --

Private Eval:

- 85%: threshold for winning the prize [1]

Semi-Private Eval:

- 87.5%: o3 (unlimited compute) [2]

- 75.7%: o3 (limited compute) [2]

Public Eval:

- 91.5%: o3 (unlimited compute) [2]

- 82.8%: o3 (limited compute) [2]

- 64.2%: human average (Mechanical Turk) [1] [3]

Public Training:

- 76.2%: human average (Mechanical Turk) [1] [3]

...

References:

[1] https://arcprize.org/guide

[2] https://arcprize.org/blog/oai-o3-pub-breakthrough

[3] https://arxiv.org/abs/2409.01374


If my life depended on the average rando solving 8/10 arc-prize puzzles, I'd consider myself dead.


It actually beats the human average by a wide margin:

- 64.2% for humans vs. 82.8%+ for o3.

...

Private Eval:

- 85%: threshold for winning the prize [1]

Semi-Private Eval:

- 87.5%: o3 (unlimited compute) [2]

- 75.7%: o3 (limited compute) [2]

Public Eval:

- 91.5%: o3 (unlimited compute) [2]

- 82.8%: o3 (limited compute) [2]

- 64.2%: human average (Mechanical Turk) [1] [3]

Public Training:

- 76.2%: human average (Mechanical Turk) [1] [3]

...

References:

[1] https://arcprize.org/guide

[2] https://arcprize.org/blog/oai-o3-pub-breakthrough

[3] https://arxiv.org/abs/2409.01374


Super human isn't beating rando mech turk.

Their post has stem grad at nearly 100%


This is correct. It's easy to get arbitrarily bad results on Mechanical Turk, since without any quality control people will just click as fast as they can to get paid (or bot it and get paid even faster).

So in practice, there's always some kind of quality control. Stricter quality control will improve your results, and the right amount of quality control is subjective. This makes any assessment of human quality meaningless without explanation of how those humans were selected and incentivized. Chollet is careful to provide that, but many posters here are not.

In any case, the ensemble of task-specific, low-compute Kaggle solutions is reportedly also super-Turk, at 81%. I don't think anyone would call that AGI, since it's not general; but if the "(tuned)" in the figure means o3 was tuned specifically for these tasks, that's not obviously general either.


I did, and then promptly used it for 2 hours straight. It's excellent. Going to save me so much time.


My $0.02: it's too hard to build and iterate on complex workflows.

Every agent uses a meta-workflow (eg. ReAct is plan->act->observe, with some added steps to check for completion etc.).

The teams that have been successful with agents do so by building better but more complex workflows.

Most notably, AlphaCodium's "From Prompt Engineering to Flow Engineering" https://github.com/Codium-ai/AlphaCodium

Our current tools don't do a great job of making it simple to build and iterate on these workflows.

For example, here's a HN post from yesterday where a user created their own workflow management platform because of their frustration with the leading tooling providers: https://news.ycombinator.com/item?id=42299098

I think once we get this tooling right and start to build more expertise in the process of flow engineering, we'll start to faster improvement in agent quality.


It does seem like everyone is trying to figure it out as we go. It seems like we're still very early days.


Thank you for building this! It looks excellent and geared at exactly the same problems I've been facing. In fact, I've been working on a very similar package and this may have just saved me a ton of time. Excited to give it a try!


Thank you for the kind words! Feel free to join our discord https://discord.gg/nNFUUDAKub and discuss stuff there. Also ping me there any time, always happy to help you onboard!


Yes, they are overblown, with some caveats.

In terms of API usage, OpenAI has never used the prompts for training but this is very poorly understood among enterprise CEOs and CIOs. Executives heard about the Samsung incident early on (confidential information submitted by employees via the ChatGPT interface, which was training on the data by default at the time), and their trust was shook in a fundamental way.

The email analogy is very apt - companies send all of their secrets to other peoples' computers for processing (cloud compute, email, etc.) without any issue. BUT there's a big caveat: abuse moderation. Prompts, including API calls, are normally stored by OpenAI/MS/etc. for a certain period and may be viewed by a human to check for abuse (e.g. using the system to do phishing requests). This causes significant issues when it comes to certain type of data. Worth nothing that the moderation by default approach is in the proces of being dialed down, and there are now top tier enterprise plans that are no longer moderated by 3rd parties by default.

TL;DR: The concern stems from an early loss of trust (Samsung), but there is a valid issue for certain types of data (abuse moderation), but there are ways around it if you have enough money (enterprise plans).


At my Uni we have an enterprise contract with Microsoft for Copilot that constrains what can be done with our data.

I’d imagine keeping track of the prompts would be highly valuable for evaluation if not training because you do want to know what people are doing it, where it succeeds, where it fails, etc.


Open auctions will help.

In the last year, we've ended up #2 in 6 bidding wars (as disclosed by the listing agents) in one particular area of the GTA. In each case we reached our absolute max and wouldn't have paid any more.

Several times we lost by $100-200k, and once by $250k. These overpayments set new price benchmarks for the area which became sticky. To continue to be competitive, we had to make hard sacrifices to increase our budget throughout the year.

The fact that houses continued to move at the prices determined by these over-payments indicates there are some buyers at the higher prices. However, the market is very thin and the pace of price growth in this speculative market would've been slowed with open bidding.

Beyond the blind bidding issue, I don't know why people focus on foreign and large corporate buyers. Yes, they're scary because they represent potentially large sources of demand. But are they actually buying a large percentage of the homes? No. It's the smaller investors who speculatively bought 40% of homes last fall.

But we should definitely protect mom and pop investors, such as our housing minister wink

Real solution? Reduce the incentive for speculation among these groups by treating all gains on non-primary residences as income.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: