I think my favorite of the bunch is the "Does Reinforcement Learning Really Ince...

energy123 · 2025-12-05T09:49:30 1764928170

I am not sure how to interpret the first paper's results.

If we use a random number generator then we will converge to 100% correct answers under pass@n in the limit.

A random number generator will eventually outperform or match all models (for large n) whenever top-p is less than 1 because the other models will most likely have some level of bias that makes correct CoTs mathematically impossible due to the tokens being too improbable and being filtered out by top-p, meaning that other models will asymptote to below 100% while the RNG will reach 100% in an almost surely sense.

Under this paper's logic doesn't that mean that the random number generator is a superior reasoner?

tipsytoad · 2025-12-06T17:55:27 1765043727

It’s a quite deceptive paper. The main headline benchmarks (math500, aime24 /25) final answer is just a number from 0-1000, so what is the takeaway supposed to be for pass@k of 512/1024?

On the unstructured outputs, where you can’t just ratchet up the pass@k until it’s almost random, it switches the base model out for instruct, and in the worse case on livecodebench it uses a qwen-r1-distill as a _base_ model(!?) that’s an instruct model further fine tuned on R1’s reasoning traces. I assume that was because no matter how high the pass@k, a base model won’t output correct python.

Scene_Cast2 · 2025-12-05T13:47:26 1764942446

I'm not sure how likely it is that an answer would fall outside of the top-p of 0.95 (used in the paper). A random number generator would also need an unreasonably high number of samples to get a correct answer. I think figures 17 and 18 are interesting for this discussion too, they show performance at various sampling temperatures. I think the point of the paper is that RL "sharpens" the distribution of non-RL nets, but it does not uncover any new reasoning paths - non-RL nets already had multiple decently high probability paths of answering questions to begin with, and RL reuses a subset of those.

energy123 · 2025-12-05T23:48:53 1764978533

  > I think the point of the paper is that RL "sharpens" the distribution of non-RL nets, but it does not uncover any new reasoning paths

This is an implication of the results that's intuitive and likely to be correct, but isn't guaranteed to be correct. The results do show worse answer correctness for large k. But answers and reasoning strategies to arrive at these answers are different things. It's impractical to inspect the CoTs in both the RL and Base to show that all the reasoning strategies used by the former are a subset of the latter. For all we know the venn diagram might not be fully overlapping. It could be that the RL did uncover some novel and subtle reasoning strategies not present in the Base, but it also introduced separate handicaps for some unknown reason, which nerfed answer correctness for large k. We need some theory to bridge that understanding which seems lacking in the paper? Not that I fault them for an absence of such a theory because it seems intractable. But then I am doubtful one could reach such a neat conclusion as they have tried to do, beyond the appeal to strong intuition (which I also share).

Scene_Cast2 · 2025-12-06T02:51:39 1764989499

Ah, I think I agree. There could be a potential unrelated handicap, so there is a lack of a guarantee or a proof.

robrenaud · 2025-12-05T15:10:06 1764947406

I agree that pass@k feels a bit weird for large k. But for LLMs, it's a decent proxy for "are the knowledge/skills/circuit necessary to solve the problem somewhere in the model". Note that choices for large k is on the order of 256, and the range of valid answers is much larger than that. So your infinite monkeys critique, while true in the limit, wouldn't actually outperform models in the tested regime.

Also, in practice, models don't have that much semantic entropy of a given prompt. With temperature based sampling, models will tend to generate very similar but not identical responses.

boroboro4 · 2025-12-05T15:49:04 1764949744

To me intellect has two parts to it: "creativity" and "correctness". And from this perspective random sampler is infinitely "creative" - over (infinite) time it can come up with answer to any given problem. And from this perspective it does feel natural that base models are more "creative" (because that's what being measured in the paper), while RL models are more "correct" (that's a slope of the curve from the paper).

mountainriver · 2025-12-06T16:40:44 1765039244

> "Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model"

I believe NVidia’s ProRL showed otherwise right?