Can someone explain the bit counting argument in the reinforcement learning part...

navar · 2025-10-04T13:33:47 1759584827

I believe it's because the way you measure things in RL, each episode only tells you whether it was good (say reward +1) or bad (say 0 or negative reward), it does not tell you anything about the trace that was produced to get the outcome. This reward is the only thing measured to produce your gradients. Hence why the amount of info in it is O(1).

This is in contrast to more "supervised" forms of learning where you could get a loss for each token produced (e.g. cross entropy loss), and where you'd get, as a consequence O(number of tokens) information into your gradients.

mountainriver · 2025-10-04T02:06:58 1759543618

A fair amount of research has shown that RL doesn’t add knowledge to the base model it just optimizes paths that already exist. Now ProRL from Nvidia showed there are ways of adding knowledge, mostly through progressive merging.

I’m still not fully convinced of the 1bit claim, they made other mistakes in the blog post