I believe it's because the way you measure things in RL, each episode only tells...

I believe it's because the way you measure things in RL, each episode only tells you whether it was good (say reward +1) or bad (say 0 or negative reward), it does not tell you anything about the trace that was produced to get the outcome. This reward is the only thing measured to produce your gradients. Hence why the amount of info in it is O(1).

This is in contrast to more "supervised" forms of learning where you could get a loss for each token produced (e.g. cross entropy loss), and where you'd get, as a consequence O(number of tokens) information into your gradients.