Nit: the author says that supervised fine tuning is a type of RL, but it is not....

jampekka · 2025-08-17T18:22:48 1755454968

RL is about getting numerical feedback of outputs, in contrast to supervised learning where there are examples of what the output should be. There are many RL problems with no delayed rewards, e.g. multi-armed bandits.

Admittely most interesting cases do have delays.

ProofHouse · 2025-08-17T13:15:31 1755436531

Well they can be used together in some contexts so while they are different, you could also say RL can help Supervised Fine Tuning for further optimization

tempusalaria · 2025-08-17T13:17:15 1755436635

SFT is part of the classic RLHF process though

JoshPurtell · 2025-08-17T22:14:32 1755468872

RL is not about delayed reward. Multi-armed bandit problems have no credit assignment component, but are often the first RL problem taught.

In its most general, RL is about learning a policy (state -> action mapping). Which often requires inferring value, etc.

But copying a strong reference policy ... is still learning a policy. Whether by SFT or not