Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Nit: the author says that supervised fine tuning is a type of RL, but it is not. RL is about delayed reward. Supervised fine tuning is not in any way about delayed reward.


RL is about getting numerical feedback of outputs, in contrast to supervised learning where there are examples of what the output should be. There are many RL problems with no delayed rewards, e.g. multi-armed bandits.

Admittely most interesting cases do have delays.


Well they can be used together in some contexts so while they are different, you could also say RL can help Supervised Fine Tuning for further optimization


SFT is part of the classic RLHF process though


RL is not about delayed reward. Multi-armed bandit problems have no credit assignment component, but are often the first RL problem taught.

In its most general, RL is about learning a policy (state -> action mapping). Which often requires inferring value, etc.

But copying a strong reference policy ... is still learning a policy. Whether by SFT or not




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: