RL constrains the space of possible output token sequences to what is likely to lead to the correct answer. So we are inherently making a trade-off to reduce variance. A non-RL model will have higher variance, so given enough attempts, it will come up with some correct answers that an RL model can't.
I always thought that the point of instruction tuning and ability to use prompts to get the model to do 0 shot tasks was that you don't have to collect tons of example data/samples. The method proposed here requires you to have tons of data. If you have that, why not just fine tune the underlying model?
I've been scraping YC data week over week to track things like changes in founder, pivots in the idea, company shutting down, etc. You can check it out here https://pivots.fyi/
It tracks 1000+ startups that have been founded in the last 3 years and showcases how their product, mission, team size, founders, etc. evolve week over week. It is interesting to see how quickly early stage startups pivot.
Looking for feedback/suggestions about how I can make this more useful.
I'm building this tool to make it easier for educators to create programming videos. It can also be useful for people new to programming to play around with basic data structures and algorithms like Trees, Linked Lists, etc.
I don't have any problem with working with people who are different from me. But I think showcasing the benefits of diversity will make everyone actually embrace it rather grudgingly support it just to be politcically correct.