From article: he also created offline/private set of questions to avoid answer leakage to o1 training set, and o1 scored around 97, which is not 120, but still significant leap in LLM performance.
It's a pretty dishonest headline to quote the 120 number, when that's the result when the LLM was allowed to train over all the questions and answers in its training data.