Humans are much better at out of sample prediction than LLMs. And inherently benchmarks cannot be out of sample. So I believe that leads to the disconnect between LLMs getting better and better at in sample prediction (benchmarks) while not improving nearly as much at out of sample (actual work).