Humans are much better at out of sample prediction than LLMs. And inherently ben...

Humans are much better at out of sample prediction than LLMs. And inherently benchmarks cannot be out of sample. So I believe that leads to the disconnect between LLMs getting better and better at in sample prediction (benchmarks) while not improving nearly as much at out of sample (actual work).