*Have you actually read the paper, or are you just waving it around?* I've spent...

Have you actually read the paper, or are you just waving it around?

I've spent a lot of time feeding similar problems to various models to understand what they can and cannot do well at various stages of development. Reading papers is great, but by the time a paper comes out in this field, it's often obsolete. Witness how much mileage the ludds still get out of the METR study, which was conducted with a now-ancient Claude 3.x model that wasn't at the top of the field when it was new.

Here, let me call a shot -- I bet this paper says LLMs fuck up on proofs like they fuck up on code. It will sometimes generate things that are fine, but it'll frequently generate things that are just irrational garbage.

And the goalposts have now been moved to a dark corner of the parking garage down the street from the stadium. "This brand-new technology doesn't deliver infallible, godlike results out of the box, so it must just be fooling people." Or in equestrian parlance, "This talking horse told me to short NVDA. What a scam."