> Benchmarks are useful, but they often completely miss out on a lot of real-wor...

> Benchmarks are useful, but they often completely miss out on a lot of real-world factors (e.g., long horizon, multiple agents interacting, interfacing with real-world systems in all their complexity, non-nicely-scoped goals, computer use, etc). They also generally don’t give us any understanding of agent proclivities (what they decide to do) when pursuing goals, or when given the freedom to choose their own goal to pursue.

I'd like to see Rob Pike address this, however, based on what he said about LLMs he might reject it before then (getting off the usefulness train as in getting of the "doom train" in regards to AI safety)