Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The partner for these projects has a benchmark that the top frontier LLM labs seem to be running on their new model releases - I think there's _some_ value to these numbers in helping people compare and contrast model performance.

https://andonlabs.com/evals/vending-bench



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: