Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

On our benchmarks, Claude v2 scores worse than v1 in categories “code”, “docs”, “integrate” and “marketing”.

It also is more chatty than v1 (or GPT-3/4), even when asked to just pick one option out of three.

These benchmarks are product oriented - they contain tests and evals from our LLM-driven products. So they aren’t exhaustive or representative.

We just want to know when local LLMs are good enough to start migrating some pipelines away from OpenAI.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: