On our benchmarks, Claude v2 scores worse than v1 in categories “code”, “docs”, ...

On our benchmarks, Claude v2 scores worse than v1 in categories “code”, “docs”, “integrate” and “marketing”.

It also is more chatty than v1 (or GPT-3/4), even when asked to just pick one option out of three.

These benchmarks are product oriented - they contain tests and evals from our LLM-driven products. So they aren’t exhaustive or representative.

We just want to know when local LLMs are good enough to start migrating some pipelines away from OpenAI.