I gave 179 LLM models the same prompt, and evaluated the results with claude-3.5-sonnet. Here's how they did:
View full Detailed Rankings.
Compare different evaluators.
I gave 179 LLM models the same prompt, and evaluated the results with claude-3.5-sonnet. Here's how they did:
View full Detailed Rankings.
Compare different evaluators.