LLM Survey

I gave 179 LLM models the same prompt, and evaluated the results with claude-3.5-sonnet. Here's how they did:

Model Cost Score
human 10.0
claude-3-opus 6.21¢ 10.0
claude-3-opus:beta 6.87¢ 9.3
llama-3-8b-instruct:free 0.00¢ 9.0
gpt-4-vision-preview 1.65¢ 9.0
gpt-4-turbo-preview 2.17¢ 9.0
gpt-4-turbo 2.25¢ 9.0
gpt-4-1106-preview 2.04¢ 8.7
gpt-4-0314 3.41¢ 8.5
claude-instant-1.1 0.10¢ 8.3
phind-codellama-34b 0.06¢ 8.0
mistral-large 1.51¢ 8.0
gpt-4-32k-0314 6.07¢ 8.0
wizardlm-2-7b 0.01¢ 7.8
llama-3-70b-instruct:nitro 0.06¢ 7.7
wizardlm-2-8x22b 0.07¢ 7.7
claude-instant-1.2 0.09¢ 7.7
claude-3-haiku:beta 0.10¢ 7.7
gemini-pro-1.5 2.21¢ 7.7
gpt-4-32k 7.18¢ 7.7
claude-3.5-sonnet:beta 1.04¢ 7.5
deepseek-coder 0.02¢ 7.3
nous-hermes-2-mixtral-8x7b-dpo 0.04¢ 7.3
llama-3-70b-instruct 0.05¢ 7.3
sonar-medium-chat 0.05¢ 7.3
jamba-instruct 0.08¢ 7.3
claude-instant-1:beta 0.08¢ 7.3
claude-3-haiku 0.09¢ 7.3
gpt-3.5-turbo-0301 0.10¢ 7.3
snowflake-arctic-instruct 0.16¢ 7.3
gpt-4o-2024-05-13 1.01¢ 7.3
claude-3.5-sonnet 1.09¢ 7.3
gpt-4 3.25¢ 7.3
qwen-72b-chat 0.06¢ 7.2
claude-3-sonnet:beta 1.11¢ 7.2
openchat-7b:free 0.00¢ 7.0
hermes-2-pro-llama-3-8b 0.01¢ 7.0
qwen-32b-chat 0.05¢ 7.0
codellama-70b-instruct 0.06¢ 7.0
gpt-3.5-turbo-0613 0.11¢ 7.0
palm-2-codechat-bison 0.12¢ 7.0
claude-2.1 0.95¢ 7.0
llama-3-8b-instruct:nitro 0.01¢ 6.8
llama-3-lumimaid-70b 0.48¢ 6.8
dbrx-instruct:nitro 0.04¢ 6.8
openchat-7b 0.01¢ 6.7
llama-3-sonar-small-32k-online 0.01¢ 6.7
lzlv-70b-fp16-hf 0.08¢ 6.7
gpt-3.5-turbo 0.08¢ 6.7
claude-instant-1 0.09¢ 6.7
llama-3-lumimaid-8b 0.14¢ 6.7
claude-2 0.88¢ 6.7
claude-2:beta 0.96¢ 6.7
claude-3-sonnet 1.06¢ 6.7
nous-hermes-2-mistral-7b-dpo 0.02¢ 6.5
llama-3-8b-instruct 0.00¢ 6.3
openchat-8b 0.00¢ 6.3
phi-3-medium-4k-instruct 0.01¢ 6.3
mixtral-8x22b-instruct 0.05¢ 6.3
gpt-3.5-turbo-0125 0.07¢ 6.3
gemini-flash-1.5 0.07¢ 6.3
command-r-plus 0.97¢ 6.3
llama-3-sonar-large-32k-chat 0.06¢ 6.2
qwen-14b-chat 0.02¢ 6.0
dbrx-instruct 0.02¢ 6.0
deepseek-chat 0.02¢ 6.0
zephyr-orpo-141b-a35b 0.06¢ 6.0
gpt-3.5-turbo-16k 0.20¢ 6.0
nemotron-4-340b-instruct 0.28¢ 6.0
mistral-medium 0.52¢ 6.0
midnight-rose-70b 0.85¢ 6.0
gpt-4o 1.05¢ 6.0
llama-3-sonar-small-32k-chat 0.01¢ 5.7
mixtral-8x7b-instruct 0.02¢ 5.7
gemini-pro-vision 0.04¢ 5.7
llama-3-sonar-large-32k-online 0.06¢ 5.7
gpt-3.5-turbo-1106 0.08¢ 5.7
pplx-70b-chat 0.09¢ 5.7
llama-3-8b-instruct:extended 0.11¢ 5.7
pplx-70b-online 0.59¢ 5.7
claude-2.0:beta 1.00¢ 5.7
mistral-tiny 0.02¢ 5.5
phi-3-medium-128k-instruct:free 0.00¢ 5.3
mistral-7b-instruct:nitro 0.02¢ 5.3
palm-2-chat-bison-32k 0.06¢ 5.3
qwen-110b-chat 0.10¢ 5.3
llama-3-lumimaid-8b:extended 0.12¢ 5.3
claude-1.2 0.77¢ 5.3
claude-2.0 0.92¢ 5.3
claude-1 1.05¢ 5.3
mistral-7b-instruct-v0.3 0.01¢ 5.0
toppy-m-7b 0.01¢ 5.0
mixtral-8x7b-instruct:nitro 0.05¢ 5.0
mixtral-8x22b-instruct-preview 0.08¢ 5.0
command-r 0.08¢ 5.0
claude-instant-1.0 0.09¢ 5.0
nous-hermes-2-mixtral-8x7b-sft 0.04¢ 4.7
nous-capybara-34b 0.06¢ 4.7
gemini-pro 0.07¢ 4.7
airoboros-l2-70b 0.07¢ 4.7
xwin-lm-70b 0.22¢ 4.7
fimbulvetr-11b-v2 0.26¢ 4.7
toppy-m-7b:free 0.00¢ 4.5
zephyr-7b-beta:free 0.00¢ 4.3
mistral-7b-instruct 0.01¢ 4.3
phi-3-mini-128k-instruct 0.01¢ 4.3
mythomax-l2-13b:nitro 0.01¢ 4.3
mythomax-l2-13b 0.01¢ 4.3
openhermes-2.5-mistral-7b 0.02¢ 4.3
phi-3-medium-128k-instruct 0.07¢ 4.3
llava-yi-34b 0.07¢ 4.3
nous-hermes-yi-34b 0.11¢ 4.3
sonar-small-online 0.51¢ 4.3
pplx-7b-online 0.52¢ 4.3
noromaid-mixtral-8x7b-instruct 0.75¢ 4.3
claude-2.1:beta 0.81¢ 4.3
cinematika-7b:free 0.00¢ 4.0
phi-3-mini-128k-instruct:free 0.00¢ 4.0
toppy-m-7b:nitro 0.01¢ 4.0
mistral-7b-instruct-v0.2 0.01¢ 4.0
qwen-7b-chat 0.01¢ 4.0
mythomist-7b 0.03¢ 4.0
dolphin-mixtral-8x7b 0.03¢ 4.0
yi-34b-chat 0.06¢ 4.0
palm-2-codechat-bison-32k 0.07¢ 4.0
psyfighter-13b 0.07¢ 4.0
llama-2-70b-chat 0.16¢ 4.0
neural-chat-7b 0.31¢ 4.0
bagel-34b 0.40¢ 4.0
nous-capybara-7b 0.01¢ 3.7
chronos-hermes-13b 0.02¢ 3.7
palm-2-chat-bison 0.05¢ 3.7
codellama-34b-instruct 0.06¢ 3.7
remm-slerp-l2-13b:extended 0.07¢ 3.7
gpt-3.5-turbo-instruct 0.14¢ 3.7
sonar-medium-online 0.55¢ 3.7
gemma-7b-it 0.00¢ 3.3
mistral-7b-openorca 0.01¢ 3.3
stripedhyena-nous-7b 0.01¢ 3.3
pplx-7b-chat 0.02¢ 3.3
mythomax-l2-13b:extended 0.02¢ 3.3
nous-hermes-llama2-13b 0.02¢ 3.3
remm-slerp-l2-13b 0.02¢ 3.3
codellama-70b-instruct 0.06¢ 3.3
llama-2-70b-chat:nitro 0.07¢ 3.3
psyfighter-13b-2 0.07¢ 3.3
synthia-70b 0.22¢ 3.3
rwkv-5-world-3b 0.00¢ 3.0
qwen-4b-chat 0.00¢ 3.0
cinematika-7b 0.01¢ 3.0
openhermes-2-mistral-7b 0.01¢ 3.0
mixtral-8x22b 0.04¢ 3.0
nous-capybara-7b:free 0.00¢ 2.7
gemma-7b-it:nitro 0.01¢ 2.7
firellava-13b 0.01¢ 2.7
mistral-7b-instruct-v0.1 0.01¢ 2.7
sonar-small-chat 0.01¢ 2.7
mythalion-13b 0.06¢ 2.7
dolphin-mixtral-8x22b 0.08¢ 2.7
mythomist-7b:free 0.00¢ 2.3
llama-2-13b-chat 0.02¢ 2.3
noromaid-20b 0.13¢ 2.3
goliath-120b 0.56¢ 2.3
olmo-7b-instruct 0.02¢ 2.2
llama-3-70b 0.05¢ 2.2
eagle-7b 0.00¢ 2.0
zephyr-7b-beta 0.01¢ 2.0
command 0.11¢ 2.0
gemma-7b-it:free 0.00¢ 1.7
mistral-7b-instruct:free 0.00¢ 1.7
stripedhyena-hessian-7b 0.01¢ 1.5
mistral-small 0.11¢ 1.3
weaver 0.22¢ 1.0
yi-6b 0.01¢ 0.7
mixtral-8x7b 0.04¢ 0.7
rwkv-5-3b-ai-town 0.00¢ 0.0
soliloquy-l3 0.00¢ 0.0
llama-guard-2-8b 0.00¢ 0.0
yi-34b 0.03¢ 0.0

View full Detailed Rankings.

Compare different evaluators.