LLM Survey

I gave 179 LLM models the same prompt, and evaluated the results with claude-3.5-sonnet. Here's how they did:

Model	Cost	Score
human		10.0
claude-3-opus	6.21¢	10.0
claude-3-opus:beta	6.87¢	9.3
llama-3-8b-instruct:free	0.00¢	9.0
gpt-4-vision-preview	1.65¢	9.0
gpt-4-turbo-preview	2.17¢	9.0
gpt-4-turbo	2.25¢	9.0
gpt-4-1106-preview	2.04¢	8.7
gpt-4-0314	3.41¢	8.5
claude-instant-1.1	0.10¢	8.3
phind-codellama-34b	0.06¢	8.0
mistral-large	1.51¢	8.0
gpt-4-32k-0314	6.07¢	8.0
wizardlm-2-7b	0.01¢	7.8
llama-3-70b-instruct:nitro	0.06¢	7.7
wizardlm-2-8x22b	0.07¢	7.7
claude-instant-1.2	0.09¢	7.7
claude-3-haiku:beta	0.10¢	7.7
gemini-pro-1.5	2.21¢	7.7
gpt-4-32k	7.18¢	7.7
claude-3.5-sonnet:beta	1.04¢	7.5
deepseek-coder	0.02¢	7.3
nous-hermes-2-mixtral-8x7b-dpo	0.04¢	7.3
llama-3-70b-instruct	0.05¢	7.3
sonar-medium-chat	0.05¢	7.3
jamba-instruct	0.08¢	7.3
claude-instant-1:beta	0.08¢	7.3
claude-3-haiku	0.09¢	7.3
gpt-3.5-turbo-0301	0.10¢	7.3
snowflake-arctic-instruct	0.16¢	7.3
gpt-4o-2024-05-13	1.01¢	7.3
claude-3.5-sonnet	1.09¢	7.3
gpt-4	3.25¢	7.3
qwen-72b-chat	0.06¢	7.2
claude-3-sonnet:beta	1.11¢	7.2
openchat-7b:free	0.00¢	7.0
hermes-2-pro-llama-3-8b	0.01¢	7.0
qwen-32b-chat	0.05¢	7.0
codellama-70b-instruct	0.06¢	7.0
gpt-3.5-turbo-0613	0.11¢	7.0
palm-2-codechat-bison	0.12¢	7.0
claude-2.1	0.95¢	7.0
llama-3-8b-instruct:nitro	0.01¢	6.8
llama-3-lumimaid-70b	0.48¢	6.8
dbrx-instruct:nitro	0.04¢	6.8
openchat-7b	0.01¢	6.7
llama-3-sonar-small-32k-online	0.01¢	6.7
lzlv-70b-fp16-hf	0.08¢	6.7
gpt-3.5-turbo	0.08¢	6.7
claude-instant-1	0.09¢	6.7
llama-3-lumimaid-8b	0.14¢	6.7
claude-2	0.88¢	6.7
claude-2:beta	0.96¢	6.7
claude-3-sonnet	1.06¢	6.7
nous-hermes-2-mistral-7b-dpo	0.02¢	6.5
llama-3-8b-instruct	0.00¢	6.3
openchat-8b	0.00¢	6.3
phi-3-medium-4k-instruct	0.01¢	6.3
mixtral-8x22b-instruct	0.05¢	6.3
gpt-3.5-turbo-0125	0.07¢	6.3
gemini-flash-1.5	0.07¢	6.3
command-r-plus	0.97¢	6.3
llama-3-sonar-large-32k-chat	0.06¢	6.2
qwen-14b-chat	0.02¢	6.0
dbrx-instruct	0.02¢	6.0
deepseek-chat	0.02¢	6.0
zephyr-orpo-141b-a35b	0.06¢	6.0
gpt-3.5-turbo-16k	0.20¢	6.0
nemotron-4-340b-instruct	0.28¢	6.0
mistral-medium	0.52¢	6.0
midnight-rose-70b	0.85¢	6.0
gpt-4o	1.05¢	6.0
llama-3-sonar-small-32k-chat	0.01¢	5.7
mixtral-8x7b-instruct	0.02¢	5.7
gemini-pro-vision	0.04¢	5.7
llama-3-sonar-large-32k-online	0.06¢	5.7
gpt-3.5-turbo-1106	0.08¢	5.7
pplx-70b-chat	0.09¢	5.7
llama-3-8b-instruct:extended	0.11¢	5.7
pplx-70b-online	0.59¢	5.7
claude-2.0:beta	1.00¢	5.7
mistral-tiny	0.02¢	5.5
phi-3-medium-128k-instruct:free	0.00¢	5.3
mistral-7b-instruct:nitro	0.02¢	5.3
palm-2-chat-bison-32k	0.06¢	5.3
qwen-110b-chat	0.10¢	5.3
llama-3-lumimaid-8b:extended	0.12¢	5.3
claude-1.2	0.77¢	5.3
claude-2.0	0.92¢	5.3
claude-1	1.05¢	5.3
mistral-7b-instruct-v0.3	0.01¢	5.0
toppy-m-7b	0.01¢	5.0
mixtral-8x7b-instruct:nitro	0.05¢	5.0
mixtral-8x22b-instruct-preview	0.08¢	5.0
command-r	0.08¢	5.0
claude-instant-1.0	0.09¢	5.0
nous-hermes-2-mixtral-8x7b-sft	0.04¢	4.7
nous-capybara-34b	0.06¢	4.7
gemini-pro	0.07¢	4.7
airoboros-l2-70b	0.07¢	4.7
xwin-lm-70b	0.22¢	4.7
fimbulvetr-11b-v2	0.26¢	4.7
toppy-m-7b:free	0.00¢	4.5
zephyr-7b-beta:free	0.00¢	4.3
mistral-7b-instruct	0.01¢	4.3
phi-3-mini-128k-instruct	0.01¢	4.3
mythomax-l2-13b:nitro	0.01¢	4.3
mythomax-l2-13b	0.01¢	4.3
openhermes-2.5-mistral-7b	0.02¢	4.3
phi-3-medium-128k-instruct	0.07¢	4.3
llava-yi-34b	0.07¢	4.3
nous-hermes-yi-34b	0.11¢	4.3
sonar-small-online	0.51¢	4.3
pplx-7b-online	0.52¢	4.3
noromaid-mixtral-8x7b-instruct	0.75¢	4.3
claude-2.1:beta	0.81¢	4.3
cinematika-7b:free	0.00¢	4.0
phi-3-mini-128k-instruct:free	0.00¢	4.0
toppy-m-7b:nitro	0.01¢	4.0
mistral-7b-instruct-v0.2	0.01¢	4.0
qwen-7b-chat	0.01¢	4.0
mythomist-7b	0.03¢	4.0
dolphin-mixtral-8x7b	0.03¢	4.0
yi-34b-chat	0.06¢	4.0
palm-2-codechat-bison-32k	0.07¢	4.0
psyfighter-13b	0.07¢	4.0
llama-2-70b-chat	0.16¢	4.0
neural-chat-7b	0.31¢	4.0
bagel-34b	0.40¢	4.0
nous-capybara-7b	0.01¢	3.7
chronos-hermes-13b	0.02¢	3.7
palm-2-chat-bison	0.05¢	3.7
codellama-34b-instruct	0.06¢	3.7
remm-slerp-l2-13b:extended	0.07¢	3.7
gpt-3.5-turbo-instruct	0.14¢	3.7
sonar-medium-online	0.55¢	3.7
gemma-7b-it	0.00¢	3.3
mistral-7b-openorca	0.01¢	3.3
stripedhyena-nous-7b	0.01¢	3.3
pplx-7b-chat	0.02¢	3.3
mythomax-l2-13b:extended	0.02¢	3.3
nous-hermes-llama2-13b	0.02¢	3.3
remm-slerp-l2-13b	0.02¢	3.3
codellama-70b-instruct	0.06¢	3.3
llama-2-70b-chat:nitro	0.07¢	3.3
psyfighter-13b-2	0.07¢	3.3
synthia-70b	0.22¢	3.3
rwkv-5-world-3b	0.00¢	3.0
qwen-4b-chat	0.00¢	3.0
cinematika-7b	0.01¢	3.0
openhermes-2-mistral-7b	0.01¢	3.0
mixtral-8x22b	0.04¢	3.0
nous-capybara-7b:free	0.00¢	2.7
gemma-7b-it:nitro	0.01¢	2.7
firellava-13b	0.01¢	2.7
mistral-7b-instruct-v0.1	0.01¢	2.7
sonar-small-chat	0.01¢	2.7
mythalion-13b	0.06¢	2.7
dolphin-mixtral-8x22b	0.08¢	2.7
mythomist-7b:free	0.00¢	2.3
llama-2-13b-chat	0.02¢	2.3
noromaid-20b	0.13¢	2.3
goliath-120b	0.56¢	2.3
olmo-7b-instruct	0.02¢	2.2
llama-3-70b	0.05¢	2.2
eagle-7b	0.00¢	2.0
zephyr-7b-beta	0.01¢	2.0
command	0.11¢	2.0
gemma-7b-it:free	0.00¢	1.7
mistral-7b-instruct:free	0.00¢	1.7
stripedhyena-hessian-7b	0.01¢	1.5
mistral-small	0.11¢	1.3
weaver	0.22¢	1.0
yi-6b	0.01¢	0.7
mixtral-8x7b	0.04¢	0.7
rwkv-5-3b-ai-town	0.00¢	0.0
soliloquy-l3	0.00¢	0.0
llama-guard-2-8b	0.00¢	0.0
yi-34b	0.03¢	0.0

View full Detailed Rankings.

Compare different evaluators.