LLM Survey

One challenge is to choose an evaluator to mark the LLM outputs. The evaluator needs to fairly consistent over time, produce good results and not cost too much. I've chosen anthropic/claude-3.5-sonnet but here's some scores with different models.

Model
claude-3-haiku
claude-3.5-sonnet
gpt-3.5-turbo-0125
gpt-4-0125-preview
gpt-4-1106-preview
Cost Score Cost Score Cost Score Cost Score Cost Score
human 0.88¢ 10.0
claude-3-opus 0.07¢ 10.0 0.99¢ 10.0 0.07¢ 10.0 10.0 2.09¢ 10.0
claude-3-opus:beta 0.07¢ 10.0 1.04¢ 9.3 0.09¢ 9.3 8.7 2.32¢ 8.7
llama-3-8b-instruct:free 0.94¢ 9.0
gpt-4-vision-preview 0.98¢ 9.0 8.2
gpt-4-turbo-preview 0.08¢ 10.0 1.01¢ 9.0 0.09¢ 9.3 8.8 2.02¢ 7.7
gpt-4-turbo 0.10¢ 10.0 1.07¢ 9.0 0.09¢ 7.0 8.2 2.25¢ 7.0
gpt-4-1106-preview 0.08¢ 10.0 1.00¢ 8.7 0.08¢ 8.3 9.0 2.26¢ 6.7
gpt-4-0314 1.00¢ 8.5 7.8
claude-instant-1.1 0.93¢ 8.3 7.7
phind-codellama-34b 1.02¢ 8.0 6.7
mistral-large 0.08¢ 10.0 0.98¢ 8.0 0.07¢ 10.0 7.8 2.04¢ 5.7
gpt-4-32k-0314 0.97¢ 8.0 7.5
wizardlm-2-7b 1.13¢ 7.8 5.7
llama-3-70b-instruct:nitro 0.99¢ 7.7 7.2
wizardlm-2-8x22b 1.08¢ 7.7 7.0
claude-instant-1.2 0.97¢ 7.7 7.7
claude-3-haiku:beta 1.08¢ 7.7 7.8
gemini-pro-1.5 1.09¢ 7.7 7.0
gpt-4-32k 1.01¢ 7.7 6.8
claude-3.5-sonnet:beta 0.08¢ 10.0 1.05¢ 7.5 0.08¢ 8.7 2.17¢ 6.5
deepseek-coder 0.99¢ 7.3 6.8
nous-hermes-2-mixtral-8x7b-dpo 0.98¢ 7.3 6.5
llama-3-70b-instruct 1.03¢ 7.3 6.2
sonar-medium-chat 0.99¢ 7.3 5.0
jamba-instruct 1.04¢ 7.3
claude-instant-1:beta 0.86¢ 7.3 7.2
claude-3-haiku 1.01¢ 7.3 6.0
gpt-3.5-turbo-0301 0.06¢ 10.0 0.96¢ 7.3 0.07¢ 8.0 6.3 1.98¢ 6.0
snowflake-arctic-instruct 1.00¢ 7.3 6.0
gpt-4o-2024-05-13 1.07¢ 7.3 7.0
claude-3.5-sonnet 0.07¢ 10.0 1.13¢ 7.3 0.08¢ 9.7 2.41¢ 6.0
gpt-4 0.98¢ 7.3 5.0
qwen-72b-chat 1.01¢ 7.2 4.0
claude-3-sonnet:beta 1.04¢ 7.2 6.8
openchat-7b:free 1.03¢ 7.0 5.0
hermes-2-pro-llama-3-8b 0.99¢ 7.0 6.0
qwen-32b-chat 0.96¢ 7.0 5.0
codellama-70b-instruct 0.91¢ 7.0
gpt-3.5-turbo-0613 0.07¢ 10.0 1.00¢ 7.0 0.08¢ 9.3 7.3 2.41¢ 7.7
palm-2-codechat-bison 1.07¢ 7.0 6.0
claude-2.1 0.06¢ 10.0 0.90¢ 7.0 0.07¢ 8.5 5.7 1.85¢ 5.0
llama-3-8b-instruct:nitro 0.98¢ 6.8 6.0
llama-3-lumimaid-70b 1.00¢ 6.8 6.5
dbrx-instruct:nitro 0.74¢ 6.8 3.7
openchat-7b 0.98¢ 6.7 4.5
llama-3-sonar-small-32k-online 0.95¢ 6.7 4.5
lzlv-70b-fp16-hf 1.05¢ 6.7 4.8
gpt-3.5-turbo 0.07¢ 10.0 1.01¢ 6.7 0.08¢ 8.0 6.0 2.23¢ 5.7
claude-instant-1 0.99¢ 6.7 5.7
llama-3-lumimaid-8b 1.02¢ 6.7 6.2
claude-2 0.09¢ 9.0 0.93¢ 6.7 0.06¢ 10.0 7.2 1.86¢ 5.0
claude-2:beta 0.07¢ 9.3 0.95¢ 6.7 0.06¢ 10.0 6.0 1.91¢ 5.0
claude-3-sonnet 1.04¢ 6.7 4.8
nous-hermes-2-mistral-7b-dpo 1.03¢ 6.5 3.3
llama-3-8b-instruct 0.91¢ 6.3 3.5
openchat-8b 1.03¢ 6.3
phi-3-medium-4k-instruct 0.99¢ 6.3
mixtral-8x22b-instruct 0.99¢ 6.3 5.3
gpt-3.5-turbo-0125 0.07¢ 10.0 1.01¢ 6.3 0.07¢ 9.0 5.2 1.85¢ 3.0
gemini-flash-1.5 1.17¢ 6.3 5.8
command-r-plus 0.07¢ 10.0 1.01¢ 6.3 0.09¢ 8.7 5.5 2.00¢ 4.0
llama-3-sonar-large-32k-chat 1.06¢ 6.2 6.5
qwen-14b-chat 1.00¢ 6.0 5.7
dbrx-instruct 0.52¢ 6.0 2.0
deepseek-chat 1.01¢ 6.0 5.0
zephyr-orpo-141b-a35b 0.07¢ 9.7 1.04¢ 6.0 0.08¢ 7.3 5.8 2.18¢ 3.3
gpt-3.5-turbo-16k 0.07¢ 10.0 0.96¢ 6.0 0.07¢ 10.0 4.7 2.06¢ 4.3
nemotron-4-340b-instruct 1.01¢ 6.0
mistral-medium 1.01¢ 6.0 6.2
midnight-rose-70b 0.07¢ 10.0 1.09¢ 6.0 0.08¢ 10.0 3.2 2.32¢ 2.7
gpt-4o 1.10¢ 6.0 5.3
llama-3-sonar-small-32k-chat 1.03¢ 5.7 4.3
mixtral-8x7b-instruct 0.96¢ 5.7 2.8
gemini-pro-vision 1.01¢ 5.7 4.0
llama-3-sonar-large-32k-online 0.98¢ 5.7 5.3
gpt-3.5-turbo-1106 0.06¢ 10.0 0.94¢ 5.7 0.07¢ 8.3 5.3 1.86¢ 2.3
pplx-70b-chat 0.96¢ 5.7 4.0
llama-3-8b-instruct:extended 0.98¢ 5.7 4.7
pplx-70b-online 1.03¢ 5.7 3.8
claude-2.0:beta 0.06¢ 10.0 0.97¢ 5.7 0.07¢ 9.3 5.5 1.76¢ 2.7
mistral-tiny 1.02¢ 5.5 4.0
phi-3-medium-128k-instruct:free 0.97¢ 5.3 3.7
mistral-7b-instruct:nitro 1.03¢ 5.3 3.0
palm-2-chat-bison-32k 0.94¢ 5.3 3.8
qwen-110b-chat 0.99¢ 5.3 5.8
llama-3-lumimaid-8b:extended 1.02¢ 5.3 3.3
claude-1.2 0.95¢ 5.3 5.3
claude-2.0 0.07¢ 10.0 0.91¢ 5.3 0.07¢ 7.3 4.2 1.92¢ 2.3
claude-1 0.96¢ 5.3 4.2
mistral-7b-instruct-v0.3 1.00¢ 5.0 4.5
toppy-m-7b 0.99¢ 5.0 3.0
mixtral-8x7b-instruct:nitro 0.97¢ 5.0 4.3
mixtral-8x22b-instruct-preview 0.99¢ 5.0 5.0
command-r 0.08¢ 9.7 0.94¢ 5.0 0.07¢ 8.7 3.2 2.15¢ 2.0
claude-instant-1.0 0.95¢ 5.0 4.3
nous-hermes-2-mixtral-8x7b-sft 0.92¢ 4.7 4.7
nous-capybara-34b 0.92¢ 4.7 4.8
gemini-pro 0.96¢ 4.7 3.7
airoboros-l2-70b 1.01¢ 4.7 3.8
xwin-lm-70b 0.88¢ 4.7 2.0
fimbulvetr-11b-v2 1.06¢ 4.7 3.2
toppy-m-7b:free 1.01¢ 4.5 2.3
zephyr-7b-beta:free 1.02¢ 4.3 1.7
mistral-7b-instruct 1.04¢ 4.3 1.3
phi-3-mini-128k-instruct 0.99¢ 4.3 3.3
mythomax-l2-13b:nitro 0.96¢ 4.3 2.0
mythomax-l2-13b 0.95¢ 4.3 2.7
openhermes-2.5-mistral-7b 0.94¢ 4.3 2.7
phi-3-medium-128k-instruct 0.99¢ 4.3 3.0
llava-yi-34b 1.02¢ 4.3 2.2
nous-hermes-yi-34b 1.22¢ 4.3 2.2
sonar-small-online 0.96¢ 4.3 2.0
pplx-7b-online 0.97¢ 4.3 2.7
noromaid-mixtral-8x7b-instruct 1.03¢ 4.3 2.2
claude-2.1:beta 0.06¢ 9.7 0.94¢ 4.3 0.07¢ 6.0 4.3 1.93¢ 3.0
cinematika-7b:free 0.90¢ 4.0 2.0
phi-3-mini-128k-instruct:free 0.92¢ 4.0 2.3
toppy-m-7b:nitro 1.04¢ 4.0 2.0
mistral-7b-instruct-v0.2 1.08¢ 4.0 0.8
qwen-7b-chat 0.98¢ 4.0 3.3
mythomist-7b 0.99¢ 4.0 2.0
dolphin-mixtral-8x7b 0.94¢ 4.0 2.7
yi-34b-chat 0.95¢ 4.0 2.5
palm-2-codechat-bison-32k 0.81¢ 4.0 3.0
psyfighter-13b 0.95¢ 4.0 2.0
llama-2-70b-chat 1.00¢ 4.0 3.0
neural-chat-7b 1.03¢ 4.0 3.0
bagel-34b 0.07¢ 9.3 0.94¢ 4.0 0.07¢ 8.7 2.2 1.98¢ 1.0
nous-capybara-7b 0.07¢ 6.3 0.94¢ 3.7 0.07¢ 7.5 1.7 1.97¢ 0.3
chronos-hermes-13b 0.08¢ 9.3 0.99¢ 3.7 0.08¢ 9.3 2.3 2.08¢ 1.3
palm-2-chat-bison 0.88¢ 3.7 2.3
codellama-34b-instruct 0.97¢ 3.7 3.3
remm-slerp-l2-13b:extended 0.93¢ 3.7 2.8
gpt-3.5-turbo-instruct 0.07¢ 10.0 1.00¢ 3.7 0.09¢ 8.7 2.7 2.28¢ 3.0
sonar-medium-online 1.03¢ 3.7 2.7
gemma-7b-it 0.89¢ 3.3 3.7
mistral-7b-openorca 0.94¢ 3.3 1.5
stripedhyena-nous-7b 0.99¢ 3.3 2.2
pplx-7b-chat 1.00¢ 3.3 0.8
mythomax-l2-13b:extended 0.95¢ 3.3 0.7
nous-hermes-llama2-13b 0.92¢ 3.3 2.7
remm-slerp-l2-13b 0.93¢ 3.3 2.2
codellama-70b-instruct 0.93¢ 3.3 2.3
llama-2-70b-chat:nitro 0.89¢ 3.3 2.5
psyfighter-13b-2 0.98¢ 3.3 0.7
synthia-70b 0.95¢ 3.3 1.0
rwkv-5-world-3b 0.88¢ 3.0 0.8
qwen-4b-chat 0.91¢ 3.0 1.7
cinematika-7b 0.80¢ 3.0 4.0
openhermes-2-mistral-7b 0.91¢ 3.0 1.5
mixtral-8x22b 0.80¢ 3.0 2.7
nous-capybara-7b:free 0.08¢ 8.7 0.95¢ 2.7 0.07¢ 5.0 0.7 1.89¢ 0.0
gemma-7b-it:nitro 0.87¢ 2.7 1.0
firellava-13b 0.92¢ 2.7 2.3
mistral-7b-instruct-v0.1 0.92¢ 2.7 1.7
sonar-small-chat 0.98¢ 2.7 0.3
mythalion-13b 0.94¢ 2.7 2.5
dolphin-mixtral-8x22b 1.00¢ 2.7
mythomist-7b:free 0.99¢ 2.3 2.0
llama-2-13b-chat 1.01¢ 2.3 0.2
noromaid-20b 0.96¢ 2.3 1.0
goliath-120b 1.01¢ 2.3 0.3
olmo-7b-instruct 1.15¢ 2.2 0.0
llama-3-70b 0.90¢ 2.2 1.7
eagle-7b 1.28¢ 2.0 0.2
zephyr-7b-beta 0.93¢ 2.0 1.2
command 0.92¢ 2.0 0.3
gemma-7b-it:free 0.83¢ 1.7 1.0
mistral-7b-instruct:free 0.90¢ 1.7 0.2
stripedhyena-hessian-7b 0.91¢ 1.5 0.7
mistral-small 0.78¢ 1.3 3.7
weaver 0.05¢ 3.3 0.74¢ 1.0 0.06¢ 4.0 0.7 1.60¢ 0.0
yi-6b 0.75¢ 0.7 0.0
mixtral-8x7b 0.66¢ 0.7 0.0
rwkv-5-3b-ai-town 0.80¢ 0.0 0.0
soliloquy-l3 0.59¢ 0.0 0.0
llama-guard-2-8b 0.53¢ 0.0 3.5
yi-34b 0.62¢ 0.0 0.0