LLM Survey
One challenge is to choose an evaluator to mark the LLM outputs. The evaluator needs to fairly consistent over time, produce good results and not cost too much. I've chosen anthropic/claude-3.5-sonnet but here's some scores with different models.
Model | claude-3-haiku |
claude-3.5-sonnet |
gpt-3.5-turbo-0125 |
gpt-4-0125-preview |
gpt-4-1106-preview |
|||||
---|---|---|---|---|---|---|---|---|---|---|
Cost | Score | Cost | Score | Cost | Score | Cost | Score | Cost | Score | |
human | 0.88¢ | 10.0 | ||||||||
claude-3-opus | 0.07¢ | 10.0 | 0.99¢ | 10.0 | 0.07¢ | 10.0 | 10.0 | 2.09¢ | 10.0 | |
claude-3-opus:beta | 0.07¢ | 10.0 | 1.04¢ | 9.3 | 0.09¢ | 9.3 | 8.7 | 2.32¢ | 8.7 | |
llama-3-8b-instruct:free | 0.94¢ | 9.0 | ||||||||
gpt-4-vision-preview | 0.98¢ | 9.0 | 8.2 | |||||||
gpt-4-turbo-preview | 0.08¢ | 10.0 | 1.01¢ | 9.0 | 0.09¢ | 9.3 | 8.8 | 2.02¢ | 7.7 | |
gpt-4-turbo | 0.10¢ | 10.0 | 1.07¢ | 9.0 | 0.09¢ | 7.0 | 8.2 | 2.25¢ | 7.0 | |
gpt-4-1106-preview | 0.08¢ | 10.0 | 1.00¢ | 8.7 | 0.08¢ | 8.3 | 9.0 | 2.26¢ | 6.7 | |
gpt-4-0314 | 1.00¢ | 8.5 | 7.8 | |||||||
claude-instant-1.1 | 0.93¢ | 8.3 | 7.7 | |||||||
phind-codellama-34b | 1.02¢ | 8.0 | 6.7 | |||||||
mistral-large | 0.08¢ | 10.0 | 0.98¢ | 8.0 | 0.07¢ | 10.0 | 7.8 | 2.04¢ | 5.7 | |
gpt-4-32k-0314 | 0.97¢ | 8.0 | 7.5 | |||||||
wizardlm-2-7b | 1.13¢ | 7.8 | 5.7 | |||||||
llama-3-70b-instruct:nitro | 0.99¢ | 7.7 | 7.2 | |||||||
wizardlm-2-8x22b | 1.08¢ | 7.7 | 7.0 | |||||||
claude-instant-1.2 | 0.97¢ | 7.7 | 7.7 | |||||||
claude-3-haiku:beta | 1.08¢ | 7.7 | 7.8 | |||||||
gemini-pro-1.5 | 1.09¢ | 7.7 | 7.0 | |||||||
gpt-4-32k | 1.01¢ | 7.7 | 6.8 | |||||||
claude-3.5-sonnet:beta | 0.08¢ | 10.0 | 1.05¢ | 7.5 | 0.08¢ | 8.7 | 2.17¢ | 6.5 | ||
deepseek-coder | 0.99¢ | 7.3 | 6.8 | |||||||
nous-hermes-2-mixtral-8x7b-dpo | 0.98¢ | 7.3 | 6.5 | |||||||
llama-3-70b-instruct | 1.03¢ | 7.3 | 6.2 | |||||||
sonar-medium-chat | 0.99¢ | 7.3 | 5.0 | |||||||
jamba-instruct | 1.04¢ | 7.3 | ||||||||
claude-instant-1:beta | 0.86¢ | 7.3 | 7.2 | |||||||
claude-3-haiku | 1.01¢ | 7.3 | 6.0 | |||||||
gpt-3.5-turbo-0301 | 0.06¢ | 10.0 | 0.96¢ | 7.3 | 0.07¢ | 8.0 | 6.3 | 1.98¢ | 6.0 | |
snowflake-arctic-instruct | 1.00¢ | 7.3 | 6.0 | |||||||
gpt-4o-2024-05-13 | 1.07¢ | 7.3 | 7.0 | |||||||
claude-3.5-sonnet | 0.07¢ | 10.0 | 1.13¢ | 7.3 | 0.08¢ | 9.7 | 2.41¢ | 6.0 | ||
gpt-4 | 0.98¢ | 7.3 | 5.0 | |||||||
qwen-72b-chat | 1.01¢ | 7.2 | 4.0 | |||||||
claude-3-sonnet:beta | 1.04¢ | 7.2 | 6.8 | |||||||
openchat-7b:free | 1.03¢ | 7.0 | 5.0 | |||||||
hermes-2-pro-llama-3-8b | 0.99¢ | 7.0 | 6.0 | |||||||
qwen-32b-chat | 0.96¢ | 7.0 | 5.0 | |||||||
codellama-70b-instruct | 0.91¢ | 7.0 | ||||||||
gpt-3.5-turbo-0613 | 0.07¢ | 10.0 | 1.00¢ | 7.0 | 0.08¢ | 9.3 | 7.3 | 2.41¢ | 7.7 | |
palm-2-codechat-bison | 1.07¢ | 7.0 | 6.0 | |||||||
claude-2.1 | 0.06¢ | 10.0 | 0.90¢ | 7.0 | 0.07¢ | 8.5 | 5.7 | 1.85¢ | 5.0 | |
llama-3-8b-instruct:nitro | 0.98¢ | 6.8 | 6.0 | |||||||
llama-3-lumimaid-70b | 1.00¢ | 6.8 | 6.5 | |||||||
dbrx-instruct:nitro | 0.74¢ | 6.8 | 3.7 | |||||||
openchat-7b | 0.98¢ | 6.7 | 4.5 | |||||||
llama-3-sonar-small-32k-online | 0.95¢ | 6.7 | 4.5 | |||||||
lzlv-70b-fp16-hf | 1.05¢ | 6.7 | 4.8 | |||||||
gpt-3.5-turbo | 0.07¢ | 10.0 | 1.01¢ | 6.7 | 0.08¢ | 8.0 | 6.0 | 2.23¢ | 5.7 | |
claude-instant-1 | 0.99¢ | 6.7 | 5.7 | |||||||
llama-3-lumimaid-8b | 1.02¢ | 6.7 | 6.2 | |||||||
claude-2 | 0.09¢ | 9.0 | 0.93¢ | 6.7 | 0.06¢ | 10.0 | 7.2 | 1.86¢ | 5.0 | |
claude-2:beta | 0.07¢ | 9.3 | 0.95¢ | 6.7 | 0.06¢ | 10.0 | 6.0 | 1.91¢ | 5.0 | |
claude-3-sonnet | 1.04¢ | 6.7 | 4.8 | |||||||
nous-hermes-2-mistral-7b-dpo | 1.03¢ | 6.5 | 3.3 | |||||||
llama-3-8b-instruct | 0.91¢ | 6.3 | 3.5 | |||||||
openchat-8b | 1.03¢ | 6.3 | ||||||||
phi-3-medium-4k-instruct | 0.99¢ | 6.3 | ||||||||
mixtral-8x22b-instruct | 0.99¢ | 6.3 | 5.3 | |||||||
gpt-3.5-turbo-0125 | 0.07¢ | 10.0 | 1.01¢ | 6.3 | 0.07¢ | 9.0 | 5.2 | 1.85¢ | 3.0 | |
gemini-flash-1.5 | 1.17¢ | 6.3 | 5.8 | |||||||
command-r-plus | 0.07¢ | 10.0 | 1.01¢ | 6.3 | 0.09¢ | 8.7 | 5.5 | 2.00¢ | 4.0 | |
llama-3-sonar-large-32k-chat | 1.06¢ | 6.2 | 6.5 | |||||||
qwen-14b-chat | 1.00¢ | 6.0 | 5.7 | |||||||
dbrx-instruct | 0.52¢ | 6.0 | 2.0 | |||||||
deepseek-chat | 1.01¢ | 6.0 | 5.0 | |||||||
zephyr-orpo-141b-a35b | 0.07¢ | 9.7 | 1.04¢ | 6.0 | 0.08¢ | 7.3 | 5.8 | 2.18¢ | 3.3 | |
gpt-3.5-turbo-16k | 0.07¢ | 10.0 | 0.96¢ | 6.0 | 0.07¢ | 10.0 | 4.7 | 2.06¢ | 4.3 | |
nemotron-4-340b-instruct | 1.01¢ | 6.0 | ||||||||
mistral-medium | 1.01¢ | 6.0 | 6.2 | |||||||
midnight-rose-70b | 0.07¢ | 10.0 | 1.09¢ | 6.0 | 0.08¢ | 10.0 | 3.2 | 2.32¢ | 2.7 | |
gpt-4o | 1.10¢ | 6.0 | 5.3 | |||||||
llama-3-sonar-small-32k-chat | 1.03¢ | 5.7 | 4.3 | |||||||
mixtral-8x7b-instruct | 0.96¢ | 5.7 | 2.8 | |||||||
gemini-pro-vision | 1.01¢ | 5.7 | 4.0 | |||||||
llama-3-sonar-large-32k-online | 0.98¢ | 5.7 | 5.3 | |||||||
gpt-3.5-turbo-1106 | 0.06¢ | 10.0 | 0.94¢ | 5.7 | 0.07¢ | 8.3 | 5.3 | 1.86¢ | 2.3 | |
pplx-70b-chat | 0.96¢ | 5.7 | 4.0 | |||||||
llama-3-8b-instruct:extended | 0.98¢ | 5.7 | 4.7 | |||||||
pplx-70b-online | 1.03¢ | 5.7 | 3.8 | |||||||
claude-2.0:beta | 0.06¢ | 10.0 | 0.97¢ | 5.7 | 0.07¢ | 9.3 | 5.5 | 1.76¢ | 2.7 | |
mistral-tiny | 1.02¢ | 5.5 | 4.0 | |||||||
phi-3-medium-128k-instruct:free | 0.97¢ | 5.3 | 3.7 | |||||||
mistral-7b-instruct:nitro | 1.03¢ | 5.3 | 3.0 | |||||||
palm-2-chat-bison-32k | 0.94¢ | 5.3 | 3.8 | |||||||
qwen-110b-chat | 0.99¢ | 5.3 | 5.8 | |||||||
llama-3-lumimaid-8b:extended | 1.02¢ | 5.3 | 3.3 | |||||||
claude-1.2 | 0.95¢ | 5.3 | 5.3 | |||||||
claude-2.0 | 0.07¢ | 10.0 | 0.91¢ | 5.3 | 0.07¢ | 7.3 | 4.2 | 1.92¢ | 2.3 | |
claude-1 | 0.96¢ | 5.3 | 4.2 | |||||||
mistral-7b-instruct-v0.3 | 1.00¢ | 5.0 | 4.5 | |||||||
toppy-m-7b | 0.99¢ | 5.0 | 3.0 | |||||||
mixtral-8x7b-instruct:nitro | 0.97¢ | 5.0 | 4.3 | |||||||
mixtral-8x22b-instruct-preview | 0.99¢ | 5.0 | 5.0 | |||||||
command-r | 0.08¢ | 9.7 | 0.94¢ | 5.0 | 0.07¢ | 8.7 | 3.2 | 2.15¢ | 2.0 | |
claude-instant-1.0 | 0.95¢ | 5.0 | 4.3 | |||||||
nous-hermes-2-mixtral-8x7b-sft | 0.92¢ | 4.7 | 4.7 | |||||||
nous-capybara-34b | 0.92¢ | 4.7 | 4.8 | |||||||
gemini-pro | 0.96¢ | 4.7 | 3.7 | |||||||
airoboros-l2-70b | 1.01¢ | 4.7 | 3.8 | |||||||
xwin-lm-70b | 0.88¢ | 4.7 | 2.0 | |||||||
fimbulvetr-11b-v2 | 1.06¢ | 4.7 | 3.2 | |||||||
toppy-m-7b:free | 1.01¢ | 4.5 | 2.3 | |||||||
zephyr-7b-beta:free | 1.02¢ | 4.3 | 1.7 | |||||||
mistral-7b-instruct | 1.04¢ | 4.3 | 1.3 | |||||||
phi-3-mini-128k-instruct | 0.99¢ | 4.3 | 3.3 | |||||||
mythomax-l2-13b:nitro | 0.96¢ | 4.3 | 2.0 | |||||||
mythomax-l2-13b | 0.95¢ | 4.3 | 2.7 | |||||||
openhermes-2.5-mistral-7b | 0.94¢ | 4.3 | 2.7 | |||||||
phi-3-medium-128k-instruct | 0.99¢ | 4.3 | 3.0 | |||||||
llava-yi-34b | 1.02¢ | 4.3 | 2.2 | |||||||
nous-hermes-yi-34b | 1.22¢ | 4.3 | 2.2 | |||||||
sonar-small-online | 0.96¢ | 4.3 | 2.0 | |||||||
pplx-7b-online | 0.97¢ | 4.3 | 2.7 | |||||||
noromaid-mixtral-8x7b-instruct | 1.03¢ | 4.3 | 2.2 | |||||||
claude-2.1:beta | 0.06¢ | 9.7 | 0.94¢ | 4.3 | 0.07¢ | 6.0 | 4.3 | 1.93¢ | 3.0 | |
cinematika-7b:free | 0.90¢ | 4.0 | 2.0 | |||||||
phi-3-mini-128k-instruct:free | 0.92¢ | 4.0 | 2.3 | |||||||
toppy-m-7b:nitro | 1.04¢ | 4.0 | 2.0 | |||||||
mistral-7b-instruct-v0.2 | 1.08¢ | 4.0 | 0.8 | |||||||
qwen-7b-chat | 0.98¢ | 4.0 | 3.3 | |||||||
mythomist-7b | 0.99¢ | 4.0 | 2.0 | |||||||
dolphin-mixtral-8x7b | 0.94¢ | 4.0 | 2.7 | |||||||
yi-34b-chat | 0.95¢ | 4.0 | 2.5 | |||||||
palm-2-codechat-bison-32k | 0.81¢ | 4.0 | 3.0 | |||||||
psyfighter-13b | 0.95¢ | 4.0 | 2.0 | |||||||
llama-2-70b-chat | 1.00¢ | 4.0 | 3.0 | |||||||
neural-chat-7b | 1.03¢ | 4.0 | 3.0 | |||||||
bagel-34b | 0.07¢ | 9.3 | 0.94¢ | 4.0 | 0.07¢ | 8.7 | 2.2 | 1.98¢ | 1.0 | |
nous-capybara-7b | 0.07¢ | 6.3 | 0.94¢ | 3.7 | 0.07¢ | 7.5 | 1.7 | 1.97¢ | 0.3 | |
chronos-hermes-13b | 0.08¢ | 9.3 | 0.99¢ | 3.7 | 0.08¢ | 9.3 | 2.3 | 2.08¢ | 1.3 | |
palm-2-chat-bison | 0.88¢ | 3.7 | 2.3 | |||||||
codellama-34b-instruct | 0.97¢ | 3.7 | 3.3 | |||||||
remm-slerp-l2-13b:extended | 0.93¢ | 3.7 | 2.8 | |||||||
gpt-3.5-turbo-instruct | 0.07¢ | 10.0 | 1.00¢ | 3.7 | 0.09¢ | 8.7 | 2.7 | 2.28¢ | 3.0 | |
sonar-medium-online | 1.03¢ | 3.7 | 2.7 | |||||||
gemma-7b-it | 0.89¢ | 3.3 | 3.7 | |||||||
mistral-7b-openorca | 0.94¢ | 3.3 | 1.5 | |||||||
stripedhyena-nous-7b | 0.99¢ | 3.3 | 2.2 | |||||||
pplx-7b-chat | 1.00¢ | 3.3 | 0.8 | |||||||
mythomax-l2-13b:extended | 0.95¢ | 3.3 | 0.7 | |||||||
nous-hermes-llama2-13b | 0.92¢ | 3.3 | 2.7 | |||||||
remm-slerp-l2-13b | 0.93¢ | 3.3 | 2.2 | |||||||
codellama-70b-instruct | 0.93¢ | 3.3 | 2.3 | |||||||
llama-2-70b-chat:nitro | 0.89¢ | 3.3 | 2.5 | |||||||
psyfighter-13b-2 | 0.98¢ | 3.3 | 0.7 | |||||||
synthia-70b | 0.95¢ | 3.3 | 1.0 | |||||||
rwkv-5-world-3b | 0.88¢ | 3.0 | 0.8 | |||||||
qwen-4b-chat | 0.91¢ | 3.0 | 1.7 | |||||||
cinematika-7b | 0.80¢ | 3.0 | 4.0 | |||||||
openhermes-2-mistral-7b | 0.91¢ | 3.0 | 1.5 | |||||||
mixtral-8x22b | 0.80¢ | 3.0 | 2.7 | |||||||
nous-capybara-7b:free | 0.08¢ | 8.7 | 0.95¢ | 2.7 | 0.07¢ | 5.0 | 0.7 | 1.89¢ | 0.0 | |
gemma-7b-it:nitro | 0.87¢ | 2.7 | 1.0 | |||||||
firellava-13b | 0.92¢ | 2.7 | 2.3 | |||||||
mistral-7b-instruct-v0.1 | 0.92¢ | 2.7 | 1.7 | |||||||
sonar-small-chat | 0.98¢ | 2.7 | 0.3 | |||||||
mythalion-13b | 0.94¢ | 2.7 | 2.5 | |||||||
dolphin-mixtral-8x22b | 1.00¢ | 2.7 | ||||||||
mythomist-7b:free | 0.99¢ | 2.3 | 2.0 | |||||||
llama-2-13b-chat | 1.01¢ | 2.3 | 0.2 | |||||||
noromaid-20b | 0.96¢ | 2.3 | 1.0 | |||||||
goliath-120b | 1.01¢ | 2.3 | 0.3 | |||||||
olmo-7b-instruct | 1.15¢ | 2.2 | 0.0 | |||||||
llama-3-70b | 0.90¢ | 2.2 | 1.7 | |||||||
eagle-7b | 1.28¢ | 2.0 | 0.2 | |||||||
zephyr-7b-beta | 0.93¢ | 2.0 | 1.2 | |||||||
command | 0.92¢ | 2.0 | 0.3 | |||||||
gemma-7b-it:free | 0.83¢ | 1.7 | 1.0 | |||||||
mistral-7b-instruct:free | 0.90¢ | 1.7 | 0.2 | |||||||
stripedhyena-hessian-7b | 0.91¢ | 1.5 | 0.7 | |||||||
mistral-small | 0.78¢ | 1.3 | 3.7 | |||||||
weaver | 0.05¢ | 3.3 | 0.74¢ | 1.0 | 0.06¢ | 4.0 | 0.7 | 1.60¢ | 0.0 | |
yi-6b | 0.75¢ | 0.7 | 0.0 | |||||||
mixtral-8x7b | 0.66¢ | 0.7 | 0.0 | |||||||
rwkv-5-3b-ai-town | 0.80¢ | 0.0 | 0.0 | |||||||
soliloquy-l3 | 0.59¢ | 0.0 | 0.0 | |||||||
llama-guard-2-8b | 0.53¢ | 0.0 | 3.5 | |||||||
yi-34b | 0.62¢ | 0.0 | 0.0 |