It has the following characteristics:
100 tokens are generated with a batch size of 1 and 50 input tokens:
Model(3) | Epyc 7313 (6) (tokens/s) | RTX A6000 (tokens/s) | RTX 4090 (tokens/s) |
---|---|---|---|
gptj_6B_q4 | 21.5 | 132 | 164 |
flan_t5_xxl_q4 | 25 | 130 | 158 |
llama2_7B_q4 | 23 | 115 | 144 |
llama2_13B_q4 | 12.0 | 69.3 | 88 |
gptneox_20B_q4 | 8.1 | 45.5 | 59 |
llama2_70B_q4 | 2.5 | 15.2 | - |
8 simultaneous requests generating 100 tokens with 50 input tokens (equivalent to a batch size of 8):
Model(3) | RTX A6000 (tokens/s) |
---|---|
llama2_7B_q4 | 783 |
llama2_13B_q4 | 492 |
llama2_70B_q4 | 118 |
A single 512x512 image is generated using 50 time steps.
Model(3) | RTX A6000 (seconds) | RTX 4090 (seconds) |
---|---|---|
stable diffusion 1.4 | 1.82 | 1.21 |
stable diffusion 2.1 | 1.67 | 1.19 |
Language Models:
bloom_560M | 1.1 | 29.176 | 36.8% | 35.8% | 51.4% | 63.7% | 36.0% | 44.7% |
codegen_6B_mono_q4 | 4.4 | 69.409 | 28.0% | 35.7% | 51.1% | 60.2% | 38.0% | 42.6% |
codegen_6B_mono_q8 | 7.7 | 67.262 | 28.1% | 35.8% | 50.8% | 60.1% | 39.1% | 42.8% | fairseq_gpt_13B | 26.2 | 3.567 | 71.9% | 72.7% | 67.5% | 77.6% | 70.1% | 71.9% |
fairseq_gpt_13B_q4 | 7.9 | 3.646 | 71.2% | 72.5% | 67.6% | 77.4% | 70.6% | 71.9% |
fairseq_gpt_13B_q8 | 14.2 | 3.565 | 71.8% | 72.7% | 67.2% | 77.7% | 70.0% | 71.9% |
flan_t5_base | 0.5 | 12.891 | 54.2% | 36.5% | 54.7% | 65.8% | 62.1% | 54.7% |
flan_t5_base_q8 | 0.3 | 13.098 | 54.2% | 36.4% | 54.2% | 65.7% | 61.8% | 54.5% |
flan_t5_small | 0.2 | 23.343 | 46.7% | 29.2% | 50.0% | 62.4% | 47.9% | 47.2% |
flan_t5_small_q8 | 0.1 | 23.449 | 46.7% | 29.2% | 49.7% | 62.4% | 48.2% | 47.2% |
flan_t5_xxl_q4 | 6.5 | 3.010 | 77.7% | 71.5% | 73.4% | 77.6% | 71.8% | 74.4% |
flan_t5_xxl_q8 | 12.0 | 3.049 | 77.8% | 72.1% | 75.1% | 77.8% | 73.1% | 75.2% |
flan_ul2_20B_q4 | 11.3 | - | 74.1% | 24.3% | 51.1% | 49.9% | 78.8% | 55.6% |
flan_ul2_20B_q8 | 20.9 | - | 74.4% | 24.4% | 52.0% | 50.6% | 77.3% | 55.7% |
gpt2_117M | 0.3 | 40.110 | 32.9% | 31.1% | 52.1% | 62.9% | 27.3% | 41.3% |
gpt2_345M | 0.7 | 18.272 | 43.5% | 39.4% | 53.3% | 67.7% | 43.1% | 49.4% |
gpt2_345M_q8 | 0.5 | 18.452 | 43.1% | 39.4% | 53.1% | 67.5% | 41.9% | 49.0% |
gpt2_774M | 1.6 | 12.966 | 47.8% | 45.4% | 55.6% | 70.4% | 48.5% | 53.5% |
gpt2_774M_q8 | 1.0 | 12.928 | 47.9% | 45.4% | 55.3% | 70.3% | 48.2% | 53.4% |
gpt2_1558M | 3.1 | 10.637 | 51.3% | 50.8% | 58.4% | 70.8% | 53.2% | 56.9% |
gpt2_1558M_q8 | 1.8 | 10.655 | 51.2% | 50.8% | 58.6% | 70.8% | 53.2% | 56.9% | gptj_6B | 12.1 | 4.124 | 69.0% | 66.2% | 64.8% | 75.5% | 66.9% | 68.5% |
gptj_6B_q4 | 3.8 | 4.153 | 68.9% | 65.7% | 63.9% | 74.4% | 67.0% | 68.0% |
gptj_6B_q8 | 6.6 | 4.122 | 69.1% | 66.2% | 64.4% | 75.4% | 66.4% | 68.3% | gptneox_20B | 41.1 | 3.657 | 72.6% | 71.4% | 65.5% | 77.5% | 73.3% | 72.0% |
gptneox_20B_q4 | 12.2 | 3.711 | 72.0% | 69.3% | 64.8% | 76.7% | 70.8% | 70.7% |
gptneox_20B_q8 | 22.1 | 3.659 | 72.6% | 71.3% | 65.8% | 77.3% | 72.9% | 72.0% |
llama_7B | 13.5 | 3.463 | 73.6% | 76.2% | 70.4% | 78.1% | 75.4% | 74.7% |
llama_7B_q4 | 4.0 | 3.549 | 73.2% | 75.5% | 70.4% | 78.0% | 74.7% | 74.4% |
llama_7B_q8 | 7.3 | 3.453 | 73.7% | 76.1% | 70.2% | 78.0% | 75.5% | 74.7% |
llama_13B_q4 | 7.6 | 3.130 | 77.1% | 78.6% | 72.2% | 78.3% | 77.8% | 76.8% |
llama_13B_q8 | 14.0 | 3.178 | 76.5% | 79.1% | 73.2% | 79.1% | 77.1% | 77.0% |
llama_30B_q4 | 18.7 | 2.877 | 77.5% | 82.4% | 75.7% | 80.2% | 80.2% | 79.2% |
llama_30B_q8 | 34.8 | 2.853 | 77.7% | 82.7% | 76.3% | 80.3% | 80.4% | 79.5% |
llama_65B_q4 | 37.2 | 2.760 | 78.5% | 83.9% | 76.6% | 81.4% | 83.2% | 80.7% |
opt_125M | 0.3 | 26.028 | 37.9% | 31.3% | 50.2% | 63.2% | 23.4% | 41.2% |
opt_30B_q4 | 17.8 | 3.656 | 71.5% | 72.1% | 68.0% | 77.4% | 69.9% | 71.8% |
opt_30B_q8 | 32.6 | 3.628 | 71.6% | 72.3% | 68.2% | 77.7% | 71.4% | 72.3% |
opt_66B_q4 | 38.2 | 3.308 | 73.4% | 74.4% | 68.4% | 78.5% | 75.0% | 73.9% |
pythia_deduped_70M | 0.1 | 96.126 | 25.6% | 28.3% | 54.4% | 60.4% | 13.1% | 36.3% |
pythia_deduped_160M | 0.3 | 26.380 | 36.9% | 32.3% | 51.4% | 63.8% | 23.2% | 41.5% |
pythia_deduped_410M | 0.8 | 10.827 | 51.7% | 40.8% | 54.0% | 67.2% | 43.0% | 51.4% |
pythia_deduped_410M_q8 | 0.5 | 10.729 | 51.8% | 40.7% | 53.8% | 67.1% | 42.7% | 51.2% |
pythia_deduped_1B | 2.0 | 7.273 | 58.5% | 49.0% | 54.5% | 71.0% | 49.9% | 56.6% |
pythia_deduped_1B_q8 | 1.2 | 7.286 | 58.4% | 49.0% | 54.9% | 70.9% | 49.0% | 56.5% |
pythia_deduped_1.4B | 2.8 | 6.546 | 63.1% | 52.2% | 57.1% | 72.7% | 52.6% | 59.5% |
pythia_deduped_1.4B_q8 | 1.6 | 6.577 | 63.3% | 52.1% | 55.7% | 73.1% | 53.0% | 59.4% |
pythia_deduped_2.8B | 5.6 | 4.787 | 67.1% | 61.6% | 60.9% | 74.4% | 65.5% | 65.9% |
pythia_deduped_2.8B_q8 | 3.1 | 4.778 | 66.9% | 61.5% | 61.2% | 74.5% | 65.6% | 66.0% | pythia_deduped_6.9B | 13.7 | 4.195 | 69.1% | 65.7% | 63.9% | 75.1% | 66.1% | 68.0% |
pythia_deduped_6.9B_q4 | 4.3 | 4.344 | 68.3% | 65.0% | 62.5% | 75.3% | 66.3% | 67.5% |
pythia_deduped_6.9B_q8 | 7.5 | 4.187 | 69.4% | 65.7% | 63.6% | 75.5% | 66.8% | 68.2% |
pythia_deduped_12B | 23.7 | 3.854 | 70.9% | 69.2% | 63.9% | 76.3% | 70.8% | 70.2% |
pythia_deduped_12B_q4 | 7.2 | 4.187 | 69.2% | 68.5% | 63.1% | 76.4% | 69.6% | 69.4% |
pythia_deduped_12B_q8 | 12.8 | 3.857 | 70.9% | 69.2% | 64.2% | 76.1% | 70.9% | 70.3% |
rwkv_14B | 28.3 | 3.819 | 71.6% | 70.2% | 63.1% | 77.5% | 47.2% | 65.9% |
rwkv_14B_q4 | 8.5 | 4.076 | 68.3% | 69.8% | 63.1% | 77.1% | 45.0% | 64.7% |
rwkv_14B_q8 | 15.3 | 3.806 | 71.9% | 70.2% | 63.0% | 77.5% | 47.1% | 65.9% |
rwkv_7B | 16 | 4.396 | 67.5% | 65.6% | 61.9% | 75.6% | 39.7% | 62.1% |
rwkv_7B_q4 | 4.6 | 4.939 | 64.7% | 64.8% | 61.2% | 75.4% | 38.4% | 60.9% |
rwkv_7B_q8 | 8.0 | 4.395 | 67.5% | 65.6% | 61.6% | 75.9% | 40.2% | 62.2% |
RedPajama-INCITE-7B_q4 | 4.3 | 4.006 | 71.0% | 69.7% | 64.6% | 76.3% | 71.7% | 70.7% |
RedPajama-INCITE-7B_q8 | 7.5 | 3.910 | 71.4% | 70.4% | 64.3% | 77.0% | 71.9% | 71.0% |
falcon_40B_q4 | 24.6 | 2.844 | 77.6% | 82.5% | 76.2% | 82.2% | 78.8% | 79.5% |
falcon_40B_q8 | 45.0 | 2.799 | 77.9% | 82.7% | 76.7% | 82.2% | 80.4% | 80.0% |
falcon_7B | 14.4 | 3.359 | 75.0% | 76.2% | 67.3% | 79.4% | 72.1% | 74.0% |
falcon_7B_q4 | 4.6 | 3.444 | 73.9% | 75.8% | 67.5% | 79.7% | 71.6% | 73.7% |
falcon_7B_q8 | 7.9 | 3.368 | 75.0% | 76.2% | 66.9% | 79.5% | 71.9% | 73.9% |
mpt_30B_q4 | 17.8 | 3.219 | 78.9% | 79.4% | 70.1% | 79.8% | 79.8% | 77.6% |
mpt_30B_q8 | 32.6 | 3.062 | 80.7% | 79.8% | 70.7% | 80.0% | 79.9% | 78.2% |
mpt_7B_q4 | 4.3 | 3.949 | 73.1% | 75.7% | 67.4% | 79.0% | 75.9% | 74.2% |
mpt_7B_q8 | 7.5 | 3.850 | 73.2% | 76.2% | 68.5% | 79.1% | 76.4% | 74.7% |
llama2_7B | 13.5 | 3.428 | 74.5% | 76.2% | 69.7% | 78.4% | 77.2% | 75.2% |
llama2_7B_q4 | 4.0 | 3.487 | 73.5% | 75.5% | 69.9% | 77.6% | 77.8% | 74.9% |
llama2_13B | 26.0 | 3.051 | 77.2% | 79.6% | 72.1% | 78.9% | 79.3% | 77.4% |
llama2_13B_q4 | 7.6 | 3.109 | 77.0% | 79.0% | 72.6% | 79.5% | 78.9% | 77.4% |
llama2_70B_q4 | 39.3 | 2.646 | 80.6% | 84.0% | 78.7% | 82.0% | 83.4% | 81.7% |
llama2_7B_q3 | 3.2 | 3.566 | 72.7% | 74.1% | 68.0% | 77.6% | 77.5% | 74.0% |
llama2_13B_q3 | 6.1 | 3.148 | 76.5% | 77.9% | 71.4% | 78.4% | 77.8% | 76.4% |
llama2_70B_q3 | 30.8 | 2.638 | 79.9% | 82.9% | 77.7% | 81.7% | 82.6% | 80.9% |
mistral_7B | 14.5 | 3.178 | 76.2% | 81.0% | 74.2% | 80.4% | 80.9% | 78.5% |
mistral_7B_q4 | 4.3 | 3.412 | 74.9% | 80.1% | 73.9% | 80.7% | 80.3% | 78.0% |
mistral_7B_q8 | 7.8 | 3.174 | 76.0% | 81.0% | 73.6% | 80.4% | 80.7% | 78.3% |
mixtral_47B_q3 | 19.3 | 2.851 | 76.8% | 82.2% | 75.6% | 81.3% | 79.8% | 79.1% |
mixtral_47B_q4 | 26.5 | 2.811 | 78.6% | 83.3% | 76.0% | 82.6% | 80.4% | 80.2% |
mixtral_47B_q8 | 49.7 | 2.790 | 79.3% | 83.9% | 78.1% | 82.0% | 80.7% | 80.8% |
llama3_8B | 16.1 | 3.107 | 76.8% | 79.1% | 73.1% | 79.7% | 80.7% | 77.9% |
llama3_8B_q4 | 5.5 | 3.291 | 75.2% | 78.2% | 73.5% | 78.8% | 80.4% | 77.2% |
llama3_70B | 141.1 | 2.597 | 80.6% | 84.9% | 80.1% | 82.3% | 84.0% | 82.4% |
llama3_70B_q4 | 41.7 | 2.619 | 80.4% | 84.4% | 80.3% | 82.1% | 83.1% | 82.1% |
llama3.1_8B | 16.1 | 3.150 | 76.6% | 78.8% | 73.9% | 79.9% | 80.8% | 78.0% |
llama3.1_70B | 141.1 | 2.670 | 80.1% | 84.9% | 79.4% | 83.0% | 83.7% | 82.2% |
llama3.1_70B_q4 | 41.8 | 2.713 | 79.9% | 84.4% | 79.4% | 82.6% | 83.4% | 81.9% |
llama3.1_70B_q3 | 31.1 | 2.865 | 78.0% | 83.0% | 78.4% | 82.0% | 83.6% | 81.0% |
llama3.1_405B_q4 | 232.4 | 2.454 | 81.6% | 87.0% | 82.4% | 83.8% | 83.8% | 83.7% |
qwen2_7B | 15.2 | 3.647 | 72.3% | 78.3% | 72.3% | 79.9% | 80.9% | 76.8% |
qwen2_7B_q4 | 5.3 | 3.712 | 72.0% | 77.8% | 71.3% | 79.7% | 81.7% | 76.5% |
llama3.2_1B | 2.5 | 5.765 | 63.7% | 63.6% | 60.4% | 74.6% | 69.3% | 66.3% |
llama3.2_1B_q8 | 1.9 | 5.780 | 63.9% | 63.5% | 59.9% | 74.4% | 68.5% | 66.0% |
llama3.2_3B | 6.5 | 3.963 | 71.4% | 73.5% | 69.1% | 76.7% | 78.9% | 73.9% |
llama3.2_3B_q4 | 2.8 | 4.167 | 71.0% | 72.2% | 68.4% | 76.7% | 78.6% | 73.4% |
Chat Models:
llama3_8B_instruct | 16.1 | 67.3% |
llama3_8B_instruct_q4 | 5.5 | 65.7% |
llama2_7B_chat_q4 | 3.9 | 45.3% |
llama2_13B_chat_q4 | 7.6 | 51.2% |
llama2_70B_chat_q4 | 39.3 | 61.1% |
mistral_7B_instruct_q4 | 3.9 | 53.0% |
mixtral_47B_instruct_q4 | 26.5 | 67.6% |
llama3.1_8B_instruct | 16.1 | 68.6% |
llama3.1_8B_instruct_q4 | 5.6 | 67.1% |
llama3.1_70B_instruct_q4 | 41.8 | 82.4% |
phi3_mini_4k_instruct | 7.6 | 70.1% |
phi3_mini_4k_instruct_q4 | 2.3 | 67.8% |
phi3.5_mini_instruct | 7.7 | 67.7% |
phi3.5_mini_instruct_q4 | 2.4 | 65.9% |
qwen2_7B_instruct | 15.2 | 70.3% |
qwen2_7B_instruct_q4 | 5.3 | 68.7% |
llama3.2_1B_instruct | 2.5 | 45.6% |
llama3.2_1B_instruct_q4 | 1.4 | 43.5% |
llama3.2_3B_instruct | 6.5 | 62.1% |
llama3.2_3B_instruct_q4 | 2.8 | 59.5% |
Additional Models:
Description | ||
---|---|---|
m2m100_1_2B_q8 | 1.6 | Translation between 100 languages |
nllb200_1.3B_q8 | 2.0 | Translation between 200 languages |
nllb200_3.3B_q8 | 4.6 | Translation between 200 languages |
madlad400_7B_q4 | 5.7 | Translation between 400 languages |
sd_v1.4 | 2.1 | Stable Diffusion text-to-image version 1.4 |
sd_v2.1 | 2.6 | Stable Diffusion text-to-image version 2.1 |
whisper_large_v3_q8.bin | 1.8 | Whisper large v3 speech to text transcription |
gte_qwen2_1.5B_instruct_q8.bin | 2.1 | Qwen2 GTE embeddings |
SHA256 of all the models: sha256.txt.
Notes: