TextSynth Server

News

2025-03-08: Added Parler-TTS text-to-speech model. Added real time WebSocket API for audio transcription and voice chat. Added BERT embeddings.
2024-09-30: Added Llama 3.2, phi3, Qwen2 and gte_qwen2 models support. Added embeddings endpoint. Added cpu offloading.
2024-08-03: Added Llama 3.1 model support.
2024-05-21: Added Llama 3 model support.
2024-01-20: Added Mixtral model support. Added fast Whisper based speech to text transcription.
2023-10-21: CUDA support in the Windows version, mistral model support. Speculative sampling is supported. BNF grammar and JSON schema sampling.
2023-08-07: The GPU version and model conversion utilities are now freely available.
2023-07-21: The MPT and Llama 2 models are supported.
2023-06-10: New ts_chat utility to chat with language models. The Falcon and RedPajama-INCITE models are supported.
2023-03-26: The NLLB200 and flan UL2 models have been added. An HTML GUI is now available in ts_server.

Introduction

ts_server is a web server proposing a REST API to large language models. They can be used for example for text completion, question answering, classification, chat, translation, image generation, audio transcription, speech synthesis, ...

It has the following characteristics:

All is included in a single binary. Very few external dependencies (Python is not needed) so installation is easy.
Supports many Transformer variants (GPT-J, GPT-NeoX, GPT-Neo, OPT, Fairseq GPT, M2M100, CodeGen, GPT2, T5, RWKV, LLAMA, Falcon, MPT, Llama 3.2, Mistral, Mixtral, Qwen2, Phi3, Whisper, Parler-TTS) and Stable Diffusion.
Integrated REST JSON API for text completion, translation, image generation, audio transcription and speech synthesis. It is used by textsynth.com.
Integrated WebSocket API for real time audio transcription and voice chat (experimental).
Integrated HTML GUI for testing.
Very high performance for small and large batches on CPU and GPU. Support of dynamic batching to handle a large number of simultaneous requests.
Efficient custom 8, 4 and 3 bit quantization. Our quantized models are thoroughly evaluated on several standard tasks to ensure good performance.
Larger models work optimally on lower cost GPUs (e.g. RTX 3090, RTX A6000) thanks to efficient quantization.
Support of speculative sampling for even faster inference.
Support of grammar based sampling to constraint the model output according to a BNF grammar or a JSON schema.
Uses the LibNC library for simple tensor manipulation using the C language.
Simple command line tools (ts_test, ts_sd, ts_chat, ts_audiototext are provided to test the various models).

The free version is released as binary code for non-commercial use only. Please contact for commercial use.

Download

Linux version ts_server_free-2025-03-09.tar.gz (Changelog).
Windows version ts_server_free-2024-09-30-win64.zip (Changelog).

Documentation

Benchmarks

Text generation

100 tokens are generated with a batch size of 1 and 50 input tokens:

Model⁽³⁾	Epyc 7313 (6) (tokens/s)	RTX A6000 (tokens/s)	RTX 4090 (tokens/s)
gptj_6B_q4	21.5	132	164
flan_t5_xxl_q4	25	130	158
llama2_7B_q4	23	115	144
llama2_13B_q4	12.0	69.3	88
gptneox_20B_q4	8.1	45.5	59
llama2_70B_q4	2.5	15.2	-

8 simultaneous requests generating 100 tokens with 50 input tokens (equivalent to a batch size of 8):

Model⁽³⁾	RTX A6000 (tokens/s)
llama2_7B_q4	783
llama2_13B_q4	492
llama2_70B_q4	118

Text to image

A single 512x512 image is generated using 50 time steps.

Model⁽³⁾	RTX A6000 (seconds)	RTX 4090 (seconds)
stable diffusion 1.4	1.82	1.21
stable diffusion 2.1	1.67	1.19

Available Models

We provide here model files that can be used with the TextSynth Server. Each model was evaluated with the lm-evaluation-harness with the TextSynth server on a single RTX A6000 GPU.

Language Models:

bloom_560M 1.1 29.176 36.8% 35.8% 51.4% 63.7% 36.0% 44.7%

codegen_6B_mono_q4 4.4 69.409 28.0% 35.7% 51.1% 60.2% 38.0% 42.6%

codegen_6B_mono_q8 7.7 67.262 28.1% 35.8% 50.8% 60.1% 39.1% 42.8%

fairseq_gpt_13B 26.2 3.567 71.9% 72.7% 67.5% 77.6% 70.1% 71.9%

fairseq_gpt_13B_q4 7.9 3.646 71.2% 72.5% 67.6% 77.4% 70.6% 71.9%

fairseq_gpt_13B_q8 14.2 3.565 71.8% 72.7% 67.2% 77.7% 70.0% 71.9%

flan_t5_base 0.5 12.891 54.2% 36.5% 54.7% 65.8% 62.1% 54.7%

flan_t5_base_q8 0.3 13.098 54.2% 36.4% 54.2% 65.7% 61.8% 54.5%

flan_t5_small 0.2 23.343 46.7% 29.2% 50.0% 62.4% 47.9% 47.2%

flan_t5_small_q8 0.1 23.449 46.7% 29.2% 49.7% 62.4% 48.2% 47.2%

flan_t5_xxl_q4 6.5 3.010 77.7% 71.5% 73.4% 77.6% 71.8% 74.4%

flan_t5_xxl_q8 12.0 3.049 77.8% 72.1% 75.1% 77.8% 73.1% 75.2%

flan_ul2_20B_q4 11.3 - 74.1% 24.3% 51.1% 49.9% 78.8% 55.6%

flan_ul2_20B_q8 20.9 - 74.4% 24.4% 52.0% 50.6% 77.3% 55.7%

gpt2_117M 0.3 40.110 32.9% 31.1% 52.1% 62.9% 27.3% 41.3%

gpt2_345M 0.7 18.272 43.5% 39.4% 53.3% 67.7% 43.1% 49.4%

gpt2_345M_q8 0.5 18.452 43.1% 39.4% 53.1% 67.5% 41.9% 49.0%

gpt2_774M 1.6 12.966 47.8% 45.4% 55.6% 70.4% 48.5% 53.5%

gpt2_774M_q8 1.0 12.928 47.9% 45.4% 55.3% 70.3% 48.2% 53.4%

gpt2_1558M 3.1 10.637 51.3% 50.8% 58.4% 70.8% 53.2% 56.9%

gpt2_1558M_q8 1.8 10.655 51.2% 50.8% 58.6% 70.8% 53.2% 56.9%

gptj_6B 12.1 4.124 69.0% 66.2% 64.8% 75.5% 66.9% 68.5%

gptj_6B_q4 3.8 4.153 68.9% 65.7% 63.9% 74.4% 67.0% 68.0%

gptj_6B_q8 6.6 4.122 69.1% 66.2% 64.4% 75.4% 66.4% 68.3%

gptneox_20B 41.1 3.657 72.6% 71.4% 65.5% 77.5% 73.3% 72.0%

gptneox_20B_q4 12.2 3.711 72.0% 69.3% 64.8% 76.7% 70.8% 70.7%

gptneox_20B_q8 22.1 3.659 72.6% 71.3% 65.8% 77.3% 72.9% 72.0%

llama_7B 13.5 3.463 73.6% 76.2% 70.4% 78.1% 75.4% 74.7%

llama_7B_q4 4.0 3.549 73.2% 75.5% 70.4% 78.0% 74.7% 74.4%

llama_7B_q8 7.3 3.453 73.7% 76.1% 70.2% 78.0% 75.5% 74.7%

llama_13B_q4 7.6 3.130 77.1% 78.6% 72.2% 78.3% 77.8% 76.8%

llama_13B_q8 14.0 3.178 76.5% 79.1% 73.2% 79.1% 77.1% 77.0%

llama_30B_q4 18.7 2.877 77.5% 82.4% 75.7% 80.2% 80.2% 79.2%

llama_30B_q8 34.8 2.853 77.7% 82.7% 76.3% 80.3% 80.4% 79.5%

llama_65B_q4 37.2 2.760 78.5% 83.9% 76.6% 81.4% 83.2% 80.7%

opt_125M 0.3 26.028 37.9% 31.3% 50.2% 63.2% 23.4% 41.2%

opt_30B_q4 17.8 3.656 71.5% 72.1% 68.0% 77.4% 69.9% 71.8%

opt_30B_q8 32.6 3.628 71.6% 72.3% 68.2% 77.7% 71.4% 72.3%

opt_66B_q4 38.2 3.308 73.4% 74.4% 68.4% 78.5% 75.0% 73.9%

pythia_deduped_70M 0.1 96.126 25.6% 28.3% 54.4% 60.4% 13.1% 36.3%

pythia_deduped_160M 0.3 26.380 36.9% 32.3% 51.4% 63.8% 23.2% 41.5%

pythia_deduped_410M 0.8 10.827 51.7% 40.8% 54.0% 67.2% 43.0% 51.4%

pythia_deduped_410M_q8 0.5 10.729 51.8% 40.7% 53.8% 67.1% 42.7% 51.2%

pythia_deduped_1B 2.0 7.273 58.5% 49.0% 54.5% 71.0% 49.9% 56.6%

pythia_deduped_1B_q8 1.2 7.286 58.4% 49.0% 54.9% 70.9% 49.0% 56.5%

pythia_deduped_1.4B 2.8 6.546 63.1% 52.2% 57.1% 72.7% 52.6% 59.5%

pythia_deduped_1.4B_q8 1.6 6.577 63.3% 52.1% 55.7% 73.1% 53.0% 59.4%

pythia_deduped_2.8B 5.6 4.787 67.1% 61.6% 60.9% 74.4% 65.5% 65.9%

pythia_deduped_2.8B_q8 3.1 4.778 66.9% 61.5% 61.2% 74.5% 65.6% 66.0%

pythia_deduped_6.9B 13.7 4.195 69.1% 65.7% 63.9% 75.1% 66.1% 68.0%

pythia_deduped_6.9B_q4 4.3 4.344 68.3% 65.0% 62.5% 75.3% 66.3% 67.5%

pythia_deduped_6.9B_q8 7.5 4.187 69.4% 65.7% 63.6% 75.5% 66.8% 68.2%

pythia_deduped_12B 23.7 3.854 70.9% 69.2% 63.9% 76.3% 70.8% 70.2%

pythia_deduped_12B_q4 7.2 4.187 69.2% 68.5% 63.1% 76.4% 69.6% 69.4%

pythia_deduped_12B_q8 12.8 3.857 70.9% 69.2% 64.2% 76.1% 70.9% 70.3%

rwkv_14B 28.3 3.819 71.6% 70.2% 63.1% 77.5% 47.2% 65.9%

rwkv_14B_q4 8.5 4.076 68.3% 69.8% 63.1% 77.1% 45.0% 64.7%

rwkv_14B_q8 15.3 3.806 71.9% 70.2% 63.0% 77.5% 47.1% 65.9%

rwkv_7B 16 4.396 67.5% 65.6% 61.9% 75.6% 39.7% 62.1%

rwkv_7B_q4 4.6 4.939 64.7% 64.8% 61.2% 75.4% 38.4% 60.9%

rwkv_7B_q8 8.0 4.395 67.5% 65.6% 61.6% 75.9% 40.2% 62.2%

RedPajama-INCITE-7B_q4 4.3 4.006 71.0% 69.7% 64.6% 76.3% 71.7% 70.7%

RedPajama-INCITE-7B_q8 7.5 3.910 71.4% 70.4% 64.3% 77.0% 71.9% 71.0%

falcon_40B_q4 24.6 2.844 77.6% 82.5% 76.2% 82.2% 78.8% 79.5%

falcon_40B_q8 45.0 2.799 77.9% 82.7% 76.7% 82.2% 80.4% 80.0%

falcon_7B 14.4 3.359 75.0% 76.2% 67.3% 79.4% 72.1% 74.0%

falcon_7B_q4 4.6 3.444 73.9% 75.8% 67.5% 79.7% 71.6% 73.7%

falcon_7B_q8 7.9 3.368 75.0% 76.2% 66.9% 79.5% 71.9% 73.9%

mpt_30B_q4 17.8 3.219 78.9% 79.4% 70.1% 79.8% 79.8% 77.6%

mpt_30B_q8 32.6 3.062 80.7% 79.8% 70.7% 80.0% 79.9% 78.2%

mpt_7B_q4 4.3 3.949 73.1% 75.7% 67.4% 79.0% 75.9% 74.2%

mpt_7B_q8 7.5 3.850 73.2% 76.2% 68.5% 79.1% 76.4% 74.7%

llama2_7B 13.5 3.428 74.5% 76.2% 69.7% 78.4% 77.2% 75.2%

llama2_7B_q4 4.0 3.487 73.5% 75.5% 69.9% 77.6% 77.8% 74.9%

llama2_13B 26.0 3.051 77.2% 79.6% 72.1% 78.9% 79.3% 77.4%

llama2_13B_q4 7.6 3.109 77.0% 79.0% 72.6% 79.5% 78.9% 77.4%

llama2_70B_q4 39.3 2.646 80.6% 84.0% 78.7% 82.0% 83.4% 81.7%

llama2_7B_q3 3.2 3.566 72.7% 74.1% 68.0% 77.6% 77.5% 74.0%

llama2_13B_q3 6.1 3.148 76.5% 77.9% 71.4% 78.4% 77.8% 76.4%

llama2_70B_q3 30.8 2.638 79.9% 82.9% 77.7% 81.7% 82.6% 80.9%

mistral_7B 14.5 3.178 76.2% 81.0% 74.2% 80.4% 80.9% 78.5%

mistral_7B_q4 4.3 3.412 74.9% 80.1% 73.9% 80.7% 80.3% 78.0%

mistral_7B_q8 7.8 3.174 76.0% 81.0% 73.6% 80.4% 80.7% 78.3%

mixtral_47B_q3 19.3 2.851 76.8% 82.2% 75.6% 81.3% 79.8% 79.1%

mixtral_47B_q4 26.5 2.811 78.6% 83.3% 76.0% 82.6% 80.4% 80.2%

mixtral_47B_q8 49.7 2.790 79.3% 83.9% 78.1% 82.0% 80.7% 80.8%

llama3_8B 16.1 3.107 76.8% 79.1% 73.1% 79.7% 80.7% 77.9%

llama3_8B_q4 5.5 3.291 75.2% 78.2% 73.5% 78.8% 80.4% 77.2%

llama3_70B 141.1 2.597 80.6% 84.9% 80.1% 82.3% 84.0% 82.4%

llama3_70B_q4 41.7 2.619 80.4% 84.4% 80.3% 82.1% 83.1% 82.1%

llama3.1_8B 16.1 3.150 76.6% 78.8% 73.9% 79.9% 80.8% 78.0%

llama3.1_70B 141.1 2.670 80.1% 84.9% 79.4% 83.0% 83.7% 82.2%

llama3.1_70B_q4 41.8 2.713 79.9% 84.4% 79.4% 82.6% 83.4% 81.9%

llama3.1_70B_q3 31.1 2.865 78.0% 83.0% 78.4% 82.0% 83.6% 81.0%

llama3.1_405B_q4 232.4 2.454 81.6% 87.0% 82.4% 83.8% 83.8% 83.7%

qwen2_7B 15.2 3.647 72.3% 78.3% 72.3% 79.9% 80.9% 76.8%

qwen2_7B_q4 5.3 3.712 72.0% 77.8% 71.3% 79.7% 81.7% 76.5%


bloom_560M	1.1	29.176	36.8%	35.8%	51.4%	63.7%	36.0%	44.7%
codegen_6B_mono_q4	4.4	69.409	28.0%	35.7%	51.1%	60.2%	38.0%	42.6%
codegen_6B_mono_q8	7.7	67.262	28.1%	35.8%	50.8%	60.1%	39.1%	42.8%
fairseq_gpt_13B	26.2	3.567	71.9%	72.7%	67.5%	77.6%	70.1%	71.9%
fairseq_gpt_13B_q4	7.9	3.646	71.2%	72.5%	67.6%	77.4%	70.6%	71.9%
fairseq_gpt_13B_q8	14.2	3.565	71.8%	72.7%	67.2%	77.7%	70.0%	71.9%
flan_t5_base	0.5	12.891	54.2%	36.5%	54.7%	65.8%	62.1%	54.7%
flan_t5_base_q8	0.3	13.098	54.2%	36.4%	54.2%	65.7%	61.8%	54.5%
flan_t5_small	0.2	23.343	46.7%	29.2%	50.0%	62.4%	47.9%	47.2%
flan_t5_small_q8	0.1	23.449	46.7%	29.2%	49.7%	62.4%	48.2%	47.2%
flan_t5_xxl_q4	6.5	3.010	77.7%	71.5%	73.4%	77.6%	71.8%	74.4%
flan_t5_xxl_q8	12.0	3.049	77.8%	72.1%	75.1%	77.8%	73.1%	75.2%
flan_ul2_20B_q4	11.3	-	74.1%	24.3%	51.1%	49.9%	78.8%	55.6%
flan_ul2_20B_q8	20.9	-	74.4%	24.4%	52.0%	50.6%	77.3%	55.7%
gpt2_117M	0.3	40.110	32.9%	31.1%	52.1%	62.9%	27.3%	41.3%
gpt2_345M	0.7	18.272	43.5%	39.4%	53.3%	67.7%	43.1%	49.4%
gpt2_345M_q8	0.5	18.452	43.1%	39.4%	53.1%	67.5%	41.9%	49.0%
gpt2_774M	1.6	12.966	47.8%	45.4%	55.6%	70.4%	48.5%	53.5%
gpt2_774M_q8	1.0	12.928	47.9%	45.4%	55.3%	70.3%	48.2%	53.4%
gpt2_1558M	3.1	10.637	51.3%	50.8%	58.4%	70.8%	53.2%	56.9%
gpt2_1558M_q8	1.8	10.655	51.2%	50.8%	58.6%	70.8%	53.2%	56.9%
gptj_6B	12.1	4.124	69.0%	66.2%	64.8%	75.5%	66.9%	68.5%
gptj_6B_q4	3.8	4.153	68.9%	65.7%	63.9%	74.4%	67.0%	68.0%
gptj_6B_q8	6.6	4.122	69.1%	66.2%	64.4%	75.4%	66.4%	68.3%
gptneox_20B	41.1	3.657	72.6%	71.4%	65.5%	77.5%	73.3%	72.0%
gptneox_20B_q4	12.2	3.711	72.0%	69.3%	64.8%	76.7%	70.8%	70.7%
gptneox_20B_q8	22.1	3.659	72.6%	71.3%	65.8%	77.3%	72.9%	72.0%
llama_7B	13.5	3.463	73.6%	76.2%	70.4%	78.1%	75.4%	74.7%
llama_7B_q4	4.0	3.549	73.2%	75.5%	70.4%	78.0%	74.7%	74.4%
llama_7B_q8	7.3	3.453	73.7%	76.1%	70.2%	78.0%	75.5%	74.7%
llama_13B_q4	7.6	3.130	77.1%	78.6%	72.2%	78.3%	77.8%	76.8%
llama_13B_q8	14.0	3.178	76.5%	79.1%	73.2%	79.1%	77.1%	77.0%
llama_30B_q4	18.7	2.877	77.5%	82.4%	75.7%	80.2%	80.2%	79.2%
llama_30B_q8	34.8	2.853	77.7%	82.7%	76.3%	80.3%	80.4%	79.5%
llama_65B_q4	37.2	2.760	78.5%	83.9%	76.6%	81.4%	83.2%	80.7%
opt_125M	0.3	26.028	37.9%	31.3%	50.2%	63.2%	23.4%	41.2%
opt_30B_q4	17.8	3.656	71.5%	72.1%	68.0%	77.4%	69.9%	71.8%
opt_30B_q8	32.6	3.628	71.6%	72.3%	68.2%	77.7%	71.4%	72.3%
opt_66B_q4	38.2	3.308	73.4%	74.4%	68.4%	78.5%	75.0%	73.9%
pythia_deduped_70M	0.1	96.126	25.6%	28.3%	54.4%	60.4%	13.1%	36.3%
pythia_deduped_160M	0.3	26.380	36.9%	32.3%	51.4%	63.8%	23.2%	41.5%
pythia_deduped_410M	0.8	10.827	51.7%	40.8%	54.0%	67.2%	43.0%	51.4%
pythia_deduped_410M_q8	0.5	10.729	51.8%	40.7%	53.8%	67.1%	42.7%	51.2%
pythia_deduped_1B	2.0	7.273	58.5%	49.0%	54.5%	71.0%	49.9%	56.6%
pythia_deduped_1B_q8	1.2	7.286	58.4%	49.0%	54.9%	70.9%	49.0%	56.5%
pythia_deduped_1.4B	2.8	6.546	63.1%	52.2%	57.1%	72.7%	52.6%	59.5%
pythia_deduped_1.4B_q8	1.6	6.577	63.3%	52.1%	55.7%	73.1%	53.0%	59.4%
pythia_deduped_2.8B	5.6	4.787	67.1%	61.6%	60.9%	74.4%	65.5%	65.9%
pythia_deduped_2.8B_q8	3.1	4.778	66.9%	61.5%	61.2%	74.5%	65.6%	66.0%
pythia_deduped_6.9B	13.7	4.195	69.1%	65.7%	63.9%	75.1%	66.1%	68.0%
pythia_deduped_6.9B_q4	4.3	4.344	68.3%	65.0%	62.5%	75.3%	66.3%	67.5%
pythia_deduped_6.9B_q8	7.5	4.187	69.4%	65.7%	63.6%	75.5%	66.8%	68.2%
pythia_deduped_12B	23.7	3.854	70.9%	69.2%	63.9%	76.3%	70.8%	70.2%
pythia_deduped_12B_q4	7.2	4.187	69.2%	68.5%	63.1%	76.4%	69.6%	69.4%
pythia_deduped_12B_q8	12.8	3.857	70.9%	69.2%	64.2%	76.1%	70.9%	70.3%
rwkv_14B	28.3	3.819	71.6%	70.2%	63.1%	77.5%	47.2%	65.9%
rwkv_14B_q4	8.5	4.076	68.3%	69.8%	63.1%	77.1%	45.0%	64.7%
rwkv_14B_q8	15.3	3.806	71.9%	70.2%	63.0%	77.5%	47.1%	65.9%
rwkv_7B	16	4.396	67.5%	65.6%	61.9%	75.6%	39.7%	62.1%
rwkv_7B_q4	4.6	4.939	64.7%	64.8%	61.2%	75.4%	38.4%	60.9%
rwkv_7B_q8	8.0	4.395	67.5%	65.6%	61.6%	75.9%	40.2%	62.2%
RedPajama-INCITE-7B_q4	4.3	4.006	71.0%	69.7%	64.6%	76.3%	71.7%	70.7%
RedPajama-INCITE-7B_q8	7.5	3.910	71.4%	70.4%	64.3%	77.0%	71.9%	71.0%
falcon_40B_q4	24.6	2.844	77.6%	82.5%	76.2%	82.2%	78.8%	79.5%
falcon_40B_q8	45.0	2.799	77.9%	82.7%	76.7%	82.2%	80.4%	80.0%
falcon_7B	14.4	3.359	75.0%	76.2%	67.3%	79.4%	72.1%	74.0%
falcon_7B_q4	4.6	3.444	73.9%	75.8%	67.5%	79.7%	71.6%	73.7%
falcon_7B_q8	7.9	3.368	75.0%	76.2%	66.9%	79.5%	71.9%	73.9%
mpt_30B_q4	17.8	3.219	78.9%	79.4%	70.1%	79.8%	79.8%	77.6%
mpt_30B_q8	32.6	3.062	80.7%	79.8%	70.7%	80.0%	79.9%	78.2%
mpt_7B_q4	4.3	3.949	73.1%	75.7%	67.4%	79.0%	75.9%	74.2%
mpt_7B_q8	7.5	3.850	73.2%	76.2%	68.5%	79.1%	76.4%	74.7%
llama2_7B	13.5	3.428	74.5%	76.2%	69.7%	78.4%	77.2%	75.2%
llama2_7B_q4	4.0	3.487	73.5%	75.5%	69.9%	77.6%	77.8%	74.9%
llama2_13B	26.0	3.051	77.2%	79.6%	72.1%	78.9%	79.3%	77.4%
llama2_13B_q4	7.6	3.109	77.0%	79.0%	72.6%	79.5%	78.9%	77.4%
llama2_70B_q4	39.3	2.646	80.6%	84.0%	78.7%	82.0%	83.4%	81.7%
llama2_7B_q3	3.2	3.566	72.7%	74.1%	68.0%	77.6%	77.5%	74.0%
llama2_13B_q3	6.1	3.148	76.5%	77.9%	71.4%	78.4%	77.8%	76.4%
llama2_70B_q3	30.8	2.638	79.9%	82.9%	77.7%	81.7%	82.6%	80.9%
mistral_7B	14.5	3.178	76.2%	81.0%	74.2%	80.4%	80.9%	78.5%
mistral_7B_q4	4.3	3.412	74.9%	80.1%	73.9%	80.7%	80.3%	78.0%
mistral_7B_q8	7.8	3.174	76.0%	81.0%	73.6%	80.4%	80.7%	78.3%
mixtral_47B_q3	19.3	2.851	76.8%	82.2%	75.6%	81.3%	79.8%	79.1%
mixtral_47B_q4	26.5	2.811	78.6%	83.3%	76.0%	82.6%	80.4%	80.2%
mixtral_47B_q8	49.7	2.790	79.3%	83.9%	78.1%	82.0%	80.7%	80.8%
llama3_8B	16.1	3.107	76.8%	79.1%	73.1%	79.7%	80.7%	77.9%
llama3_8B_q4	5.5	3.291	75.2%	78.2%	73.5%	78.8%	80.4%	77.2%
llama3_70B	141.1	2.597	80.6%	84.9%	80.1%	82.3%	84.0%	82.4%
llama3_70B_q4	41.7	2.619	80.4%	84.4%	80.3%	82.1%	83.1%	82.1%
llama3.1_8B	16.1	3.150	76.6%	78.8%	73.9%	79.9%	80.8%	78.0%
llama3.1_70B	141.1	2.670	80.1%	84.9%	79.4%	83.0%	83.7%	82.2%
llama3.1_70B_q4	41.8	2.713	79.9%	84.4%	79.4%	82.6%	83.4%	81.9%
llama3.1_70B_q3	31.1	2.865	78.0%	83.0%	78.4%	82.0%	83.6%	81.0%
llama3.1_405B_q4	232.4	2.454	81.6%	87.0%	82.4%	83.8%	83.8%	83.7%
qwen2_7B	15.2	3.647	72.3%	78.3%	72.3%	79.9%	80.9%	76.8%
qwen2_7B_q4	5.3	3.712	72.0%	77.8%	71.3%	79.7%	81.7%	76.5%

Chat Models:

llama3_8B_instruct 16.1 67.3%

llama3_8B_instruct_q4 5.5 65.7%

llama2_7B_chat_q4 3.9 45.3%

llama2_13B_chat_q4 7.6 51.2%

llama2_70B_chat_q4 39.3 61.1%

mistral_7B_instruct_q4 3.9 53.0%

mixtral_47B_instruct_q4 26.5 67.6%

llama3.1_8B_instruct 16.1 68.6%

llama3.1_8B_instruct_q4 5.6 67.1%

llama3.1_70B_instruct_q4 41.8 82.4%

phi3_mini_4k_instruct 7.6 70.1%

phi3_mini_4k_instruct_q4 2.3 67.8%

phi3.5_mini_instruct 7.7 67.7%

phi3.5_mini_instruct_q4 2.4 65.9%

qwen2_7B_instruct 15.2 70.3%

qwen2_7B_instruct_q4 5.3 68.7%

llama3.3_70B_instruct_q4 41.8 81.9%


llama3_8B_instruct	16.1	67.3%
llama3_8B_instruct_q4	5.5	65.7%
llama2_7B_chat_q4	3.9	45.3%
llama2_13B_chat_q4	7.6	51.2%
llama2_70B_chat_q4	39.3	61.1%
mistral_7B_instruct_q4	3.9	53.0%
mixtral_47B_instruct_q4	26.5	67.6%
llama3.1_8B_instruct	16.1	68.6%
llama3.1_8B_instruct_q4	5.6	67.1%
llama3.1_70B_instruct_q4	41.8	82.4%
phi3_mini_4k_instruct	7.6	70.1%
phi3_mini_4k_instruct_q4	2.3	67.8%
phi3.5_mini_instruct	7.7	67.7%
phi3.5_mini_instruct_q4	2.4	65.9%
qwen2_7B_instruct	15.2	70.3%
qwen2_7B_instruct_q4	5.3	68.7%
llama3.3_70B_instruct_q4	41.8	81.9%

Translation Models:

Description

m2m100_1_2B_q8 1.6 Translation between 100 languages

nllb200_1.3B_q8 2.0 Translation between 200 languages

nllb200_3.3B_q8 4.6 Translation between 200 languages

madlad400_7B_q4 5.7 Translation between 400 languages

madlad400_3B_q4 2.2 Translation between 400 languages

		Description
m2m100_1_2B_q8	1.6	Translation between 100 languages
nllb200_1.3B_q8	2.0	Translation between 200 languages
nllb200_3.3B_q8	4.6	Translation between 200 languages
madlad400_7B_q4	5.7	Translation between 400 languages
madlad400_3B_q4	2.2	Translation between 400 languages

Embeddings Models:

Description

gte_qwen2_1.5B_instruct_q8 1.9 Qwen2 GTE embeddings

bge_large_en_v1.5_q8 0.4 BGE-Large EN v1.5 embeddings

		Description
gte_qwen2_1.5B_instruct_q8	1.9	Qwen2 GTE embeddings
bge_large_en_v1.5_q8	0.4	BGE-Large EN v1.5 embeddings

Text-to-Image Models:

Description

sd_v1.4 2.1 Stable Diffusion text-to-image version 1.4

sd_v2.1 2.6 Stable Diffusion text-to-image version 2.1

		Description
sd_v1.4	2.1	Stable Diffusion text-to-image version 1.4
sd_v2.1	2.6	Stable Diffusion text-to-image version 2.1

Audio Models:

Description

whisper_large_v3_q8 1.8 Whisper large v3 speech-to-text transcription

parler_tts_large_v1_q8 1.1 Parler-TTS text-to-speech model

dac_mono 0.3 Descript Audio Codec (used with Parler-TTS)

		Description
whisper_large_v3_q8	1.8	Whisper large v3 speech-to-text transcription
parler_tts_large_v1_q8	1.1	Parler-TTS text-to-speech model
dac_mono	0.3	Descript Audio Codec (used with Parler-TTS)

SHA256 of all the models: sha256.txt.

Notes:

Some models have restrictive licenses. In particular, OPT, Vicuna and NLLB200 cannot be used commercially. BLOOM, Stable Diffusion, Llama 2, Llama 3, Llama 3.1 can be used commercially but have use limitations.
For the larger models we don't provide the unquantized version when it is too large for consumer GPUs or when the quantized version gives the same performance as the unquantized version.
The q8 suffix indicates that the model was 8 bit quantized. The q4 suffix indicates that the model was 4 bit quantized. The q3 suffix indicates that the model was 3 bit quantized. Unquantized models use either float16 or bfloat16 parameters.
File size on disk (1 GB = 10⁹ bytes). The amount of CPU or GPU RAM needed to run the model is close to this value.
lambada perplexity (ppl) are comparable only for models using the same tokenizer. So the lambada accuracy (acc) should be used when comparing all models.
The speed is measured on an AMD Epyc 7313 CPU using 16 threads (ts_test -T 16)
MMLU was evaluated using 5 shots.

Fabrice Bellard - https://bellard.org/