TextSynth Server

News

Introduction

ts_server is a web server proposing a REST API to large language models. They can be used for example for text completion, question answering, classification, chat, translation, image generation, ...

It has the following characteristics:

The free version is released as binary code for non-commercial use only. It has some limitations compared to the commercial version. Please contact for the exact terms.

Download

Documentation

Benchmarks

Text generation

100 tokens are generated with a batch size of 1 and 50 input tokens:

Model(3) Epyc 7313 (6)
(tokens/s)
RTX A6000
(tokens/s)
RTX 4090
(tokens/s)
gptj_6B_q421.5132164
flan_t5_xxl_q425130158
llama2_7B_q423115144
llama2_13B_q412.069.388
gptneox_20B_q48.145.559
llama2_70B_q42.515.2-

8 simultaneous requests generating 100 tokens with 50 input tokens (equivalent to a batch size of 8):

Model(3) RTX A6000
(tokens/s)
llama2_7B_q4783
llama2_13B_q4492
llama2_70B_q4118

Text to image

A single 512x512 image is generated using 50 time steps.

Model(3) RTX A6000
(seconds)
RTX 4090
(seconds)
stable diffusion 1.41.821.21
stable diffusion 2.11.671.19

Available Models

We provide here model files that can be used with the TextSynth Server. Each model was evaluated with the lm-evaluation-harness with the TextSynth server on a RTX A6000 GPU.

Language Models:
bloom_560M 1.1 29.176 36.8% 35.8% 51.4% 63.7% 36.0% 44.7%
codegen_6B_mono_q4 4.4 69.409 28.0% 35.7% 51.1% 60.2% 38.0% 42.6%
codegen_6B_mono_q8 7.7 67.262 28.1% 35.8% 50.8% 60.1% 39.1% 42.8%
fairseq_gpt_13B 26.2 3.567 71.9% 72.7% 67.5% 77.6% 70.1% 71.9%
fairseq_gpt_13B_q4 7.9 3.646 71.2% 72.5% 67.6% 77.4% 70.6% 71.9%
fairseq_gpt_13B_q8 14.2 3.565 71.8% 72.7% 67.2% 77.7% 70.0% 71.9%
flan_t5_base 0.5 12.891 54.2% 36.5% 54.7% 65.8% 62.1% 54.7%
flan_t5_base_q8 0.3 13.098 54.2% 36.4% 54.2% 65.7% 61.8% 54.5%
flan_t5_small 0.2 23.343 46.7% 29.2% 50.0% 62.4% 47.9% 47.2%
flan_t5_small_q8 0.1 23.449 46.7% 29.2% 49.7% 62.4% 48.2% 47.2%
flan_t5_xxl_q4 6.5 3.010 77.7% 71.5% 73.4% 77.6% 71.8% 74.4%
flan_t5_xxl_q8 12.0 3.049 77.8% 72.1% 75.1% 77.8% 73.1% 75.2%
flan_ul2_20B_q4 11.3 - 74.1% 24.3% 51.1% 49.9% 78.8% 55.6%
flan_ul2_20B_q8 20.9 - 74.4% 24.4% 52.0% 50.6% 77.3% 55.7%
gpt2_117M 0.3 40.110 32.9% 31.1% 52.1% 62.9% 27.3% 41.3%
gpt2_345M 0.7 18.272 43.5% 39.4% 53.3% 67.7% 43.1% 49.4%
gpt2_345M_q8 0.5 18.452 43.1% 39.4% 53.1% 67.5% 41.9% 49.0%
gpt2_774M 1.6 12.966 47.8% 45.4% 55.6% 70.4% 48.5% 53.5%
gpt2_774M_q8 1.0 12.928 47.9% 45.4% 55.3% 70.3% 48.2% 53.4%
gpt2_1558M 3.1 10.637 51.3% 50.8% 58.4% 70.8% 53.2% 56.9%
gpt2_1558M_q8 1.8 10.655 51.2% 50.8% 58.6% 70.8% 53.2% 56.9%
gptj_6B 12.1 4.124 69.0% 66.2% 64.8% 75.5% 66.9% 68.5%
gptj_6B_q4 3.8 4.153 68.9% 65.7% 63.9% 74.4% 67.0% 68.0%
gptj_6B_q8 6.6 4.122 69.1% 66.2% 64.4% 75.4% 66.4% 68.3%
gptneox_20B 41.1 3.657 72.6% 71.4% 65.5% 77.5% 73.3% 72.0%
gptneox_20B_q4 12.2 3.711 72.0% 69.3% 64.8% 76.7% 70.8% 70.7%
gptneox_20B_q8 22.1 3.659 72.6% 71.3% 65.8% 77.3% 72.9% 72.0%
llama_7B 13.5 3.463 73.6% 76.2% 70.4% 78.1% 75.4% 74.7%
llama_7B_q4 4.0 3.549 73.2% 75.5% 70.4% 78.0% 74.7% 74.4%
llama_7B_q8 7.3 3.453 73.7% 76.1% 70.2% 78.0% 75.5% 74.7%
llama_13B_q4 7.6 3.130 77.1% 78.6% 72.2% 78.3% 77.8% 76.8%
llama_13B_q8 14.0 3.178 76.5% 79.1% 73.2% 79.1% 77.1% 77.0%
llama_30B_q4 18.7 2.877 77.5% 82.4% 75.7% 80.2% 80.2% 79.2%
llama_30B_q8 34.8 2.853 77.7% 82.7% 76.3% 80.3% 80.4% 79.5%
llama_65B_q4 37.2 2.760 78.5% 83.9% 76.6% 81.4% 83.2% 80.7%
opt_125M 0.3 26.028 37.9% 31.3% 50.2% 63.2% 23.4% 41.2%
opt_30B_q4 17.8 3.656 71.5% 72.1% 68.0% 77.4% 69.9% 71.8%
opt_30B_q8 32.6 3.628 71.6% 72.3% 68.2% 77.7% 71.4% 72.3%
opt_66B_q4 38.2 3.308 73.4% 74.4% 68.4% 78.5% 75.0% 73.9%
pythia_deduped_70M 0.1 96.126 25.6% 28.3% 54.4% 60.4% 13.1% 36.3%
pythia_deduped_160M 0.3 26.380 36.9% 32.3% 51.4% 63.8% 23.2% 41.5%
pythia_deduped_410M 0.8 10.827 51.7% 40.8% 54.0% 67.2% 43.0% 51.4%
pythia_deduped_410M_q8 0.5 10.729 51.8% 40.7% 53.8% 67.1% 42.7% 51.2%
pythia_deduped_1B 2.0 7.273 58.5% 49.0% 54.5% 71.0% 49.9% 56.6%
pythia_deduped_1B_q8 1.2 7.286 58.4% 49.0% 54.9% 70.9% 49.0% 56.5%
pythia_deduped_1.4B 2.8 6.546 63.1% 52.2% 57.1% 72.7% 52.6% 59.5%
pythia_deduped_1.4B_q8 1.6 6.577 63.3% 52.1% 55.7% 73.1% 53.0% 59.4%
pythia_deduped_2.8B 5.6 4.787 67.1% 61.6% 60.9% 74.4% 65.5% 65.9%
pythia_deduped_2.8B_q8 3.1 4.778 66.9% 61.5% 61.2% 74.5% 65.6% 66.0%
pythia_deduped_6.9B 13.7 4.195 69.1% 65.7% 63.9% 75.1% 66.1% 68.0%
pythia_deduped_6.9B_q4 4.3 4.344 68.3% 65.0% 62.5% 75.3% 66.3% 67.5%
pythia_deduped_6.9B_q8 7.5 4.187 69.4% 65.7% 63.6% 75.5% 66.8% 68.2%
pythia_deduped_12B 23.7 3.854 70.9% 69.2% 63.9% 76.3% 70.8% 70.2%
pythia_deduped_12B_q4 7.2 4.187 69.2% 68.5% 63.1% 76.4% 69.6% 69.4%
pythia_deduped_12B_q8 12.8 3.857 70.9% 69.2% 64.2% 76.1% 70.9% 70.3%
rwkv_14B 28.3 3.819 71.6% 70.2% 63.1% 77.5% 47.2% 65.9%
rwkv_14B_q4 8.5 4.076 68.3% 69.8% 63.1% 77.1% 45.0% 64.7%
rwkv_14B_q8 15.3 3.806 71.9% 70.2% 63.0% 77.5% 47.1% 65.9%
rwkv_7B 16 4.396 67.5% 65.6% 61.9% 75.6% 39.7% 62.1%
rwkv_7B_q4 4.6 4.939 64.7% 64.8% 61.2% 75.4% 38.4% 60.9%
rwkv_7B_q8 8.0 4.395 67.5% 65.6% 61.6% 75.9% 40.2% 62.2%
RedPajama-INCITE-7B_q4 4.3 4.006 71.0% 69.7% 64.6% 76.3% 71.7% 70.7%
RedPajama-INCITE-7B_q8 7.5 3.910 71.4% 70.4% 64.3% 77.0% 71.9% 71.0%
falcon_40B_q4 24.6 2.844 77.6% 82.5% 76.2% 82.2% 78.8% 79.5%
falcon_40B_q8 45.0 2.799 77.9% 82.7% 76.7% 82.2% 80.4% 80.0%
falcon_7B 14.4 3.359 75.0% 76.2% 67.3% 79.4% 72.1% 74.0%
falcon_7B_q4 4.6 3.444 73.9% 75.8% 67.5% 79.7% 71.6% 73.7%
falcon_7B_q8 7.9 3.368 75.0% 76.2% 66.9% 79.5% 71.9% 73.9%
mpt_30B_q4 17.8 3.219 78.9% 79.4% 70.1% 79.8% 79.8% 77.6%
mpt_30B_q8 32.6 3.062 80.7% 79.8% 70.7% 80.0% 79.9% 78.2%
mpt_7B_q4 4.3 3.949 73.1% 75.7% 67.4% 79.0% 75.9% 74.2%
mpt_7B_q8 7.5 3.850 73.2% 76.2% 68.5% 79.1% 76.4% 74.7%
llama2_7B 13.5 3.428 74.5% 76.2% 69.7% 78.4% 77.2% 75.2%
llama2_7B_q4 4.0 3.535 73.5% 75.4% 69.5% 77.6% 74.5% 74.1%
llama2_13B 26.0 3.051 77.2% 79.6% 72.1% 78.9% 79.3% 77.4%
llama2_13B_q4 7.6 3.134 76.9% 79.3% 73.2% 78.8% 79.1% 77.5%
llama2_70B_q4 39.3 2.646 80.6% 84.0% 78.7% 82.0% 83.4% 81.7%
llama2_7B_q3 3.2 3.632 72.7% 74.1% 68.4% 77.1% 73.2% 73.1%
llama2_13B_q3 6.1 3.173 76.2% 78.0% 72.1% 78.2% 78.0% 76.5%
llama2_70B_q3 30.8 2.638 79.9% 82.9% 77.7% 81.7% 82.6% 80.9%
mistral_7B 14.5 3.178 76.2% 81.0% 74.2% 80.4% 80.9% 78.5%
mistral_7B_q4 4.3 3.412 74.9% 80.1% 73.9% 80.7% 80.3% 78.0%
mistral_7B_q8 7.8 3.174 76.0% 81.0% 73.6% 80.4% 80.7% 78.3%
mixtral_47B_q3 19.3 2.851 76.8% 82.2% 75.6% 81.3% 79.8% 79.1%
mixtral_47B_q4 26.5 2.811 78.6% 83.3% 76.0% 82.6% 80.4% 80.2%
mixtral_47B_q8 49.7 2.790 79.3% 83.9% 78.1% 82.0% 80.7% 80.8%

Chat Models:
Description
llama2_7B_chat_q43.9Llama 2 7B chat model
llama2_13B_chat_q47.6Llama 2 13B chat model
llama2_70B_chat_q439.3Llama 2 70B chat model
mistral_7B_instruct_q43.9Mistral 7B chat model
vicuna_13B_v1.1_q47.6Vicuna chat model
rwkv_raven_v12_14B_q48.5RWKV Raven version 12
RedPajama-INCITE-7B-Chat_q44.3RedPajama INCITE Chat model

Additional Models:
Description
m2m100_1_2B_q81.6Translation between 100 languages
nllb200_1.3B_q82.0Translation between 200 languages
nllb200_3.3B_q84.6Translation between 200 languages
madlad400_7B_q45.7Translation between 400 languages
sd_v1.42.1Stable Diffusion text-to-image version 1.4
sd_v2.12.6Stable Diffusion text-to-image version 2.1

SHA256 of all the models: sha256.txt.

Notes:

  1. Some models have restrictive licenses. In particular, OPT, Vicuna and NLLB200 cannot be used commercially. BLOOM, Stable Diffusion and Llama 2 can be used commercially but have use limitations.
  2. For the larger models we don't provide the unquantized version when it is too large for consumer GPUs or when the quantized version gives the same performance as the unquantized version.
  3. The q8 suffix indicates that the model was 8 bit quantized. The q4 suffix indicates that the model was 4 bit quantized. The q3 suffix indicates that the model was 3 bit quantized. Unquantized models use either float16 or bfloat16 parameters.
  4. File size on disk (1 GB = 109 bytes). The amount of CPU or GPU RAM needed to run the model is close to this value.
  5. lambada perplexity (ppl) are comparable only for models using the same tokenizer. So the lambada accuracy (acc) should be used when comparing all models.
  6. The speed is measured on an AMD Epyc 7313 CPU using 16 threads (ts_test -T 16)


Fabrice Bellard - https://bellard.org/