TextSynth Server is a web server proposing a REST API to large language models. They can be used for example for text completion, question answering, classification, chat, translation, image generation, audio transcription, speech synthesis, ...
It has the following characteristics:
ts_test, ts_sd, ts_chat, ts_audiototext) are provided to test the various models.
The free version is available only for non commercial use. Commercial organizations must buy the commercial version.
The TextSynth Server works only on x86 CPUs supporting AVX2 (all Intel
CPUs since 2013 support it). The installation was tested on Fedora and
CentOS/RockyLinux 8 distributions. Other distributions should work
provided the libmicrohttpd library is installed.
libmicrohttpd library. If you use Fedora, RHEL, CentOS or RockyLinux, you can type as root:
dnf install libmicrohttpd
ts_test can be used without these libraries. ts_server
needs libmicrohttpd. Audio transcription requires the FFmpeg
executable in order to convert the input audio file.
tar xtf ts_server-##version##.tar.gz cd ts_server-##version##
when ##version## is the version of the program.
gpt2_117M.bin from the
ts_server web page.
./ts_test -m gpt2_117M.bin g "The Linux kernel is"
The -T option can be used to use more or less CPU cores (the
default is the number of physical cores).
./ts_server ts_server.cfg
You can edit the ts_server.cfg JSON configuration file if you want to use another model.
curl http://localhost:8080/v1/engines/gpt2_117M/completions \
-H "Content-Type: application/json" \
-d '{"prompt": "The Linux kernel is", "max_tokens": 100}'
The full request syntax is documented at https://textsynth.com/documentation.html.
http://localhost:8080
Now you are ready to load a larger model and to use it from your application.
You need an Nvidia Ampere, ADA or Hopper GPU (e.g. RTX 3090, RTX 4090, RTX A6000, A100 or H100) in order to use the server with cuda 11.x or 12.x installed. Enough memory must be available to load the model.
ts_server needs the cuBLASLt library which is provided in the cuda toolkit.
ts_test utility:
./ts_test --cuda -m gpt2_117M.bin g "The Linux kernel is"
If you get an error such as:
Could not load: ./libnc_cuda.so
it means that cuda is not properly installed or that there is a
mismatch between the installed cuda version and the one
ts_server was compiled with. You can use:
ldd ./libnc_cuda.so
to check that all the required cuda libraries are present on your system.
ts_server.cfg configuration to enable GPU support by uncommenting
cuda: true
and run the server:
./ts_server ts_server.cfg
curl http://localhost:8080/v1/engines/gpt2_117M/completions \
-H "Content-Type: application/json" \
-d '{"prompt": "The Linux kernel is", "max_tokens": 100}'
http://localhost:8080
memory parameter in ts_server.cfg to limit the amount of memory used by the server. It is usually necessary to use a few gigabytes less that maximum available amount of GPU memory.
The TextSynth Server works only on x86 CPUs supporting AVX2 (all Intel CPUs since 2013 support it).
cd ts_server-##version##
when ##version## is the version of the program.
gpt2_117M.bin from the
ts_server web page.
ts_test -m gpt2_117M.bin g "The Linux kernel is"
The -T option can be used to use more or less CPU cores (the
default is the number of physical cores).
ts_server ts_server.cfg
You can edit the ts_server.cfg JSON configuration file if you want to use another model.
curl http://localhost:8080/v1/engines/gpt2_117M/completions \
-H "Content-Type: application/json" \
-d '{"prompt": "The Linux kernel is", "max_tokens": 100}'
The full request syntax is documented at https://textsynth.com/documentation.html.
http://localhost:8080
Now you are ready to load a larger model and to use it from your application.
You need an Nvidia Ampere, ADA or Hopper GPU (e.g. RTX 3090, RTX 4090, RTX A6000, A100 or H100) in order to use the server with cuda 11.x or 12.x installed. Enough memory must be available to load the model.
ts_server needs the cuBLASLt library which is provided in the cuda toolkit.
ts_test utility:
./ts_test --cuda -m gpt2_117M.bin g "The Linux kernel is"
If you get an error such as:
Could not load: libnc_cuda-12.dll (error=126)
it means that cuda is not properly installed.
ts_server.cfg configuration to enable GPU support by uncommenting
cuda: true
and run the server:
./ts_server ts_server.cfg
curl http://localhost:8080/v1/engines/gpt2_117M/completions \
-H "Content-Type: application/json" \
-d '{"prompt": "The Linux kernel is", "max_tokens": 100}'
http://localhost:8080
memory parameter in ts_server.cfg to limit the amount of memory used by the server. It is usually necessary to use a few gigabytes less that maximum available amount of GPU memory.
ts_test)./ts_test --cuda -m gpt2_117M.bin g "Hello, my name is"
When using a CPU, remove the --cuda option.
./ts_test --cuda -m m2m100_1_2B_q8.bin translate en fr "The dispute \ focuses on the width of seats provided on long-haul flights for \ economy passengers."
assuming you downloaded the m2m100_1_2B_q8.bin model.
The perplexity over a text file can be used to evaluate models. The
text file is first tokenized, then cut in sequences of tokens. The
default sequence length is the maximum context length of the model,
use the -l option to change it. Then the log probabilities are
averaged over a range of context positions and displayed as
perplexity.
./ts_test --cuda -m mistral_7B.bin perplexity wiki.test.raw
ctx_len=8192, n_seq=40
START END PERPLEXITY
0 256 9.746
256 512 5.758
512 1024 5.072
1024 2048 4.984
2048 4096 4.934
4096 8192 4.689
0 8192 4.952
The llama_perplexity command evaluates the perplexity using the
same algorithm as the perplexity utility in llama.cpp so
that comparisons can be made. The default context length is 512.
./ts_test --cuda -m mistral_7B.bin llama_perplexity wiki.test.raw ctx_len=512, start=256, n_seq=642 #SEQ PERPLEXITY 641 5.6946
ts_sd)./ts_sd --cuda -m sd_v1.4.bin -o out.jpg "an astronaut riding a horse"
assuming you downloaded sd_v1.4.bin.
When using a CPU, remove the --cuda option.
ts_chat)./ts_chat --cuda -m llama2_7B_chat_q4.bin
assuming you downloaded llama2_7B_chat_q4.bin.
When using a CPU, remove the --cuda option.
During the chat, some commands are available. Use /h during the
chat to have some help. Type Ctrl-C once to stop the output and
twice to quit.
ts_audiototext)./ts_audiototext --cuda -m whisper_large_v3_q8.bin -o out.json audiofile.mp3
assuming you downloaded whisper_large_v3_q8.bin and that
audiofile.mp3 is the audio file to be
transcripted. out.json contains the transcripted text.
When using a CPU, remove the --cuda option.
TextSynth Server uses a specific file format to store the weights of the models. Python scripts are provided in scripts/ to convert model checkpoints to the TextSynth format. The tokenizer is now included in the model file. For backward compatibility, tokenizer files are provided in the tokenizer/ directory.
The script hf_model_convert.py should be used when converting from a Hugging Face model.
Example to convert Llama2 weights from Hugging Face to TextSynth:
python hf_model_convert.py --tokenizer llama_vocab.txt model_dir llama2.bin
where:
model_dir is the directory containing the
config.json and pytorch_model*.bin files.
llama_vocab.txt is a ts_server tokenizer file from the tokenizer/ directory. The tokenizer file can also be extracted from an existing model with:
./ncdump -o llama3_vocab.txt llama3_8B_q4.bin tokenizer
--chat_template can be used to provide the chat template. The chat template is used to generate the prompt from the conversation history in the chat remote API and in the ts_chat utility. The following chat templates are currently available:
rwkvUsing Bob: and Alice: prompts.
vicunaUsing USER: and ASSISTANT: prompts.
redpajama_inciteUsing <human>: and <bot>: prompts.
llama2Using [INST] and [/INST] prompts.
llama3Using <|start_header_id|>user<|end_header_id|> and <|start_header_id|>assistant<|end_header_id|> prompts.
phi3qwen2With the ncconvert utility, it is possible to quantize the
model weights to 8, 4 or 3 bits. Quantization reduces the GPU memory
usage and increases the inference speed. 8 bit quantization yields a
negligible loss. 4 bit quantization yields a very small loss. 3 bit
quantization currently only works on a GPU.
Examples:
8 bit quantization:
./ncconvert -q bf8 pythia_deduped_160M.bin pythia_deduped_160M_q8.bin
4 bit quantization:
./ncconvert -q bf4 pythia_deduped_160M.bin pythia_deduped_160M_q4.bin
The file ts_server.cfg provides an example of configuration.
The syntax is similar to JSON with a few modifications:
{ property: 1 }
cudaOptional boolean (default = false). If true, CUDA (Nvidia GPU support) is enabled.
device_indexOptional integer (default = 0). Select the GPU device when using
several GPUs. Use the nvidia-smi utility to list the available
devices.
n_threadsOptional integer. When using a CPU, select the number of threads. It is set by default to the number of physical cores.
full_memoryOptional boolean (default = true). When using a GPU, ts_server
reserves by default all the GPU memory for better efficiency. This
parameter disables this behavior so that the GPU memory is allocated on
demand.
max_memoryOptional integer (default = 0). If non zero, limit the consumed GPU memory to this value by pausing the HTTP requests until there is enough memory.
Since there is some overhead when handling the requests, it is better to set a value a few GB lower than the amount of total GPU memory.
kv_cache_max_countOptional integer (default = 0). See the kv_cache_size parameter.
kv_cache_sizeOptional integer (default = 0). The KV cache is used by the
chat endpoint to store the context of the conversation to
accelerate the inference. kv_cache_size sets the maximum KV
cache memory in bytes. kv_cache_max_count sets the maximum
number of cached entries.
The cache is enabled if kv_cache_max_count or
kv_cache_size is not zero. When the cache is enabled, a zero
value for either parameter means infinite.
kv_cache_cpuOptional boolean (default = true). Select whether the KV cache is kept in the CPU memory (slower) or on the cuda device.
streaming_timeoutOptional integer (default = 100). When streaming completions, specifies the minimum time in ms between partial outputs.
modelsArray of objects. Each element defines a model that is served. The following parameters are defined:
nameString. Name (ID) of the model in the HTTP requests.
filenameString. Filename of the model. You can use the conversion scripts to create one from Pytorch checkpoints if necessary.
draft_modelOptional string. Filename of a smaller model used to accelerate inference (speculative decoding). The draft model must use the same tokenizer as the large model.
sps_k_maxOptional integer. When using speculative decoding, specify the maximum number of tokens that is predicted by the draft model. The optimal value needs to be determined by experimentation. It is usually 3 or 4.
cpu_offloadOptional boolean (default = false). If true, the model is loaded in CPU memory and run on the GPU. This mode is interesting when the model does not fit the CPU. Performance is still good for very large batch sizes.
n_ctxOptional integer. If present, limit the maximum context length of the model.
Note: the free version only accepts one model definition.
local_portInteger. TCP port on which the HTTP server listens to.
bind_addrOptional string (default = "0.0.0.0"). Set the IP address on which the server listens to. Use "127.0.0.1" if you want to accept local connections only.
tlsOptional boolean (default = false). If true, HTTPS (TLS) connections are accepted instead of HTTP ones.
tls_cert_fileOptional string. If TLS is enabled, the certificate (PEM format) must be provided with this parameter.
tls_cert_fileOptional string. If TLS is enabled, the private key of the certificate (PEM format) must be provided with this parameter.
log_startOptional boolean (default = false). Print "Started." on the console when the server has loaded all the models and is ready to accept connections.
guiOptional boolean (default = false). If true, enable a Graphical User
Interface in addition to the remote API. It is available at the root
URL, e.g. http://127.0.0.1:8080. The server just serves the files
present in the gui/ directory. You can modify or add new files
if needed.
log_filenameString. Set the filename where the logs are written. There is one line per connection. The fields are:
from_proxyOptional boolean (default = true). If true, use the
X-Forwarded-For header if available to determine the source IP
address in the logs. It is useful to have the real IP address of a
client when a proxy is used.
The server provides the following endpoints:
v1/engines/{model_id}/completionsText completion.
Complete documentation at https://textsynth.com/documentation.html.
See api_examples/completion.py to have an example in Python.
v1/engines/{model_id}/chatChat based completion. Complete documentation at https://textsynth.com/documentation.html.
v1/engines/{model_id}/translateTranslation.
Complete documentation at https://textsynth.com/documentation.html.
See api_examples/translate.py to have an example in Python.
v1/engines/{model_id}/logprobLog probability computation.
Complete documentation at https://textsynth.com/documentation.html.
v1/engines/{model_id}/tokenizeTokenization.
Complete documentation at https://textsynth.com/documentation.html.
v1/engines/{model_id}/text_to_imageText to image.
Complete documentation at https://textsynth.com/documentation.html.
See api_examples/sd.py to have an example in Python.
v1/engines/{model_id}/transcriptSpeech to text transcription. See api_examples/transcript.py to have an example in Python.
The content type of the posted data should be
multipart/form-data and should contain two files with the
following names:
jsoncontains the JSON request.
filecontains the audio file to transcript. FFmpeg is invoked by
ts_server to convert the audio file to raw samples.
The JSON request contains the following properties:
languageString. The input ISO language code. The following languages are available: af, am, ar, as, az, ba, be, bg, bn, bo, br, bs, ca, cs, cy, da, de, el, en, es, et, eu, fa, fi, fo, fr, gl, gu, ha, haw, he, hi, hr, ht, hu, hy, id, is, it, ja, jw, ka, kk, km, kn, ko, la, lb, ln, lo, lt, lv, mg, mi, mk, ml, mn, mr, ms, mt, my, ne, nl, nn, no, oc, pa, pl, ps, pt, ro, ru, sa, sd, si, sk, sl, sn, so, sq, sr, su, sv, sw, ta, te, tg, th, tk, tl, tr, tt, uk, ur, uz, vi, yi, yo, yue, zh.
Additional parameters are available for testing or tuning:
num_beamsOptional integer, range: 2 to 5 (default = 5). Number of beams used for decoding.
condition_on_previous_textOptional boolean (default = false). Condition the current frame on the previous text.
logprob_thresholdOption float (default = -1.0).
no_speech_thresholdOptional float (default = 0.6). Probability threshold of the
no_speech token for no speech detection. The average
log-probability of the generated tokens must also be below
logprob_threshold.
A JSON object is returned containing the transcription. It contains the following properties:
textString. Transcripted text.
segmentsArray of objects containing the transcripted text segments with timestamps. Each segment has the following properties:
idInteger. Segment ID.
startFloat. Start time in seconds.
endFloat. End time in seconds.
textString. Transcripted text for this segment.
languageString. ISO language code.
durationFloat. Transcription duration in seconds.
v1/engines/{model_id}/embeddingsCompute the embeddings of a text. The JSON request contains the following properties:
inputString or array of strings. Several input texts can be provided.
The returned JSON object contains the following properties:
objectString. value = "list".
dataArray of objects. Each object has the following properties:
objectString. value = "embedding".
indexInteger. Index in the array.
embeddingArray of floats. The embedding vector computed for the corresponding input text.
v1/engines/{model_id}/speechThis endpoint does text to speech output. The output is a MP3 stream containing the generated speech. Complete documentation at https://textsynth.com/documentation.html. See api_examples/speech.py to have an example in Python.
v1/memory_statsReturn a JSON object with the memory usage statistics. The following properties are available:
cur_memoryInteger. Current used memory in bytes (CPU or GPU memory).
max_memoryInteger. Maximum used memory in bytes since the last call (CPU or GPU memory).
kv_cache_countInteger. Number of entries in the KV cache count.
kv_cache_sizeInteger. CPU Memory in bytes used by the KV cache.
v1/modelsReturn the list of available models and their capabilities. It is used by the GUI.
The WebSocket endpoints are experimental and may change in the next releases. The server provides the following WebSocket endpoints:
v1/realtime_transcriptRealtime audio transcription. An audio transcription model (such as Whisper) must be loaded in ts_server.
See gui/voice_chat.js for the exact protocol. The binary WebSocket messages contain compressed audio using the Opus codec using self-delimiting framing (see RFC 6716 Annex B).
v1/realtime_chat?model=/{model_id}Realtime voice chat with optional voice synthesis. At least an audio transcription model (such as Whisper) and a chat model must be loaded in ts_server. For voice synthesis, Parler-TTS must be loaded too.
See gui/voice_chat.js for the exact protocol. model_id
is the chat model to use. The binary WebSocket messages contain
compressed audio using the Opus codec using self-delimiting framing
(see RFC 6716 Annex B).