TextSynth Server is a web server proposing a REST API to large language models. They can be used for example for text completion, question answering, classification, chat, translation, image generation, ...
It has the following characteristics:
ts_test
, ts_sd
, ts_chat
, ts_zip
) are provided to test the various models.
The free version is available only for non commercial use. Commercial organizations must buy the commercial version. The commercial version adds the following features:
The TextSynth Server works only on x86 CPUs supporting AVX2 (all Intel
CPUs since 2013 support it). The installation was tested on Fedora and
CentOS/RockyLinux 8 distributions. Other distributions should work
provided the libjpeg
and libmicrohttpd
libraries are
installed.
libjpeg
and libmicrohttpd
libraries. If you use Fedora, RHEL, CentOS or RockyLinux, you can type as root:
dnf install libjpeg libmicrohttpd
ts_test
can be used without these libraries. ts_sd
needs libjpeg
. ts_server
needs libjpeg
and
libmicrohttpd
.
tar xtf ts_server-##version##.tar.gz cd ts_server-##version##
when ##version##
is the version of the program.
gpt2_117M.bin
from the
ts_server
web page.
./ts_test -m gpt2-117M.bin g "The Linux kernel is"
You can use more CPU cores with the -T
option:
./ts_test -T 4 -m gpt2-117M.bin g "The Linux kernel is"
The optimal number of cores depends on the system configuration.
./ts_server ts_server.cfg
You can edit the ts_server.cfg JSON configuration file if you want to use another model.
curl http://localhost:8080/v1/engines/gpt2_117M/completions \ -H "Content-Type: application/json" \ -d '{"prompt": "The Linux kernel is", "max_tokens": 100}'
The full request syntax is documented at https://textsynth.com/documentation.html.
http://localhost:8080
You can edit the gui/index.html page to update the model list when downloading new models.
Now you are ready to load a larger model and to use it from your application.
You need an Nvidia Ampere, ADA or Hopper GPU (e.g. RTX 3090, RTX 4090, RTX A6000, A100 or H100) in order to use the server with cuda 11.x or 12.x installed. Enough memory must be available to load the model.
ln -sf libnc_cuda-12.so libnc_cuda.so
ts_test
utility:
./ts_test --cuda -m gpt2-117M.bin g "The Linux kernel is"
If you get an error such as:
Could not load: ./libnc_cuda.so
it means that cuda is not properly installed or that there is a
mismatch between the installed cuda version and the one
ts_server
was compiled with. You can use:
ldd ./libnc_cuda.so
to check that all the required cuda libraries are present on your system.
ts_server.cfg
configuration to enable GPU support by uncommenting
cuda: true
and run the server:
./ts_server ts_server.cfg
curl http://localhost:8080/v1/engines/gpt2_117M/completions \ -H "Content-Type: application/json" \ -d '{"prompt": "The Linux kernel is", "max_tokens": 100}'
When you change the models in the ts_server.cfg, you may have to update gui/index.html to change the model names in the GUI.
memory
parameter in ts_server.cfg
to limit the amount of memory used by the server. It is usually necessary to use a few gigabytes less that maximum available amount of GPU memory.
The TextSynth Server works only on x86 CPUs supporting AVX2 (all Intel CPUs since 2013 support it). The Windows support is experimental.
cd ts_server-##version##
when ##version##
is the version of the program.
gpt2_117M.bin
from the
ts_server
web page.
ts_test -m gpt2-117M.bin g "The Linux kernel is"
You can use more CPU cores with the -T
option:
ts_test -T 4 -m gpt2-117M.bin g "The Linux kernel is"
The optimal number of cores depends on the system configuration.
ts_server ts_server.cfg
You can edit the ts_server.cfg JSON configuration file if you want to use another model.
curl http://localhost:8080/v1/engines/gpt2_117M/completions \ -H "Content-Type: application/json" \ -d '{"prompt": "The Linux kernel is", "max_tokens": 100}'
The full request syntax is documented at https://textsynth.com/documentation.html.
http://localhost:8080
You can edit the gui/index.html page to update the model list when downloading new models.
Now you are ready to load a larger model and to use it from your application.
ts_test
)./ts_test --cuda -m gpt2_117M.bin g "Hello, my name is"
./ts_test --cuda -m gpt2_117M.bin cs "Hello, how are you ?" ./ts_test --cuda ds "##msg##"
where ##msg##
is the compressed message.
./ts_test --cuda -m m2m100_1_2B_q8.bin translate en fr "The dispute \ focuses on the width of seats provided on long-haul flights for \ economy passengers."
assuming you downloaded the m2m100_1_2B_q8.bin model.
When using a CPU, remove the --cuda
option and use the
-T
option to specify the number of threads.
ts_sd
)./ts_sd --cuda -m sd_v1.4.bin -o out.jpg "an astronaut riding a horse"
assuming you downloaded sd_v1.4.bin..
When using a CPU, remove the --cuda
option and use the
-T
option to specify the number of threads.
ts_chat
)./ts_chat --cuda -m rwkv_raven_v12_14B_q4
assuming you downloaded rwkv_raven_v12_14B_q4.bin.
When using a CPU, remove the --cuda
option and use the
-T
option to specify the number of threads.
During the chat, some commands are available. Use /h
during the
chat to have some help. Type Ctrl-C
once to stop the output.
ts_zip
)To compress a text file (here alice29.txt), assuming you downloaded the rwkv_169M.bin model, use:
./ts_zip --cuda -m rwkv_169M.bin c alice29.txt /tmp/out.bin
To decompress it:
./ts_zip --cuda -m rwkv_169M.bin d /tmp/out.bin /tmp/out.txt
A checksum is included in the compressed file and it is automatically checked. It is essential to use the same software version, language model and GPU model when compressing and decompressing a file.
Large compression gains occur only if the input file is in a language that the language model has already seen.
The compression ratio, speed and memory usage depend on the language
model but also on the selected context length (-l
option) and
batch size (-b
option). They are both chosen automatically but
can be overridden:
More information is available at https://bellard.org/ts_server/ts_zip.html.
TextSynth Server uses a specific file format to store the weights of the models. Python scripts are provided in scripts/ to convert model checkpoints to the TextSynth format.
Example to convert the Pythia pytorch weights to TextSynth:
python gptneox_hf_convert.py config.json pytorch_model.bin \ pythia_deduped_160M.bin
With the ncconvert
utility, it is possible to quantize the
model weights to 8 or 4 bits. Quantization reduces the GPU memory
usage and increases the inference speed. 8 bit quantization
yields a negligible loss. 4 bit quantization yields a very small loss.
Examples:
8 bit quantization:
./ncconvert -q bf8 pythia_deduped_160M.bin pythia_deduped_160M_q8.bin
4 bit quantization:
./ncconvert -q bf4 pythia_deduped_160M.bin pythia_deduped_160M_q4.bin