TextSynth Server

Table of Contents

1 Introduction

TextSynth Server is a web server proposing a REST API to large language models. They can be used for example for text completion, question answering, classification, chat, translation, image generation, ...

It has the following characteristics:

The free version is available only for non commercial use. Commercial organizations must buy the commercial version. The commercial version adds the following features:

2 Quick Start

2.1 Linux

2.1.1 First steps

The TextSynth Server works only on x86 CPUs supporting AVX2 (all Intel CPUs since 2013 support it). The installation was tested on Fedora and CentOS/RockyLinux 8 distributions. Other distributions should work provided the libjpeg and libmicrohttpd libraries are installed.

  1. Install the libjpeg and libmicrohttpd libraries. If you use Fedora, RHEL, CentOS or RockyLinux, you can type as root:
      dnf install libjpeg libmicrohttpd
    

    ts_test can be used without these libraries. ts_sd needs libjpeg. ts_server needs libjpeg and libmicrohttpd.

  2. Extract the archive and go into its directory:
      tar xtf ts_server-##version##.tar.gz
    
      cd ts_server-##version##
    

    when ##version## is the version of the program.

  3. Download one small example model such as gpt2_117M.bin from the ts_server web page.
  4. Use it to generate text with the "ts_test" utility:
      ./ts_test -m gpt2-117M.bin g "The Linux kernel is"
    

    You can use more CPU cores with the -T option:

      ./ts_test -T 4 -m gpt2-117M.bin g "The Linux kernel is"
    

    The optimal number of cores depends on the system configuration.

  5. Start the server:
      ./ts_server ts_server.cfg
    

    You can edit the ts_server.cfg JSON configuration file if you want to use another model.

  6. Try one request:
      curl http://localhost:8080/v1/engines/gpt2_117M/completions \
      -H "Content-Type: application/json" \
      -d '{"prompt": "The Linux kernel is", "max_tokens": 100}'
    

    The full request syntax is documented at https://textsynth.com/documentation.html.

  7. You can use the integrated GUI by exploring with your browser:
    http://localhost:8080
    

    You can edit the gui/index.html page to update the model list when downloading new models.

Now you are ready to load a larger model and to use it from your application.

2.1.2 GPU usage

You need an Nvidia Ampere, ADA or Hopper GPU (e.g. RTX 3090, RTX 4090, RTX A6000, A100 or H100) in order to use the server with cuda 11.x or 12.x installed. Enough memory must be available to load the model.

  1. First ensure that it is working on CPU (See First steps).
  2. Ensure that you have a compatible cuda installation with cuda 11.x or 12.x. The software is preconfigured with cuda 11.x. If you want to use cuda 12.x, then change the link to the libnc_cuda.so library:
      ln -sf libnc_cuda-12.so libnc_cuda.so
    
  3. Then try to use the GPU with the ts_test utility:
      ./ts_test --cuda -m gpt2-117M.bin g  "The Linux kernel is"
    

    If you get an error such as:

      Could not load: ./libnc_cuda.so
    

    it means that cuda is not properly installed or that there is a mismatch between the installed cuda version and the one ts_server was compiled with. You can use:

      ldd ./libnc_cuda.so
    

    to check that all the required cuda libraries are present on your system.

  4. Then edit the ts_server.cfg configuration to enable GPU support by uncommenting
      cuda: true
    

    and run the server:

      ./ts_server ts_server.cfg
    
  5. Assuming you have curl, Try one request:
      curl http://localhost:8080/v1/engines/gpt2_117M/completions \
      -H "Content-Type: application/json" \
      -d '{"prompt": "The Linux kernel is", "max_tokens": 100}'
    

    When you change the models in the ts_server.cfg, you may have to update gui/index.html to change the model names in the GUI.

  6. Depending on the amount of memory available on your GPU, you can set the memory parameter in ts_server.cfg to limit the amount of memory used by the server. It is usually necessary to use a few gigabytes less that maximum available amount of GPU memory.

2.2 Windows

The TextSynth Server works only on x86 CPUs supporting AVX2 (all Intel CPUs since 2013 support it). The Windows support is experimental.

  1. Extract the ZIP archive, launch the shell and go into its directory:
      cd ts_server-##version##
    

    when ##version## is the version of the program.

  2. Download one small example model such as gpt2_117M.bin from the ts_server web page.
  3. Use it to generate text with the "ts_test" utility:
      ts_test -m gpt2-117M.bin g "The Linux kernel is"
    

    You can use more CPU cores with the -T option:

      ts_test -T 4 -m gpt2-117M.bin g "The Linux kernel is"
    

    The optimal number of cores depends on the system configuration.

  4. Start the server:
      ts_server ts_server.cfg
    

    You can edit the ts_server.cfg JSON configuration file if you want to use another model.

  5. Assuming you installed curl (you can download it from https://curl.se/windows/), try one request:
      curl http://localhost:8080/v1/engines/gpt2_117M/completions \
      -H "Content-Type: application/json" \
      -d '{"prompt": "The Linux kernel is", "max_tokens": 100}'
    

    The full request syntax is documented at https://textsynth.com/documentation.html.

  6. You can use the integrated GUI by exploring with your browser:
    http://localhost:8080
    

    You can edit the gui/index.html page to update the model list when downloading new models.

Now you are ready to load a larger model and to use it from your application.

3 Utilities

3.1 Text processing (ts_test)

When using a CPU, remove the --cuda option and use the -T option to specify the number of threads.

3.2 Text to image (ts_sd)

./ts_sd --cuda -m sd_v1.4.bin -o out.jpg "an astronaut riding a horse"

assuming you downloaded sd_v1.4.bin..

When using a CPU, remove the --cuda option and use the -T option to specify the number of threads.

3.3 Chat (ts_chat)

./ts_chat --cuda -m rwkv_raven_v12_14B_q4

assuming you downloaded rwkv_raven_v12_14B_q4.bin.

When using a CPU, remove the --cuda option and use the -T option to specify the number of threads.

During the chat, some commands are available. Use /h during the chat to have some help. Type Ctrl-C once to stop the output.

3.4 Text compression (ts_zip)

To compress a text file (here alice29.txt), assuming you downloaded the rwkv_169M.bin model, use:

./ts_zip --cuda -m rwkv_169M.bin c alice29.txt /tmp/out.bin

To decompress it:

./ts_zip --cuda -m rwkv_169M.bin d /tmp/out.bin /tmp/out.txt

A checksum is included in the compressed file and it is automatically checked. It is essential to use the same software version, language model and GPU model when compressing and decompressing a file.

Large compression gains occur only if the input file is in a language that the language model has already seen.

The compression ratio, speed and memory usage depend on the language model but also on the selected context length (-l option) and batch size (-b option). They are both chosen automatically but can be overridden:

More information is available at https://bellard.org/ts_server/ts_zip.html.

3.5 Model Weight Conversion

TextSynth Server uses a specific file format to store the weights of the models. Python scripts are provided in scripts/ to convert model checkpoints to the TextSynth format.

Example to convert the Pythia pytorch weights to TextSynth:

python gptneox_hf_convert.py config.json pytorch_model.bin \
       pythia_deduped_160M.bin

3.6 Model Weight Quantization

With the ncconvert utility, it is possible to quantize the model weights to 8 or 4 bits. Quantization reduces the GPU memory usage and increases the inference speed. 8 bit quantization yields a negligible loss. 4 bit quantization yields a very small loss.

Examples:

8 bit quantization:

./ncconvert -q bf8 pythia_deduped_160M.bin pythia_deduped_160M_q8.bin

4 bit quantization:

./ncconvert -q bf4 pythia_deduped_160M.bin pythia_deduped_160M_q4.bin