ts_zip: Text Compression using Large Language Models

The ts_zip utility can compress (and hopefully decompress) text files using a Large Language Model. The compression ratio is much higher than with other compression tools. There are some caveats of course:

A GPU is necessary to get a reasonable speed. 4 GB of RAM is required.
It is slower than conventional compressors (compression and decompression speed: up to 1 MB/s on a RTX 4090).
Only text files are supported. Binary files won't be compressed much. The currently used language model (RWKV 169M v4) was trained mostly on English texts. Other languages are supported including source code.
It is experimental so no backward compability should be expected between the various versions.
See also ts_sms which is optimized for the compression of small messages.

Compression Ratio

The compression ratio is given in bits per byte (bpb).

File	Original size (bytes)	xz (bytes) (bpb)		ts_zip (bytes) (bpb)
alice29.txt	152089	48492	2.551	21713	1.142
book1	768771	261116	2.717	137477	1.431
enwik8	100000000	24865244	1.989	13825741	1.106
enwik9	1000000000	213370900	1.707	135443237	1.084
linux-1.2.13.tar	9379840	1689468	1.441	1196859	1.021

Results and speed for other programs on enwik8 and enwik9 are available at the Large Text Compression Benchmark.

Download

Linux version: ts_zip-2024-03-02.tar.gz.
Windows version: ts_zip-2024-03-02-win64.zip.

Technical information

ts_zip uses the RWKV 169M v4 language model which is a good compromise between speed and compression ratio. The model is quantized to 8 bits per parameter and evaluated using BF16 floating point numbers.
The language model predicts the probabilities of the next token. An arithmetic coder then encodes the next token according to the probabilities.
The model is evaluated in a deterministic and reproducible way. Hence the result does not depend on the exact GPU or CPU model nor on the number of configured threads. This key point ensures that a compressed file can be decompressed using a different hardware or software configuration.

Fabrice Bellard - https://bellard.org/