ts_zip: Text Compression using Large Language Models

The ts_zip utility provided with the ts_server software can compress (and hopefully decompress) text files with Large Language Models. The compression ratio is much higher than with other compression tools. There are some caveats of course:

Compression Ratio

The compression ratio is given in bits per byte for each model. CMIX v19 is one of the best lossless data compression program.

File Original size
(bytes)
xz
(bpb)
CMIX v19
(bpb)
pythia_deduped_70M
(bpb)
rwkv_169M
(bpb)
rwkv_430M
(bpb)
falcon_7B_q4
(bpb)
rwkv_7B_q4
(bpb)
alice29.txt 152089 2.551 1.645 1.335 1.166 1.028 0.718 0.411
book1 768771 2.717 1.816 1.569 1.426 1.311 1.104 1.115
enwik8 100000000 1.989 1.187 - 1.098 0.948 - -
linux-1.2.13.tar 9379840 1.441 - 1.010 0.991 0.837 - -

Compression Speed and Required Memory

They are measured when compressing the book1 file on a RTX A6000 GPU. The decompression speed and memory requirements are similar.

Model Compression speed
(kBytes/s)
GPU memory
(GB)
rwkv_169M1280.38
rwkv_430M850.94
pythia_deduped_70M706.61
rwkv_7B_q4154.76
falcon_7B_q46.78.44

Conclusion

The smaller RWKV models seem a good compromise for text compression because they use a small amount of memory due to their RNN structure and have a (relatively) high running speed.


Fabrice Bellard - https://bellard.org/