ts_zip
utility provided with
the ts_server software can
compress (and hopefully decompress) text files with Large Language
Models. The compression ratio is much higher than with other
compression tools. There are some caveats of course:
The compression ratio is given in bits per byte for each model. CMIX v19 is one of the best lossless data compression program.
File | Original size (bytes) | xz (bpb) | CMIX v19 (bpb) | pythia_deduped_70M (bpb) | rwkv_169M (bpb) | rwkv_430M (bpb) | falcon_7B_q4 (bpb) | rwkv_7B_q4 (bpb) |
---|---|---|---|---|---|---|---|---|
alice29.txt | 152089 | 2.551 | 1.645 | 1.335 | 1.166 | 1.028 | 0.718 | 0.411 |
book1 | 768771 | 2.717 | 1.816 | 1.569 | 1.426 | 1.311 | 1.104 | 1.115 |
enwik8 | 100000000 | 1.989 | 1.187 | - | 1.098 | 0.948 | - | - |
linux-1.2.13.tar | 9379840 | 1.441 | - | 1.010 | 0.991 | 0.837 | - | - |
They are measured when compressing the book1 file on a RTX A6000 GPU. The decompression speed and memory requirements are similar.
Model | Compression speed (kBytes/s) | GPU memory (GB) |
---|---|---|
rwkv_169M | 128 | 0.38 |
rwkv_430M | 85 | 0.94 |
pythia_deduped_70M | 70 | 6.61 |
rwkv_7B_q4 | 15 | 4.76 |
falcon_7B_q4 | 6.7 | 8.44 |
The smaller RWKV models seem a good compromise for text compression because they use a small amount of memory due to their RNN structure and have a (relatively) high running speed.