Lossless Compression of English Short Messages
This lossless compressor achieves a much higher compression rate on English texts than general purpose compressors. Its typical compression ratio is 15% (number of output bits divided by the number of input bits).
The compression is achieved by using the probability of the next word computed by the GPT-2 language model released by OpenAI. It is a neural network of 345 million parameters based on the Transformer architecture. An arithmetic coder generates the bit stream. For this demo, each compressed character holds 15 data bits by using the CJK and the Hangul Syllables unicode ranges.
It is implemented in C using the
LibNC library. A
standalone command line version (
gpt2tc) can be
ratios on several text compression benchmarks is listed in
A similar model can be used to complete text messages.