Lossless Compression of English Short Messages

This lossless compressor achieves a much higher compression rate on English texts than general purpose compressors. Its typical compression ratio is 15% (number of output bits divided by the number of input bits).

The compression is achieved by using the probability of the next word computed by the GPT-2 language model released by OpenAI. It is a neural network of 345 million parameters based on the Transformer architecture (the largest GPT-2 model of 1.5 billion parameters brings marginal improvement when compressing short messages). An arithmetic coder generates the bit stream. For this demo, each compressed character holds 15 data bits by using the CJK and the Hangul Syllables unicode ranges.

It is implemented using the LibNC library and runs on a standard PC. The Linux standalone command line version (gpt2tc) can be downloaded here. Compression ratios on several text compression benchmarks is listed in the gpt2tc documentation.

A similar model can be used to complete text messages.