Making Compression Algorithms for Unicode Text

Abstract

The majority of online content is written in languages other than English, and is most commonly encoded in UTF-8, the world’s dominant Unicode character encoding. Traditional compression algorithms typically operate on individual bytes. While this approach works well for the single-byte ASCII encoding, it works poorly for UTF-8, where characters often span multiple bytes. Our paper introduces a technique to modify byte-by-byte compressors to operate directly on Unicode characters. We demonstrate this technique applied to LZW and PPM, finding our variant substantially outperforms the original unmodified compressors.

Publication
In Proceedings of the Data Compression Conference
Adam Gleave
Adam Gleave
Founder & CEO at FAR.AI

Founder of FAR.AI, an alignment research non-profit working to incubate and accelerate new alignment research agendas. Previously: PhD @ UC Berkeley; Google DeepMind. Research interests include adversarial robustness and interpretability.