
Research Article
On the Entropy of Written Afan Oromo
@INPROCEEDINGS{10.1007/978-3-031-06374-9_3, author={Dereje Hailemariam Woldegebreal and Tsegamlak Terefe Debella and Kalkidan Dejenie Molla}, title={On the Entropy of Written Afan Oromo}, proceedings={e-Infrastructure and e-Services for Developing Countries. 13th EAI International Conference, AFRICOMM 2021, Zanzibar, Tanzania, December 1-3, 2021, Proceedings}, proceedings_a={AFRICOMM}, year={2022}, month={5}, keywords={Compression Entropy Encoding Written Afan Oromo}, doi={10.1007/978-3-031-06374-9_3} }
- Dereje Hailemariam Woldegebreal
Tsegamlak Terefe Debella
Kalkidan Dejenie Molla
Year: 2022
On the Entropy of Written Afan Oromo
AFRICOMM
Springer
DOI: 10.1007/978-3-031-06374-9_3
Abstract
Afan Oromo is the language of the Oromo people, the largest ethnolinguistic group in Ethiopia. Written Afan Oromo uses Latin alphabet. In electronic communication systems letters in the alphabet are represented with standard ASCII-8 code, which uses 8 bits/letter, or UTF-8 fixed length encoding, which uses 16 bits/letter. Moreover, the language uses gemination (i.e., doubling of a consonant) and long vowels are represented by double letters, e.g., “dammee” to mean sweet potato. From information theoretic perspective, this doubling and fixed length encoding schemes addredundancyin written Afan Oromo. This redundancy, in turn, contributes for inefficient use of communication resources, such as bandwidth and energy, during transmission and storage of texts written in Afan Oromo. This paper aims at utilizing information theory to estimate entropy of written Afan Oromo. We use higher-order Markov chain, also calledN-gram model, to compute the entropy of a sample text corpora (or written source) by capturing the dependencies among sequence of letters generated from the corpora. Entropy measures average information in bits per letter or block of letters, depending on theN-gram considered. Entropy also indicates the achievable lower bound for compression when using lossless compressions such as Huffman coding. When modeled as a first order Markov chain (i.e., assumingmemorylesssource where sequence of letters from the source are occurring independent of each other), the entropy of the language is 4.31 bits/letter. When compared with ASCII-8, the achievable compression level is about 46%. WhenN= 19 the estimated entropy is as low as 0.85 bits/letter; this corresponds to about 89% compression level. Huffman and Arithmetic source coding algorithms are implemented to check the achievable compression level. For the collected sample corpora, the average compression by Huffman algorithm varies from 42.2%−64.9% forN= 1 − 5. These compression levels are closer to the theoretical entropy. With increasing demand of the language in telecom services and storage systems, the entropy results show the need to further investigate language specific applications, like compression algorithms.