Text Compression Using the Shannon-Fano, Huffman, and Half–Byte Algorithms

compression ratio text data effectiveness in compressing

Authors

  • Eko Priyono Informatics Engineering, Universitas Muhammadiyah Purwokerto,, Indonesia
  • Hindayati Mustafidah Informatics Engineering, Universitas Muhammadiyah Purwokerto,, Indonesia
Vol. 12 No. 09 (2024)
Engineering and Computer Science
September 5, 2024

Downloads

Background and Objectives: File sizes increase as technology advances. Large files require more storage memory and longer transfer times. Data compression is changing an input or original data into another data stream as output or compressed data which is smaller in size. Existing compression techniques include the Huffman, Shannon-Fano, and Half-Byte algorithms. Like algorithms in computer science, these three algorithms offer advantages and disadvantages. Therefore, testing is needed to determine which algorithm is most effective for data compression, especially text data.

Methods: Applying the Huffman, Shannon-Fano, and Half-Byte algorithms to test their effectiveness in compressing text data. The text data as a sample in the research carried out is a text file containing abstracts from research articles published in scientific publications randomly selected from 100 journals. The abstract text used as data is in Indonesian.

Results: Based on test findings, the Huffman algorithm outperforms the Shannon-Fano and Half-Byte algorithms in terms of compression ratio. The Half-Byte algorithm has the lowest compression ratio compared to the Huffman and Shannon-Fano algorithm. The Half-Byte compression algorithm is based on the similarity of the first four bits of seven consecutive characters, whereas Huffman and Shannon-Fano algorithms employ the number of character appearances. The Huffman method can be considered for use in compressing Indonesian language text data according to its average compression ratio of 46.05%, while Shannon-Fano of 40.36%, and Half-Byte of 5.04%.