A Paradigm Shift for Improved Processing of Small Files in Hadoop

Authors

December 25, 2017

Downloads

HDFS is designed to handle large files containing petabytes or exabytes of data. However, there are plenty of applications that need access & manipulation of large number of small files. HDFS suffers performance penalty while dealing with large number of small files. With the rapid development of Internet, users may tend to store their data and programs in a cloud computing platform. Personal data has an obvious feature –large number and small file size. In such cases, HDFS struggles to meet performance criteria. In hadoop architecture, the FileInputFormat generates a split per file. Map tasks usually process a block of input at a time. In case of large number of small files, each map task processes very little input. Every map task imposes extra bookkeeping overhead on NameNode and also consumes considerable time to process large number of small files. We have attempted a strategy wherein large number of small files are clubbed together to form a single split. This inherently reduces number of blocks generated by FileInputFormat, resulting in lesser processing time. Clubbing of small files is achieved through a customized mapper. Practical setup has indicated performance improvement of around 80%. The paper covers this paradigm shift in processing large number of small files in HDFS for performance improvement