A Paradigm Shift for Improved Processing of Small Files in Hadoop

Miss Priyanka G. Phakade1 , Dr. Suhas D. Dr. Suhas D.,

Authors

Miss Priyanka G. Phakade1 , Dr. Suhas D. Dr. Suhas D., N.K. Orchid College of Engineering and Technology, Solapur, India

Vol. 2 No. 11 (2014)

Articles

December 25, 2017

Downloads

PDF

Abstract
How to Cite
Metrics

HDFS is designed to handle large files containing petabytes or exabytes of data. However, there are plenty of applications that need access & manipulation of large number of small files. HDFS suffers performance penalty while dealing with large number of small files. With the rapid development of Internet, users may tend to store their data and programs in a cloud computing platform. Personal data has an obvious feature –large number and small file size. In such cases, HDFS struggles to meet performance criteria. In hadoop architecture, the FileInputFormat generates a split per file. Map tasks usually process a block of input at a time. In case of large number of small files, each map task processes very little input. Every map task imposes extra bookkeeping overhead on NameNode and also consumes considerable time to process large number of small files. We have attempted a strategy wherein large number of small files are clubbed together to form a single split. This inherently reduces number of blocks generated by FileInputFormat, resulting in lesser processing time. Clubbing of small files is achieved through a customized mapper. Practical setup has indicated performance improvement of around 80%. The paper covers this paradigm shift in processing large number of small files in HDFS for performance improvement

Citations	3245
2yr mean_citedness	0.323
h index	21
i10 index	67

A Paradigm Shift for Improved Processing of Small Files in Hadoop

Authors

Downloads

Downloads

Metrics

Whatsapp

Make a Submission

Author Desk

Download

i

Indexig

Indexing

Information

Quick Menu

Information for

Journal & Policies