Efficient Customer Data Privacy Management in Hadoop Ecosystems: A Scalable Query Engine Approach

Hadoop Ecosystem, Customer Data Privacy, Query Engine Optimization, Bloom Filters, OLAP Systems, Big Data Scalability, ORC and Parquet Files

Authors

Vol. 8 No. `11 (2020)
Engineering and Computer Science
November 18, 2020

Downloads

Assurance of customer data privacy in the Hadoop ecosystem creates a lot of interesting challenges for large-scale data request processing. Traditional methods involve very resource-consuming table scans that neither cost-effective nor scalable can afford. This paper proposes a new architecture in Hadoop for customers' data retrieval that achieves considerable computation overhead and cost reductions down to one-tenth compared to conventional methods. This would, in turn, use Bloom filters, bucketing, and predicate pushdown to directly optimize the data elimination and fetching processes at a file level, rather than following the inefficiencies prevalent in traditional OLAP systems. Benchmarking results depict scalability and effectiveness ranging over several magnitudes from terabytes to petabytes. This ensures that proposed methodology complies better with data privacy regulations without comprising performance and cost efficiency and hence would work perfectly for the enterprise-grade big data platform.