Efficient Customer Data Privacy Management in Hadoop Ecosystems: A Scalable Query Engine Approach

Sai Kiran Reddy Malikireddy

doi:10.18535/ijsrm/v8i11.ec01

Abstract

Assurance of customer data privacy in the Hadoop ecosystem creates a lot of interesting challenges for large-scale data request processing. Traditional methods involve very resource-consuming table scans that neither cost-effective nor scalable can afford. This paper proposes a new architecture in Hadoop for customers' data retrieval that achieves considerable computation overhead and cost reductions down to one-tenth compared to conventional methods. This would, in turn, use Bloom filters, bucketing, and predicate pushdown to directly optimize the data elimination and fetching processes at a file level, rather than following the inefficiencies prevalent in traditional OLAP systems. Benchmarking results depict scalability and effectiveness ranging over several magnitudes from terabytes to petabytes. This ensures that proposed methodology complies better with data privacy regulations without comprising performance and cost efficiency and hence would work perfectly for the enterprise-grade big data platform.

Keywords

Hadoop EcosystemCustomer Data PrivacyQuery Engine OptimizationBloom FiltersOLAP

References

Dahdouh, K., Dakkak, A., Oughdir, L., & Ibriz, A. (2019). Large-scale e-learning recommender system based on Spark and Hadoop. Journal of Big Data, 6(1), 1–23. https://doi.org/10.1186/s40537-019-0173-1DOI ↗Google Scholar ↗
Jain, V. K. (2017). Big Data and Hadoop. Khanna Publishing.Google Scholar ↗
Jayaraman, P. P., Perera, C., Georgakopoulos, D., Dustdar, S., Thakker, D., & Ranjan, R. (2017). Analytics-as-a-service in a multi-cloud environment through semantically-enabled hierarchical data processing. Software: Practice and Experience, 47(8), 1139–1156. https://doi.org/10.1002/spe.2490DOI ↗Google Scholar ↗
Kumar, V. N., & Shindgikar, P. (2018). Modern Big Data processing with Hadoop: Expert techniques for architecting end-to-end Big Data solutions to get valuable insights. Packt Publishing Ltd.Google Scholar ↗
Landset, S., Khoshgoftaar, T. M., Richter, A. N., & Hasanin, T. (2015). A survey of open source tools for machine learning with big data in the Hadoop ecosystem. Journal of Big Data, 2(1), 1–36. https://doi.org/10.1186/s40537-015-0019-6DOI ↗Google Scholar ↗
Mazumder, S., & Dhar, S. (2018). Hadoop ecosystem as enterprise big data platform: Perspectives and practices. International Journal of Information Technology and Management, 17(4), 334–348. https://doi.org/10.1504/IJITM.2018.094161DOI ↗Google Scholar ↗
Mazumder, S., Seybold, D., Kritikos, K., & Verginadis, Y. (2019). A survey on data storage and placement methodologies for cloud-big data ecosystem. Journal of Big Data, 6(1), 1–37. https://doi.org/10.1186/s40537-019-0178-9DOI ↗Google Scholar ↗
Mazumdar, S., & Dhar, S. (2015, March). Hadoop as Big Data Operating System: The emerging approach for managing challenges of enterprise big data platform. In 2015 IEEE First International Conference on Big Data Computing Service and Applications (pp. 499–505). IEEE. https://doi.org/10.1109/BigDataService.2015.23DOI ↗Google Scholar ↗
Patil, A. (2018). Securing MapReduce programming paradigm in Hadoop, cloud, and big data ecosystem. Journal of Theoretical & Applied Information Technology, 96(3), 664–674.Google Scholar ↗
Rathore, M. M., Son, H., Ahmad, A., Paul, A., & Jeon, G. (2018). Real-time big data stream processing using GPU with Spark over Hadoop ecosystem. International Journal of Parallel Programming, 46, 630–646. https://doi.org/10.1007/s10766-017-0523-2DOI ↗Google Scholar ↗
Romero, O., Herrero, V., Abelló, A., & Ferrarons, J. (2015). Tuning small analytics on big data: Data partitioning and secondary indexes in the Hadoop ecosystem. Information Systems, 54, 336–356. https://doi.org/10.1016/j.is.2015.06.004DOI ↗Google Scholar ↗
Sitto, K., & Presser, M. (2015). Field guide to Hadoop: An introduction to Hadoop, its ecosystem, and aligned technologies. O’Reilly Media, Inc.Google Scholar ↗
Spivey, B., & Echeverria, J. (2015). Hadoop Security: Protecting your big data platform. O’Reilly Media, Inc.Google Scholar ↗
Wu, W., Lin, W., Hsu, C. H., & He, L. (2018). Energy-efficient Hadoop for big data analytics and computing: A systematic review and research insights. Future Generation Computer Systems, 86, 1351–1367. https://doi.org/10.1016/j.future.2018.04.038DOI ↗Google Scholar ↗
Storey, V. C., & Song, I. Y. (2017). Big data technologies and management: What conceptual modeling can do. Data & Knowledge Engineering, 108, 50-67.Google Scholar ↗
Gupta, A. (2015, March). Big data analysis using computational intelligence and Hadoop: a study. In 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom) (pp. 1397-1401). IEEE.Google Scholar ↗
Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A., & Curino, C. (2015, May). Apache tez: A unifying framework for modeling and building data processing applications. In Proceedings of the 2015 ACM SIGMOD international conference on Management of Data (pp. 1357-1369).Google Scholar ↗
Benjelloun, F. Z., Lahcen, A. A., & Belfkih, S. (2015). An overview of big data opportunities, applications and tools. 2015 Intelligent Systems and Computer Vision (ISCV), 1-6.Google Scholar ↗
Noh, K. S., & Lee, D. S. (2015). Bigdata platform design and implementation model. Indian Journal of science and technology, 8(18), 1.Google Scholar ↗
Gupta, D., & Rani, R. (2019). A study of big data evolution and research challenges. Journal of information science, 45(3), 322-340.Google Scholar ↗
Moyne, J., Samantaray, J., & Armacost, M. (2016). Big data capabilities applied to semiconductor manufacturing advanced process control. IEEE transactions on semiconductor manufacturing, 29(4), 283-291.Google Scholar ↗
Kapil, G., Agrawal, A., & Khan, R. A. (2018). Big data security challenges: Hadoop perspective. International Journal of pure and applied mathematics, 120(6), 11767-11784.Google Scholar ↗
Ullah, S., Awan, M. D., & Sikander Hayat Khiyal, M. (2018). Big data in cloud computing: A resource management perspective. Scientific programming, 2018(1), 5418679.Google Scholar ↗
Jayanthi, M. D., Sumathi, G., & Sriperumbudur, S. (2016). A framework for real-time streaming analytics using machine learning approach. In Proceedings of national conference on communication and informatics-2016.Google Scholar ↗
Ismail, M., Gebremeskel, E., Kakantousis, T., Berthou, G., & Dowling, J. (2017, June). Hopsworks: Improving user experience and development on hadoop with scalable, strongly consistent metadata. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS) (pp. 2525-2528). IEEE.Google Scholar ↗

[refR-1] Dahdouh, K., Dakkak, A., Oughdir, L., & Ibriz, A. (2019). Large-scale e-learning recommender system based on Spark and Hadoop. Journal of Big Data, 6(1), 1–23. https://doi.org/10.1186/s40537-019-0173-1DOI ↗Google Scholar ↗

[refR-2] Jain, V. K. (2017). Big Data and Hadoop. Khanna Publishing.Google Scholar ↗

[refR-3] Jayaraman, P. P., Perera, C., Georgakopoulos, D., Dustdar, S., Thakker, D., & Ranjan, R. (2017). Analytics-as-a-service in a multi-cloud environment through semantically-enabled hierarchical data processing. Software: Practice and Experience, 47(8), 1139–1156. https://doi.org/10.1002/spe.2490DOI ↗Google Scholar ↗

[refR-4] Kumar, V. N., & Shindgikar, P. (2018). Modern Big Data processing with Hadoop: Expert techniques for architecting end-to-end Big Data solutions to get valuable insights. Packt Publishing Ltd.Google Scholar ↗

[refR-5] Landset, S., Khoshgoftaar, T. M., Richter, A. N., & Hasanin, T. (2015). A survey of open source tools for machine learning with big data in the Hadoop ecosystem. Journal of Big Data, 2(1), 1–36. https://doi.org/10.1186/s40537-015-0019-6DOI ↗Google Scholar ↗

[refR-6] Mazumder, S., & Dhar, S. (2018). Hadoop ecosystem as enterprise big data platform: Perspectives and practices. International Journal of Information Technology and Management, 17(4), 334–348. https://doi.org/10.1504/IJITM.2018.094161DOI ↗Google Scholar ↗

[refR-7] Mazumder, S., Seybold, D., Kritikos, K., & Verginadis, Y. (2019). A survey on data storage and placement methodologies for cloud-big data ecosystem. Journal of Big Data, 6(1), 1–37. https://doi.org/10.1186/s40537-019-0178-9DOI ↗Google Scholar ↗

[refR-8] Mazumdar, S., & Dhar, S. (2015, March). Hadoop as Big Data Operating System: The emerging approach for managing challenges of enterprise big data platform. In 2015 IEEE First International Conference on Big Data Computing Service and Applications (pp. 499–505). IEEE. https://doi.org/10.1109/BigDataService.2015.23DOI ↗Google Scholar ↗

[refR-9] Patil, A. (2018). Securing MapReduce programming paradigm in Hadoop, cloud, and big data ecosystem. Journal of Theoretical & Applied Information Technology, 96(3), 664–674.Google Scholar ↗

[refR-10] Rathore, M. M., Son, H., Ahmad, A., Paul, A., & Jeon, G. (2018). Real-time big data stream processing using GPU with Spark over Hadoop ecosystem. International Journal of Parallel Programming, 46, 630–646. https://doi.org/10.1007/s10766-017-0523-2DOI ↗Google Scholar ↗

[refR-11] Romero, O., Herrero, V., Abelló, A., & Ferrarons, J. (2015). Tuning small analytics on big data: Data partitioning and secondary indexes in the Hadoop ecosystem. Information Systems, 54, 336–356. https://doi.org/10.1016/j.is.2015.06.004DOI ↗Google Scholar ↗

[refR-12] Sitto, K., & Presser, M. (2015). Field guide to Hadoop: An introduction to Hadoop, its ecosystem, and aligned technologies. O’Reilly Media, Inc.Google Scholar ↗

[refR-13] Spivey, B., & Echeverria, J. (2015). Hadoop Security: Protecting your big data platform. O’Reilly Media, Inc.Google Scholar ↗

[refR-14] Wu, W., Lin, W., Hsu, C. H., & He, L. (2018). Energy-efficient Hadoop for big data analytics and computing: A systematic review and research insights. Future Generation Computer Systems, 86, 1351–1367. https://doi.org/10.1016/j.future.2018.04.038DOI ↗Google Scholar ↗

[refR-15] Storey, V. C., & Song, I. Y. (2017). Big data technologies and management: What conceptual modeling can do. Data & Knowledge Engineering, 108, 50-67.Google Scholar ↗

[refR-16] Gupta, A. (2015, March). Big data analysis using computational intelligence and Hadoop: a study. In 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom) (pp. 1397-1401). IEEE.Google Scholar ↗

[refR-17] Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A., & Curino, C. (2015, May). Apache tez: A unifying framework for modeling and building data processing applications. In Proceedings of the 2015 ACM SIGMOD international conference on Management of Data (pp. 1357-1369).Google Scholar ↗

[refR-18] Benjelloun, F. Z., Lahcen, A. A., & Belfkih, S. (2015). An overview of big data opportunities, applications and tools. 2015 Intelligent Systems and Computer Vision (ISCV), 1-6.Google Scholar ↗

[refR-19] Noh, K. S., & Lee, D. S. (2015). Bigdata platform design and implementation model. Indian Journal of science and technology, 8(18), 1.Google Scholar ↗

[refR-20] Gupta, D., & Rani, R. (2019). A study of big data evolution and research challenges. Journal of information science, 45(3), 322-340.Google Scholar ↗

[refR-21] Moyne, J., Samantaray, J., & Armacost, M. (2016). Big data capabilities applied to semiconductor manufacturing advanced process control. IEEE transactions on semiconductor manufacturing, 29(4), 283-291.Google Scholar ↗

[refR-22] Kapil, G., Agrawal, A., & Khan, R. A. (2018). Big data security challenges: Hadoop perspective. International Journal of pure and applied mathematics, 120(6), 11767-11784.Google Scholar ↗

[refR-23] Ullah, S., Awan, M. D., & Sikander Hayat Khiyal, M. (2018). Big data in cloud computing: A resource management perspective. Scientific programming, 2018(1), 5418679.Google Scholar ↗

[refR-24] Jayanthi, M. D., Sumathi, G., & Sriperumbudur, S. (2016). A framework for real-time streaming analytics using machine learning approach. In Proceedings of national conference on communication and informatics-2016.Google Scholar ↗

[refR-25] Ismail, M., Gebremeskel, E., Kakantousis, T., Berthou, G., & Dowling, J. (2017, June). Hopsworks: Improving user experience and development on hadoop with scalable, strongly consistent metadata. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS) (pp. 2525-2528). IEEE.Google Scholar ↗