International Journal on Science and Technology

E-ISSN: 2229-7677     Impact Factor: 9.88

A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal

Call for Paper Volume 16 Issue 3 July-September 2025 Submit your research before last 3 days of September to publish your research paper in the issue of July-September.

Integrating Machine Learning into Big Data Pipelines: A Case Study with AWS SageMaker and EMR

Author(s) Naga Surya Teja Thallam
Country United States
Abstract Organizations that deal with large structured and unstructured data volumes have learned that it is mandatory to integrate the machine learning (ML) into their big data pipelines. The big data and ML workflows are well addressed through scalable solutions from cloud based platforms, for example, Amazon Web Services (AWS). In this work we present the study of the integration of a fully managed ML service, AWS SageMaker, with the distributed, big data processing platform that is Amazon EMR (Elastic MapReduce), to build up an efficient, scalable, and automatic ML pipeline, via an experimental case study, which proves the proposed solutions as well for architectural design, performance benchmarks, automation strategies, and cost optimization of AWS based big data ML workflows. Distributed data processing with Apache Spark on EMR improves the preprocessing efficiency by a significant margin whereas SageMaker’s managed training framework reduces model training time by 34%. We demonstrate that it’s possible to take financial advantage of such capabilities by reducing costs 50% through use of AWS Spot Instances to operate most cloud based ML solutions more affordably. Related to bottlenecks of data transfer, inefficiencies of auto scaling and latency of inference, this study proposes certain strategies to make the above possible via AWS Data Wrangler to integrate seamlessly, Bayesian hyperparameter tuning, and serverless inference with AWS Lambda. It further allows us to automate our ML workflows by connecting AWS Step Functions and CloudWatch for monitoring. Therefore, the research concludes that, by combining AWS SageMaker and EMR, they can achieve a scalable and cost effective big data ML pipelines, and explores future considerations including multiple cloud interoperability, federated learning and real time ML processing. These findings are of value to enterprises, researchers, as well as cloud practitioners who aim to leverage ML-driven big data analytics to improve their operations.
Keywords Big Data Analytics, Machine Learning, AWS SageMaker, Amazon EMR, Distributed Computing, Cloud Computing, Apache Spark, Model Training, Automation, Cost Optimization, Scalability, MLOps, Serverless Inference.
Field Computer > Artificial Intelligence / Simulation / Virtual Reality
Published In Volume 13, Issue 4, October-December 2022
Published On 2022-12-03

Share this