
International Journal on Science and Technology
E-ISSN: 2229-7677
•
Impact Factor: 9.88
A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal
Plagiarism is checked by the leading plagiarism checker
Call for Paper
Volume 16 Issue 3
July-September 2025
Indexing Partners



















Integrating Machine Learning into Big Data Pipelines: A Case Study with AWS SageMaker and EMR
Author(s) | Naga Surya Teja Thallam |
---|---|
Country | United States |
Abstract | Organizations that deal with large structured and unstructured data volumes have learned that it is mandatory to integrate the machine learning (ML) into their big data pipelines. The big data and ML workflows are well addressed through scalable solutions from cloud based platforms, for example, Amazon Web Services (AWS). In this work we present the study of the integration of a fully managed ML service, AWS SageMaker, with the distributed, big data processing platform that is Amazon EMR (Elastic MapReduce), to build up an efficient, scalable, and automatic ML pipeline, via an experimental case study, which proves the proposed solutions as well for architectural design, performance benchmarks, automation strategies, and cost optimization of AWS based big data ML workflows. Distributed data processing with Apache Spark on EMR improves the preprocessing efficiency by a significant margin whereas SageMaker’s managed training framework reduces model training time by 34%. We demonstrate that it’s possible to take financial advantage of such capabilities by reducing costs 50% through use of AWS Spot Instances to operate most cloud based ML solutions more affordably. Related to bottlenecks of data transfer, inefficiencies of auto scaling and latency of inference, this study proposes certain strategies to make the above possible via AWS Data Wrangler to integrate seamlessly, Bayesian hyperparameter tuning, and serverless inference with AWS Lambda. It further allows us to automate our ML workflows by connecting AWS Step Functions and CloudWatch for monitoring. Therefore, the research concludes that, by combining AWS SageMaker and EMR, they can achieve a scalable and cost effective big data ML pipelines, and explores future considerations including multiple cloud interoperability, federated learning and real time ML processing. These findings are of value to enterprises, researchers, as well as cloud practitioners who aim to leverage ML-driven big data analytics to improve their operations. |
Keywords | Big Data Analytics, Machine Learning, AWS SageMaker, Amazon EMR, Distributed Computing, Cloud Computing, Apache Spark, Model Training, Automation, Cost Optimization, Scalability, MLOps, Serverless Inference. |
Field | Computer > Artificial Intelligence / Simulation / Virtual Reality |
Published In | Volume 13, Issue 4, October-December 2022 |
Published On | 2022-12-03 |
Share this


CrossRef DOI is assigned to each research paper published in our journal.
IJSAT DOI prefix is
10.71097/IJSAT
Downloads
All research papers published on this website are licensed under Creative Commons Attribution-ShareAlike 4.0 International License, and all rights belong to their respective authors/researchers.
