Integrating Machine Learning into Big Data Pipelines: A Case Study with AWS SageMaker and EMR

Naga Surya Teja Thallam

Integrating Machine Learning into Big Data Pipelines: A Case Study with AWS SageMaker and EMR

Author(s)	Naga Surya Teja Thallam
Country	United States
Abstract	Organizations that deal with large structured and unstructured data volumes have learned that it is mandatory to integrate the machine learning (ML) into their big data pipelines. The big data and ML workflows are well addressed through scalable solutions from cloud based platforms, for example, Amazon Web Services (AWS). In this work we present the study of the integration of a fully managed ML service, AWS SageMaker, with the distributed, big data processing platform that is Amazon EMR (Elastic MapReduce), to build up an efficient, scalable, and automatic ML pipeline, via an experimental case study, which proves the proposed solutions as well for architectural design, performance benchmarks, automation strategies, and cost optimization of AWS based big data ML workflows. Distributed data processing with Apache Spark on EMR improves the preprocessing efficiency by a significant margin whereas SageMaker’s managed training framework reduces model training time by 34%. We demonstrate that it’s possible to take financial advantage of such capabilities by reducing costs 50% through use of AWS Spot Instances to operate most cloud based ML solutions more affordably. Related to bottlenecks of data transfer, inefficiencies of auto scaling and latency of inference, this study proposes certain strategies to make the above possible via AWS Data Wrangler to integrate seamlessly, Bayesian hyperparameter tuning, and serverless inference with AWS Lambda. It further allows us to automate our ML workflows by connecting AWS Step Functions and CloudWatch for monitoring. Therefore, the research concludes that, by combining AWS SageMaker and EMR, they can achieve a scalable and cost effective big data ML pipelines, and explores future considerations including multiple cloud interoperability, federated learning and real time ML processing. These findings are of value to enterprises, researchers, as well as cloud practitioners who aim to leverage ML-driven big data analytics to improve their operations.
Keywords	Big Data Analytics, Machine Learning, AWS SageMaker, Amazon EMR, Distributed Computing, Cloud Computing, Apache Spark, Model Training, Automation, Cost Optimization, Scalability, MLOps, Serverless Inference.
Field	Computer > Artificial Intelligence / Simulation / Virtual Reality
Published In	Volume 13, Issue 4, October-December 2022
Published On	2022-12-03

View / Download PDF File

About IJSAT Fees & Payment Current Issue Publication Archive	Submit Research Paper Track Submission Status Publication Guidelines Publication Ethics	Join as a Reviewer Editors & Reviewers Get Reviewer Membership Certi.	Website/Journal Policies Usage Policy Content Policies Privacy Policy

Contact Us	Message on WhatsApp	+91-9687-182-185	editor@ijsat.org

International Journal on Science and Technology

A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal

Integrating Machine Learning into Big Data Pipelines: A Case Study with AWS SageMaker and EMR

Share this