
International Journal on Science and Technology
E-ISSN: 2229-7677
•
Impact Factor: 9.88
A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal
Plagiarism is checked by the leading plagiarism checker
Call for Paper
Volume 16 Issue 2
April-June 2025
Indexing Partners



















Enabling Scalable GPU Clusters for Distributed Deep Learning in the Cloud
Author(s) | Srikanth Jonnakuti |
---|---|
Country | United States |
Abstract | The fast pace of deep learning requires efficient and scalable training frameworks to support large data and intricate models. This paper presents design patterns for provisioning and managing multi-GPU clusters, specifically using platforms like AWS EC2 P3 instances, to enable training large convolutional neural networks (CNNs) and recurrent neural networks (RNNs) at scale. Important strategies are presented, including multi-source streaming broadcast, GPU-specialized parameter servers, distributed training frameworks, and scalable scheduling systems to maximize resource utilization and performance. Emphasis is placed on efficient data sharding techniques to enable load balancing and minimize communication overhead, thereby enabling accelerated convergence and improved throughput. Fault tolerance techniques like check pointing and dynamic resource management are outlined to ensure training continuity in case of hardware or network failure. Comparative analysis of frameworks like GeePS, CNTK, Nexus, and DeCUVE demonstrate the practical trade-offs between latency, scalability, and energy efficiency across various cluster configurations. Cost-effectiveness strategies for using cross-region GPU spot instances are also analyzed for deep learning applications. Topology-aware scheduling and edge-cloud distributed training paradigms are also explored to further improve system resilience and training effectiveness. This paper presents actionable insights and best practices for researchers and practitioners to deploy resilient, scalable deep learning architectures in modern cloud environments. |
Field | Engineering |
Published In | Volume 9, Issue 3, July-September 2018 |
Published On | 2018-07-06 |
DOI | https://doi.org/10.71097/IJSAT.v9.i3.5273 |
Short DOI | https://doi.org/g9pgfs |
Share this


CrossRef DOI is assigned to each research paper published in our journal.
IJSAT DOI prefix is
10.71097/IJSAT
Downloads
All research papers published on this website are licensed under Creative Commons Attribution-ShareAlike 4.0 International License, and all rights belong to their respective authors/researchers.
