International Journal on Science and Technology

E-ISSN: 2229-7677     Impact Factor: 9.88

A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal

Call for Paper Volume 16 Issue 2 April-June 2025 Submit your research before last 3 days of June to publish your research paper in the issue of April-June.

Enabling Scalable GPU Clusters for Distributed Deep Learning in the Cloud

Author(s) Srikanth Jonnakuti
Country United States
Abstract The fast pace of deep learning requires efficient and scalable training frameworks to support large data and intricate models. This paper presents design patterns for provisioning and managing multi-GPU clusters, specifically using platforms like AWS EC2 P3 instances, to enable training large convolutional neural networks (CNNs) and recurrent neural networks (RNNs) at scale. Important strategies are presented, including multi-source streaming broadcast, GPU-specialized parameter servers, distributed training frameworks, and scalable scheduling systems to maximize resource utilization and performance. Emphasis is placed on efficient data sharding techniques to enable load balancing and minimize communication overhead, thereby enabling accelerated convergence and improved throughput. Fault tolerance techniques like check pointing and dynamic resource management are outlined to ensure training continuity in case of hardware or network failure. Comparative analysis of frameworks like GeePS, CNTK, Nexus, and DeCUVE demonstrate the practical trade-offs between latency, scalability, and energy efficiency across various cluster configurations. Cost-effectiveness strategies for using cross-region GPU spot instances are also analyzed for deep learning applications. Topology-aware scheduling and edge-cloud distributed training paradigms are also explored to further improve system resilience and training effectiveness. This paper presents actionable insights and best practices for researchers and practitioners to deploy resilient, scalable deep learning architectures in modern cloud environments.
Field Engineering
Published In Volume 9, Issue 3, July-September 2018
Published On 2018-07-06
DOI https://doi.org/10.71097/IJSAT.v9.i3.5273
Short DOI https://doi.org/g9pgfs

Share this