International Journal on Science and Technology

E-ISSN: 2229-7677     Impact Factor: 9.88

A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal

Call for Paper Volume 16 Issue 4 October-December 2025 Submit your research before last 3 days of December to publish your research paper in the issue of October-December.

Fault Recovery and Resilience in Containerized Distributed Systems

Author(s) Kalesha Khan Pattan
Country United States
Abstract The rapid adoption of containerized distributed systems in modern cloud infrastructures has significantly enhanced scalability and deployment efficiency, but it has also introduced new challenges in maintaining system resilience and fault recovery. Traditional fault-tolerance mechanisms, such as manual failover and static replication, are inadequate in dynamic, large-scale environments where faults can arise unpredictably across multiple containers, nodes, or network layers. This research focuses on developing an intelligent, self-adaptive fault recovery and resilience framework designed specifically for containerized distributed architectures. The proposed model integrates continuous fault detection, real-time monitoring, and automated recovery processes to minimize service disruption and performance degradation during failures. By leveraging container orchestration capabilities and adaptive fault management, the system achieves rapid detection of anomalies and initiates proactive recovery actions such as container rescheduling, service replication, and resource reallocation. The study evaluates the framework’s performance using multiple metrics, including Mean Time Between Failures (MTBF), Mean Time to Repair (MTTR), service availability, and request success rate. Experimental results demonstrate that the proposed approach significantly improves MTBF, indicating enhanced reliability and reduced fault frequency. Compared to static recovery methods, the model achieves faster recovery times, higher availability, and better resource efficiency through automated, coordinated fault handling. Moreover, the framework ensures data consistency and state synchronization across distributed nodes, preventing cascading failures and minimizing downtime. The proposed approach provides a scalable and platform-agnostic solution suitable for complex, multi-cluster container environments. Overall, the findings confirm that intelligent fault recovery and resilience mechanisms can transform distributed containerized systems into more dependable, self-healing infrastructures capable of maintaining operational continuity under diverse fault scenarios. This work contributes to the advancement of fault-tolerant cloud-native systems, offering a foundation for future research in predictive recovery, anomaly detection, and resilience optimization using adaptive algorithms and decentralized coordination models.
Keywords Fault, Recovery, Resilience, Containers, Distributed, Systems, Reliability, Availability, Automation, Orchestration, Monitoring, Detection, Restoration, Scalability, Efficiency, Continuity, Adaptation.
Published In Volume 14, Issue 3, July-September 2023
Published On 2023-08-05
DOI https://doi.org/10.71097/IJSAT.v14.i3.9076
Short DOI https://doi.org/g98f42

Share this