International Journal on Science and Technology
E-ISSN: 2229-7677
•
Impact Factor: 9.88
A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal
Plagiarism is checked by the leading plagiarism checker
Call for Paper
Volume 16 Issue 4
October-December 2025
Indexing Partners
Fault Recovery and Resilience in Containerized Distributed Systems
| Author(s) | Kalesha Khan Pattan |
|---|---|
| Country | United States |
| Abstract | The rapid adoption of containerized distributed systems in modern cloud infrastructures has significantly enhanced scalability and deployment efficiency, but it has also introduced new challenges in maintaining system resilience and fault recovery. Traditional fault-tolerance mechanisms, such as manual failover and static replication, are inadequate in dynamic, large-scale environments where faults can arise unpredictably across multiple containers, nodes, or network layers. This research focuses on developing an intelligent, self-adaptive fault recovery and resilience framework designed specifically for containerized distributed architectures. The proposed model integrates continuous fault detection, real-time monitoring, and automated recovery processes to minimize service disruption and performance degradation during failures. By leveraging container orchestration capabilities and adaptive fault management, the system achieves rapid detection of anomalies and initiates proactive recovery actions such as container rescheduling, service replication, and resource reallocation. The study evaluates the framework’s performance using multiple metrics, including Mean Time Between Failures (MTBF), Mean Time to Repair (MTTR), service availability, and request success rate. Experimental results demonstrate that the proposed approach significantly improves MTBF, indicating enhanced reliability and reduced fault frequency. Compared to static recovery methods, the model achieves faster recovery times, higher availability, and better resource efficiency through automated, coordinated fault handling. Moreover, the framework ensures data consistency and state synchronization across distributed nodes, preventing cascading failures and minimizing downtime. The proposed approach provides a scalable and platform-agnostic solution suitable for complex, multi-cluster container environments. Overall, the findings confirm that intelligent fault recovery and resilience mechanisms can transform distributed containerized systems into more dependable, self-healing infrastructures capable of maintaining operational continuity under diverse fault scenarios. This work contributes to the advancement of fault-tolerant cloud-native systems, offering a foundation for future research in predictive recovery, anomaly detection, and resilience optimization using adaptive algorithms and decentralized coordination models. |
| Keywords | Fault, Recovery, Resilience, Containers, Distributed, Systems, Reliability, Availability, Automation, Orchestration, Monitoring, Detection, Restoration, Scalability, Efficiency, Continuity, Adaptation. |
| Published In | Volume 14, Issue 3, July-September 2023 |
| Published On | 2023-08-05 |
| DOI | https://doi.org/10.71097/IJSAT.v14.i3.9076 |
| Short DOI | https://doi.org/g98f42 |
Share this

CrossRef DOI is assigned to each research paper published in our journal.
IJSAT DOI prefix is
10.71097/IJSAT
Downloads
All research papers published on this website are licensed under Creative Commons Attribution-ShareAlike 4.0 International License, and all rights belong to their respective authors/researchers.