Fault Recovery and Resilience in Containerized Distributed Systems

Kalesha Khan Pattan

doi:10.71097/IJSAT.v14.i3.9076

Fault Recovery and Resilience in Containerized Distributed Systems

Author(s)	Kalesha Khan Pattan
Country	United States
Abstract	The rapid adoption of containerized distributed systems in modern cloud infrastructures has significantly enhanced scalability and deployment efficiency, but it has also introduced new challenges in maintaining system resilience and fault recovery. Traditional fault-tolerance mechanisms, such as manual failover and static replication, are inadequate in dynamic, large-scale environments where faults can arise unpredictably across multiple containers, nodes, or network layers. This research focuses on developing an intelligent, self-adaptive fault recovery and resilience framework designed specifically for containerized distributed architectures. The proposed model integrates continuous fault detection, real-time monitoring, and automated recovery processes to minimize service disruption and performance degradation during failures. By leveraging container orchestration capabilities and adaptive fault management, the system achieves rapid detection of anomalies and initiates proactive recovery actions such as container rescheduling, service replication, and resource reallocation. The study evaluates the framework’s performance using multiple metrics, including Mean Time Between Failures (MTBF), Mean Time to Repair (MTTR), service availability, and request success rate. Experimental results demonstrate that the proposed approach significantly improves MTBF, indicating enhanced reliability and reduced fault frequency. Compared to static recovery methods, the model achieves faster recovery times, higher availability, and better resource efficiency through automated, coordinated fault handling. Moreover, the framework ensures data consistency and state synchronization across distributed nodes, preventing cascading failures and minimizing downtime. The proposed approach provides a scalable and platform-agnostic solution suitable for complex, multi-cluster container environments. Overall, the findings confirm that intelligent fault recovery and resilience mechanisms can transform distributed containerized systems into more dependable, self-healing infrastructures capable of maintaining operational continuity under diverse fault scenarios. This work contributes to the advancement of fault-tolerant cloud-native systems, offering a foundation for future research in predictive recovery, anomaly detection, and resilience optimization using adaptive algorithms and decentralized coordination models.
Keywords	Fault, Recovery, Resilience, Containers, Distributed, Systems, Reliability, Availability, Automation, Orchestration, Monitoring, Detection, Restoration, Scalability, Efficiency, Continuity, Adaptation.
Published In	Volume 14, Issue 3, July-September 2023
Published On	2023-08-05
DOI	https://doi.org/10.71097/IJSAT.v14.i3.9076

View / Download PDF File

About IJSAT Fees & Payment Current Issue Publication Archive	Submit Research Paper Track Submission Status Publication Guidelines Publication Ethics	Join as a Reviewer Editors & Reviewers Get Reviewer Membership Certi.	Website/Journal Policies Usage Policy Content Policies Privacy Policy

Contact Us	Message on WhatsApp	+91-9687-182-185	editor@ijsat.org

International Journal on Science and Technology

A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal

Fault Recovery and Resilience in Containerized Distributed Systems

Share this