OTM 2013 - LNCS 8185-8186

Run-Time Root Cause Analysis in Adaptive Distributed Systems

Amit Raj, Stephen Barrett, and Siobhan Clarke

School of Computer Science and Statistics, Trinity College of Dublin, Ireland
araj@scss.tcd.ie
Stephen.Barrett@scss.tcd.ie
Siobhan.Clarke@scss.tcd.ie

Abstract. In a distributed environment, several components collaborate with each other to cater a complex functionality. Adaptation in distributed systems is one of the emerging trends that re-configures itself through components addition/removal/update, to cope up with faults. Components are generally inter-dependent, thus a fault propagates from one component to another. Existing root cause analysis techniques generally create a static faults’ dependencies graph to identify the root fault. However, these dependencies keep on changing with adaptations that makes design-time fault dependencies invalid at run-time. This paper describes the problem of deriving causal relationships of faults in adaptive distributed systems. Then, presents a statechart-based solution that statically identifies the sequence of methods execution to derive the causal relationships of faults at run-time. The approach is evaluated, and found that it is highly scalable and time efficient that can be used to reduce the Mean Time To Recover (MTTR) of a distributed system.

Keywords: Distributed Systems, Root cause analysis, Fault causal relationship, adaptive system, component-based system

LNCS 8186, p. 292 ff.

Full article in PDF | BibTeX