ML Driven Root Cause Analysis (RCA) in Telco Microservices Continuum
Enz S., Nigg S., Kalinagac O., Ermis O., Gür G.
2024 IEEE International Conference on Communications Workshops, ICC Workshops 2024, pp. 371-377, 2024
As microservice architectures continue to gain popularity, effective monitoring and Root Cause Analysis (RCA) solutions become crucial. This paper presents a Machine Learning Based Root Cause Analysis Framework (MALEAF) for carrying out RCA for service failures in microservice environments. Our approach leverages various metrics, including latency, availability, CPU load, and memory usage, which are collected from the microservices periodically. Initially, our framework employs an anomaly detection mechanism to identify abnormal behavior in the system. Upon detecting an anomaly, two ML algorithms, Support Vector Machine (SVM) and Random Forest, are used to perform RCA. These algorithms predict the probability of a specific service being the root cause of traffic, reliability, or performance anomaly. Moreover, our framework extends the analysis to predict the fault type associated with the root cause and corresponding probabilities. Through evaluation in an IoT-oriented testbed, we demonstrate the effectiveness and accuracy of our proposed framework. The results highlight MALEAF's ability to accurately identify root causes and classify fault types, thereby facilitating efficient troubleshooting and proactive system management.
doi:10.1109/ICCWorkshops59551.2024.10615702