Skip to main content

Node Failure Prediction in Large-Scale Computing Systems via Machine Learning

Event Details

Wednesday, May 1, 2019
12-1 p.m.

Today's computing systems encounter faults on a daily basis. Predicting which node will fail and how soon remains a challenge, that needs to be solved to pave the way to exploiting proactive remedies before jobs fail. Scale, complex system design and diverse unstructured log sources have made data mining-based failure diagnosis non-trivial. Existing anomaly detection schemes fall short of lead time analysis, an indispensable requirement for real-time prediction. While high lead times are helpful, low false positive rate is also a necessity for production systems. This talk will highlight some of the effective methods developed for achieving lead times to node failures for predictive localization in large-scale computing infrastructures.

Bio: Anwesha Das is a final year computer science PhD student from NC State University. She has been working on proactive fault tolerant solutions for large-scale computing systems leveraging machine learning techniques. She has diverse interests in general, including various aspects of reliability and performance of computing systems, and quantum computing.