MIDS Capstone Project Summer 2024

GabbleGrid: Self Healing Clouds with AI Agents

Team members

Developed as a UC Berkeley graduate capstone initiative, GabbleGrid transforms cloud service management with autonomous AI agents that:

Analyze vast amounts of log data in real-time
Select and optimize machine learning models for precise anomaly detection
Execute proactive measures to prevent service disruptions

GabbleGrid addresses the challenges of service outages in complex IT infrastructures by leveraging the latest advancements in AI and machine learning. Our autonomous agents perform real-time monitoring, predictive maintenance, and proactive anomaly detection, significantly reducing the frequency and impact of service disruptions, thereby ensuring continuous and reliable operations.

https://drive.google.com/file/d/1rZ9pGc_a786vEY9UJeyrRSvj4A3cG3Yq/view?usp=sharing

Key Learnings and Impact
We recognized the inherent challenges in dealing with log data, which is often unstructured, messy, and unreliable. This complexity required us to adopt advanced data preprocessing techniques and leverage state-of-the-art NLP methods to structure and clean the data effectively. A critical insight was the importance of precision in anomaly detection. Reducing false positives ensures alerts are more actionable and reliable, while high precision helps maintain trust in the alerting system.

By focusing on precision, we ensured that raised alerts are likely true anomalies, minimizing disruption and resource wastage. The impact of these insights has been a more resilient and reliable service, significantly reducing the frequency and impact of service disruptions.

GabbleGrid’s autonomous AI agents offer real-time monitoring, predictive maintenance, and proactive anomaly detection, providing a comprehensive solution to the service downtime issue. This scalable and effective solution meets the growing needs of complex IT infrastructures, positioning GabbleGrid not only as a technological innovation but also as a critical player in the market for enhancing service reliability and operational efficiency.

Evaluation
Our model evaluation focuses on balancing precision and recall to ensure optimal performance in anomaly detection. A critical factor in our system is this balance: false positives can lead to significant disruption and resource wastage, while false negatives mean missing actual issues that can cause service disruptions. By leveraging an encoder-only transformer model with a classification head, we achieve high accuracy in identifying true anomalies, minimizing false positives and enhancing the overall reliability of our system

Acknowledgements
We extend our heartfelt gratitude to our advisors, Joyce Shen and Korin Reid, for their invaluable mentorship and guidance throughout this project. We also deeply appreciate the significant contributions of Mark Butler, whose efforts were instrumental to the success of this endeavor. Their combined expertise and support have been crucial in navigating challenges and achieving our goals. We are sincerely thankful for their dedication and encouragement.