
Aegis - Autonomous SRE Guardian is an AI-powered self-healing cloud operations system that monitors infrastructure, detects incidents, diagnoses likely root causes, executes safe remediation actions, and generates post-mortem reports automatically. The project uses a three-agent workflow: a Monitor Agent watches real-time logs and AMD cloud metrics, a Diagnosis Agent compares anomalies with historical incident memory, and a Remediation Agent applies predefined fixes over SSH. A React dashboard shows live logs, system health, agent status, remediation history, and incident timelines through WebSocket updates. The backend is built with FastAPI, Python, Paramiko, ReportLab, Pydantic, and vector-style incident storage. The frontend uses React, TypeScript, Vite, Tailwind CSS, and Recharts. Aegis fits the AI Agents & Agentic Workflows track because it demonstrates a real end-to-end agent loop for cloud reliability. For AMD, the system is designed to run against AMD cloud infrastructure today, with a clear path toward ROCm-powered open-source model inference and future fine-tuning on SRE runbooks and incident history.
10 May 2026