
AtlasOps is a production-style incident-response stack built for the AMD Developer Hackathon: a coordinator runs four complementary LLM agents—Triage, Diagnosis, Remediation, and Comms—that call real APIs instead of mocking “tickets.” Alerts come from live Prometheus / Alertmanager on Google Kubernetes Engine with Online Boutique, Chaos Mesh scenarios (single-fault, cascade, and named historical replays), and paths into Jaeger, kubectl, Argo CD (including real rollback), and cloud logging. A human-in-the-loop approval gate pauses dangerous remediation until an operator approves or times out—surfaced in the UI and over Discord webhooks for demos. Models are served with vLLM on AMD MI300X (co-hosted Qwen2.5-7B roles plus a 72B judge/designer lane for adversarial curricula). The training story pairs Supervised Fine-Tuning with online GRPO via TRL under ROCm, with BitsAndBytes-ROCm / AWQ and HF Optimum-AMD on the inference path. The runnable demo lives on Hugging Face Spaces wired to MI300x inference secrets; judges can follow JUDGES_START_HERE.md and the open MIT repo for architecture, manifests, and evidence.
10 May 2026