.png&w=828&q=75)
InfraAgent ReplayOps is an evidence-verified incident triage agent for SRE and platform teams. Instead of returning a generic AI summary, it turns a noisy production alert into a replayable War Room Packet that a judge or operator can audit. The public Hugging Face Space is a thin UI. It sends an alert to a FastAPI polling API exposed through Cloudflare Tunnel. The backend runs a bounded five-node LangGraph workflow: Alert Ingestor, Investigator, Root Cause Analyst, Runbook Generator, and Verification Gate. The Investigator gathers deterministic observability evidence across metrics, logs, deploy events, traces, and topology. The Root Cause Analyst ranks the most likely cause using stable evidence IDs, and the Runbook Generator creates safe, human-approved recovery steps. The AMD target runtime is Qwen2.5-72B-Instruct served locally with vLLM on AMD Developer Cloud MI300X and ROCm. The app includes a runtime truth contract, so it shows whether the run used live vLLM or an honest fallback. The demo also exposes node traces, evidence timeline, business impact, owner routing, a deterministic eval scorecard, and a War Room Packet with an audit seal.
10 May 2026