ROCmOps AI War Room

Created by team Adeel on May 10, 2026
AI Agents & Agentic Workflows (Best Track for Beginners)

ROCmOps AI War Room is an autonomous AI-powered incident response platform built on AMD MI300X GPUs using ROCm acceleration and AMD Developer Cloud infrastructure. Modern AI infrastructure environments are increasingly complex and difficult to manage. Infrastructure incidents such as GPU memory failures, inference latency spikes, container instability, and workload bottlenecks often require manual investigation and remediation, resulting in increased downtime and operational overhead. Our platform solves this problem using a multi-agent AI operations architecture capable of autonomous infrastructure monitoring, log analysis, root cause analysis, remediation generation, GPU optimization, and postmortem reporting. The system provides a real-time operational dashboard powered by Streamlit, displaying live ROCm GPU telemetry including GPU utilization, VRAM allocation, and thermal metrics streamed directly from AMD infrastructure. The platform includes five specialized AI agents: • Log Analysis Agent • Root Cause Analysis Agent • Remediation Agent • GPU Optimization Agent • Postmortem Report Agent When an infrastructure incident is detected, the platform autonomously analyzes production logs, identifies root causes, generates remediation recommendations, applies ROCm optimization strategies, and creates enterprise-grade incident reports. The solution demonstrates how autonomous AI operations can dramatically reduce infrastructure incident response time from hours to seconds while leveraging AMD GPU acceleration for scalable enterprise AI systems. Technologies used include: • AMD MI300X GPUs • ROCm • AMD Developer Cloud • Python • Streamlit • Docker • Autonomous AI Agent Architecture

Category tags:

"The concept is good, the implementation seems to be a simulation and use fixed/demo telemetry rather than live system integration. The code seems to be mostly a scripted orchestration layer. The “agents” are thin wrappers, and run_agent() returns canned responses by matching prompt text rather than running a real model workflow. It would have been interesting to see how this performs on a live system. I'm also curious about the agent that can fix issues - that gives a lot of control to the agent, so I'm interest to know about guardrails to protect the system from errant agent behaviour. Your presentation pdf could use some images, and use of a better template. I did enjoy the simplicity and the focus on the content rather than a very elaborate slide deck, but there is a balance. The video and presentation are an opportunity to pitch your idea. For your video, going straight to your demo page and hiding all the other tabs you have open would make it a little nicer (F11 for full-page browser). Great effort. I'd love to see this developed further. "

avatar