ReplayLab: GPU Experiment Flight Recorder

Created by team Latency Locksmith on May 05, 2026
AI Agents & Agentic Workflows (Best Track for Beginners)Hugging FaceQwen

ReplayLab is an autonomous recovery agent for GPU experiments. When a vLLM serving job crashes on AMD Instinct MI300X — whether from memory pressure, context length violations, or timeout failures — ReplayLab detects the failure, diagnoses the root cause, generates a targeted fix, and replays the corrected experiment. No human in the loop. The agent runs an eight-step closed loop: detect failure, capture evidence (logs, config, GPU telemetry via rocm-smi), classify against a 10-pattern vLLM/ROCm failure taxonomy, run LLM-powered diagnosis using Qwen2.5-7B-Instruct served by vLLM on the same MI300X, plan a minimum parameter change, replay the experiment, and verify recovery. Real MI300X benchmarks: 227 tok/sec sustained throughput, 2,931 tok/sec aggregate at 16x concurrency, 604ms LLM diagnosis latency, and sub-second time-to-first-token — all on a single MI300X with 192 GB HBM3, ROCm 7.2.0, and vLLM 0.17.1. We chose a 7B model deliberately: diagnostic agents need speed, not scale. Qwen 7B fits in 14 GB, runs at full float16 precision with no quantization, and leaves 155 GB free for the workloads being diagnosed. Each recovery cycle costs $0.14 in GPU time, compared to ~$150 and 2 hours of manual debugging. That's 1,071x cheaper and 28x faster. Every recovery produces a structured evidence trail — before/after metrics, GPU memory snapshots, LLM diagnosis, and full agent reasoning trace — giving engineers an auditable record they can trust and reproduce. Open source, MIT licensed, 38 tests passing. Built for Track 1: AI Agents & Agentic Workflows.

Category tags: