ReplayLab: GPU Experiment Flight Recorder

Created by team Latency Locksmith on May 05, 2026

HuggingFace Spaces AMD Developer Cloud AMD ROCm Claude Code

AI Agents & Agentic Workflows (Best Track for Beginners)Hugging FaceQwen

ReplayLab is an autonomous recovery agent for GPU experiments. When a vLLM serving job crashes on AMD Instinct MI300X — whether from memory pressure, context length violations, or timeout failures — ReplayLab detects the failure, diagnoses the root cause, generates a targeted fix, and replays the corrected experiment. No human in the loop. The agent runs an eight-step closed loop: detect failure, capture evidence (logs, config, GPU telemetry via rocm-smi), classify against a 10-pattern vLLM/ROCm failure taxonomy, run LLM-powered diagnosis using Qwen2.5-7B-Instruct served by vLLM on the same MI300X, plan a minimum parameter change, replay the experiment, and verify recovery. Real MI300X benchmarks: 227 tok/sec sustained throughput, 2,931 tok/sec aggregate at 16x concurrency, 604ms LLM diagnosis latency, and sub-second time-to-first-token — all on a single MI300X with 192 GB HBM3, ROCm 7.2.0, and vLLM 0.17.1. We chose a 7B model deliberately: diagnostic agents need speed, not scale. Qwen 7B fits in 14 GB, runs at full float16 precision with no quantization, and leaves 155 GB free for the workloads being diagnosed. Each recovery cycle costs $0.14 in GPU time, compared to ~$150 and 2 hours of manual debugging. That's 1,071x cheaper and 28x faster. Every recovery produces a structured evidence trail — before/after metrics, GPU memory snapshots, LLM diagnosis, and full agent reasoning trace — giving engineers an auditable record they can trust and reproduce. Open source, MIT licensed, 38 tests passing. Built for Track 1: AI Agents & Agentic Workflows.

Category tags:

Developer Tools, Cloud Application, Coding excellence, Optimizing Resource Allocation

Github Presentation Demo

Explore more applications

AMD2_PKK

A clock-aware, zero-token-first routing agent. It classifies each task with no category hint, answers math, logic and code by generating a program and *executing* it

PKK

RiskOps

RiskOps is a event triggered supply chain risk simulator with a domain adaptive Multi-Agent AI System analyzes catastrophic events across your vendor network in parallel and generates structured mitigation plans. Built for AMD ACT II Hackathon (Track 3).

The Nacxmeers

GarudaLinux

Garuda Linux is an Arch-based Linux distribution known for its striking visual design, performance-focused tweaks (like BTRFS with automatic snapshots and Zen kernel), and a strong focus on gaming.

CoreX

AMD Developer Cloud

Simple Request Router

Uses Gemma 4 to classify complex vs. simple requests, and routes them to a local LLM / cloud provider as needed.

lone wizard

AMD Developer CloudAMD ROCmGemmaGemini AIAssistants API

ConsultIn

Quantivo AI (BOA) generates AI-powered Business Opportunity Analysis reports by combining local market data, sentiment analysis, and SWOT insights to help entrepreneurs validate and grow their business ideas.

Donat Madu

AI/ML APIAnthropic ClaudeClaude CodeCodexBright Data DatasetsBright Data Scraping BrowserBright Data MCP Server

Roopal Guha Neogi
Student

Upcoming AI Hackathons
For Innovators & Creators

Explore more applications

AMD2_PKK

A clock-aware, zero-token-first routing agent. It classifies each task with no category hint, answers math, logic and code by generating a program and *executing* it

PKK

RiskOps

RiskOps is a event triggered supply chain risk simulator with a domain adaptive Multi-Agent AI System analyzes catastrophic events across your vendor network in parallel and generates structured mitigation plans. Built for AMD ACT II Hackathon (Track 3).

The Nacxmeers

GarudaLinux

Garuda Linux is an Arch-based Linux distribution known for its striking visual design, performance-focused tweaks (like BTRFS with automatic snapshots and Zen kernel), and a strong focus on gaming.

CoreX

AMD Developer Cloud

Simple Request Router

Uses Gemma 4 to classify complex vs. simple requests, and routes them to a local LLM / cloud provider as needed.

lone wizard

AMD Developer CloudAMD ROCmGemmaGemini AIAssistants API

ConsultIn

Quantivo AI (BOA) generates AI-powered Business Opportunity Analysis reports by combining local market data, sentiment analysis, and SWOT insights to help entrepreneurs validate and grow their business ideas.

Donat Madu

AI/ML APIAnthropic ClaudeClaude CodeCodexBright Data DatasetsBright Data Scraping BrowserBright Data MCP Server