Building a gpu experiment 'locksmith' that records GPU failures, diagnose cause, auto-fix configs, replay + verify recovery on AMD MI300X (ROCm)
Autonomous agent that detects GPU experiment failures, diagnoses root causes using vLLM taxonomy + LLM reasoning on AMD MI300X, and replays corrected experiments — full recovery in under a minute for $0.14, replacing $150 manual debug cycles.