
ROCKET is a multi-agent system (Profiler → Planner → Implementer → Validator) that takes any PyTorch model, profiles it on AMD MI300X, hypothesizes which optimizations will help, applies them from a bounded toolbox of five transformations (dtype_cast, torch_compile, sdpa_attention, input_padding, kv_cache_config), validates output correctness, and measures the speedup — completely on its own. On a real AMD Instinct MI300X (ROCm 7) running Qwen2.5-7B-Instruct at batch=8, prompt 256 + new 512, ROCKET achieved a measured 2.93× throughput speedup: 62.59 → 183.47 tokens per second. The agent tried 5 tools and kept 1 (bf16 cast); correctly reverted the four that didn't beat the validator's 2% keep threshold — exactly the judgment a senior performance engineer would apply. How this is different from adjacent submissions: - ROCmPort AI translates CUDA → ROCm code. ROCKET makes the model FAST. - ReplayLab records GPU experiments. ROCKET acts on them. - Sakana's AI Scientist optimizes model accuracy. ROCKET optimizes throughput on AMD silicon; the question every AMD developer asks first that no one had automated.
10 May 2026