.png&w=828&q=75)
Here's a description in the 1700-1800 character range — comfortably inside 600-2000: --- **GPU Goblin Auto-Tune** is an AI-driven optimizer that finds the fastest configuration for any HuggingFace fine-tuning workload on AMD MI300X (ROCm 7.0, CDNA3, 192 GB HBM3). Point it at a model id and walk away — it returns a measured speedup and a ready-to-ship modified training script in roughly ten minutes. **The problem.** Manually tuning LLM fine-tunes on MI300X is brutal: dozens of interacting knobs (precision, batch size, attention implementation, dataloader, optimizer, env vars), ROCm-specific gotchas distinct from CUDA, and "best practices" that sometimes hurt on your specific workload. Generic hyperparameter tuners don't know MI300X. **The solution.** A single CLI that profiles a baseline, iteratively tries curated MI300X-specific changes (bf16, larger batches, SDPA attention, hipBLASLt, TunableOp, MIOpen FAST), benchmarks each on real hardware, and keeps what wins. Three modes: hardcoded playbook (deterministic, no API key), LLM-driven greedy (Qwen picks one change per iteration based on live waste-budget signals), and LLM-explore (Qwen proposes K candidates; the script benchmarks all and tries a merged version of the positives to capture compound gains). **Output.** A live per-iteration log plus a final report showing tokens/sec delta, MFU jump, waste-budget reduction chart, accepted vs rejected experiments, and a downloadable best.py. **Real result on Qwen2.5-7B LoRA fine-tune:** baseline 5,734 tok/sec → tuned 11,708 tok/sec (2.05×, +105%; MFU 20% → 41%) in ~10 minutes, fully automated. **Architecture.** Streamlit UI deployed to a Hugging Face Space; FastAPI server on the MI300X droplet streams NDJSON events back to the browser as SSE. Judges click the Space, enter a model id, and the GPU work happens on AMD hardware while progress streams live. Powered by Qwen as both the agent brain and the demo audit target.
10 May 2026