
ROCmForge is a fine-tuned code LLM that solves a real and growing problem: AMD MI300X is 2–3× cheaper per FLOP than NVIDIA H100, but naive PyTorch on ROCm leaves 40–60% of that performance on the table. Writing hand-tuned HIP kernels requires deep knowledge of wavefront-64 architecture, MFMA matrix intrinsics, LDS bank conflict avoidance, and gfx942-specific occupancy tuning — expertise almost no team has. We fine-tuned Qwen2.5-Coder-7B using QLoRA on a dataset of 25,000 curated instruction-output pairs sourced from real AMD open-source repositories including ROCm, composable_kernel, MIOpen, and hipBLAS. Training ran on a single AMD MI300X (192 GB HBM3) for approximately 8 hours — 3,116 steps, 4,096 token context, bfloat16 precision with eager attention for ROCm numerical stability. The result is a model that correctly generates AMD-native HIP code where the base model silently fails. Key differences: ROCmForge uses wavefront-64 shuffle reductions (offset starting at 32, not 16), the correct AMD __shfl_down API without NVIDIA's 0xffffffff lane mask, __launch_bounds__ for gfx942 occupancy hints, LDS padding to eliminate bank conflicts, and MFMA intrinsics (__builtin_amdgcn_mfma_f32_32x32x8f16) for hardware-accelerated matrix operations. The base model scores 0% on MI300X-specific optimizations across 50 held-out test prompts. ROCmForge scores 6% MI300X awareness and 18% compilable output — a measurable, quantified improvement from fine-tuning alone. The system ships as a full end-to-end product: a FastAPI + vLLM backend serving the fine-tuned model, a Vite + React frontend with live streaming generation, hipcc compilation validation, a four-way benchmark comparison (PyTorch eager / torch.compile / rocBLAS / ROCmForge), and a live side-by-side model comparison tab with automated AMD pattern detection and a 0–100 quality scoring system.
10 May 2026