Theorem

Created by team Optimizer on May 10, 2026
Fine-Tuning on AMD GPUs (Advanced / GPU-Intensive)

Building Theorem: a hand-tuned set of GPU kernels for the AMD MI300X. Today: re-wrote the gated DeltaNet WY-transform from an O(C²) elementwise loop into two MFMA GEMMs — 2.4× speedup. Wavefront-aware num_warps + L2 reordering + persistent kernels make CDNA3 sing. Building Theorem: a hand-tuned set of GPU kernels for the AMD MI300X. Today: re-wrote the gated DeltaNet WY-transform from an O(C²) elementwise loop into two MFMA GEMMs — 2.4× speedup. Wavefront-aware num_warps + L2 reordering + persistent kernels make CDNA3 sing. Building Theorem: a hand-tuned set of GPU kernels for the AMD MI300X. Today: re-wrote the gated DeltaNet WY-transform from an O(C²) elementwise loop into two MFMA GEMMs — 2.4× speedup. Wavefront-aware num_warps + L2 reordering + persistent kernels make CDNA3 sing.

Category tags: