
Building Theorem: a hand-tuned set of GPU kernels for the AMD MI300X. Today: re-wrote the gated DeltaNet WY-transform from an O(C²) elementwise loop into two MFMA GEMMs — 2.4× speedup. Wavefront-aware num_warps + L2 reordering + persistent kernels make CDNA3 sing. Building Theorem: a hand-tuned set of GPU kernels for the AMD MI300X. Today: re-wrote the gated DeltaNet WY-transform from an O(C²) elementwise loop into two MFMA GEMMs — 2.4× speedup. Wavefront-aware num_warps + L2 reordering + persistent kernels make CDNA3 sing. Building Theorem: a hand-tuned set of GPU kernels for the AMD MI300X. Today: re-wrote the gated DeltaNet WY-transform from an O(C²) elementwise loop into two MFMA GEMMs — 2.4× speedup. Wavefront-aware num_warps + L2 reordering + persistent kernels make CDNA3 sing.
10 May 2026