
### The problem Frontier mixture-of-experts (MoE) models grow faster than VRAM. Qwen3.5-397B (227 GiB at 4-bit) and DeepSeek-V3 671B (377 GiB) don't fit any single GPU — not a $700 consumer card, not even a $15K MI300X with 192 GB of HBM3. Multi-GPU clusters or aggressive downcasting were the only options. Existing CPU-offload paths (`--cpu-moe`, DeepSpeed-MoE, Mixtral-Offloading) cap at host RAM (~128 GB) and pay PCIe transfer cost on every miss. ### The thesis Stream the experts from NVMe. Each token activates only **K of N experts per layer** — 10 of 512 for Qwen3.5-397B. Load just those K from SSD on demand. The OS page cache becomes a free third memory tier between VRAM and SSD: when the model fits in OS RAM, the page cache absorbs the streaming cost; when the model is bigger than RAM, you trade speed for capability — slower, but possible. Either way, the largest MoE models become reachable on a single GPU. ### Who this matters for Streaming MoE unlocks frontier inference for users who can't afford multi-GPU H100/H200 clusters: independent developers, on-prem and air-gapped deployments (regulated industries — healthcare, finance, defense), and mid-tier providers serving more concurrent users per GPU. **The same code path runs from a $700 consumer GPU to a $15K datacenter card.**
10 May 2026