
StudioMI300 turns one English sentence into a 30-second cinematic reel, end-to-end, on a single AMD Instinct MI300X. The pipeline runs eight stages sequentially on the same GPU: a Qwen3.5-35B Director Agent plans six shots with character portraits, music brief, and per-shot voice-over script; FLUX.2 [klein] 4B paints character master keyframes with reference editing for identity preservation; Wan2.2-I2V-A14B animates each shot using First-Last-Frame conditioning for cut:false continuation arcs; the same Qwen3.5-35B re-loads as a vision critic, scoring four sampled frames per clip on character_match / scene_match / composition / artifact_free axes, with structured failure labels (STYLIZED_AI_LOOK, CHARACTER_DRIFT, EXTRAS_INVADE_FRAME, CAMERA_IGNORED) that drive a re-render loop with bumped seeds; ACE-Step v1 generates the instrumental music; Kokoro-82M narrates per-shot voice-over in nine languages (Director picks the language to match the setting); ffmpeg concatenates and mixes the final mp4. 192 GB HBM3 lets all four model architectures share the same card, loaded and unloaded between phases. ParaAttention FBCache (lossless 2x) and selective torch.compile on Wan2.2's transformer_2 give a cumulative 2.5x speedup vs unoptimised baseline. Every model is Apache 2.0 / MIT - outputs are commercially usable. The pipeline ships as a Python CLI plus a FastAPI server that streams stage events (plan, masters, keyframes, clip rendering, critic verdicts, music, VO chunks, final mp4) over Server-Sent Events for live demos. Multi-GPU routing scaffolding is in place via STUDIOMI_GPU_* env vars.
10 May 2026