
I realized the full pipeline didn't exist, you could run it locally and the amd machines were great at this. Also some new features of qwen 3.6 series such as their ASR models made this especially useful. So these are all locally run on open-weight models. Anyone with a sufficiently powerful Radeon or an Nvidia card can do this. Also they can modify any step in the pipeline and regenerate so it's fully controllable. The pipeline is as follows. Qwen does the initial lyric creation, then Acestep does the music, Qwen ASR does timestamp extraction, images are either generative or captured through a web search agent, ltx makes it with the right time constraints and then ffmpeg tapes them together and overlays the initial audio. You can do it in 3 levels of complexity (low/medium/high) depending on how much you want to wait
10 May 2026