.png&w=828&q=75)
I worked on improving Hindi language support in Qwen3-Omni-30B-A3B-Instruct, with a focus on enhancing its ability to interpret Hindi audio. For this, I leveraged open-source Hindi audio datasets, particularly ShrutiLipi, Rasa, and the Lahaja dataset by AI4Bharat. We used a customized version of LLaMA Factory (modified to support Hugging Face streaming iterators) for LoRA-based fine-tuning, and Weights & Biases (W&B) for experiment tracking. Additionally, I fine-tuned Qwen 2.5 Omni for speech-to-speech (S2S) tasks after reverse-engineering its audio codec (the Mimi codec). I also generated synthetic Hindi audio data using Svara TTS for text-to-speech augmentation. One of the major challenges I faced was model collapse and catastrophic forgetting. For model collapse, I addressed the issue through careful hyperparameter tuning, which surprisingly resolved the instability. For catastrophic forgetting, I created a synthetic replay dataset by using the model itself to generate input–output pairs. This acted as a form of self-distillation, helping the model retain its original capabilities while learning new Hindi-specific knowledge. It also helped preserve reasoning ability alongside domain adaptation. Since LLMs are trained on large-scale corpora but exhibit specific emergent behavioral patterns in their final forms, this approach also helped in partially reverse-engineering how certain capabilities are represented internally. The Qwen Omni architecture follows a thinker–talker design, where an encoder processes audio and images and passes them to the thinker (main LLM). The thinker generates text and neural codec tokens, which are then decoded into continuous waveforms by the talker module. However, a major challenge was that several implementation details of the model were not fully open-sourced, which required significant reverse engineering to properly adapt and train the system.
10 May 2026