Whispy is an accessibility tool built for voice chat accessibility. Using multiple models running concurrently, we can completely substitute a user in a voice chat. Users of Whispy can stick to using their preferred input method, whether that be Speech to text, or Text to speech, and other users in the voice chat continue to use the platform as is. This seamless integration into the Discord platform for our Demo allows users to have complete, real-time, and thorough conversations via Text or Voice, regardless of their preference. We leverage ElevenLabs streaming API and an audio queue to return any written text to the users of the voice call with a custom TTS voice. Text users can choose from all default voices, and their preferences are stored in the bot files. Our solution allows for text to be streamed back into the voice call rapidly, ensuring fluid conversation. Additionally, OpenAI's Whisper large model is analyzing and transcribing audio from any number of users in a voice call, separated out by speaker, and returning their speech as text into the same channel as the ElevenLabs user is typing in. This essentially replicates the Voice Call audio into a text conversation. For international users, both ElevenLabs and Whisper models can handle other languages, mostly limited to the Whisper supported languages. Our demo showcases Spanish as a secondary.Category tags:
"Wonderful, as within Discord, the voice transcription is major issue, which with this solution can be resolved easily. I tried it myself and it was very smooth. Moreover, the idea is very impactful and looking forward what the future holds for this project. Well Done! "
Machine Learning Engineer