OpenAI Whisper

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

The Whisper models are trained for speech recognition and translation tasks, capable of transcribing speech audio into the text in the language it is spoken (ASR) as well as translated into English (speech translation). Whisper has been trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Whisper is Encoder-Decoder model. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

Open AI Whisper Libraries

Discover Whisper to help you get started

Whisper Boilerplates

Boilerplates to help you get started

Whisper Hackathon projects

Solutions built with Whisper that have been created during our hackathons by the members of our community

Phoenix Whisper tutorial

Phoenix Whisper

According to research made by J. Birulés-Muntané1 and S. Soto-Faraco (10.1371/journal.pone.0158409), watching movies with subtitles can help us learn a new language more effectively. However, the traditional way of showing subtitles in YouTube or Netflix does not provide us the best way to check the meaning of new vocabulary nor understand complex slang and abbreviation. Therefore, we found out that if we display dual subtitles (the original subtitle of the video and the translated one), the learning curve immediately improves. In research conducted in Japan, the authors concluded that the participants who viewed the episode with dual subtitles did significantly better (http://callej.org/journal/22-3/Dizon-Thanyawatpokin2021.pdf). After understanding both the problem and the solution, we decided to create a platform for learning new languages with dual active transcripts. When you enter a YouTube URL or upload an MP4 file in our web application, the app will produce a web page where you can view the video and have a transcript running next to it in two different languages. We have accomplished this goal and successfully integrated OpenAI Whisper, GPT and Facebook's language model for the backend of the app. At first, we use Streamlit for the app, but it does not provide a transcript that automatically move with the audio timeline, also Streamlit does not give us the ability to design the user interface, so we create our own full stack application using Bootstrap, Flask, HTML, CSS and Javascript. Our business model is subscription-based and/or one-time purchase based on the usage. Our app isn’t just for language learners. It can also be used for writers, singers, YouTubers, or anyone who would like to make their content reach out to more people by adding different languages to their videos/audios. Due to the limitation of free hosting plan, we could not deploy the app on cloud for now but we have a simple website that you can have a quick look at what we are creating (https://phoenixwhisper.onrender.com/success/BzKtI9OfEpk/en).