OpenAI Whisper tutorial: How to use Whisper to transcribe a YouTube videoby Fabian Stehle on OCT 6, 2022
What is Whisper?
Whisper is an automatic State-of-the-Art speech recognition system from OpenAI that has been trained on 680,000 hours of multilingual and multitask supervised data collected from the web. This large and diverse dataset leads to improved robustness to accents, background noise and technical language. In addition, it enables transcription in multiple languages, as well as translation from those languages into English. Unlike DALLE-2 and GPT-3, Whisper is a free and open-source model. OpenAI released the models and code to serve as a foundation for building useful applications that leverage speech recognition.
How to transcribe a YouTube video
In this tutorial we will use Whisper to transcribe a YouTube video. We will use the Python package "Pytube" to
download convert the sounds into a
MP4 file. You can find the repo of Pytube here
First, we need to install the Pytube Library. You can do this by running the following command in your terminal:
!pip install -— upgrade pytube
For this tutorial I'll be using this "Python in 100 Seconds" Video.
Next, we need to import Pytube, provide the link to the YouTube video, and convert the audio to
#Importing Pytube library import pytube # Reading the YouTube link video = "https://www.youtube.com/watch?v=x7X9w_GIm1s" data = pytube.YouTube(video) # Converting and downloading as 'MP4' file audio = data.streams.get_audio_only() audio.download()
The output is a file named like the video title in your current directory. In our case, the file is named
Python in 100 Seconds.mp4
Now, the next step is to convert audio into text. We can do this in three lines of code using whisper. First, we install and import
whisper. Then we load the model and finally we transcribe the audio file.
Installing Whisper libary
!pip install git+https://github.com/openai/whisper.git -q
Load the model. We'll use the "base" model for this tutorial. You can find more information about the models here. Each one of them has tradeoffs between accuracy and speed (compute needed).
model = whisper.load_model("base") text = model.transcribe("Python in 100 Seconds.mp4")
And now we can print out the output.
#printing the transcribe text['text']
You can find the full code as Jupyter Notebook here