ElevenLabs Tutorial: Building AI-Powered Auto Dubbing Service

Wednesday, July 12, 2023 by septian_adi_nugraha408
ElevenLabs Tutorial: Building AI-Powered Auto Dubbing Service

Introduction

The arrival of highly advanced text-to-speech technology in recent years has opened the floodgate of so many innovative and cutting edge AI-powered products. No longer are we limited to the awkward and robotic synthesized speeches generated by text-to-speech technology of the olden days. Recently, a company called ElevenLabs has upped the ante by providing us with features centered around voice generations. From creating and designing our custom voices, to synthesizing speeches using the voices that we create or using pre-made voices provided by ElevenLabs.

In this tutorial, we will build an automatic dubbing service using text-to-speech technology by ElevenLabs. Not only that, we will also identify the necessary steps from retrieving the video from Youtube link to combine the video with the generated dubs.

Introduction to ElevenLabs

ElevenLabs is a highly innovative company that offers a powerful but easy to use API for voice generation. They boast a cutting edge technology, their voice generation API, that is trained on large swathes of audiobooks and podcasts, resulting in the ability to generate natural-sounding and expressive speeches. Thus, ElevenLabs' API can serve as an ideal choice for a wide range of voice-centric products, such as story/audiobooks narration and voiceover for videos.

Introduciton to OpenAI's Whisper

Whisper is an audio transcription service, or speech-to-text developed by OpenAI. It's reported to be trained on a 680.000 hours worth of multilingual and multitask supervised data collected from the web, which means that the model is trained on large dataset with multiple labels and the model has to predict all of the labels simultaneously. It helped to ensure that Whisper has improved consistencies in detecting accents, background noise and technical language. Whisper is also capable of transcribing speeches in multiple languages as well as translating from non-English languages.

Introduction to Anthropic's Claude Model

Claude is an advanced AI model developed by Anthropic, which is based on their research into promoting and training helpful, honest and harmless AI systems. It is designed to help with various language-centric use cases such as text summarization, collaborative writing, Q&A and also coding. Early reviews from various users reported that Claude is much more averse in producing questionable and harmful response, easier and intuitive to work with, and more "steerable", which means users are able to arrive at their desired responses without putting much effort. Bottom line, Claude can produce human-like responses, which makes it ideal for building services which are expected to deliver humane and excellent user experience. Because of Claude's renowned excellence in dealing with language-centric tasks, we will use Claude to help us translate our video transcript.

Prerequisites

  • Basic knowledge of Python, experience with Streamlit is a plus
  • Access to ElevenLabs' API
  • Access to Anthropic's API

Outline

  1. Identifying the Requirements
  2. Initializing the Project
  3. Adding Video Transcription Feature using OpenAI's Whisper
  4. Adding Translation Feature using Anthropic's Claude
  5. Adding Dubs Generation Feature using ElevenLabs' API
  6. Final Touch - Combining the Video with the Generated Dubs
  7. Testing the Auto Dubbing Service

Discussion

Before we start getting into the coding part, let's first sit back and breathe, and then think first. Begin with the hypothetical question, if we were to build an automatic dubbing service, what features should the service have? Only after that, we can build around the requirements and deciding if the features provided by our service is sufficient to address the intended use case. With that in mind, let's begin!

Identifiying the Requirements

Let's return to the question one last time. What features do our service need as an automatic dubs generator for Youtube videos? well, to answer the question, let's trace the steps involved in generating dubs. From retrieving the video, to combining the dubs with the video.

  1. Retrieve the Video from Youtube Link

For this purpose, there is one popular Python library which we can use, it's called pytube. By providing the link to the Youtube function, we can retrieve the video and audio streams, as well as the metadata, such as title, description, and the thumbnail. For this step, we need text input field to input the Youtube link and a button to trigger the stream download process.

  1. Transcribe the Audio Stream

After we download the audio stream of the Youtube video, we can start transcribing the audio with OpenAI's Whisper via whisper library. However, due to performance reason, we will first cut the audio into one minute durations before we transcribe the audio. Once the transcript is retrieved, we move on to the next step. At the end of the transcription process, we will show the transcript in the form of DataFrame, which list the start, end and the text spoken at certain points in time.

  1. Translating the Transcript

The translation process should start immediately afterwards. For this purpose, we're going to use anthropic library to access Claude model. We will send our prompt to Claude model to ask for translation of the transcript (originally in English).

  1. Generating the Dubs

Once the response from Claude is received, we begin generating the audio using ElevenLabs' API. By importing various functions from elevenlabs library, we can generate speech using pre-made voice and multilingual model. We specifically use multilingual model to accomodate non-English translations.

  1. Combining the Dubs with the Video

Finally, in this step we will retrieve video stream from the Youtube link from earlier, as well as combining the generated audio with the video. We will use a command line software called ffmpeg for this purpose. Once the video is combined with the dubs, we will show the video via video player in our unser interface!

Initializing the Project

We will use streamlit library to build the user interface of our auto dubbing service, therefore, rest assured, for we will only have to write one Python file for this whole project! However, there are some extra steps that we need to go through to ensure our project will run smoothly.

Create the project directory

First thing first, we need to make the directory of our project. Let's open up our terminal and change directory to our coding projects directory. Let's make the project directory and enter it:

mkdir autodubs
cd autodubs

Create and activate the virtual environment

Afterwards, let's create our virtual environment and activate it. It's important to create our virtual environment early on, so that we can install our dependencies immediately afterwards and prevent our dependencies for this project leaking to the global environment.

python -m venv env


# Activate the env on Linux/Mac
source env/bin/activate

# Activate the env on Windows
.\env\Scripts\activate

Our terminal should show the name of the virtual environment once it's activated.

the terminal show the name of the virtual environment once activated

Installing the dependencies

Next, let's install all the necessary dependencies, use pip command to install Python libraries

pip install streamlit anthropic elevenlabs pytube pydub openai-whisper moviepy "pydantic==1.*" 

# Freeze the dependencies to record our installed libraries inside `requirements.txt`
pip freeze > requirements.txt

This process should take a while, so, take the time to brew some more coffee or perhaps, grab a popcorn! Meanwhile, let's have a break down on the libraries that we install in this project.

  1. Streamlit: This is the library that we use to build our user interface, it provides simple and easy to use function to define UI elements, and more importantly, it works in a Pythonic way!
  2. Anthropic: This library connects our project to Anthropic's API using simple object oriented process and a few helper constants.
  3. ElevenLabs: This library also serves as a wrapper to make a request to ElevenLabs API. However, it also provides convenient function such as play() to play the generated audio using ffpmeg.
  4. Pytube: We use this library to retrieve metadata, and video/audio streams of Youtube videos based on the URL, as well as downloading the streams.
  5. Pydub: To make the dubs audio file, we use Pydub. While in this tutorial we only simply create an AudioSegment based on the generated audio by ElevenLabs' API, the use case of Pydub is more than that. This library makes it easier and much simpler to cut, combine, and insert audio tracks by using position as index.
  6. Whisper: We use OpenAI's Whisper to transcribe the audio file we've downloaded
  7. MoviePy: Originally I intended to use MoviePy to combine the final audio and video tracks, however, turns out it consumed too much memory space and take too much time, hence we use ffmpeg for combining the tracks. We still use MoviePy to cut our video track into manageable length (one minute).
  8. Pydantic: We install a version-locked Pydantic because elevenlabs library depends on that version. Otherwise, it would return an error.

Once the installation is completed, let's address some potential issues that I've already encountered while trying to build the project. The first one is the bug in Pytube library, and the second one is the installation of ffmpeg.

Fixing the Pytube bug

Let's navigate to Pytube library directory and open the cipher.py file. If you name your virtual environment env and installed the library after the virtual environment is activated, it should be located in env/lib/python3.8/site-packages/pytube/cipher.py. Once you opened the file, let's get to line 284-294, especially line 287. Let's remove the semicolon (;) from the regex pattern.

...
if idx:
    idx = idx.strip("[]")
    array = re.search(
--        r'var {nfunc}\s*=\s*(\[.+?\]);'.format(
+         r'var {nfunc}\s*=\s*(\[.+?\])'.format(
            nfunc=re.escape(function_match.group(1))),
        js
    )
    if array:
        array = array.group(1).strip("[]").split(",")
        array = [x.strip() for x in array]
        return array[int(idx)]
...

The regex pattern is known to be problematic and causing RegexMatchError, as discussed in this Github issue.

Installing ffmpeg

Truth to be told, installing ffmpeg is beyond the scope of this tutorial. Therefore, please refer to this link for installation instructions in various operating systems. The installation should be successful if you can run ffmpeg in your terminal.

Creating our Streamlit secret file

Our auto dubbing app will make use of various AI services, and some of them require us to register to their platforms before allowing access to their APIs. The accesses are granted via keys that we send along with the requests to the APIs. It is therefore a good practice to keep our keys in a separate file, as they're considered sensitive information.

The common practice for this is to store our keys in variables inside an .env file. However, Streamlit works differently. In Streamlit, we manage our sensitive information, such as API keys, database connection details, and passwords in a file called secrets.toml. Let's create the file:

mkdir .streamlit
touch .streamlit/secrets.toml

Inside the secrets.toml file, let's write down our API keys.

xi_api_key = "xxxxxxxxxxxxxxxxxxxxxxx"
claude_key = "sk-ant-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Note that TOML files use different conventions compared to .env files. Take care of the quotation marks and the variable cases which are usually lowecases.

Creating the autodubs.py file

If our current working directory is still inside our project directory, good! let's create the file of our app project.

touch autodubs.py

Next, let's open our favorite IDE/code editor and open our project directory. This time we'll write the first form of our app, which consists of a title, text input, select box and a button.

import streamlit as st

st.title("AutoDubs πŸ“ΊπŸŽ΅")

link = st.text_input("Link to Youtube Video", key="link")

language = st.selectbox("Translate to", ("French", "German", "Hindi", "Italian", "Polish", "Portuguese", "Spanish"))

if st.button("Transcribe!"):
    print(f"downloading from link: {link}")

As we can read this code, the elements of our user interface is pretty much self-explanatory. We store the link input by user in a link variable, which we will use later when we click the "Transcribe" button. We also store the user's selection of language in which we're going to translate our dubs into. To test our app, let's use streamlit run <app-name> command.

streamlit run autodubs.py

The app should open in our default browsers once the command is executed.

the user interface of our AutoDubs app

Great! This is the beginning of our app. We will add more UI elements as we progress through the tutorial and add more features. For now, if we enter Youtube link and click the button, it should print the link in our terminal output.

Adding Video Transcription Feature using OpenAI's Whisper

Next, we'll start adding a handler to our "Transcribe!" button. In this part, we will use pytube library to download the audio stream, and whisper library to transcribe the audio. We also use AudioSegment from pydub library to cut our audio into a manageeable length (one minute).

Let's add more import statement to bring the required libraries into our code.

import streamlit as st
++ import whisper
++ from pytube import YouTube
++ from pydub import AudioSegment
++ import pandas as pd

...

Meanwhile, in our button click handler we add more code. We'll get to them after this.

...

if st.button("Transcribe!"):
    print(f"downloading from link: {link}")
++     model = whisper.load_model("base")
++     yt = YouTube(link)
++ 
++     if yt is not None:
++         st.subheader(yt.title)
++         st.image(yt.thumbnail_url)
++         audio_name = st.caption("Downloading audio stream...")
++         audio_streams = yt.streams.filter(only_audio=True)
++         filename = audio_streams.first().download()
++         if filename:
++             audio_name.caption(filename)
++             cut_audio = shorten_audio(filename)
++             transcription = model.transcribe(cut_audio)
++             print(transcription)
++             if transcription:
++                 df = pd.DataFrame(transcription['segments'], columns=['start', 'end', 'text'])
++                 st.dataframe(df)

...

This addition to our button click handler does many things before the transcription even begin, from loading the whisper model, retrieveing the Youtube video's metadata, downloading the audio stream, and cutting the audio. Ultimately, it should transcribe the downloaded audio stream and show it in a Pandas dataframe.

Are we done? Try to guess! If you're keen enough, there's a function that we invoke to get the audio filename before we transcribe it. That's right, it's the shorten_audio() function! Let's add the function definition somewhere above the streamlit function calls.

...

++ def shorten_audio(filename):
++     filename = "cut_audio.mp4"
++     audio = AudioSegment.from_file(filename)
++     cut_audio = audio[:60 * 1000]
++     cut_audio.export(filename, format="mp4")
++     return filename

st.title("AutoDubs πŸ“ΊπŸŽ΅")

link = st.text_input("Link to Youtube Video", key="link")

...

Now, let's run the app again. If we had the app running before, just click the upper right menu and click "Rerun". This time, try to insert the Youtube link and click the button! Let's ignore the language selection for now. For this tutorial, we'll use Fireship's video, "Machine Learning Explained in 100 Seconds" which can be accessed in this URL: https://www.youtube.com/watch?v=PeMlggyqz0Y&t=1s.

The result of the 'Transcribe!' button click. The app shows the video title as well as the thumbnail.

After a few minutes, our app should show the video title and its thumbnail! Not only that, the transcription process should begin immediately after that. Let's scroll down.

The app should show the downloaded audio stream file and the dataframe containing the transcription result

Voila! after the transcription process is done. We should have the Pandas DataFrame with columns such as "start", "end", and "text". Those columns contains the result of the transcription process. We should also notice that the transcription took a little faster because we've cut the audio into one minute duration, as opposed to 100 seconds. Well done!

Adding Translation Feature using Anthropic's Claude

If we look at our terminal output, we should see the text form (instead of dictionary form) of our full transcription. This transcription text is the one we use to generate the translation. Before we begin, let's add another import statement to import anthropic library.

...
from pytube import YouTube
from pydub import AudioSegment
import pandas as pd
++ import anthropic
...

After that, let's add another function called generate_translation() below the function we made earlier.

...

def shorten_audio(filename):
    filename = "cut_audio.mp4"
    audio = AudioSegment.from_file(filename)
    cut_audio = audio[:60 * 1000]
    cut_audio.export(filename, format="mp4")
    return filename

++ def generate_translation(original_text, destination_language):
++     prompt = (f"{anthropic.HUMAN_PROMPT} Please translate this video transcript into {destination_language}. You will get to the translation directly after I prompted 'the translation:'"
++               f"{anthropic.AI_PROMPT} Understood, I will get to the translation without any opening lines."
++               f"{anthropic.HUMAN_PROMPT} Great! this is the transcript: {original_text}; the translation:")
++ 
++     c = anthropic.Anthropic(api_key=st.secrets["claude_key"])
++     resp = c.completions.create(
++         prompt=f"{prompt} {anthropic.AI_PROMPT}",
++         stop_sequences=[anthropic.HUMAN_PROMPT],
++         model="claude-v1.3-100k",
++         max_tokens_to_sample=900,
++     )
++ 
++     print(resp.completion)
++     return resp.completion

...

In this function, we primarily designed the prompt which makes Claude return the translation directly without any opening lines. This is just perfect for our use case in which we directly generate speech from the translation. In the prompt, we specifically instruct Claude to "get to the translation directly" into the destination language, which is defined in the function parameter. We also added the example response, which states as if Claude "understood our instruction" and "get into the translation without any opening lines".

At the end of this function, we print the translation result into the terminal output before we return the translation result. Finally, we invoke the function inside our button click handler from earlier. We also show caption to indicate that our app is generating the translation.

...
    if transcription:
        df = pd.DataFrame(transcription['segments'], columns=['start', 'end', 'text'])
        st.dataframe(df)
++         dubbing_caption = st.caption("Generating translation...")
++         translation = generate_translation(transcription['text'], language)
...

After we're done with the translation, we're ready to move on to the next step, generating dubs for the video. Let's test the app before we move on. Rerun the app, and click the "Transcribe!" button.

The translation result shown in the terminal output

This time, our terminal should show the translation from the original language (English), to the destination language that we selected earlier in the select box (Polish).

Adding Dubs Generation Feature using ElevenLabs' API

Once our translation process is done, let's generate the dubs. This is where ElevenLabs' API shines! using multilingual model, generating speech in non-English language which sounds natural and handle the accents accurately should be a piece of cake. Let's add the import statement to bring the libraries we need for this process.

from pydub import AudioSegment
import pandas as pd
import anthropic
++ import io
++ from elevenlabs import generate, set_api_key

Let's define our function to generate the dubs. Just like the last time, we can add the function just below the translation generation function we wrote earlier.

...
++ def generate_dubs(text):
++     filename = "output.mp3"
++     set_api_key(st.secrets['xi_api_key'])
++     audio = generate(
++                     text=text,
++                     voice="Bella",
++                     model='eleven_multilingual_v1'
++                 )
++ 
++     audio_io = io.BytesIO(audio)
++     insert_audio = AudioSegment.from_file(audio_io, format='mp3')
++     insert_audio.export(filename, format="mp3")
++     return filename
...

Again, the process is pretty much straightforward at this point. We defined the name of our dubs file. After that, we generate the speech from text variable, which is the translation result from the earlier process. We use "Bella", the pre-made voice provided freely by ElevenLabs, and the eleven_multilingual_v1 model. At the end of the function, as we need the generated speech to be written as an mp3 file, we use pydub's AudioSegment to write the speech into an mp3 file.

We also invoke the function inside our "Transcribe!" button click handler. Like so.

...
    translation = generate_translation(transcription['text'], language)
++ 
++     dubbing_caption = st.caption("Begin dubbing...")
++     dubs_audio = generate_dubs(translation)
++     dubbing_caption.caption("Dubs generated! combining with the video...")
    ...

In the UI side, we only show the caption that the app "begin dubbing" and "dubs generated" when the audio file is created. We can test this section by playing the generated audio (output.mp3), it should read the translation text using "Bella" voice.

Final Touch - Combining the Video with the Generated Dubs

Alright, this is the final step of our automated dubs generation process. As we, at this point, already have the dubs file to be combined, let's retrieve the video stream from the Youtube link this time! Before we start adding more code, let's finish up our import statements.

...
import anthropic
import io
from elevenlabs import generate, set_api_key
++ import os
++ import subprocess
++ from moviepy.video.io.ffmpeg_tools import ffmpeg_extract_subclip
...

Done? Great! Add this code just below the earlier dubbing caption code.

...
    dubbing_caption = st.caption("Begin dubbing...")
    dubs_audio = generate_dubs(translation)
    dubbing_caption.caption("Dubs generated! combining with the video...")

++ 
++     video_streams = yt.streams.filter(only_video=True)
++     video_filename = video_streams.first().download()
++ 
++     if video_filename:
++         dubbing_caption.caption("Video downloaded! combining the video and the dubs...")
++         output_filename = combine_video(video_filename, dubs_audio)
++         if os.path.exists(output_filename):
++             dubbing_caption.caption("Video  successfully dubbed! Enjoy! πŸ˜€")
++             st.video(output_filename)

In this part, we download the video stream from the Youtube link. After the video is successfully downloaded, we combine the video with our dubs audio from earlier process. Finally, when the combining process is done, the caption should update, indicating that the "Video is successfully dubbed".

Notice that in this part, we invoke a function called combine_video(), which, as the name suggests, combines the video file with the dubs audio file. Let's write the function below our dubd generation function.

...
    insert_audio = AudioSegment.from_file(audio_io, format='mp3')
    insert_audio.export(filename, format="mp3")
    return filename
++ 
++ def combine_video(video_filename, audio_filename):
++     ffmpeg_extract_subclip(video_filename, 0, 60, targetname="cut_video.mp4")
++     output_filename = "output.mp4"
++     command = ["ffmpeg", "-y", "-i", "cut_video.mp4", "-i", audio_filename, "-c:v", "copy", "-c:a", "aac", output_filename]
++     subprocess.run(command)
++     return output_filename
...

This function is the part where we run ffmpeg using subprocess. Please make sure once again that ffmpeg is already installed in our machine before proceeding.

Testing the Auto Dubbing Service

With all the processes and requirements covered, let's test run our app again! If for some reason the app is terminated, run it again with the streamlit run command, if the app is already running, we can rerun the app by clicking Rerun option on the upper right menu.

When everything is ready, let's click our "Transcribe!" button again! This time, please wait until the whole process is finished, as it involves different requests to different API, with various files being written as well.

The video combination process is finished, resulting in a video player which we can run

Amazing! our app should show a video player once the video combining process is finished, as well as updating the caption to "Video successfully combined, Enjoy! πŸ˜€". We can try playing the video, which should play the same video, but with our dubs in Polish.

Conclusion

In this somewhat extended and lengthy tutorial, we explored one of the many exciting use cases of ElevenLabs' API, namely, to generate translation dubs for Youtube videos! Although the process is somewhat longer than my usual tutorial, we can get the whole app done with all the required features available to use thanks to Streamlit library tha we use to build our UI.

We are able to provide all the necessary features and processes required to automatically generate dubs for Youtube videos by combining together various AI services. It starts with using OpenAI's Whisper to transcribe the video. Anthropic's Claude comes into play when we need to generate the natural translation transcript. Finally, ElevenLabs' multilingual model comes as a finishing touch to generate the natural-sounding dubs complete with native accent and pronounciations.

Thank you for joining me in exploring the interesting use case of ElevenLabs' API. I sure had fun building the projects and writing the tutorial around it, so I hope you, dear readers, have fun and perhaps learned a things or two as well! Finally, you can find the finished project on my Github. See you on the next tutorial!