OpenAI Whisper tutorial: how to create speaker identification app

Thursday, October 13, 2022 by ezzcodeezzlife

Discovering OpenAI Whisper

Whisper, a revolutionary speech recognition system by OpenAI, has been fine-tuned with 680,000 hours of multilingual, multitask supervised data gathered from the web. This extensive dataset enhances resilience to accents, background noise, and specialized language. Whisper transcribes in numerous languages and even translates into English. OpenAI offers models and codes to form the backbone for valuable, speech recognition-based applications.

However, Whisper has difficulty identifying speakers in a conversation. Diarization, the process of determining speaker identity, is crucial for conversation analysis. In this OpenAI Whisper tutorial, learn to recognize speakers and align them with Whisper transcriptions using pyannote-audio. Let's dive in!

Let's dive in!

Preparing the audio

First, we need to prepare the audio file. We will use the first 20 minutes of Lex Fridmans podcast with Yann LeCun. To download the video and extract the audio, we will use yt-dlp package.

!pip install -U yt-dlp

We will also need ffmpeg installed.

!wget -O - -q  https://github.com/yt-dlp/FFmpeg-Builds/releases/download/latest/ffmpeg-master-latest-linux64-gpl.tar.xz | xz -qdc| tar -x

Now we can do the actual download and audio extraction via the command line.

!yt-dlp -xv --ffmpeg-location ffmpeg-master-latest-linux64-gpl/bin --audio-format wav  -o download.wav -- https://youtu.be/SGzMElJ11Cc

Now we have the download.wav file in our working directory. Let's cut the first 20 minutes of the audio. We can use the pydub package for this with just a few lines of code.

!pip install pydub

from pydub import AudioSegment

t1 = 0 * 1000 # works in milliseconds
t2 = 20 * 60 * 1000

newAudio = AudioSegment.from_wav("download.wav")
a = newAudio[t1:t2]
a.export("audio.wav", format="wav")

audio.wav is now the first 20 minutes of the audio file.

Pyannote's Diarization

pyannote.audio is an open-source toolkit written in Python for speaker diarization. Based on PyTorch machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. pyannote.audio also comes with pretrained models and pipelines covering a wide range of domains for voice activity detection, speaker segmentation, overlapped speech detection, speaker embedding reaching state-of-the-art performance for most of them.

Installing Pyannote and running it on the video audio to generate the diarizations.

!pip install pyannote.audio

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization')

DEMO_FILE = {'uri': 'blabal', 'audio': 'audio.wav'}
dz = pipeline(DEMO_FILE)  

with open("diarization.txt", "w") as text_file:
    text_file.write(str(dz))

Lets print this out to see what it looks like.

print(*list(dz.itertracks(yield_label = True))[:10], sep=&quot;\n&quot;)

The output:

(&lt;Segment(2.03344, 36.8128)&gt;, 0, &#39;SPEAKER_00&#39;)
(&lt;Segment(38.1122, 51.3759)&gt;, 0, &#39;SPEAKER_00&#39;)
(&lt;Segment(51.8653, 90.2053)&gt;, 1, &#39;SPEAKER_01&#39;)
(&lt;Segment(91.2853, 92.9391)&gt;, 1, &#39;SPEAKER_01&#39;)
(&lt;Segment(94.8628, 116.497)&gt;, 0, &#39;SPEAKER_00&#39;)
(&lt;Segment(116.497, 124.124)&gt;, 1, &#39;SPEAKER_01&#39;)
(&lt;Segment(124.192, 151.597)&gt;, 1, &#39;SPEAKER_01&#39;)
(&lt;Segment(152.018, 179.12)&gt;, 1, &#39;SPEAKER_01&#39;)
(&lt;Segment(180.318, 194.037)&gt;, 1, &#39;SPEAKER_01&#39;)
(&lt;Segment(195.016, 207.385)&gt;, 0, &#39;SPEAKER_00&#39;)

This looks pretty good already, but let's clean the data a little bit:

def millisec(timeStr):
  spl = timeStr.split(":")
  s = (int)((int(spl[0]) * 60 * 60 + int(spl[1]) * 60 + float(spl[2]) )* 1000)
  return s

import re
dz = open('diarization.txt').read().splitlines()
dzList = []
for l in dz:
  start, end =  tuple(re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=l))
  start = millisec(start) - spacermilli
  end = millisec(end)  - spacermilli
  lex = not re.findall('SPEAKER_01', string=l)
  dzList.append([start, end, lex])

print(*dzList[:10], sep='\n')

[33, 34812, True]
[36112, 49375, True]
[49865, 88205, False]
[89285, 90939, False]
[92862, 114496, True]
[114496, 122124, False]
[122191, 149596, False]
[150018, 177119, False]
[178317, 192037, False]
[193015, 205385, True]

Now we have the diarization data in a list. The first two numbers are the start and end time of the speaker segment in milliseconds. The third number is a boolean that tells us if the speaker is Lex or not.

Preparing audio file from the diarization

Next, we will attach the audio segements according to the diarization, with a spacer as the delimiter.

from pydub import AudioSegment
import re 

sounds = spacer
segments = []

dz = open('diarization.txt').read().splitlines()
for l in dz:
  start, end =  tuple(re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=l))
  start = int(millisec(start)) #milliseconds
  end = int(millisec(end))  #milliseconds
  
  segments.append(len(sounds))
  sounds = sounds.append(audio[start:end], crossfade=0)
  sounds = sounds.append(spacer, crossfade=0)

sounds.export("dz.wav", format="wav") #Exports to a wav file in the current path.

print(segments[:8])

[2000, 38779, 54042, 94382, 98036, 121670, 131297, 160702]

Transcription with Whisper

Next, we will use Whisper to transcribe the different segments of the audio file. Important: There is a version conflict with pyannote.audio resulting in an error. Our workaround is to first run Pyannote and then whisper. You can safely ignore the error.

Installing Open AI Whisper.

!pip install git+https://github.com/openai/whisper.git

Running Open AI whisper on the prepared audio file. It writes the transcription into a file. You can adjust the model size to your needs. You can find all models on the model card on Github.

!whisper dz.wav --language en --model base

[00:00.000 --&gt; 00:04.720]  The following is a conversation with Yann LeCun,
[00:04.720 --&gt; 00:06.560]  his second time on the podcast.
[00:06.560 --&gt; 00:11.160]  He is the chief AI scientist at Meta, formerly Facebook,
[00:11.160 --&gt; 00:15.040]  professor at NYU, touring award winner,
[00:15.040 --&gt; 00:17.600]  one of the seminal figures in the history
[00:17.600 --&gt; 00:20.460]  of machine learning and artificial intelligence,
...

In order to work with .vtt files, we need to install the webvtt-py library.

!pip install -U webvtt-py

Lets take a look at the data:

import webvtt

captions = [[(int)(millisec(caption.start)), (int)(millisec(caption.end)),  caption.text] for caption in webvtt.read('dz.wav.vtt')]
print(*captions[:8], sep='\n')

[0, 4720, &#39;The following is a conversation with Yann LeCun,&#39;]
[4720, 6560, &#39;his second time on the podcast.&#39;]
[6560, 11160, &#39;He is the chief AI scientist at Meta, formerly Facebook,&#39;]
[11160, 15040, &#39;professor at NYU, touring award winner,&#39;]
[15040, 17600, &#39;one of the seminal figures in the history&#39;]
[17600, 20460, &#39;of machine learning and artificial intelligence,&#39;]
[20460, 23940, &#39;and someone who is brilliant and opinionated&#39;]
[23940, 25400, &#39;in the best kind of way,&#39;]
...

Matching the Transcriptions and the Diarizations

Next, we will match each transcribtion line to some diarizations, and display everything by generating a HTML file. To get the correct timing, we should take care of the parts in original audio that were in no diarization segment. We append a new div for each segment in our audio.

# we need this fore our HTML file (basicly just some styling)
preS = '<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="UTF-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <meta http-equiv="X-UA-Compatible" content="ie=edge">\n    <title>Lexicap</title>\n    <style>\n        body {\n            font-family: sans-serif;\n            font-size: 18px;\n            color: #111;\n            padding: 0 0 1em 0;\n        }\n        .l {\n          color: #050;\n        }\n        .s {\n            display: inline-block;\n        }\n        .e {\n            display: inline-block;\n        }\n        .t {\n            display: inline-block;\n        }\n        #player {\n\t\tposition: sticky;\n\t\ttop: 20px;\n\t\tfloat: right;\n\t}\n    </style>\n  </head>\n  <body>\n    <h2>Yann LeCun: Dark Matter of Intelligence and Self-Supervised Learning | Lex Fridman Podcast #258</h2>\n  <div  id="player"></div>\n    <script>\n      var tag = document.createElement(\'script\');\n      tag.src = "https://www.youtube.com/iframe_api";\n      var firstScriptTag = document.getElementsByTagName(\'script\')[0];\n      firstScriptTag.parentNode.insertBefore(tag, firstScriptTag);\n      var player;\n      function onYouTubeIframeAPIReady() {\n        player = new YT.Player(\'player\', {\n          height: \'210\',\n          width: \'340\',\n          videoId: \'SGzMElJ11Cc\',\n        });\n      }\n      function setCurrentTime(timepoint) {\n        player.seekTo(timepoint);\n   player.playVideo();\n   }\n    </script><br>\n'
postS = '\t</body>\n</html>'

from datetime import timedelta

html = list(preS)

for i in range(len(segments)):
  idx = 0
  for idx in range(len(captions)):
    if captions[idx][0] >= (segments[i] - spacermilli):
      break;
  
  while (idx < (len(captions))) and ((i == len(segments) - 1) or (captions[idx][1] < segments[i+1])):
    c = captions[idx]  
    
    start = dzList[i][0] + (c[0] -segments[i])

    if start < 0: 
      start = 0
    idx += 1

    start = start / 1000.0
    startStr = '{0:02d}:{1:02d}:{2:02.2f}'.format((int)(start // 3600), 
                                            (int)(start % 3600 // 60), 
                                            start % 60)
    
    html.append('\t\t\t<div class="c">\n')
    html.append(f'\t\t\t\t<a class="l" href="#{startStr}" id="{startStr}">link</a> |\n')
    html.append(f'\t\t\t\t<div class="s"><a href="javascript:void(0);" onclick=setCurrentTime({int(start)})>{startStr}</a></div>\n')
    html.append(f'\t\t\t\t<div class="t">{"[Lex]" if dzList[i][2] else "[Yann]"} {c[2]}</div>\n')
    html.append('\t\t\t</div>\n\n')

html.append(postS)
s = "".join(html)

with open("lexicap.html", "w") as text_file:
    text_file.write(s)
print(s)

You can take a look at the results here or view the complete code as notebook

Where can I use this knowledge?

Well, the first thing which comes to my mind would be an Whisper app you could build during AI hackathon with other like-minded people from all around the world. Then apply to New Native's slingshot program to accelerate your project to a new level. And then... put it on the market and solve one of the world's problems with AI.

The second one would be to keep the skills you learned to yourself and put the project on the shelf and let others change the world and fix its problems. This one I don't recommend, but there is such an option.

During lablab.ai's AI Hackathons over 54 000 people from all around the world and with different fields of expertise created over 900 prototypes and this number grows with every week's AI hackathon. Join the biggest community of AI builders and make a difference!

Thank you for reading. If you enjoyed this tutorial you can find more dedicated
AI tutorials to help you grow in a field of AI- Fabian Stehle, Junior Data Scientist at New Native