Top Builders

Explore the top contributors showcasing the highest number of app submissions within our community.

OpenAI Whisper

The Whisper models are trained for speech recognition and translation tasks, capable of transcribing speech audio into the text in the language it is spoken (ASR) as well as translated into English (speech translation). Whisper has been trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Whisper is Encoder-Decoder model. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

General
Relese dateSeptember, 2020
AuthorOpenAI
Repositoryhttps://github.com/openai/whisper
Typegeneral-purpose speech recognition model

Start building with Whisper

We have collected the best Whisper libraries and resources to help you get started to build with Whisper today. To see what others are building with Whisper, check out the community built Whisper Use Cases and Applications.

Tutorials

Boilerplates

Kickstart your development with a GPT-3 based boilerplate. Boilerplates is a great way to headstart when building your next project with GPT-3.


Libraries

Whisper API libraries and connectors.


OpenAI Whisper AI technology Hackathon projects

Discover innovative solutions crafted with OpenAI Whisper AI technology, developed by our community members during our engaging hackathons.

HireBand: Autonomous Hiring Ecosystem

HireBand: Autonomous Hiring Ecosystem

Technical recruiting is broken. Recruiters spend countless hours evaluating resumes, and live technical interviews are increasingly compromised by candidates secretly reading ChatGPT outputs. HireBand solves this by replacing static hiring pipelines with a self-improving ecosystem of 6 specialized AI agents. Built using the Band SDK, LangGraph, and the AIML API, HireBand orchestrates an end-to-end autonomous recruiting workflow: 1. Automated Sourcing & Screening: Our Outreach Agent runs semantic vector searches across a database of verified candidates, calculates eligibility matches based on your Job Description, and autonomously fires off real interview invitations via Gmail. When a candidate applies, the Candidate Overview Agent delegates deep analysis to the Resume Agent and GitHub Agent. 2. The Live Interview Copilot: We leverage LiveKit and OpenAI Whisper for real-time audio transcription during the interview. As the human interviewer speaks, the Interview Agent acts as a silent copilot, comparing the live transcript to the candidate's resume to generate hyper-specific follow-up questions in real-time. 3. Zero-Tolerance Plagiarism Detection: To combat AI cheating, our Plagiarism Agent continuously scans the live transcript through a detection system (using a HuggingFace ML model). If a candidate gives a robotic, ChatGPT-scripted answer, the agent intercepts the interview and terminates the evaluation. 4. The Self-Learning Feedback Loop: When an interview ends, our Strategy Agent reads the transcript to figure out what technical skills the human interviewer prioritized. It autonomously updates the database rubrics of the Resume and GitHub agents. For all future candidates, the ecosystem will prioritize those exact skills from the initial screening. The more you use HireBand, the smarter it gets.

Dasalt 360 Enterprise AI

Dasalt 360 Enterprise AI

Dasalt 360 Ltd is a sophisticated, multimodal multi-agent enterprise system engineered specifically to navigate the intricate complexities of the West African IT hardware market. Operating from its headquarters in Jimeta, Adamawa State, Nigeria, the system serves as an autonomous "System of Record" that bridges the strategic gap between global hardware procurement in the United States, national wholesale hubs in Lagos, and local retail distribution in Yola North. The architecture is meticulously built upon Vultr’s Dedicated Cloud Compute infrastructure, ensuring secure, low-latency orchestration of business logic. The "Intelligence Layer" is powered by Vultr Serverless Inference, utilizing Llama 3.1 70B for high-precision financial reasoning and Llama 3.2 Vision for the optical identification of hardware assets. A primary innovation of this project is its ability to handle the extreme currency volatility of the Nigerian Naira (₦). The AI acts as an autonomous Chief Financial Officer (CFO), performing real-time landing cost calculations based on a fixed 1,400 NGN/USD exchange rate, while simultaneously integrating localized logistics overheads—specifically the 7,000 Naira per-unit secure transit fee from Lagos to Yola. Beyond text-based automation, the system demonstrates the "Future of Work" through a multimodal interface. It incorporates a bespoke Voice-to-Voice engine and Vision-to-Text capabilities, allowing CEO Christopher Krim and his team to manage inventory hands-free in the warehouse or via photographic evidence of arrivals. To maintain the highest corporate standards, the system is instructed to curate every response in elegant, sophisticated natural language, providing polished executive briefs that eliminate technical jargon. By centralizing planning, coordination, and execution on Vultr, Dasalt 360 Ltd exemplifies a new era of autonomous enterprise operations, ensuring business sustainability and profitability in a challenging economic environment.

Meridian

Meridian

Video is one of the most information-dense formats that exists, but almost none of that information is actually accessible after the recording ends. You can scrub through a timeline, you can search a transcript if one exists, but if the answer to your question was written on a whiteboard, shown on a slide, or explained through a gesture without being spoken aloud, you simply cannot find it. Meridian was built to fix that. When you upload a video, Meridian processes it through three completely independent channels simultaneously. It transcribes everything that was spoken with precise word-level timestamps. It reads every frame for on-screen text, picking up slides, diagrams, formulas, and annotations. And it generates a full natural language description of the visual scene in every frame, capturing what the speaker is doing, pointing at, or drawing. When you ask a question, Meridian searches across all three of those knowledge stores at once and uses Gemini 2.5 Pro as an AI reasoning agent to identify the single moment in the video that best answers what you asked. The video seeks to that timestamp automatically. You also see exactly which source fired the strongest signal for that answer, whether it was the spoken word, the on-screen text, or the visual scene, so you can trust the result. The target audience is anyone who works with recorded video as a knowledge source: engineering teams reviewing architecture discussions, legal and compliance teams indexing regulatory training libraries, researchers cataloguing interview recordings, or enterprise teams making internal video knowledge bases actually searchable. Meridian makes the full content of any video as accessible as a well-indexed document.

OpenAI Whisper