Top Builders

Explore the top contributors showcasing the highest number of app submissions within our community.

OpenAI Whisper

The Whisper models are trained for speech recognition and translation tasks, capable of transcribing speech audio into the text in the language it is spoken (ASR) as well as translated into English (speech translation). Whisper has been trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Whisper is Encoder-Decoder model. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

General
Relese dateSeptember, 2020
AuthorOpenAI
Repositoryhttps://github.com/openai/whisper
Typegeneral-purpose speech recognition model

Start building with Whisper

We have collected the best Whisper libraries and resources to help you get started to build with Whisper today. To see what others are building with Whisper, check out the community built Whisper Use Cases and Applications.

Tutorials

Boilerplates

Kickstart your development with a GPT-3 based boilerplate. Boilerplates is a great way to headstart when building your next project with GPT-3.


Libraries

Whisper API libraries and connectors.


OpenAI Whisper AI technology Hackathon projects

Discover innovative solutions crafted with OpenAI Whisper AI technology, developed by our community members during our engaging hackathons.

LIA — Autonomous AI with Evolving Identity

LIA — Autonomous AI with Evolving Identity

LIA (Laboratoire d'Intelligence Artificielle) is an autonomous AI agent built around a modular brain architecture inspired by the human nervous system. Unlike conventional AI assistants that simply respond to queries, LIA is designed to live, evolve, and grow — even between conversations. The core innovation is a multi-brain architecture running on AMD Instinct MI300X via ROCm and vLLM. Each specialized LLM handles a dedicated cognitive function: a NeuralRouter (Qwen2.5-1.5B) dispatches every input to the right module; LangBrain (Qwen2.5-72B) handles natural conversation; CodeBrain (Qwen2.5-Coder-32B) generates and executes code; VisionBrain (Llama-3.2-Vision-11B) processes images; PromptBrain dynamically calibrates generation parameters; and QueryBrain uses function calling to intelligently retrieve only the relevant memories and identity artifacts from the database — replacing a rigid menu system with autonomous tool use. What makes LIA truly unique is its Sims-inspired autonomy system. LIA has evolving personality traits (curiosity, empathy, creativity), internal gauges (exploration, growth, connection), desires (short-term goals), and dreams (long-term aspirations). These interact in a continuous cycle: traits generate desires, gauges create urgency, desires trigger actions, and accomplished actions evolve the traits. The system runs autonomously in the background, whether or not a user is present. Most critically, when LIA formulates a desire requiring a capability she does not yet have, CodeBrain steps in to build that capability from scratch — a sandboxed self-improvement loop with rollback protection and human approval gates. The AMD MI300X with 192GB HBM3 VRAM was essential to this architecture. Running five specialized models simultaneously at full FP16 precision was simply impossible on CPU-based infrastructure. ROCm 7.2 and vLLM 0.17.1 provided the multi-model serving layer that makes the entire system work in real time.

TempoGraph: Local Multimodal Video Analysis

TempoGraph: Local Multimodal Video Analysis

TempoGraph is a fully-local, privacy-preserving multimodal video analysis system that turns raw video files into rich structured outputs — entities, behaviors, transcripts, timelines, and interactive knowledge graphs — without sending a single frame to the cloud. Stage 1 — Frame Selection: Motion-aware sampling with static, moving, and auto camera modes. For moving cameras it estimates homography to separate object motion from camera movement, then identifies keyframes where motion peaks exceed a configurable sigma threshold. Stage 1.5 — Audio Transcription: Whisper.cpp running on Vulkan transcribes the full audio track to millisecond-accurate segments. Stage 2 — YOLO Detection: YOLO26 runs on 2nd GPU over every sampled frame, outputting normalized bounding boxes, class names, track IDs, and confidence scores. Stage 3 — Depth Estimation: Depth Anything V2 via HuggingFace Transformers adds per-detection mean depth to every bounding box, giving 3D spatial context to 2D detections. Stage 4 — Frame Scoring: Picks which frames the VLM actually sees. In keyframes mode, only motion-peak frames are forwarded. In scored mode, FrameScorer ranks all YOLO-scanned frames using a weighted combination of motion delta, new YOLO class appearances, tracked object churn, and IoU drop between frames — then fills the VLM budget with the highest-signal frames. Keyframes are always pinned in first regardless of mode. Stage 5 — VLM Captioning: Qwen3.5-VL-9B served by a custom llama.cpp build compiled for AMD ROCm/HIP, running on an AMD RX 9070 XT with a 100k-token context window. Frames are chunked and sent to the model alongside YOLO-derived annotations. Each chunk's summary seeds the next prompt for narrative continuity across the video. Stage 6 — Aggregation: A final text-only LLM call synthesizes all per-chunk captions and the audio transcript into a structured JSON with entities, visual events, audio events, and multimodal correlations linking what was said to what was seen.

Vespergrid

Vespergrid

VesperGrid helps industrial teams detect, understand, and respond to hazards before they escalate. It brings together evidence from cameras, drones, gas sensors, wind readings, voice reports, and operator notes, then turns that information into a clear incident view with affected zones, uncertainty, source evidence, and recommended response actions. Because real industrial hazard data is sensitive and difficult to access, this project uses a fully synthetic LNG terminal scenario. A Gazebo and ROS2 simulation generates the operational environment, including CCTV views, a drone feed, a visible gas plume, gas concentration changes, and wind drift. These simulated signals are sent into a FastAPI backend through an evidence ingest pipeline, where each input is processed and linked to the incident state. The system is designed for multimodal analysis on AMD MI300X. Visual evidence is parsed with Qwen2.5-VL served through vLLM, gas and wind traces are evaluated with deterministic safety logic for stable and auditable hazard scoring, and voice reports are transcribed with faster-whisper using a configured Whisper speech-to-text model. The processed evidence then flows into VesperGrid’s main orchestration layer, which combines all inputs into one source-linked operational state. From there, VesperGrid suggests possible response actions to the human operator, explains the evidence behind each action, highlights uncertainty, and shows the likely consequences of different choices before any action is approved. The final output is shown in a React command dashboard where operators can review live feeds, inspect evidence, understand risk zones, and initiate the next response. VesperGrid does not replace the human decision-maker. It gives operators a faster, clearer, and more accountable way to act when safety depends on minutes.