OpenAI Whisper Transcription Guide
What is OpenAI Whisper?
Whisper is an automatic speech recognition (ASR) system developed by OpenAI. It was trained on 680,000 hours of multilingual audio data, making it one of the most capable transcription systems available. Whisper can transcribe speech in 70+ languages with impressive accuracy.
Unlike older transcription systems that struggle with accents, background noise, or multiple speakers, Whisper handles real-world audio remarkably well. This makes it ideal for language learners working with podcasts, YouTube videos, TV shows, and other authentic content.
How Mimikaki Uses Whisper
When you upload audio to Mimikaki, we send it to OpenAI's Whisper API for transcription. The API returns timestamped text that we transform into interactive segments you can click to replay.
Key features of our Whisper integration:
- Timestamped segments - Each sentence is linked to its exact position in the audio
- Language detection - Whisper auto-detects the language (or you can specify it)
- large-v3 model - We use Whisper's most accurate model for best results
- Editable transcripts - Fix any mistakes directly in the app
Supported Languages
Whisper supports transcription for 70+ languages. For language learners, accuracy varies by language based on how much training data was available:
Excellent Accuracy (5% or lower word error rate)
English, Spanish, Italian, Portuguese, German, French, Japanese, Korean, Chinese, Dutch, Polish, Russian, Turkish, Vietnamese, Indonesian, Thai
Good Accuracy (5-10% word error rate)
Finnish, Swedish, Norwegian, Danish, Greek, Czech, Romanian, Hungarian, Hindi, Arabic, Hebrew
Moderate Accuracy (10-20% word error rate)
Less common languages may have higher error rates but are still useful for comprehension practice.
Tip: Even with imperfect transcription, you can import your own subtitles for free. Many podcasts and videos already have subtitle files available.
Transcription Credits
Mimikaki uses a credit system for transcription:
| Audio Length | Credits Used |
|---|---|
| Up to 10 minutes | 1 credit |
| Up to 20 minutes | 2 credits |
| Up to 30 minutes | 3 credits |
| Up to 60 minutes | 6 credits |
New users get 3 free credits to try the service. See our pricing page for credit packs.
Tips for Better Transcription
1. Choose Clear Audio
Whisper works best with clear speech and minimal background noise. Podcasts and audiobooks typically produce excellent results. Music and sound effects can interfere with accuracy.
2. Single Speaker is Easier
While Whisper can handle multiple speakers, accuracy is highest for single-speaker content like language learning podcasts or audiobooks.
3. Specify the Language
When uploading, select the correct language. While auto-detection usually works, specifying the language ensures Whisper uses the right vocabulary and character set.
4. Fix Mistakes as You Go
No transcription is perfect. Click the edit button on any line to fix errors. Your corrections are saved and help you review more accurately later.
When to Use Whisper vs. Importing Subtitles
| Use Whisper Transcription | Import Existing Subtitles |
|---|---|
| Content without subtitles | YouTube videos with captions |
| Podcasts and audiobooks | TV shows with .srt files |
| Personal recordings | Netflix content (via tools) |
| Quick one-off transcription | When you want free/perfect text |
Remember: Importing subtitles is always free and doesn't use credits. Check if subtitles already exist before transcribing!
Running Whisper Locally (Free)
If you want to transcribe for free, you can run Whisper on your own computer. This requires some technical setup but costs nothing per transcription.
Option 1: FFmpeg (Recommended)
FFmpeg 8.0+ includes a built-in Whisper filter powered by whisper.cpp, so you can transcribe with a single command. No Python required.
Install FFmpeg 8.0+
brew install ffmpeg
Download a Whisper model
Grab a GGML model file
(e.g. ggml-large-v3.bin for best accuracy, or ggml-base.bin for speed).
Transcribe to SRT
ffmpeg -i your-audio.mp3 -vn -af "whisper=model=ggml-large-v3.bin:language=ja:destination=output.srt:format=srt" -f null -
This outputs an output.srt file you can import into Mimikaki for free.
GPU acceleration is supported if whisper.cpp was compiled with it.
Option 2: OpenAI Whisper (Python)
macOS Installation
brew install openai-whisper
Linux/Windows Installation
pip install openai-whisper
Basic Usage
whisper your-audio.mp3 --language ja --output_format srt
This creates a your-audio.srt file that you can import into Mimikaki for free.
Language Tags
Always specify the language for better accuracy. Common language tags:
| Language | Tag | Language | Tag |
|---|---|---|---|
| Japanese | ja |
Korean | ko |
| Chinese | zh |
Finnish | fi |
| Spanish | es |
French | fr |
| German | de |
Portuguese | pt |
| Italian | it |
Russian | ru |
Initial Prompts for Better Accuracy
Whisper's initial_prompt parameter conditions the model on text that
supposedly appeared before the audio. This lets you influence output style,
punctuation, and vocabulary without changing the model itself.
How it works
The prompt acts as fake "previous context" — Whisper treats it as if those words just appeared in the transcription. This nudges the model toward similar style, terminology, and formatting in its output.
FFmpeg
ffmpeg -i audio.mp3 -vn -af "whisper=model=ggml-large-v3.bin:language=ja:initial_prompt=JLPT、日本語能力試験:destination=output.srt:format=srt" -f null -
Python
whisper audio.mp3 --language ja --initial_prompt "JLPT、日本語能力試験" --output_format srt
What to put in the prompt
Keep it short — list specific words rather than writing full sentences. Good uses:
- Names and terms:
"Dr. Tanaka, JLPT, N1"— helps Whisper spell domain-specific words correctly - Punctuation style: Include properly punctuated text in the target language to encourage consistent formatting
- Script preference: For Japanese, use kanji in the prompt to discourage unnecessary katakana or romaji output
Gotchas:
- The prompt only influences the first ~30 seconds of audio. For longer files, Whisper replaces it with its own previous output as context.
- Long, descriptive prompts often cause hallucinations — the model starts repeating the prompt text instead of transcribing the audio. Keep it to a short list of words.
- The prompt doesn't act as an instruction. Writing "please transcribe accurately" won't help — instead, show the style you want by example.
Example prompts by language:
- Japanese:
東京、大阪、日本語能力試験、漢字— key terms in kanji - Korean:
서울, 한국어, 맞춤법— key terms in hangul - Chinese:
北京,普通话,简体中文— simplified Chinese terms - English:
Dr. Smith, MIT, JavaScript, React— names and technical terms
FFmpeg Whisper Tips
The FFmpeg whisper filter has several useful options beyond the basics shown above. Here are some worth knowing about:
Audio Queue Size
The queue parameter controls how many seconds of audio are batched
before sending to Whisper. The default is 3 seconds. Larger values (e.g. 30) can
improve accuracy for longer phrases but increase latency. For file transcription,
queue=30 is a reasonable choice.
ffmpeg -i audio.mp3 -vn -af "whisper=model=ggml-large-v3.bin:language=ja:queue=30:destination=output.srt:format=srt" -f null -
Output Formats
The format parameter supports three values:
- srt — Standard subtitle format with timestamps (best for importing into Mimikaki)
- json — JSON objects with start/end timestamps and text
- text — Plain text without timestamps
Voice Activity Detection (VAD)
If you encounter hallucinated text (Whisper generating words when nobody is speaking),
the vad_model parameter can help. It uses
Silero VAD to detect speech
before passing audio to Whisper. Download the VAD model from the
whisper.cpp models directory,
then use it with a large queue value so VAD handles the chunking:
ffmpeg -i audio.mp3 -vn -af "whisper=model=ggml-large-v3.bin:language=ja:queue=10000:vad_model=ggml-silero-v5.1.2.bin:destination=output.srt:format=srt" -f null -
GPU and Performance
FFmpeg uses GPU acceleration by default if whisper.cpp was compiled with it.
You can disable it with use_gpu=false (much slower) or select a
specific GPU with gpu_device=0. With GPU enabled, the base model
can process audio faster than real-time; the large-v3 model is slower but
still practical for file transcription.
Model Behavior Differences
English-specific models (ggml-base.en.bin, ggml-medium.en.bin)
annotate non-speech sounds like [MUSIC PLAYING] and [screaming].
The multilingual large model does not produce these annotations. For language learning,
this is rarely an issue since you typically want speech transcription only.
Further Reading
- FFmpeg whisper filter documentation — official reference for all filter parameters
- whisper.cpp — the C/C++ Whisper implementation that powers FFmpeg's whisper filter
- GGML model files — download Whisper models for local use