OpenAI Whisper Transcription Guide

Everything you need to know about AI-powered transcription for language learning

What is OpenAI Whisper?

Whisper is an automatic speech recognition (ASR) system developed by OpenAI. It was trained on 680,000 hours of multilingual audio data, making it one of the most capable transcription systems available. Whisper can transcribe speech in 70+ languages with impressive accuracy.

Unlike older transcription systems that struggle with accents, background noise, or multiple speakers, Whisper handles real-world audio remarkably well. This makes it ideal for language learners working with podcasts, YouTube videos, TV shows, and other authentic content.

How Mimikaki Uses Whisper

When you upload audio to Mimikaki, we send it to OpenAI's Whisper API for transcription. The API returns timestamped text that we transform into interactive segments you can click to replay.

Key features of our Whisper integration:

Timestamped segments - Each sentence is linked to its exact position in the audio
Language detection - Whisper auto-detects the language (or you can specify it)
large-v3 model - We use Whisper's most accurate model for best results
Editable transcripts - Fix any mistakes directly in the app

Supported Languages

Whisper supports transcription for 70+ languages. For language learners, accuracy varies by language based on how much training data was available:

Excellent Accuracy (5% or lower word error rate)

English, Spanish, Italian, Portuguese, German, French, Japanese, Korean, Chinese, Dutch, Polish, Russian, Turkish, Vietnamese, Indonesian, Thai

Good Accuracy (5-10% word error rate)

Finnish, Swedish, Norwegian, Danish, Greek, Czech, Romanian, Hungarian, Hindi, Arabic, Hebrew

Moderate Accuracy (10-20% word error rate)

Less common languages may have higher error rates but are still useful for comprehension practice.

Tip: Even with imperfect transcription, you can import your own subtitles for free. Many podcasts and videos already have subtitle files available.

Transcription Credits

Mimikaki uses a credit system for transcription:

Audio Length	Credits Used
Up to 10 minutes	1 credit
Up to 20 minutes	2 credits
Up to 30 minutes	3 credits
Up to 60 minutes	6 credits

New users get 3 free credits to try the service. See our pricing page for credit packs.

Tips for Better Transcription

1. Choose Clear Audio

Whisper works best with clear speech and minimal background noise. Podcasts and audiobooks typically produce excellent results. Music and sound effects can interfere with accuracy.

2. Single Speaker is Easier

While Whisper can handle multiple speakers, accuracy is highest for single-speaker content like language learning podcasts or audiobooks.

3. Specify the Language

When uploading, select the correct language. While auto-detection usually works, specifying the language ensures Whisper uses the right vocabulary and character set.

4. Fix Mistakes as You Go

No transcription is perfect. Click the edit button on any line to fix errors. Your corrections are saved and help you review more accurately later.

When to Use Whisper vs. Importing Subtitles

Use Whisper Transcription	Import Existing Subtitles
Content without subtitles	YouTube videos with captions
Podcasts and audiobooks	TV shows with .srt files
Personal recordings	Netflix content (via tools)
Quick one-off transcription	When you want free/perfect text

Remember: Importing subtitles is always free and doesn't use credits. Check if subtitles already exist before transcribing!

Running Whisper Locally (Free)

If you want to transcribe for free, you can run Whisper on your own computer. This requires some technical setup but costs nothing per transcription.

Option 1: FFmpeg (Recommended)

FFmpeg 8.0+ includes a built-in Whisper filter powered by whisper.cpp, so you can transcribe with a single command. No Python required.

Install FFmpeg 8.0+

brew install ffmpeg

Download a Whisper model

Grab a GGML model file (e.g. ggml-large-v3.bin for best accuracy, or ggml-base.bin for speed).

Transcribe to SRT

ffmpeg -i your-audio.mp3 -vn -af "whisper=model=ggml-large-v3.bin:language=ja:destination=output.srt:format=srt" -f null -

This outputs an output.srt file you can import into Mimikaki for free. GPU acceleration is supported if whisper.cpp was compiled with it.

Option 2: OpenAI Whisper (Python)

macOS Installation

brew install openai-whisper

Linux/Windows Installation

pip install openai-whisper

Basic Usage

whisper your-audio.mp3 --language ja --output_format srt

This creates a your-audio.srt file that you can import into Mimikaki for free.

Language Tags

Always specify the language for better accuracy. Common language tags:

Language	Tag	Language	Tag
Japanese	`ja`	Korean	`ko`
Chinese	`zh`	Finnish	`fi`
Spanish	`es`	French	`fr`
German	`de`	Portuguese	`pt`
Italian	`it`	Russian	`ru`

Initial Prompts for Better Accuracy

Whisper's initial_prompt parameter conditions the model on text that supposedly appeared before the audio. This lets you influence output style, punctuation, and vocabulary without changing the model itself.

How it works

The prompt acts as fake "previous context" — Whisper treats it as if those words just appeared in the transcription. This nudges the model toward similar style, terminology, and formatting in its output.

FFmpeg

ffmpeg -i audio.mp3 -vn -af "whisper=model=ggml-large-v3.bin:language=ja:initial_prompt=JLPT、日本語能力試験:destination=output.srt:format=srt" -f null -

Python

whisper audio.mp3 --language ja --initial_prompt "JLPT、日本語能力試験" --output_format srt

What to put in the prompt

Keep it short — list specific words rather than writing full sentences. Good uses:

Names and terms: "Dr. Tanaka, JLPT, N1" — helps Whisper spell domain-specific words correctly
Punctuation style: Include properly punctuated text in the target language to encourage consistent formatting
Script preference: For Japanese, use kanji in the prompt to discourage unnecessary katakana or romaji output

Gotchas:

The prompt only influences the first ~30 seconds of audio. For longer files, Whisper replaces it with its own previous output as context.
Long, descriptive prompts often cause hallucinations — the model starts repeating the prompt text instead of transcribing the audio. Keep it to a short list of words.
The prompt doesn't act as an instruction. Writing "please transcribe accurately" won't help — instead, show the style you want by example.

Example prompts by language:

Japanese: 東京、大阪、日本語能力試験、漢字 — key terms in kanji
Korean: 서울, 한국어, 맞춤법 — key terms in hangul
Chinese: 北京，普通话，简体中文 — simplified Chinese terms
English: Dr. Smith, MIT, JavaScript, React — names and technical terms

FFmpeg Whisper Tips

The FFmpeg whisper filter has several useful options beyond the basics shown above. Here are some worth knowing about:

Audio Queue Size

The queue parameter controls how many seconds of audio are batched before sending to Whisper. The default is 3 seconds. Larger values (e.g. 30) can improve accuracy for longer phrases but increase latency. For file transcription, queue=30 is a reasonable choice.

ffmpeg -i audio.mp3 -vn -af "whisper=model=ggml-large-v3.bin:language=ja:queue=30:destination=output.srt:format=srt" -f null -

Output Formats

The format parameter supports three values:

srt — Standard subtitle format with timestamps (best for importing into Mimikaki)
json — JSON objects with start/end timestamps and text
text — Plain text without timestamps

Voice Activity Detection (VAD)

If you encounter hallucinated text (Whisper generating words when nobody is speaking), the vad_model parameter can help. It uses Silero VAD to detect speech before passing audio to Whisper. Download the VAD model from the whisper.cpp models directory, then use it with a large queue value so VAD handles the chunking:

ffmpeg -i audio.mp3 -vn -af "whisper=model=ggml-large-v3.bin:language=ja:queue=10000:vad_model=ggml-silero-v5.1.2.bin:destination=output.srt:format=srt" -f null -

GPU and Performance

FFmpeg uses GPU acceleration by default if whisper.cpp was compiled with it. You can disable it with use_gpu=false (much slower) or select a specific GPU with gpu_device=0. With GPU enabled, the base model can process audio faster than real-time; the large-v3 model is slower but still practical for file transcription.

Model Behavior Differences

English-specific models (ggml-base.en.bin, ggml-medium.en.bin) annotate non-speech sounds like [MUSIC PLAYING] and [screaming]. The multilingual large model does not produce these annotations. For language learning, this is rarely an issue since you typically want speech transcription only.