OpenAI Whisper Transcription Guide

Everything you need to know about AI-powered transcription for language learning

What is OpenAI Whisper?

Whisper is an automatic speech recognition (ASR) system developed by OpenAI. It was trained on 680,000 hours of multilingual audio data, making it one of the most capable transcription systems available. Whisper can transcribe speech in 70+ languages with impressive accuracy.

Unlike older transcription systems that struggle with accents, background noise, or multiple speakers, Whisper handles real-world audio remarkably well. This makes it ideal for language learners working with podcasts, YouTube videos, TV shows, and other authentic content.

How Mimikaki Uses Whisper

When you upload audio to Mimikaki, we send it to OpenAI's Whisper API for transcription. The API returns timestamped text that we transform into interactive segments you can click to replay.

Key features of our Whisper integration:

Supported Languages

Whisper supports transcription for 70+ languages. For language learners, accuracy varies by language based on how much training data was available:

Excellent Accuracy (5% or lower word error rate)

English, Spanish, Italian, Portuguese, German, French, Japanese, Korean, Chinese, Dutch, Polish, Russian, Turkish, Vietnamese, Indonesian, Thai

Good Accuracy (5-10% word error rate)

Finnish, Swedish, Norwegian, Danish, Greek, Czech, Romanian, Hungarian, Hindi, Arabic, Hebrew

Moderate Accuracy (10-20% word error rate)

Less common languages may have higher error rates but are still useful for comprehension practice.

Tip: Even with imperfect transcription, you can import your own subtitles for free. Many podcasts and videos already have subtitle files available.

Transcription Credits

Mimikaki uses a credit system for transcription:

Audio Length Credits Used
Up to 10 minutes 1 credit
Up to 20 minutes 2 credits
Up to 30 minutes 3 credits
Up to 60 minutes 6 credits

New users get 3 free credits to try the service. See our pricing page for credit packs.

Tips for Better Transcription

1. Choose Clear Audio

Whisper works best with clear speech and minimal background noise. Podcasts and audiobooks typically produce excellent results. Music and sound effects can interfere with accuracy.

2. Single Speaker is Easier

While Whisper can handle multiple speakers, accuracy is highest for single-speaker content like language learning podcasts or audiobooks.

3. Specify the Language

When uploading, select the correct language. While auto-detection usually works, specifying the language ensures Whisper uses the right vocabulary and character set.

4. Fix Mistakes as You Go

No transcription is perfect. Click the edit button on any line to fix errors. Your corrections are saved and help you review more accurately later.

When to Use Whisper vs. Importing Subtitles

Use Whisper Transcription Import Existing Subtitles
Content without subtitles YouTube videos with captions
Podcasts and audiobooks TV shows with .srt files
Personal recordings Netflix content (via tools)
Quick one-off transcription When you want free/perfect text

Remember: Importing subtitles is always free and doesn't use credits. Check if subtitles already exist before transcribing!

Running Whisper Locally (Free)

If you want to transcribe for free, you can run Whisper on your own computer. This requires some technical setup but costs nothing per transcription.

Option 1: FFmpeg (Recommended)

FFmpeg 8.0+ includes a built-in Whisper filter powered by whisper.cpp, so you can transcribe with a single command. No Python required.

Install FFmpeg 8.0+

brew install ffmpeg

Download a Whisper model

Grab a GGML model file (e.g. ggml-large-v3.bin for best accuracy, or ggml-base.bin for speed).

Transcribe to SRT

ffmpeg -i your-audio.mp3 -vn -af "whisper=model=ggml-large-v3.bin:language=ja:destination=output.srt:format=srt" -f null -

This outputs an output.srt file you can import into Mimikaki for free. GPU acceleration is supported if whisper.cpp was compiled with it.

Option 2: OpenAI Whisper (Python)

macOS Installation

brew install openai-whisper

Linux/Windows Installation

pip install openai-whisper

Basic Usage

whisper your-audio.mp3 --language ja --output_format srt

This creates a your-audio.srt file that you can import into Mimikaki for free.

Language Tags

Always specify the language for better accuracy. Common language tags:

Language Tag Language Tag
Japanese ja Korean ko
Chinese zh Finnish fi
Spanish es French fr
German de Portuguese pt
Italian it Russian ru

Initial Prompts for Better Accuracy

Whisper's initial_prompt parameter conditions the model on text that supposedly appeared before the audio. This lets you influence output style, punctuation, and vocabulary without changing the model itself.

How it works

The prompt acts as fake "previous context" — Whisper treats it as if those words just appeared in the transcription. This nudges the model toward similar style, terminology, and formatting in its output.

FFmpeg

ffmpeg -i audio.mp3 -vn -af "whisper=model=ggml-large-v3.bin:language=ja:initial_prompt=JLPT、日本語能力試験:destination=output.srt:format=srt" -f null -

Python

whisper audio.mp3 --language ja --initial_prompt "JLPT、日本語能力試験" --output_format srt

What to put in the prompt

Keep it short — list specific words rather than writing full sentences. Good uses:

Gotchas:

Example prompts by language:

FFmpeg Whisper Tips

The FFmpeg whisper filter has several useful options beyond the basics shown above. Here are some worth knowing about:

Audio Queue Size

The queue parameter controls how many seconds of audio are batched before sending to Whisper. The default is 3 seconds. Larger values (e.g. 30) can improve accuracy for longer phrases but increase latency. For file transcription, queue=30 is a reasonable choice.

ffmpeg -i audio.mp3 -vn -af "whisper=model=ggml-large-v3.bin:language=ja:queue=30:destination=output.srt:format=srt" -f null -

Output Formats

The format parameter supports three values:

Voice Activity Detection (VAD)

If you encounter hallucinated text (Whisper generating words when nobody is speaking), the vad_model parameter can help. It uses Silero VAD to detect speech before passing audio to Whisper. Download the VAD model from the whisper.cpp models directory, then use it with a large queue value so VAD handles the chunking:

ffmpeg -i audio.mp3 -vn -af "whisper=model=ggml-large-v3.bin:language=ja:queue=10000:vad_model=ggml-silero-v5.1.2.bin:destination=output.srt:format=srt" -f null -

GPU and Performance

FFmpeg uses GPU acceleration by default if whisper.cpp was compiled with it. You can disable it with use_gpu=false (much slower) or select a specific GPU with gpu_device=0. With GPU enabled, the base model can process audio faster than real-time; the large-v3 model is slower but still practical for file transcription.

Model Behavior Differences

English-specific models (ggml-base.en.bin, ggml-medium.en.bin) annotate non-speech sounds like [MUSIC PLAYING] and [screaming]. The multilingual large model does not produce these annotations. For language learning, this is rarely an issue since you typically want speech transcription only.

Further Reading

Ready to Try It?

Get 3 free credits when you sign up. No credit card required.

Get Started Free