
Gemini may be > Whisper
Audio transcription has been getting a lot better recently. In 2022, OpenAI's Whisper shocked the
space with actually accurate transcription. From there, it's been getting cheaper and cheaper,
with distilled versions like whisper-large-v3-turbo
and batching making it 10,000x cheaper
than a human.
But surprisingly, the cheapest transcriber isn't a model designed for that task. It's the multimodal language model Gemini. Let me show you.
Deepinfra runs Whisper Turbo at a price of $0.0002/minute, which equates to the rather respectable price of $0.012/hour.
Finding the Gemini price is not as easy. Let me break it down in a table, using the Flash version (other versions are too expensive or too crappy).
Type | Tokens | Pricing | Cost |
---|---|---|---|
Input | 115,329/hour | $0.075/million (with Flash) | $0.0086/hour |
Output | ~1000/hour | $0.3/million (with Flash) | $0.0003/hour |
Total | - | - | $0.0089/hour |
75% of the price. This could be a real business, especially if you use Gemini's batching feature. If you do make some money from this, let me know, contacts are on the home page.
2025 update: Phi 4 Multimodal is out, and it's even cheaper! $0.004/hour. (Some say that Whisper actually just costs $0.001/hour to run, but I haven't checked this.)