Gemini may be > Whisper

Audio transcription has been getting a lot better recently. In 2022, OpenAI's Whisper shocked the space with actually accurate transcription. From there, it's been getting cheaper and cheaper, with distilled versions like whisper-large-v3-turbo and batching making it 10,000x cheaper than a human.

But surprisingly, the cheapest transcriber isn't a model designed for that task. It's the multimodal language model Gemini. Let me show you.


Deepinfra runs Whisper Turbo at a price of $0.0002/minute, which equates to the rather respectable price of $0.012/hour.

Finding the Gemini price is not as easy. Let me break it down in a table, using the Flash version (other versions are too expensive or too crappy).

TypeTokensPricingCost
Input115,329/hour$0.075/million (with Flash)$0.0086/hour
Output~1000/hour$0.3/million (with Flash)$0.0003/hour
Total--$0.0089/hour

75% of the price. This could be a real business, especially if you use Gemini's batching feature. If you do make some money from this, let me know, contacts are on the home page.


2025 update: Phi 4 Multimodal is out, and it's even cheaper! $0.004/hour. (Some say that Whisper actually just costs $0.001/hour to run, but I haven't checked this.)