Gemini may be > Whisper

Audio transcription has been getting a lot better recently. In 2022, OpenAI's Whisper shocked the space with actually accurate transcription. From there, it's been getting cheaper and cheaper, with distilled versions like whisper-large-v3-turbo and batching making it 10,000x cheaper than a human.

But surprisingly, the cheapest transcriber isn't a model designed for that task. It's the multimodal language model Gemini. Let me show you.


Deepinfra runs Whisper Turbo at a price of $0.0002/minute, which equates to the rather respectable price of $0.012/hour.

Finding the Gemini price is not as easy. Let me break it down in a table, using the Flash version (other versions are too expensive or too crappy).

Type Tokens Pricing Cost
Input 115,329/hour $0.075/million (with Flash) $0.0086/hour
Output ~1000/hour $0.3/million (with Flash) $0.0003/hour
Total - - $0.0089/hour

75% of the price. This could be a real business, especially if you use Gemini's batching feature. If you do make some money from this, let me know, contacts are on the home page.