Gemini may be > Whisper

Audio transcription has been getting a lot better recently. In 2022, OpenAI's Whisper shocked the space with actually accurate transcription. From there, it's been getting cheaper and cheaper, with distilled versions like whisper-large-v3-turbo and batching making it 10,000x cheaper than a human.

But surprisingly, the cheapest transcriber isn't a model designed for that task. It's the multimodal language model Gemini. Let me show you.

Deepinfra runs Whisper Turbo at a price of $0.0002/minute, which equates to the rather respectable price of $0.012/hour.

Finding the Gemini price is not as easy. Let me break it down in a table, using the Flash version (other versions are too expensive or too crappy).

Type	Tokens	Pricing	Cost
Input	115,329/hour	$0.075/million (with Flash)	$0.0086/hour
Output	~1000/hour	$0.3/million (with Flash)	$0.0003/hour
Total	-	-	$0.0089/hour

75% of the price. This could be a real business, especially if you use Gemini's batching feature. If you do make some money from this, let me know, contacts are on the home page.

2025 update: Phi 4 Multimodal is out, and it's even cheaper! $0.004/hour. (Some say that Whisper actually just costs $0.001/hour to run, but I haven't checked this.)