Why is Phi 4 Multimodal so cheap?

Phi 4 Multimodal is the current cheapest transcription model, as this table shows:

Model	Inputs	Outputs	Pricing
GPT 4o transcribe	$6/mtok, ~750/min	$10/mtok	~$0.36/hour
GPT 4o mini transcribe	$3/mtok, ~750/min	$5/mtok	~$0.18/hour
Whisper Large (OpenAI)	6000 frames/min pre-conv, 3000/min post-conv	-	$0.36/hour
Whisper Large (DeepInfra)			$0.027/hour
Whisper Large Turbo (DeepInfra)			$0.012/hour
Gemini 2.0 Flash Lite	$0.075/mtok, 1920/min	$0.3/mtok	$0.011/hour
Phi 4 Multimodal	$0.05/mtok, 750/min	$0.1/mtok	$0.003/hour

* Assuming 150 output tokens/minute

Why?

Phi 4 uses few tokens to represent speech.
Phi 4 has a low price per token, because DeepInfra can host it, because it's an open model.

If Phi 4 used the same token rate as Whisper (3000), their prices would almost be the same: 3000 * 60 * 0.05/1000000 + 150 * 60 * 0.1/1000000 = around $0.01/hour

If GPT 4o mini transcribe used the same pricing as plain GPT 4o mini, it would be much closer to DeepInfra's: 750 * 60 * 0.3/1000000 + 150 * 60 * 0.6/1000000 = around $0.019/hour

This means that the primary factor it's cheap is its efficient tokenization, but it wouldn't be possible without a few other factors.