Why is Phi 4 Multimodal so cheap?

Phi 4 Multimodal is the current cheapest transcription model, as this table shows:

ModelInputsOutputsPricing
GPT 4o transcribe$6/mtok, ~750/min$10/mtok~$0.36/hour
GPT 4o mini transcribe$3/mtok, ~750/min$5/mtok~$0.18/hour
Whisper Large (OpenAI)6000 frames/min pre-conv, 3000/min post-conv-$0.36/hour
Whisper Large (DeepInfra)$0.027/hour
Whisper Large Turbo (DeepInfra)$0.012/hour
Gemini 2.0 Flash Lite$0.075/mtok, 1920/min$0.3/mtok$0.011/hour
Phi 4 Multimodal$0.05/mtok, 750/min$0.1/mtok$0.003/hour

* Assuming 150 output tokens/minute

Why?

If Phi 4 used the same token rate as Whisper (3000), their prices would almost be the same: 3000 * 60 * 0.05/1000000 + 150 * 60 * 0.1/1000000 = around $0.01/hour

If GPT 4o mini transcribe used the same pricing as plain GPT 4o mini, it would be much closer to DeepInfra's: 750 * 60 * 0.3/1000000 + 150 * 60 * 0.6/1000000 = around $0.019/hour

This means that the primary factor it's cheap is its efficient tokenization, but it wouldn't be possible without a few other factors.