Why is Phi 4 Multimodal so cheap?

Phi 4 Multimodal is the current cheapest transcription model, as this table shows:

Model Inputs Outputs Pricing
GPT 4o transcribe $6/mtok, ~750/min $10/mtok ~$0.36/hour
GPT 4o mini transcribe $3/mtok, ~750/min $5/mtok ~$0.18/hour
Whisper Large (OpenAI) 6000 frames/min pre-conv, 3000/min post-conv - $0.36/hour
Whisper Large (DeepInfra) $0.027/hour
Whisper Large Turbo (DeepInfra) $0.012/hour
Gemini 2.0 Flash Lite $0.075/mtok, 1920/min $0.3/mtok $0.011/hour
Phi 4 Multimodal $0.05/mtok, 750/min $0.1/mtok $0.003/hour

* Assuming 150 output tokens/minute

Why?

If Phi 4 used the same token rate as Whisper (3000), their prices would almost be the same: 3000 * 60 * 0.05/1000000 + 150 * 60 * 0.1/1000000 = around $0.01/hour

If GPT 4o mini transcribe used the same pricing as plain GPT 4o mini, it would be much closer to DeepInfra's: 750 * 60 * 0.3/1000000 + 150 * 60 * 0.6/1000000 = around $0.019/hour

This means that the primary factor it's cheap is its efficient tokenization, but it wouldn't be possible without a few other factors.

More posts