Why is Phi 4 Multimodal so cheap?
Phi 4 Multimodal is the current cheapest transcription model, as this table shows:
Model | Inputs | Outputs | Pricing |
---|---|---|---|
GPT 4o transcribe | $6/mtok, ~750/min | $10/mtok | ~$0.36/hour |
GPT 4o mini transcribe | $3/mtok, ~750/min | $5/mtok | ~$0.18/hour |
Whisper Large (OpenAI) | 6000 frames/min pre-conv, 3000/min post-conv | - | $0.36/hour |
Whisper Large (DeepInfra) | $0.027/hour | ||
Whisper Large Turbo (DeepInfra) | $0.012/hour | ||
Gemini 2.0 Flash Lite | $0.075/mtok, 1920/min | $0.3/mtok | $0.011/hour |
Phi 4 Multimodal | $0.05/mtok, 750/min | $0.1/mtok | $0.003/hour |
* Assuming 150 output tokens/minute
Why?
- Phi 4 uses few tokens to represent speech.
- Phi 4 has a low price per token, because DeepInfra can host it, because it's an open model.
If Phi 4 used the same token rate as Whisper (3000), their prices would almost be the same: 3000 * 60 * 0.05/1000000 + 150 * 60 * 0.1/1000000 = around $0.01/hour
If GPT 4o mini transcribe used the same pricing as plain GPT 4o mini, it would be much closer to DeepInfra's: 750 * 60 * 0.3/1000000 + 150 * 60 * 0.6/1000000 = around $0.019/hour
This means that the primary factor it's cheap is its efficient tokenization, but it wouldn't be possible without a few other factors.