Multi-Modal Pricing
Multi-modal models charge different rates for different input types · images, audio, and video have their own per-unit prices alongside text tokens.
Multi-modal models charge different rates for different input types · images, audio, and video have their own per-unit prices alongside text tokens.
Basic
GPT-4o charges $2.50/M text input tokens but $50/M audio input tokens. Claude Sonnet 4 charges $3/M text and $3.75/M image tokens (with image-token conversion rules). Gemini 2.5 charges text and image tokens at same rate but video at 258 tokens/frame. Multi-modal pricing requires tracking consumption per input type to forecast cost accurately.
Deep
Image tokens: providers convert images to token equivalents via tile-based encoding. OpenAI: ~170 tokens per 512×512 tile. Anthropic: 1,200 tokens per image (approximate). Audio: GPT-4o counts ~50-100 tokens per second; Gemini native audio ~25 tokens/sec. Video: Gemini counts 258 tokens per frame at 1fps; you can adjust fps. Output tokens are almost always text-only and priced as regular output. Tracking multi-modal consumption requires per-input metering.
Expert
Pricing dynamics differ by modality. Text is commoditized · <$1/M is common now. Images are still premium · $2-5/M token equivalent. Audio input is expensive ($40-100/M token-equivalent) because it's still specialized capacity. Video is extremely expensive and limited · often only frontier models accept native video. The pricing gap between modalities drives architecture decisions: for high-volume image analysis, cheap models like Gemini Flash (~$0.30/M image tokens) beat premium models on total cost.
Native video input shipped across Gemini 3 and GPT-5 in 2025 · multi-modal pricing is now a first-class finops concern.
Depending on why you're here
- ·Per-input-type pricing · text, image, audio, video each have own rate
- ·Image-to-token conversion via tile encoding
- ·Video typically 258 tokens/frame at 1fps
- ·Meter per input type for accurate forecasting
- ·Cheap multi-modal models (Gemini Flash) beat premium on high volume
- ·Audio input is 20-50× more expensive than text
- ·Multi-modal compute is still premium · expensive to train and serve
- ·Pricing gap reveals capacity constraints per modality
- ·Future: specialized providers for video (Runway, Pika) undercut generalist labs
- ·AI charges different prices for different media types
- ·Images cost more than text · videos cost even more
- ·Key for businesses processing photos or videos at scale
Multi-modal pricing is where model-choice decisions get counter-intuitive. Cheap multi-modal often beats premium.