Beta
PricingReading · ~3 min · 65 words deep

Multi-Modal Pricing

Multi-modal models charge different rates for different input types · images, audio, and video have their own per-unit prices alongside text tokens.

TL;DR

Multi-modal models charge different rates for different input types · images, audio, and video have their own per-unit prices alongside text tokens.

Level 1

GPT-4o charges $2.50/M text input tokens but $50/M audio input tokens. Claude Sonnet 4 charges $3/M text and $3.75/M image tokens (with image-token conversion rules). Gemini 2.5 charges text and image tokens at same rate but video at 258 tokens/frame. Multi-modal pricing requires tracking consumption per input type to forecast cost accurately.

Level 2

Image tokens: providers convert images to token equivalents via tile-based encoding. OpenAI: ~170 tokens per 512×512 tile. Anthropic: 1,200 tokens per image (approximate). Audio: GPT-4o counts ~50-100 tokens per second; Gemini native audio ~25 tokens/sec. Video: Gemini counts 258 tokens per frame at 1fps; you can adjust fps. Output tokens are almost always text-only and priced as regular output. Tracking multi-modal consumption requires per-input metering.

Level 3

Pricing dynamics differ by modality. Text is commoditized · <$1/M is common now. Images are still premium · $2-5/M token equivalent. Audio input is expensive ($40-100/M token-equivalent) because it's still specialized capacity. Video is extremely expensive and limited · often only frontier models accept native video. The pricing gap between modalities drives architecture decisions: for high-volume image analysis, cheap models like Gemini Flash (~$0.30/M image tokens) beat premium models on total cost.

Why this matters now

Native video input shipped across Gemini 3 and GPT-5 in 2025 · multi-modal pricing is now a first-class finops concern.

The takeaway for you
If you are a
Researcher
  • ·Per-input-type pricing · text, image, audio, video each have own rate
  • ·Image-to-token conversion via tile encoding
  • ·Video typically 258 tokens/frame at 1fps
If you are a
Builder
  • ·Meter per input type for accurate forecasting
  • ·Cheap multi-modal models (Gemini Flash) beat premium on high volume
  • ·Audio input is 20-50× more expensive than text
If you are a
Investor
  • ·Multi-modal compute is still premium · expensive to train and serve
  • ·Pricing gap reveals capacity constraints per modality
  • ·Future: specialized providers for video (Runway, Pika) undercut generalist labs
If you are a
Curious · Normie
  • ·AI charges different prices for different media types
  • ·Images cost more than text · videos cost even more
  • ·Key for businesses processing photos or videos at scale
Gecko's take

Multi-modal pricing is where model-choice decisions get counter-intuitive. Cheap multi-modal often beats premium.

Varies wildly. Gemini Flash: ~$0.0003/image. GPT-4o: ~$0.01/image. Claude Sonnet: ~$0.004/image. Premium models are 10-30× more expensive.