Does output count multi-modal tokens?

Usually no · output is text tokens priced at standard rates. Exception: audio-output models like GPT-4o voice, which charge audio output tokens.

Why is audio input expensive?

Audio tokenization produces ~50-100 tokens per second, and specialized audio encoders require additional compute. Providers price that in.

PricingReading · ~3 min · 65 words deep

Multi-Modal Pricing

Q: How much does an image cost?

Varies wildly. Gemini Flash: ~$0.0003/image. GPT-4o: ~$0.01/image. Claude Sonnet: ~$0.004/image. Premium models are 10-30× more expensive.

Multi-modal models charge different rates for different input types · images, audio, and video have their own per-unit prices alongside text tokens.

TL;DR

Multi-modal models charge different rates for different input types · images, audio, and video have their own per-unit prices alongside text tokens.

Level 1

Basic

GPT-4o charges $2.50/M text input tokens but $50/M audio input tokens. Claude Sonnet 4 charges $3/M text and $3.75/M image tokens (with image-token conversion rules). Gemini 2.5 charges text and image tokens at same rate but video at 258 tokens/frame. Multi-modal pricing requires tracking consumption per input type to forecast cost accurately.

Level 2

Deep

Image tokens: providers convert images to token equivalents via tile-based encoding. OpenAI: ~170 tokens per 512×512 tile. Anthropic: 1,200 tokens per image (approximate). Audio: GPT-4o counts ~50-100 tokens per second; Gemini native audio ~25 tokens/sec. Video: Gemini counts 258 tokens per frame at 1fps; you can adjust fps. Output tokens are almost always text-only and priced as regular output. Tracking multi-modal consumption requires per-input metering.

Level 3

Expert

Pricing dynamics differ by modality. Text is commoditized · <$1/M is common now. Images are still premium · $2-5/M token equivalent. Audio input is expensive ($40-100/M token-equivalent) because it's still specialized capacity. Video is extremely expensive and limited · often only frontier models accept native video. The pricing gap between modalities drives architecture decisions: for high-volume image analysis, cheap models like Gemini Flash (~$0.30/M image tokens) beat premium models on total cost.

Why this matters now

Native video input shipped across Gemini 3 and GPT-5 in 2025 · multi-modal pricing is now a first-class finops concern.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·Per-input-type pricing · text, image, audio, video each have own rate
·Image-to-token conversion via tile encoding
·Video typically 258 tokens/frame at 1fps

If you are a

Builder

·Meter per input type for accurate forecasting
·Cheap multi-modal models (Gemini Flash) beat premium on high volume
·Audio input is 20-50× more expensive than text

If you are a

Investor

·Multi-modal compute is still premium · expensive to train and serve
·Pricing gap reveals capacity constraints per modality
·Future: specialized providers for video (Runway, Pika) undercut generalist labs

If you are a

Curious · Normie

·AI charges different prices for different media types
·Images cost more than text · videos cost even more
·Key for businesses processing photos or videos at scale

Gecko's take

Multi-modal pricing is where model-choice decisions get counter-intuitive. Cheap multi-modal often beats premium.

Frequently Asked Questions

Varies wildly. Gemini Flash: ~$0.0003/image. GPT-4o: ~$0.01/image. Claude Sonnet: ~$0.004/image. Premium models are 10-30× more expensive.

Multi-Modal Pricing

Basic

Deep

Expert

Depending on why you're here

Frequently Asked Questions

Related terms

Glossary

Explore live data

Cite or embed