Beta
ConceptsReading · ~3 min · 78 words deep

Multimodal

A model that handles multiple input or output types · text, image, audio, video · not just text alone.

TL;DR

A model that handles multiple input or output types · text, image, audio, video · not just text alone.

Level 1

Multimodal models can read images, listen to audio, watch video, and respond in text (or sometimes image/audio back). Frontier examples: GPT-5 (text + image + audio), Claude 4.5 Opus (text + image), Gemini 3 (text + image + audio + video native). Use cases: image understanding, document extraction (charts, diagrams, handwriting), video summarization, audio transcription + reasoning, and robotic perception.

Level 2

Two main architectures: (1) separate encoders per modality fused into a shared transformer (CLIP-style, earlier Gemini), and (2) native multimodal pretraining where tokens from different modalities share a unified embedding space (Gemini 1.5+, Chameleon). Native multimodal is more expressive but harder to train. Vision tokens use ViT (Vision Transformer) or patch-based encoders. Audio uses Whisper-style mel-spectrograms. Video extends images with temporal attention. Context windows for multimodal are often shorter · 1 minute of video can consume 10K+ tokens.

Level 3

CLIP contrastive training aligns image and text embeddings. Flamingo (DeepMind) used cross-attention from frozen image encoder into LM. Native approaches (Gemini 1.5, Chameleon) tokenize images into discrete codes via VQ-VAE or similar, then treat them like text tokens. Audio tokenization via neural codecs (SoundStream, EnCodec) enables audio in and out. Video is the hardest · native frame-level tokenization blows up context, temporal downsampling loses detail. Most 2026 multimodal APIs use hybrid: frame-level encoding with temporal pooling.

Why this matters now

Multimodal is table stakes for 2026 · agents need to see screens, read documents, watch video. Every frontier model ships multimodal default.

The takeaway for you
If you are a
Researcher
  • ·Native multimodal (shared embedding space) outperforms encoder fusion
  • ·ViT patches + text tokens for image understanding
  • ·Video: frame tokenization + temporal pooling is the dominant pattern
If you are a
Builder
  • ·GPT-5 Vision and Claude 4.5 for most image-in workloads
  • ·Gemini for longest video understanding (2M context)
  • ·OCR still often cheaper than multimodal for simple document tasks
If you are a
Investor
  • ·Multimodal commoditized fast · feature of every frontier model
  • ·Differentiation shifts to native multimodal quality and long video
  • ·Robotics + agents drive multimodal demand beyond chat
If you are a
Curious · Normie
  • ·AI that understands pictures, sound, and video too · not just text
  • ·Can look at your screenshot and tell you what is in it
  • ·Standard feature in 2026 · not special
Gecko's take

Multimodal is a feature, not a product. The 2026 winners differentiate on quality, not capability presence.

Gemini 3 (native text + image + audio + video). GPT-5 adds image + audio. Claude 4.5 adds image. All frontier models have vision by default in 2026.