Beta
PricingReading · ~3 min · 67 words deep

Batched Inference

Batch APIs let you submit thousands of prompts for offline processing at 50% of synchronous pricing · turnaround up to 24 hours.

TL;DR

Batch APIs let you submit thousands of prompts for offline processing at 50% of synchronous pricing · turnaround up to 24 hours.

Level 1

OpenAI Batch API, Anthropic Message Batches, and Gemini Batch Prediction all offer a tradeoff: submit a JSONL file of prompts, get results back within 24 hours, pay 50% of real-time pricing. Use cases: bulk content generation, offline data labeling, evaluation runs, embedding generation. Not for interactive apps.

Level 2

Batch throughput bypasses rate limits · you can submit 50K prompts even if your sync rate limit is 1000 RPM. The 50% discount applies to both input and output tokens. Providers run batches on leftover capacity, which is why the SLA is "up to 24 hours." Typical real-world turnaround: 1-4 hours. Batch pricing combined with prompt caching gives an effective 10-20× cost reduction vs naive synchronous calls.

Level 3

Batch is idempotent · provider dedupes by custom_id. Failed items are returned with error codes; no partial charges. Anthropic batch supports cache_control blocks (cache applies across batch). Limits: OpenAI caps at 50K requests or 100MB per batch; Anthropic 100K or 256MB. Cross-provider pattern: use batch for reranking, synthetic data generation, large-scale evaluation. Finops teams now mandate batch for any workflow that can tolerate 1-24h latency.

The takeaway for you
If you are a
Researcher
  • ·50% discount on input + output tokens
  • ·SLA 24 hours · typical 1-4 hours
  • ·Bypasses synchronous rate limits
If you are a
Builder
  • ·Use for eval runs, embeddings, bulk generation
  • ·JSONL submission · poll until complete
  • ·Combine with prompt caching for stacked savings
If you are a
Investor
  • ·Batch utilizes spare provider capacity · win-win pricing
  • ·Enterprise workloads increasingly shift to batch
  • ·Catalyst for very large-scale AI training data generation
If you are a
Curious · Normie
  • ·A way to ask AI many questions at once for half price · but wait hours for answers
  • ·Good for big non-urgent jobs
  • ·Not for chat apps
Gecko's take

Batch API is the first cost-cutting lever for any non-chat workflow. 50% off is enormous · use it.

OpenAI (Batch API), Anthropic (Message Batches), Google (Batch Prediction on Gemini). All at 50% discount.