Which framework is best?

Depends on the shape. LangGraph for stateful DAGs. Smolagents for simple iterative loops. CrewAI for multi-agent role play. Claude Code for code-focused agents. No one wins all cases.

How do I stop runaway loops?

Hard cap on total tokens, hard cap on turns, retry detection (if the same tool call repeats 3 times, stop). Log every call with a trace ID and spot-audit weekly.

What are MCP servers and should I use them?

Model Context Protocol servers are standardized tool providers. Pick from 4000+ in the registry. They work with Claude, Smolagents, and increasingly GPT tools. See our /mcp page for the full catalog.

Can cheap models do agents well?

For short loops (under 5 turns), yes. DeepSeek V3.2 and Qwen3.5 are competitive. For long-horizon agents (20+ turns), frontier models are still meaningfully more reliable.

Stack · Autonomous agent

Autonomous agent stack

The complete autonomous agent stack. Model, framework, tools, and cost per 100 runs. Built for long-horizon task execution with tool use.

Tiers3

TypeStack recipe

Updated2026-04

All pricing Compare models

What this page is

Autonomous agents run long loops with tool calls, planning, memory, and retries. Cost scales with loop depth, context accumulation, and retry rate. The model choice dominates quality and cost. The framework drives how the loop is wired. Our estimates assume 100 runs with ~50K tokens each (25K in, 25K out) across 10 to 20 tool calls.

Tier-by-tier breakdown

Frontier, mainstream, and budget recipes. Pick the row that matches your workload.

Frontier

Frontier · reliability first

Model

Claude Mythos Preview

in $5/M · out $25/M

Provider

Anthropic direct

Tool · Agent

Claude Code (agent mode)

Extended thinking · reliable tool use

Estimate · 100 runs · 25K in + 25K out

~$75/100 runs

For agents that actually need to finish · the reliability gap over cheaper models shows up at run length 20+. Mythos handles tool schemas without hallucinating. Worth the premium for any run that would break your product on failure.

Mainstream

Mainstream · default

Model

GPT-5 Chat

in $2.50/M · out $10/M

Provider

OpenAI

Tool · Agent

Smolagents + MCP servers

HF framework · 200+ MCP servers

Estimate · 100 runs · 25K in + 25K out

~$31/100 runs

The default production agent stack. GPT-5 is reliable with tool calling, Smolagents is a lean framework that works with any OpenAI-compatible endpoint, MCP servers provide the tool universe (search, files, databases, APIs).

Budget

Budget · experimentation

Model

DeepSeek V3.2

in $0.28/M · out $0.84/M

Provider

DeepInfra

Tool · Agent

CrewAI

Multi-agent orchestration · free

Estimate · 100 runs · 25K in + 25K out

~$3/100 runs

For experimentation and research-grade agents where reliability is negotiable. DeepSeek V3.2 handles tool calling acceptably. CrewAI is a good multi-agent framework. Expect higher retry rates than frontier stacks.

Alternative picks

If the defaults do not fit, try these.

Alternative

Claude Sonnet + LangGraph

Strong mid-tier blend. LangGraph is battle-tested for complex DAG agents.

Alternative

Gemini 2.5 Pro + Vertex Agent Builder

If you are on GCP. Tight integration with Vertex workflows and Google Search tools.

Alternative

Open-source all the way · Llama 3.3 + LangGraph on self-hosted

Zero vendor lock-in. Requires more engineering but no per-token cost after GPU.

Frequently asked questions

Loops compound. Every turn re-sends system prompt, tool schema, and accumulated conversation. A 20-turn loop can burn 500K tokens on what started as a single question. Prompt caching is the biggest single lever.

Autonomous agent stack

Tier-by-tier breakdown

Alternative picks

Frequently asked questions

See also

Other stacks

Related

Compare