Inferentia 3 vs H100 inference?

Inferentia 3 is cheaper per token for AWS customers. H100 is faster absolute but more expensive.

Who uses Inferentia 3?

AWS Bedrock customers for supported models (Claude, Llama, Nova, select open-source). Indirect access.

ChipsReading · ~3 min · 56 words deep

Inferentia 3

Inferentia 3 is AWS's third-gen inference ASIC · launched late 2024 to serve LLMs like Claude + Llama at scale on Bedrock.

Inferentia 3 on hardware map

TL;DR

Inferentia 3 is AWS's third-gen inference ASIC · launched late 2024 to serve LLMs like Claude + Llama at scale on Bedrock.

Level 1

Basic

Inferentia 3 targets LLM inference on AWS Bedrock. Specs: 5nm, 2 NeuronCores v3, 128GB HBM3, optimized for low-latency decode. Used primarily inside AWS Bedrock · customers never directly provision Inferentia instances (usually) · they consume model endpoints backed by Inferentia.

Level 2

Deep

Inferentia 3's design prioritizes low cost per token over peak throughput. AWS claims 60% lower cost per token vs equivalent H100 deployments for Llama 3 70B-class models. Shipping primarily inside Bedrock · end-users see it via cheaper Bedrock pricing for "optimized" model SKUs. Software: Neuron SDK, integrates with vLLM and TensorRT-LLM analogues in the Neuron ecosystem.

Level 3

Expert

Inferentia 3's compute-to-memory ratio is tuned for autoregressive decode (memory-bandwidth bound, not compute). 128GB HBM3 per chip allows 70B-class models on single-chip serving with 8-bit quantization. NeuronCores v3 add better support for sparse attention and grouped-query attention · reflecting 2024 LLM architecture trends. Not available outside AWS · strategic lock-in.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·AWS 3rd-gen inference chip · 5nm
·128GB HBM3 per chip
·Optimized for autoregressive decode

If you are a

Builder

·Consumed via Bedrock · rarely directly provisioned
·AWS Bedrock uses Inferentia 3 for "optimized" SKUs
·Lower per-token price vs equivalent H100 Bedrock

If you are a

Investor

·AWS margin lever on Bedrock
·Hardware-level differentiation vs Azure/GCP Bedrock equivalents
·Anthropic, Meta, Cohere all use Bedrock · indirect Inferentia 3 exposure

If you are a

Curious · Normie

·Amazon's inference chip · runs AI models on AWS
·Makes AI serving cheaper on AWS Bedrock
·Customers don't see it directly

Gecko's take

Inferentia 3 is the quiet margin lever on Bedrock · most customers don't know they're using it.

Frequently Asked Questions

Via AWS Bedrock "optimized" model SKUs · AWS provisions the hardware behind the scenes. Not directly accessible to most customers.

Inferentia 3

Basic

Deep

Expert

Depending on why you're here

Frequently Asked Questions

Related terms

Glossary

Explore live data

Cite or embed