Inferentia 3
Inferentia 3 is AWS's third-gen inference ASIC · launched late 2024 to serve LLMs like Claude + Llama at scale on Bedrock.
Inferentia 3 is AWS's third-gen inference ASIC · launched late 2024 to serve LLMs like Claude + Llama at scale on Bedrock.
Basic
Inferentia 3 targets LLM inference on AWS Bedrock. Specs: 5nm, 2 NeuronCores v3, 128GB HBM3, optimized for low-latency decode. Used primarily inside AWS Bedrock · customers never directly provision Inferentia instances (usually) · they consume model endpoints backed by Inferentia.
Deep
Inferentia 3's design prioritizes low cost per token over peak throughput. AWS claims 60% lower cost per token vs equivalent H100 deployments for Llama 3 70B-class models. Shipping primarily inside Bedrock · end-users see it via cheaper Bedrock pricing for "optimized" model SKUs. Software: Neuron SDK, integrates with vLLM and TensorRT-LLM analogues in the Neuron ecosystem.
Expert
Inferentia 3's compute-to-memory ratio is tuned for autoregressive decode (memory-bandwidth bound, not compute). 128GB HBM3 per chip allows 70B-class models on single-chip serving with 8-bit quantization. NeuronCores v3 add better support for sparse attention and grouped-query attention · reflecting 2024 LLM architecture trends. Not available outside AWS · strategic lock-in.
Depending on why you're here
- ·AWS 3rd-gen inference chip · 5nm
- ·128GB HBM3 per chip
- ·Optimized for autoregressive decode
- ·Consumed via Bedrock · rarely directly provisioned
- ·AWS Bedrock uses Inferentia 3 for "optimized" SKUs
- ·Lower per-token price vs equivalent H100 Bedrock
- ·AWS margin lever on Bedrock
- ·Hardware-level differentiation vs Azure/GCP Bedrock equivalents
- ·Anthropic, Meta, Cohere all use Bedrock · indirect Inferentia 3 exposure
- ·Amazon's inference chip · runs AI models on AWS
- ·Makes AI serving cheaper on AWS Bedrock
- ·Customers don't see it directly
Inferentia 3 is the quiet margin lever on Bedrock · most customers don't know they're using it.