Beta
ConceptsReading · ~3 min · 64 words deep

Guardrails

Runtime safety filters around AI models · check inputs for attacks, check outputs for harms, block policy violations.

TL;DR

Runtime safety filters around AI models · check inputs for attacks, check outputs for harms, block policy violations.

Level 1

Guardrails run outside the model. They check user inputs (prompt injection detection, PII scanning) and model outputs (toxicity, factual verification, policy compliance). Production-critical · you can't trust the model alone to catch every edge case. Popular tools: Guardrails AI, NVIDIA NeMo Guardrails, Lakera Guard, Protect AI.

Level 2

Guardrail categories: (1) input filtering (prompt injection detection, jailbreak recognition, PII redaction), (2) output filtering (toxicity scoring, factual verification, format compliance), (3) rate/volume controls (anomaly detection, cost caps). Implementation: small classifier models, regex patterns, rule engines, or separate LLM calls dedicated to validation. Production systems stack multiple layers because no single approach catches everything. Trade-off: more guardrails = more latency and more false positives.

Level 3

Jailbreak detection classifiers (Lakera Guard, open-source variants) catch 80-95% of known patterns but struggle with novel attacks. Input/output LLM-as-judge approaches have 90%+ accuracy on safety eval sets at 10-50ms latency overhead. Enterprise deployments typically run 3-5 guardrail layers: input sanitization, PII detection, output toxicity, output fact-check, format validator. Open-source framework choice · Guardrails AI (Python-native), NeMo Guardrails (config-driven), LLM Guard (comprehensive), or custom.

The takeaway for you
If you are a
Researcher
  • ·Input filtering + output filtering + volume controls · three categories
  • ·LLM-as-judge is dominant pattern · 90%+ accuracy at 10-50ms latency
  • ·Open-source tools: Guardrails AI, NeMo Guardrails, LLM Guard
If you are a
Builder
  • ·Never deploy LLMs to end users without output guardrails
  • ·PII redaction + prompt injection detection are the 2 must-haves
  • ·Budget 10-50ms per guardrail check · plan latency accordingly
If you are a
Investor
  • ·Guardrails market: Lakera, Protect AI, Robust Intelligence $50M+ rounds
  • ·Enterprise compliance drives adoption · GDPR + EU AI Act + HIPAA require audit trails
  • ·Commoditizing fast · differentiation shifts to integration quality
If you are a
Curious · Normie
  • ·Safety filters around AI · catch the bad stuff the AI itself might miss
  • ·Why ChatGPT doesn't tell you how to build a bomb
  • ·Required for production AI, not optional
Gecko's take

If you're running production AI without guardrails, you're one prompt-injection away from a front-page incident.

Yes. Frontier models catch baseline harmful requests but miss prompt injection, data leaks, format violations, and domain-specific policy rules.