Concepts
MoE, RAG, reasoning, quantization, tokens, agents.
Top 12 terms
A model architecture that routes each token to a subset of specialized experts, so only a fraction of parameters activate per forward pass.
An LLM trained to spend extra compute at inference thinking before it answers, trading latency and cost for accuracy on hard tasks.
Retrieval-Augmented Generation · grounds model responses in external data retrieved at query time.
Model Context Protocol · an open standard for connecting AI models to external tools, data, and services.
A delegated sub-task agent spawned by a parent orchestrator · each runs in isolated context with specialized tools.
Attention pattern where each token attends only to a local window · used in Mistral, Gemma for long-context efficiency.
Attention with Linear Biases · position encoding via attention-score penalties instead of positional embeddings.
State-space sequence model · linear-time alternative to transformers · foundation of Mamba-2, Jamba, Zamba models.
Implicit-convolution alternative to attention · linear-time long-range modeling · explored by Stanford + Together.
Techniques to shrink the key-value cache during inference · sliding windows, quantization, eviction, sparsity.
Further training a pre-trained model on domain data to adapt behavior without retraining from scratch.
A prompting technique (and trained behavior) where the model shows step-by-step reasoning before the final answer.