In our previous deep dive, we explored the "magic" of collaborative editing. But today, the most frequent "collaborator" in a document isn't a human it’s an AI agent.
Serving a standard microservice is relatively straightforward: you scale CPU and RAM. Serving an LLM is a different beast entirely. It is a world where Memory Bandwidth is the primary bottleneck, where a single request can consume gigabytes of VRAM, and where "latency" is measured in Tokens Per Second (TPS).
Today, we deconstruct the architecture behind scalable LLM inference: from KV Caching to PagedAttention and the infrastructure that makes real-time AI agents possible.
1. The Bottleneck: Compute-Bound vs. Memory-Bound
To understand LLM architecture, you must understand why GPUs are necessary. A model like Llama-3 70B has 70 billion parameters. To generate a single word, the GPU must:
Load 70B parameters from VRAM into the compute cores.
Perform billions of matrix multiplications.
Write the result back to VRAM.
In the Prefill Phase (processing your prompt), the GPU is Compute-Bound (limited by TFLOPS). But in the Decoding Phase (generating tokens one by one), the GPU is almost always Memory-Bound. The time spent moving parameters from memory to the processor is far greater than the time spent actually calculating the next token.
2. The KV Cache: Trading Memory for Speed
LLMs are "autoregressive." To predict token N, the model needs the context of tokens 1 through N-1. Without optimization, the model would re-calculate the mathematical "attention" for the entire prompt every time it generates a new word. This would be O(N^2) complexity.
To solve this, we use the KV (Key-Value) Cache. We store the intermediate tensors for every token so we only have to calculate the new token’s values.
The Memory Tax
The KV cache grows linearly with the context length and the number of concurrent users. For a 70B model with a 32k context window, the KV cache can consume tens of gigabytes. If the cache runs out of memory, the system crashes (OOM).
3. PagedAttention: The vLLM Breakthrough
In early inference systems, VRAM for the KV cache was allocated contiguously. If you requested a 1024-token limit, the system reserved that memory upfront, even if you only generated 10 tokens. This led to 60-80% memory waste due to internal fragmentation.
Inspired by Virtual Memory in operating systems, the vLLM project introduced PagedAttention.
The Concept: Divide the KV cache into non-contiguous "blocks" (pages).
The Benefit: Memory is only allocated as tokens are generated. This allows the system to increase the "Batch Size" (number of concurrent requests) by 2x to 4x on the same hardware.
4. Continuous Batching: Maximizing Throughput
Traditional "Static Batching" waits for 10 requests to arrive, processes them together, and returns them together. If Request A finishes in 50 tokens but Request B takes 500, the GPU sits idle waiting for B to finish.
Modern inference engines (Triton, vLLM, TGI) use Continuous Batching (or Iteration-level Scheduling). As soon as one request in a batch finishes, a new request is "inserted" into the pipeline mid-flight. This ensures the GPU cores are always saturated, significantly increasing total system throughput.
5. Scaling Out: Model Parallelism
A 70B model in 16-bit precision requires ~140GB of VRAM. A single A100 GPU only has 80GB. To serve large models, we must split them across multiple GPUs.
Tensor Parallelism (Intra-layer)
Splits the large matrices of a single layer across multiple GPUs. Each GPU calculates a part of the matrix multiplication, and they "All-Reduce" the results. This is extremely fast but requires high-bandwidth interconnects like NVLink.
Pipeline Parallelism (Inter-layer)
Different layers reside on different GPUs. GPU 1 handles layers 1-20, GPU 2 handles 21-40, and so on. This is easier to scale across nodes but introduces "bubbles" (idle time) as data moves between cards.
6. Quantization: Shrinking the Weights
To fit models on cheaper hardware, we use Quantization. We convert the model's weights from 16-bit floats (FP16) to 8-bit (INT8) or even 4-bit (AWQ/GPTQ).
4-bit Quantization: Reduces the memory footprint by 4x with a negligible (<1%) loss in accuracy.
Impact: This allows a 70B model to run on two consumer-grade 3090/4090 GPUs instead of a $30,000 enterprise cluster.
7. Designing for Agents: Tool Use and Long Context
AI agents don't just "chat"; they use tools (API calls). This introduces a new architecture challenge: The Tool-Use Loop.
Model outputs a thought: "I need to check the weather."
System intercepts: Calls a Weather API.
Observation is fed back: "It is 70°F."
Model continues.
For a scalable agent architecture, the Context Manager must be extremely efficient. We use Prompt Caching to store the system instructions and tool definitions in VRAM so they aren't re-processed with every turn of the conversation.
Summary: The LLM Serving Stack
Building a production-grade inference engine requires a layered approach:
Hardware: H100/A100 clusters with NVLink.
Kernel Level: FlashAttention-2 for fast math.
Memory Management: PagedAttention to prevent VRAM waste.
Scheduling: Continuous Batching for high throughput.
Optimization: Quantization and Speculative Decoding (using a smaller "draft" model to predict tokens).
References & Further Reading
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention - The seminal blog post that changed how we think about GPU memory.
NVIDIA Technical Blog: Optimizing LLM Queries - A deep dive into quantization, batching, and KV cache.
Anyscale: The Economics of LLM Serving - A business-centric view on how continuous batching reduces costs.
FlashAttention: Fast and Memory-Efficient Exact Attention - The paper behind the core algorithm that made 100k+ context windows possible.
Hugging Face: Methods and tools for efficient LLM inference - A practical guide to implementing quantization (bitsandbytes, GPTQ).