System Design

Serving AI Agents: Scalable LLM Inference Architecture

5 min read

Feb 27

In our previous deep dive, we explored the "magic" of collaborative editing. But today, the most frequent "collaborator" in a document isn't a human it’s an AI agent.

Serving a standard microservice is relatively straightforward: you scale CPU and RAM. Serving an LLM is a different beast entirely. It is a world where Memory Bandwidth is the primary bottleneck, where a single request can consume gigabytes of VRAM, and where "latency" is measured in Tokens Per Second (TPS).

Today, we deconstruct the architecture behind scalable LLM inference: from KV Caching to PagedAttention and the infrastructure that makes real-time AI agents possible.

1. The Bottleneck: Compute-Bound vs. Memory-Bound

To understand LLM architecture, you must understand why GPUs are necessary. A model like Llama-3 70B has 70 billion parameters. To generate a single word, the GPU must:

Load 70B parameters from VRAM into the compute cores.
Perform billions of matrix multiplications.
Write the result back to VRAM.

In the Prefill Phase (processing your prompt), the GPU is Compute-Bound (limited by TFLOPS). But in the Decoding Phase (generating tokens one by one), the GPU is almost always Memory-Bound. The time spent moving parameters from memory to the processor is far greater than the time spent actually calculating the next token.

2. The KV Cache: Trading Memory for Speed

LLMs are "autoregressive." To predict token N, the model needs the context of tokens 1 through N-1. Without optimization, the model would re-calculate the mathematical "attention" for the entire prompt every time it generates a new word. This would be O(N^2) complexity.

To solve this, we use the KV (Key-Value) Cache. We store the intermediate tensors for every token so we only have to calculate the new token’s values.

The Memory Tax

The KV cache grows linearly with the context length and the number of concurrent users. For a 70B model with a 32k context window, the KV cache can consume tens of gigabytes. If the cache runs out of memory, the system crashes (OOM).

3. PagedAttention: The vLLM Breakthrough

In early inference systems, VRAM for the KV cache was allocated contiguously. If you requested a 1024-token limit, the system reserved that memory upfront, even if you only generated 10 tokens. This led to 60-80% memory waste due to internal fragmentation.

Inspired by Virtual Memory in operating systems, the vLLM project introduced PagedAttention.

The Concept: Divide the KV cache into non-contiguous "blocks" (pages).
The Benefit: Memory is only allocated as tokens are generated. This allows the system to increase the "Batch Size" (number of concurrent requests) by 2x to 4x on the same hardware.

4. Continuous Batching: Maximizing Throughput

Traditional "Static Batching" waits for 10 requests to arrive, processes them together, and returns them together. If Request A finishes in 50 tokens but Request B takes 500, the GPU sits idle waiting for B to finish.

Modern inference engines (Triton, vLLM, TGI) use Continuous Batching (or Iteration-level Scheduling). As soon as one request in a batch finishes, a new request is "inserted" into the pipeline mid-flight. This ensures the GPU cores are always saturated, significantly increasing total system throughput.

5. Scaling Out: Model Parallelism

A 70B model in 16-bit precision requires ~140GB of VRAM. A single A100 GPU only has 80GB. To serve large models, we must split them across multiple GPUs.

Tensor Parallelism (Intra-layer)

Splits the large matrices of a single layer across multiple GPUs. Each GPU calculates a part of the matrix multiplication, and they "All-Reduce" the results. This is extremely fast but requires high-bandwidth interconnects like NVLink.

Pipeline Parallelism (Inter-layer)

Different layers reside on different GPUs. GPU 1 handles layers 1-20, GPU 2 handles 21-40, and so on. This is easier to scale across nodes but introduces "bubbles" (idle time) as data moves between cards.

6. Quantization: Shrinking the Weights

To fit models on cheaper hardware, we use Quantization. We convert the model's weights from 16-bit floats (FP16) to 8-bit (INT8) or even 4-bit (AWQ/GPTQ).

4-bit Quantization: Reduces the memory footprint by 4x with a negligible (<1%) loss in accuracy.
Impact: This allows a 70B model to run on two consumer-grade 3090/4090 GPUs instead of a $30,000 enterprise cluster.

7. Designing for Agents: Tool Use and Long Context

AI agents don't just "chat"; they use tools (API calls). This introduces a new architecture challenge: The Tool-Use Loop.

Model outputs a thought: "I need to check the weather."
System intercepts: Calls a Weather API.
Observation is fed back: "It is 70°F."
Model continues.

For a scalable agent architecture, the Context Manager must be extremely efficient. We use Prompt Caching to store the system instructions and tool definitions in VRAM so they aren't re-processed with every turn of the conversation.

Summary: The LLM Serving Stack

Building a production-grade inference engine requires a layered approach:

Hardware: H100/A100 clusters with NVLink.
Kernel Level: FlashAttention-2 for fast math.
Memory Management: PagedAttention to prevent VRAM waste.
Scheduling: Continuous Batching for high throughput.
Optimization: Quantization and Speculative Decoding (using a smaller "draft" model to predict tokens).

References & Further Reading

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention - The seminal blog post that changed how we think about GPU memory.
NVIDIA Technical Blog: Optimizing LLM Queries - A deep dive into quantization, batching, and KV cache.
Anyscale: The Economics of LLM Serving - A business-centric view on how continuous batching reduces costs.
FlashAttention: Fast and Memory-Efficient Exact Attention - The paper behind the core algorithm that made 100k+ context windows possible.
Hugging Face: Methods and tools for efficient LLM inference - A practical guide to implementing quantization (bitsandbytes, GPTQ).

Newsletter

Level Up Your Tech Knowledge

Join 5,000+ developers receiving expert insights, coding tips, and exclusive content delivered straight to your inbox.

No spam, ever. Unsubscribe at any time.

Comments0

Leave a thought

No comments yet.
Be the first to share your thoughts!

Explore related posts

Chaos Engineering: How to Build Systems That Embrace Failure

Don't wait for a crash. How to use tools like Chaos Monkey to break your system intentionally and build resilience.

Mar 27 min read

SYSTEM DESIGN

Collaborative Editing: How Google Docs Handles Concurrency

How two people type at once without overwriting each other. Explaining Operational Transformation (OT) and CRDTs.

Feb 235 min read

SYSTEM DESIGN

Distributed Locking: Preventing Race Conditions with Redlock

When multiple services try to write to the same resource. How to implement distributed mutual exclusion using Redis and Redlock.

Feb 205 min read

SYSTEM DESIGN

System Design

Serving AI Agents: Scalable LLM Inference Architecture

Written byTanyaradzwa

5 min read

Feb 27

In our previous deep dive, we explored the "magic" of collaborative editing. But today, the most frequent "collaborator" in a document isn't a human it’s an AI agent.

Today, we deconstruct the architecture behind scalable LLM inference: from KV Caching to PagedAttention and the infrastructure that makes real-time AI agents possible.

1. The Bottleneck: Compute-Bound vs. Memory-Bound

To understand LLM architecture, you must understand why GPUs are necessary. A model like Llama-3 70B has 70 billion parameters. To generate a single word, the GPU must:

Load 70B parameters from VRAM into the compute cores.
Perform billions of matrix multiplications.
Write the result back to VRAM.

2. The KV Cache: Trading Memory for Speed

To solve this, we use the KV (Key-Value) Cache. We store the intermediate tensors for every token so we only have to calculate the new token’s values.

The Memory Tax

3. PagedAttention: The vLLM Breakthrough

Inspired by Virtual Memory in operating systems, the vLLM project introduced PagedAttention.

The Concept: Divide the KV cache into non-contiguous "blocks" (pages).
The Benefit: Memory is only allocated as tokens are generated. This allows the system to increase the "Batch Size" (number of concurrent requests) by 2x to 4x on the same hardware.

4. Continuous Batching: Maximizing Throughput

5. Scaling Out: Model Parallelism

A 70B model in 16-bit precision requires ~140GB of VRAM. A single A100 GPU only has 80GB. To serve large models, we must split them across multiple GPUs.

Tensor Parallelism (Intra-layer)

Pipeline Parallelism (Inter-layer)

6. Quantization: Shrinking the Weights

To fit models on cheaper hardware, we use Quantization. We convert the model's weights from 16-bit floats (FP16) to 8-bit (INT8) or even 4-bit (AWQ/GPTQ).

4-bit Quantization: Reduces the memory footprint by 4x with a negligible (<1%) loss in accuracy.
Impact: This allows a 70B model to run on two consumer-grade 3090/4090 GPUs instead of a $30,000 enterprise cluster.

7. Designing for Agents: Tool Use and Long Context

AI agents don't just "chat"; they use tools (API calls). This introduces a new architecture challenge: The Tool-Use Loop.

Model outputs a thought: "I need to check the weather."
System intercepts: Calls a Weather API.
Observation is fed back: "It is 70°F."
Model continues.

Summary: The LLM Serving Stack

Building a production-grade inference engine requires a layered approach:

Hardware: H100/A100 clusters with NVLink.
Kernel Level: FlashAttention-2 for fast math.
Memory Management: PagedAttention to prevent VRAM waste.
Scheduling: Continuous Batching for high throughput.
Optimization: Quantization and Speculative Decoding (using a smaller "draft" model to predict tokens).

References & Further Reading

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention - The seminal blog post that changed how we think about GPU memory.
NVIDIA Technical Blog: Optimizing LLM Queries - A deep dive into quantization, batching, and KV cache.
Anyscale: The Economics of LLM Serving - A business-centric view on how continuous batching reduces costs.
FlashAttention: Fast and Memory-Efficient Exact Attention - The paper behind the core algorithm that made 100k+ context windows possible.
Hugging Face: Methods and tools for efficient LLM inference - A practical guide to implementing quantization (bitsandbytes, GPTQ).

Newsletter

Level Up Your Tech Knowledge

Join 5,000+ developers receiving expert insights, coding tips, and exclusive content delivered straight to your inbox.

No spam, ever. Unsubscribe at any time.

Comments0

Leave a thought

No comments yet.
Be the first to share your thoughts!

Explore related posts

Chaos Engineering: How to Build Systems That Embrace Failure

Don't wait for a crash. How to use tools like Chaos Monkey to break your system intentionally and build resilience.

Mar 27 min read

SYSTEM DESIGN

Collaborative Editing: How Google Docs Handles Concurrency

How two people type at once without overwriting each other. Explaining Operational Transformation (OT) and CRDTs.

Feb 235 min read

SYSTEM DESIGN

Distributed Locking: Preventing Race Conditions with Redlock

When multiple services try to write to the same resource. How to implement distributed mutual exclusion using Redis and Redlock.

Feb 205 min read

SYSTEM DESIGN