System Design

Zoom Architecture: Low-Latency Audio & Video at Scale

7 min read

Jan 26

In the world of distributed systems, real-time communication is often considered the "final boss." Unlike streaming a movie on Netflix, where you can buffer minutes of content to hide network hiccups, a video call must feel instantaneous. To achieve the "human-to-human" feel, end-to-end latency must ideally stay below 150ms. Once it crosses 250ms, the conversation breaks down users start talking over one another, and the "presence" is lost.

Zoom’s rise to dominance wasn't just about a simple UI; it was about a fundamental architectural shift in how video data is routed and processed across the global internet.

1. The Architecture of Routing: SFU vs. MCU

To understand Zoom’s scalability, we must first look at the evolution of video conferencing infrastructure. Legacy systems (like early Cisco or Polycom) utilized a Multipoint Control Unit (MCU) architecture.

The Legacy Mixer (MCU)

In an MCU setup, the central server acts as a "video mixer." It receives individual streams from every participant, decodes them, stitches them into a single grid layout, re-encodes that grid into a single video stream, and sends it back to everyone.

The Latency Penalty: Decoding and re-encoding video is computationally expensive. It introduces massive "processing latency" (often 100ms+) at the server level alone.
The Quality Bottleneck: MCUs are inflexible. If one user has a 3G connection, the server often has to downgrade the quality of the entire mixed stream for everyone to accommodate the weakest link.

The Modern Router (SFU)

Zoom pioneered a proprietary Selective Forwarding Unit (SFU) approach, which they call a Multimedia Router (MMR). Instead of mixing video, the MMR acts like an intelligent, high-speed post office. It receives encrypted packets and simply "forwards" them to other participants without ever decoding the media.

By avoiding the decode/re-encode cycle, Zoom reduced server-side latency to near zero. The heavy lifting of arranging the "gallery view" grid is shifted to the user's device (the client), which uses its local GPU to decode the multiple incoming streams simultaneously.

2. Dealing with the "Dirty" Internet: SVC and Simulcast

The public internet is not a smooth highway; it’s a series of unpredictable, congested paths. One participant might be on a 1Gbps fiber line, while another is on spotty coffee shop Wi-Fi. Zoom handles this through Scalable Video Coding (SVC) and Simulcast.

The Layered Stream

Instead of sending one static 720p video stream, the Zoom client sends a "layered" stream. This package contains multiple resolutions and frame rates bundled together:

Base Layer: Low resolution (180p), low frame rate (10 fps).
Enhancement Layer 1: Medium resolution (360p), standard frame rate.
Enhancement Layer 2: High resolution (720p/1080p), high frame rate.

Intelligent Adaptive Forwarding

The MMR (server) monitors the "health" of every participant's connection in real-time. If it detects that Participant A’s bandwidth is dropping, it doesn't ask the sender to change anything. Instead, the MMR simply stops forwarding the "Enhancement Layer" packets to Participant A, sending only the "Base Layer."

Because the sender is always providing all layers, the MMR can resume sending high-definition packets the millisecond the receiver's network clears up. This provides a "hitless" transition no "reconnecting" spinners or dropped calls.

3. The Transport Layer: Why UDP Wins

Standard web traffic uses TCP (Transmission Control Protocol). TCP is "lossless," meaning if a packet goes missing, it stops the entire data flow and waits for a retransmission. In real-time video, a late packet is a useless packet. If a frame arrives 500ms late, the conversation has already moved on.

Custom UDP Implementation

Zoom uses a proprietary version of UDP (User Datagram Protocol). UDP is "fire and forget," making it much faster but inherently "unreliable." Zoom builds "Application-Layer Reliability" on top of UDP using several sophisticated techniques:

Forward Error Correction (FEC): The client sends a small amount of redundant "parity" data. If 5-10% of the packets are lost, the receiving client can use the parity data to mathematically reconstruct the missing pixels without asking the server to resend them.
Dynamic Jitter Buffers: Zoom’s software constantly calculates the "jitter" (the variance in packet arrival times). It adjusts a tiny buffer in memory (usually 20-50ms) to hold packets just long enough to re-order them if they arrive out of sequence.
Audio Prioritization: Audio and video are sent on different sub-channels. When bandwidth is extremely tight, Zoom will completely sacrifice video frames to ensure the audio packets (which require much less data) get through. This is why you can often still hear people clearly even when their video is frozen.

4. The Signaling Plane vs. The Data Plane

Zoom splits its architecture into two distinct "planes" to ensure reliability:

The Signaling Plane (HTTPS/WebSockets): This handles the "logic" of the meeting who is in the room, who is muted, screen sharing requests, and chat messages. This uses TCP/TLS because these commands must arrive and must be in order.
The Data Plane (UDP/SRTP): This handles the actual audio and video bits. By separating these, a "glitch" in the chat window doesn't cause your video to freeze, and a spike in video data doesn't delay a "Raise Hand" notification.

5. Global Edge Infrastructure and "Meeting Zones"

Data cannot travel faster than the speed of light. A round trip from London to a server in San Francisco takes roughly 140ms under perfect conditions already pushing the limits of "instant" communication.

Zoom solves the physics problem through Geographic Distribution:

Meeting Zones: Zoom operates dozens of co-located data centers globally.
The On-Ramp: When you click "Join," a Global Cloud Controller looks at your IP address and directs you to the MMR physically closest to you.
The Backbone: If you are in London and your colleague is in Tokyo, you both connect to your local MMRs. Those MMRs then communicate with each other over Zoom's private fiber backbone, bypassing the chaotic public internet exchanges where most packet loss occurs.

6. Scaling with Hybrid Cloud "Bursting"

At the start of 2020, Zoom had roughly 10 million daily meeting participants. Within months, that number spiked to 300 million. No physical data center can rack servers that fast. Zoom’s secret is its Hybrid Cloud Architecture.

The Bare Metal Core: For "steady-state" traffic, Zoom uses its own optimized hardware in co-location facilities. This allows for maximum performance and lower costs.
The Cloud Burst: Zoom is integrated with AWS and Oracle Cloud. Using automated infrastructure-as-code, Zoom can spin up thousands of "Virtual MMRs" in public cloud regions within minutes.
Dynamic Balancing: As the sun rises over the US East Coast and millions of people log on at 9:00 AM, the system "bursts" into the cloud. As the workday ends, those instances are terminated, and traffic is migrated back to the bare-metal servers.

7. Security: AES-256-GCM and E2EE

Scalability is meaningless without security. Zoom's data plane uses AES-256-GCM encryption.

In Transit: All media is encrypted at the client before it ever hits the wire.
End-to-End Encryption (E2EE): For sensitive meetings, Zoom offers a mode where the encryption keys are generated solely by the participants' devices. In this mode, even the Zoom MMR servers the "routers" cannot see the video data they are forwarding because they do not possess the keys to decrypt the packets.

Conclusion: The Architecture of Presence

Zoom’s success isn't due to a single "magic" algorithm. It is the result of a holistic architectural philosophy that prioritizes latency over perfection. By shifting the computational load to the edge (the user's device), using a layered approach to video coding, and leveraging a global network of "routers" instead of "mixers," Zoom created a system that simulates human presence at a global scale.

As we move toward 4K video and augmented reality meetings, this SFU-based, cloud-bursting model remains the blueprint for the future of the real-time web.

9. References & Further Reading

H.264 SVC Standard: ITU-T Rec. H.264 (04/2017) - Advanced video coding for generic audiovisual services.
SFU vs. MCU Research: Garcia, R., et al. (2018). "Performance Evaluation of Selective Forwarding Units for WebRTC." IEEE Conference on Communications.
Zoom Security Whitepaper: Zoom Video Communications. (2020). "Zoom Encryption Whitepaper: A Deep Dive into AES-256-GCM and E2EE."
Network Transport Protocols: Postel, J. (1980). "User Datagram Protocol (UDP)." RFC 768. IETF.
Real-Time Latency Thresholds: International Telecommunication Union (ITU). "ITU-T G.114: One-way transmission time."

Newsletter

Level Up Your Tech Knowledge

Join 5,000+ developers receiving expert insights, coding tips, and exclusive content delivered straight to your inbox.

No spam, ever. Unsubscribe at any time.

Comments0

Leave a thought

No comments yet.
Be the first to share your thoughts!

Explore related posts

Chaos Engineering: How to Build Systems That Embrace Failure

Don't wait for a crash. How to use tools like Chaos Monkey to break your system intentionally and build resilience.

Mar 27 min read

SYSTEM DESIGN

Serving AI Agents: Scalable LLM Inference Architecture

Moving beyond chatbots. How to architect systems that run autonomous AI agents using vector databases and RAG.

Feb 275 min read

SYSTEM DESIGN

Collaborative Editing: How Google Docs Handles Concurrency

How two people type at once without overwriting each other. Explaining Operational Transformation (OT) and CRDTs.

Feb 235 min read

SYSTEM DESIGN

System Design

Zoom Architecture: Low-Latency Audio & Video at Scale

Written byTanyaradzwa

7 min read

Jan 26

Zoom’s rise to dominance wasn't just about a simple UI; it was about a fundamental architectural shift in how video data is routed and processed across the global internet.

1. The Architecture of Routing: SFU vs. MCU

The Legacy Mixer (MCU)

The Latency Penalty: Decoding and re-encoding video is computationally expensive. It introduces massive "processing latency" (often 100ms+) at the server level alone.
The Quality Bottleneck: MCUs are inflexible. If one user has a 3G connection, the server often has to downgrade the quality of the entire mixed stream for everyone to accommodate the weakest link.

The Modern Router (SFU)

2. Dealing with the "Dirty" Internet: SVC and Simulcast

The Layered Stream

Instead of sending one static 720p video stream, the Zoom client sends a "layered" stream. This package contains multiple resolutions and frame rates bundled together:

Base Layer: Low resolution (180p), low frame rate (10 fps).
Enhancement Layer 1: Medium resolution (360p), standard frame rate.
Enhancement Layer 2: High resolution (720p/1080p), high frame rate.

Intelligent Adaptive Forwarding

3. The Transport Layer: Why UDP Wins

Custom UDP Implementation

Forward Error Correction (FEC): The client sends a small amount of redundant "parity" data. If 5-10% of the packets are lost, the receiving client can use the parity data to mathematically reconstruct the missing pixels without asking the server to resend them.
Dynamic Jitter Buffers: Zoom’s software constantly calculates the "jitter" (the variance in packet arrival times). It adjusts a tiny buffer in memory (usually 20-50ms) to hold packets just long enough to re-order them if they arrive out of sequence.
Audio Prioritization: Audio and video are sent on different sub-channels. When bandwidth is extremely tight, Zoom will completely sacrifice video frames to ensure the audio packets (which require much less data) get through. This is why you can often still hear people clearly even when their video is frozen.

4. The Signaling Plane vs. The Data Plane

Zoom splits its architecture into two distinct "planes" to ensure reliability:

The Signaling Plane (HTTPS/WebSockets): This handles the "logic" of the meeting who is in the room, who is muted, screen sharing requests, and chat messages. This uses TCP/TLS because these commands must arrive and must be in order.
The Data Plane (UDP/SRTP): This handles the actual audio and video bits. By separating these, a "glitch" in the chat window doesn't cause your video to freeze, and a spike in video data doesn't delay a "Raise Hand" notification.

5. Global Edge Infrastructure and "Meeting Zones"

Zoom solves the physics problem through Geographic Distribution:

Meeting Zones: Zoom operates dozens of co-located data centers globally.
The On-Ramp: When you click "Join," a Global Cloud Controller looks at your IP address and directs you to the MMR physically closest to you.
The Backbone: If you are in London and your colleague is in Tokyo, you both connect to your local MMRs. Those MMRs then communicate with each other over Zoom's private fiber backbone, bypassing the chaotic public internet exchanges where most packet loss occurs.

6. Scaling with Hybrid Cloud "Bursting"

The Bare Metal Core: For "steady-state" traffic, Zoom uses its own optimized hardware in co-location facilities. This allows for maximum performance and lower costs.
The Cloud Burst: Zoom is integrated with AWS and Oracle Cloud. Using automated infrastructure-as-code, Zoom can spin up thousands of "Virtual MMRs" in public cloud regions within minutes.
Dynamic Balancing: As the sun rises over the US East Coast and millions of people log on at 9:00 AM, the system "bursts" into the cloud. As the workday ends, those instances are terminated, and traffic is migrated back to the bare-metal servers.

7. Security: AES-256-GCM and E2EE

Scalability is meaningless without security. Zoom's data plane uses AES-256-GCM encryption.

In Transit: All media is encrypted at the client before it ever hits the wire.
End-to-End Encryption (E2EE): For sensitive meetings, Zoom offers a mode where the encryption keys are generated solely by the participants' devices. In this mode, even the Zoom MMR servers the "routers" cannot see the video data they are forwarding because they do not possess the keys to decrypt the packets.

Conclusion: The Architecture of Presence

As we move toward 4K video and augmented reality meetings, this SFU-based, cloud-bursting model remains the blueprint for the future of the real-time web.

9. References & Further Reading

H.264 SVC Standard: ITU-T Rec. H.264 (04/2017) - Advanced video coding for generic audiovisual services.
SFU vs. MCU Research: Garcia, R., et al. (2018). "Performance Evaluation of Selective Forwarding Units for WebRTC." IEEE Conference on Communications.
Zoom Security Whitepaper: Zoom Video Communications. (2020). "Zoom Encryption Whitepaper: A Deep Dive into AES-256-GCM and E2EE."
Network Transport Protocols: Postel, J. (1980). "User Datagram Protocol (UDP)." RFC 768. IETF.
Real-Time Latency Thresholds: International Telecommunication Union (ITU). "ITU-T G.114: One-way transmission time."

Newsletter

Level Up Your Tech Knowledge

Join 5,000+ developers receiving expert insights, coding tips, and exclusive content delivered straight to your inbox.

No spam, ever. Unsubscribe at any time.

Comments0

Leave a thought

No comments yet.
Be the first to share your thoughts!

Explore related posts

Chaos Engineering: How to Build Systems That Embrace Failure

Don't wait for a crash. How to use tools like Chaos Monkey to break your system intentionally and build resilience.

Mar 27 min read

SYSTEM DESIGN

Serving AI Agents: Scalable LLM Inference Architecture

Moving beyond chatbots. How to architect systems that run autonomous AI agents using vector databases and RAG.

Feb 275 min read

SYSTEM DESIGN

Collaborative Editing: How Google Docs Handles Concurrency

How two people type at once without overwriting each other. Explaining Operational Transformation (OT) and CRDTs.

Feb 235 min read

SYSTEM DESIGN