System Design

Database Sharding: Scaling Instagram's Postgres DB

9 min read

Jan 19

It is Monday morning. You are the Lead Database Engineer at "ZimGram," the photo-sharing app that just went viral across Southern Africa.

For the last two years, you have been sleeping soundly because you followed the standard advice: "Just use Postgres." You have a single, massive primary database on AWS, an r7g.24xlarge instance with 768GB of RAM. It handles everything: users, photos, likes, and comments.

But today, the alarms are ringing.

The disk usage is at 92%. The IOPS (Input/Output Operations Per Second) are capped out. Your query latency has jumped from 20ms to 500ms.

You try to vertically scale (Scale Up), but you realize you are already running on the largest instance type available in the af-south-1 region. There is no bigger computer to buy.

You have hit the Vertical Limit.

If you don't find a way to split this data across multiple servers by Friday, the app goes down. You need to stop scaling up and start scaling out. You need Database Sharding.

1. The Ceiling of Vertical Scaling

In the early days of a startup, the answer to "My database is slow" is always "Buy more RAM." This is Vertical Scaling. It is simple, requires no code changes, and works great... until it doesn't.

Eventually, you hit the laws of physics and hardware limitations.

The RAM/Index Bottleneck

Databases are fast because they use Indexes (B-Trees) to find data. Ideally, these indexes live entirely in the RAM (Random Access Memory). When you have 1 billion rows, your indexes might grow to 1TB in size. If your server only has 768GB of RAM, the database has to swap parts of the index in and out of the disk. Disk access is roughly 100,000 times slower than RAM access. This "Thrashing" causes performance to fall off a cliff.

Connection Limits & Context Switching

A single Postgres server acts like a traffic intersection. It can only handle a certain number of active cars (connections) at once. Even with connection pooling (like PgBouncer), if 50,000 users try to run queries simultaneously, the CPU spends more time switching between tasks (Context Switching) than actually executing queries.

The Write Bottleneck

You can add Read Replicas to help with reading data, but in a standard architecture, all writes (INSERTs, UPDATEs, DELETEs) must go to the single Primary node. If you have 10 million users liking photos simultaneously, a single hard drive (even an NVMe SSD) physically cannot write the data fast enough.

2. The Solution: Sharding (Horizontal Partitioning)

Sharding is the process of splitting a large dataset across multiple database instances (shards) so that each instance contains only a subset of the data.

Instead of one database holding 1 billion photos, you split the universe.

Shard 01: Holds Photos 1 to 100 million.
Shard 02: Holds Photos 101 to 200 million.
...and so on.

Now, your write throughput is multiplied by the number of shards. If you have 10 shards, you can handle 10 times the write traffic. Furthermore, if Shard 01 crashes, only 10% of your users (those on Shard 01) are affected. The other 90% can still use the app perfectly fine. This is massive for fault tolerance.

But Sharding introduces a massive engineering headache: How do you find data?

If User X wants to fetch Photo Y, which server is it on? You cannot ask every server ("Do you have Photo Y?") because that is inefficient, this is known as the Scatter-Gather problem. You need a deterministic map or routing strategy.

3. The Instagram Approach: Logical vs. Physical Shards

When Instagram faced this problem, they didn't just spin up 100 database servers immediately. That would be a maintenance nightmare. Instead, they adopted a brilliant hybrid approach: Logical Shards mapping to Physical Nodes.

The Setup

Instagram created thousands of "Logical Shards." A logical shard isn't a separate server; it is just a Postgres Schema (a namespace within a single database).

Schema_0001
Schema_0002
...
Schema_4096

In the beginning, all 4,096 schemas might live on just one physical server. The application code treats them as separate, but they share the same hardware.

The Physical Mapping

As the app grows, you add physical servers and move the schemas around.

Physical Server A: Holds Schemas 0001 to 2000.
Physical Server B: Holds Schemas 2001 to 4096.

Why is this genius?

It solves the "Rebalancing" problem.

Imagine Server A gets overwhelmed. If you sharded by Row ID directly, moving 50% of the rows to a new server is incredibly hard. You have to select rows, lock the table, copy them, and delete the old ones.

With Logical Shards, you simply pick up a directory. You take Schema_0001 through Schema_0100, run a standard backup (pg_dump), restore it on a new Server C, and update your configuration file.

"Schemas 1-100 are now on Server C (IP: 10.0.0.5)."

This allows you to scale out incrementally, adding servers only when you need them, without changing your application code.

4. The Hardest Problem: Generating IDs

In a monolithic database, you use AUTO_INCREMENT or SERIAL for primary keys.

Photo 1
Photo 2
Photo 3

In a sharded system, this breaks immediately.

If Shard 1 generates "Photo ID 1" and Shard 2 also generates "Photo ID 1", you have a Primary Key Collision. You cannot merge these databases later because the IDs conflict.

You need a Globally Unique ID that can be generated independently on each shard without talking to a central authority (which would be a bottleneck).

Why UUIDs Fail (The B-Tree Problem)

A common suggestion is: "Just use UUIDs!" (e.g., a0eebc99-9c0b...). They are globally unique.

However, UUIDs are random. Databases store data in B-Trees. For performance, B-Trees love sequential data (1, 2, 3, 4). When you insert sequential data, the database simply appends it to the end of the file.

When you insert random UUIDs, the database has to jump around the disk to find the "right" spot in the tree to insert the random number. This causes Page Splitting and Fragmentation. On a large table, inserting UUIDs can be 10x to 100x slower than inserting integers.

The Winning Approach: The "Snowflake" ID

Instagram (and Twitter/Discord) solved this by creating 64-bit integers that contain routing information inside them.

An Instagram ID is not just a number; it is a packed binary structure consisting of:

41 bits: Timestamp (Milliseconds since a custom epoch, e.g., Jan 1st, 2026).
13 bits: Shard ID (Which logical shard generated this? e.g., Shard 204).
10 bits: Auto-increment sequence (A local counter for concurrency within the shard).

Why this is elegant:

Time-Sortable: Because the ID starts with a timestamp, ORDER BY id is mathematically the same as ORDER BY created_at. You don't need a separate index for time, saving massive amounts of RAM.
Self-Routing: If I give you ID 123456789, you don't need to look up which database it lives on. You simply run a math operation (bit-shift) to extract the middle 13 bits, and you know instantly: "This photo lives on Shard 204."

5. Handling Relationships: The "User Affinity" Rule

How do you store "Likes"?

If User A (who lives on Shard 1) likes a Photo by User B (who lives on Shard 2), where does the "Like" row live?

If you scatter data randomly, fetching "All photos liked by User A" requires querying every single shard. This is the Cross-Shard Join nightmare.

To solve this, Instagram uses User Affinity (or Entity Groups).

The rule is simple: Data follows the User.

Everything related to User A their profile, their media, their likes, their comments is stored on the same shard. When User A posts a photo, the system forces that photo to be stored on User A's shard.

Pros: Rendering User A's profile requires a single, fast query to one specific shard. You never have to join data across the network.
Cons: The Celebrity Problem (or Hot Sharding). If User A is a massive celebrity like Justin Bieber or Davido, his shard becomes incredibly hot compared to others. He generates 100x more traffic than a normal user. To mitigate this, sophisticated systems might isolate celebrities onto their own dedicated hardware.

6. The Cost of Sharding

Before you rush to shard "ZimGram," understand the trade-offs. Sharding is technically complex and expensive.

Loss of ACID Transactions

You cannot have a database transaction that updates Shard A and Shard B atomically. Standard databases do not support cross-server transactions efficiently. You must rely on Eventual Consistency or implement complex application-level patterns like Sagas to handle multi-shard updates.

The Reporting Nightmare

Running a query like "Count total users across the platform" becomes difficult. You can't just run SELECT COUNT(*) anymore. You have to run it on all 10 shards and sum the results in your application code. Analytics becomes a distinct engineering challenge, usually requiring a separate Data Warehouse (like Snowflake or BigQuery).

Operational Overhead

Backups, monitoring, and migrations become 10x harder. You are no longer managing one pet; you are managing a herd of cattle. If one shard falls behind on updates or runs out of disk space, it can cause partial outages where some users can log in and others cannot.

Conclusion

Sharding is the nuclear option of database scaling.

For 99% of startups, a single well-tuned Postgres instance with Read Replicas and proper caching (Redis) is enough. But when you reach the scale of Instagram or a national payment gateway, the "Infinite Ceiling" of horizontal scaling is the only way forward.

By moving from simple auto-increment IDs to Snowflake IDs, and by mapping logical schemas to physical nodes, you can build a system that swallows petabytes of data without blinking.

But remember: Don't shard until you absolutely have to. Complexity is the enemy of reliability.

References & Further Reading

Instagram Engineering Blog - "Sharding & IDs at Instagram" (The classic post that inspired this architecture).
High Scalability - "How Twitter generates unique IDs (Snowflake)."
PostgreSQL Wiki - Table Partitioning vs. Sharding strategies.

Newsletter

Level Up Your Tech Knowledge

Join 5,000+ developers receiving expert insights, coding tips, and exclusive content delivered straight to your inbox.

No spam, ever. Unsubscribe at any time.

Comments0

Leave a thought

No comments yet.
Be the first to share your thoughts!

Explore related posts

Chaos Engineering: How to Build Systems That Embrace Failure

Don't wait for a crash. How to use tools like Chaos Monkey to break your system intentionally and build resilience.

Mar 27 min read

SYSTEM DESIGN

Serving AI Agents: Scalable LLM Inference Architecture

Moving beyond chatbots. How to architect systems that run autonomous AI agents using vector databases and RAG.

Feb 275 min read

SYSTEM DESIGN

Collaborative Editing: How Google Docs Handles Concurrency

How two people type at once without overwriting each other. Explaining Operational Transformation (OT) and CRDTs.

Feb 235 min read

SYSTEM DESIGN

System Design

Database Sharding: Scaling Instagram's Postgres DB

Written byTanyaradzwa

9 min read

Jan 19

It is Monday morning. You are the Lead Database Engineer at "ZimGram," the photo-sharing app that just went viral across Southern Africa.

But today, the alarms are ringing.

The disk usage is at 92%. The IOPS (Input/Output Operations Per Second) are capped out. Your query latency has jumped from 20ms to 500ms.

You try to vertically scale (Scale Up), but you realize you are already running on the largest instance type available in the af-south-1 region. There is no bigger computer to buy.

You have hit the Vertical Limit.

If you don't find a way to split this data across multiple servers by Friday, the app goes down. You need to stop scaling up and start scaling out. You need Database Sharding.

1. The Ceiling of Vertical Scaling

In the early days of a startup, the answer to "My database is slow" is always "Buy more RAM." This is Vertical Scaling. It is simple, requires no code changes, and works great... until it doesn't.

Eventually, you hit the laws of physics and hardware limitations.

The RAM/Index Bottleneck

Connection Limits & Context Switching

The Write Bottleneck

2. The Solution: Sharding (Horizontal Partitioning)

Sharding is the process of splitting a large dataset across multiple database instances (shards) so that each instance contains only a subset of the data.

Instead of one database holding 1 billion photos, you split the universe.

Shard 01: Holds Photos 1 to 100 million.
Shard 02: Holds Photos 101 to 200 million.
...and so on.

But Sharding introduces a massive engineering headache: How do you find data?

3. The Instagram Approach: Logical vs. Physical Shards

The Setup

Instagram created thousands of "Logical Shards." A logical shard isn't a separate server; it is just a Postgres Schema (a namespace within a single database).

Schema_0001
Schema_0002
...
Schema_4096

In the beginning, all 4,096 schemas might live on just one physical server. The application code treats them as separate, but they share the same hardware.

The Physical Mapping

As the app grows, you add physical servers and move the schemas around.

Physical Server A: Holds Schemas 0001 to 2000.
Physical Server B: Holds Schemas 2001 to 4096.

Why is this genius?

It solves the "Rebalancing" problem.

With Logical Shards, you simply pick up a directory. You take Schema_0001 through Schema_0100, run a standard backup (pg_dump), restore it on a new Server C, and update your configuration file.

"Schemas 1-100 are now on Server C (IP: 10.0.0.5)."

This allows you to scale out incrementally, adding servers only when you need them, without changing your application code.

4. The Hardest Problem: Generating IDs

In a monolithic database, you use AUTO_INCREMENT or SERIAL for primary keys.

Photo 1
Photo 2
Photo 3

In a sharded system, this breaks immediately.

If Shard 1 generates "Photo ID 1" and Shard 2 also generates "Photo ID 1", you have a Primary Key Collision. You cannot merge these databases later because the IDs conflict.

You need a Globally Unique ID that can be generated independently on each shard without talking to a central authority (which would be a bottleneck).

Why UUIDs Fail (The B-Tree Problem)

A common suggestion is: "Just use UUIDs!" (e.g., a0eebc99-9c0b...). They are globally unique.

The Winning Approach: The "Snowflake" ID

Instagram (and Twitter/Discord) solved this by creating 64-bit integers that contain routing information inside them.

An Instagram ID is not just a number; it is a packed binary structure consisting of:

41 bits: Timestamp (Milliseconds since a custom epoch, e.g., Jan 1st, 2026).
13 bits: Shard ID (Which logical shard generated this? e.g., Shard 204).
10 bits: Auto-increment sequence (A local counter for concurrency within the shard).

Why this is elegant:

Time-Sortable: Because the ID starts with a timestamp, ORDER BY id is mathematically the same as ORDER BY created_at. You don't need a separate index for time, saving massive amounts of RAM.
Self-Routing: If I give you ID 123456789, you don't need to look up which database it lives on. You simply run a math operation (bit-shift) to extract the middle 13 bits, and you know instantly: "This photo lives on Shard 204."

5. Handling Relationships: The "User Affinity" Rule

How do you store "Likes"?

If User A (who lives on Shard 1) likes a Photo by User B (who lives on Shard 2), where does the "Like" row live?

If you scatter data randomly, fetching "All photos liked by User A" requires querying every single shard. This is the Cross-Shard Join nightmare.

To solve this, Instagram uses User Affinity (or Entity Groups).

The rule is simple: Data follows the User.

Pros: Rendering User A's profile requires a single, fast query to one specific shard. You never have to join data across the network.
Cons: The Celebrity Problem (or Hot Sharding). If User A is a massive celebrity like Justin Bieber or Davido, his shard becomes incredibly hot compared to others. He generates 100x more traffic than a normal user. To mitigate this, sophisticated systems might isolate celebrities onto their own dedicated hardware.

6. The Cost of Sharding

Before you rush to shard "ZimGram," understand the trade-offs. Sharding is technically complex and expensive.

Loss of ACID Transactions

The Reporting Nightmare

Operational Overhead

Conclusion

Sharding is the nuclear option of database scaling.

By moving from simple auto-increment IDs to Snowflake IDs, and by mapping logical schemas to physical nodes, you can build a system that swallows petabytes of data without blinking.

But remember: Don't shard until you absolutely have to. Complexity is the enemy of reliability.

References & Further Reading

Instagram Engineering Blog - "Sharding & IDs at Instagram" (The classic post that inspired this architecture).
High Scalability - "How Twitter generates unique IDs (Snowflake)."
PostgreSQL Wiki - Table Partitioning vs. Sharding strategies.

Newsletter

Level Up Your Tech Knowledge

Join 5,000+ developers receiving expert insights, coding tips, and exclusive content delivered straight to your inbox.

No spam, ever. Unsubscribe at any time.

Comments0

Leave a thought

No comments yet.
Be the first to share your thoughts!

Explore related posts

Chaos Engineering: How to Build Systems That Embrace Failure

Don't wait for a crash. How to use tools like Chaos Monkey to break your system intentionally and build resilience.

Mar 27 min read

SYSTEM DESIGN

Serving AI Agents: Scalable LLM Inference Architecture

Moving beyond chatbots. How to architect systems that run autonomous AI agents using vector databases and RAG.

Feb 275 min read

SYSTEM DESIGN

Collaborative Editing: How Google Docs Handles Concurrency

How two people type at once without overwriting each other. Explaining Operational Transformation (OT) and CRDTs.

Feb 235 min read

SYSTEM DESIGN