In our previous deep dive, we explored how TikTok engineers "telepathy" via real-time recommendation loops. But before you can recommend a video, you must be able to find it. In the world of unstructured data, traditional relational databases (RDBMS) fall off a cliff.
If you’ve ever tried to run a SELECT * FROM logs WHERE message LIKE '%error%' on a table with a billion rows, you know the pain. You aren't just looking for a needle in a haystack; you’re asking the database to examine every single piece of hay individually.
This is where Elasticsearch comes in. It doesn't just store data; it indexes the world. Today, we deconstruct the architecture that powers search for GitHub, Uber, and Slack.
1. The Core Innovation: The Inverted Index
The fundamental difference between a database and a search engine is how they "read."
An RDBMS is row-oriented. To find a word, it scans rows. Elasticsearch uses an Inverted Index. Imagine the index at the back of a massive textbook: instead of listing pages and what's on them, it lists words and which pages they appear on.
The Analysis Pipeline
When you index a document, Elasticsearch doesn't just "save" the text. it runs it through an Analyzer:
Character Filters: Strips HTML tags or converts
&toand.Tokenizer: Breaks the string into individual "tokens" (words).
Token Filters: * Lowercasing: So "Apple" and "apple" match.
Stemming: Converts "running" and "ran" to the root "run."
Stopwords: Removes "the," "is," and "at" to save space.
The result is a highly optimized map: Token -> [Document IDs]. When you search for "run," Elasticsearch looks at one entry in the index and immediately has a list of every document containing that concept.
2. Lucene Segments: The "Write Once" Secret
Elasticsearch is built on top of Apache Lucene. A common misconception is that a Lucene index is a single file. In reality, it is composed of multiple Segments.
Segments are immutable. Once written to disk, they never change.
Why? Immutability allows for incredible caching. The OS can keep segment files in memory (the filesystem cache) without worrying about cache invalidation.
The Write Path: When data is indexed, it first goes into an in-memory buffer and a Translog (for durability). Periodically (usually every 1 second), the buffer is "refreshed" into a new segment. This is why Elasticsearch is called Near Real-Time (NRT) there is a slight delay before data is searchable.
Segment Merging
If you index data for a week, you'll end up with thousands of tiny segments. Searching many files is slow. Lucene runs a background process called Merge that picks smaller segments and collapses them into larger ones, physically removing documents that were marked for deletion in the process.
3. Distributed Coordination: Sharding & Replicas
How do you search petabytes of data in milliseconds? You parallelize.
Elasticsearch breaks an Index (the logical container) into multiple Shards. Each shard is a fully functional, independent Lucene index.
Primary Shards: The "source of truth" for a slice of data.
Replica Shards: A copy of the primary. They provide high availability (if a node dies) and increased read throughput (search requests can hit replicas too).
The "Scatter-Gather" Pattern
When a search query hits an Elasticsearch node (the Coordinating Node):
Scatter: The node broadcasts the query to a copy (primary or replica) of every shard in the index.
Local Search: Each shard executes the search locally, calculates relevance scores (usually using the BM25 algorithm), and returns a sorted list of the top N results.
Gather: The coordinating node collects the results from all shards, performs a global sort, and returns the final list to the user.
4. Relevance Scoring: The BM25 Algorithm
In a search engine, "matching" is easy; "ranking" is hard. Elasticsearch defaults to Okapi BM25. It calculates scores based on three factors:
Term Frequency (TF): How often does the word appear in this document? (More is better).
Inverse Document Frequency (IDF): How rare is this word across the entire index? (Searching for "the" shouldn't rank as high as searching for "cryptography").
Field Length Norm: If a word appears in a 5-word title, it’s more relevant than if it appears in a 5,000-word essay.
5. Scaling Challenges: The "Deep Pagination" Problem
A common trap for system designers is Deep Pagination. If you ask for page 100 (results 1,000 to 1,010), the coordinating node must collect 1,010 results from every shard, merge them, and then throw away the first 1,000.
If you have 100 shards, you are effectively asking the system to sort 101,000 items just to show 10. This is why tools like search_after or "Scroll" APIs are used for massive data exports they use stateful cursors instead of offsets to keep memory usage low.
Summary: Designing for Search
Elasticsearch succeeds because it recognizes that search is a different beast than storage. By trading off immediate consistency (NRT) and accepting the overhead of sharding, it provides a level of query flexibility that SQL simply cannot match.
Inverted Index: Constant time lookups for terms.
Lucene Segments: Leveraging the OS filesystem cache through immutability.
Sharding: Scaling out through parallelization.
References & Further Reading
For those ready to dive into the configuration and internals, start with these:
Elasticsearch: The Definitive Guide - Though some parts are for older versions, the conceptual chapters on the Inverted Index and Relevance are timeless.
The Ultimate Guide to Scaling Elasticsearch - A modern look at hardware provisioning and shard density.
Serving Video at 800Gb/s (Netflix Tech Blog) - Relevant for understanding how the underlying infrastructure supports high-speed data retrieval.
Lucene Segment Merging (Advanced) - A deep dive into the background heavy-lifting that keeps indices performant.
BM25 Explained - A mathematical breakdown of how relevance is calculated in modern search.