Search Indexes and Inverted Indexes

Search engines need data structures that keep lookup times low while the number of indexed documents keeps growing. This article summarizes why indexes matter, how inverted indexes work, and why they remain the backbone of information retrieval systems.

1. Why a search index exists

An index is a data structure that reduces the cost of finding documents that match a query. Without it, a system would need to scan all records one by one, which quickly becomes infeasible at scale.

In practice, indexes trade storage and update cost for much faster reads. Instead of paying an O(N) scan per query, many operations move toward O(log N) or near-constant lookups depending on the structure.

Databases, search engines and analytics platforms all rely on this idea because the query workload is usually far larger than the cost of maintaining the index.

2. Inverted index

An inverted index maps each term to the documents where that term appears. This flips the direct document view into a term-first structure that is much better suited for full-text search.

Two common variants are:

  • Basic inverted file: stores the list of documents for every term.
  • Fully inverted index: also stores positions inside each document.

The positional version is more expensive to build and store, but it enables phrase queries and proximity search.

2.1 Example

Assume these three documents:

  • D0 : it is what it is
  • D1 : what is it
  • D2 : it is a banana

A basic inverted index would look like this:

  • a: {2}
  • banana: {2}
  • is: {0, 1, 2}
  • it: {0, 1, 2}
  • what: {0, 1}

Querying for what, is and it means intersecting posting lists: {0,1}\left\{{0,1 }\right\} \cap {0,1,2}\left\{{0,1,2 } \right\} \cap{0,1,2} \left\{{0,1,2 } \right\}=={0,1}\left\{{0,1}\right\}

If we also store token positions, phrase queries become possible. For example, both D0 and D1 contain the three terms, but only D1 contains them consecutively in the order needed for the phrase what is it.

2.2 Implementation notes

Search systems usually start from a direct index built during parsing and then transform it into the inverted representation. That inversion step is what turns a sequential document scan into random access by term identifier.

At web scale, the inverted index is split into shards and replicated across machines. This parallel design allows queries to stay fast even when the raw collection occupies many terabytes.

Replication also provides resilience: if one shard replica fails, the load balancer can route queries to another copy while keeping the service online.