Autocomplete / Typeahead — Designed in Stages

First PublishedFeb 7, 2026ByAtif Alam

You don’t need to design for scale on day one.

This guide is a staged design playbook: it tells you what to build at MVP, Growth, and Advanced scale so you don’t over- or under-build.

Define what you need—suggest completions from a prefix (e.g. “new” → “new york”, “newton”), and update the index when the list of terms changes—then build the simplest thing that works and evolve as list size and query load grow.

When to use this: Use this when you’re designing or evolving an autocomplete or typeahead system from scratch or from an existing MVP; when you have a list of suggestable terms and need low-latency prefix suggestions. Skip or adapt if you need full-text search only (see Search example) or no prefix-based completion.

Unlike designing for max scale on day one, this adds complexity only when triggers appear (e.g. large list of terms, high QPS, ranking needs). Unlike ad-hoc growth with no structure, you get a clear sequence: MVP → Growth → Advanced. If you over-build (e.g. distributed index and fuzzy match before you need them), you pay in ops and tuning. If you under-invest in triggers (e.g. no cache when QPS grows), you hit latency and load. The stages tie additions to triggers so you avoid both.

Here we use autocomplete or typeahead (search box suggestions, product name completion) as the running example: prefix, candidate list, and optional ranking by popularity or recency. The same staged thinking applies to any system that suggests completions as the user types: low latency (e.g. < 100 ms), relevance (prefix match, possibly fuzzy), and ranking (popularity, recency) are central. You can reference the Search example for full-text search and indexing patterns.

Requirements and Constraints (no architecture yet)

Functional Requirements

Suggest — given a prefix (string user has typed), return a list of completions (candidates that start with or match the prefix); limit to top-k (e.g. 5–10); optionally filter by context (category, locale).
Update index — when the list of suggestable terms changes (new products, new queries, trending terms), update the index so new suggestions appear; batch or stream; define freshness.

Quality Requirements

Low latency — suggest response should be very fast (e.g. p95 < 100 ms) so typing feels responsive; avoid heavy computation or remote calls on the hot path; cache when possible.
Relevance — candidates should match the prefix (prefix match at minimum); optionally fuzzy (typo tolerance); optional semantic or synonym expansion later.
Ranking — among matching candidates, order by usefulness: e.g. popularity (frequency), recency (recent searches or trends), or business rules (promoted terms); ranking drives perceived quality.
Expected scale — list size (number of suggestable terms), query QPS, update frequency.

Key Entities

Prefix — the string the user has typed so far; input to suggest; may be normalized (lowercase, trim).
Candidate list — the set of completions that match the prefix; each candidate is a string (e.g. “New York”) and optional metadata (id, category, count); stored or derived from index.
Ranking / Count (optional) — per candidate, a score for ranking: e.g. search frequency (count), last-seen timestamp, or static boost; used to sort candidates before returning top-k.

Primary Use Cases and Access Patterns

Suggest — read path; input = prefix (and optional context); lookup or traverse index by prefix; get matching candidates; rank; return top-k; dominant access pattern; must be very fast.
Update index — write path; list of terms changes (new terms, updated counts); rebuild or incrementally update index; can be async (batch job or stream); suggest reads from current index.

Given this, start with the simplest MVP: one API, prefix tree or sorted structure in API server RAM, list of terms loaded at startup or from DB, suggest = traverse prefix tree or binary search—then add a dedicated search index (e.g. prefix suggester), cache per prefix, ranking (frequency, recency), and an index update pipeline (batch or stream) as the list of terms and QPS grow.

Stage 1 — MVP (simple, correct, not over-engineered)

Goal

Ship working autocomplete: user types a prefix, API returns a short list of completions. One API, prefix tree or sorted structure in API server RAM; list of terms loaded at startup or from DB; suggest = traverse prefix tree or binary search; single server.

Components

API — REST or similar; single endpoint: GET or POST suggest (prefix, optional limit, optional context); return list of candidate strings (and optional id/metadata). Auth optional (e.g. API key for rate limit). Single server so index is local.
Prefix tree or sorted structure (in API server RAM) — Prefix tree: each node = character, path from root = prefix, leaves or nodes hold list of full terms (completions); suggest(prefix) = traverse to node for prefix, collect all terms in subtree, return up to k. Or sorted structure: store all terms sorted; suggest(prefix) = binary search for first term ≥ prefix, scan until term no longer starts with prefix, return up to k. Prefix tree is O(prefix length) for lookup; sorted + binary search is O(log n + k).
List of terms — load at startup from DB or file (e.g. product names, city names, recent queries); build prefix tree or sort list; no ranking at MVP or simple static order (e.g. alphabetical); reload on deploy to pick up changes to the list, or optional admin “reload” endpoint (e.g. daily reload or on deploy).
Single server — index in process memory; no distributed cache or search service yet; vertical scaling for more QPS.

Minimal Diagram

Stage 1: one path in (prefix), prefix tree in API server RAM, one path out (top-k completions).

       +-----------------+
       | DB or file      |
       | (list of terms) |
       +-----------------+
                |
                | load at startup
                v
       +---------------------+
       | Prefix tree         |  <-- Client → API (suggest) reads here
       | (in API server RAM) |
       +---------------------+
                |
                v
       Return top-k completions

Patterns and Concerns (don’t overbuild)

Prefix normalization: lowercase and trim prefix (and list of terms) for case-insensitive match; document normalization rules.
Limit: always cap returned candidates (e.g. 10); avoid large payloads and long response time.
Basic monitoring: suggest latency (p50, p95), cache/index size, error rate.

Still Avoid (common over-engineering here)

Dedicated search index or distributed prefix tree before you have list size or QPS that justify it.
Cache per prefix, ranking (frequency/recency), and index update pipeline before you have hot prefixes or ranking requirements.

Why This Is a Correct MVP

One API, prefix tree or sorted structure in API server RAM, list of terms at startup, suggest in a few ms → enough to ship autocomplete for a small-to-medium list of terms; easy to reason about.
Vertical scaling and periodic reload buy you time before you need a dedicated index, cache, and ranking pipeline.

Stage 2 — Growth Phase (dedicated index, cache, ranking)

You have a working MVP (one API, prefix tree or sorted structure in API server RAM, list of terms at startup). Now one or more of the triggers below are true.

What Triggers the Growth Phase?

Need dedicated index: when the list of terms is large or updates frequently and in-process reload is slow or doesn’t scale → add a dedicated search index (e.g. Elasticsearch completion suggester) and an update pipeline; index updated by pipeline, not in API server.
Need cache for hot prefixes: when query QPS grows and the same prefix is requested often → cache suggestions per prefix (e.g. TTL 1–5 min) to reduce load on index and improve p99 latency.
Need ranking: when order by popularity (search/click count) or recency matters → store and use counts/timestamps per candidate; index or pipeline must support ranking; pipeline updates counts from search/click logs.

Components to Add (incrementally)

Dedicated search index — use a search engine with prefix/completion support (e.g. Elasticsearch completion suggester, which uses FST under the hood); or maintain a prefix tree in a dedicated service and replicate; index supports prefix lookup and optional metadata (for ranking). Index updated by pipeline, not in API process.
- FST or prefix tree; prefix lookup + optional weight/count for ranking; pipeline owns updates.
Cache per prefix — cache (prefix → list of completions) with short TTL (e.g. 1–5 min); hot prefixes hit cache; reduce load on index and improve p99 latency; invalidate or TTL when the list of terms updates.
- Key by normalized prefix (and optional context); TTL 1–5 min; invalidate when list of terms updates.
Ranking (frequency, recency) — store per candidate: count (how often selected or searched) and/or last_used timestamp; at suggest time, sort candidates by score (e.g. count desc, then recency); or use index’s built-in scoring (e.g. Elasticsearch suggest with weight). Pipeline updates counts from search/click logs.
- Score = count and/or recency; pipeline ingests search/click logs and updates counts.
Index update pipeline — batch: periodic job reads list of terms (and optionally logs for counts), builds or updates index, deploys; or stream: events (new term, click) go to queue, worker updates index or count store; suggest reads from updated index; define freshness (e.g. hourly batch vs near-real-time).
- Batch (e.g. hourly) or stream; document “suggestions updated within X” for consistency expectations.

Growth Diagram

Stage 2: we add dedicated index, cache per prefix, ranking, and index update pipeline.

Client
   |
   v
+-----------------+
| API             |
+-----------------+
   |              |
   v              v
Cache (prefix → completions, TTL e.g. 1–5 min)   On miss
   |              |
   v              v
Dedicated index (prefix suggester / prefix tree)
   - candidates + optional weight/count
   |
   v
Rank by frequency / recency → top-k
   |
   v
Index update pipeline (batch or stream)
   - list of terms + logs → build/update index

Patterns and Concerns to Introduce (practical scaling)

Cache key: key by normalized prefix (and optional context); short TTL to balance freshness and hit rate; consider cache size limit (evict least recently used).
Pipeline consistency: index may be eventually consistent with the list of terms; document “suggestions updated within X” if needed.
Monitoring: suggest latency (cache hit vs miss), index size, update pipeline lag, ranking quality (optional A/B).

Still Avoid (common over-engineering here)

Distributed prefix tree and fuzzy match before product and scale justify them.
Personalization and context-aware ranking before you have clear signals and ROI.
Real-time stream updates to index until latency requirement is strict.

Stage 3 — Advanced Scale (distributed, personalization, fuzzy)

You have a dedicated index, cache, and ranking pipeline. Now you need distributed scale, personalization, or fuzzy match.

What Triggers Advanced Scale?

Need distributed scale: when the list of terms is very large or high QPS and single index or region doesn’t scale → replicate index across nodes for read scaling, or shard (e.g. by first character of prefix) and merge results; load balancer in front.
Need personalization or context: when suggest order should differ per user (e.g. recent user searches first) or per context (e.g. category, location) → include user_id or context in request; ranking merges global score + user/context signal; may require separate store for user state.
Need fuzzy match: when you must tolerate typos (e.g. “nwe york” → “New York”) → use edit distance or fuzzy index (e.g. Elasticsearch fuzzy suggester); tune to avoid noisy suggestions; adds complexity and cost.

Components to Add (incrementally)

Distributed index — replicate index across nodes for read scaling; or shard (e.g. by first character of prefix) and merge results; load balancer in front; index build produces sharded or replicated data.
- Replicas for read scale, or shard by prefix range and merge top-k; LB in front of API.
Personalization or context-aware ranking — include user_id or context (category, locale) in request; ranking layer merges: global score (popularity) + user signal (recent searches, clicks) or context boost; may require separate store for user state and merge at suggest time.
- Cache user state at edge or keep minimal; avoid blocking suggest on slow user store.
Fuzzy match — configure suggester for typo tolerance (e.g. Levenshtein distance 1–2); or maintain fuzzy index (variants); trade recall and relevance for tolerance; tune to avoid noisy suggestions.
- e.g. Levenshtein 1–2; tune to avoid noisy suggestions; consider cost vs benefit.
Scale for large list of terms and QPS — index compression (e.g. FST); efficient merge of sharded results; rate limiting and backpressure; observability (latency by prefix length, cache hit rate).
- FST compression; rate limiting; observability for latency (e.g. p95 < 100 ms), cache hit rate.

Advanced Diagram (conceptual)

Stage 3: distributed or replicated index, optional personalization and fuzzy, high QPS path.

Clients (high QPS)
   |
   v
Load balancer
   |
   v
API (per region or shard)
   |              |
   v              v
Cache            Dedicated index (replicated or sharded)
   |              |
   v              v
Ranking (global + optional personalization/context)
   |              |
   v              v
Optional: fuzzy expansion
   |
   v
Return top-k
   |
   v
Index pipeline (batch/stream, multi-region if needed)

Patterns and Concerns at This Stage

Merge strategy: if sharded by prefix, each shard returns top-k; merge and re-rank to global top-k; or use single logical index with replicas for read scale.
Personalization latency: fetching user state (recent searches) adds latency; cache user state at edge or keep it minimal; avoid blocking suggest on slow user store.
SLO-driven ops: suggest latency (p50, p95, p99), cache hit rate, index freshness; error budgets and on-call.

Still Avoid (common over-engineering here)

Multi-region active-active autocomplete before you have latency or DR requirements.
Heavy semantic or synonym expansion before you have proven relevance gaps.
Over-tuned fuzzy (e.g. distance > 2) before you have typo data and relevance impact.

Summarizing the Evolution

MVP delivers autocomplete with one API, a prefix tree or sorted structure in API server RAM, list of terms loaded at startup, and suggest = traverse or binary search. That’s enough to ship typeahead for a small-to-medium list of terms.

As you grow, you add a dedicated search index (prefix suggester), cache per prefix for hot queries, ranking by frequency and recency, and an index update pipeline (batch or stream). You keep latency low and relevance acceptable.

At advanced scale, you add distributed index, optional personalization or context-aware ranking, and fuzzy match. You scale the list of terms and QPS without over-building on day one.

This approach gives you:

Start Simple — API + prefix tree or sorted list in API server RAM, suggest = prefix lookup; ship and learn.
Scale Intentionally — add dedicated index and cache when list of terms and QPS demand it; add ranking when quality demands it.
Add Complexity Only When Required — avoid distributed index, personalization, and fuzzy until product and scale justify them; keep latency and relevance first.

Example: Product search typeahead

Stage 1: Single API, prefix tree of product names in API server RAM, list of terms loaded at startup (e.g. daily); suggest returns top 10 by prefix. Stage 2: When QPS and list size grow, add Elasticsearch completion suggester, cache per prefix (TTL 1–5 min), ranking by click count; index update pipeline (e.g. hourly) from product catalog and click logs. Stage 3: When you need global read scale or personalization, add replicated index and optional “recent searches for this user” in ranking; add fuzzy only if typo tolerance is required.

Limits and confidence

This approach fits prefix-based autocomplete with low-latency requirements; adjust if you need full-text search only (see Search example) or very large fuzzy expansion. Use it as a heuristic, not a spec.

What do I do next?

Capture your requirements using the sections above (functional, quality, entities, access patterns).
Map your current system to Stage 1, 2, or 3.
If you’re in Growth or Advanced, pick one trigger that applies and add the corresponding components first.