LLM-Powered / Semantic Search — Designed in Stages

First PublishedFeb 7, 2026ByAtif Alam

You don’t need to design for scale on day one.

Define what you need—index documents (compute embeddings, store), search by query (query → embedding → similarity over vectors)—then build the simplest thing that works and evolve as corpus size, relevance requirements, and cost constraints grow.

Here we use LLM-powered or semantic search (enterprise search, RAG, “find by meaning”) as the running example: documents, embeddings, an index, and queries. The same staged thinking applies to any system that matches by semantic similarity rather than keyword alone: relevance, latency, freshness, and cost (embedding API and index) are central. You can reference the Search example for keyword indexing and hybrid patterns.

Requirements and Constraints (no architecture yet)

Functional Requirements

Index documents — ingest documents (text, or chunks); compute embedding vector per document or chunk (via embedding model); store (doc_id, embedding, optional metadata) for similarity search.
Search — user submits query (text); compute query embedding; find documents whose embedding is most similar to query embedding (e.g. cosine similarity, or approximate nearest neighbor); return ranked list; optional metadata filter.
Update index — when documents change, re-embed and update index; batch or incremental; define freshness (how soon new/updated docs appear in search).

Quality Requirements

Relevance — results should match query intent (semantic match); embedding model quality and chunking strategy drive relevance; optional reranking for top-k.
Latency — search latency (query to results) should be acceptable (e.g. p95 < 500 ms–2 s); embedding call + vector search + optional rerank; cache and index choice matter.
Freshness — how soon after a document is added or updated it appears in search; batch reindex vs incremental; trade-off with cost.
Cost — embedding API cost (per token or per call); index storage and compute; optimize model choice, caching, and index size.
Expected scale — document count, query QPS, embedding dimension, update frequency.

Key Entities

Document — unit to be searched; doc_id, raw content (text), optional metadata (source, date); may be chunked (multiple vectors per doc) for long docs.
Embedding — vector representation of document or query; from embedding model (e.g. OpenAI, Cohere, or open-source); fixed dimension (e.g. 1536); used for similarity.
Index — structure storing (doc_id, embedding, optional metadata); supports similarity search (exact scan or approximate nearest neighbor).
Query — user search input (text); converted to query embedding; compared to document embeddings.

Primary Use Cases and Access Patterns

Index — write path; receive document(s); optionally chunk; call embedding model per chunk/doc; store (doc_id, embedding, metadata) in index; batch or stream; idempotent by doc_id.
Search — read path; receive query text; compute query embedding; similarity search (cosine or ANN); return top-k doc_ids (and metadata); optional filter by metadata before or after vector search.
Update — write path; document updated or deleted; re-embed and upsert, or mark deleted; index reflects change after update run.

Given this, start with the simplest MVP: one API, one embedding model (internal or API), index as a store (doc_id, vector), search = query embedding + similarity over vectors (e.g. brute-force or simple vector DB)—then add vector index (HNSW, IVF), hybrid (keyword + vector), reranking, and batch/incremental index updates as corpus and quality demands grow.

Stage 1 — MVP (simple, correct, not over-engineered)

Goal

Ship working semantic search: index documents (compute embedding, store), search by query (query → embedding → similarity search → results). One API, one embedding model, single store (vector DB or in-memory); small-to-medium corpus.

Components

API — REST or similar; auth; index document(s) (POST doc or batch: text, optional doc_id, metadata); search (POST query text, top_k, optional filter); optional get doc by id. Single server or small cluster.
Embedding model — call internal model (e.g. sentence-transformers on same host) or external API (e.g. OpenAI embeddings); input = text, output = vector; one call per doc/chunk and per query; rate limit and handle failures.
Index — store (doc_id, embedding vector, optional metadata). Can be: (a) vector DB (Pinecone, Weaviate, pgvector, etc.), (b) in-memory (e.g. numpy array + doc_id list) for tiny corpus, or (c) relational table with vector column. No ANN yet if corpus is small (e.g. < 10k vectors); brute-force similarity (compute cosine to all) is acceptable.
Search flow — query text → call embedding model → get query vector; query index: similarity (cosine or dot product) between query vector and each stored vector; sort by score; return top_k with doc_id and metadata.
Indexing flow — receive doc; optional chunking (e.g. by paragraph or fixed size); for each chunk, call embedding model, store (doc_id/chunk_id, embedding). Idempotent by doc_id (overwrite on re-index).

Minimal Diagram

Documents          Query
   |                 |
   v                 v
+-----------------------+
| API                   |
+-----------------------+
   |                 |
   v                 v
Embedding model    Embedding model
   |                 |
   v                 v
Index (doc_id, vector)  Search: query vector → similarity → top_k
   |
   v
Store (vector DB or in-memory)

Patterns and Concerns (don’t overbuild)

Chunking: long docs often chunked (e.g. 512 tokens) so embedding fits model limit and retrieval is at chunk level; aggregate or dedupe by doc when returning; keep chunk size and overlap consistent.
Similarity: cosine similarity or dot product (if vectors normalized); same for query and index; document which you use.
Basic monitoring: index size, search latency (embedding + search), embedding API errors, cost per query.

Why This Is a Correct MVP

One API, embedding model, store (doc_id, vector), search = query embedding + similarity → enough to ship semantic search for a small corpus; easy to reason about.
Vertical scaling and brute-force search buy you time before you need ANN index, hybrid, and reranking.

Stage 2 — Growth Phase (vector index, hybrid, reranking)

What Triggers the Growth Phase?

Corpus grows; brute-force similarity scan is too slow; need approximate nearest neighbor (ANN) index (e.g. HNSW, IVF) for sub-second search over large vector sets.
Keyword match still matters for some queries (exact names, codes); need hybrid search (keyword + vector), then combine scores (e.g. reciprocal rank fusion or weighted sum).
Relevance: top-k from vector search can be improved with a second-stage reranker (cross-encoder or LLM); rerank top-20 to get top-5.
Index updates: full reindex is expensive; need batch or incremental updates (add/update/delete vectors without full rebuild).

Components to Add (incrementally)

Vector index (ANN) — use index type that supports approximate nearest neighbor: HNSW (graph), IVF (inverted file), or similar; trade recall for speed; tune parameters (e.g. ef_search, nlist). Vector DB or library (e.g. Faiss, Milvus) provides this; index build from existing vectors and incremental add/delete.
Hybrid (keyword + vector) — keyword search: use existing search index (e.g. Elasticsearch) or simple BM25/text match; vector search: as before; combine: run both, merge results by score (RRF or weighted combination); optional filter (e.g. date, source) before or after.
Reranking — take top-k from vector (or hybrid) search (e.g. k=20–50); run reranker model (cross-encoder or LLM) on (query, candidate doc); re-sort by rerank score; return top-n (e.g. 5–10). Improves relevance at cost of latency and compute.
Batch index updates — pipeline: new/updated docs → chunk → embed (batch API if available) → upsert to vector index; delete old vectors for updated doc; run on schedule or event; avoid blocking search during build.

Growth Diagram

Documents          Query
   |                 |
   v                 v
+-----------------------+
| API                   |
+-----------------------+
   |                 |
   v                 v
Embedding (batch)   Embedding (query)
   |                 |
   v                 v
Vector index (ANN)  Vector search (ANN) + optional keyword
   |                 |
   v                 v
Metadata store      Reranker (top_k → top_n)
   |                 |
   v                 v
Incremental update  Return results

Patterns and Concerns to Introduce (practical scaling)

Recall vs latency: ANN trades recall for speed; tune so recall@k is acceptable (e.g. 95%+) while keeping latency under SLO.
Hybrid weights: tune weight or fusion method (vector vs keyword) per use case; A/B test if needed.
Monitoring: index build time, search latency (embedding + ANN + rerank), reranker latency, embedding and rerank cost.

Still Avoid (common over-engineering here)

Distributed vector index and multi-region until corpus and QPS justify it.
Multi-modal (image + text) and complex rerank pipelines until product requires them.
Real-time incremental index (every doc change immediately) until freshness SLO demands it; batch hourly or daily often enough.

What Triggers Advanced Scale?

Very large corpus or high QPS; single index or single region doesn’t scale; need distributed or sharded vector index; optional multi-region for latency.
Multi-modal: index both text and image (or other modality); unified embedding or separate indices with fusion.
Cost and latency optimization: cache query embeddings for repeated/similar queries; use smaller or cheaper embedding model where acceptable; tiered index (hot vs cold).

Components (common advanced additions)

Scale (distributed index) — shard vectors by doc_id range or by embedding (e.g. IVF partitions); query all shards and merge top-k, or route by metadata; replicate for read scaling; use managed vector DB that scales (e.g. Pinecone, Weaviate cluster).
Multi-modal — index text and image (or other) embeddings; same or different models; query can be text or image; search in one or both indices and fuse results; optional unified embedding space (e.g. CLIP).
Incremental indexing — stream or queue of doc changes; worker embeds and upserts/deletes; index reflects changes within minutes; consistency: eventual (search may miss very recent docs briefly).
Cost and latency optimization — cache query embedding (and optionally results) for identical or similar queries; use smaller/faster embedding model for first stage and stronger model for rerank only; tier hot vs cold docs (e.g. recent in fast index, older in cheaper store).

Advanced Diagram (conceptual)

Documents (text + optional image)    Query (text or image)
   |                                    |
   v                                    v
Embedding (text + optional image)   Cache (query embedding)
   |                                    |
   v                                    v
Distributed / sharded vector index  ANN search (all shards)
   |                                    |
   v                                    v
Incremental upsert/delete            Merge + rerank
   |                                    |
   v                                    v
Multi-modal fusion (if needed)      Return results

Patterns and Concerns at This Stage

Sharding strategy: by id range (simple) or by vector (e.g. IVF) for pruning; avoid hot shards.
Consistency: eventual consistency for index updates; document “searchable after X minutes” if needed.
SLO-driven ops: search latency (p50, p95), index freshness, embedding and rerank cost; error budgets and on-call.

Summarizing the Evolution

MVP delivers LLM-powered/semantic search with one API, one embedding model, a store of (doc_id, vector), and search = query embedding + similarity (brute-force or simple vector DB). That’s enough to ship semantic search for a small corpus.

As you grow, you add a vector index (ANN) for speed, hybrid (keyword + vector) for better coverage, reranking for relevance, and batch or incremental index updates. You keep relevance and latency in balance with cost.

At advanced scale, you add distributed/sharded index, optional multi-modal indexing, incremental indexing for freshness, and cost/latency optimization (caching, model choice, tiers). You scale corpus and QPS without over-building on day one.

This approach gives you:

Start Simple — API + embedding model + vector store, search = embed query + similarity; ship and learn.
Scale Intentionally — add ANN index when corpus size demands it; add hybrid and rerank when relevance demands it.
Add Complexity Only When Required — avoid distributed index and multi-modal until scale and product justify them; keep relevance and cost under control first.

LLM-Powered / Semantic Search — Designed in Stages

Requirements and Constraints (no architecture yet)

Stage 1 — MVP (simple, correct, not over-engineered)

Stage 2 — Growth Phase (vector index, hybrid, reranking)

Stage 3 — Advanced Scale (distributed index, multi-modal, cost optimization)

Summarizing the Evolution