LLM-Powered / Semantic Search — Designed in Stages
You don’t need to design for scale on day one.
Define what you need—index documents (compute embeddings, store), search by query (query → embedding → similarity over vectors)—then build the simplest thing that works and evolve as corpus size, relevance requirements, and cost constraints grow.
Here we use LLM-powered or semantic search (enterprise search, RAG, “find by meaning”) as the running example: documents, embeddings, an index, and queries. The same staged thinking applies to any system that matches by semantic similarity rather than keyword alone: relevance, latency, freshness, and cost (embedding API and index) are central. You can reference the Search example for keyword indexing and hybrid patterns.
Requirements and Constraints (no architecture yet)
Section titled “Requirements and Constraints (no architecture yet)”Functional Requirements
- Index documents — ingest documents (text, or chunks); compute embedding vector per document or chunk (via embedding model); store (doc_id, embedding, optional metadata) for similarity search.
- Search — user submits query (text); compute query embedding; find documents whose embedding is most similar to query embedding (e.g. cosine similarity, or approximate nearest neighbor); return ranked list; optional metadata filter.
- Update index — when documents change, re-embed and update index; batch or incremental; define freshness (how soon new/updated docs appear in search).
Quality Requirements
- Relevance — results should match query intent (semantic match); embedding model quality and chunking strategy drive relevance; optional reranking for top-k.
- Latency — search latency (query to results) should be acceptable (e.g. p95 < 500 ms–2 s); embedding call + vector search + optional rerank; cache and index choice matter.
- Freshness — how soon after a document is added or updated it appears in search; batch reindex vs incremental; trade-off with cost.
- Cost — embedding API cost (per token or per call); index storage and compute; optimize model choice, caching, and index size.
- Expected scale — document count, query QPS, embedding dimension, update frequency.
Key Entities
- Document — unit to be searched; doc_id, raw content (text), optional metadata (source, date); may be chunked (multiple vectors per doc) for long docs.
- Embedding — vector representation of document or query; from embedding model (e.g. OpenAI, Cohere, or open-source); fixed dimension (e.g. 1536); used for similarity.
- Index — structure storing (doc_id, embedding, optional metadata); supports similarity search (exact scan or approximate nearest neighbor).
- Query — user search input (text); converted to query embedding; compared to document embeddings.
Primary Use Cases and Access Patterns
- Index — write path; receive document(s); optionally chunk; call embedding model per chunk/doc; store (doc_id, embedding, metadata) in index; batch or stream; idempotent by doc_id.
- Search — read path; receive query text; compute query embedding; similarity search (cosine or ANN); return top-k doc_ids (and metadata); optional filter by metadata before or after vector search.
- Update — write path; document updated or deleted; re-embed and upsert, or mark deleted; index reflects change after update run.
Given this, start with the simplest MVP: one API, one embedding model (internal or API), index as a store (doc_id, vector), search = query embedding + similarity over vectors (e.g. brute-force or simple vector DB)—then add vector index (HNSW, IVF), hybrid (keyword + vector), reranking, and batch/incremental index updates as corpus and quality demands grow.
Stage 1 — MVP (simple, correct, not over-engineered)
Section titled “Stage 1 — MVP (simple, correct, not over-engineered)”Goal
Ship working semantic search: index documents (compute embedding, store), search by query (query → embedding → similarity search → results). One API, one embedding model, single store (vector DB or in-memory); small-to-medium corpus.
Components
- API — REST or similar; auth; index document(s) (POST doc or batch: text, optional doc_id, metadata); search (POST query text, top_k, optional filter); optional get doc by id. Single server or small cluster.
- Embedding model — call internal model (e.g. sentence-transformers on same host) or external API (e.g. OpenAI embeddings); input = text, output = vector; one call per doc/chunk and per query; rate limit and handle failures.
- Index — store (doc_id, embedding vector, optional metadata). Can be: (a) vector DB (Pinecone, Weaviate, pgvector, etc.), (b) in-memory (e.g. numpy array + doc_id list) for tiny corpus, or (c) relational table with vector column. No ANN yet if corpus is small (e.g. < 10k vectors); brute-force similarity (compute cosine to all) is acceptable.
- Search flow — query text → call embedding model → get query vector; query index: similarity (cosine or dot product) between query vector and each stored vector; sort by score; return top_k with doc_id and metadata.
- Indexing flow — receive doc; optional chunking (e.g. by paragraph or fixed size); for each chunk, call embedding model, store (doc_id/chunk_id, embedding). Idempotent by doc_id (overwrite on re-index).
Minimal Diagram
Documents Query | | v v+-----------------------+| API |+-----------------------+ | | v vEmbedding model Embedding model | | v vIndex (doc_id, vector) Search: query vector → similarity → top_k | vStore (vector DB or in-memory)Patterns and Concerns (don’t overbuild)
- Chunking: long docs often chunked (e.g. 512 tokens) so embedding fits model limit and retrieval is at chunk level; aggregate or dedupe by doc when returning; keep chunk size and overlap consistent.
- Similarity: cosine similarity or dot product (if vectors normalized); same for query and index; document which you use.
- Basic monitoring: index size, search latency (embedding + search), embedding API errors, cost per query.
Why This Is a Correct MVP
- One API, embedding model, store (doc_id, vector), search = query embedding + similarity → enough to ship semantic search for a small corpus; easy to reason about.
- Vertical scaling and brute-force search buy you time before you need ANN index, hybrid, and reranking.
Stage 2 — Growth Phase (vector index, hybrid, reranking)
Section titled “Stage 2 — Growth Phase (vector index, hybrid, reranking)”What Triggers the Growth Phase?
- Corpus grows; brute-force similarity scan is too slow; need approximate nearest neighbor (ANN) index (e.g. HNSW, IVF) for sub-second search over large vector sets.
- Keyword match still matters for some queries (exact names, codes); need hybrid search (keyword + vector), then combine scores (e.g. reciprocal rank fusion or weighted sum).
- Relevance: top-k from vector search can be improved with a second-stage reranker (cross-encoder or LLM); rerank top-20 to get top-5.
- Index updates: full reindex is expensive; need batch or incremental updates (add/update/delete vectors without full rebuild).
Components to Add (incrementally)
- Vector index (ANN) — use index type that supports approximate nearest neighbor: HNSW (graph), IVF (inverted file), or similar; trade recall for speed; tune parameters (e.g. ef_search, nlist). Vector DB or library (e.g. Faiss, Milvus) provides this; index build from existing vectors and incremental add/delete.
- Hybrid (keyword + vector) — keyword search: use existing search index (e.g. Elasticsearch) or simple BM25/text match; vector search: as before; combine: run both, merge results by score (RRF or weighted combination); optional filter (e.g. date, source) before or after.
- Reranking — take top-k from vector (or hybrid) search (e.g. k=20–50); run reranker model (cross-encoder or LLM) on (query, candidate doc); re-sort by rerank score; return top-n (e.g. 5–10). Improves relevance at cost of latency and compute.
- Batch index updates — pipeline: new/updated docs → chunk → embed (batch API if available) → upsert to vector index; delete old vectors for updated doc; run on schedule or event; avoid blocking search during build.
Growth Diagram
Documents Query | | v v+-----------------------+| API |+-----------------------+ | | v vEmbedding (batch) Embedding (query) | | v vVector index (ANN) Vector search (ANN) + optional keyword | | v vMetadata store Reranker (top_k → top_n) | | v vIncremental update Return resultsPatterns and Concerns to Introduce (practical scaling)
- Recall vs latency: ANN trades recall for speed; tune so recall@k is acceptable (e.g. 95%+) while keeping latency under SLO.
- Hybrid weights: tune weight or fusion method (vector vs keyword) per use case; A/B test if needed.
- Monitoring: index build time, search latency (embedding + ANN + rerank), reranker latency, embedding and rerank cost.
Still Avoid (common over-engineering here)
- Distributed vector index and multi-region until corpus and QPS justify it.
- Multi-modal (image + text) and complex rerank pipelines until product requires them.
- Real-time incremental index (every doc change immediately) until freshness SLO demands it; batch hourly or daily often enough.
Stage 3 — Advanced Scale (distributed index, multi-modal, cost optimization)
Section titled “Stage 3 — Advanced Scale (distributed index, multi-modal, cost optimization)”What Triggers Advanced Scale?
- Very large corpus or high QPS; single index or single region doesn’t scale; need distributed or sharded vector index; optional multi-region for latency.
- Multi-modal: index both text and image (or other modality); unified embedding or separate indices with fusion.
- Cost and latency optimization: cache query embeddings for repeated/similar queries; use smaller or cheaper embedding model where acceptable; tiered index (hot vs cold).
Components (common advanced additions)
- Scale (distributed index) — shard vectors by doc_id range or by embedding (e.g. IVF partitions); query all shards and merge top-k, or route by metadata; replicate for read scaling; use managed vector DB that scales (e.g. Pinecone, Weaviate cluster).
- Multi-modal — index text and image (or other) embeddings; same or different models; query can be text or image; search in one or both indices and fuse results; optional unified embedding space (e.g. CLIP).
- Incremental indexing — stream or queue of doc changes; worker embeds and upserts/deletes; index reflects changes within minutes; consistency: eventual (search may miss very recent docs briefly).
- Cost and latency optimization — cache query embedding (and optionally results) for identical or similar queries; use smaller/faster embedding model for first stage and stronger model for rerank only; tier hot vs cold docs (e.g. recent in fast index, older in cheaper store).
Advanced Diagram (conceptual)
Documents (text + optional image) Query (text or image) | | v vEmbedding (text + optional image) Cache (query embedding) | | v vDistributed / sharded vector index ANN search (all shards) | | v vIncremental upsert/delete Merge + rerank | | v vMulti-modal fusion (if needed) Return resultsPatterns and Concerns at This Stage
- Sharding strategy: by id range (simple) or by vector (e.g. IVF) for pruning; avoid hot shards.
- Consistency: eventual consistency for index updates; document “searchable after X minutes” if needed.
- SLO-driven ops: search latency (p50, p95), index freshness, embedding and rerank cost; error budgets and on-call.
Summarizing the Evolution
Section titled “Summarizing the Evolution”MVP delivers LLM-powered/semantic search with one API, one embedding model, a store of (doc_id, vector), and search = query embedding + similarity (brute-force or simple vector DB). That’s enough to ship semantic search for a small corpus.
As you grow, you add a vector index (ANN) for speed, hybrid (keyword + vector) for better coverage, reranking for relevance, and batch or incremental index updates. You keep relevance and latency in balance with cost.
At advanced scale, you add distributed/sharded index, optional multi-modal indexing, incremental indexing for freshness, and cost/latency optimization (caching, model choice, tiers). You scale corpus and QPS without over-building on day one.
This approach gives you:
- Start Simple — API + embedding model + vector store, search = embed query + similarity; ship and learn.
- Scale Intentionally — add ANN index when corpus size demands it; add hybrid and rerank when relevance demands it.
- Add Complexity Only When Required — avoid distributed index and multi-modal until scale and product justify them; keep relevance and cost under control first.