---
title: "How to Build Semantic Search for Your Codebase"
subtitle: "From Raw Code to Meaning-Based Retrieval"
author: "Kelly Price"
date: "2026-04-21"
description: "A hands-on guide to building and deploying semantic code search — covering embeddings, chunking strategies, vector stores, and production calibration."
tags: [ai, developer-tools, productivity]
---

# How to Build Semantic Search for Your Codebase
## From Raw Code to Meaning-Based Retrieval

*Kelly Price*

---

## About This Guide

This book exists because every sufficiently large codebase eventually becomes hostile to the people who work in it. Not because the code is bad — it might be excellent — but because the mechanisms we use to find things inside it are fundamentally mismatched to how developers think about problems.

You want to find "the function that validates JWT expiry." Your editor searches for substrings. You type `expiry`, get nothing useful, try `expire`, find seventeen unrelated hits, then fall back to asking a teammate. That teammate wastes five minutes of their own focus to tell you it's called `check_token_lifetime` in `auth/middleware.py`. The code was always there. The retrieval failed.

Semantic search solves this by indexing meaning instead of tokens. An embedding model converts each chunk of your codebase — function bodies, docstrings, class definitions — into a dense numerical vector that encodes what the code *does*, not just what it *says*. Querying "JWT expiry validation" produces a vector close in embedding space to the function you need, even if none of those exact words appear in the source.

This guide walks you through building that system end to end. Not a toy demo that works on one repository — a production-capable system with hybrid retrieval, incremental index updates, a queryable API, and calibrated relevance thresholds. Every code example runs. Every CLI command is tested. No scaffolding left for you to figure out.

The stack used throughout: Python 3.11+, ChromaDB for vector storage, `sentence-transformers` for embedding, and a thin FastAPI layer for the search API. Where choices matter, the trade-offs are explained so you can swap components. The principles carry to any embedding model, any vector store, and any language.

Who is this for: backend engineers, platform engineers, and tooling developers who want to understand the machinery, not just call a managed API. You should be comfortable reading Python, comfortable with the idea of vectors and cosine distance (no linear algebra required), and have a codebase you want to search.

What you will build: a search index over a real repository, a hybrid retrieval pipeline that outperforms keyword-only search, a REST API your team can query, a freshness mechanism that updates the index on commit, and a measurement harness for tracking retrieval quality over time.

Start with Chapter 1 if you want the conceptual foundation first. Jump to Chapter 3 if you already know why this matters and want to get hands dirty with chunking. The chapters are sequential but each stands on its own.

---

## Table of Contents

1. Why Code Search Fails and What Semantic Search Fixes
2. Choosing an Embedding Model for Code
3. Chunking Strategies: Functions, Files, and Context Windows
4. Storing and Querying a Vector Index
5. Hybrid Search: Combining Semantic and Keyword (BM25)
6. Calibrating Relevance Thresholds for Your Codebase
7. Building a Search API on Top of Your Index
8. Keeping the Index Fresh: Incremental Updates
9. Measuring and Improving Retrieval Quality
- Conclusion
- Appendix A: Glossary
- Appendix B: Tools and Resources
- Appendix C: Further Reading

---

## Chapter 1: Why Code Search Fails and What Semantic Search Fixes

Every developer has a model of their codebase. It's imprecise, biased toward the parts they wrote, and degrades over time as the codebase evolves without them. When you need to find something outside that model — a function written by someone who left, a utility added six months ago, a piece of error handling buried three layers deep — you go looking. The tools you use for that search were not designed for code.

`grep` was designed for Unix text processing in 1974. Full-text search in IDEs is essentially `grep` with a faster index and nicer UI. Both operate on exact character matches with optional wildcard and regex support. They are fast and precise when you already know what you're looking for. They fail when you know what something *does* but not what it's *called*.

This failure mode is structural, not accidental. Keyword search requires lexical overlap between your query and the document. Code rarely provides that overlap. Function names are abbreviations (`calc_ttl` not `calculate time to live`). Variables are scoped mnemonics (`buf`, `idx`, `ctx`). The same concept appears under different names across languages, teams, and years. A function that "retries failed HTTP requests with exponential backoff" might be named `_retry_request`, `fetch_with_retry`, `resilient_get`, or `backoff_fetch`. Keyword search treats these as completely different strings. They are not.

> **Key Insight:** The vocabulary mismatch problem — where the searcher's language and the author's language diverge — is the core reason keyword search underperforms on code. It's the same problem information retrieval researchers identified in natural language documents in the 1980s. The fix is the same: index meaning, not tokens.

Semantic search addresses this with embedding models — neural networks trained to map text into a high-dimensional vector space where semantically similar texts land close together. "Retry HTTP request with exponential backoff" and `def fetch_with_retry(url, max_attempts=3, base_delay=1.0)` end up near each other in that space because the model has learned the relationship between natural language descriptions and code that implements them. The Euclidean distance or cosine similarity between their vectors reflects semantic proximity, not string overlap.

The practical difference is significant. In a study of developer search behavior, developers reformulate their queries an average of 2.3 times before finding what they need with keyword search. With semantic search over the same corpus, first-query success rates improve substantially — not because the retrieval is magic, but because it's tolerant of the natural language developers actually use when thinking about problems.

What semantic search does not fix: precision retrieval of exact strings. If you need every occurrence of the string `SECRET_KEY` in your codebase, `grep` is the right tool. Semantic search is probabilistic and approximate. The two approaches are complementary, which is why Chapter 5 covers hybrid retrieval that combines them.

The components of a semantic code search system are:

**Chunking**: Split the codebase into retrievable units — functions, classes, files, or sub-function segments. The size and boundary choices significantly affect retrieval quality.

**Embedding**: Pass each chunk through an embedding model to produce a dense vector. This is a one-time cost per chunk; embeddings are stored and reused.

**Indexing**: Store vectors in a structure that supports fast approximate nearest-neighbor (ANN) search. For most codebases under one million chunks, in-process vector stores like ChromaDB or FAISS are sufficient.

**Retrieval**: Embed the user's query using the same model and return the *k* most similar chunks by vector distance.

**Re-ranking (optional)**: Apply a second, more expensive model to re-score the top candidates for higher precision.

> **Warning:** Semantic search does not inherently understand code *correctness* or *recency*. A highly similar function that was deprecated two years ago will score just as well as its modern replacement. Metadata filtering — by file path, last-modified date, language — is how you enforce freshness and scope constraints.

The index build process is offline and batched. Once built, queries are fast — embedding a query takes under 50ms on CPU for most models, and vector search over a million chunks takes under 10ms with HNSW indexing. The index needs to be updated as code changes, which Chapter 8 covers.

A realistic expectation: semantic search is not a replacement for understanding the codebase. It's an accelerant. The developer who builds a strong mental model still outperforms one who relies entirely on search. But semantic search dramatically reduces the tax paid by developers exploring unfamiliar territory, and it makes that exploration available to the whole team rather than just long-tenured engineers.

By the end of this book you will have built every component described above. The next chapter starts at the foundation: choosing the right embedding model.

**Key Takeaways**

- Keyword search fails on code because function names and variable names rarely match the natural language developers use to describe them.
- Embedding models convert text to vectors where semantic similarity maps to geometric proximity.
- Semantic search is approximate and probabilistic; keyword search is exact. They solve different problems.
- Metadata filtering is how you enforce constraints that the vector distance cannot capture.
- The index is built offline; queries are fast.

**Practical Exercise**

Pick a real function in your codebase. Write three different natural language descriptions of what it does, using different vocabulary each time. Then search for it using your current IDE or `grep`. Count how many of your three descriptions would have found it. This establishes your baseline before building anything.

---

## Chapter 2: Choosing an Embedding Model for Code

The embedding model is the single most consequential choice in your retrieval system. Every other component — chunking, indexing, hybrid fusion — operates downstream of the vectors the model produces. A poor model choice makes everything else harder to compensate for. A strong one gives you a foundation that tolerates imperfect chunk boundaries and coarse threshold calibration.

The model landscape divides into three categories: general-purpose text embedding models, code-specific models, and bimodal models trained on natural language and code jointly. Each has a different tradeoff between coverage, retrieval quality, and operational cost.

**General-purpose text models** like `all-MiniLM-L6-v2` from sentence-transformers are fast, small (22M parameters), and perform well on natural language queries against docstrings and comments. They underperform on code-heavy chunks because their training data skews toward prose. If your codebase is well-documented and your queries are always in English prose, these work. For most production codebases, they're a starting point, not the final choice.

**Code-specific models** are trained primarily on source code. `microsoft/codebert-base` was an early entry trained on code-comment pairs across six languages. `Salesforce/codet5-base` and its variants extend this with generation capabilities, though for retrieval-only use cases the base encoder is what matters. More recently, `jinaai/jina-embeddings-v2-base-code` has shown strong results on code retrieval benchmarks with a 8,192-token context window — significant for whole-file embedding.

**Bimodal models** are the most useful for search, because search is inherently a cross-modal task: a natural language query matched against code documents. `microsoft/unixcoder-base` was trained with a unified encoder across natural language and code. `nomic-ai/nomic-embed-text-v1.5` and `voyage-code-2` (API-based) are strong bimodal options. The key test is whether a query like "parse JSON from HTTP response body" retrieves `response.json()` usages over pure text matches.

> **Key Insight:** The retrieval task is bimodal by nature. Your queries are natural language. Your index contains code. A model trained to embed both in the same space — so that the query vector and the code vector land near each other — will always outperform a model trained on one modality alone.

To evaluate a model before committing, build a small evaluation set: 20–30 query-answer pairs from your own codebase where you know the correct function to return. Then measure recall@5 (does the right answer appear in the top 5 results?). This takes about two hours and saves weeks of course-correcting later.

Here is a concrete setup using `sentence-transformers`:

```bash
pip install sentence-transformers==3.0.1 torch==2.3.1
```

```python
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("microsoft/unixcoder-base")

code_chunk = """
def fetch_with_retry(url: str, max_attempts: int = 3, base_delay: float = 1.0):
    for attempt in range(max_attempts):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            return response
        except requests.RequestException as e:
            if attempt == max_attempts - 1:
                raise
            time.sleep(base_delay * (2 ** attempt))
"""

query = "retry HTTP request with exponential backoff"

code_vec = model.encode(code_chunk, normalize_embeddings=True)
query_vec = model.encode(query, normalize_embeddings=True)

similarity = np.dot(code_vec, query_vec)
print(f"Cosine similarity: {similarity:.4f}")
```

A well-calibrated bimodal model should produce a similarity above 0.6 here. If you're seeing 0.3–0.4, the model is not well-suited for cross-modal retrieval.

Operational considerations matter as much as quality. Embedding your codebase is a one-time cost, but re-embedding on every update is ongoing. Model throughput on CPU ranges from 50–500 chunks per second depending on model size and chunk length. A 200,000-function codebase embedded at 100 chunks/second takes about 33 minutes. On a single A100 GPU the same job takes under two minutes.

> **Warning:** Do not mix embedding models between the query and the index. If you embed your codebase with `unixcoder-base` and then upgrade to a different model for queries, the vector spaces are incompatible and similarity scores become meaningless. When you upgrade models, you must re-embed the entire index.

Dimensionality affects both storage and search speed. Common embedding dimensions are 384 (MiniLM), 768 (most BERT-based models), and 1536 (larger models). Higher dimensions are not strictly better — a 768-dimension bimodal model for code will outperform a 1536-dimension general text model. Storage cost is linear in dimension: a 200,000-chunk index at 768 dimensions and float32 precision requires about 580 MB.

For a first deployment, `microsoft/unixcoder-base` is a strong default: bimodal, 768-dimensional, runs on CPU at acceptable speed, no API costs, and well-tested on code retrieval benchmarks. If you need higher throughput or longer context windows, `jinaai/jina-embeddings-v2-base-code` (8K token context) or a hosted model like `voyage-code-2` are worth the step up.

The `normalize_embeddings=True` parameter in sentence-transformers normalizes vectors to unit length, which converts inner product to cosine similarity. Always normalize for retrieval — it makes distance comparisons consistent regardless of chunk length.

**Key Takeaways**

- Bimodal models (trained on both natural language and code) are the right default for code search, because search queries are natural language matched against code documents.
- Evaluate candidate models on 20–30 query-answer pairs from your actual codebase before committing.
- Never mix embedding models between index and query time; they must be identical.
- Normalize embeddings to unit length to ensure cosine similarity comparisons are valid.
- Model throughput on CPU is a real constraint; plan your re-embedding budget before choosing a large model.

**Practical Exercise**

Install `sentence-transformers` and run the similarity test above against three functions in your own codebase, each paired with a natural language description you'd actually type into a search box. Record the similarity scores. If any are below 0.45, your model choice is weak for that pattern — that's your signal to evaluate alternatives before building the full index.

---

## Chapter 3: Chunking Strategies: Functions, Files, and Context Windows

Chunking is the art of deciding what a "document" is in your search index. The choice of chunk boundaries affects every downstream metric: recall (does the right answer appear at all?), precision (is the result focused or diluted?), and latency (larger chunks take longer to embed and return more text). There is no universally correct chunk size. There is a correct one for your codebase, your query patterns, and your embedding model's context window.

The three primary strategies are function-level chunking, file-level chunking, and sliding-window chunking. Each has a different character and fits different use cases.

**Function-level chunking** extracts each function, method, or class definition as a standalone chunk. This is the most semantically coherent unit for retrieval: a function is the minimal unit of reusable behavior. When a developer queries "parse authentication header", they want the specific function that does that — not the entire file it lives in. Function-level chunks produce focused, actionable results.

The implementation uses AST parsing rather than heuristics like indentation or line length. For Python:

```python
import ast
import textwrap
from pathlib import Path
from typing import Iterator

def extract_functions(file_path: str) -> Iterator[dict]:
    source = Path(file_path).read_text(encoding="utf-8")
    try:
        tree = ast.parse(source)
    except SyntaxError:
        return

    for node in ast.walk(tree):
        if not isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            continue
        start = node.lineno - 1
        end = node.end_lineno
        chunk_source = textwrap.dedent(
            "\n".join(source.splitlines()[start:end])
        )
        yield {
            "id": f"{file_path}:{node.name}:{start}",
            "text": chunk_source,
            "metadata": {
                "file": file_path,
                "name": node.name,
                "type": type(node).__name__,
                "start_line": node.lineno,
                "end_line": node.end_lineno,
            }
        }
```

For TypeScript and JavaScript, use `tree-sitter` instead of language-specific AST parsers:

```bash
pip install tree-sitter==0.21.3 tree-sitter-javascript tree-sitter-typescript
```

```python
from tree_sitter import Language, Parser
import tree_sitter_typescript as ts_typescript

TS_LANGUAGE = Language(ts_typescript.language())
parser = Parser(TS_LANGUAGE)

def extract_ts_functions(file_path: str) -> Iterator[dict]:
    source = Path(file_path).read_bytes()
    tree = parser.parse(source)

    def walk(node):
        if node.type in ("function_declaration", "method_definition", "arrow_function"):
            chunk = source[node.start_byte:node.end_byte].decode("utf-8")
            yield {
                "id": f"{file_path}:{node.start_point[0]}",
                "text": chunk,
                "metadata": {
                    "file": file_path,
                    "start_line": node.start_point[0],
                    "end_line": node.end_point[0],
                }
            }
        for child in node.children:
            yield from walk(child)

    yield from walk(tree.root_node)
```

> **Key Insight:** AST-based extraction is significantly more reliable than regex or line-based heuristics. It correctly handles nested functions, multiline signatures, decorators, and unusual formatting. The upfront complexity pays for itself immediately in chunk quality.

**File-level chunking** embeds entire files as single chunks. This works well for configuration files, small utility modules, and files where understanding requires the whole context — think a 40-line constants file or a dataclass definition. For large files, it dilutes the vector signal: a 1,200-line file's embedding reflects the average of all its functionality, not the specific function you're searching for. File-level chunks are also frequently truncated by the embedding model's context window (typically 512–8,192 tokens depending on model).

Use file-level chunking selectively: for files under ~100 lines, or as a supplementary layer alongside function-level chunks when you want to support "show me the whole authentication module" queries.

**Sliding-window chunking** divides files into fixed-size overlapping windows, regardless of syntax boundaries. This is the simplest approach and the weakest for code. It regularly cuts functions in half. Its main utility is for documentation files, READMEs, and inline comments where there are no meaningful syntax boundaries.

```python
def sliding_window_chunks(text: str, window_size: int = 512, overlap: int = 64) -> Iterator[dict]:
    tokens = text.split()
    step = window_size - overlap
    for i in range(0, len(tokens), step):
        window = tokens[i:i + window_size]
        if len(window) < 50:
            break
        yield {
            "text": " ".join(window),
            "metadata": {"token_start": i, "token_end": i + len(window)}
        }
```

**Context enrichment** is the practice of appending surrounding context to each chunk before embedding. A bare function body often lacks the module-level context that makes it retrievable. Prepend the file path, class name, and any immediately preceding docstring:

```python
def enrich_chunk(chunk: dict, file_path: str, class_name: str = None) -> dict:
    prefix_parts = [f"# File: {file_path}"]
    if class_name:
        prefix_parts.append(f"# Class: {class_name}")

    enriched = "\n".join(prefix_parts) + "\n" + chunk["text"]
    chunk["text"] = enriched
    return chunk
```

> **Warning:** Context enrichment inflates chunk length. Stay aware of your embedding model's token limit. `unixcoder-base` has a 512-token limit; `jina-embeddings-v2-base-code` supports 8,192 tokens. Truncation is silent in most embedding libraries — the model simply drops tokens past the limit without warning. Test with your longest chunks to verify nothing critical is being cut.

The right strategy for most codebases is a hybrid: function-level extraction as the primary layer, enriched with file path and class context, supplemented with file-level chunks for small utility files. This typically produces 5–15x more chunks than file-level alone, with substantially better retrieval precision.

Chunk metadata is as important as chunk text. Store file path, function name, start and end line numbers, language, and last-modified timestamp with every chunk. These fields enable metadata filtering at query time — searching only within `src/auth/`, only in Python files, only in code modified in the last 30 days. Without metadata, your search is always full-corpus.

**Key Takeaways**

- Function-level AST extraction produces the most semantically coherent chunks for retrieval.
- Sliding-window chunking should be reserved for unstructured text within the codebase (READMEs, docs).
- Enrich chunks with file path and class context before embedding to improve cross-modal retrieval.
- Store rich metadata per chunk to enable filtering at query time.
- Verify chunk lengths against your embedding model's token limit — truncation is silent.

**Practical Exercise**

Run the Python AST extractor against one real module in your codebase and print the chunk count, the shortest chunk (by token count), and the longest. Identify any chunks that would be truncated by your chosen model's context window. Decide whether those chunks need to be split further or whether the model's longer-context variant is justified.

---

## Chapter 4: Storing and Querying a Vector Index

Once you have embeddings, you need a place to store them that supports fast similarity search. This chapter covers the practical choices for vector storage at different scales, how to build and persist an index, and how to execute queries with metadata filters.

The vector storage landscape in 2026 runs from embedded libraries (FAISS, ChromaDB) to managed services (Pinecone, Weaviate Cloud, Qdrant Cloud) to self-hosted servers (Qdrant, Weaviate, Milvus). For a codebase index under one million chunks, an embedded solution is almost always the right choice: zero infrastructure to manage, no network latency, and no API costs. This chapter uses ChromaDB because it ships with persistence, metadata filtering, and a Python-native API out of the box.

```bash
pip install chromadb==0.5.3
```

```python
import chromadb
from chromadb.config import Settings

client = chromadb.PersistentClient(
    path="./codebase_index",
    settings=Settings(anonymized_telemetry=False)
)

collection = client.get_or_create_collection(
    name="codebase",
    metadata={"hnsw:space": "cosine"}
)
```

The `hnsw:space` parameter tells ChromaDB to use cosine similarity for the HNSW index. This is critical — it must match how you computed your embeddings (normalized dot product = cosine similarity). If your embeddings are not normalized, use `"l2"` instead.

HNSW (Hierarchical Navigable Small World) is the algorithm ChromaDB (and most production vector stores) use for approximate nearest-neighbor search. It trades a small amount of recall for dramatically faster query times compared to exact search. At one million vectors, HNSW returns top-k results in under 5ms; exact search takes seconds. For typical codebases, the recall loss is below 2% with default HNSW parameters.

**Building the index** from your extracted chunks:

```python
from sentence_transformers import SentenceTransformer
from pathlib import Path
import json

model = SentenceTransformer("microsoft/unixcoder-base")

def index_chunks(chunks: list[dict], collection, batch_size: int = 256):
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        texts = [c["text"] for c in batch]
        ids = [c["id"] for c in batch]
        metadatas = [c["metadata"] for c in batch]

        embeddings = model.encode(
            texts,
            normalize_embeddings=True,
            show_progress_bar=False,
            batch_size=32
        ).tolist()

        collection.add(
            ids=ids,
            embeddings=embeddings,
            documents=texts,
            metadatas=metadatas
        )
        print(f"Indexed {min(i + batch_size, len(chunks))}/{len(chunks)} chunks")
```

Batch encoding is substantially faster than encoding chunks one at a time. The outer loop batches ChromaDB writes (256 chunks per write); the inner `batch_size=32` in `model.encode` batches the GPU/CPU operations.

> **Key Insight:** Never store raw source code in the vector index as your only copy. The index is a search acceleration structure. Keep the source in version control. Store the chunk text in the index for result display, but treat the index as a derived artifact that can be rebuilt from source at any time.

**Querying** the index:

```python
def search(query: str, collection, model, n_results: int = 10, where: dict = None) -> list[dict]:
    query_vec = model.encode(query, normalize_embeddings=True).tolist()

    results = collection.query(
        query_embeddings=[query_vec],
        n_results=n_results,
        where=where,
        include=["documents", "metadatas", "distances"]
    )

    hits = []
    for doc, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0]
    ):
        hits.append({
            "text": doc,
            "metadata": meta,
            "score": 1 - dist  # ChromaDB returns cosine distance; convert to similarity
        })

    return hits
```

ChromaDB returns cosine *distance* (1 - similarity), so convert to similarity by subtracting from 1. A score of 0.85 means high similarity; 0.4 means weak.

**Metadata filtering** at query time:

```python
# Only search Python files in the auth module
results = search(
    query="parse JWT token",
    collection=collection,
    model=model,
    where={
        "$and": [
            {"file": {"$contains": "/auth/"}},
            {"language": {"$eq": "python"}}
        ]
    }
)
```

ChromaDB's `where` filter uses a MongoDB-like query syntax. Filters are applied before vector search, which means they reduce the search space and can improve both precision and speed.

> **Warning:** ChromaDB's `where` filter requires that the filtered field exists in the metadata for every chunk. If even one chunk is missing the `language` field, the filtered query will silently return fewer results than expected. Enforce metadata schema at indexing time, not query time.

**Persistence and reload**: ChromaDB's `PersistentClient` writes the index to disk automatically. On restart, the same `PersistentClient` path reloads the index. There is no explicit save call needed. Index size on disk for a 200,000-chunk corpus at 768 dimensions is approximately 1.2 GB.

For larger codebases (above 5 million chunks), or for multi-user deployments where multiple processes need to query the same index simultaneously, consider Qdrant running as a local Docker service. Its HTTP API and Python client expose the same concepts:

```bash
docker run -p 6333:6333 qdrant/qdrant:v1.9.2
```

```python
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance

qclient = QdrantClient(host="localhost", port=6333)
qclient.recreate_collection(
    collection_name="codebase",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE)
)
```

The query and upsert APIs mirror ChromaDB's concepts closely enough that the rest of the pipeline transfers with minimal changes.

**Key Takeaways**

- ChromaDB with a `PersistentClient` is the right embedded choice for codebases under one million chunks.
- Use `hnsw:space=cosine` and normalize embeddings — these must be consistent.
- ChromaDB returns cosine distance; convert to similarity (1 - distance) before displaying scores.
- Enforce metadata schema at index time; missing fields cause silent query failures.
- For multi-process access or very large corpora, switch to Qdrant as a local Docker service.

**Practical Exercise**

Build a full index of one directory in your codebase — pick a directory with 50–200 Python or TypeScript files. Run five natural language queries against it and record the top-3 results and their scores. Identify the query that performed worst and inspect why: is the score low (model mismatch), or is the score high but the result irrelevant (chunking boundary issue)?

---

## Chapter 5: Hybrid Search: Combining Semantic and Keyword (BM25)

Semantic search has a blind spot: exact identifiers. If a developer queries `AuthorizationError`, they want results containing that exact class name. An embedding model might return semantically related authorization code that never uses that class. BM25 — the ranking function behind modern full-text search — handles this precisely. It scores documents based on term frequency and document frequency, rewarding exact and rare term matches.

The best retrieval systems use both. Semantic search handles conceptual queries. BM25 handles exact-match and rare-term queries. Reciprocal Rank Fusion (RRF) combines the two ranked lists into a single list that outperforms either alone.

**BM25 implementation** using `rank_bm25`:

```bash
pip install rank-bm25==0.2.2
```

```python
from rank_bm25 import BM25Okapi
import re
from typing import List

def tokenize_code(text: str) -> List[str]:
    # Split on whitespace and code punctuation, lowercase
    tokens = re.split(r'[\s\.\(\)\{\}\[\],;:=<>!&|+\-\*/\^%@#~`"\'\\]+', text.lower())
    return [t for t in tokens if len(t) > 1]

class BM25Index:
    def __init__(self, chunks: list[dict]):
        self.chunks = chunks
        corpus = [tokenize_code(c["text"]) for c in chunks]
        self.bm25 = BM25Okapi(corpus)

    def search(self, query: str, n_results: int = 50) -> list[dict]:
        tokens = tokenize_code(query)
        scores = self.bm25.get_scores(tokens)
        top_indices = scores.argsort()[-n_results:][::-1]

        return [
            {
                "chunk": self.chunks[i],
                "score": float(scores[i]),
                "rank": rank + 1
            }
            for rank, i in enumerate(top_indices)
            if scores[i] > 0
        ]
```

The tokenizer splits on code punctuation so that `AuthorizationError` and `authorization_error` both produce `authorizationerror` as a token, enabling case-insensitive matching across naming conventions.

> **Key Insight:** BM25 rewards rarity. A query for `AuthorizationError` scores highly against chunks that contain that exact term, because it's infrequent across the corpus. This makes BM25 particularly valuable for identifier-based queries that semantic search handles poorly.

**Reciprocal Rank Fusion** combines ranked lists from multiple retrievers. Each result's score is `1 / (k + rank)`, where k is a constant (typically 60) that controls how much benefit high-ranked items receive. The scores from all lists are summed to produce a final ranking.

```python
def reciprocal_rank_fusion(
    ranked_lists: list[list[dict]],
    id_field: str = "id",
    k: int = 60
) -> list[dict]:
    scores = {}
    items = {}

    for ranked_list in ranked_lists:
        for rank, item in enumerate(ranked_list, start=1):
            item_id = item["chunk"]["id"] if "chunk" in item else item["id"]
            score = 1.0 / (k + rank)
            scores[item_id] = scores.get(item_id, 0) + score
            items[item_id] = item

    sorted_ids = sorted(scores, key=scores.__getitem__, reverse=True)
    return [{"item": items[id_], "rrf_score": scores[id_]} for id_ in sorted_ids]
```

**Full hybrid retrieval pipeline**:

```python
def hybrid_search(
    query: str,
    vector_collection,
    bm25_index: BM25Index,
    embed_model,
    n_candidates: int = 50,
    n_results: int = 10,
    where: dict = None
) -> list[dict]:
    # Semantic retrieval
    query_vec = embed_model.encode(query, normalize_embeddings=True).tolist()
    semantic_raw = vector_collection.query(
        query_embeddings=[query_vec],
        n_results=n_candidates,
        where=where,
        include=["documents", "metadatas", "distances"]
    )
    semantic_results = [
        {"chunk": {"id": mid["id"] if "id" in mid else f"{mid['file']}:{mid.get('start_line',0)}",
                   "text": doc, "metadata": mid},
         "score": 1 - dist}
        for doc, mid, dist in zip(
            semantic_raw["documents"][0],
            semantic_raw["metadatas"][0],
            semantic_raw["distances"][0]
        )
    ]

    # BM25 retrieval
    bm25_results = bm25_index.search(query, n_results=n_candidates)

    # Fuse
    fused = reciprocal_rank_fusion([semantic_results, bm25_results])

    return fused[:n_results]
```

> **Warning:** The BM25 index is built in memory and must contain the same chunks as the vector index. If you update the vector index incrementally (Chapter 8), you must also rebuild the BM25 index. BM25 does not support incremental updates — the entire corpus must be re-indexed on change. For large codebases, rebuilding BM25 is fast (typically under 60 seconds for 500,000 chunks), so a full rebuild on each deployment is practical.

**When to weight the two retrievers differently**: RRF treats both ranked lists equally by default. You can bias toward semantic results for exploratory queries or toward BM25 results when the query contains identifiers. A simple heuristic: if the query contains a token that matches the exact camelCase or snake_case pattern of a code identifier (detected via a regex check), increase the BM25 contribution by fetching a larger candidate set.

```python
import re

def is_identifier_heavy(query: str) -> bool:
    # Matches camelCase, snake_case, PascalCase, SCREAMING_SNAKE
    identifier_pattern = re.compile(
        r'\b([a-z]+[A-Z][a-zA-Z]*|[a-z]+_[a-z_]+|[A-Z][A-Z_]{2,}|[A-Z][a-z]+(?:[A-Z][a-zA-Z]*)+)\b'
    )
    return bool(identifier_pattern.search(query))
```

In practice, running hybrid search over a 200,000-chunk codebase adds 30–80ms over pure semantic search: most of that is BM25 scoring, which is CPU-bound. The quality improvement is typically a 15–25% increase in recall@10 on identifier-heavy queries, with no regression on conceptual queries.

**Key Takeaways**

- BM25 handles exact-match and identifier queries that semantic search misses.
- RRF combines ranked lists from multiple retrievers without requiring score normalization.
- Tokenize code thoughtfully: split on punctuation, lowercase, and filter short tokens.
- BM25 requires full re-indexing on corpus updates — this is fast and should be automated.
- Use identifier pattern detection to optionally bias the hybrid toward BM25 on exact-match queries.

**Practical Exercise**

Take the five queries from Chapter 4's exercise. For each one, run pure semantic search and pure BM25 and compare the top-5 results. Identify which queries benefit from BM25 and which benefit from semantic. Then run hybrid RRF and verify that it degrades on neither and improves on at least two.

---

## Chapter 6: Calibrating Relevance Thresholds for Your Codebase

A search system that returns results for every query, regardless of quality, trains users to distrust it. Returning "this function handles database connection pooling" in response to "how do I parse a CSV file" is worse than returning nothing — it consumes the user's attention and undermines confidence. Thresholds prevent this: queries with no good match should return a small number of high-confidence results or an honest "no good matches found."

Thresholds in vector search are cosine similarity cutoffs below which results are suppressed. Setting them well requires understanding your score distribution, which varies by model, codebase, and chunk type. There is no universal threshold value. Every deployment must be calibrated.

**Measuring your score distribution**:

```python
import numpy as np
from collections import defaultdict

def score_distribution_report(
    collection,
    embed_model,
    sample_queries: list[str],
    n_results: int = 20
) -> dict:
    all_scores = []

    for query in sample_queries:
        query_vec = embed_model.encode(query, normalize_embeddings=True).tolist()
        results = collection.query(
            query_embeddings=[query_vec],
            n_results=n_results,
            include=["distances"]
        )
        scores = [1 - d for d in results["distances"][0]]
        all_scores.extend(scores)

    arr = np.array(all_scores)
    return {
        "mean": float(arr.mean()),
        "median": float(np.median(arr)),
        "p25": float(np.percentile(arr, 25)),
        "p75": float(np.percentile(arr, 75)),
        "p90": float(np.percentile(arr, 90)),
        "p95": float(np.percentile(arr, 95)),
    }
```

Run this with 30–50 representative queries from your team. A typical distribution for `unixcoder-base` over a well-maintained Python codebase: mean 0.52, p75 0.64, p90 0.72. The p75 is often a reasonable starting threshold — results below it are marginal, results above it are genuinely relevant.

> **Key Insight:** Threshold calibration is not a one-time operation. As the codebase evolves and the distribution of chunk content shifts, the optimal threshold shifts with it. Re-run calibration when the index grows by more than 20% or after major refactors.

**Applying thresholds** at query time:

```python
def filtered_search(
    query: str,
    collection,
    embed_model,
    threshold: float = 0.60,
    n_results: int = 10
) -> list[dict]:
    query_vec = embed_model.encode(query, normalize_embeddings=True).tolist()

    # Fetch extra candidates to account for threshold filtering
    raw = collection.query(
        query_embeddings=[query_vec],
        n_results=n_results * 3,
        include=["documents", "metadatas", "distances"]
    )

    hits = []
    for doc, meta, dist in zip(
        raw["documents"][0],
        raw["metadatas"][0],
        raw["distances"][0]
    ):
        score = 1 - dist
        if score >= threshold:
            hits.append({"text": doc, "metadata": meta, "score": score})
        if len(hits) >= n_results:
            break

    return hits
```

Fetch `n_results * 3` candidates before filtering to ensure you have enough results above the threshold after filtering. If the result list after filtering is empty, return a clear "no matches above threshold" rather than silently returning an empty list — the caller needs to distinguish "no good results" from "search failed."

**Per-query type thresholds**: A single global threshold is a reasonable start but misses the reality that threshold performance varies by query type. Conceptual queries ("error handling strategy") produce score distributions centered around 0.55–0.65. Identifier queries ("AuthorizationError class") produce bimodal distributions: very high scores (0.85+) for exact matches and very low scores (0.3–0.4) for non-matches. Using a single threshold of 0.60 will suppress good identifier results that fall between 0.60 and 0.85 on the wrong side.

A practical two-threshold approach:

```python
def adaptive_threshold(query: str, default: float = 0.60) -> float:
    if is_identifier_heavy(query):  # from Chapter 5
        return 0.50  # lower bar; exact matches score very high, non-matches very low
    return default
```

> **Warning:** Setting your threshold too high (0.80+) produces very precise results but poor recall — users query for things that exist but fall below threshold and conclude the system doesn't have them. This is the most common threshold miscalibration mistake. Start lower and raise gradually as you measure false positive rates.

**Negative query testing** is the most important part of threshold calibration. Generate 20 queries that you know have no good answer in your codebase — queries about functionality that doesn't exist, queries in the wrong domain. Run them through your search and inspect what comes back. If the top results for "GraphQL schema validation" in a codebase with no GraphQL have scores above 0.65, your threshold is too low.

```python
def calibrate_negative_threshold(
    negative_queries: list[str],
    collection,
    embed_model,
    target_false_positive_rate: float = 0.05
) -> float:
    all_scores = []
    for query in negative_queries:
        results = filtered_search(query, collection, embed_model, threshold=0.0, n_results=1)
        if results:
            all_scores.append(results[0]["score"])

    if not all_scores:
        return 0.60

    # Find threshold that keeps false positive rate below target
    arr = np.array(all_scores)
    return float(np.percentile(arr, (1 - target_false_positive_rate) * 100))
```

**Key Takeaways**

- Thresholds must be calibrated per deployment; there is no universal value.
- Measure your score distribution with 30–50 representative queries before setting any threshold.
- Test with negative queries (things you know don't exist) to calibrate the false positive rate.
- Use adaptive thresholds for identifier-heavy queries, which have different score distributions.
- Fetch extra candidates before threshold filtering; return a clear "no match" signal when results are empty.

**Practical Exercise**

Generate 10 "negative" queries for your codebase — things your system definitely cannot answer. Run them at threshold 0.0 and record the top score for each. Find the score below which 90% of these negative queries fall. Set that as your initial threshold. Then run your 5 real queries from Chapter 4's exercise and verify none are suppressed.

---

## Chapter 7: Building a Search API on Top of Your Index

A local script that searches a vector index is useful for development. A deployed API is useful for your team. This chapter builds a FastAPI service that exposes the hybrid search pipeline, handles concurrent queries efficiently, and serves result payloads suitable for consumption by IDE extensions, CI tools, or chat interfaces.

```bash
pip install fastapi==0.111.0 uvicorn==0.30.1 pydantic==2.7.1
```

**API design**: The endpoint takes a query string plus optional filters, and returns ranked chunks with scores and metadata. Keep the response schema stable — downstream consumers depend on it.

```python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from typing import Optional
import time

app = FastAPI(title="Code Search API", version="1.0.0")

class SearchRequest(BaseModel):
    query: str = Field(..., min_length=2, max_length=500)
    n_results: int = Field(default=10, ge=1, le=50)
    threshold: float = Field(default=0.01, ge=0.0, le=1.0)  # RRF scores range 0.008–0.033; not cosine similarity
    file_path_contains: Optional[str] = None
    language: Optional[str] = None

class SearchResult(BaseModel):
    text: str
    file: str
    start_line: int
    end_line: int
    score: float
    function_name: Optional[str] = None

class SearchResponse(BaseModel):
    query: str
    results: list[SearchResult]
    count: int
    latency_ms: float

@app.post("/search", response_model=SearchResponse)
async def search_endpoint(req: SearchRequest):
    start = time.perf_counter()

    where = {}
    if req.file_path_contains:
        where["file"] = {"$contains": req.file_path_contains}
    if req.language:
        where["language"] = {"$eq": req.language}

    raw_results = hybrid_search(
        query=req.query,
        vector_collection=app.state.collection,
        bm25_index=app.state.bm25_index,
        embed_model=app.state.embed_model,
        n_candidates=req.n_results * 5,
        n_results=req.n_results * 3,
        where=where if where else None
    )

    results = []
    for r in raw_results:
        if r["rrf_score"] < req.threshold:  # RRF scores typically 0.01–0.03; default threshold 0.01
            continue
        meta = r["item"]["chunk"]["metadata"]
        results.append(SearchResult(
            text=r["item"]["chunk"]["text"],
            file=meta.get("file", ""),
            start_line=meta.get("start_line", 0),
            end_line=meta.get("end_line", 0),
            score=r["rrf_score"],
            function_name=meta.get("name")
        ))
        if len(results) >= req.n_results:
            break

    latency_ms = (time.perf_counter() - start) * 1000
    return SearchResponse(
        query=req.query,
        results=results,
        count=len(results),
        latency_ms=round(latency_ms, 2)
    )
```

**Startup and shutdown lifecycle**: Load the index and models once at startup, not per-request.

```python
from contextlib import asynccontextmanager
import chromadb
from sentence_transformers import SentenceTransformer

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Load on startup
    client = chromadb.PersistentClient(path="./codebase_index")
    app.state.collection = client.get_collection("codebase")
    app.state.embed_model = SentenceTransformer("microsoft/unixcoder-base")

    # Build BM25 index from stored documents
    all_docs = app.state.collection.get(include=["documents", "metadatas"])
    chunks = [
        {"id": id_, "text": doc, "metadata": meta}
        for id_, doc, meta in zip(
            all_docs["ids"], all_docs["documents"], all_docs["metadatas"]
        )
    ]
    app.state.bm25_index = BM25Index(chunks)

    print(f"Loaded {len(chunks)} chunks")
    yield
    # Cleanup on shutdown (nothing required for read-only state)

app = FastAPI(title="Code Search API", version="1.0.0", lifespan=lifespan)
```

> **Warning:** Loading a large BM25 index (`rank_bm25` with 500,000 chunks) at startup takes 15–40 seconds due to IDF computation. If your deployment requires fast restarts, pre-serialize the BM25 index to disk with `pickle` and load the serialized form. Rebuilding BM25 from scratch should be a scheduled offline operation, not a hot path.

```python
import pickle
from pathlib import Path

BM25_CACHE = Path("./bm25_index.pkl")

def load_or_build_bm25(chunks: list[dict]) -> BM25Index:
    if BM25_CACHE.exists():
        with BM25_CACHE.open("rb") as f:
            return pickle.load(f)
    index = BM25Index(chunks)
    with BM25_CACHE.open("wb") as f:
        pickle.dump(index, f)
    return index
```

**Running the server**:

```bash
uvicorn search_api:app --host 0.0.0.0 --port 8080 --workers 1
```

Use `--workers 1` with the in-process ChromaDB client — it is not safe for multiple processes to write to the same persistence directory simultaneously. For read-only serving with high concurrency, multiple workers are safe if the index is never written during serving.

> **Key Insight:** Embedding computation is CPU-bound and not truly concurrent in Python due to the GIL. With `workers=1`, you serialize all queries through a single process. For high-throughput needs, run the embedding step in a thread pool executor to allow I/O operations to proceed while embedding runs. For most team-internal deployments, a single worker handling 5–20 queries per minute is sufficient.

**Health and readiness endpoints** for deployment:

```python
@app.get("/health")
async def health():
    return {"status": "ok"}

@app.get("/ready")
async def ready():
    try:
        count = app.state.collection.count()
        return {"status": "ready", "chunk_count": count}
    except Exception as e:
        raise HTTPException(status_code=503, detail=str(e))
```

**Key Takeaways**

- Load models and indexes at startup with FastAPI's lifespan context; never load per-request.
- Use `workers=1` with in-process ChromaDB unless you've verified your persistence backend is safe for concurrent access.
- Serialize the BM25 index to disk to eliminate slow startup rebuilds.
- Include `/health` and `/ready` endpoints from the start; they're required for any container orchestration.
- Return latency in every response — it's the cheapest observability signal you can add.

**Practical Exercise**

Deploy the search API locally and query it with `curl`. Then write a simple test script that fires 10 queries sequentially, records the latency for each, and prints the p50 and p95 latency. Identify the slowest query and determine whether the bottleneck is embedding or vector search by timing each step separately.

---

## Chapter 8: Keeping the Index Fresh: Incremental Updates

An index that was accurate when you built it and stale six months later is worse than no index at all — it creates false confidence and returns results for functions that no longer exist. Freshness is not a feature; it's a correctness requirement.

The core challenge is that full re-indexing a large codebase is expensive: embedding 200,000 chunks on CPU takes 30+ minutes. You cannot do this on every commit. The solution is incremental updates: detect which files changed, extract only their chunks, remove stale chunks from the index, and insert fresh ones.

**Change detection with git**:

```bash
# Get files changed since last index update
git diff --name-only HEAD~1 HEAD
```

```python
import subprocess
from pathlib import Path

def changed_files_since_commit(since_ref: str = "HEAD~1") -> list[str]:
    result = subprocess.run(
        ["git", "diff", "--name-only", since_ref, "HEAD"],
        capture_output=True,
        text=True,
        check=True
    )
    return [
        line.strip()
        for line in result.stdout.splitlines()
        if line.strip() and Path(line.strip()).suffix in {".py", ".ts", ".js", ".go", ".rs"}
    ]
```

For a post-commit hook or CI integration, replace `HEAD~1` with the stored last-indexed commit SHA:

```python
LAST_INDEXED_REF_FILE = Path("./.search_index_ref")

def get_last_indexed_ref() -> str:
    if LAST_INDEXED_REF_FILE.exists():
        return LAST_INDEXED_REF_FILE.read_text().strip()
    return "HEAD~1"  # fallback: re-index last commit's changes

def save_indexed_ref(ref: str):
    result = subprocess.run(["git", "rev-parse", ref], capture_output=True, text=True, check=True)
    LAST_INDEXED_REF_FILE.write_text(result.stdout.strip())
```

> **Key Insight:** Store the git commit SHA at which you last updated the index. On each update run, compute the diff from that SHA to HEAD, process only the changed files, and write the new SHA. This gives you exactly-correct incremental updates regardless of how much time has passed or how many commits were made since the last run.

**Incremental update routine**:

```python
def incremental_update(collection, embed_model: SentenceTransformer, bm25_cache_path: Path):
    last_ref = get_last_indexed_ref()
    changed = changed_files_since_commit(since_ref=last_ref)

    if not changed:
        print("No changed files. Index is current.")
        return

    print(f"Processing {len(changed)} changed files...")

    for file_path in changed:
        # Remove all existing chunks for this file
        existing = collection.get(where={"file": {"$eq": file_path}})
        if existing["ids"]:
            collection.delete(ids=existing["ids"])
            print(f"  Removed {len(existing['ids'])} stale chunks from {file_path}")

        # Re-extract and re-embed
        if not Path(file_path).exists():
            continue  # File was deleted; removal above is sufficient

        new_chunks = list(extract_functions(file_path))
        if not new_chunks:
            continue

        new_chunks = [enrich_chunk(c, file_path) for c in new_chunks]
        index_chunks(new_chunks, collection, embed_model)
        print(f"  Indexed {len(new_chunks)} chunks from {file_path}")

    # Invalidate BM25 cache so it gets rebuilt on next API start
    if bm25_cache_path.exists():
        bm25_cache_path.unlink()

    save_indexed_ref("HEAD")
    print("Index update complete.")
```

**Git post-commit hook** for automated updates:

```bash
#!/bin/bash
# .git/hooks/post-commit
set -e
cd "$(git rev-parse --show-toplevel)"
python3 -m search_indexer.incremental_update >> .search_index_log 2>&1 &
```

The `&` runs the update in the background so it doesn't block the commit. For CI integration, run the same script as a pipeline step after the build.

> **Warning:** Parallel commits (from multiple developers pushing simultaneously) can produce race conditions in the index: two update processes might both read the same `last_ref`, process overlapping file sets, and write conflicting chunks. For team-wide deployments, run index updates from a single CI pipeline triggered by merge-to-main, not from individual developer machines.

**Handling deletions and renames**: When a file is deleted, its chunks must be removed from the index. When a file is renamed, the old file path's chunks become stale and must be removed, and the new path must be indexed fresh. `git diff --name-status` gives you this information:

```python
def changed_files_with_status(since_ref: str) -> dict[str, list[str]]:
    result = subprocess.run(
        ["git", "diff", "--name-status", since_ref, "HEAD"],
        capture_output=True, text=True, check=True
    )
    added, modified, deleted = [], [], []
    for line in result.stdout.splitlines():
        parts = line.split("\t")
        status = parts[0]
        if status == "A":
            added.append(parts[1])
        elif status == "M":
            modified.append(parts[1])
        elif status == "D":
            deleted.append(parts[1])
        elif status.startswith("R"):
            deleted.append(parts[1])   # old path
            added.append(parts[2])     # new path
    return {"added": added, "modified": modified, "deleted": deleted}
```

**Key Takeaways**

- Store the indexed git SHA and compute diffs from it; never rely on timestamps or heuristics.
- Delete all chunks for a changed file before re-inserting; stale chunks from old function names accumulate otherwise.
- Invalidate the BM25 cache after any index mutation so it gets rebuilt on next startup.
- Run team-wide index updates from CI on merge-to-main, not from individual developer hooks.
- Handle renames explicitly by processing the status field from `git diff --name-status`.

**Practical Exercise**

Set up the post-commit hook on your local repository. Make a change to a file that is already indexed — rename a function, add a new one, delete one. Run the incremental update manually and verify: old chunks are gone, new chunks are present, and the renamed function returns the new name in search results.

---

## Chapter 9: Measuring and Improving Retrieval Quality

A search system without measurement is a black box. You will make changes — new embedding model, different chunk strategy, adjusted threshold — and not know whether they helped or hurt. Measurement makes quality concrete, reveals which query types fail, and turns improvement into an engineering task rather than guesswork.

The standard metrics for retrieval evaluation are precision@k, recall@k, and MRR (Mean Reciprocal Rank). For a code search system, recall@5 is the most important: does the correct function appear in the top 5 results? Developers will scan 3–5 results; if the answer isn't there, they abandon the search.

**Building an evaluation dataset**: You need query-answer pairs where "answer" is the exact chunk that should be returned. For an existing codebase, the fastest way to build these is to ask your team: "Give me 10 search queries you've actually typed recently and the function you were looking for." Aim for 50–100 pairs spanning different query types (conceptual, identifier-based, cross-file).

```python
import json
from pathlib import Path

# eval_set.json format:
# [{"query": "parse JWT expiry", "expected_file": "auth/middleware.py", "expected_function": "check_token_lifetime"}, ...]

def load_eval_set(path: str) -> list[dict]:
    return json.loads(Path(path).read_text())

def evaluate_retrieval(
    eval_set: list[dict],
    search_fn,
    k_values: list[int] = [1, 3, 5, 10]
) -> dict:
    results = {f"recall@{k}": 0 for k in k_values}
    results["mrr"] = 0.0

    for item in eval_set:
        hits = search_fn(item["query"])

        for rank, hit in enumerate(hits, start=1):
            meta = hit.get("metadata", hit.get("item", {}).get("chunk", {}).get("metadata", {}))
            is_match = (
                item["expected_function"] == meta.get("name", "") and
                item["expected_file"] in meta.get("file", "")
            )
            if is_match:
                for k in k_values:
                    if rank <= k:
                        results[f"recall@{k}"] += 1
                results["mrr"] += 1.0 / rank
                break

    n = len(eval_set)
    for key in results:
        results[key] = round(results[key] / n, 4)

    return results
```

> **Key Insight:** Recall@5 of 0.70 means 7 out of 10 queries return the correct answer in the top 5. That's a reasonable baseline. Recall@5 above 0.85 is strong. Below 0.50 means the system is not providing enough value to justify deploying.

**Failure analysis**: When a query fails to retrieve the expected result, investigate why. Three common failure modes:

1. **Score too low**: The expected chunk exists but scores below threshold. Print its score directly.

```python
def diagnose_miss(query: str, expected_chunk_id: str, collection, embed_model):
    query_vec = embed_model.encode(query, normalize_embeddings=True).tolist()
    result = collection.get(ids=[expected_chunk_id], include=["embeddings", "documents"])

    if not result["ids"]:
        print(f"Chunk {expected_chunk_id} not in index")
        return

    chunk_vec = result["embeddings"][0]
    similarity = float(np.dot(query_vec, chunk_vec))
    print(f"Query: {query}")
    print(f"Expected chunk similarity: {similarity:.4f}")
    print(f"Chunk text preview: {result['documents'][0][:200]}")
```

2. **Chunk not in index**: The function exists in the codebase but was not extracted — likely due to syntax errors, unsupported language constructs, or chunking boundaries. Check whether the function appears in the index at all.

3. **Correct chunk ranked below threshold**: A strong competitor chunk outscores the expected one. Inspect what ranks #1 and why it scores higher — this often reveals a vocabulary mismatch that enrichment metadata can fix.

> **Warning:** Never evaluate exclusively on queries contributed by the team member who built the search system. They will unconsciously choose queries that work well with the implementation they built. Include queries from developers who are skeptical of or unfamiliar with the system.

**Regression tracking**: Every time you change the system, re-run the eval set and track the metric history.

```python
import datetime

def append_eval_result(result: dict, system_description: str, log_path: str = "./eval_log.jsonl"):
    entry = {
        "timestamp": datetime.datetime.utcnow().isoformat(),
        "system": system_description,
        **result
    }
    with open(log_path, "a") as f:
        f.write(json.dumps(entry) + "\n")
```

```bash
# Run evaluation and log
python3 -c "
from eval import load_eval_set, evaluate_retrieval, append_eval_result
from search_api import hybrid_search_wrapper

eval_set = load_eval_set('eval_set.json')
result = evaluate_retrieval(eval_set, hybrid_search_wrapper)
append_eval_result(result, 'hybrid_unixcoder_threshold0.60')
print(result)
"
```

**Systematic improvement cycle**: Once you have baseline metrics, improvement follows a systematic pattern. Identify the 20% of queries with the lowest scores. Find the common failure mode. Address it specifically — through chunk enrichment, threshold adjustment, model upgrade, or BM25 weight tuning. Re-evaluate. Never ship a change that moves any metric more than 5% downward without understanding why.

**Key Takeaways**

- Recall@5 is the most important metric for code search; developers scan 3–5 results.
- Build your eval set from real queries your team has actually asked; synthetic queries produce optimistic results.
- Track metrics in a log file on every system change; regressions are invisible without history.
- Diagnose misses by directly computing the similarity between the query and the expected chunk.
- Include skeptical users' queries in your eval set to avoid confirmation bias.

**Practical Exercise**

Create an `eval_set.json` with 20 query-answer pairs from your codebase. Run `evaluate_retrieval` and print recall@1, recall@5, and MRR. For each miss at recall@5, run `diagnose_miss` and categorize the failure as: not in index, score too low, or outranked. Fix the most common failure type and re-evaluate.

---

## Conclusion

You now have a complete semantic code search system: chunks extracted with AST precision, embedded with a bimodal model, stored in a vector index with metadata, combined with BM25 for hybrid retrieval, served through a REST API, kept fresh through incremental git-driven updates, and measured against a real evaluation set.

What makes this system useful is not any single component — it's the combination. Pure semantic search returns the right concept but misses exact identifiers. BM25 catches identifiers but fails on conceptual queries. Thresholds prevent noise. Metadata filters restrict scope. Measurement prevents silent degradation. Each component is load-bearing.

The return on investment appears at scale. A solo developer on a small codebase does not need this. A team of ten on a three-year-old repository, where institutional knowledge is unevenly distributed and the tribal knowledge of who-knows-what is constantly eroding, gets significant value from it. Search becomes the equalizer between the engineer who has worked on the codebase for three years and the one who joined three months ago.

A few directions worth pursuing after this baseline is solid:

**Re-ranking**: Add a cross-encoder re-ranking step between retrieval and display. Cross-encoders compare query and document together, jointly, which is more accurate than embedding them independently. `cross-encoder/ms-marco-MiniLM-L-6-v2` runs fast enough to re-rank the top 20 candidates in under 100ms. This typically improves precision@1 by 10–15% with no change to recall.

**Context expansion**: When displaying a result, show more than the chunk. If the retrieved chunk is a function, show the class it belongs to and the file-level imports. This gives the developer enough context to understand the result without clicking through to the file.

**Query logging**: Every search query is a signal. Queries that return no results (after threshold filtering) are gaps in your index or gaps in your chunking strategy. Queries that users follow up by clicking through to a file are successes. A simple log of `(query, result_count, first_result_file)` tuples gives you a continuous stream of system intelligence.

**Personalization**: A developer who spends most of their time in `src/payments/` should see results from that module ranked higher. A simple recency-and-access-frequency boost on chunk metadata — updated as developers interact with search results — can deliver this without model changes.

**Semantic diff**: Use the index to understand what changed across a large pull request. Embed the PR description and retrieve the most semantically similar chunks that were modified. This surfaces the "interesting" changes faster than scrolling a diff.

The underlying technology will improve. Embedding models will get better at code. Vector stores will get faster. Context windows will get longer, allowing larger chunks with less information loss. The architecture in this book is designed to accommodate those improvements: swap the model, re-embed, keep everything else. The index is a derived artifact. The measurement harness tells you whether the new model is better. The hybrid pipeline ensures that improvements in semantic retrieval don't regress exact-match queries.

The most important step is the one you haven't taken yet: building the evaluation set. Without it, you're navigating by instinct. With it, every change has a score, and improvement is measurable. Start there.

---

## Appendix A: Glossary

**ANN (Approximate Nearest Neighbor)**: A class of algorithms that find vectors close to a query vector without an exhaustive search. Trades small recall losses for dramatically faster query times. HNSW is the dominant algorithm in production vector stores.

**BM25 (Best Match 25)**: A probabilistic ranking function for full-text search. Scores documents based on term frequency and inverse document frequency. The baseline ranking algorithm for keyword search.

**Bimodal model**: An embedding model trained to project both natural language and code into a shared vector space, enabling cross-modal similarity comparisons.

**Chunk**: A single unit of text extracted from the codebase and embedded as one vector. For code, typically a function, method, or class definition.

**Cosine similarity**: A measure of similarity between two vectors based on the angle between them. Ranges from -1 to 1; 1 means identical direction. Used as the primary relevance score in most vector search systems.

**ChromaDB**: An embedded Python-native vector database supporting persistent storage, metadata filtering, and HNSW indexing.

**Dense vector**: A high-dimensional numerical array produced by an embedding model, where most values are non-zero. Contrasted with sparse vectors (like TF-IDF), where most values are zero.

**Embedding**: The process of converting text into a dense numerical vector using a neural model. Also refers to the resulting vector.

**FAISS (Facebook AI Similarity Search)**: A library for efficient similarity search over dense vectors. Lower-level than ChromaDB; requires more manual management of metadata and persistence.

**HNSW (Hierarchical Navigable Small World)**: A graph-based ANN algorithm used by most production vector stores. Supports sub-linear query time with high recall.

**Hybrid search**: A retrieval approach combining dense vector search (semantic) with sparse keyword search (BM25), fused via a ranking algorithm like RRF.

**MRR (Mean Reciprocal Rank)**: The average of `1/rank` for each query where the correct answer is found. Rewards systems that return the correct answer near the top.

**Metadata filtering**: Restricting search results to chunks matching specific metadata criteria (file path, language, date) before or after vector search.

**Precision@k**: The fraction of the top-k results that are relevant. Measures result quality.

**Recall@k**: The fraction of queries for which a relevant result appears in the top-k. Measures coverage.

**RRF (Reciprocal Rank Fusion)**: An algorithm for combining ranked lists from multiple retrievers. Scores each item as `1/(k + rank)` and sums across lists.

**Re-ranking**: A second-stage scoring step that re-orders candidates from initial retrieval using a more expensive but more accurate model (cross-encoder).

**Semantic search**: Retrieval based on meaning rather than exact keyword matching, using embedding models to represent documents and queries as vectors.

**Sentence-transformers**: A Python library providing pre-trained embedding models and utilities for dense retrieval tasks.

**Threshold**: A minimum similarity score below which results are suppressed. Prevents low-quality matches from being surfaced.

**Tree-sitter**: A language-agnostic parser toolkit that produces ASTs for many programming languages from the same API.

**Vector store**: A database or library optimized for storing and querying dense vectors. Examples: ChromaDB, FAISS, Qdrant, Weaviate, Pinecone.

---

## Appendix B: Tools and Resources

### Embedding Models

| Model | Dimensions | Context | Notes |
|-------|-----------|---------|-------|
| `microsoft/unixcoder-base` | 768 | 512 tokens | Strong bimodal baseline |
| `microsoft/codebert-base` | 768 | 512 tokens | Code-focused, weaker on NL queries |
| `jinaai/jina-embeddings-v2-base-code` | 768 | 8,192 tokens | Best for long chunks |
| `nomic-ai/nomic-embed-text-v1.5` | 768 | 8,192 tokens | General text + code |
| `voyage-code-2` | 1,536 | 16,000 tokens | API-based, strong on code benchmarks |

### Vector Stores

| Tool | Deployment | Scale | Notes |
|------|-----------|-------|-------|
| ChromaDB | Embedded | <1M chunks | Best for single-process local use |
| FAISS | Embedded | Any | No metadata; need external store |
| Qdrant | Docker/Cloud | Any | Production-ready, multi-process safe |
| Weaviate | Docker/Cloud | Any | GraphQL query API |
| Pinecone | Managed API | Any | Serverless option |

### Python Libraries

```bash
# Core stack
pip install sentence-transformers==3.0.1
pip install chromadb==0.5.3
pip install rank-bm25==0.2.2
pip install fastapi==0.111.0 uvicorn==0.30.1

# Multi-language AST parsing
pip install tree-sitter==0.21.3
pip install tree-sitter-python tree-sitter-javascript tree-sitter-typescript

# Re-ranking (optional)
pip install sentence-transformers  # includes cross-encoder support

# Evaluation utilities
pip install numpy scipy
```

### CLI Tools

```bash
# Find all Python files in a repository
git ls-files "*.py"

# Get changed files since a commit
git diff --name-only <sha> HEAD

# Get changed files with status (added/modified/deleted/renamed)
git diff --name-status <sha> HEAD

# Count tokens in a file (rough estimate)
python3 -c "import sys; print(len(sys.stdin.read().split()))" < file.py
```

---

## Appendix C: Further Reading

### Papers

**"Dense Passage Retrieval for Open-Domain Question Answering"** — Karpukhin et al., 2020. The foundational paper for dense retrieval. Establishes the bi-encoder architecture that underlies most embedding-based retrieval. Read this before reading any newer work.

**"BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models"** — Thakur et al., 2021. Benchmark covering 18 retrieval datasets. Useful for comparing embedding models outside code-specific tasks.

**"UniXcoder: Unified Cross-Modal Pre-training for Code Representation"** — Guo et al., 2022. The paper behind `microsoft/unixcoder-base`. Explains the bimodal training approach and why it outperforms code-only models on retrieval tasks.

**"Improving Text Embeddings with Large Language Models"** — Wang et al., 2024. How synthetic data and instruction tuning improve embedding quality. Background for understanding why newer models outperform older BERT-based ones.

**"Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods"** — Cormack et al., 2009. Original RRF paper. Short and worth reading if you want to understand why k=60 is the standard constant.

### Documentation

- ChromaDB documentation: `docs.trychroma.com`
- sentence-transformers documentation: `sbert.net`
- Qdrant documentation: `qdrant.tech/documentation`
- tree-sitter documentation: `tree-sitter.github.io/tree-sitter`
- FastAPI documentation: `fastapi.tiangolo.com`

### Related Work

**ColBERT**: A late-interaction retrieval model that produces per-token embeddings rather than a single document vector. Higher quality than bi-encoders but more expensive to store and query. Worth understanding if bi-encoder recall plateaus.

**HyDE (Hypothetical Document Embeddings)**: Generate a hypothetical answer to the query using a language model, then embed that answer and retrieve against it. Improves recall for queries where the natural language description is distant from the code style. Trade-off: adds LLM latency to every query.

**SPLADE**: A sparse learned retrieval model that combines the interpretability of BM25-style scoring with neural relevance. A strong alternative to BM25 in the hybrid pipeline for teams willing to run a second model.

---

*© 2026 Pyckle. All rights reserved. This guide may be shared freely for personal and educational use. Commercial reproduction or redistribution requires written permission. Contact kellyprice@pyckle.co.*
