---
title: "Production Code Search"
subtitle: "Reranking, Scaling, and Evaluating Code Retrieval Systems"
author: "David Kelly Price"
version: "1.0"
date: 2026-03-21
status: draft
type: ebook
target_audience: "Senior/staff engineers who have built (or understand) basic code retrieval and need to production-harden it — comfortable with Python, familiar with hybrid retrieval fundamentals (embeddings, BM25, RRF)"
estimated_pages: 90
chapters:
  - "Cross-Encoder Reranking"
  - "Adaptive Thresholds"
  - "Cold Start and Performance Engineering"
  - "Incremental Indexing at Scale"
  - "Evaluation and Benchmarking"
  - "Extending the Pipeline"
tags:
  - pyckle
  - ebook
  - reranking
  - indexing
  - evaluation
  - production
  - scaling
  - pipeline-architecture
  - draft
---

<!-- DESIGN & LAYOUT NOTES

Target formats:
- Primary: Markdown (source of truth)
- Export: PDF via Pandoc, web page
- Print-ready: Letter size, 1" margins

Typography:
- Headers: Sans-serif (brand-consistent)
- Body: Serif or clean sans-serif for readability
- Code: Monospace, syntax highlighted, line-numbered where helpful

Color scheme:
- Pyckle brand palette
- Callout boxes use muted background tints, not heavy borders

Callout box types:
- **Try This** — Exercises and hands-on activities
- **Key Insight** — Important concepts worth remembering
- **Warning** — Common mistakes or gotchas

Code blocks:
- Syntax highlighted by language
- Numbered lines for reference in explanatory text
- Copy-pasteable (no line numbers in actual code)

Figures:
- Captioned and numbered (Figure 1, Figure 2, etc.)
- Referenced by number in body text
-->

---

# Production Code Search

## Reranking, Scaling, and Evaluating Code Retrieval Systems

**By David Kelly Price**

Version 1.0 — March 2026

---

## Table of Contents

**Part I: Ranking**

1. Cross-Encoder Reranking
2. Adaptive Thresholds

**Part II: Production**

3. Cold Start and Performance Engineering
4. Incremental Indexing at Scale

**Part III: Measurement and Beyond**

5. Evaluation and Benchmarking
6. Extending the Pipeline

Pipeline Architecture Summary
Appendix A: Glossary
Appendix B: Tools & Resources
Appendix C: Series Cross-References
Appendix D: Further Reading

---

## About This Guide

You have a retrieval layer. It finds candidate code chunks using embeddings, BM25, or both. The right answer is in the candidate list -- usually. But it is not always at the top, you do not know how many results to return, the system takes too long to start, the index goes stale, and you have no way to measure whether any of it is actually working.

This guide covers the engineering that takes a working retrieval system to production. Ranking candidates precisely with cross-encoder reranking. Filtering dynamically with adaptive thresholds. Starting fast with phased warm-start and memory-mapped indexes. Keeping indexes fresh with incremental updates. Measuring quality with rigorous evaluation. Extending the pipeline without breaking it.

If you have not built the retrieval layer yet -- embeddings, BM25, hybrid fusion, AST graph boosting -- the companion book *Code Retrieval from Scratch* covers that foundation end to end.

---

## How to Use This Guide

**Six chapters. Three parts.** Part I covers ranking: getting the right answer to rank 1 and knowing how many results to return. Part II covers production engineering: cold start, performance, and keeping the index current. Part III covers measurement and extension: proving the pipeline works and adding new capabilities.

**Modular reading.** Unlike a tutorial that builds sequentially, this guide can be read in any order depending on your immediate need:

- **Ranking chapters (1-2)** stand alone. If your retrieval system returns good candidates but in the wrong order, start here.
- **Production chapters (3-4)** stand alone. If your pipeline is accurate but too slow to start or the index goes stale, start here.
- **Evaluation (Chapter 5)** stands alone and arguably should be read first -- you cannot improve what you do not measure.
- **Extension (Chapter 6)** assumes familiarity with the pipeline architecture but does not depend on the other chapters.

**Prerequisites.** This guide assumes you understand the basics of code retrieval: embedding models, BM25, hybrid search with reciprocal rank fusion, and AST-based structural analysis. If these concepts are unfamiliar, read *Code Retrieval from Scratch* first -- it covers the retrieval layer from first principles.

---

# Part I: Ranking

---

## Chapter 1: Cross-Encoder Reranking

### Chapter Overview

The retrieval stages produced a candidate list. The right answer is in the list, but not necessarily at the top. Cross-encoder reranking uses a heavier model to precisely reorder candidates, pushing the correct result to rank 1. This material expands on Episode 10 and architecture article 38 of the Code Search, Decoded series.

---

### The Bi-Encoder Trade-Off

In the semantic search stage, the embedding model works as a bi-encoder. It encodes the query into a vector. Separately, it encodes each code chunk into a vector. Then it compares vectors with cosine similarity.

This is fast. You encode each chunk once at index time and store the vector. At query time, you encode the query (one forward pass), then compare it against stored vectors with simple math. For a codebase with 50,000 chunks, that is one neural network call plus 50,000 dot products. Sub-millisecond.

But there is a cost to that speed. The query vector and the document vector were produced independently. The model never saw them together. It never got to ask "does this specific query match this specific document?" It compressed each into 384 floats and hoped the compressed representations would be close enough to compare.

Usually they are. Sometimes they are not.

The distinction between "validates input before insert" and "validates input before delete" lives in a single token. After compression into 384 dimensions, that token's influence may not survive.

### What a Cross-Encoder Does

A cross-encoder takes the query and a document, concatenates them, and feeds the pair through a transformer as a single input:

```
Input: [CLS] function that validates user input [SEP] def validate_input(data): schema.validate(data) [SEP]
```

The transformer's self-attention mechanism operates over both the query tokens and the document tokens simultaneously. Every query token attends to every document token. The model can learn fine-grained interactions: "validates" in the query aligns with `validate` in the code, "user input" aligns with the `data` parameter, and the `schema.validate` call confirms this is validation, not something that merely mentions validation in a comment.

The output is a single relevance score. Not a vector -- a scalar. "How relevant is this document to this query, on a scale from 0 to 1?"

This is fundamentally more powerful than comparing two independently-produced vectors. The cross-encoder sees the relationship, not two summaries of each side.

### Why Not Cross-Encode Everything?

Cost. A bi-encoder encodes the query once, then does vector comparisons. A cross-encoder needs a separate forward pass for every query-document pair. Searching 50,000 chunks means 50,000 forward passes through a transformer. Even on a GPU, that is seconds.

The solution is the standard two-stage retrieval pattern: bi-encoder for recall, cross-encoder for precision. The retrieval stages (as covered in *Code Retrieval from Scratch*) narrow 50,000 chunks down to a candidate set -- typically the top 20-50 results. The cross-encoder runs over only those candidates. Twenty forward passes instead of fifty thousand.

The bi-encoder is a fast, approximate filter. The cross-encoder is a slow, precise judge. Stacked together, you get the speed of the first and the accuracy of the second.

### Implementation

```python
from sentence_transformers import CrossEncoder

class CodeReranker:
    def __init__(self, model_name: str = 'cross-encoder/ms-marco-MiniLM-L-6-v2'):
        self.model = CrossEncoder(model_name, max_length=512)

    def rerank(
        self,
        query: str,
        candidates: list[dict],
        top_k: int | None = None
    ) -> list[dict]:
        """Rerank candidates using cross-encoder scoring.

        Args:
            query: The search query.
            candidates: Candidate results from hybrid search.
            top_k: Maximum results to return (None = return all, reranked).

        Returns:
            Reranked candidate list.
        """
        if not candidates:
            return []

        # Prepare query-document pairs
        pairs = []
        for candidate in candidates:
            content = candidate.get('content', candidate.get('chunk', {}).get('content', ''))
            pairs.append([query, content])

        # Score all pairs
        scores = self.model.predict(pairs)

        # Attach scores and sort
        reranked = []
        for i, candidate in enumerate(candidates):
            result = candidate.copy()
            result['cross_encoder_score'] = float(scores[i])
            result['original_rank'] = i + 1
            reranked.append(result)

        reranked.sort(key=lambda x: x['cross_encoder_score'], reverse=True)

        # Track rank changes for debugging
        for new_rank, result in enumerate(reranked, start=1):
            result['new_rank'] = new_rank
            result['rank_delta'] = result['original_rank'] - new_rank

        if top_k:
            reranked = reranked[:top_k]

        return reranked
```

### Hard Negatives: What Cross-Encoders See Through

The candidates that reach the reranking stage are all plausible. They all scored well on hybrid search. The easy cases -- completely irrelevant chunks -- were already filtered. What remains is a set of closely-scored results where the right answer sits next to convincing distractors.

In code search, the most common hard negatives are:

**Test mocks.** A test file that mocks `validate_user_input` matches a query about user input validation on both BM25 (exact name match) and semantic search (same concept). But it is a test mock, not the implementation. The cross-encoder sees the `mock.patch` decorator, the `assert_called_with` patterns, the test class inheritance -- and scores it lower than the actual implementation.

**Old implementations.** A function that was refactored but left in the codebase (commented out, moved to a `deprecated/` directory, or renamed with an `_old` suffix) matches queries about its functionality. BM25 sees the name. Semantic search sees the logic. The cross-encoder picks up on the deprecation signals in the surrounding context.

**Utility functions with matching names.** `utils/validation.py` might contain a generic `validate()` helper that checks data types. It matches a query about "input validation" on every surface metric. But the cross-encoder, processing query and document together, recognizes that this generic utility is not what the query asks about when more specific validation logic exists.

These are the cases where retrieval gets the right answer into the candidate list but not at rank 1. The cross-encoder fixes the ordering.

### The Mushy Middle

There is a pattern in retrieval that does not get discussed enough.

After hybrid search, the top 20 candidates often have a score distribution like this: one or two results clearly at the top, one or two clearly at the bottom, and a cluster of 10-15 results in the middle with scores within 0.05 of each other.

That is the mushy middle. The candidates that are all "pretty relevant" but not distinguishably so under the bi-encoder's compressed representations.

The right answer is often in that cluster. Not at rank 1. Not at rank 15. At rank 6, or rank 9, indistinguishable from its neighbors by score alone.

Cross-encoder reranking resolves the mushy middle. Where the bi-encoder saw a flat cluster, the cross-encoder produces a clear gradient. The candidate at rank 9 jumps to rank 2 because the cross-encoder, attending to both query and document simultaneously, noticed that this was the only chunk where the function actually performs the operation the query describes, rather than merely referencing it.

### The Numbers

| Metric | Before Reranking | After Reranking |
|--------|-----------------|-----------------|
| Top-1 accuracy | 64% | 87% |
| Top-3 accuracy | 81% | 95% |
| Top-5 accuracy | 91% | 97% |

Top-5 was already strong from hybrid search. Cross-encoder reranking does not change what is in the candidate list much. It changes the order. For a developer waiting for a single best answer, the difference between 64% and 87% top-1 accuracy is the difference between trusting the tool and second-guessing it.

### The Latency Budget

Cross-encoder reranking adds 1-3ms to the pipeline when processing 20-30 candidates. This works because the cross-encoder model is small and optimized for CPU inference. A distilled model specifically tuned for relevance judgment does one thing well: decide whether a query-document pair is a good match.

```python
# Latency benchmark for reranking
import time

reranker = CodeReranker()
candidates = hybrid_search.search("validates user input before database insert")

start = time.perf_counter()
reranked = reranker.rerank(
    "validates user input before database insert",
    candidates[:25]
)
elapsed = (time.perf_counter() - start) * 1000

print(f"Reranked {len(candidates[:25])} candidates in {elapsed:.1f}ms")
# Typical output: "Reranked 25 candidates in 1.8ms"
```

The full pipeline -- hybrid search plus cross-encoder reranking -- stays well under 10ms.

### Score Calibration

Cross-encoder scores are not probabilities. A score of 0.85 does not mean "85% chance this is relevant." The raw scores depend on the model's training distribution and may not be well-calibrated.

For pipeline use, raw scores work fine -- you are comparing scores within a single query's candidate set, so calibration does not matter. But if you use scores for adaptive thresholds (Chapter 2) or for logging and analysis, calibration helps.

A simple calibration approach: apply a sigmoid with learned temperature:

```python
import numpy as np

def calibrate_scores(
    raw_scores: list[float],
    temperature: float = 1.0,
    shift: float = 0.0
) -> list[float]:
    """Calibrate cross-encoder scores to approximate probabilities.

    Temperature and shift are learned from a held-out evaluation set.
    """
    scores = np.array(raw_scores)
    calibrated = 1 / (1 + np.exp(-(scores - shift) / temperature))
    return calibrated.tolist()
```

The temperature and shift parameters are fit on your evaluation set (Chapter 5) using a small held-out portion. A temperature of ~2.0 and shift of ~0.0 work as starting points for the MS MARCO cross-encoder.

### Batching for Throughput

When reranking 25 candidates, the cross-encoder needs 25 forward passes. These can be batched on modern hardware:

```python
def rerank_batched(
    self,
    query: str,
    candidates: list[dict],
    batch_size: int = 32
) -> list[dict]:
    """Rerank with batched inference for better throughput."""
    pairs = [[query, c.get('content', '')] for c in candidates]

    # The cross-encoder handles batching internally
    # but we can control the batch size for memory management
    all_scores = []
    for i in range(0, len(pairs), batch_size):
        batch = pairs[i:i + batch_size]
        scores = self.model.predict(batch, show_progress_bar=False)
        all_scores.extend(scores)

    # Attach scores and sort
    for i, candidate in enumerate(candidates):
        candidate['cross_encoder_score'] = float(all_scores[i])

    candidates.sort(key=lambda x: x['cross_encoder_score'], reverse=True)
    return candidates
```

On CPU, batching provides modest speedup (1.2-1.5x) due to memory access patterns. On GPU, batching is dramatic (5-10x) because the GPU can process multiple pairs in parallel. For a local search tool running on developer hardware, CPU batching is the realistic scenario.

### The Attention Pattern

Why does the cross-encoder catch things the bi-encoder misses? The answer is in the attention pattern.

In a bi-encoder, the query tokens attend to other query tokens, and the document tokens attend to other document tokens. There is no cross-attention. The query embedding is the model's best summary of the query, and the document embedding is its best summary of the document. The comparison happens in compressed space.

In a cross-encoder, the query tokens attend to the document tokens directly. When the query says "validates user input" and the document contains `validate_input(data)`, the attention mechanism creates a strong alignment between "validates" and `validate`, between "input" and `input` (and `data`). But when the document contains `mock_validate_input()` or `test_validate_input()`, the attention also aligns with `mock` and `test` -- tokens that signal this is not the production implementation.

This cross-attention is what gives the cross-encoder its power to detect hard negatives. It sees the full context of the match, not just the compressed summary.

### Model Selection for Reranking

The default `cross-encoder/ms-marco-MiniLM-L-6-v2` is trained on the MS MARCO passage ranking dataset. It works reasonably for code but was not trained on code specifically.

For better code reranking:

1. **Fine-tune on code relevance data.** Collect (query, relevant code, irrelevant code) triples from your search logs. Fine-tune the cross-encoder to distinguish relevant code from hard negatives.

2. **Use a code-specific cross-encoder.** Models trained on code question-answering datasets (like CodeSearchNet) understand code patterns better than general-purpose models.

3. **Distill a larger model.** Train a small, fast cross-encoder to mimic the judgments of a large, slow model. This gives you the accuracy of the large model with the latency of the small one.

For most teams, the default model provides significant accuracy gains with no training effort. Fine-tuning is the optimization you pursue after measuring (Chapter 5) and confirming that reranking accuracy is the bottleneck.

### Reranking Edge Cases

Several edge cases in code reranking require specific handling:

**Very long chunks.** Cross-encoder models have a maximum input length (typically 512 tokens). If a chunk exceeds this limit after concatenation with the query, the model truncates it. The truncated portion -- often the end of a function -- might contain the most relevant code. Mitigation: if a chunk exceeds the max length, create two versions (beginning + query, end + query), score both, and take the maximum.

```python
def score_long_chunk(self, query: str, content: str, max_length: int = 450) -> float:
    """Score a potentially long chunk by evaluating multiple windows."""
    tokens = content.split()

    if len(tokens) <= max_length:
        return float(self.model.predict([[query, content]])[0])

    # Score the beginning and end of the chunk
    beginning = ' '.join(tokens[:max_length])
    ending = ' '.join(tokens[-max_length:])

    score_begin = float(self.model.predict([[query, beginning]])[0])
    score_end = float(self.model.predict([[query, ending]])[0])

    return max(score_begin, score_end)
```

**Empty or near-empty chunks.** A one-line import statement or a bare `pass` in a placeholder function produces a trivially low cross-encoder score regardless of the query. These chunks should be filtered before reranking to avoid wasting cross-encoder capacity.

**Code in comments vs. code in functions.** A cross-encoder trained on natural language might score a well-written comment about authentication higher than the actual authentication function. This is because the comment uses the same vocabulary as the query (natural language), while the function uses code syntax. The solution is to train or fine-tune the cross-encoder on code-specific relevance data, or to weight cross-encoder scores against the chunk's type (code vs. comment).

### Candidate Set Size: How Many to Rerank?

The number of candidates sent to the cross-encoder is a tunable parameter with a direct speed/accuracy trade-off:

| Candidates | Reranking Latency | Top-1 Accuracy |
|------------|------------------|----------------|
| 10 | 0.7ms | 82% |
| 20 | 1.4ms | 86% |
| 30 | 2.1ms | 87% |
| 50 | 3.5ms | 87.5% |
| 100 | 7.0ms | 88% |

The accuracy curve flattens quickly. Going from 10 to 20 candidates gains 4 points. Going from 20 to 100 gains only 2 points. The diminishing returns make 20-30 candidates the optimal range for most use cases -- good accuracy without blowing the latency budget.

If your hybrid retrieval has high recall@20 (above 0.90), reranking 20 candidates is sufficient. If recall@20 is lower (0.80-0.85), consider reranking 30-50 to give the cross-encoder a better chance of seeing the correct result.

---

### Exercise

> **Try This**
>
> Add cross-encoder reranking to your pipeline:
>
> 1. Install `sentence-transformers` and load the `cross-encoder/ms-marco-MiniLM-L-6-v2` model.
> 2. Take the top 25 results from your hybrid search and rerank them.
> 3. For each of your benchmark queries, record the rank changes: which results moved up? Which moved down?
> 4. Look at the results that dropped the most. Are they test files, old implementations, or generic utilities? These are the hard negatives the cross-encoder caught.

---

### Key Takeaways

- Cross-encoders process query and document together, enabling fine-grained relevance judgments that bi-encoders cannot make.
- The two-stage pattern (bi-encoder for recall, cross-encoder for precision) is the industry standard for balancing speed and accuracy.
- Cross-encoder reranking resolves the "mushy middle" -- the cluster of similarly-scored candidates where the right answer hides.
- Top-1 accuracy jumps from 64% to 87% with reranking, while adding only 1-3ms of latency.
- Test mocks, old implementations, and generic utilities are the most common hard negatives that cross-encoders catch.

---

## Chapter 2: Adaptive Thresholds

### Chapter Overview

The pipeline now produces a precisely ordered list of results. But how many should it return? Static `top_k` is a guess that fails in both directions. Adaptive thresholds dynamically determine the relevance cutoff for each query, returning exactly as many results as are truly relevant. This material expands on Episode 11 and architecture article 39 of the Code Search, Decoded series.

---

### The top_k Trap

Most retrieval systems use a static `top_k`. Set it to 20 and move on. The logic feels reasonable: return enough results that you probably capture everything relevant, let the LLM sort it out.

In practice, static `top_k` fails in both directions.

**Over-retrieval.** Query: "rate limiter configuration." Two files define the rate limiter. The other 18 results are tangentially related imports, test helpers, and a README that mentions "rate" once. The LLM receives 40,000 tokens. It hallucinates connections between the rate limiter and an unrelated caching layer. You debug for twenty minutes before realizing the answer was in the first two results.

**Under-retrieval.** Query: "all error handling middleware." Fifteen files implement error handlers across the stack. Static `top_k=20` captures most of them -- but only because you got lucky with the number. Set it to 10 and you miss five.

Static `top_k` is a guess. Guesses do not scale.

### Reading the Score Distribution

After cross-encoder reranking, every result has a confidence score between 0 and 1. These scores are not uniformly distributed. They cluster. And the clustering pattern tells you where relevance ends.

```
Result 1:  0.93  ████████████████████████████████████████
Result 2:  0.91  ██████████████████████████████████████
Result 3:  0.89  ████████████████████████████████████
Result 4:  0.62  ████████████████████████████
Result 5:  0.58  █████████████████████████
Result 6:  0.54  ██████████████████████
Result 7:  0.51  ████████████████████
Result 8:  0.47  ██████████████████
```

There is a cliff between results 3 and 4. A 0.27-point drop. That cliff is the cross-encoder telling you, with high confidence, that results 1-3 are about your query and results 4-8 are about something adjacent.

An adaptive threshold catches that cliff automatically. Static `top_k` does not even look for it.

### The Math

The simplest effective approach is a relative margin off the top score:

```
threshold = top_score * (1 - relative_margin)
```

With a default `relative_margin` of 0.15 and a top score of 0.93:

```
threshold = 0.93 * (1 - 0.15) = 0.93 * 0.85 = 0.7905
```

Results 1-3 (0.93, 0.91, 0.89) pass. Results 4-8 (0.62 and below) do not. Three results instead of twenty.

The relative margin means the threshold scales with query confidence. A strong match (top score 0.95) produces a high threshold (0.808). A weaker exploratory query (top score 0.70) produces a lower threshold (0.595), allowing more results through. The system adapts to the query, not to a hardcoded number.

### Implementation

```python
@dataclass
class ThresholdConfig:
    relative_margin: float = 0.15    # How far below top score to cut
    min_results: int = 1             # Always return at least this many
    max_results: int = 25            # Hard cap even if all pass threshold
    absolute_floor: float = 0.3      # Never return results below this score

def apply_adaptive_threshold(
    results: list[dict],
    config: ThresholdConfig = ThresholdConfig()
) -> list[dict]:
    """Filter results using adaptive threshold based on score distribution.

    Args:
        results: Ranked results with 'cross_encoder_score' key, sorted descending.
        config: Threshold configuration.

    Returns:
        Filtered results that pass the threshold.
    """
    if not results:
        return []

    top_score = results[0]['cross_encoder_score']
    threshold = top_score * (1 - config.relative_margin)
    threshold = max(threshold, config.absolute_floor)

    filtered = []
    for i, result in enumerate(results):
        score = result['cross_encoder_score']

        if i < config.min_results:
            # Always include up to min_results
            result_copy = result.copy()
            result_copy['above_threshold'] = score >= threshold
            filtered.append(result_copy)
        elif score >= threshold and len(filtered) < config.max_results:
            result_copy = result.copy()
            result_copy['above_threshold'] = True
            filtered.append(result_copy)
        else:
            break  # Results are sorted, so no need to check further

    return filtered
```

### Alternative: Gap Detection

The relative margin approach is simple and effective. A more sophisticated alternative detects the largest gap in the score distribution:

```python
def detect_score_gap(
    results: list[dict],
    min_gap_ratio: float = 0.15,
    min_results: int = 1,
    max_results: int = 25
) -> list[dict]:
    """Find the natural breakpoint in the score distribution.

    Looks for the largest relative gap between consecutive scores.
    """
    if len(results) <= 1:
        return results

    scores = [r['cross_encoder_score'] for r in results]

    # Find gaps between consecutive scores
    gaps = []
    for i in range(len(scores) - 1):
        if scores[i] > 0:
            relative_gap = (scores[i] - scores[i + 1]) / scores[i]
            gaps.append((i + 1, relative_gap))  # (cutoff_position, gap_size)

    # Find the largest gap that meets the minimum threshold
    significant_gaps = [(pos, gap) for pos, gap in gaps
                        if gap >= min_gap_ratio and pos >= min_results]

    if significant_gaps:
        # Cut at the first significant gap
        cutoff = min(g[0] for g in significant_gaps)
        cutoff = min(cutoff, max_results)
    else:
        # No significant gap found; use relative margin fallback
        cutoff = min(len(results), max_results)

    return results[:cutoff]
```

Gap detection works well when the score distribution has a clear cliff. The relative margin approach works well in all cases, including when scores decline gradually. In practice, the relative margin is the safer default.

### Tuning the Margin

The default 0.15 works well for most codebases. But you can tune it:

**Tighter margins (0.08-0.10):** For precise queries when you want only exact matches. Common in well-structured monorepos where naming conventions are strict.

**Wider margins (0.20-0.25):** For exploratory queries when you want to cast a broader net. Useful during onboarding, code archaeology, or when searching unfamiliar parts of the codebase.

The key advantage of the relative margin over an absolute threshold: a fixed cutoff of 0.80 works for focused queries but kills exploratory ones. A fixed cutoff of 0.60 lets exploratory queries through but floods focused queries with noise. The relative margin handles both because it anchors to the top score, not to an arbitrary number.

### Edge Cases in Thresholding

Several edge cases require explicit handling:

**No strong results.** When the top score is 0.35 -- nothing in the codebase matches the query well -- the relative threshold drops to 0.30. The system should still return the best match (the `min_results` floor), but flag it as low-confidence. The downstream LLM should know it is working with weak context.

```python
# In the threshold function:
if top_score < 0.5:
    # Low-confidence query -- flag results
    for result in filtered:
        result['confidence'] = 'low'
        result['low_confidence_reason'] = (
            f'Top score {top_score:.2f} below confidence threshold 0.50'
        )
```

**All results pass.** When the query is broad ("error handling") and the codebase has error handling in many files, every result might pass the threshold. The `max_results` cap prevents the pipeline from returning 200 chunks. Set it to 25 or whatever your downstream token budget allows.

**Bimodal distributions.** Some queries produce a score distribution with two distinct clusters: a high cluster (the actual matches) and a low cluster (coincidental matches). The relative margin catches this cleanly because the gap between clusters is large. But if the margin is set too wide, results from the lower cluster leak through.

```
Result 1:  0.94
Result 2:  0.91
Result 3:  0.88
  -- gap --
Result 4:  0.72  <-- with margin 0.25, threshold = 0.94 * 0.75 = 0.705
Result 5:  0.71      These pass. Should they?
Result 6:  0.70
  -- gap --
Result 7:  0.42
```

With margin 0.25, results 4-6 pass the threshold. Whether they should pass depends on the query. For an exploratory query, yes -- they are above-average matches. For a precise query, no -- the gap between 0.88 and 0.72 is significant.

This is why the default margin of 0.15 is conservative. It catches the primary gap in most distributions. Teams that need broader recall can increase it.

**Identical scores.** If the top 5 results all score 0.91, the relative margin does not help -- there is no gap to detect. In this case, the threshold passes all of them (they are all within the margin), and the results are effectively unranked within the cluster. This is fine. If the cross-encoder could not distinguish them, they are likely equally relevant.

### Why This Matters More Than Better Embeddings

Teams spend weeks evaluating embedding models. Swapping MiniLM for a larger model might improve retrieval accuracy by 2-3%. Meaningful, but incremental.

Adaptive thresholds have a step-function impact on LLM output quality. The reason is the noise ratio.

An LLM given 3 relevant results and 0 irrelevant results produces accurate answers. An LLM given 3 relevant results and 17 irrelevant results produces hallucinations, false connections, and qualified hedging. The relevant results are identical in both cases. The noise changes the output.

This is not a linear relationship. Going from 0% noise to 10% noise barely matters. Going from 10% to 50% noise degrades output dramatically. It is a step function, and most static-`top_k` systems sit on the wrong side of it.

The token math makes the business case clear:

| Approach | Avg. tokens per query | Monthly cost (10 devs, 50 queries/day) |
|----------|----------------------|---------------------------------------|
| Static top_k=20 | ~40,000 | ~$3,000 |
| Adaptive threshold (0.15) | ~6,000 | ~$450 |

An 85% reduction. Not from compression or summarization -- from not sending irrelevant code to the LLM. The results are better and they cost less. That combination is rare.

### Implementing Threshold Monitoring

In production, track how the threshold behaves across real queries:

```python
@dataclass
class ThresholdEvent:
    query: str
    top_score: float
    threshold: float
    total_candidates: int
    passed_threshold: int
    filtered_out: int
    timestamp: float

class ThresholdMonitor:
    def __init__(self):
        self.events: list[ThresholdEvent] = []

    def record(self, event: ThresholdEvent):
        self.events.append(event)

    def summary(self) -> dict:
        if not self.events:
            return {}

        pass_rates = [e.passed_threshold / e.total_candidates for e in self.events]
        top_scores = [e.top_score for e in self.events]

        return {
            'total_queries': len(self.events),
            'avg_pass_rate': sum(pass_rates) / len(pass_rates),
            'avg_top_score': sum(top_scores) / len(top_scores),
            'avg_results_returned': sum(e.passed_threshold for e in self.events) / len(self.events),
            'avg_results_filtered': sum(e.filtered_out for e in self.events) / len(self.events),
            'low_confidence_queries': sum(1 for e in self.events if e.top_score < 0.5),
            'zero_result_queries': sum(1 for e in self.events if e.passed_threshold == 0),
        }
```

Watch for these signals:

- **avg_pass_rate above 0.8**: the threshold is too permissive, tighten the margin.
- **avg_pass_rate below 0.1**: the threshold is too aggressive, widen the margin.
- **low_confidence_queries above 20%**: the embedding model may need fine-tuning, or the queries are too far from the codebase vocabulary.
- **zero_result_queries above 5%**: the `min_results` floor is not set, or the absolute floor threshold is too high.

### The Threshold as User Feedback

An underappreciated property of the adaptive threshold: it generates implicit feedback about query quality.

When a query produces a high top score (0.90+) with a clear gap, the system is confident. When a query produces a low top score (0.50-0.60) with no clear gap, the system is uncertain. This uncertainty signal can be surfaced to the user:

```python
def format_confidence_indicator(top_score: float, results_count: int) -> str:
    """Generate a human-readable confidence indicator."""
    if top_score >= 0.85 and results_count <= 5:
        return "High confidence — these results closely match your query."
    elif top_score >= 0.70:
        return "Moderate confidence — results are relevant but may not be exact matches."
    elif top_score >= 0.50:
        return "Low confidence — the codebase may not contain what you are looking for."
    else:
        return "Very low confidence — consider rephrasing your query or checking if this code exists."
```

This kind of transparency builds trust. A search tool that says "I am not confident about these results" is more trustworthy than one that presents weak results with no qualification.

---

### Exercise

> **Try This**
>
> Add adaptive thresholds to your pipeline:
>
> 1. Take your reranked results from Chapter 1 and apply the `apply_adaptive_threshold` function.
> 2. For each benchmark query, compare: how many results does static top_k=20 return vs. adaptive threshold?
> 3. Examine the results that the threshold filtered out. Are they genuinely irrelevant?
> 4. Experiment with the `relative_margin` parameter. At what value does the threshold become too aggressive (filtering out relevant results)? At what value does it become too permissive (including noise)?
> 5. Estimate the token savings: for each query, count the tokens in the filtered result set vs. the unfiltered set.

---

### Key Takeaways

- Static `top_k` fails in both directions: over-retrieves for focused queries, under-retrieves for broad queries.
- Adaptive thresholds use a relative margin off the top score, automatically adjusting to query confidence.
- Score distributions cluster with natural breakpoints. The threshold catches these breakpoints; static `top_k` ignores them.
- The impact on LLM output quality is step-function, not linear. Noise below a ratio is harmless; noise above it causes hallucination and hedging.
- Token savings of 85% are typical, with better output quality. This is the rare optimization that improves results and reduces cost simultaneously.

---

# Part II: Production

---

## Chapter 3: Cold Start and Performance Engineering

### Chapter Overview

The pipeline runs in 6ms per query. But the server has to start first. Loading models, indexes, and graph structures sequentially takes 3-8 seconds. A developer tool that takes that long to start gets background-tabbed and forgotten. This chapter covers the engineering that makes semantic search feel instant. This material expands on Episode 12 and architecture article 40 of the Code Search, Decoded series.

---

### The One-Second Budget

Developer tools compete with tools that are already on the developer's screen. The editor starts in under a second. The terminal is instant. `grep` does not make you wait.

A code search tool that takes 5 seconds to start loses to grep. Not because grep is better -- it is worse by every quality metric. But grep is available now. A tool that makes you wait is not available now. Availability wins.

The performance bar is not set by other search tools. It is set by the tools already in the workflow. One second. That is the budget.

### The Inventory

Here is what has to be ready before a query can execute at full quality:

| Component | Size | Sequential Load Time |
|-----------|------|---------------------|
| Server socket | -- | ~20ms |
| BM25 index | 5-30MB | ~80ms |
| Embedding model (MiniLM) | ~100MB | ~400ms |
| Vector index (HNSW) | 50-200MB | ~300ms |
| AST graph | 10-50MB | ~200ms |
| Cross-encoder model | ~120MB | ~500ms |

Total sequential: 1.5s on a fast machine, 3-8s on a cold machine with nothing cached by the OS. That does not fit in one second.

But not everything is needed at the same time.

### Three-Phase Layered Warm-Start

The insight: the pipeline stages are not equally important for the first query, and they can load independently. A BM25-only result in 200ms is better than no result for 3 seconds.

**Phase 1 (<200ms): Socket + BM25**

The server socket opens and the BM25 index loads. At this point, you can answer keyword queries. Not semantic queries, not reranked queries -- but keyword queries that are often good enough for the first interaction.

```python
import asyncio
import time

class SearchServer:
    def __init__(self, index_path: str):
        self.index_path = index_path
        self.bm25 = None
        self.semantic = None
        self.reranker = None
        self.ast_graph = None
        self.phase = 0

    async def start(self):
        """Start the server with phased loading."""
        start = time.perf_counter()

        # Phase 1: BM25 (immediate capability)
        self.bm25 = load_bm25_index(self.index_path)
        self.phase = 1
        phase1_time = (time.perf_counter() - start) * 1000
        print(f"Phase 1 ready: BM25 ({phase1_time:.0f}ms)")

        # Start Phase 2 and 3 loading concurrently
        asyncio.create_task(self._load_phase2())
        asyncio.create_task(self._load_phase3())

    async def _load_phase2(self):
        """Load embedding model and vector index."""
        start = time.perf_counter()
        self.semantic = await asyncio.to_thread(
            load_semantic_index, self.index_path
        )
        self.phase = max(self.phase, 2)
        elapsed = (time.perf_counter() - start) * 1000
        print(f"Phase 2 ready: Semantic search ({elapsed:.0f}ms)")

    async def _load_phase3(self):
        """Load AST graph and cross-encoder."""
        start = time.perf_counter()
        self.ast_graph = await asyncio.to_thread(
            load_ast_graph, self.index_path
        )
        self.reranker = await asyncio.to_thread(
            load_cross_encoder
        )
        self.phase = 3
        elapsed = (time.perf_counter() - start) * 1000
        print(f"Phase 3 ready: Full pipeline ({elapsed:.0f}ms)")

    def search(self, query: str, top_k: int = 20) -> list[dict]:
        """Search using whatever stages are currently loaded."""
        if self.phase >= 3:
            # Full pipeline
            return self._full_pipeline_search(query, top_k)
        elif self.phase >= 2:
            # Hybrid search without reranking
            return self._hybrid_search(query, top_k)
        elif self.phase >= 1:
            # BM25 only
            return self.bm25.search(query, top_k)
        else:
            return []  # Server still starting
```

**Phase 2 (<600ms): Embeddings + Vector Index**

The embedding model loads (from cache if available) and the HNSW vector index memory-maps into place. Now the server can run hybrid BM25 + semantic search. Any query issued after Phase 2 completes gets keyword matching plus semantic similarity.

**Phase 3 (<1s): AST Graph + Cross-Encoder**

The AST graph loads and the cross-encoder model initializes. Now the full pipeline is active: AST boosting, hybrid search, cross-encoder reranking, adaptive thresholds.

Each phase does not block on the previous one. They run concurrently. Queries issued during Phase 2 loading get Phase 1 quality. Queries issued during Phase 3 loading get Phase 2 quality. Nothing waits.

### Memory-Mapped Indexes

The vector index and AST graph use memory-mapped files. This matters more than it sounds.

A traditional approach loads the entire index into RAM before serving queries. For a 200MB HNSW index, that means reading 200MB from disk, allocating 200MB of heap memory, and copying the data. Even on NVMe, that is hundreds of milliseconds.

Memory-mapped files skip the copy. The operating system maps the file into the process's virtual address space. Pages load on demand when accessed. For the first query, only the pages containing the relevant portion of the graph get loaded -- typically 2-5MB out of 200MB.

```python
import mmap
import numpy as np

def load_vectors_mmap(index_path: str, dimension: int) -> np.ndarray:
    """Load vector index using memory-mapped file access.

    Only pages accessed during queries are loaded into physical RAM.
    """
    vector_file = f"{index_path}/vectors.bin"

    with open(vector_file, 'rb') as f:
        mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        # Create numpy array backed by the memory-mapped file
        num_vectors = len(mm) // (dimension * 4)  # 4 bytes per float32
        vectors = np.frombuffer(mm, dtype=np.float32).reshape(num_vectors, dimension)
        return vectors
```

The compounding benefit: the OS file cache keeps hot pages in RAM across server restarts. A fresh start after the first one is faster because the OS already has the index pages cached.

### Model Caching

The embedding model is the largest single bottleneck. Loading from original model files, parsing weights, and initializing the inference runtime takes ~800ms the very first time.

The optimization: compile the model into an optimized format and cache it locally. On subsequent starts, loading the compiled model takes ~150ms.

```python
import os
from pathlib import Path

CACHE_DIR = Path('.search_index/cache/models')

def load_model_cached(model_name: str) -> SentenceTransformer:
    """Load embedding model with caching for fast subsequent starts."""
    cache_path = CACHE_DIR / model_name.replace('/', '_')

    if cache_path.exists():
        # Fast path: load from cache
        return SentenceTransformer(str(cache_path))

    # Slow path: download, compile, cache
    model = SentenceTransformer(model_name)

    cache_path.mkdir(parents=True, exist_ok=True)
    model.save(str(cache_path))

    return model
```

Combined, model caching cuts 1+ second off every start after the first.

### Startup Across Hardware

Cold start performance varies with hardware, but the phase budget holds:

| Machine | Phase 1 | Phase 2 | Phase 3 |
|---------|---------|---------|---------|
| M2 MacBook Pro | 95ms | 340ms | 620ms |
| Linux workstation (NVMe) | 110ms | 390ms | 710ms |
| Older laptop (SATA SSD) | 180ms | 560ms | 940ms |
| CI runner (shared, cold cache) | 190ms | 680ms | 1,180ms |

The SATA SSD case is instructive. Phase 1 still makes the 200ms budget because BM25 indexes are small. Phase 3 is tight but usually under 1 second. The CI runner with a cold cache is the worst case -- even there, first query capability arrives in under 200ms.

### Correctness During Transitions

Layered warm-start sounds straightforward. Load things in parallel, serve partial results. The hard part is correctness during transitions.

A query that starts during Phase 1 and completes during Phase 2 must not mix BM25-only scores with hybrid scores for the same result set. A query that arrives exactly as Phase 3 finishes must see the cross-encoder applied to all results, not just the ones that were reranked before the transition.

The solution is snapshot-based queries: each query captures the current phase at the moment it starts and uses exactly those components throughout. Even if Phase 3 completes mid-query, the query uses Phase 2 components because that was the snapshot when it began.

```python
def search_with_phase_snapshot(self, query: str, top_k: int = 20) -> list[dict]:
    """Search using a frozen snapshot of the current phase."""
    # Capture current phase components
    phase = self.phase
    bm25 = self.bm25
    semantic = self.semantic
    reranker = self.reranker
    ast_graph = self.ast_graph

    # Execute with frozen snapshot -- no phase transitions during the query
    if phase >= 3 and reranker and ast_graph:
        return full_pipeline(query, bm25, semantic, ast_graph, reranker, top_k)
    elif phase >= 2 and semantic:
        return hybrid_search(query, bm25, semantic, top_k)
    elif phase >= 1 and bm25:
        return bm25.search(query, top_k)
    return []
```

### HNSW: The Index That Makes Search Fast

The vector index is the most performance-critical data structure in the pipeline. Naive nearest-neighbor search compares the query vector against every vector in the index -- `O(n)` time. For 200,000 vectors at 384 dimensions, that is 200,000 dot products, each requiring 384 multiplications and 383 additions. On modern hardware, this takes ~20ms. Acceptable for small codebases. Not for large ones.

HNSW (Hierarchical Navigable Small World) is a graph-based approximate nearest-neighbor algorithm that achieves `O(log n)` query time. The idea:

1. Build a multi-layer graph where each layer is a subset of the vectors. The top layer has very few nodes (a coarse map). The bottom layer has all nodes (full resolution).

2. Start the search at the top layer. Find the closest node using the coarse map. Drop down to the next layer, starting from that node. Repeat until you reach the bottom layer.

3. At each layer, traverse the graph greedily: move to the neighbor closest to the query, repeat until no neighbor is closer. The graph structure ensures this greedy search finds a good (though not guaranteed optimal) result.

The trade-off is accuracy vs. speed. HNSW provides approximate nearest neighbors -- it might miss the true closest vector if the graph structure leads the search astray. In practice, with default parameters, HNSW finds the exact nearest neighbor 95%+ of the time. The misses are always close to the true nearest neighbor.

The parameters that matter:

```python
# ChromaDB HNSW configuration
collection = client.create_collection(
    name="code_search",
    metadata={
        "hnsw:space": "cosine",       # Distance metric
        "hnsw:M": 16,                  # Number of connections per node (default 16)
        "hnsw:construction_ef": 200,   # Build-time search width (higher = slower build, better graph)
        "hnsw:search_ef": 100,         # Query-time search width (higher = slower query, better recall)
    }
)
```

`M` controls the graph's connectivity. Higher M means more connections per node, better recall, but larger index size and slower builds. M=16 is the standard default.

`construction_ef` controls how thorough the graph construction is. Higher values build a better graph but take longer. 200 is a good balance.

`search_ef` controls how thorough the query-time search is. Higher values find better results but take longer. 100 gives >99% recall for most distributions. Reduce to 50 for faster queries at slight accuracy cost.

### Warm-up Queries

A trick for reducing perceived latency: issue a few warm-up queries during Phase 3 loading. These queries touch the vector index and load the hot pages into the OS cache. Subsequent real queries benefit from the cached pages.

```python
WARMUP_QUERIES = [
    "main entry point",
    "database connection",
    "error handling",
    "configuration",
    "authentication",
]

async def warmup_index(self):
    """Issue warm-up queries to pre-load hot index pages."""
    for query in WARMUP_QUERIES:
        try:
            self.search(query, top_k=5)
        except Exception:
            pass  # Warm-up failures are non-fatal
```

Five warm-up queries load roughly 10-20MB of index pages into the OS cache. The latency of the sixth query (the first real query) drops by 30-50% compared to a completely cold index.

### Query-Level Performance

Once the server is running, individual query performance matters. The pipeline has six stages, each with its own latency characteristics:

| Stage | Typical Latency | What Determines It |
|-------|----------------|-------------------|
| AST boost | 0.3-0.5ms | Graph size, anchor count |
| BM25 | 0.5-1.5ms | Corpus size, query term count |
| Semantic search | 1.5-3.0ms | Vector index size, model encode time |
| RRF fusion | <0.1ms | Candidate count (sort operation) |
| Cross-encoder | 1.0-2.0ms | Candidate count, model size |
| Adaptive threshold | <0.1ms | Candidate count (linear scan) |
| **Total** | **3.5-7.0ms** | |

The semantic search stage dominates because it includes the query encoding step (one forward pass through MiniLM). Caching frequent queries eliminates this cost:

```python
from functools import lru_cache

@lru_cache(maxsize=1024)
def encode_query_cached(query: str) -> tuple[float, ...]:
    """Cache query embeddings for repeated queries."""
    embedding = model.encode(query)
    return tuple(embedding.tolist())
```

An LRU cache of 1,024 queries uses ~1.5MB of memory (384 floats * 4 bytes * 1024) and eliminates the encoding step for the most common queries. In practice, developers often repeat variations of the same query within a session.

### Profiling the Pipeline

Before optimizing, measure. A profiling wrapper reveals where time is actually spent:

```python
import time
from contextlib import contextmanager

class PipelineProfiler:
    def __init__(self):
        self.timings: dict[str, list[float]] = {}

    @contextmanager
    def stage(self, name: str):
        start = time.perf_counter()
        yield
        elapsed = (time.perf_counter() - start) * 1000
        self.timings.setdefault(name, []).append(elapsed)

    def report(self) -> str:
        lines = ["Pipeline Profile:"]
        total = 0
        for name, times in self.timings.items():
            avg = sum(times) / len(times)
            p95 = sorted(times)[int(len(times) * 0.95)] if len(times) >= 20 else max(times)
            total += avg
            lines.append(f"  {name:30s}  avg: {avg:6.2f}ms  p95: {p95:6.2f}ms")
        lines.append(f"  {'TOTAL':30s}  avg: {total:6.2f}ms")
        return '\n'.join(lines)

# Usage in the search pipeline:
profiler = PipelineProfiler()

def profiled_search(query, top_k=20):
    with profiler.stage("ast_boost"):
        boosts = compute_ast_boosts(query, graph, chunks)

    with profiler.stage("bm25"):
        bm25_results = bm25.search(query, top_k * 2)

    with profiler.stage("semantic_encode"):
        query_embedding = model.encode(query)

    with profiler.stage("semantic_search"):
        semantic_results = collection.query(query_embeddings=[query_embedding.tolist()], n_results=top_k * 2)

    with profiler.stage("rrf_fusion"):
        fused = reciprocal_rank_fusion([bm25_results, semantic_results])

    with profiler.stage("apply_boosts"):
        boosted = apply_ast_boosts(fused, boosts)

    with profiler.stage("cross_encoder"):
        reranked = reranker.rerank(query, boosted[:25])

    with profiler.stage("threshold"):
        filtered = apply_adaptive_threshold(reranked)

    return filtered
```

After 100 queries, `profiler.report()` shows exactly where the time goes. Common findings:

- Semantic encoding dominates (40-60% of total time). Caching helps.
- Cross-encoder is the second largest (20-30%). Reducing candidate count helps.
- BM25 and RRF are negligible (<5% combined). No optimization needed.
- AST boost depends on graph size. For small codebases, negligible. For large codebases, 10-15%.

### Caching Strategies Beyond Query Embedding

Query embedding caching is the highest-impact single optimization. But other caching strategies compound:

**Result caching.** Cache the full pipeline result for exact query strings. If the index has not changed since the cached result was computed, the cached result is still valid. Invalidate the cache on index updates.

```python
from functools import lru_cache

class CachedPipeline:
    def __init__(self, pipeline):
        self.pipeline = pipeline
        self.index_version = 0
        self._cache = {}

    def search(self, query: str, top_k: int = 20) -> list[dict]:
        cache_key = (query, top_k, self.index_version)
        if cache_key in self._cache:
            return self._cache[cache_key]

        results = self.pipeline.search(query, top_k)
        self._cache[cache_key] = results
        return results

    def invalidate(self):
        """Call after index updates."""
        self.index_version += 1
        self._cache.clear()
```

**BM25 posting list caching.** The inverted index lookups for common terms can be cached. Since the inverted index changes only on index updates, the cache is valid for the entire session between updates.

**AST graph neighbor caching.** The BFS traversal in AST boosting visits the same subgraphs for related queries. Caching the neighbors of popular anchor nodes avoids redundant graph traversals.

---

### Exercise

> **Try This**
>
> Implement the phased startup pattern:
>
> 1. Time each component's load individually: BM25 index, embedding model, vector index, cross-encoder.
> 2. Implement the three-phase loading with concurrent execution.
> 3. Measure the time to first query capability under each phase.
> 4. Add query embedding caching and measure the latency improvement for repeated queries.
> 5. Simulate a Phase 1 query arriving before Phase 2 completes. Does the result degrade gracefully?

---

### Key Takeaways

- The one-second budget is set by the tools already on the developer's screen, not by other search tools.
- Three-phase layered warm-start provides BM25 in <200ms, hybrid search in <600ms, and full pipeline in <1s.
- Memory-mapped indexes skip the full load, accessing pages on demand. The OS file cache makes subsequent starts faster.
- Model caching cuts 1+ second off every start after the first.
- Correctness during phase transitions requires snapshot-based queries that freeze the component set at query start time.

---

## Chapter 4: Incremental Indexing at Scale

### Chapter Overview

The pipeline assumes the index exists and is current. Building it is the bottleneck. Full reindexing takes minutes for large codebases. Incremental indexing takes seconds but has sharp edges around deletions, renames, and cross-file dependencies. This chapter covers how to build and maintain indexes that stay fresh without blocking the developer. This material expands on Episode 13 and architecture article 56 of the Code Search, Decoded series.

---

### The Full Reindex Problem

A 50,000-file repository produces 200,000+ chunks after language-aware parsing. Each chunk needs an embedding vector, a BM25 entry, and AST graph edges. Full reindexing means:

- Parsing 50K files into AST-aware chunks: 30-90 seconds
- Computing 200K+ embedding vectors: 3-8 minutes on CPU
- Building the HNSW vector index: 30-60 seconds
- Building the BM25 inverted index: 5-10 seconds
- Computing AST graph relationships: 20-40 seconds

Total: 5-10 minutes on a modern machine. Acceptable for a one-time setup. Unacceptable every time you pull main.

### Git-Aware Change Detection

The foundation of incremental indexing is knowing what changed. For git-tracked repositories, this is a solved problem:

```python
import subprocess
from pathlib import Path

def detect_changes(repo_path: str, last_indexed_commit: str | None) -> dict:
    """Detect file changes since last index build using git.

    Returns:
        Dict with 'added', 'modified', 'deleted', 'renamed' file lists.
    """
    if last_indexed_commit is None:
        # First index -- everything is new
        result = subprocess.run(
            ['git', 'ls-files'],
            cwd=repo_path, capture_output=True, text=True
        )
        return {
            'added': result.stdout.strip().split('\n'),
            'modified': [],
            'deleted': [],
            'renamed': [],
        }

    # Get changes since last indexed commit
    result = subprocess.run(
        ['git', 'diff', '--name-status', '-M', last_indexed_commit, 'HEAD'],
        cwd=repo_path, capture_output=True, text=True
    )

    changes = {'added': [], 'modified': [], 'deleted': [], 'renamed': []}

    for line in result.stdout.strip().split('\n'):
        if not line:
            continue
        parts = line.split('\t')
        status = parts[0]

        if status == 'A':
            changes['added'].append(parts[1])
        elif status == 'M':
            changes['modified'].append(parts[1])
        elif status == 'D':
            changes['deleted'].append(parts[1])
        elif status.startswith('R'):
            # Rename: old_path -> new_path
            changes['renamed'].append({
                'old': parts[1],
                'new': parts[2],
                'similarity': int(status[1:]) if len(status) > 1 else 100,
            })

    return changes
```

The `-M` flag enables rename detection. Git tracks renames when the content similarity exceeds a threshold (default 50%). A renamed file updates its path in the index without recomputing embeddings -- the content has not changed, so the vectors are still valid.

### The Incremental Index Pipeline

```python
class IncrementalIndexer:
    def __init__(
        self,
        repo_path: str,
        index_path: str,
        model: SentenceTransformer,
        chunker,
        bm25_index: CodeBM25,
        vector_collection: chromadb.Collection,
        ast_graph: ASTGraph
    ):
        self.repo_path = repo_path
        self.index_path = index_path
        self.model = model
        self.chunker = chunker
        self.bm25 = bm25_index
        self.vectors = vector_collection
        self.graph = ast_graph
        self.manifest = self._load_manifest()

    def update(self) -> dict:
        """Run incremental index update. Returns statistics."""
        changes = detect_changes(self.repo_path, self.manifest.get('last_commit'))
        stats = {'added': 0, 'modified': 0, 'deleted': 0, 'chunks_processed': 0}

        # Handle deletions first -- remove stale entries
        for file_path in changes['deleted']:
            self._remove_file(file_path)
            stats['deleted'] += 1

        # Handle renames -- update paths without re-embedding
        for rename in changes['renamed']:
            self._rename_file(rename['old'], rename['new'], rename['similarity'])

        # Handle additions and modifications -- re-chunk and re-embed
        files_to_process = changes['added'] + changes['modified']

        # Add cross-file dependents that need AST refresh
        dependents = self._find_ast_dependents(files_to_process)
        files_to_process.extend(dependents)
        files_to_process = list(set(files_to_process))  # Deduplicate

        for file_path in files_to_process:
            full_path = Path(self.repo_path) / file_path
            if not full_path.exists():
                continue

            content = full_path.read_text(encoding='utf-8', errors='ignore')
            chunks = self.chunker(content, file_path)

            # Remove old chunks for this file
            self._remove_file(file_path)

            # Add new chunks
            for chunk in chunks:
                self._add_chunk(chunk)
                stats['chunks_processed'] += 1

            stats['added' if file_path in changes['added'] else 'modified'] += 1

        # Update manifest
        self.manifest['last_commit'] = self._get_head_commit()
        self._save_manifest()

        return stats

    def _find_ast_dependents(
        self,
        changed_files: list[str],
        max_depth: int = 2
    ) -> list[str]:
        """Find files that depend on changed files via import graph."""
        dependents = set()
        frontier = set(changed_files)

        for depth in range(max_depth):
            next_frontier = set()
            for file_path in frontier:
                for importer in self.graph.imported_by.get(file_path, set()):
                    if importer not in changed_files and importer not in dependents:
                        dependents.add(importer)
                        next_frontier.add(importer)
            frontier = next_frontier

        return list(dependents)

    def _remove_file(self, file_path: str):
        """Remove all chunks for a file from all indexes."""
        # Remove from vector store
        existing = self.vectors.get(where={"file_path": file_path})
        if existing['ids']:
            self.vectors.delete(ids=existing['ids'])

        # Remove from AST graph
        keys_to_remove = [k for k in self.graph.calls if k.startswith(file_path)]
        for key in keys_to_remove:
            del self.graph.calls[key]

    def _add_chunk(self, chunk: dict):
        """Add a chunk to all indexes."""
        chunk_id = f"{chunk['file_path']}:{chunk['start_line']}:{chunk['name']}"
        embedding = self.model.encode(chunk['content']).tolist()

        self.vectors.add(
            ids=[chunk_id],
            embeddings=[embedding],
            documents=[chunk['content']],
            metadatas=[{
                'file_path': chunk['file_path'],
                'chunk_type': chunk['chunk_type'],
                'name': chunk.get('name', ''),
                'start_line': chunk['start_line'],
                'end_line': chunk['end_line'],
            }]
        )
```

### Sharp Edges

**Deleted files.** A file removed from the repository must be removed from the index. Stale entries from deleted files are worse than missing entries -- they actively mislead the pipeline. The incremental indexer detects deletions by comparing the git diff. But files deleted outside of git (`rm` without committing) are invisible until the next `git status` check.

**Renamed files with modifications.** Git only detects renames when content similarity exceeds its threshold. A file that is renamed and heavily modified in the same commit looks like a deletion plus an addition. Both the old chunks and the new chunks get processed. This is correct behavior, but it doubles the work for that file.

**Cross-file AST dependencies.** When you change a function signature in `utils.py`, the AST graph edges for every file that imports that function need updating. The `_find_ast_dependents` method propagates these changes, but with a depth limit (default 2 hops) to prevent cascading through the entire codebase.

The depth limit is a trade-off. A utility function used by 500 files would trigger re-indexing of all 500 files at unlimited depth. With a 2-hop limit, only direct importers and their direct importers get refreshed. Empirically, this captures the relevant context updates without turning incremental indexing into a full reindex.

### Controlling What Gets Indexed

Not everything in a repository belongs in a search index. A `.pyckle-ignore` file (or equivalent for your system) excludes noise:

```python
import fnmatch

def load_ignore_patterns(repo_path: str) -> list[str]:
    """Load ignore patterns from .pyckle-ignore (gitignore syntax)."""
    ignore_file = Path(repo_path) / '.pyckle-ignore'
    if not ignore_file.exists():
        return []

    patterns = []
    for line in ignore_file.read_text().splitlines():
        line = line.strip()
        if line and not line.startswith('#'):
            patterns.append(line)
    return patterns

def should_index(file_path: str, ignore_patterns: list[str]) -> bool:
    """Check if a file should be indexed based on ignore patterns."""
    for pattern in ignore_patterns:
        if pattern.startswith('!'):
            # Re-include pattern
            if fnmatch.fnmatch(file_path, pattern[1:]):
                return True
        elif fnmatch.fnmatch(file_path, pattern):
            return False
    return True
```

Typical ignore patterns:

```
# Build artifacts
dist/
build/
*.min.js

# Generated code
*_pb2.py
*.generated.ts

# Vendored dependencies
vendor/
third_party/
node_modules/

# Test fixtures (but keep test code)
tests/fixtures/
**/__snapshots__/

# Large data files
*.csv
*.json
!package.json
!tsconfig.json
```

For most repositories under 10K files, indexing everything git tracks is fine. For larger repositories, an ignore file cuts index size by 20-40% and removes an entire category of irrelevant results.

### CI/CD Integration

The index is an artifact that can be shared across environments. In a CI/CD pipeline, caching the index avoids redundant full reindexing:

```yaml
# GitHub Actions example
- name: Cache search index
  uses: actions/cache@v4
  with:
    path: .search_index/
    key: search-index-${{ hashFiles('**/*.py', '**/*.ts', '**/*.go') }}
    restore-keys: |
      search-index-

- name: Update search index
  run: |
    python -c "
    from pipeline import IncrementalIndexer
    indexer = IncrementalIndexer('.')
    stats = indexer.update()
    print(f'Updated: {stats}')
    "
```

The cache key includes a hash of all source files. If no source files changed, the cached index is used directly (0 seconds). If source files changed, the cached index is restored and incrementally updated (2-5 seconds instead of 5-10 minutes for a full reindex).

For teams that run search-powered code review or quality checks in CI, this caching strategy makes search-based CI checks practical. A step that queries the search index for architectural violations, dead code, or naming convention mismatches can run in seconds rather than minutes.

### Index Integrity and Recovery

Incremental indexing has a failure mode: if the indexer crashes mid-update, the index can be left in an inconsistent state. Some chunks are updated, others are not. The AST graph references chunks that no longer exist.

The solution is transactional indexing: write all updates to a staging area, then atomically swap the staging area with the live index:

```python
import shutil
from pathlib import Path

class TransactionalIndex:
    def __init__(self, index_path: str):
        self.live_path = Path(index_path)
        self.staging_path = Path(f"{index_path}.staging")
        self.backup_path = Path(f"{index_path}.backup")

    def begin_update(self):
        """Start a transactional update by copying live to staging."""
        if self.staging_path.exists():
            shutil.rmtree(self.staging_path)
        shutil.copytree(self.live_path, self.staging_path)

    def commit(self):
        """Atomically swap staging into live."""
        if self.backup_path.exists():
            shutil.rmtree(self.backup_path)
        self.live_path.rename(self.backup_path)
        self.staging_path.rename(self.live_path)
        shutil.rmtree(self.backup_path)

    def rollback(self):
        """Discard the staging area."""
        if self.staging_path.exists():
            shutil.rmtree(self.staging_path)

    def recover(self):
        """Recover from a crashed update."""
        if self.staging_path.exists() and self.live_path.exists():
            # Crash during staging -- discard staging
            self.rollback()
        elif self.staging_path.exists() and not self.live_path.exists():
            # Crash during commit -- staging is the new live
            self.staging_path.rename(self.live_path)
        elif self.backup_path.exists() and not self.live_path.exists():
            # Crash during commit cleanup -- backup is the live
            self.backup_path.rename(self.live_path)
```

This pattern adds a few hundred milliseconds to each update (the cost of copying the index to staging) but eliminates the risk of corruption. For production search systems, this reliability is essential.

### File Watching for Real-Time Updates

For active development, polling git after each pull is the baseline. File watching is the next level:

```python
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler

class IndexUpdateHandler(FileSystemEventHandler):
    def __init__(self, indexer: IncrementalIndexer, debounce_ms: int = 500):
        self.indexer = indexer
        self.debounce_ms = debounce_ms
        self._pending = {}

    def on_modified(self, event):
        if event.is_directory:
            return
        file_path = str(Path(event.src_path).relative_to(self.indexer.repo_path))
        if should_index(file_path, self.indexer.ignore_patterns):
            self._schedule_update(file_path)

    def _schedule_update(self, file_path: str):
        """Debounce rapid saves -- wait for the file to settle."""
        # Implementation: use a timer that resets on each event
        # After debounce_ms with no new events, trigger the update
        pass

def start_file_watcher(indexer: IncrementalIndexer):
    """Watch the repository for file changes and update the index."""
    handler = IndexUpdateHandler(indexer)
    observer = Observer()
    observer.schedule(handler, indexer.repo_path, recursive=True)
    observer.start()
    return observer
```

Every file save triggers an incremental update: parsing, embedding, BM25 update, AST edge update. Typically 50-200ms per file. By the time the developer types a query, the index reflects the latest changes.

### Monorepo Strategies

Large monorepos (50K-100K+ files) present unique indexing challenges. The full index exceeds 1GB. Full reindexing takes 10+ minutes. Incremental updates touch more files because cross-package changes cascade further.

**Hierarchical indexing.** Split the monorepo into per-service or per-package sub-indexes. Each sub-index covers one service's code. A query first searches the relevant sub-index (identified by file path context or explicit scope). If the sub-index does not produce strong results, fall back to a global index that covers the entire repo.

```python
class HierarchicalIndex:
    def __init__(self, repo_path: str, service_dirs: list[str]):
        self.service_indexes = {}
        for service_dir in service_dirs:
            service_path = Path(repo_path) / service_dir
            if service_path.is_dir():
                chunks = chunk_directory(str(service_path))
                self.service_indexes[service_dir] = HybridSearch(chunks)

        # Global index covers everything
        all_chunks = chunk_directory(repo_path)
        self.global_index = HybridSearch(all_chunks)

    def search(
        self,
        query: str,
        scope: str | None = None,
        top_k: int = 20,
        fallback_threshold: float = 0.6
    ) -> list[dict]:
        """Search with optional service scope and global fallback."""
        if scope and scope in self.service_indexes:
            results = self.service_indexes[scope].search(query, top_k=top_k)
            if results and results[0].get('rrf_score', 0) >= fallback_threshold:
                return results
            # Weak results from scoped search -- fall back to global
            return self.global_index.search(query, top_k=top_k)
        return self.global_index.search(query, top_k=top_k)
```

This approach reduces query latency for scoped searches (smaller index to search) and enables per-service incremental updates (changing a service only re-indexes that service's sub-index).

**Branch-aware indexing.** In monorepos with multiple active branches, the index needs to handle branch switches. The simplest approach: maintain one index per branch. The more practical approach: maintain one index for the main branch and compute diffs for feature branches.

```python
def index_for_branch(repo_path: str, branch: str, main_index_path: str):
    """Create a lightweight index for a feature branch by diffing against main."""
    # Get the diff from main to the current branch
    result = subprocess.run(
        ['git', 'diff', '--name-only', 'main...HEAD'],
        cwd=repo_path, capture_output=True, text=True
    )
    changed_files = result.stdout.strip().split('\n')

    # Load the main index and apply the diff
    main_index = load_index(main_index_path)
    branch_index = main_index.copy()

    for file_path in changed_files:
        branch_index.remove_file(file_path)
        if Path(repo_path, file_path).exists():
            content = Path(repo_path, file_path).read_text()
            chunks = chunk_file(file_path, content)
            branch_index.add_chunks(chunks)

    return branch_index
```

This avoids full reindexing on branch switch. The branch index is the main index with a small overlay of changed files. Branch indexes are cheap to create and discard.

### Content Hashing for Change Detection

Git-based change detection works for committed changes. But developers often want to search their working tree, including uncommitted changes. Content hashing catches these:

```python
import hashlib

def compute_file_hash(content: str) -> str:
    """Compute a content hash for change detection."""
    return hashlib.sha256(content.encode('utf-8')).hexdigest()[:16]

def detect_working_tree_changes(
    repo_path: str,
    manifest: dict[str, str]  # file_path -> content_hash
) -> dict:
    """Detect changes in the working tree, including uncommitted modifications."""
    changes = {'modified': [], 'added': [], 'deleted': []}
    current_files = set()

    for root, dirs, files in os.walk(repo_path):
        dirs[:] = [d for d in dirs if not d.startswith('.')]
        for fname in files:
            fpath = Path(root) / fname
            rel_path = str(fpath.relative_to(repo_path))
            current_files.add(rel_path)

            content = fpath.read_text(encoding='utf-8', errors='ignore')
            current_hash = compute_file_hash(content)

            if rel_path not in manifest:
                changes['added'].append(rel_path)
            elif manifest[rel_path] != current_hash:
                changes['modified'].append(rel_path)

    # Detect deletions
    for old_path in manifest:
        if old_path not in current_files:
            changes['deleted'].append(old_path)

    return changes
```

### Index Size Budgets

| Repository Size | Chunks | Index Size | Incremental Update |
|----------------|--------|------------|-------------------|
| 1,000 files | ~4,000 | ~15MB | <0.5s |
| 10,000 files | ~40,000 | ~150MB | ~1s |
| 50,000 files | ~200,000 | ~600MB | ~2s |
| 100,000 files | ~400,000 | ~1.5GB | ~4s |

Query latency scales sub-linearly because HNSW graph search is `O(log n)`. Doubling the index size does not double the query time. A 100K-file monorepo is still under 15ms per query.

---

### Exercise

> **Try This**
>
> Build an incremental indexer:
>
> 1. Index your test codebase fully. Record the time.
> 2. Modify three files. Run incremental update. Record the time and compare.
> 3. Delete a file. Run incremental update. Verify the deleted file's chunks are removed from the index.
> 4. Change a function signature in a utility file. Check if the AST dependency propagation re-indexes files that import the changed function.
> 5. Add a `.pyckle-ignore` file that excludes test fixtures and generated code. Rebuild and compare the index size.

---

### Key Takeaways

- Full reindexing takes minutes for large codebases. Incremental indexing takes seconds by processing only changed files.
- Git-aware change detection provides accurate file-level diffs. Rename detection prevents unnecessary re-embedding.
- Cross-file AST dependency propagation ensures structural accuracy after changes, with a depth limit to prevent cascading re-indexes.
- File watching provides real-time index updates on save, typically 50-200ms per file.
- A `.pyckle-ignore` file reduces index size by 20-40% and removes noise from search results.

---

# Part III: Measurement and Beyond

---

## Chapter 5: Evaluation and Benchmarking

### Chapter Overview

Building a search pipeline is engineering. Measuring whether it works is science. This chapter covers how to build an evaluation set, choose the right metrics, run benchmarks, detect regressions, and make data-driven decisions about pipeline changes.

---

### Why Measurement Matters

Every chapter in this book introduced a component that "improves search quality." Cross-encoder reranking "improves precision." Adaptive thresholds "reduce noise."

These claims are meaningless without numbers. How much does each component improve? On what queries? For which codebases? Is the improvement consistent or does it vary? Does it regress on any query type?

Without measurement, pipeline development is guesswork. You add a component, the results "feel better," and you ship it. Six months later, a different component change makes results "feel worse," but you do not know if the new change caused the regression or if the original "improvement" was never real.

Evaluation turns feelings into facts.

### Building an Evaluation Set

An evaluation set is a collection of (query, relevant_results) pairs that represent the queries your users actually ask.

```python
@dataclass
class EvalQuery:
    query: str                      # The search query
    relevant_chunks: list[str]      # IDs of chunks that are relevant
    relevance_grades: dict[str, int]  # chunk_id -> relevance grade (0-3)
    query_type: str                 # 'exact', 'conceptual', 'structural', 'cross_file'
    notes: str = ''                 # Why these judgments were made

# Example evaluation set
eval_set = [
    EvalQuery(
        query="POSTGRES_MAX_CONNECTIONS",
        relevant_chunks=["config/database.py:12", "config/database.py:34"],
        relevance_grades={
            "config/database.py:12": 3,    # Perfect match
            "config/database.py:34": 2,    # Relevant usage
            "docs/deployment.md:89": 1,    # Mentions it
        },
        query_type="exact",
    ),
    EvalQuery(
        query="how does the app handle authentication",
        relevant_chunks=[
            "src/middleware/auth.py:14",
            "src/auth/tokens.py:42",
            "src/auth/session.py:18",
            "config/auth.py:5",
        ],
        relevance_grades={
            "src/middleware/auth.py:14": 3,
            "src/auth/tokens.py:42": 3,
            "src/auth/session.py:18": 2,
            "config/auth.py:5": 2,
        },
        query_type="cross_file",
    ),
    # ... 50-200 more queries
]
```

**Sources for evaluation queries:**

1. **Search logs.** If you have an existing search tool, mine the query logs. Real queries are always more diverse and surprising than synthetic ones.

2. **Developer interviews.** Ask five developers: "What did you search for this week?" Record the queries and the code they were actually looking for.

3. **Code review comments.** Comments like "how does X work" or "where is Y defined" are implicit search queries with known answers.

4. **Onboarding questions.** New team members ask questions that test the search system's ability to surface non-obvious code.

**How many evaluation queries do you need?**

The answer depends on what you are measuring and how precisely you need to measure it.

| Purpose | Minimum Queries | Recommended |
|---------|----------------|-------------|
| Quick sanity check | 10-20 | 20 |
| Component comparison | 30-50 | 50 |
| Production readiness | 50-100 | 100 |
| Per-query-type analysis | 100-200 | 200 (50+ per type) |
| Academic benchmark | 200+ | 500+ |

For pipeline development, 50-100 queries is the sweet spot. Enough for stable metrics, not so many that maintaining the eval set becomes a project of its own.

**Avoiding bias in query selection:**

A common mistake is writing queries that test the system's strengths rather than its weaknesses. If you build the eval set by searching your codebase, finding results, and writing down the query that found them, your eval set is biased toward queries the system already handles well.

Better approach: collect queries from developers who are not involved in building the pipeline. Their queries reflect actual information needs, not system capabilities. The queries that stump the pipeline are the most valuable evaluation queries -- they reveal the ceiling you need to raise.

**Relevance grading:**

Use a 4-point scale:

| Grade | Meaning |
|-------|---------|
| 3 | Perfect match -- exactly what the query is asking for |
| 2 | Highly relevant -- directly related, provides useful context |
| 1 | Somewhat relevant -- tangentially related, might help |
| 0 | Not relevant -- noise |

Multiple grades are better than binary (relevant/not relevant) because they let you measure ranking quality, not just recall.

### Metrics

**Mean Reciprocal Rank (MRR)**

MRR measures where the first relevant result appears. For each query, the reciprocal rank is `1/rank` where `rank` is the position of the first relevant result. MRR is the average across all queries.

```python
def mean_reciprocal_rank(eval_set: list[EvalQuery], results: dict[str, list[dict]]) -> float:
    """Calculate MRR across the evaluation set.

    Args:
        eval_set: List of evaluation queries.
        results: Dict mapping query string to ranked result list.
    """
    reciprocal_ranks = []

    for eq in eval_set:
        query_results = results[eq.query]
        rr = 0.0
        for rank, result in enumerate(query_results, start=1):
            if result['id'] in eq.relevant_chunks:
                rr = 1.0 / rank
                break
        reciprocal_ranks.append(rr)

    return sum(reciprocal_ranks) / len(reciprocal_ranks)
```

MRR ranges from 0 to 1. An MRR of 0.8 means the first relevant result is, on average, between rank 1 and rank 2. An MRR of 0.5 means the first relevant result is, on average, at rank 2.

**Recall@k**

What fraction of relevant results appear in the top `k`?

```python
def recall_at_k(eval_set: list[EvalQuery], results: dict[str, list[dict]], k: int) -> float:
    """Calculate Recall@k across the evaluation set."""
    recalls = []

    for eq in eval_set:
        query_results = results[eq.query][:k]
        result_ids = {r['id'] for r in query_results}
        relevant_found = len(result_ids & set(eq.relevant_chunks))
        total_relevant = len(eq.relevant_chunks)

        recalls.append(relevant_found / total_relevant if total_relevant > 0 else 0)

    return sum(recalls) / len(recalls)
```

Recall@5 of 0.85 means 85% of relevant results are in the top 5. Recall@20 of 0.95 means 95% are in the top 20. High recall is necessary -- if the relevant code is not in the candidate set, no amount of reranking will find it.

**Normalized Discounted Cumulative Gain (nDCG)**

nDCG measures ranking quality, accounting for the position and grade of each result. It rewards systems that put highly relevant results at the top.

```
DCG@k = sum( (2^relevance_grade - 1) / log2(rank + 1) )  for rank 1..k
nDCG@k = DCG@k / ideal_DCG@k
```

Where `ideal_DCG@k` is the DCG of the perfect ranking (all relevant results at the top, in order of relevance grade).

```python
import math

def ndcg_at_k(eval_set: list[EvalQuery], results: dict[str, list[dict]], k: int) -> float:
    """Calculate nDCG@k across the evaluation set."""
    ndcgs = []

    for eq in eval_set:
        query_results = results[eq.query][:k]

        # Compute DCG
        dcg = 0.0
        for rank, result in enumerate(query_results, start=1):
            grade = eq.relevance_grades.get(result['id'], 0)
            dcg += (2**grade - 1) / math.log2(rank + 1)

        # Compute ideal DCG
        ideal_grades = sorted(eq.relevance_grades.values(), reverse=True)[:k]
        idcg = sum(
            (2**g - 1) / math.log2(r + 1)
            for r, g in enumerate(ideal_grades, start=1)
        )

        ndcgs.append(dcg / idcg if idcg > 0 else 0)

    return sum(ndcgs) / len(ndcgs)
```

nDCG ranges from 0 to 1. An nDCG@5 of 0.90 means the ranking is 90% as good as the perfect ranking in the top 5 positions. It is the most informative single metric for search quality because it captures both what is returned and where it is ranked.

### Running Benchmarks

A benchmark harness runs the evaluation set through the pipeline and computes all metrics:

```python
def run_benchmark(
    pipeline,
    eval_set: list[EvalQuery],
    k_values: list[int] = [1, 3, 5, 10, 20]
) -> dict:
    """Run full benchmark suite against the pipeline."""
    # Collect results for all queries
    all_results = {}
    latencies = []

    for eq in eval_set:
        start = time.perf_counter()
        results = pipeline.search(eq.query)
        elapsed = (time.perf_counter() - start) * 1000

        all_results[eq.query] = results
        latencies.append(elapsed)

    # Compute metrics
    metrics = {
        'mrr': mean_reciprocal_rank(eval_set, all_results),
        'latency_p50': sorted(latencies)[len(latencies) // 2],
        'latency_p95': sorted(latencies)[int(len(latencies) * 0.95)],
        'latency_p99': sorted(latencies)[int(len(latencies) * 0.99)],
    }

    for k in k_values:
        metrics[f'recall@{k}'] = recall_at_k(eval_set, all_results, k)
        metrics[f'ndcg@{k}'] = ndcg_at_k(eval_set, all_results, k)

    # Per-query-type breakdown
    query_types = set(eq.query_type for eq in eval_set)
    for qt in query_types:
        subset = [eq for eq in eval_set if eq.query_type == qt]
        if subset:
            metrics[f'mrr_{qt}'] = mean_reciprocal_rank(subset, all_results)
            metrics[f'ndcg@5_{qt}'] = ndcg_at_k(subset, all_results, 5)

    return metrics
```

### Ablation Studies

The most valuable benchmarking technique is ablation: remove one pipeline component at a time and measure the impact.

```python
def run_ablation_study(pipeline, eval_set):
    """Measure the impact of each pipeline component."""
    configurations = {
        'full_pipeline': pipeline.search,
        'no_ast': lambda q: pipeline.search(q, skip_ast=True),
        'no_reranking': lambda q: pipeline.search(q, skip_reranker=True),
        'no_threshold': lambda q: pipeline.search(q, skip_threshold=True),
        'bm25_only': lambda q: pipeline.bm25.search(q),
        'semantic_only': lambda q: pipeline.semantic_search(q),
    }

    results = {}
    for name, search_fn in configurations.items():
        all_results = {eq.query: search_fn(eq.query) for eq in eval_set}
        results[name] = {
            'mrr': mean_reciprocal_rank(eval_set, all_results),
            'ndcg@5': ndcg_at_k(eval_set, all_results, 5),
            'recall@10': recall_at_k(eval_set, all_results, 10),
        }

    return results
```

Expected ablation results for a well-tuned pipeline:

| Configuration | MRR | nDCG@5 | Recall@10 |
|--------------|-----|--------|-----------|
| Full pipeline | 0.87 | 0.91 | 0.95 |
| No AST boosting | 0.79 | 0.84 | 0.93 |
| No reranking | 0.64 | 0.78 | 0.91 |
| No threshold | 0.87 | 0.91 | 0.95 |
| BM25 only | 0.52 | 0.58 | 0.72 |
| Semantic only | 0.61 | 0.68 | 0.80 |

Threshold does not affect ranking metrics (it only filters, does not reorder), but it affects downstream LLM quality. Reranking has the largest single impact on MRR. AST boosting helps most on structural and cross-file queries.

### Common Evaluation Mistakes

**Evaluating on the training distribution.** If your eval set queries are all exact identifiers and your pipeline is tuned for exact identifiers, you will see great metrics. But your users ask conceptual questions too. Ensure your eval set covers all four query types: exact, conceptual, structural, and cross-file.

**Too few queries.** An eval set of 10 queries produces noisy metrics. Random chance can swing MRR by 0.10 or more. You need at least 50 queries for stable MRR estimates and 100+ for per-query-type breakdowns.

**Stale relevance judgments.** Code changes. A chunk that was relevant six months ago may have been refactored, renamed, or deleted. Review your eval set quarterly and update relevance judgments to match the current codebase.

**Ignoring the tail.** The 90th percentile query is more revealing than the median. If your pipeline works well on 90% of queries but fails completely on 10%, the average metrics look fine but 10% of your users are frustrated. Report percentile metrics, not just averages:

```python
def compute_percentile_metrics(
    eval_set: list[EvalQuery],
    results: dict[str, list[dict]]
) -> dict:
    """Compute percentile breakdowns for MRR."""
    mrr_per_query = []

    for eq in eval_set:
        query_results = results[eq.query]
        rr = 0.0
        for rank, result in enumerate(query_results, start=1):
            if result['id'] in eq.relevant_chunks:
                rr = 1.0 / rank
                break
        mrr_per_query.append(rr)

    sorted_rr = sorted(mrr_per_query)
    n = len(sorted_rr)

    return {
        'mrr_p10': sorted_rr[int(n * 0.1)],    # Worst 10% of queries
        'mrr_p25': sorted_rr[int(n * 0.25)],   # Bottom quartile
        'mrr_p50': sorted_rr[int(n * 0.5)],    # Median
        'mrr_p75': sorted_rr[int(n * 0.75)],   # Upper quartile
        'mrr_p90': sorted_rr[int(n * 0.9)],    # Best 10%
        'mrr_mean': sum(sorted_rr) / n,         # Average
        'mrr_zero_count': sum(1 for r in sorted_rr if r == 0),  # Complete failures
    }
```

The `mrr_zero_count` is particularly important. These are queries where the relevant result was not in the returned set at all -- complete retrieval failures. Even one such failure per hundred queries can erode trust in the tool.

### Statistical Significance

When comparing two pipeline configurations, the difference in metrics might be noise. A 2% MRR improvement might be real, or it might be within the margin of error.

For small eval sets (<100 queries), use a paired bootstrap test:

```python
import random

def bootstrap_significance(
    mrr_a: list[float],
    mrr_b: list[float],
    n_bootstrap: int = 10000
) -> float:
    """Test whether pipeline B is significantly better than pipeline A.

    Returns p-value. If p < 0.05, the improvement is likely real.
    """
    n = len(mrr_a)
    observed_diff = sum(mrr_b) / n - sum(mrr_a) / n

    # Count how often a random re-sampling produces a diff >= observed
    count_greater = 0
    for _ in range(n_bootstrap):
        # Randomly swap A and B for each query
        sample_diff = 0.0
        for i in range(n):
            if random.random() < 0.5:
                sample_diff += (mrr_b[i] - mrr_a[i])
            else:
                sample_diff += (mrr_a[i] - mrr_b[i])
        sample_diff /= n

        if sample_diff >= observed_diff:
            count_greater += 1

    return count_greater / n_bootstrap
```

A p-value below 0.05 means the improvement is statistically significant -- unlikely to be due to random chance. For production decisions, require p < 0.05 and at least a 1% absolute improvement in the primary metric (usually MRR or nDCG@5).

### Regression Detection

Every pipeline change should be benchmarked against the evaluation set before deployment:

```python
def check_for_regressions(
    baseline_metrics: dict,
    new_metrics: dict,
    threshold: float = 0.02
) -> list[str]:
    """Check if any metric regressed beyond the threshold."""
    regressions = []

    for metric, baseline_value in baseline_metrics.items():
        new_value = new_metrics.get(metric, 0)
        if new_value < baseline_value - threshold:
            regressions.append(
                f"{metric}: {baseline_value:.3f} -> {new_value:.3f} "
                f"(delta: {new_value - baseline_value:+.3f})"
            )

    return regressions
```

A 2% regression threshold catches meaningful degradations while allowing for natural variance. Run benchmarks three times and average to reduce noise.

### Continuous Evaluation

An evaluation set is a snapshot. Codebases change. Queries evolve. Models age. Continuous evaluation tracks search quality over time.

The simplest continuous evaluation: run the benchmark suite nightly on the current codebase and compare to the previous night's results.

```python
import json
from datetime import datetime

class ContinuousEvaluator:
    def __init__(self, history_path: str = '.search_index/eval_history.jsonl'):
        self.history_path = history_path

    def record(self, metrics: dict):
        """Record today's benchmark metrics."""
        entry = {
            'date': datetime.now().isoformat(),
            'metrics': metrics,
        }
        with open(self.history_path, 'a') as f:
            f.write(json.dumps(entry) + '\n')

    def trend(self, metric: str, window: int = 30) -> dict:
        """Compute trend for a metric over the last N days."""
        entries = []
        with open(self.history_path) as f:
            for line in f:
                entries.append(json.loads(line))

        recent = entries[-window:]
        values = [e['metrics'].get(metric, 0) for e in recent]

        if len(values) < 2:
            return {'trend': 'insufficient_data'}

        # Simple linear regression for trend direction
        n = len(values)
        x_mean = (n - 1) / 2
        y_mean = sum(values) / n
        slope = sum(
            (i - x_mean) * (v - y_mean) for i, v in enumerate(values)
        ) / sum(
            (i - x_mean) ** 2 for i in range(n)
        )

        return {
            'current': values[-1],
            'mean': y_mean,
            'min': min(values),
            'max': max(values),
            'slope': slope,
            'direction': 'improving' if slope > 0.001 else 'degrading' if slope < -0.001 else 'stable',
        }
```

Watch for slow degradation. A metric that drops 0.5% per week is invisible in daily checks but represents a 25% degradation over a year. The trend analysis catches these slow drifts.

Common causes of gradual degradation:
- **Codebase growth without eval set updates.** The eval set queries target code from six months ago. New code is not represented.
- **Style drift.** The team adopts new naming conventions, new frameworks, or new patterns. The embedding model was never trained on these patterns.
- **Index staleness.** Incremental indexing misses some edge cases (Chapter 4), and the index gradually diverges from reality.

The fix is quarterly eval set maintenance: review queries, update relevance judgments, add queries for new code areas, and remove queries for deleted code.

### Failure Analysis

When a query fails (the relevant result is not in the top 5), understanding why it failed guides pipeline improvements.

```python
def analyze_failure(
    query: str,
    expected_chunk_id: str,
    pipeline_results: list[dict],
    bm25_results: list[dict],
    semantic_results: list[dict]
) -> dict:
    """Diagnose why a query failed to return the expected result."""
    analysis = {'query': query, 'expected': expected_chunk_id}

    # Check each retriever independently
    bm25_ids = [r['id'] for r in bm25_results[:50]]
    semantic_ids = [r['id'] for r in semantic_results[:50]]
    pipeline_ids = [r['id'] for r in pipeline_results[:20]]

    analysis['in_bm25_top50'] = expected_chunk_id in bm25_ids
    analysis['in_semantic_top50'] = expected_chunk_id in semantic_ids
    analysis['in_pipeline_top20'] = expected_chunk_id in pipeline_ids

    if expected_chunk_id in bm25_ids:
        analysis['bm25_rank'] = bm25_ids.index(expected_chunk_id) + 1
    if expected_chunk_id in semantic_ids:
        analysis['semantic_rank'] = semantic_ids.index(expected_chunk_id) + 1
    if expected_chunk_id in pipeline_ids:
        analysis['pipeline_rank'] = pipeline_ids.index(expected_chunk_id) + 1

    # Determine failure type
    if not analysis['in_bm25_top50'] and not analysis['in_semantic_top50']:
        analysis['failure_type'] = 'retrieval_miss'
        analysis['recommendation'] = 'Check chunking -- the relevant code may be split across chunks or missing from the index.'
    elif analysis.get('in_bm25_top50') and not analysis.get('in_semantic_top50'):
        analysis['failure_type'] = 'semantic_miss'
        analysis['recommendation'] = 'The embedding model does not capture the relationship between this query and this code. Consider fine-tuning.'
    elif not analysis.get('in_bm25_top50') and analysis.get('in_semantic_top50'):
        analysis['failure_type'] = 'lexical_miss'
        analysis['recommendation'] = 'No keyword overlap between query and code. This is expected for conceptual queries.'
    elif analysis.get('in_pipeline_top20') and analysis.get('pipeline_rank', 99) > 5:
        analysis['failure_type'] = 'ranking_error'
        analysis['recommendation'] = 'The result was retrieved but ranked too low. Cross-encoder reranking may need tuning.'
    else:
        analysis['failure_type'] = 'threshold_filtered'
        analysis['recommendation'] = 'The result was retrieved and ranked but filtered by the adaptive threshold. Consider widening the margin.'

    return analysis
```

This kind of failure analysis transforms evaluation from a metric dashboard into an actionable improvement plan. Each failure type points to a specific pipeline component that needs attention.

### A/B Testing Search Quality

For production systems, benchmarks on a fixed evaluation set are necessary but not sufficient. Real queries are more diverse than any evaluation set. A/B testing measures quality on actual usage.

The simplest approach: log the query, the results, and whether the user clicked on (or used) a result. Compute click-through rate (CTR) and mean position of clicked results for each pipeline variant.

```python
@dataclass
class SearchEvent:
    query: str
    results: list[str]       # Result IDs in order
    clicked_result: str | None  # Which result the user used
    pipeline_variant: str       # 'control' or 'experiment'
    timestamp: float

def compute_ab_metrics(events: list[SearchEvent]) -> dict:
    """Compute A/B test metrics from search events."""
    variants = {}

    for event in events:
        v = event.pipeline_variant
        if v not in variants:
            variants[v] = {'clicks': 0, 'total': 0, 'click_positions': []}

        variants[v]['total'] += 1
        if event.clicked_result:
            variants[v]['clicks'] += 1
            try:
                pos = event.results.index(event.clicked_result) + 1
                variants[v]['click_positions'].append(pos)
            except ValueError:
                pass

    metrics = {}
    for v, data in variants.items():
        metrics[v] = {
            'ctr': data['clicks'] / data['total'] if data['total'] > 0 else 0,
            'mean_click_position': (
                sum(data['click_positions']) / len(data['click_positions'])
                if data['click_positions'] else 0
            ),
            'total_queries': data['total'],
        }

    return metrics
```

---

### Exercise

> **Try This**
>
> Build an evaluation set and benchmark your pipeline:
>
> 1. Create 20 evaluation queries with relevance grades (use the 4-point scale). Include at least 5 exact queries, 5 conceptual queries, 5 structural queries, and 5 cross-file queries.
> 2. Run the full benchmark suite and record MRR, nDCG@5, and Recall@10.
> 3. Run an ablation study: measure the metrics without AST boosting, without reranking, and without thresholding.
> 4. Which component contributes the most to your pipeline's quality? Is it the same component for all query types?
> 5. Save your baseline metrics. Every pipeline change from here should be benchmarked against them.

---

### Key Takeaways

- Evaluation turns subjective "feels better" into objective measurements. Without it, pipeline development is guesswork.
- An evaluation set of 50-200 (query, relevant_results) pairs, sourced from real developer behavior, is the foundation of measurement.
- MRR measures where the first relevant result appears. Recall@k measures coverage. nDCG@k measures ranking quality. Use all three.
- Ablation studies reveal the contribution of each pipeline component. Remove one at a time and measure the impact.
- Regression detection with a 2% threshold catches meaningful degradations before deployment.

---

## Chapter 6: Extending the Pipeline

### Chapter Overview

The five-stage pipeline covers the common case. But every codebase has edges: domain-specific vocabulary, multi-language repositories, integration with LLMs, custom ranking signals, and features the default pipeline does not anticipate. This chapter covers how to extend the pipeline without breaking it. This material expands on Episode 21 of the Code Search, Decoded series.

---

### When to Extend vs. When to Configure

Before writing custom pipeline stages, try configuration:

**Configure** when:
- Results are mostly good but occasionally miss relevant files -- adjust thresholds or hybrid weights.
- Certain directories pollute results -- add them to your ignore file.
- Performance needs tuning -- adjust chunk sizes or disable stages for small codebases.

**Extend** when:
- Your domain vocabulary is specialized enough that general models miss key relationships.
- Your workflow needs metadata the default pipeline does not produce (license flags, ownership info, staleness scores).
- You have behavioral data (click history, query logs) that can improve ranking.
- You need to integrate non-code context (docs, issues, design specs) into the retrieval flow.

Configuration covers 90% of teams. The remaining 10% have extension points.

### Custom Embedding Models

The default MiniLM model works for general codebases. If your domain vocabulary is specialized -- bioinformatics, financial instruments, game physics, hardware description languages -- the model may not capture the semantic relationships between your terms.

**Option 1: Fine-tune on your code.**

Take the base MiniLM model and fine-tune it on your codebase (as covered in *Code Retrieval from Scratch*). You need (query, relevant code chunk) pairs -- extractable from search logs or constructable from saved query patterns. A few hundred pairs is enough to shift the embedding space toward your domain.

```python
# After fine-tuning, swap the model in your pipeline
pipeline = HybridSearch(
    chunks=chunks,
    model_name='./models/custom-minilm-fintech'  # Local fine-tuned model
)

# Rebuild index with new model
pipeline.rebuild_index()
```

The rebuild is slower than incremental indexing because every chunk needs re-embedding. On a typical codebase, it is still under 30 seconds.

**Option 2: Use a larger model.**

If fine-tuning is too much effort, a larger general-purpose model captures more nuance. The trade-off is speed:

```python
# Larger model: better quality, slower queries
pipeline = HybridSearch(
    chunks=chunks,
    model_name='sentence-transformers/all-mpnet-base-v2'  # 768 dimensions
)
```

A 110M parameter model is roughly 5x slower per query than MiniLM. Your 6ms queries become ~30ms queries. Still fast for most use cases, but it affects the cold start story (Chapter 3) and may matter at scale.

### Adding Custom Pipeline Stages

The pipeline is a sequence of stages, each narrowing or reranking the candidate set. You can insert additional stages at any point.

**Example: License-Awareness Filter**

Your legal team wants to know when search results draw from third-party code:

```python
class LicenseFilter:
    def __init__(self, third_party_dirs: list[str], action: str = 'flag'):
        self.third_party_dirs = third_party_dirs
        self.action = action  # 'flag' or 'exclude'

    def process(self, results: list[dict]) -> list[dict]:
        """Flag or exclude results from third-party directories."""
        processed = []
        for result in results:
            file_path = result.get('metadata', {}).get('file_path', '')
            is_third_party = any(
                file_path.startswith(d) for d in self.third_party_dirs
            )

            if is_third_party and self.action == 'exclude':
                continue

            result_copy = result.copy()
            if is_third_party:
                result_copy['flags'] = result_copy.get('flags', []) + ['third-party']
            processed.append(result_copy)

        return processed
```

**Example: Recency Boost**

Prefer recently modified code over stale code:

```python
import time
from pathlib import Path

class RecencyBoost:
    def __init__(self, repo_path: str, max_boost: float = 0.1, half_life_days: int = 30):
        self.repo_path = repo_path
        self.max_boost = max_boost
        self.half_life = half_life_days * 86400  # Convert to seconds

    def process(self, results: list[dict]) -> list[dict]:
        """Boost recently modified files."""
        now = time.time()
        boosted = []

        for result in results:
            file_path = result.get('metadata', {}).get('file_path', '')
            full_path = Path(self.repo_path) / file_path

            boost = 0.0
            if full_path.exists():
                mtime = full_path.stat().st_mtime
                age = now - mtime
                # Exponential decay based on file age
                boost = self.max_boost * (0.5 ** (age / self.half_life))

            result_copy = result.copy()
            result_copy['rrf_score'] = result.get('rrf_score', 0) + boost
            result_copy['recency_boost'] = boost
            boosted.append(result_copy)

        boosted.sort(key=lambda x: x['rrf_score'], reverse=True)
        return boosted
```

**Example: Code Ownership Signal**

In large organizations, knowing who owns the code is as important as finding it:

```python
class OwnershipAnnotator:
    def __init__(self, codeowners_path: str):
        self.owners = self._parse_codeowners(codeowners_path)

    def process(self, results: list[dict]) -> list[dict]:
        """Annotate results with code ownership information."""
        annotated = []
        for result in results:
            file_path = result.get('metadata', {}).get('file_path', '')
            owner = self._find_owner(file_path)

            result_copy = result.copy()
            result_copy['owner'] = owner
            annotated.append(result_copy)

        return annotated

    def _parse_codeowners(self, path: str) -> list[tuple[str, str]]:
        """Parse CODEOWNERS file into (pattern, owner) pairs."""
        owners = []
        try:
            for line in Path(path).read_text().splitlines():
                line = line.strip()
                if line and not line.startswith('#'):
                    parts = line.split()
                    if len(parts) >= 2:
                        owners.append((parts[0], ' '.join(parts[1:])))
        except FileNotFoundError:
            pass
        return owners

    def _find_owner(self, file_path: str) -> str:
        """Find the owner for a given file path."""
        for pattern, owner in reversed(self.owners):
            if fnmatch.fnmatch(file_path, pattern):
                return owner
        return 'unowned'
```

### Composing the Extended Pipeline

Custom stages plug into the pipeline sequence:

```python
class ExtendedPipeline:
    def __init__(self, base_pipeline, custom_stages: list = None):
        self.base = base_pipeline
        self.custom_stages = custom_stages or []

    def search(self, query: str, top_k: int = 20) -> list[dict]:
        """Run the full extended pipeline."""
        # Base pipeline: AST boost -> BM25 -> Semantic -> RRF -> Rerank -> Threshold
        results = self.base.search(query, top_k=top_k * 2)

        # Run custom stages
        for stage in self.custom_stages:
            results = stage.process(results)

        return results[:top_k]

# Usage
pipeline = ExtendedPipeline(
    base_pipeline=full_pipeline,
    custom_stages=[
        LicenseFilter(['vendor/', 'third_party/'], action='flag'),
        RecencyBoost('/path/to/repo', max_boost=0.05),
        OwnershipAnnotator('/path/to/repo/CODEOWNERS'),
    ]
)
```

### Multi-Language Support

Most codebases use more than one language. The chunking and AST strategies (as covered in *Code Retrieval from Scratch*) are language-specific. Supporting multiple languages requires:

1. **Language detection.** Typically by file extension, with fallback to content analysis for ambiguous cases.

2. **Language-specific parsers.** Tree-sitter supports 100+ languages. Each language needs its own grammar, but the AST traversal patterns are similar.

3. **Cross-language relationships.** A Python service calling a Go microservice via HTTP does not create an AST edge. These relationships exist at the infrastructure level, not the syntax level.

```python
import tree_sitter_python as tspython
import tree_sitter_javascript as tsjs
import tree_sitter_go as tsgo

LANGUAGE_MAP = {
    '.py': ('python', Language(tspython.language())),
    '.js': ('javascript', Language(tsjs.language())),
    '.ts': ('javascript', Language(tsjs.language())),  # Close enough for chunking
    '.go': ('go', Language(tsgo.language())),
}

def chunk_file(file_path: str, content: str) -> list[dict]:
    """Chunk a file using language-appropriate strategy."""
    ext = Path(file_path).suffix.lower()

    if ext in LANGUAGE_MAP:
        lang_name, language = LANGUAGE_MAP[ext]
        parser = Parser(language)
        return chunk_with_parser(content, file_path, parser, lang_name)
    else:
        # Fallback: sliding window for unsupported languages
        return chunk_by_sliding_window(content, file_path)
```

### Integrating with LLMs

The search pipeline produces ranked code chunks. In most workflows, these chunks feed into an LLM as context for code generation, explanation, or debugging.

The quality of the LLM's output depends directly on the quality and quantity of context it receives. This is where every component converges: good chunking produces meaningful units. Good embeddings find semantically relevant ones. BM25 catches exact matches. Hybrid fusion combines them. AST boosting adds structural context. Reranking (Chapter 1) puts the right answer first. Adaptive thresholds (Chapter 2) filter noise. The LLM sees only what passes all these stages.

The integration point is formatting results into a prompt context block:

```python
def format_results_for_llm(
    results: list[dict],
    max_tokens: int = 8000
) -> str:
    """Format search results as LLM context."""
    context_parts = []
    total_tokens = 0

    for result in results:
        content = result.get('content', '')
        metadata = result.get('metadata', {})

        # Estimate tokens (rough: 1 token per 4 characters)
        estimated_tokens = len(content) // 4
        if total_tokens + estimated_tokens > max_tokens:
            break

        header = f"# {metadata.get('file_path', 'unknown')}:{metadata.get('start_line', '?')}"
        if metadata.get('name'):
            header += f" ({metadata['name']})"

        flags = result.get('flags', [])
        if flags:
            header += f"  [{', '.join(flags)}]"

        context_parts.append(f"{header}\n```\n{content}\n```")
        total_tokens += estimated_tokens

    return '\n\n'.join(context_parts)
```

The key insight from Chapter 2 applies here: fewer, more relevant results produce better LLM outputs than more, noisier results. The adaptive threshold ensures the LLM receives only code that passed the relevance bar.

**Context ordering matters.** LLMs exhibit a "lost in the middle" effect: they attend more strongly to the beginning and end of the context window than to the middle. Place the most relevant result first. If you have low-confidence results (flagged by the threshold), place them at the end with explicit qualification:

```python
def format_results_with_ordering(
    results: list[dict],
    max_tokens: int = 8000
) -> str:
    """Format results with attention to context ordering.

    High-confidence results go first. Low-confidence results
    go at the end with qualification.
    """
    high_confidence = [r for r in results if r.get('above_threshold', True)]
    low_confidence = [r for r in results if not r.get('above_threshold', True)]

    parts = []
    total_tokens = 0

    # High-confidence results first
    for result in high_confidence:
        block, tokens = format_single_result(result)
        if total_tokens + tokens > max_tokens:
            break
        parts.append(block)
        total_tokens += tokens

    # Low-confidence results with qualification
    if low_confidence and total_tokens < max_tokens * 0.8:
        parts.append("\n# The following results are lower-confidence matches:")
        for result in low_confidence:
            block, tokens = format_single_result(result)
            if total_tokens + tokens > max_tokens:
                break
            parts.append(block)
            total_tokens += tokens

    return '\n\n'.join(parts)
```

**Token budget allocation.** The LLM has a fixed context window. Your search results compete with the system prompt, the user's message, and any conversation history. A practical budget allocation:

- System prompt: ~500 tokens
- Conversation history: ~2,000 tokens
- User message: ~200 tokens
- **Search context: 4,000-8,000 tokens**
- Generation headroom: ~2,000 tokens

Within the search context budget, the adaptive threshold from Chapter 2 naturally controls how many results fit. If the threshold passes 3 chunks averaging 800 tokens each, the total is 2,400 tokens -- well within budget. If it passes 12 chunks, you may need to truncate or summarize the lower-ranked results.

### Query Reformulation

Not every query works well on the first attempt. Query reformulation -- automatically rewriting a query to improve results -- can significantly improve search quality without requiring the user to manually iterate.

**Expansion.** If a query returns few results above the threshold, automatically expand it with related terms:

```python
def expand_query(query: str, model: SentenceTransformer, top_k_terms: int = 3) -> str:
    """Expand a query with semantically related terms.

    Uses the embedding model to find terms that are close to the query
    in embedding space.
    """
    # Encode the query
    query_emb = model.encode(query)

    # Common code-related terms to consider as expansions
    candidate_terms = [
        "function", "class", "method", "variable", "config",
        "handler", "middleware", "controller", "service", "repository",
        "model", "schema", "validator", "serializer", "parser",
        "utility", "helper", "manager", "factory", "builder",
        "test", "mock", "fixture", "setup", "teardown",
        "error", "exception", "handler", "retry", "timeout",
        "database", "query", "connection", "pool", "transaction",
        "cache", "redis", "memcached", "session", "token",
        "api", "endpoint", "route", "request", "response",
    ]

    # Find the most relevant expansion terms
    term_embs = model.encode(candidate_terms)
    similarities = np.dot(term_embs, query_emb) / (
        np.linalg.norm(term_embs, axis=1) * np.linalg.norm(query_emb)
    )

    # Pick top-k terms that are relevant but not already in the query
    query_terms = set(query.lower().split())
    expansions = []
    for idx in similarities.argsort()[::-1]:
        term = candidate_terms[idx]
        if term not in query_terms and similarities[idx] > 0.3:
            expansions.append(term)
            if len(expansions) >= top_k_terms:
                break

    if expansions:
        return f"{query} {' '.join(expansions)}"
    return query
```

**Decomposition.** Complex queries benefit from being broken into sub-queries:

"How does the app validate user input, save it to the database, and return a response?" decomposes into three searches:
1. "validate user input"
2. "save to database"
3. "return response"

The results from all three searches are merged and deduplicated. This often finds code across the full request lifecycle that a single query would miss.

**Fallback.** If the initial query returns zero results above the threshold, try progressively simpler versions:
1. Original: "how does the payment retry logic handle idempotency keys"
2. Simplified: "payment retry idempotency"
3. Core terms: "payment retry"

Each simplification broadens the search. Stop at the first level that produces results.

### MCP Server Composition

The pipeline runs as a server -- typically an MCP (Model Context Protocol) server that AI assistants query through a standard interface. MCP is not exclusive. Your assistant can talk to multiple MCP servers simultaneously:

- **Code search server** -- the pipeline from this book
- **Documentation server** -- your team's design docs, API docs, READMEs
- **Issue tracker server** -- JIRA, Linear, GitHub Issues

When a developer asks "how does the payment retry logic work," the assistant queries all three. The code server returns the implementation. The docs server returns the design doc explaining why the retry limit is five. The issue tracker returns the bug that prompted the retry logic. The assistant synthesizes all three into a comprehensive answer.

This composition happens at the MCP client level, not in the pipeline. The pipeline focuses on code retrieval. Other servers handle other content types. The assistant orchestrates.

### Adding New Retrievers

The hybrid fusion architecture (as covered in *Code Retrieval from Scratch*) is not limited to two retrievers. RRF fuses an arbitrary number of ranked lists. You can add specialized retrievers alongside BM25 and semantic search:

**Regex retriever.** For queries that are clearly patterns (contain `*`, `?`, `.+`, etc.), a regex search over the raw code produces results that neither BM25 nor semantic search can match. A developer searching for `def test_.*payment` wants all test functions related to payment -- a regex handles this directly.

```python
import re

class RegexRetriever:
    def __init__(self, chunks: list[dict]):
        self.chunks = chunks

    def search(self, pattern: str, top_k: int = 20) -> list[dict]:
        """Search chunks using regex pattern matching."""
        try:
            compiled = re.compile(pattern, re.IGNORECASE)
        except re.error:
            return []  # Invalid regex -- skip this retriever

        results = []
        for chunk in self.chunks:
            matches = compiled.findall(chunk['content'])
            if matches:
                results.append({
                    'id': f"{chunk['file_path']}:{chunk['start_line']}",
                    'content': chunk['content'],
                    'metadata': chunk,
                    'match_count': len(matches),
                })

        # Rank by match count
        results.sort(key=lambda x: x['match_count'], reverse=True)
        return results[:top_k]
```

**Docstring retriever.** Extract docstrings from all chunks and build a separate embedding index over just the docstrings. Queries about "what does X do" often match docstrings better than they match code, because docstrings are written in natural language.

**Comment retriever.** Similarly, extract inline comments and build a BM25 index over them. Developers often document workarounds, TODOs, and explanations in comments. A query about "why does the rate limiter use 100 instead of 50" might match a comment that no other retriever would find.

All of these retrievers plug into RRF alongside BM25 and semantic search:

```python
hybrid_results = reciprocal_rank_fusion(
    rankings=[bm25_results, semantic_results, regex_results, docstring_results],
    weights=[1.0, 1.0, 0.5, 0.8]
)
```

The weights reflect each retriever's typical usefulness. Regex is weighted lower because it only helps for pattern queries. Docstring is weighted slightly lower than semantic because docstrings are often incomplete or absent. Tuning these weights against your evaluation set (Chapter 5) optimizes the fusion for your codebase.

### When to Build vs. When to Use

This book has walked through building every component from scratch. That is valuable for understanding. It is not always the right choice for production.

If your requirements are:
- Standard codebase (Python, TypeScript, Go, Java)
- Standard queries (identifier lookup, conceptual search, cross-file navigation)
- Standard scale (<100K files)

Then a production-ready tool that implements this architecture will serve you better than a custom build. The engineering effort to make a pipeline robust -- handling edge cases, supporting concurrent queries, maintaining index integrity, surviving crashes -- is substantial.

If your requirements include any of the extensions in this chapter -- domain-specific models, custom stages, non-standard languages, integration with proprietary systems -- then you need the understanding from this book to either extend an existing tool or build your own.

The architecture is the same either way. The decision is build vs. buy, not learn vs. skip.

---

### Exercise

> **Try This**
>
> Extend your pipeline with at least one custom stage:
>
> 1. Implement the `LicenseFilter` and add it to your pipeline. Create a fake `vendor/` directory with a few files and verify they get flagged.
> 2. Implement the `RecencyBoost` and measure its effect. Does it meaningfully change rankings for your codebase?
> 3. If your codebase has multiple languages, implement language detection and verify that each language's files get chunked with the appropriate strategy.
> 4. Format your search results for LLM consumption. Estimate the token count. How does the adaptive threshold from Chapter 2 affect the token budget?
> 5. Run your Chapter 5 benchmark suite with the extended pipeline. Did any metric change? Did any regression occur?

---

### Key Takeaways

- Configure first, extend second. Threshold tuning, ignore files, and hybrid weights cover 90% of customization needs.
- Custom embedding models help when domain vocabulary is specialized. Fine-tuning on 200-500 query-code pairs is usually sufficient.
- Custom pipeline stages (license filters, recency boosts, ownership annotations) plug into the pipeline sequence without modifying existing stages.
- Multi-language support requires language-specific parsers but follows the same chunking and graph patterns.
- MCP composition lets the search pipeline focus on code while other servers handle docs, issues, and other context types.
- The architecture from this book applies whether you build from scratch or extend an existing tool. Understanding the components lets you make informed decisions either way.

### The Complete Pipeline, End to End

As a final reference, here is the complete pipeline assembled from all the components covered in this book and its companion:

```python
class SemanticSearchPipeline:
    """Complete semantic search pipeline for code.

    Implements the 5-stage architecture:
    1. AST Graph Boosting (covered in Code Retrieval from Scratch)
    2. BM25 Keyword Search (covered in Code Retrieval from Scratch)
    3. Semantic Embedding Search (covered in Code Retrieval from Scratch)
    4. Cross-Encoder Reranking (Chapter 1)
    5. Adaptive Threshold (Chapter 2)

    With hybrid fusion (covered in Code Retrieval from Scratch) between stages 2-3.
    """

    def __init__(
        self,
        repo_path: str,
        model_name: str = 'all-MiniLM-L6-v2',
        reranker_name: str = 'cross-encoder/ms-marco-MiniLM-L-6-v2',
        threshold_config: ThresholdConfig = ThresholdConfig(),
    ):
        self.repo_path = repo_path
        self.threshold_config = threshold_config

        # Build components (Chapter 3 covers phased loading)
        print("Loading pipeline components...")
        self.model = SentenceTransformer(model_name)
        self.reranker = CodeReranker(reranker_name)

        # Index the codebase (Chapter 4 covers incremental indexing)
        chunks = self._index_codebase()
        self.bm25 = CodeBM25(chunks)
        self.collection = build_embedding_index(chunks, model_name)
        self.ast_graph = build_ast_graph(chunks)
        self.chunks = chunks

        print(f"Pipeline ready: {len(chunks)} chunks indexed")

    def search(self, query: str, top_k: int = 20) -> list[dict]:
        """Execute the full 5-stage pipeline."""
        # Stage 1: AST Graph Boosting
        boosts = compute_ast_boosts(query, self.ast_graph, self.chunks)

        # Stage 2: BM25 Keyword Search
        bm25_results = self.bm25.search(query, top_k=top_k * 2)

        # Stage 3: Semantic Embedding Search
        semantic_results = semantic_search(
            self.collection, query, self.model, n_results=top_k * 2
        )

        # Hybrid Fusion (RRF)
        weights = classify_query(query)
        fused = reciprocal_rank_fusion(
            [bm25_results, semantic_results],
            weights=[weights['bm25'], weights['semantic']]
        )

        # Apply AST boosts
        fused = apply_ast_boosts(fused, boosts)

        # Stage 4: Cross-Encoder Reranking
        reranked = self.reranker.rerank(query, fused[:30])

        # Stage 5: Adaptive Threshold
        filtered = apply_adaptive_threshold(reranked, self.threshold_config)

        return filtered[:top_k]

    def _index_codebase(self) -> list[dict]:
        """Index the codebase using AST-aware chunking."""
        all_chunks = []
        for root, dirs, files in os.walk(self.repo_path):
            dirs[:] = [d for d in dirs if not d.startswith('.')]
            for fname in files:
                if fname.endswith('.py'):
                    fpath = Path(root) / fname
                    content = fpath.read_text(encoding='utf-8', errors='ignore')
                    rel_path = str(fpath.relative_to(self.repo_path))
                    chunks = chunk_with_ast_context(content, rel_path)
                    all_chunks.extend(chunks)
        return all_chunks
```

This is the simplest possible implementation of the architecture. A production implementation adds the phased loading from Chapter 3, the incremental indexing from Chapter 4, the evaluation framework from Chapter 5, and whatever extensions Chapter 6 demands.

The point of building it from scratch is not that you should deploy this code as-is. The point is that you now understand every decision in the pipeline: why the chunking works this way, why fusion beats single retrieval, why reranking resolves the mushy middle, why the threshold adapts instead of being fixed. When you evaluate production tools or build your own, that understanding translates directly into better decisions.

### Where to Go From Here

This book covered ranking, production engineering, evaluation, and extension. Several frontiers remain:

**Multi-modal search.** Searching not just code but also diagrams, screenshots, design documents, and API documentation alongside code. The fusion architecture extends naturally -- add retrievers for each modality and fuse with RRF.

**Conversational search.** Using the history of a search session to improve subsequent queries. If the developer asked "where is the authentication middleware" and then asks "what does it call," the second query should inherit context from the first.

**Personalized ranking.** Different developers search differently. A frontend developer and a backend developer asking the same query expect different results. Personalized ranking models (trained on individual search history) can customize the pipeline per user.

**Federated search.** Searching across multiple repositories simultaneously, maintaining per-repo indexes but fusing results across repos. This is the monorepo problem (Chapter 4) generalized to independent repositories.

Each of these extensions is a book of its own. The pipeline architecture -- retrieve broadly, rank precisely, filter adaptively -- is the foundation that every extension builds on.

---


## Conclusion

You can now build a code search system that doesn't fall apart under real conditions. Not a prototype that demos well and collapses at scale. Not a search bar that returns results nobody trusts. A system with defined latency budgets, a query pipeline that handles the mess of real developer input, an indexing strategy that keeps pace with a codebase that never stops changing, and benchmarks that tell you when something breaks before your users do. That's the capability you've built a foundation for across these chapters.

Three threads run through everything here, and they're worth naming explicitly because they don't announce themselves in chapter headings.

The first is the gap between what you index and what people actually search. Production code is not clean. It's aliased, refactored mid-sprint, partially migrated, and full of naming conventions that made sense two engineers ago. The chapters on indexing and query processing both orbit this problem from different directions — one asks how you represent code accurately, the other asks how you interpret the query charitably. Getting both right is not optional. A system that indexes perfectly but can't handle "that thing that wraps the DB calls" is a system your team will stop using. A system that handles fuzzy queries brilliantly but indexes stale artifacts is worse than grep. The whole pipeline has to work.

The second thread is that latency and trust are the same thing wearing different clothes. Developers have near-zero tolerance for slow search. Not because they're impatient — because slow search breaks the flow of thought that made them open a search box in the first place. The latency chapter is technically about SLAs and percentiles, but it's actually about whether people come back. P99 latency of 800ms sounds fine until you realize that's one in a hundred searches that costs a developer their train of thought. The benchmarking chapter is where you find out if you've actually solved this or just convinced yourself you have. Together they form a discipline: measure honestly, set thresholds with teeth, and treat a regression as a regression even if it only shows up in the tail.

The third thread is operational permanence. Most search systems get built once and then quietly degrade. The codebase grows. New languages get added. Teams rename things. The query patterns that dominated launch shift as the product evolves. The chapter on extending the pipeline exists because a search system that can't absorb change will eventually be abandoned, and abandonment rarely gets announced — it just happens when someone switches back to `grep -r` and never switches back. The extension patterns there are not advanced features. They're the difference between a system you maintain and a system you eventually rewrite.

Here's the Monday morning action: pick one endpoint — one — and run it through the evaluation framework from the benchmarking chapter against your current search infrastructure, whatever it is. Not a full audit. Not a migration plan. One endpoint, real queries from real users (pull them from logs if you have them), and score the results against a ground truth you define in the next two hours. You will find either that your current system is better than you thought, which is useful to know, or that it has a specific, measurable failure mode, which is more useful. Either outcome gives you a number. A number is something you can act on, report on, and improve. Right now you probably have an opinion. An opinion is much harder to work with.

The most common reason people don't apply what they've read in a book like this is that the gap between "I understand this" and "I have run the first command" feels larger than it is. There's a pull toward planning — toward designing the whole system before touching any of it — because planning feels like progress without the risk of being wrong. But production systems are not designed into existence. They're iterated into existence. The first version of your indexing pipeline will have the wrong chunking strategy. Your first latency budget will be too generous or too tight. Your first evaluation dataset will have gaps. This is not failure; this is the process. The chapters here give you a framework for iterating toward a system that works, not a blueprint for getting it right on the first try. The blueprint doesn't exist.

What changes if you act on this is concrete. Your team finds relevant code faster, which means debugging gets shorter, onboarding gets shorter, and the answer to "where does this get handled" stops being a Slack thread. Search becomes infrastructure — invisible when it works, deeply felt when it doesn't, and something you can actually reason about when it degrades. That's not a small thing. Code search touches every developer workflow, which means a system that works compounds in value the same way bad search compounds in friction.

What doesn't change if you don't act is also concrete. You keep the system you have, which means you keep the problems you have. The searches that return nothing useful. The results that are accurate but stale. The queries that work perfectly for the person who wrote the system and fail for everyone else. Those problems don't resolve on their own. They become the background hum of a codebase getting harder to navigate, and the cost accumulates in places that don't show up in any dashboard — in the minutes lost before someone finds the right file, in the questions never asked because the search wasn't worth trying.

Production code search is not a research problem. The techniques exist, the infrastructure exists, the evaluation methodology exists. What's left is building it and being honest about whether it's working. You now have the vocabulary, the mental models, and the operational framework to do that. The only remaining question is when you start.
# Back Matter

---

## Pipeline Architecture Summary

A visual summary of the complete pipeline. Stages 1-3 (retrieval) are covered in the companion book *Code Retrieval from Scratch*. Stages 4-5 (ranking and filtering) are covered in this book.

```
                    Query: "how does authentication work"
                                    |
                    ┌───────────────┼───────────────┐
                    |               |               |
              ┌─────────┐   ┌──────────┐   ┌───────────┐
              | Stage 1  |   | Stage 2  |   | Stage 3   |
              | AST      |   | Semantic |   | BM25      |
              | Boost    |   | Search   |   | Search    |
              | (companion)|  | (companion)|  | (companion)|
              └────┬─────┘   └─────┬────┘   └─────┬─────┘
                   |               |               |
                   |        ┌──────┴───────────────┘
                   |        |
                   |  ┌─────────────┐
                   |  | RRF Fusion  |
                   |  | (companion) |
                   |  └──────┬──────┘
                   |         |
                   └────┬────┘
                        |
                 ┌──────┴───────┐
                 | Fused +      |
                 | Boosted      |
                 | Candidates   |
                 └──────┬───────┘
                        |
                 ┌──────┴───────┐
                 | Stage 4      |
                 | Cross-Encoder|
                 | Reranking    |
                 | (Ch. 1)      |
                 └──────┬───────┘
                        |
                 ┌──────┴───────┐
                 | Stage 5      |
                 | Adaptive     |
                 | Threshold    |
                 | (Ch. 2)      |
                 └──────┬───────┘
                        |
                 ┌──────┴───────┐
                 | Final        |
                 | Results      |
                 | (3-5 chunks) |
                 └──────────────┘
```

**Stage timing breakdown (typical query, 50K-file codebase):**

| Stage | Time | What It Does |
|-------|------|-------------|
| AST Boost | 0.4ms | Walk graph from anchor nodes, assign structural boost scores |
| Semantic Search | 2.5ms | Encode query, search HNSW index for nearest vectors |
| BM25 Search | 0.8ms | Look up query terms in inverted index, compute BM25 scores |
| RRF Fusion | 0.1ms | Merge ranked lists using reciprocal rank fusion |
| Cross-Encoder | 1.5ms | Score top 25 query-document pairs for precise relevance |
| Adaptive Threshold | 0.1ms | Filter by relative margin off top score |
| **Total** | **5.4ms** | |

**Accuracy at each stage (cumulative):**

| After Stage | Top-1 Accuracy | Top-5 Accuracy |
|-------------|---------------|----------------|
| BM25 only | 42% | 66% |
| Semantic only | 55% | 75% |
| Hybrid (BM25 + Semantic) | 61% | 91% |
| + AST Boosting | 64% | 91% |
| + Cross-Encoder Reranking | 87% | 97% |
| + Adaptive Threshold | 87% (filtered) | 97% (filtered) |

The threshold does not improve accuracy metrics -- it improves downstream quality by removing noise.

---

## Appendix A: Glossary

### Information Retrieval Terms

| Term | Definition |
|------|-----------|
| Cross-encoder | An architecture that processes query and document together as a single concatenated input. Slower but captures fine-grained query-document interactions. |
| DCG | Discounted Cumulative Gain. A ranking quality metric that sums relevance grades discounted by their log position. |
| Hard negative | A result that appears relevant by surface metrics (keywords, similarity score) but is actually irrelevant or misleading. |
| HNSW | Hierarchical Navigable Small World. A graph-based algorithm for approximate nearest-neighbor search. Provides O(log n) query time. |
| MRR | Mean Reciprocal Rank. The average of 1/rank where rank is the position of the first relevant result, across all queries. |
| nDCG | Normalized Discounted Cumulative Gain. DCG divided by ideal DCG, normalizing to a 0-1 scale. Measures ranking quality accounting for position and relevance grade. |
| Precision | Of the results returned, the fraction that are relevant. |
| Recall | Of all relevant documents in the corpus, the fraction that were returned. |
| Recall@k | Recall computed over only the top k results. |
| Ranking | The process of ordering retrieved candidates by relevance. Tightens the net. |
| Top-k | Returning the k highest-ranked results. Static top-k uses a fixed k; adaptive thresholds adjust dynamically. |

### Code Analysis Terms

| Term | Definition |
|------|-----------|
| AST | Abstract Syntax Tree. A tree representation of source code produced by a parser, capturing the syntactic structure (functions, classes, expressions, statements). |
| Call graph | A directed graph where edges represent function calls. Node A has an edge to node B if function A calls function B. |
| Chunk | A piece of code extracted from a file for indexing. Typically a function, method, or class. The unit of search. |
| Import graph | A directed graph where edges represent module imports. Shows dependency relationships between files. |
| Memory-mapped file | A file mapped into a process's virtual address space. Pages load from disk on demand, avoiding the need to read the entire file upfront. |
| Tree-sitter | A parser generator and incremental parsing library used for syntax highlighting, code navigation, and AST extraction across many languages. |

---

## Appendix B: Tools & Resources

| Tool / Resource | URL | Purpose |
|----------------|-----|---------|
| sentence-transformers | https://www.sbert.net | Python library for embedding models. Supports bi-encoders and cross-encoders with a clean API. |
| ChromaDB | https://www.trychroma.com | Open-source embedding database with HNSW indexing. Good for prototyping and small-to-medium indexes. |
| FAISS | https://github.com/facebookresearch/faiss | Facebook's vector similarity search library. More performant than ChromaDB for large-scale indexes. |
| Qdrant | https://qdrant.tech | Production-grade vector database with filtering, payload storage, and horizontal scaling. |
| CodeSearchNet | https://github.com/github/CodeSearchNet | Benchmark dataset for code search evaluation. Six languages, natural language queries paired with relevant functions. |
| Pyckle CLI | https://pyckle.co | Semantic code search tool implementing the architecture described in this book. |

---

## Appendix C: Series Cross-References (Code Search, Decoded)

This book expands on material from the Code Search, Decoded blog and video series. For readers who want the shorter-form treatment:

| Chapter | Series Episode | Topic |
|---------|---------------|-------|
| Ch. 1 | Ep 10 | Cross-encoder reranking |
| Ch. 2 | Ep 11 | Adaptive thresholds |
| Ch. 3 | Ep 12 | Cold start and performance |
| Ch. 4 | Ep 13 | Incremental indexing |
| Ch. 6 | Ep 21 | Extending the pipeline |

Chapter 5 is original to this book with no direct series counterpart.

---

## Appendix D: Further Reading

- **"Introduction to Information Retrieval"** by Manning, Raghavan, and Schutze. The standard textbook on IR fundamentals. Free online. Covers evaluation metrics and ranking models in depth.

- **"Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs"** by Malkov and Yashunin (IEEE TPAMI 2018). The HNSW paper. Explains the graph structure and search algorithm used by most vector databases.

- **"ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT"** by Khattab and Zaharia (SIGIR 2020). A hybrid between bi-encoders and cross-encoders that may represent the future of efficient precise retrieval.

- **Code Search, Decoded** (Episodes 8-13, 21) by David Kelly Price. The blog series that covers the same pipeline architecture in episodic format. Available at pyckle.co/blog.

---

## About the Author

David Kelly Price is the founder of Pyckle, building AI context optimization tools for development teams. Background in AI/ML tooling, retrieval systems, and context routing for codebases. MBA in Finance -- analytical rigor applied to technical problems.

---

## About Pyckle

Pyckle builds semantic code search tools for AI-assisted development. The core product is a local search server that indexes your codebase and serves relevant code to your AI assistant through the Model Context Protocol (MCP). The architecture described in this book -- AST graph boosting, hybrid BM25+semantic search, cross-encoder reranking, adaptive thresholds, and incremental indexing -- is the foundation of Pyckle's retrieval pipeline.

Pyckle runs locally on developer hardware. Code never leaves the machine. The search pipeline executes in under 10 milliseconds per query. The index stays fresh through git-aware incremental updates. The goal is straightforward: when an AI assistant needs code context, give it the right code, in the right order, with nothing extra.

More information at pyckle.co.

---

## Acknowledgments

This book draws on decades of information retrieval research, from the BM25 work of Robertson and Zaragoza (1994) to the transformer architectures that power modern embeddings. The pipeline architecture described here is informed by production systems at Google (dense retrieval), Microsoft (CodeBERT), Meta (FAISS), and the broader open-source IR community.

The Code Search, Decoded series (Episodes 8-13, 21) covers many of the same topics in a blog/video format and served as the source material for several chapters. Readers who want a more concise treatment or prefer video content should start there.

Special thanks to the sentence-transformers, ChromaDB, tree-sitter, and rank-bm25 teams for building the open-source tools that make this pipeline practical to implement.

---

*Production Code Search — Version 1.0 — March 2026*
*Published by Pyckle (pyckle.co)*

*© 2026 Pyckle. All rights reserved. This guide may be shared freely for personal and educational use. Commercial reproduction or redistribution requires written permission. Contact kellyprice@pyckle.co.*


---

## Related Blog Posts

- [Why Naive Retrieval Breaks at Scale](https://pyckle.co/blog/why-naive-retrieval-breaks-at-scale-and-what-we-built-instead.html)
- [Semantic Chunking Will Not Save Your RAG System](https://pyckle.co/blog/semantic-chunking-will-not-save-your-rag-system.html)
- [Vector Databases Are Not Your RAG Bottleneck](https://pyckle.co/blog/vector-databases-are-not-your-rag-bottleneck.html)

---

*[Browse all free guides →](https://pyckle.co/books.html)*