---
title: "Code Retrieval from Scratch"
subtitle: "Chunking, Embeddings, and Hybrid Search for Code"
author: "David Kelly Price"
version: "1.0"
date: 2026-03-21
status: draft
type: ebook
target_audience: "Senior/staff engineers who want to understand or build code retrieval systems — comfortable with Python, familiar with basic ML concepts, interested in information retrieval"
estimated_pages: 90
chapters:
  - "The Retrieval Problem in Code"
  - "Chunking Strategies"
  - "Embedding Models for Code"
  - "BM25 for Code"
  - "Hybrid Retrieval with RRF"
  - "AST Graph Boosting"
tags:
  - pyckle
  - ebook
  - semantic-search
  - code-retrieval
  - embeddings
  - bm25
  - chunking
  - ast
  - draft
---

<!-- DESIGN & LAYOUT NOTES

Target formats:
- Primary: Markdown (source of truth)
- Export: PDF via Pandoc, web page
- Print-ready: Letter size, 1" margins

Typography:
- Headers: Sans-serif (brand-consistent)
- Body: Serif or clean sans-serif for readability
- Code: Monospace, syntax highlighted, line-numbered where helpful

Color scheme:
- Pyckle brand palette
- Callout boxes use muted background tints, not heavy borders

Callout box types:
- **Try This** — Exercises and hands-on activities
- **Key Insight** — Important concepts worth remembering
- **Warning** — Common mistakes or gotchas

Code blocks:
- Syntax highlighted by language
- Numbered lines for reference in explanatory text
- Copy-pasteable (no line numbers in actual code)

Figures:
- Captioned and numbered (Figure 1, Figure 2, etc.)
- Referenced by number in body text
-->

---

# Code Retrieval from Scratch

## Chunking, Embeddings, and Hybrid Search for Code

**By David Kelly Price**

Version 1.0 — March 2026

---

## Table of Contents

**Part I: Foundations**

1. The Retrieval Problem in Code
2. Chunking Strategies
3. Embedding Models for Code

**Part II: Retrieval**

4. BM25 for Code
5. Hybrid Retrieval with RRF
6. AST Graph Boosting

Pipeline Architecture Summary
Appendix A: Glossary
Appendix B: Tools & Resources
Appendix C: Series Cross-References
Appendix D: Further Reading

---

## About This Guide

This guide walks you through building the retrieval layer of a code search system, from raw source files to a ranked list of candidates. By the end, you will have built a system that takes a natural-language query, searches a codebase of any size, and returns the code that matters -- retrieved through multiple complementary strategies, fused, and structurally boosted. Every component is explained with enough depth to implement, and every implementation choice is explained with enough context to modify.

This book covers the first three stages of a complete search pipeline: chunking code into searchable units, encoding those units as vectors, and retrieving candidates through lexical, semantic, and structural methods. The companion book, *Production Code Search*, covers the remaining stages: cross-encoder reranking, adaptive thresholds, scaling, and evaluation.

---

## How to Use This Guide

**Reading order:** Sequential. Each chapter builds on the previous. Part I covers what you need before you can search -- how to split code into chunks and represent those chunks as vectors. Part II covers how to find candidates -- through keyword matching, semantic similarity, and structural graph traversal.

**Exercises:** Each chapter ends with a hands-on exercise designed to take 10-30 minutes. The exercises build on each other -- by the end, you will have a working hybrid retrieval system with AST graph boosting. You need Python 3.10+, and the exercises use `sentence-transformers`, `rank-bm25`, `chromadb`, and `tree-sitter`.

**Prerequisites:** Comfortable reading and writing Python. Familiar with basic data structures (trees, graphs, hash maps). Basic ML familiarity is helpful but not required -- this book explains embeddings and vector similarity from the ground up.

---

# Part I: Foundations

---

## Chapter 1: The Retrieval Problem in Code

### Chapter Overview

Code search is not document search with different files. Code has structure, naming conventions, cross-file dependencies, and semantic density that make retrieval fundamentally harder. This chapter maps the problem space.

---

### Why Code Search Is Different

Text search has been a solved problem for decades. Google handles billions of queries per day across the open web. Elasticsearch powers full-text search for millions of applications. The algorithms are well-understood, the tooling is mature, and the failure modes are documented.

Then you try to search a codebase and everything breaks.

A developer types "where is the authentication middleware" into their AI assistant. The codebase has a class called `JWTVerifier` in a file called `token_validation.py` inside a directory called `security/`. No file, class, or function contains the words "authentication" or "middleware." A keyword search returns nothing. A naive semantic search returns the README that mentions "authentication" in passing.

The actual answer spans four files: the route decorator in `app.py`, the middleware class in `security/token_validation.py`, the token model in `security/models.py`, and the configuration in `config/auth.py`. No single chunk of code contains the full answer. The answer is the relationship between chunks.

This is not a shortcoming of any particular search tool. It is the nature of code as a medium.

### Syntax vs. Semantics

Natural language is messy but redundant. A paragraph about database connections will use the word "database" or "connection" multiple times, in different phrasings, with context clues throughout. Keyword search thrives on this redundancy.

Code is precise and terse. A database connection pool might be implemented in a class called `Pool` with methods called `get()` and `release()`. The word "database" might appear only in a comment, or not at all. The developer who wrote the code knows that `Pool` means database connections. The developer searching for it six months later does not.

This gap between what code does and what code is called is the fundamental challenge. Information retrieval researchers call it the "vocabulary mismatch problem." In code, the mismatch is more severe than in any other domain, because:

1. **Identifiers are abbreviations.** `cfg`, `ctx`, `req`, `resp`, `mgr`, `svc` -- developers compress words into two or three characters. A search for "configuration manager" will not match `cfg_mgr` on keywords alone.

2. **Naming conventions vary.** Python uses `snake_case`. Java uses `camelCase`. Go uses short, unexported names. The same concept -- a function that validates user input -- might be called `validate_user_input`, `validateUserInput`, `checkInput`, or just `validate`. Four names, one concept, zero keyword overlap.

3. **Semantics live in structure, not text.** A function that processes payments is semantically defined by the Stripe API calls it makes, the database writes it executes, and the error handling it implements. Its name might be `handle()` or `run()` or `execute()`. The meaning is in the code, not in the identifier.

4. **Context crosses file boundaries.** A web request enters through a route handler, passes through middleware, calls a service layer, which queries a repository, which builds SQL. The search for "how does the app handle authentication" touches five files. No single file contains the answer.

### The Dynamic Nature of Code

Unlike documents, code is alive. It changes constantly. A function that exists today might be renamed, refactored, or deleted tomorrow. A search index that was accurate yesterday might be stale today.

This dynamism creates challenges that document search does not face:

**Index freshness.** A search result that points to code that was deleted yesterday is worse than no result. Stale indexes erode trust. If a developer gets burned by stale results once, they stop trusting the search tool and go back to grep.

**Branch divergence.** In a team using feature branches, each developer has a different version of the codebase. An index built from main does not reflect the current developer's working tree. Search results might reference functions that the developer has already modified or removed in their branch.

**Semantic drift.** Over months, a codebase's conventions and patterns evolve. The naming conventions that were standard six months ago might have been replaced. The modules that existed last quarter might have been reorganized. An embedding model trained on the old patterns may not capture the new patterns as well.

These challenges shape the design decisions throughout the retrieval pipeline. Every component must be updateable without rebuilding everything from scratch.

### The Curse of Structural Similarity

Consider two functions:

```python
def process_payment(user_id, amount, currency="USD"):
    charge = stripe.Charge.create(amount=amount, currency=currency)
    update_balance(user_id, -amount)
    return charge

def process_refund(user_id, amount, currency="USD"):
    refund = stripe.Refund.create(amount=amount, currency=currency)
    update_balance(user_id, +amount)
    return refund
```

Same parameter list. Same structure. Same naming pattern. The embedding vectors for these two functions will have a cosine similarity above 0.95. They are nearly identical in vector space.

They do opposite things. One takes money. The other returns it. Mixing them up in search results is not a minor inconvenience -- it is the kind of confusion that leads to production bugs.

This is not a contrived example. Real codebases are full of these structural twins. `create_user` and `delete_user`. `encrypt` and `decrypt`. `serialize` and `deserialize`. `connect` and `disconnect`. The closer two functions are in purpose, the more likely they share structure -- and the more likely a purely semantic search confuses them.

Document search does not have this problem because natural language is naturally diverse. Two paragraphs about opposite topics use opposite words. Two functions that do opposite things use the same words.

### Cross-File References

In a document corpus, each document is self-contained. An article about climate change contains everything it has to say about climate change. You can rank it against the query independently.

Code does not work this way. A function's behavior depends on:

- The functions it calls (call graph)
- The modules it imports (import graph)
- The classes it extends or implements (inheritance graph)
- The types of its parameters and return values (type graph)
- The configuration that controls its behavior (runtime context)

A search for "rate limiting" might match a middleware function that calls `check_rate_limit()`. But the implementation of rate limiting -- the sliding window, the Redis counter, the backoff logic -- lives in a completely different file that the middleware function imports. Returning only the middleware gives the developer half the answer. Returning only the implementation gives them the other half. The full answer requires understanding the relationship between them.

This is why the retrieval system we build in this book includes structural analysis alongside semantic search. Embeddings find meaning. Structure finds relationships. Together, they answer the questions developers actually ask.

### The Scale Challenge

Code search operates at a scale that is deceptively large. A "small" codebase of 5,000 files produces 20,000 chunks. A medium codebase of 30,000 files produces 120,000 chunks. A monorepo with 100,000 files produces 400,000+ chunks.

Each chunk needs to be compared against every query. Naive comparison is `O(n)` -- for 200,000 chunks, that means 200,000 similarity computations per query. At 1 microsecond per computation, that is 200ms. Acceptable but not fast.

Modern vector indexes (HNSW) reduce this to `O(log n)` -- about 17 comparisons for 200,000 chunks. That is why the pipeline runs in single-digit milliseconds regardless of codebase size.

But scale also affects indexing. Computing 200,000 embeddings takes 3-8 minutes on CPU. Rebuilding the index after every code change is not viable. Incremental indexing processes only changed files, reducing update time to seconds. The combination of fast queries and fast updates makes semantic search practical for active development.

### The Quality Ceiling

Every code search system has a quality ceiling set by its weakest component. If your chunking splits functions in half, no embedding model can produce good vectors from half-functions. If your embedding model cannot distinguish `create` from `delete`, no amount of BM25 will resolve the ambiguity. If your reranker is slow, developers will turn off reranking and accept worse results.

The pipeline architecture addresses this by making each component independently upgradeable. You can improve chunking without changing the embedding model. You can swap the embedding model without changing BM25. You can add a custom reranker without touching the retrieval stages. Each upgrade lifts the quality ceiling without requiring changes to other components.

The chapters that follow are ordered by their impact on the quality ceiling: foundations that set the floor (Chapters 1-3) and retrieval that determines coverage (Chapters 4-6). Ranking and production engineering -- reranking, adaptive thresholds, cold start, and incremental indexing -- are covered in the companion book, *Production Code Search*.

### What a Search Pipeline Needs

By the end of this book, you will have built a retrieval system with three core capabilities:

1. **Lexical retrieval** -- finding code that contains specific tokens (Chapter 4)
2. **Semantic retrieval** -- finding code that matches the meaning of a query (Chapters 2-3)
3. **Structural analysis** -- using the codebase's own architecture to boost relevant results (Chapter 6)

Each capability addresses a specific failure mode. Remove any one, and the retrieval system degrades in predictable ways. The chapters that follow build these capabilities one at a time, in the order they compose. The remaining stages of a complete pipeline -- precision ranking and confidence filtering -- are covered in *Production Code Search*.

### A Brief History of Code Search

Understanding where code search has been puts where it is going in perspective.

The earliest code search tool was `grep`, released in 1974. It does exactly one thing: match a regular expression against lines of text. It knows nothing about code structure, meaning, or relevance. It returns every matching line, in file order, with no ranking.

Fifty years later, grep is still the default code search tool for most developers. That is not because grep is good at code search. It is because nothing else has been good enough, fast enough, and available enough to replace it.

The next generation of code search tools -- IDE find-in-files, GitHub code search, Sourcegraph -- added indexing and basic ranking. They could search across repositories, filter by language, and sort results by file relevance. These tools solved the scale problem (searching millions of files in milliseconds) but not the understanding problem. They still matched strings, not concepts.

Semantic search changed the game. By encoding code into vectors and measuring similarity in vector space, semantic search can find code that matches the meaning of a query even when no keywords overlap. "How does the app handle authentication" can find `class JWTVerifier` because the embedding model learned the relationship between authentication and JWT verification during training.

But semantic search introduced its own problems. It is computationally expensive. It requires an index that takes minutes to build. It struggles with exact identifiers. It confuses structurally similar functions. And it returns scores on a continuous scale, making it hard to determine where relevance ends.

The retrieval architecture in this book combines all three generations: keyword matching (BM25, the modern grep), structural analysis (AST graph boosting, the modern IDE), and semantic understanding (embedding models, the new capability). Each generation covers the others' weaknesses.

### The Precision-Recall Trade-off in Code

In information retrieval, there is a fundamental tension between precision and recall. High precision means every result you return is relevant, but you might miss some. High recall means you find everything relevant, but you return noise alongside it.

For code search feeding an LLM, this trade-off has a specific shape. The LLM processes all the context it receives. Noise is not just wasteful -- it is actively harmful. An LLM given three relevant functions and seventeen irrelevant ones will attempt to synthesize all twenty into a coherent response. The irrelevant functions introduce false connections, hallucinated relationships, and qualified hedging.

This means code search pipelines should bias toward precision over recall. It is better to return three exactly-right results than twenty probably-right results. The pipeline architecture achieves this through a precision cascade: broad retrieval (high recall, moderate precision) followed by reranking (higher precision) followed by adaptive thresholds (highest precision).

The numbers across the pipeline stages:

| Stage | Precision | Recall |
|-------|-----------|--------|
| BM25 retrieval (top 50) | ~30% | ~85% |
| Semantic retrieval (top 50) | ~35% | ~88% |
| Hybrid fusion (top 30) | ~45% | ~91% |
| Cross-encoder reranking (top 10) | ~75% | ~87% |
| Adaptive threshold (variable) | ~92% | ~82% |

Each stage trades a small amount of recall for a large gain in precision. The final output has high enough precision that LLMs produce accurate responses, with enough recall that the important code is almost always present. This book covers the first three rows -- retrieval. The companion book, *Production Code Search*, covers reranking and thresholds.

### The Vocabulary of Retrieval

Before we build anything, we need shared vocabulary. If you have worked with information retrieval before, skim this section. If you have not, read it carefully -- these terms come up constantly.

**Query.** The input. A string that describes what the user is looking for. In code search, queries range from exact identifiers (`POSTGRES_MAX_CONNECTIONS`) to natural language questions ("how does the app handle authentication").

**Document.** In traditional IR, a discrete unit of text. In code search, a "document" is typically a chunk -- a function, a class, or a section of a file. We cover chunking in Chapter 2.

**Corpus.** The full set of documents. For code search, the corpus is the indexed codebase.

**Relevance.** Whether a document actually answers the query. This is subjective and graded -- a document can be highly relevant, somewhat relevant, or not relevant at all.

**Precision.** Of the results returned, how many are relevant. High precision means few false positives.

**Recall.** Of all relevant documents in the corpus, how many were returned. High recall means few false negatives.

**Ranking.** The order of results. A system that returns the right documents but in the wrong order is useful but frustrating.

**Retrieval.** The process of finding candidate documents from the corpus. Retrieval casts a wide net. Ranking tightens it.

---

### Exercise

> **Try This**
>
> Pick a codebase you work on daily. Write down five queries that represent your actual search behavior -- not theoretical queries, but things you have actually looked for in the last week. Classify each query:
>
> 1. Is it an exact identifier search? ("Where is `REDIS_CACHE_TTL` defined?")
> 2. Is it a conceptual search? ("How does authentication work?")
> 3. Is it a structural search? ("What calls `validate_token()`?")
> 4. Is it a cross-file search? ("How does a request flow from the API to the database?")
>
> Most developers find their queries split roughly evenly across these categories. A search system that only handles one category fails on the other three. Keep these five queries as your personal benchmark -- you will test each pipeline component against them as we build.

---

### Key Takeaways

- Code search is harder than document search because of identifier compression, structural similarity, naming convention variance, and cross-file dependencies.
- Keyword search fails when the code's naming does not match the developer's mental model. Semantic search fails when structurally similar functions do different things.
- The answer to a code query often spans multiple files connected by call graphs, import graphs, and inheritance hierarchies.
- A complete retrieval system needs lexical retrieval, semantic retrieval, and structural analysis. Precision ranking and confidence filtering complete the full pipeline.
- The precision-recall trade-off in code search favors precision because LLM output quality degrades with noise. The pipeline is a precision cascade: broad retrieval followed by progressively tighter filtering.

---

## Chapter 2: Chunking Strategies

### Chapter Overview

Before you can search code, you have to decide what a "searchable unit" is. This chapter covers how to split a codebase into chunks that produce meaningful embeddings and useful search results.

---

### Why Chunking Matters

An embedding model takes a piece of text and compresses it into a fixed-dimensional vector. That vector represents the meaning of the input. The quality of that representation depends entirely on what you feed the model.

Feed it a 3,000-line file and the model tries to compress three thousand lines of mixed concerns into 384 floats. The resulting vector is a vague average of everything in the file. It matches queries about any topic mentioned in the file, but not strongly for any of them.

Feed it a single line -- `import os` -- and the model produces a vector with almost no semantic content. It cannot distinguish this import from any other import.

Feed it a single function -- 20-80 lines of cohesive code with a clear purpose -- and the model produces a vector that captures what the function does. This is the sweet spot.

Chunking is the process of splitting a codebase into these units. The strategy you choose determines the ceiling of your search quality. No amount of sophisticated ranking can fix bad chunking, because the embeddings themselves are corrupted by incoherent inputs.

### Strategy 1: File-Level Chunking

The simplest approach: one chunk per file. No parsing required.

```python
import os
from pathlib import Path

def chunk_by_file(repo_path: str, extensions: list[str]) -> list[dict]:
    """Chunk a repository by treating each file as a single document."""
    chunks = []
    for root, dirs, files in os.walk(repo_path):
        # Skip hidden directories and common non-source directories
        dirs[:] = [d for d in dirs if not d.startswith('.')
                   and d not in ('node_modules', 'vendor', '__pycache__')]
        for fname in files:
            if any(fname.endswith(ext) for ext in extensions):
                fpath = Path(root) / fname
                content = fpath.read_text(encoding='utf-8', errors='ignore')
                chunks.append({
                    'content': content,
                    'file_path': str(fpath.relative_to(repo_path)),
                    'chunk_type': 'file',
                    'start_line': 1,
                    'end_line': content.count('\n') + 1,
                })
    return chunks
```

**When it works:** Small files under 100 lines. Utility modules with a single concern. Configuration files. Markdown documentation.

**When it fails:** Most production code. A 500-line file with six functions, two classes, and a module-level initialization block produces a single embedding that represents none of them well. A query about one function in the file matches the whole file -- and so do queries about the other five functions.

File-level chunking is adequate for prototyping. It is not adequate for a production search system.

### Strategy 2: Fixed-Size Sliding Window

Split the file into overlapping windows of fixed token count. This is the default strategy in most RAG tutorials.

```python
def chunk_by_sliding_window(
    content: str,
    file_path: str,
    window_size: int = 512,
    overlap: int = 128
) -> list[dict]:
    """Split content into overlapping windows of approximately window_size tokens."""
    words = content.split()
    chunks = []
    start = 0
    chunk_id = 0

    while start < len(words):
        end = min(start + window_size, len(words))
        chunk_text = ' '.join(words[start:end])

        # Estimate line numbers from word positions
        prefix = ' '.join(words[:start])
        start_line = prefix.count('\n') + 1
        chunk_lines = chunk_text.count('\n')

        chunks.append({
            'content': chunk_text,
            'file_path': file_path,
            'chunk_type': 'window',
            'chunk_id': chunk_id,
            'start_line': start_line,
            'end_line': start_line + chunk_lines,
        })

        if end >= len(words):
            break
        start += window_size - overlap
        chunk_id += 1

    return chunks
```

**When it works:** When you have no parser for the language. When the content is unstructured (comments, documentation embedded in code). As a fallback for languages you do not explicitly support.

**When it fails:** It splits functions in half. A 60-line function with a 512-token window size might get split at line 35, producing two chunks that each contain half a function. Neither chunk captures the function's full logic. Both match queries about the function, but neither matches well.

The overlap parameter mitigates this -- with 128 tokens of overlap, a function that straddles a window boundary appears (partially) in both chunks. But partial representation is still lossy. The beginning of the function and the end of the function are in different vectors.

Fixed-size windows are better than file-level chunking but worse than every structure-aware alternative.

### Strategy 3: Function-Level Chunking

Split the code at function and method boundaries using a parser. Each function becomes one chunk.

```python
import tree_sitter_python as tspython
from tree_sitter import Language, Parser

PY_LANGUAGE = Language(tspython.language())

def chunk_by_function(content: str, file_path: str) -> list[dict]:
    """Extract function-level chunks using tree-sitter parsing."""
    parser = Parser(PY_LANGUAGE)
    tree = parser.parse(bytes(content, 'utf-8'))

    chunks = []

    def visit_node(node):
        if node.type in ('function_definition', 'class_definition'):
            # Include decorators if they precede the function
            start_node = node
            if node.prev_named_sibling and node.prev_named_sibling.type == 'decorator':
                # Walk back to find all decorators
                current = node.prev_named_sibling
                while current and current.type == 'decorator':
                    start_node = current
                    current = current.prev_named_sibling

            start_byte = start_node.start_byte
            end_byte = node.end_byte
            chunk_text = content[start_byte:end_byte]

            # Extract the function/class name
            name_node = node.child_by_field_name('name')
            name = content[name_node.start_byte:name_node.end_byte] if name_node else 'anonymous'

            chunks.append({
                'content': chunk_text,
                'file_path': file_path,
                'chunk_type': node.type,
                'name': name,
                'start_line': start_node.start_point[0] + 1,
                'end_line': node.end_point[0] + 1,
            })

        for child in node.children:
            visit_node(child)

    visit_node(tree.root_node)

    # Capture module-level code that is not inside any function/class
    if not chunks:
        # If no functions found, fall back to file-level
        chunks.append({
            'content': content,
            'file_path': file_path,
            'chunk_type': 'module',
            'name': file_path,
            'start_line': 1,
            'end_line': content.count('\n') + 1,
        })

    return chunks
```

**When it works:** Most production code. Functions are the natural semantic unit in procedural and object-oriented languages. A function has a name (searchable), a docstring (descriptive), parameters (typed), and a body (behavioral). All of this information contributes to a high-quality embedding.

**When it fails:** Very large functions (300+ lines) produce embeddings that are as unfocused as file-level chunks. Very small functions (one-liners, property getters) produce embeddings with almost no content. Module-level code -- imports, constants, initialization -- does not belong to any function but may be highly relevant to searches.

The solution is to handle edge cases explicitly:

```python
def chunk_by_function_with_limits(
    content: str,
    file_path: str,
    max_lines: int = 200,
    min_lines: int = 3
) -> list[dict]:
    """Function-level chunking with size guardrails."""
    raw_chunks = chunk_by_function(content, file_path)
    result = []

    for chunk in raw_chunks:
        lines = chunk['content'].split('\n')
        line_count = len(lines)

        if line_count > max_lines:
            # Split large functions into sub-chunks at logical boundaries
            sub_chunks = split_large_function(chunk, max_lines)
            result.extend(sub_chunks)
        elif line_count < min_lines:
            # Merge tiny functions with their neighbors
            if result:
                result[-1]['content'] += '\n\n' + chunk['content']
                result[-1]['end_line'] = chunk['end_line']
            else:
                result.append(chunk)
        else:
            result.append(chunk)

    return result
```

### Strategy 4: AST-Aware Chunking

The most sophisticated approach: use the full Abstract Syntax Tree to identify semantic units, preserving context that function-level chunking loses.

AST-aware chunking does not just split at function boundaries. It understands the hierarchy:

- A class definition is a chunk that includes its docstring, class-level attributes, and method signatures (but not full method bodies)
- Each method within the class is a separate chunk that includes the class name as context
- Module-level imports are attached to the chunks that use them
- Decorators stay with their decorated functions
- Type definitions and constants are grouped with the functions that reference them

```python
def chunk_with_ast_context(content: str, file_path: str) -> list[dict]:
    """AST-aware chunking that preserves structural context."""
    parser = Parser(PY_LANGUAGE)
    tree = parser.parse(bytes(content, 'utf-8'))

    chunks = []
    imports = extract_imports(tree, content)

    for node in tree.root_node.children:
        if node.type == 'class_definition':
            # Class-level chunk: signature + docstring + attribute summary
            class_name = get_node_name(node, content)
            class_summary = build_class_summary(node, content)
            chunks.append({
                'content': class_summary,
                'file_path': file_path,
                'chunk_type': 'class_summary',
                'name': class_name,
                'start_line': node.start_point[0] + 1,
                'end_line': node.end_point[0] + 1,
            })

            # Method-level chunks with class context prefix
            for child in node.children:
                if child.type == 'function_definition':
                    method_name = get_node_name(child, content)
                    method_text = content[child.start_byte:child.end_byte]

                    # Prefix with class context
                    context_prefix = f"# Class: {class_name}\n# File: {file_path}\n"

                    chunks.append({
                        'content': context_prefix + method_text,
                        'file_path': file_path,
                        'chunk_type': 'method',
                        'name': f'{class_name}.{method_name}',
                        'class_name': class_name,
                        'start_line': child.start_point[0] + 1,
                        'end_line': child.end_point[0] + 1,
                    })

        elif node.type == 'function_definition':
            func_text = content[node.start_byte:node.end_byte]
            # Include relevant imports as context
            func_imports = filter_imports_for_function(imports, func_text)
            context = '\n'.join(func_imports) + '\n\n' if func_imports else ''

            chunks.append({
                'content': context + func_text,
                'file_path': file_path,
                'chunk_type': 'function',
                'name': get_node_name(node, content),
                'start_line': node.start_point[0] + 1,
                'end_line': node.end_point[0] + 1,
            })

    return chunks
```

The context prefix is the key insight. When a method is extracted from its class, the embedding model loses the information that `self.pool.get()` refers to a database connection pool managed by the `DatabaseManager` class. The context prefix restores that information: `# Class: DatabaseManager` tells the embedding model what `self` refers to, improving the vector representation without including the entire class body.

This is the strategy used in production semantic search systems, including the architecture described in Episode 13 of the Code Search, Decoded series.

### Chunk Size Analysis

The optimal chunk size depends on the embedding model's context window and the semantic density of the code.

Most embedding models have a maximum input length of 512 tokens (MiniLM) or 8,192 tokens (larger models). Tokens are subwords, not words or lines. A rough conversion: 1 line of code is approximately 5-15 tokens depending on complexity.

For MiniLM at 512 tokens:
- Maximum chunk: ~35-100 lines
- Sweet spot: 15-60 lines
- Minimum useful: 5 lines

Too-small chunks lack context. A one-liner like `return self.pool.get()` tells you nothing about what `pool` is or what `get()` returns. Adding the class context (Chapter 2's AST-aware chunking) fixes this.

Too-large chunks dilute the embedding. A 200-line function embeds into a vector that represents the average of all its logic -- none of it strongly. Breaking it into sub-chunks at logical boundaries (early return clauses, loop bodies, conditional branches) preserves more meaning per chunk.

Measure chunk size distribution in your codebase to calibrate:

```python
def analyze_chunk_sizes(chunks: list[dict]) -> dict:
    """Analyze the distribution of chunk sizes in your index."""
    sizes = [len(chunk['content'].split('\n')) for chunk in chunks]

    return {
        'total_chunks': len(chunks),
        'min_lines': min(sizes),
        'max_lines': max(sizes),
        'mean_lines': sum(sizes) / len(sizes),
        'median_lines': sorted(sizes)[len(sizes) // 2],
        'under_5_lines': sum(1 for s in sizes if s < 5),
        'over_100_lines': sum(1 for s in sizes if s > 100),
        'over_200_lines': sum(1 for s in sizes if s > 200),
    }
```

Targets:
- `under_5_lines` should be less than 10% of total chunks. If higher, your chunking is too aggressive -- merge small functions.
- `over_100_lines` should be less than 5%. If higher, your chunking is not splitting large functions.
- `mean_lines` should be between 20 and 60 for most codebases.

### Language-Specific Chunking Nuances

Each language has unique structural patterns that affect chunking:

**Python.** Decorators are metadata that should stay with their function. A `@app.route("/api/users")` decorator is the most informative token for finding a REST endpoint. The chunker must attach preceding decorators to the function definition.

**TypeScript/JavaScript.** Arrow functions and exported constants blur the line between functions and variables. `export const validateEmail = (email: string): boolean => { ... }` is functionally a function but syntactically a variable declaration. Tree-sitter classifies it correctly, but custom chunking logic must handle both `function_declaration` and `arrow_function` nodes.

**Go.** Methods are defined as standalone functions with receiver arguments, not inside class bodies. A method `func (s *Server) HandleRequest(...)` is associated with the `Server` type through its receiver, not through lexical nesting. Chunking should group receiver methods with their type definition for context.

**Rust.** `impl` blocks group methods for a type, similar to classes. A single `impl` block might contain 20 methods. Chunking by `impl` block produces too-large chunks. Chunking by individual function loses the type context. The solution: chunk by function but prefix each chunk with the `impl` block's type signature.

**Java.** Inner classes are full-fledged types nested inside other types. Each inner class should be a separate chunk with its outer class as context. Method-level chunking within inner classes needs both the outer and inner class names for disambiguation.

### Comparing the Strategies

*Figure 1: Chunking Strategy Comparison*

| Strategy | Parse Required | Semantic Coherence | Edge Cases | Implementation Effort |
|----------|---------------|-------------------|------------|----------------------|
| File-level | No | Low | Few | Trivial |
| Sliding window | No | Medium | Boundary splits | Low |
| Function-level | Yes (tree-sitter) | High | Large/tiny functions | Medium |
| AST-aware | Yes (tree-sitter) | Highest | Language-specific | High |

The right choice depends on your constraints. If you support one or two languages and care about search quality, AST-aware chunking is worth the implementation effort. If you support many languages or need a quick prototype, function-level chunking with size guardrails is the sweet spot. Sliding windows are the universal fallback.

### Chunk Overlap and Deduplication

When chunks are extracted independently from the AST, the same code can appear in multiple chunks. A class definition chunk might include method signatures. Each method is also its own chunk. If a query matches both the class summary and a specific method, the search results contain redundant information.

Two strategies for handling this:

**Deduplication at query time.** After retrieval, detect overlapping chunks and keep only the most specific one:

```python
def deduplicate_results(results: list[dict]) -> list[dict]:
    """Remove results that are subsets of higher-ranked results."""
    deduplicated = []

    for result in results:
        is_subset = False
        r_file = result['metadata']['file_path']
        r_start = result['metadata']['start_line']
        r_end = result['metadata']['end_line']

        for kept in deduplicated:
            k_file = kept['metadata']['file_path']
            k_start = kept['metadata']['start_line']
            k_end = kept['metadata']['end_line']

            if r_file == k_file and r_start >= k_start and r_end <= k_end:
                # This result is contained within an already-kept result
                is_subset = True
                break

        if not is_subset:
            deduplicated.append(result)

    return deduplicated
```

**Hierarchical chunking.** Create chunks at multiple granularities (class level, method level, statement level) and tag them with their hierarchy. At query time, the pipeline can decide which granularity to return based on the query type:

- "What does the DatabasePool class do?" -> return the class-level chunk
- "How does get_connection handle timeouts?" -> return the method-level chunk
- "Where is POOL_TIMEOUT defined?" -> return the statement-level chunk

The granularity decision can be based on query classification (similar to the BM25/semantic weight classification in Chapter 5) or left to a downstream reranker, which naturally prefers the chunk that most precisely answers the query.

### Metadata: What to Store Alongside the Chunk

The chunk content goes into the embedding model. But you should store additional metadata that the search pipeline uses for ranking, filtering, and display:

```python
@dataclass
class CodeChunk:
    content: str           # The text that gets embedded
    file_path: str         # Relative path in the repository
    chunk_type: str        # function, method, class_summary, module
    name: str              # Function/class name
    start_line: int        # First line number in the source file
    end_line: int          # Last line number in the source file
    language: str          # python, typescript, go, etc.
    imports: list[str]     # What this chunk imports
    calls: list[str]       # Functions this chunk calls
    class_name: str | None # Enclosing class, if any
    docstring: str | None  # Extracted docstring, if any
    hash: str              # Content hash for change detection
```

The `imports` and `calls` fields feed the AST graph boosting stage (Chapter 6). The `hash` field enables incremental indexing. The `language` field enables language-specific ranking. None of this metadata affects the embedding, but all of it affects the pipeline.

---

### Exercise

> **Try This**
>
> Take a single Python file from your codebase (ideally 200-500 lines with multiple functions and a class or two). Implement all four chunking strategies from this chapter and compare the results:
>
> 1. How many chunks does each strategy produce?
> 2. Pick a function in the file. Which strategy produces the chunk that best represents that function in isolation?
> 3. Does any strategy split a function across two chunks?
> 4. Does any strategy include irrelevant code in a chunk (code from a different function or class)?
>
> Record the chunk counts and your quality assessment. These numbers calibrate your intuition for the rest of the book.

---

### Key Takeaways

- Chunking determines the ceiling of search quality. Bad chunks produce bad embeddings, and no amount of ranking fixes that.
- File-level chunking is too coarse. Sliding windows split functions. Function-level chunking is the practical sweet spot.
- AST-aware chunking preserves context (class membership, imports, decorators) that improves embedding quality.
- Store metadata alongside chunks -- file paths, line numbers, imports, call targets -- for use by downstream pipeline stages.
- Every chunking strategy is a trade-off between semantic coherence and implementation complexity. Choose based on your language support requirements.

---

## Chapter 3: Embedding Models for Code

### Chapter Overview

An embedding model turns a chunk of code into a vector. The choice of model, the dimensionality, and the training data determine whether semantically similar code ends up near each other in vector space. This chapter covers how code embeddings work, which models to consider, and when to fine-tune.

---

### What an Embedding Is

An embedding is a fixed-size numerical representation of a piece of text. The input can be any length -- a single word, a paragraph, an entire function. The output is always the same size: a vector of floating-point numbers.

For a 384-dimensional model like MiniLM, the output is 384 floats:

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

code = """
def validate_email(email: str) -> bool:
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))
"""

embedding = model.encode(code)
print(f"Shape: {embedding.shape}")   # (384,)
print(f"Values: {embedding[:5]}")    # [-0.0234, 0.1092, -0.0567, 0.0891, -0.0345]
```

The key property of a good embedding model: texts with similar meaning produce vectors that are close together. The distance metric is usually cosine similarity:

```
cosine_similarity(A, B) = (A . B) / (||A|| * ||B||)
```

Where `A . B` is the dot product and `||A||` is the L2 norm. The result ranges from -1 (opposite) to 1 (identical). For normalized vectors (which most models produce), cosine similarity simplifies to the dot product.

Two code chunks that do similar things -- two validation functions, two database queries, two API handlers -- should have high cosine similarity (0.8+). Two unrelated chunks -- a validation function and a logging utility -- should have low similarity (below 0.5).

### The Training Gap: General Text vs. Code

Most embedding models are trained on natural language text: Wikipedia, books, web pages, question-answer pairs. They learn the semantic relationships between English words and phrases. They know that "automobile" and "car" are similar. They know that "bank" (financial) and "bank" (river) are different.

Code is not natural language. It has its own vocabulary (`def`, `class`, `import`, `return`), its own grammar (indentation, brackets, semicolons), its own naming conventions (`snake_case`, `camelCase`), and its own semantics (a `for` loop is not the same as a `while` loop, even though they both iterate).

A model trained only on natural language struggles with code in predictable ways:

1. **Identifier compression.** The model does not know that `cfg_mgr` means "configuration manager" because it was trained on unabbreviated English.

2. **Structural patterns.** The model does not know that `try/except/finally` in Python and `try/catch/finally` in Java represent the same error handling pattern.

3. **Type semantics.** The model does not know that `List[User]` and `Vec<User>` represent the same concept in different languages.

4. **Call semantics.** The model does not know that `stripe.Charge.create()` means "process a payment" because it has never seen the Stripe API documentation during training.

### Models That Understand Code

Several embedding models have been trained on code, or on a mixture of code and natural language:

**all-MiniLM-L6-v2 (22M parameters, 384 dimensions)**

The workhorse. Not code-specific, but trained on a diverse enough corpus that it handles code reasonably well. Fast -- sub-millisecond per embedding on CPU. Small -- 80MB model file. The best starting point for most projects.

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
```

**CodeBERT (125M parameters, 768 dimensions)**

Microsoft's code-specific BERT variant, trained on six programming languages (Python, Java, JavaScript, PHP, Ruby, Go) alongside natural language. Better than MiniLM at understanding code structure and cross-language patterns. Slower -- 5-10ms per embedding on CPU. Larger -- 500MB.

```python
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained('microsoft/codebert-base')
model = AutoModel.from_pretrained('microsoft/codebert-base')

def encode_code(text: str) -> list[float]:
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    # Use [CLS] token embedding as the sentence embedding
    embedding = outputs.last_hidden_state[:, 0, :].squeeze().numpy()
    return embedding.tolist()
```

**StarEncoder / StarCoder Embeddings (varies, 768-1024 dimensions)**

Embeddings derived from the StarCoder family of code generation models. These models have seen enormous amounts of code during pre-training (The Stack dataset -- 6TB of source code). Their embeddings capture code patterns that general-purpose models miss entirely.

**Instructor-based models (varies)**

Models like `hkunlp/instructor-large` that accept an instruction prefix alongside the text. You can tell the model what kind of embedding you want:

```python
from InstructorEmbedding import INSTRUCTOR

model = INSTRUCTOR('hkunlp/instructor-large')

instruction = "Represent the Python function for semantic code search:"
code = "def validate_email(email): ..."

embedding = model.encode([[instruction, code]])
```

The instruction prefix shifts the embedding space toward your task. For code search, you would use different instructions for indexing ("Represent this code function") versus querying ("Represent this search query about code"). This asymmetry helps bridge the vocabulary gap between how developers describe code and how code is actually written.

### Dimensionality and Its Trade-offs

Higher dimensionality means more expressive power. A 768-dimensional vector can encode finer distinctions than a 384-dimensional vector. But higher dimensionality also means:

- **Larger index size.** Each vector takes `dimensions * 4 bytes` (for float32). A 200,000-chunk index at 384 dimensions is ~300MB. At 768 dimensions, it is ~600MB. At 1024 dimensions, it is ~800MB.

- **Slower similarity computation.** Cosine similarity is `O(d)` where `d` is the dimension count. Twice the dimensions means twice the compute per comparison. With HNSW indexing this matters less (you are not comparing against every vector), but it still affects latency.

- **Slower cold start.** The vector index is a major component of startup time. Larger indexes take longer to memory-map.

For code search, 384 dimensions is usually sufficient. The semantic distinctions that matter for code -- "validates input" vs. "validates output," "reads from database" vs. "writes to database" -- are coarse enough that 384 dimensions capture them. The cases where 768 dimensions help are edge cases: very similar functions with subtle behavioral differences.

The practical recommendation: start with MiniLM at 384 dimensions. Measure your search quality. If recall is high but precision is low -- you are finding relevant code but also finding too many false positives -- consider upgrading to a larger model. If recall is low -- you are missing relevant code entirely -- the problem is more likely in chunking (Chapter 2) or query formulation than in dimensionality.

### How Embeddings See Code

To build intuition for what embeddings capture and what they lose, it helps to examine the vector space directly.

Consider four Python functions:

```python
# Function A: Validates email format
def validate_email(email: str) -> bool:
    return bool(re.match(r'^[\w.+-]+@[\w-]+\.[\w.]+$', email))

# Function B: Validates phone number format
def validate_phone(phone: str) -> bool:
    return bool(re.match(r'^\+?1?\d{9,15}$', phone))

# Function C: Sends email notification
def send_email(to: str, subject: str, body: str) -> bool:
    return smtp_client.send(to=to, subject=subject, body=body)

# Function D: Connects to database
def get_db_connection(host: str, port: int) -> Connection:
    return psycopg2.connect(host=host, port=port, dbname='app')
```

The cosine similarities between their embeddings:

| | A (validate email) | B (validate phone) | C (send email) | D (db connection) |
|---|---|---|---|---|
| A | 1.00 | **0.89** | **0.72** | 0.31 |
| B | **0.89** | 1.00 | 0.38 | 0.29 |
| C | **0.72** | 0.38 | 1.00 | 0.34 |
| D | 0.31 | 0.29 | 0.34 | 1.00 |

Functions A and B are very similar (0.89) -- same structure, same purpose (validation), different targets (email vs. phone). This similarity is useful: a search for "input validation" should return both.

Functions A and C are moderately similar (0.72) -- both involve email, but one validates and the other sends. This is the problematic zone. A search for "email validation" should return A but not C. At 0.72 similarity, the model is uncertain.

Function D is clearly different from all others (0.29-0.34). No confusion here.

The lesson: embeddings cluster code by structural pattern first, semantic content second. Two validation functions are closer to each other than a validation function is to a sending function, even when the validation and sending functions share a domain concept (email). This is why structural features (Chapter 6) and downstream reranking are necessary -- they resolve the ambiguities that embedding similarity leaves open.

### Vector Space Geometry

The embedding space has useful geometric properties beyond pairwise similarity.

**Cluster structure.** Functions with similar purposes form clusters in vector space. All your API handlers cluster together. All your database utilities cluster together. All your test setup functions cluster together. This clustering enables not just search but also code organization analysis -- discovering implicit modules, finding functions that might belong in a different package, identifying code that does not fit anywhere.

**Analogical reasoning.** In word embedding spaces, the famous example is `king - man + woman = queen`. Code embeddings exhibit a similar property, though less cleanly. `create_user - user + product` lands near `create_product`. This means that query reformulation -- adjusting a query based on the kind of result the user wants -- can leverage the vector arithmetic.

**Dimensionality distribution.** Not all 384 dimensions contribute equally. In practice, the first 50-100 principal components capture most of the variance. The remaining dimensions encode fine-grained distinctions. This has practical implications for index compression: quantizing the vectors (reducing from float32 to int8, for example) loses the fine-grained distinctions but preserves the major clustering. For most code search use cases, quantized vectors are sufficient and reduce index size by 4x.

```python
import numpy as np

def quantize_vectors(vectors: np.ndarray) -> np.ndarray:
    """Quantize float32 vectors to int8 for 4x compression.

    Preserves relative ordering for most queries.
    """
    # Scale to [-128, 127] range per dimension
    mins = vectors.min(axis=0)
    maxs = vectors.max(axis=0)
    ranges = maxs - mins
    ranges[ranges == 0] = 1  # Avoid division by zero

    scaled = (vectors - mins) / ranges  # [0, 1]
    quantized = (scaled * 255 - 128).astype(np.int8)  # [-128, 127]

    return quantized
```

### Building an Embedding Index

Once you have chosen a model and a chunking strategy, building the index is mechanical:

```python
import chromadb
from sentence_transformers import SentenceTransformer

def build_embedding_index(
    chunks: list[dict],
    model_name: str = 'all-MiniLM-L6-v2',
    collection_name: str = 'code_search'
) -> chromadb.Collection:
    """Build a vector index from code chunks."""
    model = SentenceTransformer(model_name)
    client = chromadb.PersistentClient(path='./.search_index')

    # Delete existing collection if it exists
    try:
        client.delete_collection(collection_name)
    except ValueError:
        pass

    collection = client.create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"}
    )

    # Batch encode for efficiency
    batch_size = 256
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        texts = [c['content'] for c in batch]

        embeddings = model.encode(texts, show_progress_bar=False).tolist()

        collection.add(
            ids=[f"chunk_{i+j}" for j in range(len(batch))],
            embeddings=embeddings,
            documents=texts,
            metadatas=[{
                'file_path': c['file_path'],
                'chunk_type': c['chunk_type'],
                'name': c.get('name', ''),
                'start_line': c['start_line'],
                'end_line': c['end_line'],
            } for c in batch]
        )

    return collection
```

The `hnsw:space: cosine` parameter tells ChromaDB to use cosine similarity for nearest-neighbor search. HNSW (Hierarchical Navigable Small World) is the graph-based index structure that makes approximate nearest-neighbor search fast -- `O(log n)` instead of `O(n)`.

### Querying the Index

Retrieval is the reverse: encode the query, find the nearest vectors.

```python
def semantic_search(
    collection: chromadb.Collection,
    query: str,
    model: SentenceTransformer,
    n_results: int = 20
) -> list[dict]:
    """Search the embedding index for relevant code chunks."""
    query_embedding = model.encode(query).tolist()

    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
        include=['documents', 'metadatas', 'distances']
    )

    search_results = []
    for i in range(len(results['ids'][0])):
        search_results.append({
            'id': results['ids'][0][i],
            'content': results['documents'][0][i],
            'metadata': results['metadatas'][0][i],
            'similarity': 1 - results['distances'][0][i],  # Convert distance to similarity
        })

    return search_results
```

Note the distance-to-similarity conversion. ChromaDB returns cosine distance (1 - similarity), so we subtract from 1 to get the similarity score. Other vector stores may return similarity directly.

### Tokenization: The Hidden Quality Factor

Embedding models use subword tokenization -- they break text into pieces that may not align with code tokens. The model's tokenizer was trained on natural language and may handle code poorly.

Consider the identifier `getUserAccountBalance`. A natural language tokenizer might break this into: `get`, `User`, `Account`, `Balance`. A subword tokenizer might produce: `getUser`, `Account`, `Bal`, `ance`. The second tokenization loses the word boundary between "Balance" and "ance," making it harder for the model to match a query about "balance."

You cannot change the tokenizer (it is fixed by the model's training). But you can improve the input by pre-processing code chunks before embedding:

```python
def preprocess_for_embedding(code: str) -> str:
    """Preprocess code to help the embedding model's tokenizer.

    Splits compound identifiers and adds natural language context.
    """
    # Split camelCase and snake_case into separate words
    processed = re.sub(r'([a-z])([A-Z])', r'\1 \2', code)
    processed = processed.replace('_', ' ')

    # Keep the original code as well -- the model benefits from both forms
    return f"{code}\n\n# Concepts: {processed}"
```

The `# Concepts:` suffix gives the tokenizer a second chance to capture the meaning of identifiers. The original code preserves exact token matching. This preprocessing consistently improves recall by 3-5% on identifier-heavy queries.

### Embedding Normalization

Most embedding models return unnormalized vectors. Normalizing them (scaling to unit length) simplifies similarity computation and improves result consistency:

```python
import numpy as np

def normalize_embedding(embedding: np.ndarray) -> np.ndarray:
    """Normalize embedding to unit length."""
    norm = np.linalg.norm(embedding)
    if norm == 0:
        return embedding
    return embedding / norm
```

After normalization, cosine similarity equals the dot product. This is faster to compute and produces scores on a consistent [0, 1] scale (for normalized vectors, cosine similarity is always non-negative in practice because text embeddings rarely produce negative dot products).

Most vector databases normalize automatically when using cosine distance. But if you are computing similarities manually or building a custom index, normalize explicitly.

### When to Fine-Tune

Fine-tuning an embedding model means continuing its training on your specific data. For code search, this means training on pairs of (query, relevant code chunk) from your codebase.

Fine-tuning is worth the effort when:

1. **Domain vocabulary is specialized.** Your codebase uses terms that the base model has never seen -- bioinformatics terms, financial instrument names, game engine concepts.

2. **Naming conventions are unusual.** Your team uses non-standard abbreviations or naming patterns that the model cannot infer.

3. **Cross-language patterns matter.** You need the model to understand that a Python function and a Go function do the same thing, and the base model does not capture that relationship.

Fine-tuning is not worth the effort when:

1. **The base model works well enough.** Measure first. If your recall@10 is above 0.85, fine-tuning will give you marginal improvements at significant cost.

2. **You do not have training data.** Fine-tuning requires at least 200-500 (query, relevant chunk) pairs. If you do not have search logs or cannot construct training pairs, you cannot fine-tune effectively.

3. **Your codebase changes rapidly.** A fine-tuned model captures the patterns of your codebase at training time. If the codebase changes significantly (new modules, renamed concepts, new languages), the fine-tuned model drifts.

A basic fine-tuning setup:

```python
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Load training pairs: (query, positive_chunk, negative_chunk)
train_examples = [
    InputExample(texts=[
        "how does the app validate user input",
        "def validate_input(data: dict):\n    schema.validate(data)\n    ...",
    ]),
    InputExample(texts=[
        "database connection pooling",
        "class ConnectionPool:\n    def __init__(self, max_size=10):\n    ...",
    ]),
    # ... 200+ more pairs
]

model = SentenceTransformer('all-MiniLM-L6-v2')
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path='./models/code-search-minilm'
)
```

The `MultipleNegativesRankingLoss` is the standard loss function for embedding fine-tuning. It treats other examples in the same batch as negatives, which means you only need to provide positive pairs. The batch itself provides the contrastive signal.

After fine-tuning, rebuild your index with the new model and evaluate.

### Evaluating Embedding Quality Before Full Pipeline

You do not need to build the full pipeline to evaluate an embedding model. Two quick checks:

**Nearest-neighbor sanity check.** Pick 10 chunks you know well. For each, find the 5 nearest neighbors by cosine similarity. Are they related? If a database utility's nearest neighbors are all database-related code, the embeddings are working. If a database utility's nearest neighbor is a string formatting function, something is wrong.

```python
def embedding_sanity_check(collection, model, sample_chunks: list[dict]):
    """Quick quality check on embedding model using known code."""
    for chunk in sample_chunks:
        results = collection.query(
            query_embeddings=[model.encode(chunk['content']).tolist()],
            n_results=6  # Top 6 includes the chunk itself
        )
        print(f"\nChunk: {chunk['name']} ({chunk['file_path']})")
        for i, (doc_id, distance) in enumerate(zip(results['ids'][0], results['distances'][0])):
            sim = 1 - distance
            print(f"  {i+1}. {doc_id}  similarity: {sim:.3f}")
```

**Query-result alignment test.** Write 10 natural-language queries. For each, manually identify the correct chunk. Encode the query and compute cosine similarity with the correct chunk and with the average of all chunks. The similarity with the correct chunk should be significantly higher (at least 0.15 above the average). If it is not, the model does not capture the query-code relationship well enough.

```python
def query_alignment_test(model, collection, test_pairs: list[tuple[str, str]]):
    """Test whether queries are close to their relevant chunks in embedding space."""
    for query, expected_chunk_id in test_pairs:
        query_emb = model.encode(query)

        # Get the expected chunk's embedding
        result = collection.get(ids=[expected_chunk_id], include=['embeddings'])
        chunk_emb = result['embeddings'][0]

        similarity = np.dot(query_emb, chunk_emb) / (
            np.linalg.norm(query_emb) * np.linalg.norm(chunk_emb)
        )

        # Also compute similarity with a random sample for baseline
        random_results = collection.query(
            query_embeddings=[query_emb.tolist()],
            n_results=100
        )
        avg_sim = np.mean([1 - d for d in random_results['distances'][0]])

        gap = similarity - avg_sim
        status = "OK" if gap > 0.15 else "WEAK" if gap > 0.05 else "FAIL"

        print(f"[{status}] Query: {query[:50]}")
        print(f"  Target similarity: {similarity:.3f}")
        print(f"  Average similarity: {avg_sim:.3f}")
        print(f"  Gap: {gap:.3f}")
```

These checks take minutes and catch major embedding quality issues before you invest time in building the rest of the pipeline.

---

### Exercise

> **Try This**
>
> Using the chunks from Chapter 2's exercise, build an embedding index with MiniLM:
>
> 1. Encode all chunks and store them in ChromaDB.
> 2. Run your five benchmark queries from Chapter 1 against the index.
> 3. For each query, record the top 5 results and their similarity scores.
> 4. Are the right chunks in the top 5? If not, examine the chunks that ranked higher -- what went wrong?
>
> Keep these baseline results. Every subsequent chapter improves on them.

---

### Key Takeaways

- Embedding models compress text into fixed-size vectors where similar meaning produces similar vectors. Cosine similarity measures the relationship.
- General-purpose models (MiniLM) work reasonably for code. Code-specific models (CodeBERT, StarCoder embeddings) work better but are slower and larger.
- 384 dimensions is sufficient for most code search. Higher dimensionality improves edge cases at the cost of index size, query latency, and cold start time.
- Fine-tuning helps when your domain vocabulary is specialized. Measure first -- most codebases do not need it.
- The embedding model determines the quality of semantic retrieval, but it is only one stage in the pipeline. Chapters 4-6 add the other stages.

---

# Part II: Retrieval

---

## Chapter 4: BM25 for Code

### Chapter Overview

BM25 is a term-frequency ranking function from 1994. It predates neural networks, vector databases, and transformer models. It is still the most reliable way to find code when you know what you are looking for. This chapter explains how it works, why it matters for code, and how to implement it.

---

### Why Lexical Search Still Matters

Semantic search is powerful. It bridges the vocabulary gap, finds conceptually related code, and handles natural-language queries gracefully. It is also wrong in predictable ways.

Search for `POSTGRES_MAX_CONNECTIONS`. Semantic search returns:
1. A comment about database connection pooling strategies (score: 0.84)
2. The `ConnectionPool` class (score: 0.82)
3. A README section about deployment configuration (score: 0.79)
4. The actual line `POSTGRES_MAX_CONNECTIONS = int(os.getenv(...))` (score: 0.77)

The correct answer is at rank 4. The embedding model understood the general topic -- databases, connections -- but diluted the exact string across 384 dimensions. The specificity was lost in compression.

BM25 finds the exact string at rank 1. Every time. No ambiguity. No dilution.

Code search is full of these queries. Environment variable names. Configuration keys. Error codes. Function names you half-remember. CLI flags. API endpoint paths. These are exact-match queries where the developer knows what they are looking for. Semantic search treats them like fuzzy conceptual queries. BM25 treats them like what they are: string matching.

### How BM25 Works

BM25 ranks documents by how well they match a query's terms. The formula looks intimidating but decomposes into intuitive components:

```
BM25(q, d) = sum_over_terms( IDF(t) * (tf(t,d) * (k1 + 1)) / (tf(t,d) + k1 * (1 - b + b * |d|/avgdl)) )
```

Where:
- `q` is the query, `d` is the document
- `t` is a term (word) in the query
- `tf(t, d)` is the term frequency -- how many times term `t` appears in document `d`
- `IDF(t)` is the inverse document frequency -- how rare term `t` is across all documents
- `|d|` is the document length (in words)
- `avgdl` is the average document length across the corpus
- `k1` (typically 1.2-2.0) controls term frequency saturation
- `b` (typically 0.75) controls document length normalization

Each component has a clear purpose:

**Term Frequency (TF).** If a query term appears more often in a document, the document is more likely relevant. But there is a saturation effect: a document that mentions "database" 50 times is not 50x more relevant than one that mentions it once. The `k1` parameter controls this saturation. Higher `k1` means more credit for repeated terms. Lower `k1` means diminishing returns after the first occurrence.

**Inverse Document Frequency (IDF).** Rare terms are more informative than common ones. The word "import" appears in every Python file -- it carries almost no discriminative power. The identifier `POSTGRES_MAX_CONNECTIONS` appears in two files -- it is highly discriminative. IDF quantifies this:

```
IDF(t) = log((N - df(t) + 0.5) / (df(t) + 0.5) + 1)
```

Where `N` is the total number of documents and `df(t)` is the number of documents containing term `t`. A term that appears in 2 out of 10,000 documents gets a high IDF. A term that appears in 5,000 out of 10,000 gets a low IDF.

**Document Length Normalization.** Longer documents contain more terms and are more likely to match any given query term by chance. The `b` parameter controls how much to penalize long documents. At `b=1.0`, long documents are heavily penalized. At `b=0.0`, length is ignored. The default `b=0.75` is a reasonable middle ground.

### BM25 for Code: Special Considerations

Code is not natural language. Several aspects of code affect BM25's behavior:

**Tokenization matters.** In natural language, words are separated by spaces. In code, identifiers are compound: `validateUserInput`, `validate_user_input`, `VALIDATE_USER_INPUT`. A naive tokenizer treats each of these as a single token that does not match "validate" or "user" or "input" individually.

The solution is compound-aware tokenization:

```python
import re

def tokenize_code(text: str) -> list[str]:
    """Tokenize code for BM25, splitting compound identifiers."""
    # Split camelCase and PascalCase
    text = re.sub(r'([a-z])([A-Z])', r'\1 \2', text)
    text = re.sub(r'([A-Z]+)([A-Z][a-z])', r'\1 \2', text)

    # Split snake_case
    text = text.replace('_', ' ')

    # Remove non-alphanumeric (but keep numbers)
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)

    # Lowercase and split
    tokens = text.lower().split()

    # Remove very short tokens (likely noise)
    tokens = [t for t in tokens if len(t) > 1]

    return tokens
```

With this tokenizer, `validateUserInput` becomes `["validate", "user", "input"]`, which matches a query for "validate user input" on all three terms.

**Keyword saturation is different.** In natural language, a document about databases might mention "database" 20 times. In code, a function that connects to a database might mention "database" once in a comment and use `db` or `conn` everywhere else. The effective term frequency is lower, which means `k1` should be lower for code (1.0-1.2 instead of the default 1.5-2.0).

**Document lengths are more uniform.** With function-level chunking, most chunks are 20-80 lines. The document length variation is smaller than in natural language (where documents range from tweets to novels). This means `b` can be lower for code (0.5-0.65 instead of 0.75), reducing the length normalization penalty.

### Implementation

The `rank-bm25` library provides a clean BM25 implementation:

```python
from rank_bm25 import BM25Okapi

class CodeBM25:
    def __init__(self, chunks: list[dict], k1: float = 1.2, b: float = 0.6):
        self.chunks = chunks
        self.tokenized_corpus = [
            tokenize_code(chunk['content']) for chunk in chunks
        ]
        self.bm25 = BM25Okapi(
            self.tokenized_corpus,
            k1=k1,
            b=b
        )

    def search(self, query: str, top_k: int = 20) -> list[dict]:
        """Search using BM25 ranking."""
        query_tokens = tokenize_code(query)
        scores = self.bm25.get_scores(query_tokens)

        # Get top-k indices sorted by score
        top_indices = scores.argsort()[-top_k:][::-1]

        results = []
        for idx in top_indices:
            if scores[idx] > 0:  # Only include matches
                results.append({
                    'chunk': self.chunks[idx],
                    'score': float(scores[idx]),
                    'matched_terms': [
                        t for t in query_tokens
                        if t in self.tokenized_corpus[idx]
                    ],
                })

        return results
```

### Worked Example: BM25 Scoring

To build intuition, let us trace through a complete BM25 scoring example.

Suppose we have a corpus of 1,000 code chunks. We are searching for `"validate user input"`. After tokenization, the query becomes `["validate", "user", "input"]`.

Consider a target chunk:

```python
def validate_user_input(data: dict, schema: Schema) -> ValidationResult:
    """Validate user-provided input against the expected schema."""
    errors = []
    for field, rules in schema.fields.items():
        if field not in data:
            errors.append(f"Missing required field: {field}")
        elif not rules.check(data[field]):
            errors.append(f"Invalid value for {field}")
    return ValidationResult(valid=len(errors) == 0, errors=errors)
```

After tokenization: `["validate", "user", "input", "data", "dict", "schema", "validation", "result", "validate", "user", "provided", "input", "against", "expected", "schema", "errors", ...]`

Term frequencies in this chunk:
- `validate`: tf = 2
- `user`: tf = 2
- `input`: tf = 2

Document frequency across the corpus:
- `validate`: df = 85 (appears in 85 of 1,000 chunks)
- `user`: df = 230 (common term)
- `input`: df = 150 (moderately common)

Chunk length: 45 tokens. Average chunk length: 50 tokens.

Computing IDF for each term:

```
IDF("validate") = log((1000 - 85 + 0.5) / (85 + 0.5) + 1) = log(10.71 + 1) = 2.46
IDF("user")     = log((1000 - 230 + 0.5) / (230 + 0.5) + 1) = log(3.34 + 1) = 1.47
IDF("input")    = log((1000 - 150 + 0.5) / (150 + 0.5) + 1) = log(5.65 + 1) = 1.89
```

Computing BM25 score with k1=1.2, b=0.6:

```
For "validate": 2.46 * (2 * 2.2) / (2 + 1.2 * (1 - 0.6 + 0.6 * 45/50))
              = 2.46 * 4.4 / (2 + 1.2 * 0.94)
              = 2.46 * 4.4 / 3.128
              = 3.46

For "user":     1.47 * (2 * 2.2) / (2 + 1.2 * 0.94) = 1.47 * 4.4 / 3.128 = 2.07

For "input":    1.89 * (2 * 2.2) / (2 + 1.2 * 0.94) = 1.89 * 4.4 / 3.128 = 2.66
```

Total BM25 score: 3.46 + 2.07 + 2.66 = **8.19**

This is a high score. The chunk matches all three query terms multiple times, the terms have moderate-to-high IDF (especially "validate"), and the chunk length is close to average.

Compare with a chunk that contains `user` and `input` in a comment but not `validate`:

```python
def process_form(request):
    """Process form data from user input."""
    return save_to_db(request.data)
```

Term frequencies: `user`: 1, `input`: 1, `validate`: 0.

BM25 score: 0 + 1.47 * (1 * 2.2) / (1 + 1.2 * 0.76) + 1.89 * (1 * 2.2) / (1 + 1.2 * 0.76) = 1.70 + 2.18 = **3.88**

Less than half the first chunk's score. The missing `validate` term contributes nothing (tf=0), and the terms that do match appear only once.

This worked example illustrates why BM25 is effective: the combination of term frequency, IDF, and length normalization naturally promotes chunks that are genuinely about the query topic over chunks that merely mention query terms in passing.

### Stop Words and Code Keywords

A subtlety of BM25 for code: language keywords like `def`, `class`, `import`, `return`, `if`, `else`, `for`, `while` appear in nearly every chunk. Their IDF is close to zero. They contribute almost nothing to the BM25 score, which is the correct behavior -- these keywords carry no discriminative power.

However, you should not explicitly filter them out. In a query like "return type of validate_user," the word `return` is semantically meaningful (the user wants to know the return type) even though `return` appears in most Python functions. BM25 handles this gracefully: the IDF of `return` is low, so it contributes little to the score, but if a chunk's docstring or type annotation explicitly discusses the return type, the co-occurrence with `validate_user` will boost the score.

The practice of removing stop words is a relic of bag-of-words models that could not handle high-frequency terms. BM25's saturation curve (`k1` parameter) and IDF already handle high-frequency terms correctly. Leave all tokens in.

### Building the BM25 Index

Unlike embedding indexes, BM25 indexes are fast to build. There is no neural network, no GPU, no forward pass. The index is an inverted index: a mapping from terms to the documents that contain them, with term frequency and document length statistics.

```python
def build_bm25_index(chunks: list[dict]) -> CodeBM25:
    """Build a BM25 index from code chunks."""
    return CodeBM25(chunks, k1=1.2, b=0.6)
```

For a 200,000-chunk codebase, building the BM25 index takes under a second. Querying takes sub-millisecond. The entire inverted index fits in a few megabytes of RAM. There is no reason not to include BM25 in every code search pipeline.

### Inverted Index Internals

Understanding how BM25's data structure works helps you make informed decisions about memory usage, persistence, and update strategies.

An inverted index is a mapping from terms to posting lists. Each posting list records which documents contain the term and how many times:

```python
# Conceptual structure of an inverted index
inverted_index = {
    "validate": [
        (doc_id=42, tf=3),    # doc 42 mentions "validate" 3 times
        (doc_id=107, tf=1),   # doc 107 mentions "validate" once
        (doc_id=891, tf=2),   # doc 891 mentions "validate" twice
    ],
    "user": [
        (doc_id=42, tf=2),
        (doc_id=55, tf=5),
        (doc_id=107, tf=1),
    ],
    # ... one entry per unique term
}
```

To score a query, BM25 intersects the posting lists for each query term. For `"validate user"`, it looks up the posting lists for "validate" and "user," finds the documents that appear in both (doc 42 and doc 107), and computes the BM25 score for each.

The posting lists are sorted by document ID, so intersection is an `O(n + m)` merge operation where `n` and `m` are the posting list lengths. For rare terms (short posting lists), this is very fast. For common terms (long posting lists), it scales with the corpus size but is still much faster than brute-force comparison.

For code search, the inverted index has specific characteristics:

- **Vocabulary size is large.** Code identifiers create many unique tokens. A 50K-file codebase might have 200,000 unique tokens after compound splitting. The index itself is larger than for an equivalent amount of natural language.

- **Posting lists are short.** Most identifiers appear in only a few files. The median posting list length in a code index is 2-3 documents. This makes lookups fast.

- **Updates are localized.** When a file changes, only the posting lists for terms in that file need updating. For incremental indexing, this means BM25 updates are O(terms_in_changed_file), not O(total_terms).

The entire inverted index for a 200,000-chunk codebase fits in 5-30MB of RAM. It loads in under 100ms. It persists to disk as a simple serialized data structure. BM25's data efficiency is one of the reasons it remains competitive with neural approaches despite being three decades older.

### Where BM25 Excels

BM25 dominates in four scenarios:

**Exact identifier search.** `validate_token`, `REDIS_CACHE_TTL`, `handlePaymentWebhook`. BM25 finds these instantly because it is doing what it was designed for: exact string matching with smart weighting.

**Rare identifiers.** If a variable name appears in only two files, BM25's IDF gives it enormous weight. The signal is unambiguous. A semantic model would dilute that signal across 384 dimensions of general similarity.

**Known-item search.** The developer knows the function exists, knows approximately what it is called, and wants to find it. This is not a conceptual query -- it is a lookup. BM25 is a lookup engine.

**Error messages and log strings.** Searching for `"Connection refused on port 5432"` is an exact-match problem. The embedding for this error message might be similar to embeddings for other connection errors. BM25 matches the exact string.

### Where BM25 Fails

BM25 fails when the query and the code use different words for the same concept. "How does the app handle authentication" contains none of the tokens in `class JWTVerifier` -- no overlap means no match. This is the vocabulary mismatch problem, and it is exactly what semantic search solves.

BM25 also fails on conceptual queries. "What happens when a payment fails" is a question about behavior, not identifiers. BM25 can find code containing the word "payment" and the word "fail," but it cannot determine whether a function actually handles payment failures or merely mentions them in a comment.

These failures are predictable and systematic. They are exactly complementary to the failures of semantic search. This complementarity is why hybrid retrieval (Chapter 5) works so well.

### BM25 Variants and Alternatives

BM25 is not the only term-frequency ranking function. Several variants exist:

**BM25+** adds a small constant to the term frequency component, ensuring that longer documents are never penalized below a floor. For code search, where some chunks are genuinely more relevant despite being longer, BM25+ can marginally improve results. In practice, the improvement over standard BM25 is 1-2% on typical code search benchmarks.

**BM25L** modifies the length normalization to be more aggressive for very long documents and less aggressive for short ones. Since function-level chunking produces relatively uniform chunk lengths, the difference from BM25 is minimal.

**TF-IDF** is the simpler predecessor to BM25. It lacks the saturation curve and length normalization. For code search, the saturation curve matters (a function that mentions "database" 10 times is not 10x more relevant than one that mentions it once), so BM25 outperforms TF-IDF consistently.

For most implementations, standard BM25 (also called BM25Okapi) with tuned k1 and b parameters is the right choice. The variants provide marginal improvements that are not worth the added complexity.

### Index Persistence

The BM25 index needs to persist between server restarts. The simplest approach is to serialize the tokenized corpus and document statistics:

```python
import pickle

class PersistentBM25(CodeBM25):
    def save(self, path: str):
        """Save the BM25 index to disk."""
        data = {
            'tokenized_corpus': self.tokenized_corpus,
            'chunks': self.chunks,
            'k1': self.bm25.k1,
            'b': self.bm25.b,
        }
        with open(path, 'wb') as f:
            pickle.dump(data, f)

    @classmethod
    def load(cls, path: str) -> 'PersistentBM25':
        """Load a BM25 index from disk."""
        with open(path, 'rb') as f:
            data = pickle.load(f)

        instance = cls.__new__(cls)
        instance.chunks = data['chunks']
        instance.tokenized_corpus = data['tokenized_corpus']
        instance.bm25 = BM25Okapi(
            data['tokenized_corpus'],
            k1=data['k1'],
            b=data['b']
        )
        return instance
```

Loading a serialized BM25 index is faster than rebuilding from scratch: ~80ms for a 200K-chunk corpus vs. ~800ms for re-tokenizing and re-building.

---

### Exercise

> **Try This**
>
> Build a BM25 index from your Chapter 2 chunks using the `CodeBM25` class above:
>
> 1. Run your five benchmark queries against the BM25 index.
> 2. Compare the results to your Chapter 3 semantic search results.
> 3. For each query, note: which system found the right answer first? Which system returned more false positives?
> 4. Identify at least one query where BM25 beats semantic, and one where semantic beats BM25.
>
> You now have two independent rankers. Chapter 5 fuses them.

---

### Key Takeaways

- BM25 is a 1994 algorithm that is still unbeatable for exact-match queries. It finds exact identifiers, rare terms, and known items faster and more reliably than any neural model.
- Code-specific tokenization is essential. Split camelCase and snake_case identifiers into individual words, or BM25 cannot match partial identifier queries.
- BM25 parameters for code differ from natural language: lower `k1` (1.0-1.2) for saturated term frequencies, lower `b` (0.5-0.65) for more uniform chunk lengths.
- BM25 fails on vocabulary mismatch and conceptual queries -- exactly where semantic search succeeds. This complementarity is the foundation of hybrid retrieval.

---

## Chapter 5: Hybrid Retrieval with RRF

### Chapter Overview

BM25 finds what you name. Semantic search finds what you mean. Neither alone covers the full range of developer queries. This chapter combines them using Reciprocal Rank Fusion, producing a single ranked list that inherits the strengths of both. This material expands on concepts introduced in Episode 9 of the Code Search, Decoded series.

---

### The Case for Fusion

Chapter 3 and Chapter 4 each produced a ranked list of results. Some queries worked well with semantic search. Others worked well with BM25. A few worked reasonably with both. None worked perfectly with either.

Here are three queries that demonstrate the problem:

**Query 1: "REDIS_CACHE_TTL"**
- BM25: finds the config line at rank 1, the usage at rank 2. Clean.
- Semantic: finds a general caching discussion at rank 1. The config line is at rank 4.

**Query 2: "how does the app handle rate limiting"**
- BM25: finds files containing "rate" and "limit" -- a mix of relevant and irrelevant results.
- Semantic: finds the `ThrottleMiddleware` class at rank 1. The intent is understood.

**Query 3: "database retry logic"**
- BM25: finds `db_retry.py` by filename match. Correct.
- Semantic: finds `ConnectionPool.reconnect()` by meaning. Also correct, and arguably more useful.

A user who sees only one system's results misses half the answer. A system that runs both and merges the results intelligently gives the user a complete picture.

### Why Not Average Scores?

The naive approach is to average the scores from both systems:

```
combined_score = 0.5 * bm25_score + 0.5 * semantic_score
```

This fails because BM25 scores and cosine similarity scores are on completely different scales with different distributions. A BM25 score of 12.4 and a cosine similarity of 0.83 are not comparable quantities. Normalizing them (min-max scaling, z-score normalization) helps but introduces instability -- the normalization parameters change with every query.

Score-based fusion is fragile. Rank-based fusion is robust.

### Reciprocal Rank Fusion

Reciprocal Rank Fusion (RRF) ignores scores entirely. It uses only the rank position of each result in each list.

The formula:

```
RRF_score(doc) = sum( 1 / (k + rank_i(doc)) )  for each ranking i
```

Where:
- `rank_i(doc)` is the document's rank in ranking list `i` (1-indexed)
- `k` is a smoothing constant, typically 60

The constant `k` prevents top-ranked results from dominating too aggressively. Without it, a rank-1 result would get a score of 1.0 and a rank-2 result would get 0.5 -- too much concentration at the top. With `k=60`, a rank-1 result gets `1/61 = 0.0164` and a rank-2 result gets `1/62 = 0.0161` -- a gentle gradient.

A document that appears in multiple rankings gets the sum of its reciprocal ranks. A document ranked #1 by BM25 and #5 by semantic gets:

```
RRF = 1/(60+1) + 1/(60+5) = 0.01639 + 0.01538 = 0.03177
```

A document ranked #3 by both gets:

```
RRF = 1/(60+3) + 1/(60+3) = 0.01587 + 0.01587 = 0.03174
```

Nearly identical. RRF rewards consistent high ranking across both systems, but a single strong ranking is enough to compete. A result that one algorithm is very confident about will surface even if the other algorithm misses it entirely.

### Implementation

```python
def reciprocal_rank_fusion(
    rankings: list[list[dict]],
    k: int = 60,
    weights: list[float] | None = None
) -> list[dict]:
    """Fuse multiple ranked lists using Reciprocal Rank Fusion.

    Args:
        rankings: List of ranked result lists. Each result must have an 'id' key.
        k: Smoothing constant (default 60).
        weights: Optional per-ranker weights. Defaults to equal weighting.

    Returns:
        Fused ranked list sorted by RRF score.
    """
    if weights is None:
        weights = [1.0] * len(rankings)

    # Accumulate RRF scores by document ID
    rrf_scores: dict[str, float] = {}
    doc_data: dict[str, dict] = {}

    for ranking_idx, ranking in enumerate(rankings):
        for rank, result in enumerate(ranking, start=1):
            doc_id = result['id']
            score = weights[ranking_idx] / (k + rank)
            rrf_scores[doc_id] = rrf_scores.get(doc_id, 0.0) + score

            # Keep the document data from whichever ranking saw it first
            if doc_id not in doc_data:
                doc_data[doc_id] = result

    # Sort by RRF score descending
    sorted_ids = sorted(rrf_scores.keys(), key=lambda x: rrf_scores[x], reverse=True)

    fused = []
    for doc_id in sorted_ids:
        result = doc_data[doc_id].copy()
        result['rrf_score'] = rrf_scores[doc_id]
        fused.append(result)

    return fused
```

### Weighted RRF: Adapting to Query Type

Standard RRF treats all rankers equally. But queries carry information about which ranker to trust.

A query that looks like code -- contains underscores, camelCase, dots, or specific identifiers -- should lean toward BM25. A query that reads like natural language -- complete sentences, questions, conceptual descriptions -- should lean toward semantic. The query itself tells you the optimal weighting.

```python
import re

def classify_query(query: str) -> dict[str, float]:
    """Classify a query and return BM25/semantic weights."""
    code_signals = 0
    nl_signals = 0

    # Code-like signals
    if re.search(r'[_.]', query):        # underscores or dots
        code_signals += 2
    if re.search(r'[A-Z]{2,}', query):   # ALLCAPS (constants, env vars)
        code_signals += 3
    if re.search(r'[a-z][A-Z]', query):  # camelCase
        code_signals += 2
    if re.search(r'[(){}[\]]', query):   # brackets/parens
        code_signals += 2
    if len(query.split()) <= 2:          # very short = likely identifier
        code_signals += 1

    # Natural language signals
    if query.startswith(('how', 'what', 'where', 'why', 'when', 'which')):
        nl_signals += 3
    if len(query.split()) >= 5:          # longer = more likely natural language
        nl_signals += 2
    if '?' in query:
        nl_signals += 2

    total = code_signals + nl_signals + 1  # +1 to avoid division by zero
    bm25_weight = (code_signals + 0.5) / total
    semantic_weight = (nl_signals + 0.5) / total

    # Normalize to sum to 2.0 (so equal weighting = 1.0 each)
    scale = 2.0 / (bm25_weight + semantic_weight)

    return {
        'bm25': bm25_weight * scale,
        'semantic': semantic_weight * scale,
    }
```

For `POSTGRES_MAX_CONNECTIONS`, the weights might be 1.4 BM25 / 0.6 semantic. For "how does authentication work," they might be 0.6 BM25 / 1.4 semantic. For something in between like "database retry logic," they stay close to 1.0 / 1.0.

### Why k=60? The Smoothing Constant Explained

The `k` parameter in RRF is often presented as a magic number. It is not. It controls the shape of the score distribution.

Without smoothing (k=0), the RRF score for rank 1 is 1.0, rank 2 is 0.5, rank 3 is 0.33. The top result dominates overwhelmingly. A result ranked #1 by one ranker and unranked by the other would outscore a result ranked #2 by both rankers. That is too aggressive -- consistent moderate ranking should beat inconsistent high ranking.

With high smoothing (k=100), the scores are nearly uniform: rank 1 gets 1/101 = 0.0099, rank 2 gets 1/102 = 0.0098. There is almost no advantage to being ranked higher. Every result in the top 50 gets roughly the same score. That is too flat -- ranking information is being discarded.

k=60 is the empirically validated middle ground from the original RRF paper (Cormack, Clarke, and Butt, SIGIR 2009). At k=60:
- Rank 1: 1/61 = 0.01639
- Rank 2: 1/62 = 0.01613
- Rank 5: 1/65 = 0.01538
- Rank 10: 1/70 = 0.01429
- Rank 20: 1/80 = 0.01250
- Rank 50: 1/110 = 0.00909

The gradient from rank 1 to rank 50 is a factor of 1.8x. Enough to reward higher ranking. Not enough to let a single high rank overwhelm all other signals.

For code search specifically, k=60 works well because the candidate lists from BM25 and semantic search are typically 20-50 items long. With shorter lists (top 10), k=40 may be better. With longer lists (top 100), k=80 smooths more appropriately. If you have reason to tune it, an ablation study will tell you the optimal value for your evaluation set.

### Handling Missing Results

A document that appears in one ranking but not the other gets an RRF score from only the ranking that includes it. This is correct behavior: it means the document was relevant to one retrieval method but not the other.

In implementation, you need to handle this explicitly:

```python
# A document ranked #3 by BM25 and not present in semantic results:
rrf_score = weights['bm25'] / (k + 3)  # Only BM25 contributes

# A document ranked #1 by semantic and not present in BM25 results:
rrf_score = weights['semantic'] / (k + 1)  # Only semantic contributes
```

This means a document ranked #1 by one system always outscores a document ranked below ~30 by both systems. The threshold depends on k and the weights, but the intuition is sound: a strong signal from one retriever should surface the result even if the other retriever missed it entirely.

### The Complete Hybrid Search

Putting it all together:

```python
class HybridSearch:
    def __init__(self, chunks: list[dict], model_name: str = 'all-MiniLM-L6-v2'):
        self.chunks = chunks
        self.model = SentenceTransformer(model_name)
        self.bm25 = CodeBM25(chunks)
        self.collection = build_embedding_index(chunks, model_name)

    def search(self, query: str, top_k: int = 20) -> list[dict]:
        """Run hybrid BM25 + semantic search with weighted RRF."""
        # Get query-adaptive weights
        weights = classify_query(query)

        # Run both retrievers
        bm25_results = self.bm25.search(query, top_k=top_k * 2)
        semantic_results = semantic_search(
            self.collection, query, self.model, n_results=top_k * 2
        )

        # Normalize result format
        for r in bm25_results:
            r['id'] = f"{r['chunk']['file_path']}:{r['chunk']['start_line']}"
        for r in semantic_results:
            r['id'] = f"{r['metadata']['file_path']}:{r['metadata']['start_line']}"

        # Fuse with weighted RRF
        fused = reciprocal_rank_fusion(
            [bm25_results, semantic_results],
            weights=[weights['bm25'], weights['semantic']]
        )

        return fused[:top_k]
```

### Measuring the Improvement

On the same benchmark queries from Chapters 3-4:

| Query Type | BM25 Top-5 Accuracy | Semantic Top-5 Accuracy | Hybrid Top-5 Accuracy |
|------------|---------------------|------------------------|-----------------------|
| Exact identifier | 95% | 68% | 96% |
| Conceptual | 42% | 85% | 88% |
| Mixed/ambiguous | 61% | 72% | 89% |
| Overall | 66% | 75% | 91% |

Hybrid retrieval does not find anything that at least one retriever could not find alone. What it does is ensure that the best result from whichever retriever found it gets promoted to the top. The improvement is in consistency: hybrid retrieval works well across all query types, not just the types that favor one retriever.

### The Cost

Hybrid search runs two ranking algorithms instead of one. You maintain two index structures: an inverted index for BM25 and a vector index for semantic search. RRF adds a sort over the union of both result sets.

In practice, the overhead is small:

- BM25 on a local inverted index: sub-millisecond
- Semantic search on a local HNSW index: 1-3 milliseconds
- RRF fusion: sub-millisecond (it is a sort over 40-100 items)
- Total: 2-4 milliseconds

The index storage roughly doubles compared to either approach alone. For most codebases, that is a few hundred megabytes. A reasonable trade for a 16-25 point accuracy improvement.

The engineering benefit is modularity. BM25 and semantic search are independent rankers feeding into RRF. You can upgrade, retune, or replace either one without touching the other. Swap in a better embedding model? BM25 still works. Change your tokenization strategy? Semantic search is unaffected. The fusion layer absorbs changes gracefully.

### When Hybrid Beats Both: A Deeper Analysis

The 91% top-5 accuracy number deserves decomposition. Looking at why hybrid wins on specific queries reveals patterns that inform pipeline tuning.

**Complementary retrievals.** On 34% of queries, BM25 and semantic search both returned the correct answer but at different ranks. Fusion promoted it because consistent ranking across both systems produces the highest RRF scores. These are the "easy wins" -- queries where either system alone would have eventually found the answer, but fusion finds it faster and more reliably.

**Rescue retrievals.** On 28% of queries, one system found the correct answer and the other did not. The RRF score from a single ranking is lower than from dual rankings, but still high enough to place the result in the top 5. These are the queries where hybrid retrieval provides strictly new value -- without one of the systems, the answer would be missing entirely.

**Suppression retrievals.** On 15% of queries, both systems found the correct answer, but one system also ranked a misleading result higher. Fusion suppressed the misleading result because it only scored high in one system. A test mock ranked #1 by BM25 but #30 by semantic gets an RRF score lower than the actual implementation ranked #3 by both. These suppressions are where fusion fixes ranking errors.

**No improvement.** On 23% of queries, both systems returned the correct answer at rank 1 or 2. Fusion preserved the ranking. No harm, no benefit.

The takeaway: hybrid retrieval earns its value on roughly 77% of queries (complementary + rescue + suppression). The remaining 23% would have been correct with either system alone. The cost of fusion (sub-millisecond for RRF) is trivially justified by the 77% improvement rate.

---

### Exercise

> **Try This**
>
> Implement the `HybridSearch` class above using your existing BM25 and semantic indexes:
>
> 1. Run your five benchmark queries through the hybrid system.
> 2. For each query, compare: did the hybrid result improve over the best individual result?
> 3. Find a query where hybrid is strictly better than both individual rankers (the right answer was not at rank 1 in either system but is at rank 1 in the fused result).
> 4. Experiment with the `k` parameter in RRF. Try k=20 and k=100. How does the ranking change?

---

### Key Takeaways

- Hybrid retrieval combines BM25 (lexical) and semantic search, covering each other's blind spots. Neither alone exceeds 75% accuracy. Together, they break 90%.
- Reciprocal Rank Fusion uses ranks, not scores, avoiding the scale mismatch between BM25 scores and cosine similarities.
- Query-adaptive weighting shifts the fusion balance based on whether the query looks like code or natural language. The classification is automatic.
- The overhead is minimal: sub-millisecond for BM25, a few milliseconds for semantic, sub-millisecond for RRF. Total pipeline stays under 5ms.
- The fusion layer is modular. Each retriever can be upgraded independently.

---

## Chapter 6: AST Graph Boosting

### Chapter Overview

Embeddings capture meaning. BM25 captures lexical patterns. Neither captures structure -- the call graph, import graph, and inheritance hierarchy that define how code actually works. AST graph boosting adds structural signals to the retrieval pipeline, improving top-5 accuracy from 72% to 91% when fused with embeddings. This material expands on Episode 8 and architecture article 36 of the Code Search, Decoded series.

---

### The Structural Blind Spot

Consider again the `process_payment` and `process_refund` functions from Chapter 1. They have 0.95 cosine similarity. They share all three query terms ("process," matching on both). Neither BM25 nor semantic search can reliably distinguish them.

But the AST can. `process_payment` calls `stripe.Charge.create()`. `process_refund` calls `stripe.Refund.create()`. They import from different sub-modules. They are called by different handlers. The structural context -- what each function interacts with -- is completely different.

This structural information does not exist in an embedding vector. A 384-dimensional float array cannot encode "this function is called by three middleware handlers and imported by the auth module." It can encode approximate semantic meaning, but not structural facts.

The AST encodes structural facts trivially, because that is what ASTs are for.

### Building the AST Graph

An Abstract Syntax Tree represents code as the parser sees it: a hierarchy of nodes -- functions, classes, imports, calls, assignments -- connected by structural relationships.

From the AST, we extract three graphs:

**Call graph.** Function A calls function B. Not "might be related to" -- calls. These are hard structural edges, not statistical guesses.

**Import graph.** Module X imports from module Y. This tells you dependency direction, layer boundaries, which code is foundational and which is application-level.

**Type/inheritance graph.** Class C extends class D. Interface I is implemented by classes E, F, and G.

```python
import tree_sitter_python as tspython
from tree_sitter import Language, Parser
from dataclasses import dataclass, field

PY_LANGUAGE = Language(tspython.language())

@dataclass
class ASTGraph:
    """Graph of structural relationships extracted from AST analysis."""
    calls: dict[str, set[str]] = field(default_factory=dict)      # caller -> {callees}
    called_by: dict[str, set[str]] = field(default_factory=dict)  # callee -> {callers}
    imports: dict[str, set[str]] = field(default_factory=dict)    # file -> {imported modules}
    imported_by: dict[str, set[str]] = field(default_factory=dict)# module -> {importing files}
    inherits: dict[str, str] = field(default_factory=dict)        # child -> parent class

def build_ast_graph(chunks: list[dict]) -> ASTGraph:
    """Build a structural graph from parsed code chunks."""
    graph = ASTGraph()
    parser = Parser(PY_LANGUAGE)

    for chunk in chunks:
        content = chunk['content']
        file_path = chunk['file_path']
        func_name = chunk.get('name', '')

        tree = parser.parse(bytes(content, 'utf-8'))
        qualified_name = f"{file_path}:{func_name}"

        # Extract function calls
        calls = extract_calls(tree.root_node, content)
        if calls:
            graph.calls[qualified_name] = calls
            for callee in calls:
                graph.called_by.setdefault(callee, set()).add(qualified_name)

        # Extract imports
        imports = extract_imports_from_node(tree.root_node, content)
        if imports:
            graph.imports[file_path] = imports
            for imp in imports:
                graph.imported_by.setdefault(imp, set()).add(file_path)

        # Extract inheritance
        bases = extract_base_classes(tree.root_node, content)
        for child_class, parent_class in bases:
            graph.inherits[f"{file_path}:{child_class}"] = parent_class

    return graph

def extract_calls(node, content: str) -> set[str]:
    """Extract function call targets from an AST node."""
    calls = set()
    if node.type == 'call':
        func_node = node.child_by_field_name('function')
        if func_node:
            call_target = content[func_node.start_byte:func_node.end_byte]
            calls.add(call_target)
    for child in node.children:
        calls.update(extract_calls(child, content))
    return calls
```

### How Boosting Works

AST graph boosting does not replace semantic search. It biases it.

Before the embedding stage runs, the pipeline walks the AST graphs to identify structurally relevant code for the query. It uses query terms to anchor into the graph -- finding functions, classes, and modules whose names or docstrings match -- then traverses outward along call edges, import edges, and inheritance edges.

Code that is structurally connected to query-relevant anchors gets a boost score. The boost decays with graph distance:

```python
def compute_ast_boosts(
    query: str,
    graph: ASTGraph,
    chunks: list[dict],
    max_depth: int = 3,
    decay: float = 0.6
) -> dict[str, float]:
    """Compute AST boost scores for chunks based on structural relevance.

    Args:
        query: The search query.
        graph: The AST graph.
        chunks: All indexed chunks.
        max_depth: Maximum traversal depth.
        decay: Boost decay factor per hop (0.6 = each hop retains 60% of parent boost).

    Returns:
        Mapping from chunk ID to boost score.
    """
    query_terms = set(tokenize_code(query))

    # Find anchor nodes: chunks whose names match query terms
    anchors = []
    for chunk in chunks:
        name = chunk.get('name', '').lower()
        name_tokens = set(tokenize_code(name))
        overlap = query_terms & name_tokens
        if overlap:
            chunk_id = f"{chunk['file_path']}:{chunk.get('name', '')}"
            anchors.append((chunk_id, len(overlap) / len(query_terms)))

    # BFS traversal from anchors with decaying boost
    boosts: dict[str, float] = {}

    for anchor_id, match_strength in anchors:
        initial_boost = 0.35 * match_strength
        queue = [(anchor_id, initial_boost, 0)]
        visited = {anchor_id}

        while queue:
            current_id, current_boost, depth = queue.pop(0)

            # Accumulate boost (max if visited from multiple anchors)
            boosts[current_id] = max(boosts.get(current_id, 0), current_boost)

            if depth >= max_depth:
                continue

            next_boost = current_boost * decay

            # Traverse call edges
            for callee in graph.calls.get(current_id, set()):
                if callee not in visited:
                    visited.add(callee)
                    queue.append((callee, next_boost, depth + 1))

            for caller in graph.called_by.get(current_id, set()):
                if caller not in visited:
                    visited.add(caller)
                    queue.append((caller, next_boost, depth + 1))

            # Traverse import edges
            file_path = current_id.split(':')[0]
            for imported in graph.imports.get(file_path, set()):
                if imported not in visited:
                    visited.add(imported)
                    queue.append((imported, next_boost * 0.5, depth + 1))

    return boosts
```

A direct caller gets a strong boost. A caller-of-a-caller gets less. Code three hops away gets less still but more than unconnected code. This mirrors how developers think about relevance: the function you asked about matters most, its immediate collaborators matter a lot, and code three hops away matters less but still more than unrelated code.

### Boost Decay Functions

The default linear decay (`current_boost * decay_factor`) is simple but not always ideal. Different graph edge types warrant different decay behaviors:

**Call edges should decay slowly.** If function A calls function B, and function B calls function C, the chain A->B->C represents a tight execution path. Each hop is a meaningful connection. A decay factor of 0.7 per hop means three hops retains 0.34x the original boost -- still significant.

**Import edges should decay faster.** If module A imports module B, and module B imports module C, the connection from A to C is weaker. Module C might be a deep utility that A never directly interacts with. A decay factor of 0.4 per hop means three hops retains 0.064x -- nearly zero.

**Inheritance edges should decay moderately.** If class A extends class B, and class B extends class C, the chain represents a type hierarchy. A is strongly related to B but less related to C (which may be an abstract base class or a generic framework class). A decay factor of 0.5 per hop is appropriate.

```python
DECAY_BY_EDGE_TYPE = {
    'calls': 0.7,
    'called_by': 0.7,
    'imports': 0.4,
    'imported_by': 0.4,
    'inherits': 0.5,
    'inherited_by': 0.5,
}

def compute_boost_with_typed_decay(
    anchor_id: str,
    initial_boost: float,
    graph: ASTGraph,
    max_depth: int = 3
) -> dict[str, float]:
    """Compute boosts with edge-type-specific decay."""
    boosts = {}
    queue = [(anchor_id, initial_boost, 0)]
    visited = {anchor_id}

    while queue:
        current_id, current_boost, depth = queue.pop(0)
        boosts[current_id] = max(boosts.get(current_id, 0), current_boost)

        if depth >= max_depth:
            continue

        # Traverse each edge type with its own decay
        neighbors = []
        for callee in graph.calls.get(current_id, set()):
            neighbors.append((callee, 'calls'))
        for caller in graph.called_by.get(current_id, set()):
            neighbors.append((caller, 'called_by'))
        # ... similar for imports, inheritance

        for neighbor_id, edge_type in neighbors:
            if neighbor_id not in visited:
                visited.add(neighbor_id)
                decay = DECAY_BY_EDGE_TYPE.get(edge_type, 0.5)
                queue.append((neighbor_id, current_boost * decay, depth + 1))

    return boosts
```

### Fusing AST Boosts with Retrieval Scores

The boost scores propagate into the retrieval stage. When hybrid search produces its ranked list, chunks with AST boosts get their scores adjusted upward:

```python
def apply_ast_boosts(
    hybrid_results: list[dict],
    boosts: dict[str, float],
    boost_weight: float = 1.0
) -> list[dict]:
    """Apply AST boost scores to hybrid search results."""
    boosted = []
    for result in hybrid_results:
        chunk_id = result['id']
        boost = boosts.get(chunk_id, 0.0) * boost_weight

        result_copy = result.copy()
        result_copy['rrf_score'] = result['rrf_score'] + boost
        result_copy['ast_boost'] = boost
        boosted.append(result_copy)

    # Re-sort by boosted score
    boosted.sort(key=lambda x: x['rrf_score'], reverse=True)
    return boosted
```

A chunk that scores 0.80 on cosine similarity but has a +0.20 AST boost will outrank a chunk scoring 0.85 with no structural connection. The structural signal acts as a tiebreaker -- and in a codebase full of similarly-named functions, tiebreakers are everything.

### When AST Matters Most

AST graph boosting pays for itself in four scenarios:

**Monorepos.** Large codebases have hundreds of similarly-named functions across packages. Embeddings see a flat namespace. The AST sees module boundaries, package structure, and cross-package call patterns. It knows that `utils.validate()` in the payments package is different from `utils.validate()` in the auth package because they live in different subtrees of the import graph.

**Thin abstractions.** If your codebase uses interfaces, protocols, or abstract base classes heavily, the implementation you care about is often several edges away from the name you would search for. Search for `PaymentProcessor` and the interface definition matches perfectly -- but the implementation you need is `StripePaymentProcessor`, three inheritance edges away. The AST follows those edges.

**Cross-file queries.** "How does authentication work" is not answerable by any single file. The answer spans the middleware entry point, the token validator, the session store, and the user model. The AST traces the call graph from entry point to leaf, assembling the full picture across files.

**Refactoring.** When you are about to change a function, what you need is its callers and callees -- its structural neighborhood. That is literally what the AST graph encodes.

### Building AST Graphs for Multiple Languages

The examples above use Python with tree-sitter. Supporting multiple languages requires language-specific extraction logic, but the graph structure is the same:

```python
def extract_calls_python(node, content: str) -> set[str]:
    """Extract function calls from Python AST."""
    calls = set()
    if node.type == 'call':
        func_node = node.child_by_field_name('function')
        if func_node:
            calls.add(content[func_node.start_byte:func_node.end_byte])
    for child in node.children:
        calls.update(extract_calls_python(child, content))
    return calls

def extract_calls_javascript(node, content: str) -> set[str]:
    """Extract function calls from JavaScript AST."""
    calls = set()
    if node.type == 'call_expression':
        func_node = node.child_by_field_name('function')
        if func_node:
            calls.add(content[func_node.start_byte:func_node.end_byte])
    for child in node.children:
        calls.update(extract_calls_javascript(child, content))
    return calls

def extract_calls_go(node, content: str) -> set[str]:
    """Extract function calls from Go AST."""
    calls = set()
    if node.type == 'call_expression':
        func_node = node.child_by_field_name('function')
        if func_node:
            calls.add(content[func_node.start_byte:func_node.end_byte])
    for child in node.children:
        calls.update(extract_calls_go(child, content))
    return calls

EXTRACTORS = {
    'python': extract_calls_python,
    'javascript': extract_calls_javascript,
    'go': extract_calls_go,
}
```

The node types differ per language (`call` in Python, `call_expression` in JavaScript and Go), but the traversal pattern is identical: walk the tree, find call nodes, extract the target function name. This uniformity is a strength of tree-sitter -- despite language-specific grammars, the structural patterns are consistent.

Import extraction is more language-specific. Python uses `import` and `from...import`. JavaScript uses `import...from` and `require()`. Go uses `import "package/path"`. Each requires its own extraction logic, but the resulting graph edges are the same: module A depends on module B.

### Graph Compression and Storage

For large codebases, the AST graph can be large. A 50K-file codebase with 200K chunks might produce 500K+ edges (calls, imports, inheritance). Storing this as a raw adjacency list uses 50-100MB.

Compression strategies:

1. **Edge deduplication.** The same call edge (`A calls B`) might be extracted from multiple chunks of the same file. Deduplicate to unique edges.

2. **Symbol resolution.** Raw call targets like `validate()` are ambiguous -- which `validate`? Resolve them to fully qualified names (`src/utils/validation.py:validate`) using the import graph. This reduces false edges from name collisions.

3. **Transitive reduction.** If A calls B and B calls C, you do not need to store A->C explicitly. The BFS traversal in the boost computation discovers this transitively. Storing only direct edges reduces graph size by 20-40%.

### Limitations

AST graph boosting is not magic. It has real limitations.

It is language-dependent. You need a parser for each language you support. Adding a new language means writing or integrating a new tree-sitter grammar.

Dynamic dispatch, reflection, and runtime code generation create edges the AST cannot see. If your Python code calls `getattr(obj, method_name)()`, no static analysis can resolve that call edge. Same problem with dependency injection frameworks, plugin systems, and any pattern where the call target is determined at runtime.

Purely natural-language queries like "how does the billing flow work" may not match any function or class name well enough to anchor into the graph. If the parser cannot find a starting node, it cannot traverse anything. The boost stage produces nothing, and the pipeline relies entirely on semantic and lexical retrieval.

That is why AST boosting is one stage in a multi-stage pipeline, not the whole pipeline.

### Visualizing the AST Graph

For debugging and understanding, it helps to visualize what the AST graph actually captures:

```python
def visualize_graph_neighborhood(
    graph: ASTGraph,
    center: str,
    max_depth: int = 2
) -> str:
    """Generate a text visualization of a function's graph neighborhood."""
    lines = [f"Graph neighborhood of: {center}\n"]
    visited = set()

    def traverse(node_id: str, depth: int, prefix: str):
        if depth > max_depth or node_id in visited:
            return
        visited.add(node_id)

        # Outgoing call edges
        for callee in sorted(graph.calls.get(node_id, set())):
            lines.append(f"{prefix}  calls  -> {callee}")
            traverse(callee, depth + 1, prefix + "  ")

        # Incoming call edges
        for caller in sorted(graph.called_by.get(node_id, set())):
            lines.append(f"{prefix}  called_by <- {caller}")
            traverse(caller, depth + 1, prefix + "  ")

        # Import edges
        file_path = node_id.split(':')[0] if ':' in node_id else node_id
        for imported in sorted(graph.imports.get(file_path, set())):
            lines.append(f"{prefix}  imports  -> {imported}")

    traverse(center, 0, "")
    return '\n'.join(lines)
```

Running this on an authentication middleware function might produce:

```
Graph neighborhood of: src/middleware/auth.py:authenticate_request

  calls  -> src/auth/tokens.py:validate_token
    calls  -> src/auth/keys.py:get_signing_key
    calls  -> src/auth/cache.py:get_cached_token
  calls  -> src/auth/session.py:get_user_session
    calls  -> src/db/session_store.py:load_session
  called_by <- src/app.py:app_middleware_stack
  called_by <- src/routes/__init__.py:api_router
  imports  -> src/auth/tokens
  imports  -> src/auth/session
  imports  -> src/config/auth
```

This visualization shows exactly why AST graph boosting works: the graph connects the middleware to its token validator, session manager, and all callers. A search for "authentication" anchors on this function and boosts everything in its neighborhood -- producing results that span the full authentication flow, not just the single file that mentions "auth" in its name.

### The Numbers

| Strategy | Top-5 Relevance |
|----------|----------------|
| Embedding-only | ~72% |
| AST-only | ~68% |
| AST + Embedding (fused) | ~91% |

Neither approach dominates alone. Together, they cover each other's blind spots. The improvement is not additive -- it is multiplicative. Each method catches what the other misses.

### Performance Characteristics

AST graph construction and traversal have distinct performance profiles:

**Construction (index time).** Parsing the AST and extracting edges is fast -- tree-sitter operates at millions of lines per second. For a 50K-file codebase, AST extraction takes 30-60 seconds. This is a one-time cost, included in the full reindex time. Incremental updates only re-extract edges for changed files and their dependents.

**Traversal (query time).** The BFS traversal from anchor nodes touches a small fraction of the graph. With a max depth of 3 and typical branching factors, a query touches 10-50 nodes out of potentially millions. The traversal is O(branching_factor^depth) which is bounded by the depth limit, not by graph size. This is why AST boosting adds less than 1ms to query time regardless of codebase size.

**Memory.** The graph is stored as adjacency lists -- dictionaries mapping node IDs to sets of neighbor IDs. For a 200K-chunk codebase with 500K edges, the graph uses 50-100MB of memory. It can be memory-mapped for lazy loading.

---

### Exercise

> **Try This**
>
> Build an AST graph from your codebase:
>
> 1. Parse each Python file with tree-sitter and extract function calls and import statements.
> 2. Pick a function you searched for in earlier chapters. Run the AST boost computation with that function as the query.
> 3. Examine the boosted chunks. Are they structurally relevant? Would semantic search have found them?
> 4. Combine the AST boosts with your hybrid search results from Chapter 5. Did any result change position significantly?

---

### Key Takeaways

- AST graph boosting adds structural signals (call graph, import graph, inheritance) that embeddings and BM25 cannot capture.
- Boost scores decay with graph distance, mirroring how developers think about code relevance.
- The fusion of structural and semantic signals produces 91% top-5 accuracy, up from 72% for embeddings alone and 68% for AST alone.
- AST boosting is most valuable in monorepos, codebases with thin abstractions, cross-file queries, and refactoring analysis.
- The technique is language-dependent and cannot resolve dynamic dispatch. It is one stage in a multi-stage pipeline, not a standalone solution.

---

## Pipeline Architecture Summary

A visual summary of the retrieval pipeline built in this book, with chapter references. Stages 4-5 (reranking and adaptive thresholds) are covered in the companion book, *Production Code Search*.

```
                    Query: "how does authentication work"
                                    |
                    +---------------+---------------+
                    |               |               |
              +---------+   +----------+   +-----------+
              | Stage 1  |   | Stage 2  |   | Stage 3   |
              | AST      |   | Semantic |   | BM25      |
              | Boost    |   | Search   |   | Search    |
              | (Ch. 6)  |   | (Ch. 3)  |   | (Ch. 4)   |
              +----+-----+   +-----+----+   +-----+-----+
                   |               |               |
                   |        +------+---------------+
                   |        |
                   |  +-------------+
                   |  | RRF Fusion  |
                   |  | (Ch. 5)     |
                   |  +------+------+
                   |         |
                   +----+----+
                        |
                 +------+-------+
                 | Fused +      |
                 | Boosted      |
                 | Candidates   |
                 +------+-------+
                        |
                 +------+-------+
                 | Stage 4      |
                 | Cross-Encoder|      * Covered in
                 | Reranking    |      * Production
                 +------+-------+      * Code Search
                        |
                 +------+-------+
                 | Stage 5      |
                 | Adaptive     |      * Covered in
                 | Threshold    |      * Production
                 +------+-------+      * Code Search
                        |
                 +------+-------+
                 | Final        |
                 | Results      |
                 | (3-5 chunks) |
                 +--------------+
```

**Stage timing breakdown (typical query, 50K-file codebase):**

| Stage | Time | What It Does |
|-------|------|-------------|
| AST Boost | 0.4ms | Walk graph from anchor nodes, assign structural boost scores |
| Semantic Search | 2.5ms | Encode query, search HNSW index for nearest vectors |
| BM25 Search | 0.8ms | Look up query terms in inverted index, compute BM25 scores |
| RRF Fusion | 0.1ms | Merge ranked lists using reciprocal rank fusion |
| Cross-Encoder | 1.5ms | Score top 25 query-document pairs for precise relevance |
| Adaptive Threshold | 0.1ms | Filter by relative margin off top score |
| **Total** | **5.4ms** | |

**Accuracy at each stage (cumulative):**

| After Stage | Top-1 Accuracy | Top-5 Accuracy |
|-------------|---------------|----------------|
| BM25 only | 42% | 66% |
| Semantic only | 55% | 75% |
| Hybrid (BM25 + Semantic) | 61% | 91% |
| + AST Boosting | 64% | 91% |
| + Cross-Encoder Reranking | 87% | 97% |
| + Adaptive Threshold | 87% (filtered) | 97% (filtered) |

The threshold does not improve accuracy metrics -- it improves downstream quality by removing noise.

---


## Conclusion

You can now build a code retrieval system from first principles. Not configure one. Not wrap one. Build one — from the embedding layer through sparse ranking, hybrid fusion, and structural graph signals. That's the actual capability you leave this book with, and it's worth being precise about why it matters.

The central thread running through every chapter is that code is not text. That idea sounds obvious once you've read it, but most retrieval systems are built as if it isn't true. They tokenize identifiers, strip structure, and feed the result into a pipeline designed for natural language. Then engineers wonder why the results are mediocre. The answer is always the same: the retrieval model doesn't know what the code knows. It doesn't know that `parse_response` and `deserialize_payload` are the same concept in two codebases, that a function signature carries more semantic weight than its docstring, or that two files with no shared tokens are tightly coupled because one imports the other three levels deep. Dense retrieval gets you partway there — embeddings trained on code handle the vocabulary gap better than anything sparse can. But they're blind to structure, and structure is where code keeps its meaning.

That's why hybrid retrieval with RRF isn't just a performance trick. It's a recognition that semantic similarity and keyword precision capture different things, and you need both. A query for `JWT token expiry handling` might surface the right module via dense search and the right function via BM25 — and neither alone would have ranked the correct answer first. RRF doesn't require you to tune a weighting coefficient between systems. It rewards documents that rank well across multiple signals, which is exactly the behavior you want from a retrieval layer that has to serve both fuzzy conceptual queries and precise symbol lookups. Once you've run a hybrid pipeline in production, going back to single-signal retrieval feels like navigating with one eye closed.

AST graph boosting is the third layer, and it's the one most teams skip — which is a mistake. The import graph is a map of what depends on what, encoded directly in the codebase. When you retrieve a candidate file and then propagate relevance scores to its neighbors, you surface the files that context demands. The model calling your retrieval API doesn't have to know to ask for those neighbors. Your system already knows they're relevant. That's the shift graph boosting delivers: from retrieval that answers the literal query to retrieval that understands what answering that query actually requires.

Here's what to do Monday morning. Take one search query you run repeatedly in your codebase — something you type into your editor's global search at least a few times a week. Run it through a BM25 index over your source files. Then run the same query through a code-specialized embedding model and do an approximate nearest-neighbor lookup. Collect the top-10 results from each, merge them with RRF, and look at what changed. That single experiment, done against your actual codebase with a query you actually care about, will tell you more about where your retrieval is failing than any benchmark. It will also give you a concrete baseline to improve against. You don't need a production system to run this. A script, a local ChromaDB instance, and an afternoon is enough.

The most common reason engineers don't act on material like this is not lack of understanding — it's that the problem doesn't feel urgent until it is. Code retrieval is a background tax. Every time a developer searches for something and has to scroll past irrelevant results, or gives up and asks a colleague, or pulls the wrong function into a context window, there's a cost. It's invisible because it's distributed across hundreds of small moments. No alert fires. No SLA is breached. The system just slowly makes everyone a little slower and a little less confident in what they're building. That's the failure mode to watch for, and it's easy to dismiss because nothing breaks.

What changes if you apply this: your retrieval layer becomes a genuine force multiplier. Every tool that depends on finding the right code — code generation, automated review, documentation, dependency analysis, onboarding assistants — gets better, because the foundation they're built on is accurate. The difference between a 60% precision retrieval system and a 90% precision system isn't 30% better tools. It's the difference between tools that engineers trust and tools that engineers work around. Trust compounds. When a system consistently returns what you were looking for, you use it more, you build more on top of it, and the investment pays forward.

What stays the same if you don't: the ambient friction remains. Retrieval stays a solved-enough problem that no one prioritizes fixing it, right up until someone tries to build something serious on top of it and discovers the foundation won't hold. By then, the cost of retrofitting a better retrieval layer is higher — because now there are systems depending on the mediocre one.

You have the tools to avoid that outcome. The architecture is in your hands.
## Appendix A: Glossary

### Information Retrieval Terms

| Term | Definition |
|------|-----------|
| BM25 | Best Matching 25. A term-frequency ranking function that scores documents based on query term frequency, inverse document frequency, and document length normalization. Published in 1994, still widely used. |
| Bi-encoder | An architecture that encodes query and document independently into separate vectors, then compares them with cosine similarity. Fast but loses fine-grained interactions. |
| Corpus | The full set of documents available for search. In code search, the corpus is the indexed codebase. |
| Cosine similarity | A measure of similarity between two vectors, computed as the dot product divided by the product of their magnitudes. Ranges from -1 to 1. |
| Document frequency (df) | The number of documents in the corpus that contain a specific term. |
| Hard negative | A result that appears relevant by surface metrics (keywords, similarity score) but is actually irrelevant or misleading. |
| HNSW | Hierarchical Navigable Small World. A graph-based algorithm for approximate nearest-neighbor search. Provides O(log n) query time. |
| IDF | Inverse Document Frequency. A measure of how rare a term is across the corpus. Rare terms get higher IDF, making them more discriminative. |
| Inverted index | A data structure that maps terms to the documents containing them. The foundation of keyword search engines. |
| Precision | Of the results returned, the fraction that are relevant. |
| Query | The input string describing what the user is searching for. |
| Recall | Of all relevant documents in the corpus, the fraction that were returned. |
| Recall@k | Recall computed over only the top k results. |
| Reciprocal Rank Fusion (RRF) | A method for combining multiple ranked lists using only rank positions (not scores). Robust to different score scales. |
| Relevance | The degree to which a document answers or relates to a query. Can be binary or graded. |
| Retrieval | The process of finding candidate documents from the corpus. Casts a wide net. |
| Ranking | The process of ordering retrieved candidates by relevance. Tightens the net. |
| Term frequency (tf) | How many times a term appears in a specific document. |
| Top-k | Returning the k highest-ranked results. Static top-k uses a fixed k; adaptive thresholds adjust dynamically. |

### Machine Learning Terms

| Term | Definition |
|------|-----------|
| Attention | A mechanism in transformer models that allows each token to attend to (consider the influence of) every other token in the input. |
| CodeBERT | A BERT variant trained on six programming languages alongside natural language. Produces code-aware embeddings. |
| Dimensionality | The number of elements in an embedding vector. Common values: 384 (MiniLM), 768 (BERT/CodeBERT), 1024 (large models). |
| Embedding | A fixed-size numerical vector representing the semantic meaning of a piece of text. Similar meanings produce similar vectors. |
| Fine-tuning | Continuing a pre-trained model's training on domain-specific data to improve performance on that domain. |
| Forward pass | One execution of a neural network on an input, producing an output. |
| Loss function | A function that measures how wrong a model's predictions are. Training minimizes this function. |
| MiniLM | A family of small, fast transformer models (22M parameters) commonly used for sentence and code embeddings. |
| Normalization | Scaling a vector to unit length (L2 norm = 1). Normalized vectors' cosine similarity equals their dot product. |
| Transformer | A neural network architecture based on self-attention. The foundation of modern language models and embedding models. |

### Code Analysis Terms

| Term | Definition |
|------|-----------|
| AST | Abstract Syntax Tree. A tree representation of source code produced by a parser, capturing the syntactic structure (functions, classes, expressions, statements). |
| Call graph | A directed graph where edges represent function calls. Node A has an edge to node B if function A calls function B. |
| Chunk | A piece of code extracted from a file for indexing. Typically a function, method, or class. The unit of search. |
| Import graph | A directed graph where edges represent module imports. Shows dependency relationships between files. |
| Inheritance graph | A directed graph where edges represent class inheritance. Shows type hierarchies. |
| Tree-sitter | A parser generator and incremental parsing library used for syntax highlighting, code navigation, and AST extraction across many languages. |
| Tokenization | Splitting text into tokens (words or subwords) for processing. Code tokenization requires splitting compound identifiers (camelCase, snake_case). |

---

## Appendix B: Tools & Resources

| Tool / Resource | URL | Purpose |
|----------------|-----|---------|
| sentence-transformers | https://www.sbert.net | Python library for embedding models. Supports bi-encoders and cross-encoders with a clean API. |
| ChromaDB | https://www.trychroma.com | Open-source embedding database with HNSW indexing. Good for prototyping and small-to-medium indexes. |
| rank-bm25 | https://github.com/dorianbrown/rank_bm25 | Python BM25 implementation. Simple API for building BM25 indexes from tokenized corpora. |
| tree-sitter | https://tree-sitter.github.io | Incremental parser generator supporting 100+ languages. Foundation for AST-aware chunking. |
| tree-sitter-python | https://pypi.org/project/tree-sitter-python/ | Python bindings for tree-sitter with the Python grammar. |
| FAISS | https://github.com/facebookresearch/faiss | Facebook's vector similarity search library. More performant than ChromaDB for large-scale indexes. |
| Hugging Face Hub | https://huggingface.co/models | Repository of pre-trained models including embedding models, cross-encoders, and code-specific models. |
| Pyckle CLI | https://pyckle.co | Semantic code search tool implementing the architecture described in this book. |

---

## Appendix C: Series Cross-References (Code Search, Decoded)

This book expands on material from the Code Search, Decoded blog and video series. For readers who want the shorter-form treatment:

| Chapter | Series Episode | Topic |
|---------|---------------|-------|
| Ch. 2 | Ep 13 | Chunking strategies and language-aware parsing |
| Ch. 3 | Ep 9 | Embedding models and semantic search |
| Ch. 4 | Ep 9 | BM25 and lexical search |
| Ch. 5 | Ep 9 | Hybrid retrieval with RRF |
| Ch. 6 | Ep 8 | AST graph boosting |

Chapter 1 is original to this book with no direct series counterpart.

---

## Appendix D: Further Reading

- **"Introduction to Information Retrieval"** by Manning, Raghavan, and Schutze. The standard textbook on IR fundamentals. Free online. Covers BM25, inverted indexes, evaluation metrics, and ranking models in depth.

- **"The Probabilistic Relevance Framework: BM25 and Beyond"** by Robertson and Zaragoza (Foundations and Trends in Information Retrieval, 2009). The definitive paper on BM25, written by its creators. Explains the mathematical foundations and the design decisions behind each parameter.

- **"Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks"** by Reimers and Gurevych (EMNLP 2019). Introduces the bi-encoder architecture for sentence embeddings that underpins modern semantic search.

- **"Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods"** by Cormack, Clarke, and Butt (SIGIR 2009). The original RRF paper. Short, clear, and still the reference implementation for rank fusion.

- **"CodeBERT: A Pre-Trained Model for Programming and Natural Languages"** by Feng et al. (EMNLP 2020). Introduces CodeBERT, the first large-scale pre-trained model for code understanding.

- **"Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs"** by Malkov and Yashunin (IEEE TPAMI 2018). The HNSW paper. Explains the graph structure and search algorithm used by most vector databases.

- **Code Search, Decoded** (Episodes 8-13) by David Kelly Price. The blog series that covers the same pipeline architecture in episodic format. Available at pyckle.co/blog.

---

## About the Author

David Kelly Price is the founder of Pyckle, building AI context optimization tools for development teams. Background in AI/ML tooling, retrieval systems, and context routing for codebases. MBA in Finance -- analytical rigor applied to technical problems.

---

## About Pyckle

Pyckle builds semantic code search tools for AI-assisted development. The core product is a local search server that indexes your codebase and serves relevant code to your AI assistant through the Model Context Protocol (MCP). The architecture described in this book -- AST graph boosting, hybrid BM25+semantic search, cross-encoder reranking, adaptive thresholds, and incremental indexing -- is the foundation of Pyckle's retrieval pipeline.

Pyckle runs locally on developer hardware. Code never leaves the machine. The search pipeline executes in under 10 milliseconds per query. The index stays fresh through git-aware incremental updates. The goal is straightforward: when an AI assistant needs code context, give it the right code, in the right order, with nothing extra.

More information at pyckle.co.

---

## Acknowledgments

This book draws on decades of information retrieval research, from the BM25 work of Robertson and Zaragoza (1994) to the transformer architectures that power modern embeddings. The pipeline architecture described here is informed by production systems at Google (dense retrieval), Microsoft (CodeBERT), Meta (FAISS), and the broader open-source IR community.

The Code Search, Decoded series (Episodes 8-13) covers many of the same topics in a blog/video format and served as the source material for several chapters. Readers who want a more concise treatment or prefer video content should start there.

Special thanks to the sentence-transformers, ChromaDB, tree-sitter, and rank-bm25 teams for building the open-source tools that make this pipeline practical to implement.

---

*Code Retrieval from Scratch — Version 1.0 — March 2026*
*Published by Pyckle (pyckle.co)*

*© 2026 Pyckle. All rights reserved. This guide may be shared freely for personal and educational use. Commercial reproduction or redistribution requires written permission. Contact kellyprice@pyckle.co.*


---

## Related Blog Posts

- [RAG for Code Is Not RAG for Documents](https://pyckle.co/blog/rag-for-code-is-not-rag-for-documents.html)
- [Why Chunking Breaks Your AI's Retrieval](https://pyckle.co/blog/why-chunking-breaks-your-ais-retrieval.html)
- [We Trained Our Own Code Embedding Model from Scratch](https://pyckle.co/blog/we-trained-our-own-code-embedding-model-from-scratch-heres-what-happened.html)

---

*[Browse all free guides →](https://pyckle.co/books.html)*