---
title: "Building a Semantic Search Pipeline"
subtitle: "From Raw Code to Intelligent Retrieval — Architecture, Embeddings, and Search at Scale"
author: "David Kelly Price"
version: "1.0"
date: 2026-03-21
status: draft
type: ebook
target_audience: "Senior developers, ML engineers, and architects who want to understand the internals of semantic code search — embeddings, chunking strategies, indexing, retrieval, and ranking"
estimated_pages: 90
chapters:
  - "Why Keyword Search Fails for Code"
  - "Code Embeddings: Turning Syntax into Vectors"
  - "Chunking Strategies for Source Code"
  - "Building the Index"
  - "Query Processing and Intent Understanding"
  - "Retrieval and Ranking"
  - "Evaluation and Quality Metrics"
  - "Scaling to Millions of Files"
  - "The Pyckle Approach"
tags:
  - pyckle
  - ebook
  - draft
  - semantic-search
  - embeddings
  - vector-search
  - code-intelligence
  - retrieval
  - architecture
---

<!--
=============================================================================
DESIGN & LAYOUT NOTES
=============================================================================

Typography:
  - Body: Inter or Source Serif 4, 11pt, 1.5 line height
  - Code blocks: JetBrains Mono or Fira Code, 9pt, syntax-highlighted
  - Headings: Inter Display, semi-bold; H1 52pt, H2 28pt, H3 18pt, H4 14pt
  - Callout labels (Key Insight, Warning, Try This): uppercase, tracked, 8pt

Color palette:
  - Brand teal:   #0FBF9F  (Pyckle primary)
  - Dark navy:    #0D1B2A  (page backgrounds, section breaks)
  - Warm white:   #F9F8F5  (body text pages)
  - Code bg:      #1B2333
  - Callout bg:   #EAF9F5 (Key Insight), #FFF8E7 (Try This), #FEF0F0 (Warning)
  - Accent gold:  #E8A84C  (chapter numbers, accent rules)

Page layout:
  - Trim size: 8.5" × 11" (PDF) / reflowable (EPUB)
  - Margins: inside 1.1", outside 0.9", top 0.85", bottom 1.0"
  - Footer: chapter title left, page number right (teal rule above)
  - Chapter openers: full-width dark navy section, chapter number in gold,
    chapter title in white, 1–2 sentence hook in teal italic

Figures:
  - All figures rendered as SVG-first diagrams (fallback PNG at 2x)
  - Caption below, centered, 9pt italic, prefixed "Figure N."
  - Architecture diagrams use dark-mode color scheme to match code blocks

Callout boxes:
  - > **Key Insight**   — teal left border, #EAF9F5 background
  - > **Try This**      — gold left border, #FFF8E7 background
  - > **Warning**       — red left border, #FEF0F0 background

Code examples:
  - Language tag always present (```python, ```bash, etc.)
  - Filename comment on line 1 where it aids context
  - Max 80 chars wide; longer lines wrapped with backslash continuation comments

Print / PDF export:
  - Chapter opener pages always start on odd (right-hand) page
  - TOC entries include page numbers (auto-generated by PDF renderer)
  - Part divider pages: full bleed navy, part number + title only

=============================================================================
-->

---

# Building a Semantic Search Pipeline


## About This Guide

This book is for engineers who have already shipped keyword search and watched it fail. You built the index, tuned the queries, maybe added fuzzy matching — and users still couldn't find what they were looking for. If you're working on a developer tool, a code assistant, or any system that needs to surface relevant code from a large repository, you're in the right place.

What you'll come away with is a working understanding of how to build a semantic search pipeline for source code — from raw files to ranked results. Not the theory in isolation. The actual decisions: how to chunk a function that spans 200 lines, how to embed it without losing structural meaning, how to handle queries that don't match any token in the corpus. By the end, you can build this system. That's the bar.

The book does not cover general-purpose document search or NLP fundamentals. It won't teach you how transformers work from scratch or walk you through fine-tuning an embedding model. If you need that foundation, read it first and come back. This also isn't a production operations guide — deployment, monitoring, and infrastructure are out of scope, except where they intersect directly with retrieval quality.

Read it linearly the first time. The chapters build on each other: chunking decisions affect index structure, index structure affects query processing, and all of it feeds into evaluation. If you're returning to a specific problem — say, you're reworking your ranking logic or debugging recall — jump directly to the relevant chapter. Each one is self-contained enough to stand alone as a reference once you have the full picture.

The last chapter covers the Pyckle approach specifically. It's not a sales pitch. It's a worked example of every tradeoff covered in the preceding eight chapters applied to a real system.
## From Raw Code to Intelligent Retrieval — Architecture, Embeddings, and Search at Scale

**By David Kelly Price**

Version 1.0 — March 2026

---

*For every developer who has typed `grep -r "authentication" .` and found three thousand results — none of them the one they needed.*

---

&nbsp;

---

## Table of Contents

**Preface** ......................................................................................................... 7

---

### Part I: The Problem and the Foundation

**Chapter 1: Why Keyword Search Fails for Code** ................................................. 11
- 1.1 The Semantic Gap
- 1.2 The Vocabulary Mismatch Problem
- 1.3 What Grep Cannot Know
- 1.4 Cross-Language Synonyms and Patterns
- 1.5 The Cost of Bad Search

**Chapter 2: Code Embeddings — Turning Syntax into Vectors** .................................. 29
- 2.1 What Is an Embedding?
- 2.2 Transformer Architecture for Code
- 2.3 Model Landscape: CodeBERT, StarCoder, UniXcoder, and Beyond
- 2.4 Embedding Dimensions and What They Mean
- 2.5 Fine-Tuning for Your Codebase
- 2.6 Generating Embeddings in Practice

**Chapter 3: Chunking Strategies for Source Code** ............................................... 53
- 3.1 Why Chunking Is Not Trivial
- 3.2 Naive Approaches and Their Failure Modes
- 3.3 AST-Aware Chunking
- 3.4 Function-Level and Class-Level Decomposition
- 3.5 Sliding Window with Overlap
- 3.6 Metadata Enrichment
- 3.7 Choosing a Strategy: Decision Framework

---

### Part II: Building the Search System

**Chapter 4: Building the Index** ..................................................................... 81
- 4.1 Vector Database Landscape
- 4.2 The HNSW Algorithm
- 4.3 Quantization: PQ, SQ, and Binary
- 4.4 Hybrid Indexes: Combining Vector and Keyword
- 4.5 Index Configuration for Code Corpora
- 4.6 Operational Concerns

**Chapter 5: Query Processing and Intent Understanding** ........................................ 109
- 5.1 Anatomy of a Code Search Query
- 5.2 Intent Classification
- 5.3 Query Expansion
- 5.4 Embedding the Query
- 5.5 Rewriting and Normalization
- 5.6 Putting It Together: A Query Processing Pipeline

---

### Part III: Production, Scale, and the Pyckle Approach

**Chapter 6: Retrieval and Ranking** ................................................................ 135
- 6.1 Approximate Nearest Neighbor Search
- 6.2 Reranking Strategies
- 6.3 Fusion: Combining Vector and BM25 Scores
- 6.4 Cross-Encoder Reranking
- 6.5 Contextual Boosting

**Chapter 7: Evaluation and Quality Metrics** ..................................................... 157
- 7.1 Why Evaluation Is Hard for Code Search
- 7.2 Offline Metrics: MRR, NDCG, Recall@K
- 7.3 Building a Ground-Truth Dataset
- 7.4 Online Metrics and User Signals
- 7.5 The Pyckle Evaluation Harness

**Chapter 8: Scaling to Millions of Files** ....................................................... 179
- 8.1 Indexing at Scale
- 8.2 Sharding Strategies
- 8.3 Incremental Updates and Staleness
- 8.4 Caching and Hot Paths
- 8.5 Cost Modeling

**Chapter 9: The Pyckle Approach** .................................................................. 203
- 9.1 Design Philosophy
- 9.2 Architecture Overview
- 9.3 Context Routing
- 9.4 The MCP Integration Layer
- 9.5 What We Learned Building This

**Appendix A: Glossary** ........................................................ 229

**Appendix B: Tools and Resources** ................................................ 233

**Appendix C: Further Reading** .................................................................. 237


---

&nbsp;

---

## Preface

This book exists because semantic code search is genuinely hard to get right, and most of the written material about it is either too shallow ("just use embeddings!") or buried in academic papers that assume you have a PhD and six months to read them.

The gap in the middle — practical, architectural, production-grade knowledge — is where teams waste months and engineering budgets rebuilding the same mistakes.

We built Pyckle to solve semantic code search as a product. Along the way, we indexed tens of millions of code chunks, tried most of the approaches described in this book, and shipped a system that handles real developer queries at scale. This book is the write-up we wish we had at the start.

**Who this is for.** You are a senior developer, ML engineer, or software architect. You understand transformers well enough to have a conversation about attention, you know what a vector database is, and you have written production Python. You do not need us to explain what a function is.

**What this covers.** Part I establishes why keyword search fails and builds the conceptual foundation — embeddings and chunking. Part II covers the construction of the search system itself — indexing, query processing, retrieval, and ranking. Part III addresses production realities — evaluation, scaling, cost, and how Pyckle approaches these problems in the real world.

**What this does not cover.** This is not a guide to building a general-purpose search engine. It is not a survey of every possible embedding model. It is not an introduction to machine learning. Each chapter assumes you are moving forward to implement, not just understand in the abstract.

All code examples in this book are Python 3.11+ and have been tested. Dependencies are pinned where they matter. Where a code example is simplified for clarity, a note says so.

Let's build something.

— *David Kelly Price, March 2026*

---

&nbsp;

---

# Part I: The Problem and the Foundation

---

&nbsp;

---

# Chapter 1: Why Keyword Search Fails for Code

> *"The map is not the territory — and the identifier is not the concept."*

---

## 1.1 The Semantic Gap

Open any sufficiently large codebase and ask a simple question: *Where do we validate user input before writing to the database?*

With `grep`, you will search for `validate`, `sanitize`, `clean`, maybe `escape`. You will get hundreds of hits — some relevant, most not. You will miss the function called `prepare_record()` that does exactly what you asked about but uses none of those words. You will also find every logging call that contains the string "validate" in a comment. You will spend twenty minutes reading irrelevant code to triangulate the answer.

This is the **semantic gap**: the distance between the words a developer uses to describe what they want and the words the author of the code used to write it.

The semantic gap is not a new problem. Information retrieval researchers have studied it since the 1980s. But it is a particularly acute problem in codebases because:

1. **Naming is inconsistent.** Different engineers use different vocabularies. One writes `authenticate_user()`, another writes `verify_credentials()`, a third writes `check_login()`. All three do the same thing. None of them match a search for the others.

2. **Code is multi-lingual by nature.** The same concept appears across Python, TypeScript, Terraform, SQL, and shell scripts. Keyword search treats each language as a separate silo.

3. **Intent lives in structure, not just tokens.** The statement `if user.role == "admin":` expresses authorization logic, but the words "authorization" and "permission" never appear. The concept is in the shape of the code, not the names within it.

4. **Identifiers are compressed language.** `calc_ttl_exp` encodes the concept "calculate time-to-live expiration" in a form that no natural language query will ever match exactly.

The result is that keyword search degrades catastrophically as codebases grow. At 10,000 lines it is usable. At 500,000 lines it is noise. At 5 million lines it is effectively non-functional for conceptual queries — which are the exact queries senior developers need to answer every day.

> **Key Insight**
>
> Keyword search has O(n) noise growth relative to codebase size for conceptual queries. Semantic search maintains roughly constant precision regardless of corpus size because it retrieves by meaning, not surface tokens.

---

## 1.2 The Vocabulary Mismatch Problem

Let us make this concrete with a table of real-world query mismatches — queries that developers ask and the actual identifiers that exist in a codebase:

| Developer's Query | What the Code Actually Says |
|---|---|
| "retry logic for failed requests" | `_with_backoff()`, `attempt_with_jitter()`, `ResiliencePolicy.execute()` |
| "password hashing" | `hash_secret()`, `derive_key()`, `bcrypt_encode()`, `_secure_store()` |
| "rate limiting middleware" | `Throttle`, `BucketPolicy`, `RequestGuard`, `slow_down()` |
| "soft delete" | `archive_record()`, `mark_inactive()`, `tombstone()`, `_hide_from_queries()` |
| "event bus" | `MessageBroker`, `SignalDispatcher`, `PubSubClient`, `EventEmitter` |
| "connection pool" | `PoolManager`, `SessionFactory`, `DatabasePool`, `_conn_cache` |
| "parse JWT" | `decode_token()`, `verify_bearer()`, `extract_claims()`, `_check_auth_header()` |
| "background job" | `Task`, `Worker`, `AsyncJob`, `CeleryTask`, `QueueEntry`, `Runnable` |

*Table 1.1 — Vocabulary mismatch: developer queries versus actual identifiers in production codebases.*

No amount of query engineering rescues a keyword system from this table. The developer searching for "rate limiting middleware" will not — and should not have to — know that this particular codebase calls it `RequestGuard`. The knowledge that `RequestGuard` is the rate limiter lives in the head of the engineer who wrote it, and possibly in documentation no one has read since 2022.

Semantic search closes this gap because it encodes *meaning* into a numerical representation. "Rate limiting middleware" and "RequestGuard" will have similar vector representations if the model has learned the relationship between the concept and its implementation patterns — and modern code-specific models have, trained on millions of repositories where both naming conventions coexist.

---

## 1.3 What Grep Cannot Know

Beyond vocabulary mismatch, there are entire categories of queries that keyword search is structurally incapable of answering:

### Structural Queries

*"Find all functions that take a database connection as their first argument and call an external HTTP endpoint."*

This is a query about code structure — the shape of function signatures and call graphs. No string match answers it. You need either an AST traversal (which is a specialized tool, not general search) or an embedding model that has learned to associate this structural pattern with the query description.

### Behavioral Queries

*"Show me the code that handles what happens when a payment fails."*

The word "payment" might appear, but "fails" could be expressed as an exception, a conditional branch on a status code, a callback to an error handler, or a message published to a queue. The behavior is distributed across multiple files and expressed in structural rather than lexical terms.

### Cross-File Conceptual Queries

*"Where does the application decide which pricing tier to apply to a customer?"*

The logic may be spread across a configuration file, a feature flag resolver, a billing service, and a rule engine. Keyword search returns all files matching "pricing" or "tier" — including the CSS file that styles the pricing table on the marketing page. Semantic search can retrieve the cluster of files that collectively implement the concept.

### Anti-Pattern Queries

*"Find places where we're doing synchronous I/O inside an async event loop."*

This is a query about a pattern of misuse. The code does not contain the string "synchronous I/O inside async event loop" — it just contains `requests.get()` called from a coroutine. An embedding model trained on code can learn this as a semantic pattern.

> **Warning**
>
> AST-based tools (tree-sitter, language servers, ctags) are excellent for structural queries when the query is precise and the language is known. Do not replace them with semantic search — compose them. The best production systems use semantic search for conceptual discovery and AST tools for structural navigation after the right file is found.

---

## 1.4 Cross-Language Synonyms and Patterns

Modern software projects are polyglot. A single feature — say, caching — may be implemented in Python for the API layer, in TypeScript for the frontend, in Terraform for the infrastructure definition, in SQL for the materialized view, and in shell script for the cache-warming cron job.

A developer searching for "caching strategy" needs results from all five. Keyword search fails here in a specific way: it cannot know that `@lru_cache` in Python, `useMemo` in React, `cache_control: max-age=3600` in Terraform, and `CREATE MATERIALIZED VIEW` in SQL are all implementations of the same conceptual category.

Semantic search handles this naturally because the embedding model has been trained on multilingual code and has learned that these constructs occupy similar positions in conceptual space.

**Figure 1.1** illustrates this: five language-specific implementations of "caching" mapped to vector space. Keyword search retrieves along the x-axis (surface token matches). Semantic search retrieves along the y-axis (conceptual proximity). The cluster at center-top — all the caching implementations — is only visible to semantic search.

```
Figure 1.1 — Conceptual clustering across languages in embedding space.
Dots represent code chunks; circles represent query retrieval radius.

    Conceptual Axis (↑ = "caching")

    ●  ●  ●                    ← @lru_cache, useMemo, max-age
      ●  ●                     ← CREATE MATERIALIZED VIEW, cache warming


           ●●●   ●             ← unrelated functions (also named "cache")
    ●                          ← variable named "cache" in logging code

    ─────────────────────────
    Token Axis (→ = contains "cache")

    Keyword retrieval radius: ████████████████  (retrieves everything)
    Semantic retrieval radius: ●●●●● (retrieves the cluster)
```

### The Cross-Language Test

Here is a concrete cross-language test for your own system. Take three equivalent implementations of "exponential backoff with jitter":

```python
# Python
import time, random

def retry_with_backoff(func, max_retries=5, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)
```

```typescript
// TypeScript
async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  maxRetries = 5,
  baseDelayMs = 1000
): Promise<T> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (attempt === maxRetries - 1) throw err;
      const delay = baseDelayMs * Math.pow(2, attempt) + Math.random() * 1000;
      await new Promise(r => setTimeout(r, delay));
    }
  }
  throw new Error("unreachable");
}
```

```go
// Go
func RetryWithBackoff(fn func() error, maxRetries int, baseDelay time.Duration) error {
    for attempt := 0; attempt < maxRetries; attempt++ {
        if err := fn(); err == nil {
            return nil
        } else if attempt == maxRetries-1 {
            return err
        }
        jitter := time.Duration(rand.Int63n(int64(baseDelay)))
        time.Sleep(baseDelay*time.Duration(1<<attempt) + jitter)
    }
    return nil
}
```

A semantic search for "retry with exponential backoff" should retrieve all three. A keyword search for "retry" will find all three — but so will it find `logger.info("Retry attempt %d", n)` in a hundred other files. The signal-to-noise ratio is the difference.

> **Try This**
>
> Pick a concept from your codebase that you know has at least two implementations in different languages or files. Run a keyword search for it. Count the results. Count how many are actually relevant. Then run the same query through a semantic search system (even a toy one using `sentence-transformers` with no fine-tuning). Compare precision at the top 10 results. The difference is your baseline improvement opportunity.

---

## 1.5 The Cost of Bad Search

The cost of poor code search is almost never tracked as a line item — which is precisely why it remains unfixed for so long.

Consider a senior engineer earning $200,000/year (approximately $100/hour fully loaded). Research consistently shows that developers spend 20–30% of their time searching for code, understanding existing code, and navigating to the right location before they can begin making a change. For a senior engineer, that is $20,000–$30,000 per year spent on *finding*, not *building*.

For a team of ten senior engineers, that is $200,000–$300,000 per year — roughly the cost of another engineer — spent on search friction.

The numbers are illustrative, but the direction is not controversial. A 2023 study of 3,000 developers by GitHub found that developers spend an average of 35% of their time understanding code. A separate McKinsey analysis found that improving developer tooling can accelerate delivery velocity by 20–40% in high-complexity codebases.

Semantic search does not eliminate this time. But it compresses the *search* component of it — the time spent asking questions and getting wrong or incomplete answers — by returning conceptually relevant results rather than lexically matching tokens.

| Team Size | Eng Cost/Year | % Time on Search | Annual Search Cost | Est. Reduction w/ Semantic Search |
|---|---|---|---|---|
| 5 | $1M | 20% | $200K | $60K–$100K |
| 20 | $4M | 25% | $1M | $250K–$400K |
| 100 | $20M | 25% | $5M | $1.25M–$2M |

*Table 1.2 — Estimated annual cost of search friction and potential savings from semantic search. Assumes 30–50% reduction in search time, which matches our data from Pyckle customers after onboarding.*

The ROI calculation is why semantic code search is not just an interesting ML problem — it is an economic one. The rest of this book is about how to build a system that actually delivers that reduction at production quality.

---

&nbsp;

---

# Chapter 2: Code Embeddings — Turning Syntax into Vectors

> *"An embedding is a hypothesis about what things mean, compressed into 768 numbers."*

---

## 2.1 What Is an Embedding?

An embedding is a dense vector — a list of floating-point numbers — that represents the meaning of a piece of text (or code) in a high-dimensional space. Two things with similar meanings have vectors that are geometrically close together. Two things with dissimilar meanings have vectors that are far apart.

The critical word is *meaning*. Embeddings are not character n-grams, term frequency vectors, or bag-of-words representations. They encode semantic content — what the text *means* — rather than what tokens it contains.

For code, this means:

```python
def get_user_by_email(email: str) -> Optional[User]:
    return db.query(User).filter(User.email == email).first()
```

and

```python
def find_account(address: str) -> Account | None:
    return session.execute(
        select(Account).where(Account.email_address == address)
    ).scalar_one_or_none()
```

...will have similar embeddings because they implement the same concept — look up a user record by their email address — despite sharing almost no surface tokens.

### The Geometry of Meaning

In a well-trained embedding space, the geometry encodes conceptual relationships:

- **Cosine similarity** measures the angle between two vectors. Two identical concepts map to cosine similarity ≈ 1.0. Unrelated concepts map to ≈ 0.0. Opposites (in models trained with contrastive loss) may map to negative values.

- **Clustering** in embedding space corresponds to conceptual categories. Authentication functions cluster together. Parsing functions cluster together. Database query builders cluster together.

- **Linear relationships** sometimes encode analogies: `authenticate - password + token ≈ verify_bearer`.

The practical consequence: to find all code that implements "database read by primary key", you embed the query phrase and retrieve all code vectors within a small cosine distance of it. You do not need to know what any of those functions are called.

> **Key Insight**
>
> Cosine similarity, not Euclidean distance, is the correct similarity metric for most text/code embeddings. Euclidean distance is sensitive to vector magnitude, which correlates with token length rather than semantic content. Always normalize your vectors or use cosine similarity explicitly.

---

## 2.2 Transformer Architecture for Code

Modern code embeddings are produced by transformer models. Understanding the architecture at a high level is important because architectural choices directly affect what kinds of queries your embedding model will handle well.

### The Attention Mechanism and Code

The key innovation of transformers for code is the **attention mechanism**: the ability for every token in a sequence to attend to every other token, regardless of distance. This matters enormously for code.

In natural language, long-range dependencies are common but bounded. In code, they are structurally fundamental. The function signature at line 1 determines the meaning of every line inside the function body. A variable declared at the top of a file shapes the semantics of every use below it. A class definition creates context for every method within it.

Self-attention handles this by computing a weighted sum over all positions for each token's representation. When processing `return user.authenticate(password)`, the model attends to the definition of `authenticate` elsewhere in the context, the type annotation of `user`, and the broader function that contains this statement.

### Bidirectional vs. Causal Models for Embedding

There are two dominant architectures for embedding:

**Bidirectional (BERT-style)**: The model sees the entire sequence simultaneously. Every token can attend to every other token. This produces better semantic representations because each token's embedding incorporates the full context. Used by CodeBERT, GraphCodeBERT, UniXcoder.

**Causal (GPT-style with pooling)**: The model generates tokens left-to-right; each token only attends to previous tokens. When used for embedding, representations are extracted at the final layer and pooled. Used by some StarCoder-derived embedding models.

For retrieval tasks, **bidirectional models generally produce better embeddings** because semantic meaning in code is not left-to-right sequential — it is relational. The name of a function is shaped by its body, not just its prefix.

```
Figure 2.1 — Attention patterns for bidirectional vs. causal architectures.

Bidirectional (CodeBERT):
  token1 ←→ token2 ←→ token3 ←→ token4 ←→ ... (full cross-attention)

Causal (GPT-style):
  token1 → token2 → token3 → token4 → ... (left-to-right only)

For the function body:
  def validate(email):     ← bidirectional: "validate" can attend to "email"
      return "@" in email       "email" can attend back to "validate"

  Causal model: "email" in return statement cannot attend to function name
```

### Pooling Strategies

A transformer produces one embedding vector per input token. To get a single embedding for an entire code chunk, you need a **pooling strategy**:

- **[CLS] token pooling**: Take the embedding of the special `[CLS]` token prepended to the input. This is what BERT-style models were trained with and is the standard approach for classification tasks. It works reasonably for code but can underweight information in the function body.

- **Mean pooling**: Average all token embeddings, optionally weighted by attention scores. Generally produces better retrieval quality for code than [CLS] pooling because it distributes representation across the full sequence.

- **Max pooling**: Take the maximum activation across all positions for each dimension. Good at capturing the presence of specific features but loses sequential structure.

- **Weighted mean pooling**: Weight token embeddings by their attention scores before averaging. This is what models like E5 and BGE use internally; it emphasizes semantically important tokens.

For code specifically, mean pooling over the full token sequence, excluding padding tokens, is the pragmatic default that works well across model architectures.

---

## 2.3 Model Landscape

The embedding model is the most consequential architectural choice in a semantic code search system. Here is the current state of the field:

### CodeBERT (Microsoft, 2020)

CodeBERT was the first widely adopted transformer trained specifically on code. It uses a RoBERTa base architecture (125M parameters) trained with masked language modeling and replaced token detection on the CodeSearchNet corpus (6 programming languages, 6.4M functions with docstrings).

**Strengths**: Battle-tested, fast, good at function-level retrieval when docstrings are present. Strong cross-lingual transfer.

**Weaknesses**: 512 token context limit (severe for modern functions). Trained on a relatively narrow distribution of open-source code. Does not understand control flow at the embedding level.

**Embedding dimension**: 768

**Best for**: Legacy systems with good docstring coverage; low-latency requirements where model size matters.

### GraphCodeBERT (Microsoft, 2021)

An extension of CodeBERT that incorporates **data flow graphs** into the pre-training objective. The model learns not just token co-occurrence patterns but structural relationships between variables — where a variable is defined, where it is used, what other variables it depends on.

**Strengths**: Significantly better at structural queries ("where is this variable used", "functions with similar data flow"). Better cross-language transfer than CodeBERT.

**Weaknesses**: More complex to serve (requires data flow graph extraction at inference time for best results, though it degrades gracefully without it). Still 512 token limit.

**Embedding dimension**: 768

### UniXcoder (Microsoft, 2022)

UniXcoder unifies code understanding and generation in a single model, with a novel **abstract syntax tree (AST) flattening** approach — the serialized AST is prepended to the token sequence so the model sees structural information without requiring a graph neural network.

**Strengths**: Better than CodeBERT and GraphCodeBERT on most retrieval benchmarks. The AST prefix makes it structure-aware without graph preprocessing complexity. 512 token limit is extended to 1024 in some fine-tuned variants.

**Weaknesses**: AST generation adds preprocessing latency. Less widely deployed than CodeBERT, so fewer community fine-tunes available.

**Embedding dimension**: 768

### StarCoder2-based Embeddings (BigCode, 2024)

StarCoder2 is a causal language model trained on The Stack v2 — over 600 programming languages, 900B tokens. Fine-tuned versions (e.g., `Salesforce/codet5p-110m-embedding`, StarEncoder) use mean pooling over the final layer to produce embeddings.

**Strengths**: Vastly broader language coverage (all mainstream languages plus Terraform, SQL, Dockerfile, shell, and hundreds more). Much more representative of real-world code diversity than CodeSearchNet-trained models. Context up to 4096 tokens in some variants.

**Weaknesses**: Larger models are slower to serve. Quality varies significantly across fine-tuning recipes — the base model is strong, but the embedding head requires careful training.

**Embedding dimension**: 256–4096 depending on variant

### Voyage Code (Voyage AI, 2024)

A commercial embedding model purpose-built for code retrieval. `voyage-code-3` produces 1024-dimensional embeddings and currently leads most public code retrieval benchmarks.

**Strengths**: State-of-the-art retrieval quality with no fine-tuning required. Handles long context (up to 16K tokens). Excellent cross-language performance.

**Weaknesses**: API-only (no self-hosting). Cost at scale is non-trivial. Black box — no ability to fine-tune for domain-specific terminology.

### OpenAI text-embedding-3-large

Not code-specific, but the large context window (8191 tokens), high-dimensional output (3072), and strong general language understanding make it competitive for code corpora with good variable naming and docstrings.

**Weaknesses**: Does not understand programming language semantics. Performs poorly on minified code, heavy DSLs, or terse naming conventions.

| Model | Params | Dim | Context | License | Self-Host | Code Benchmark (CodeSearchNet) |
|---|---|---|---|---|---|---|
| CodeBERT | 125M | 768 | 512 | MIT | Yes | 0.677 MRR |
| GraphCodeBERT | 125M | 768 | 512 | MIT | Yes | 0.693 MRR |
| UniXcoder | 125M | 768 | 1024 | MIT | Yes | 0.706 MRR |
| codet5p-110m | 110M | 256 | 512 | BSD-3 | Yes | 0.717 MRR |
| voyage-code-3 | — | 1024 | 16K | Commercial | No | 0.751 MRR |
| text-embedding-3-large | — | 3072 | 8191 | Commercial | No | 0.698 MRR |

*Table 2.1 — Model comparison for code embedding. MRR scores on CodeSearchNet Python split. Scores are approximate and depend on retrieval setup.*

> **Key Insight**
>
> For most teams starting out, `codet5p-110m-embedding` offers the best balance of quality, speed, self-hostability, and license freedom. It runs on CPU for batch indexing and fits comfortably on a single T4 GPU for real-time query embedding. Start here and upgrade to `voyage-code-3` only if you need marginal quality improvement and can absorb the API cost.

---

## 2.4 Embedding Dimensions and What They Mean

The **dimensionality** of an embedding is the length of the vector — the number of floating-point numbers that represent one code chunk. This number has practical consequences:

**Memory**: A 1M-chunk index with 768-dimensional float32 embeddings requires 768 × 4 × 1,000,000 = **3.07 GB** of memory. At 3072 dimensions (text-embedding-3-large), the same index requires **12.3 GB**. Dimensionality is the dominant factor in index memory footprint.

**Retrieval latency**: HNSW search scales with dimensionality. Higher dimensions mean slower distance calculations. The practical ceiling for real-time serving without quantization is around 1024 dimensions on consumer GPU hardware.

**Expressiveness vs. sparsity**: Higher dimensions allow more nuanced distinctions between concepts. But untrained dimensions are noise — they do not contribute signal. Models with more parameters can use high dimensions effectively; small models producing high-dimensional vectors often produce sparser, less informative spaces.

**Quantization compatibility**: Lower-dimensional embeddings benefit less from quantization (the overhead of the quantization math can approach the cost of the dot product). Higher-dimensional embeddings see greater speedups from product quantization.

The practical choice: **768 dimensions** is the sweet spot for most production code search systems. It is large enough to represent code semantics richly, small enough to serve fast and index cheaply, and is the output dimension of the most battle-tested code models.

---

## 2.5 Fine-Tuning for Your Codebase

Pre-trained embedding models are trained on generic open-source code. Your codebase has its own vocabulary, patterns, and idioms. Fine-tuning the embedding model on examples from your own codebase is the highest-ROI optimization available for improving retrieval quality.

### What Fine-Tuning Does

Fine-tuning for retrieval uses **contrastive learning** — specifically, the InfoNCE loss (also called NT-Xent). The model is given triplets:

- **Query**: A natural language description or a code snippet
- **Positive**: The code chunk that should match the query
- **Negative**: Code chunks that should not match

The loss function pushes the query embedding close to its positive and away from its negatives in vector space.

### Building a Training Dataset

The most effective approach for code search is **synthetic query generation**:

```python
# fine_tuning/generate_training_pairs.py
import anthropic
from pathlib import Path
from typing import Generator

def generate_queries_for_chunk(
    code_chunk: str,
    language: str,
    client: anthropic.Anthropic,
    n_queries: int = 5
) -> list[str]:
    """
    Use Claude to generate natural language queries that should retrieve
    this code chunk. This creates positive (query, chunk) pairs for
    contrastive fine-tuning.
    """
    prompt = f"""Given this {language} code:

```{language}
{code_chunk}
```

Generate {n_queries} natural language search queries that a developer
would type to find this code. Vary the queries: some should describe
what the code does, some should describe why you'd use it, some should
describe the problem it solves.

Output one query per line, no numbering, no explanation."""

    message = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}]
    )

    return [
        line.strip()
        for line in message.content[0].text.strip().split("\n")
        if line.strip()
    ]


def build_training_dataset(
    chunks: list[dict],
    output_path: Path,
    client: anthropic.Anthropic
) -> None:
    """
    Build a JSONL training dataset of (query, positive_chunk) pairs.
    Each line: {"query": "...", "positive": "...", "language": "..."}
    """
    import json

    with output_path.open("w") as f:
        for chunk in chunks:
            queries = generate_queries_for_chunk(
                chunk["content"],
                chunk["language"],
                client
            )
            for query in queries:
                record = {
                    "query": query,
                    "positive": chunk["content"],
                    "language": chunk["language"],
                    "chunk_id": chunk["id"]
                }
                f.write(json.dumps(record) + "\n")
```

### The Fine-Tuning Loop

```python
# fine_tuning/train_embeddings.py
from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader
import json
from pathlib import Path

def load_training_pairs(dataset_path: Path) -> list[InputExample]:
    examples = []
    with dataset_path.open() as f:
        for line in f:
            record = json.loads(line)
            # InputExample with two texts: query and positive code chunk
            examples.append(InputExample(
                texts=[record["query"], record["positive"]]
            ))
    return examples


def fine_tune_model(
    base_model_name: str = "microsoft/unixcoder-base",
    training_data_path: Path = Path("training_pairs.jsonl"),
    output_path: Path = Path("fine_tuned_code_embedder"),
    epochs: int = 3,
    batch_size: int = 32,
    warmup_steps: int = 100
) -> SentenceTransformer:

    model = SentenceTransformer(base_model_name)

    train_examples = load_training_pairs(training_data_path)
    train_dataloader = DataLoader(
        train_examples,
        shuffle=True,
        batch_size=batch_size
    )

    # MultipleNegativesRankingLoss implements InfoNCE:
    # treats all other examples in the batch as negatives.
    # Simple, effective, and requires only positive pairs.
    train_loss = losses.MultipleNegativesRankingLoss(model)

    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=epochs,
        warmup_steps=warmup_steps,
        output_path=str(output_path),
        show_progress_bar=True,
        checkpoint_save_steps=500
    )

    return model
```

> **Try This**
>
> Before investing in full fine-tuning, run a **zero-shot vs. fine-tuned comparison** on 50 hand-labeled query pairs from your actual codebase. If the base model gets MRR@10 of 0.60 or above, fine-tuning is likely to push you to 0.75+. If the base model is below 0.50, check your chunking strategy first — poor chunks hurt more than an imperfect embedding model.

---

## 2.6 Generating Embeddings in Practice

Here is a complete, production-ready embedding pipeline for a code corpus:

```python
# embeddings/embed_corpus.py
from __future__ import annotations

import hashlib
import json
import logging
from dataclasses import dataclass, field
from pathlib import Path
from typing import Iterator

import numpy as np
from sentence_transformers import SentenceTransformer
from tqdm import tqdm

logger = logging.getLogger(__name__)


@dataclass
class CodeChunk:
    id: str
    content: str
    language: str
    file_path: str
    start_line: int
    end_line: int
    chunk_type: str  # "function", "class", "module", "block"
    metadata: dict = field(default_factory=dict)

    @property
    def embedding_text(self) -> str:
        """
        The text fed to the embedding model.

        Prepend a natural-language description of the chunk type and language.
        This 'primes' the model with structural context before it sees code.
        Research shows this improves retrieval by 5–15% for code-specific models.
        """
        prefix = f"{self.language} {self.chunk_type}: "
        return prefix + self.content


@dataclass
class EmbeddedChunk:
    chunk: CodeChunk
    embedding: np.ndarray


class EmbeddingPipeline:
    def __init__(
        self,
        model_name: str = "Salesforce/codet5p-110m-embedding",
        batch_size: int = 64,
        device: str = "cpu",
        cache_dir: Path | None = None
    ):
        self.model = SentenceTransformer(model_name, device=device)
        self.batch_size = batch_size
        self.cache_dir = cache_dir

        if cache_dir:
            cache_dir.mkdir(parents=True, exist_ok=True)

    def _cache_key(self, chunk: CodeChunk) -> str:
        """SHA256 of content + model name. Avoids re-embedding unchanged code."""
        content_hash = hashlib.sha256(
            (chunk.content + self.model.get_sentence_embedding_dimension().__str__()).encode()
        ).hexdigest()[:16]
        return content_hash

    def _load_from_cache(self, chunk: CodeChunk) -> np.ndarray | None:
        if not self.cache_dir:
            return None
        cache_file = self.cache_dir / f"{self._cache_key(chunk)}.npy"
        if cache_file.exists():
            return np.load(cache_file)
        return None

    def _save_to_cache(self, chunk: CodeChunk, embedding: np.ndarray) -> None:
        if not self.cache_dir:
            return
        cache_file = self.cache_dir / f"{self._cache_key(chunk)}.npy"
        np.save(cache_file, embedding)

    def embed_chunks(
        self,
        chunks: list[CodeChunk],
        show_progress: bool = True
    ) -> list[EmbeddedChunk]:
        """
        Embed a list of CodeChunks, using cache where available.
        Returns EmbeddedChunk objects with L2-normalized vectors.
        """
        # Separate cached from uncached
        results: dict[str, np.ndarray] = {}
        to_embed: list[CodeChunk] = []

        for chunk in chunks:
            cached = self._load_from_cache(chunk)
            if cached is not None:
                results[chunk.id] = cached
            else:
                to_embed.append(chunk)

        logger.info(
            f"Embedding {len(to_embed)} chunks ({len(results)} from cache)"
        )

        # Batch embed uncached chunks
        if to_embed:
            texts = [c.embedding_text for c in to_embed]

            embeddings = self.model.encode(
                texts,
                batch_size=self.batch_size,
                show_progress_bar=show_progress,
                normalize_embeddings=True,  # L2 normalize for cosine sim
                convert_to_numpy=True
            )

            for chunk, embedding in zip(to_embed, embeddings):
                results[chunk.id] = embedding
                self._save_to_cache(chunk, embedding)

        # Return in original order
        return [
            EmbeddedChunk(chunk=c, embedding=results[c.id])
            for c in chunks
        ]

    def embed_query(self, query: str) -> np.ndarray:
        """
        Embed a natural language query.
        Note: no language/type prefix for queries — the model should map
        natural language query space to code embedding space directly.
        """
        return self.model.encode(
            query,
            normalize_embeddings=True,
            convert_to_numpy=True
        )


# Usage example
if __name__ == "__main__":
    from chunkers import ASTChunker  # Chapter 3

    pipeline = EmbeddingPipeline(
        model_name="Salesforce/codet5p-110m-embedding",
        batch_size=64,
        device="cuda",
        cache_dir=Path(".embedding_cache")
    )

    # Assume chunks are CodeChunk objects from Chapter 3's chunker
    chunker = ASTChunker()
    chunks = list(chunker.chunk_directory(Path("src/")))

    embedded = pipeline.embed_chunks(chunks)
    print(f"Embedded {len(embedded)} chunks")
    print(f"Embedding dimension: {embedded[0].embedding.shape[0]}")
    print(f"Sample norm: {np.linalg.norm(embedded[0].embedding):.4f}")  # Should be 1.0
```

This pipeline handles the three details that matter in production: **caching** (avoid re-embedding unchanged code on every index rebuild), **batching** (the model's GPU utilization is order-of-magnitude higher at batch_size=64 vs. single items), and **normalization** (L2 normalization at embed time makes dot product equivalent to cosine similarity, which is what all the downstream distance calculations assume).

> **Key Insight**
>
> Always normalize embeddings at generation time, not at query time. Storing normalized vectors means your index can use fast dot product instead of computing cosine similarity (which requires normalization anyway). This is a 10–20% speedup in FAISS and Qdrant with no quality loss.

---

&nbsp;

---

# Chapter 3: Chunking Strategies for Source Code

> *"How you cut the code determines what the search system can find. A bad chunking strategy is invisible until the day your best engineer can't find the authentication bug."*

---

## 3.1 Why Chunking Is Not Trivial

A production codebase is not a collection of independent snippets. It is a graph of interdependent files, classes, functions, and statements, woven together by imports, inheritance, shared state, and calling conventions.

When you index a codebase for semantic search, you must cut this graph into pieces — **chunks** — that are:

1. **Small enough** to fit within the embedding model's context window
2. **Large enough** to be semantically self-contained
3. **Aligned to semantic boundaries** (not arbitrary character counts)
4. **Rich enough in context** that the embedding captures the right meaning
5. **Associated with enough metadata** that retrieved chunks can be acted on

These requirements are in tension. A function of 800 lines exceeds the context window of most code models. A single line is meaningless without context. A perfectly self-contained semantic unit (a full class hierarchy) may be thousands of lines.

The chunk is the atomic unit of your search system. Every downstream component — embedding quality, index size, retrieval precision, ranking — depends on how well your chunks represent coherent semantic units.

> **Warning**
>
> The single most common mistake in semantic code search implementations is using a text splitter designed for prose documents (e.g., LangChain's `RecursiveCharacterTextSplitter` with `\n\n` as separator) on source code. Prose chunking splits on paragraph breaks and sentence boundaries. Source code has neither — it has AST nodes, scopes, and semantic units defined by grammar, not whitespace. Prose chunkers produce syntactically broken chunks that embed poorly and confuse retrieval.

---

## 3.2 Naive Approaches and Their Failure Modes

Before describing what to do, let us briefly examine what not to do — and why these approaches fail in practice.

### Fixed-Size Character Chunking

Split every file into chunks of N characters (typically 500–2000), with optional overlap.

**Failure mode**: Cuts mid-function, mid-string literal, mid-comment. The embedding model cannot form coherent representations of syntactically broken code. A function signature in one chunk and its body in the next are worse than useless — they produce embeddings that represent nothing.

```python
# What fixed-size chunking does to a function:

# Chunk 1 (characters 0-500):
def process_payment(
    order_id: str,
    amount: Decimal,
    currency: str,
    customer_id: str
) -> PaymentResult:
    """Process a payment for the given order."""
    if amount <= 0:
        raise ValueError(f"Amount must be positive, got {amount}")

    customer = self.customer_repo.get(customer_id)
    if not customer.has_valid_payment_method():
        return PaymentResult.failure(
            reason="No valid payment method on file"

# Chunk 2 (characters 501-1000):
        )

    charge = self.payment_gateway.charge(
        amount=amount,
        currency=currency,
        source=customer.default_payment_method
    )
# ...and so on — the model has no idea what it is looking at
```

### Line-Count Chunking

Split every N lines, with or without overlap.

**Failure mode**: Slightly better than character chunking — at least Python's meaningful indentation prevents mid-expression cuts most of the time — but still cuts at arbitrary boundaries. A 100-line function split at line 50 produces two chunks that individually contain nothing meaningful.

### File-Level Chunking

Use the entire file as one chunk.

**Failure mode**: Files exceeding the model's context window (512–4096 tokens depending on the model) are truncated. For a 2000-line Python file, you lose everything after the first ~150 lines. More importantly, a file containing 30 different functions embeds as a blended average of all of them — you cannot retrieve "the authentication logic" from a utility file that also contains date formatting and string manipulation.

---

## 3.3 AST-Aware Chunking

The correct approach uses the **Abstract Syntax Tree** of the source file as the chunking boundary guide. The AST represents the syntactic structure of the code as a tree: functions are nodes, classes are nodes, statements are nodes, expressions are nodes.

AST-aware chunking cuts at semantically meaningful boundaries: at function definitions, class definitions, and top-level statements. It never cuts mid-expression or mid-block.

### Parsing the AST

Python's standard library provides `ast` for Python files. For other languages, `tree-sitter` provides a uniform interface to parsers for 100+ languages.

```python
# chunkers/ast_chunker.py
from __future__ import annotations

import ast
import hashlib
from dataclasses import dataclass
from pathlib import Path
from typing import Iterator

from .base import CodeChunk


LANGUAGE_EXTENSIONS = {
    ".py": "python",
    ".ts": "typescript", ".tsx": "typescript",
    ".js": "javascript", ".jsx": "javascript",
    ".go": "go",
    ".rs": "rust",
    ".java": "java",
    ".rb": "ruby",
    ".cs": "csharp",
    ".cpp": "cpp", ".cc": "cpp", ".cxx": "cpp",
    ".c": "c",
    ".swift": "swift",
    ".kt": "kotlin",
    ".php": "php",
    ".sh": "bash", ".bash": "bash",
}


class PythonASTChunker:
    """
    AST-aware chunker for Python source files.
    Extracts functions, classes, and module-level code as separate chunks.
    """

    def __init__(
        self,
        max_chunk_tokens: int = 400,    # Conservative: leave room for prefix
        min_chunk_lines: int = 3,        # Ignore trivial one-liners
        include_decorators: bool = True,
        include_docstrings: bool = True,
    ):
        self.max_chunk_tokens = max_chunk_tokens
        self.min_chunk_lines = min_chunk_lines
        self.include_decorators = include_decorators
        self.include_docstrings = include_docstrings

    def chunk_file(self, file_path: Path) -> list[CodeChunk]:
        source = file_path.read_text(encoding="utf-8", errors="replace")
        lines = source.splitlines()

        try:
            tree = ast.parse(source, filename=str(file_path))
        except SyntaxError as e:
            # Graceful degradation: fall back to line-based chunking
            # rather than skipping the file entirely
            return self._fallback_line_chunks(source, file_path)

        chunks = []

        for node in ast.walk(tree):
            if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
                # Only emit top-level functions and methods
                # (nested functions are included in their parent's chunk)
                if self._is_top_level_or_method(node, tree):
                    chunk = self._extract_function_chunk(
                        node, lines, file_path, source
                    )
                    if chunk:
                        chunks.append(chunk)

            elif isinstance(node, ast.ClassDef):
                if self._is_top_level(node, tree):
                    # Emit class-level chunk (class signature + docstring)
                    class_chunk = self._extract_class_header_chunk(
                        node, lines, file_path
                    )
                    if class_chunk:
                        chunks.append(class_chunk)

        # Also emit module-level statements (imports, constants, top-level code)
        module_chunk = self._extract_module_chunk(tree, lines, file_path)
        if module_chunk:
            chunks.append(module_chunk)

        return chunks

    def _extract_function_chunk(
        self,
        node: ast.FunctionDef | ast.AsyncFunctionDef,
        lines: list[str],
        file_path: Path,
        source: str
    ) -> CodeChunk | None:

        # Determine actual start: include decorators if present
        start_line = node.lineno
        if self.include_decorators and node.decorator_list:
            start_line = node.decorator_list[0].lineno

        end_line = node.end_lineno

        if (end_line - start_line + 1) < self.min_chunk_lines:
            return None

        # Extract source lines (1-indexed → 0-indexed)
        chunk_lines = lines[start_line - 1:end_line]
        content = "\n".join(chunk_lines)

        # If chunk is too long, use a truncated version with a note
        # (better to embed a truncated meaningful chunk than skip it)
        estimated_tokens = len(content.split()) * 1.3  # rough token estimate
        if estimated_tokens > self.max_chunk_tokens:
            content = self._smart_truncate(content, node, lines)

        chunk_id = hashlib.sha256(
            f"{file_path}:{start_line}:{end_line}:{content[:64]}".encode()
        ).hexdigest()[:16]

        return CodeChunk(
            id=chunk_id,
            content=content,
            language="python",
            file_path=str(file_path),
            start_line=start_line,
            end_line=end_line,
            chunk_type="function",
            metadata={
                "function_name": node.name,
                "is_async": isinstance(node, ast.AsyncFunctionDef),
                "has_docstring": (
                    isinstance(node.body[0], ast.Expr) and
                    isinstance(node.body[0].value, ast.Constant) and
                    isinstance(node.body[0].value.value, str)
                ) if node.body else False,
                "decorator_names": [
                    ast.unparse(d) for d in node.decorator_list
                ],
                "arg_names": [arg.arg for arg in node.args.args],
            }
        )

    def _smart_truncate(
        self,
        content: str,
        node: ast.FunctionDef | ast.AsyncFunctionDef,
        lines: list[str]
    ) -> str:
        """
        For long functions, emit: signature + docstring + first N lines + ellipsis.
        This preserves the most semantically important parts for embedding.
        """
        content_lines = content.splitlines()

        # Keep signature (lines until first non-decorator, non-def line)
        # plus docstring if present
        truncate_at = min(40, len(content_lines))
        truncated = "\n".join(content_lines[:truncate_at])
        truncated += f"\n    # ... ({len(content_lines) - truncate_at} more lines)"

        return truncated

    def _is_top_level_or_method(
        self,
        node: ast.FunctionDef | ast.AsyncFunctionDef,
        tree: ast.Module
    ) -> bool:
        """Return True if node is a top-level function or a direct class method."""
        for parent in ast.walk(tree):
            for child in ast.iter_child_nodes(parent):
                if child is node:
                    return isinstance(parent, (ast.Module, ast.ClassDef))
        return False

    def _is_top_level(self, node: ast.ClassDef, tree: ast.Module) -> bool:
        for parent in ast.walk(tree):
            for child in ast.iter_child_nodes(parent):
                if child is node:
                    return isinstance(parent, ast.Module)
        return False

    def _extract_class_header_chunk(
        self,
        node: ast.ClassDef,
        lines: list[str],
        file_path: Path
    ) -> CodeChunk | None:
        """
        Emit a chunk for the class itself: signature, docstring, and
        a summary of method signatures (not full bodies).
        This gives the embedding model enough to understand the class's
        purpose without duplicating the method-level chunks.
        """
        start_line = node.lineno

        # Build class summary: signature + docstring + method stubs
        class_sig = lines[start_line - 1]

        docstring = ""
        if (node.body and isinstance(node.body[0], ast.Expr) and
                isinstance(node.body[0].value, ast.Constant)):
            docstring = f'    """{node.body[0].value.value[:200]}"""'

        method_stubs = []
        for child in ast.iter_child_nodes(node):
            if isinstance(child, (ast.FunctionDef, ast.AsyncFunctionDef)):
                args = ", ".join(arg.arg for arg in child.args.args)
                stub = f"    def {child.name}({args}): ..."
                method_stubs.append(stub)

        content_parts = [class_sig]
        if docstring:
            content_parts.append(docstring)
        if method_stubs:
            content_parts.append("    # Methods:")
            content_parts.extend(method_stubs[:20])  # cap at 20 method stubs

        content = "\n".join(content_parts)

        chunk_id = hashlib.sha256(
            f"{file_path}:class:{node.lineno}:{node.name}".encode()
        ).hexdigest()[:16]

        return CodeChunk(
            id=chunk_id,
            content=content,
            language="python",
            file_path=str(file_path),
            start_line=start_line,
            end_line=node.end_lineno,
            chunk_type="class",
            metadata={
                "class_name": node.name,
                "base_classes": [ast.unparse(b) for b in node.bases],
                "method_count": len(method_stubs),
            }
        )

    def _extract_module_chunk(
        self,
        tree: ast.Module,
        lines: list[str],
        file_path: Path
    ) -> CodeChunk | None:
        """
        Emit a chunk for module-level content: imports, constants, __all__,
        and module docstring. This is how callers find the right file to look in.
        """
        module_lines = []

        for node in ast.iter_child_nodes(tree):
            if isinstance(node, (ast.Import, ast.ImportFrom)):
                module_lines.append(lines[node.lineno - 1])
            elif isinstance(node, ast.Assign):
                # Capture top-level constants and __all__
                line = lines[node.lineno - 1]
                if any(
                    isinstance(t, ast.Name) and (
                        t.id.isupper() or t.id == "__all__"
                    )
                    for t in ast.walk(node)
                    if isinstance(t, ast.Name)
                ):
                    module_lines.append(line)

        if not module_lines:
            return None

        content = "\n".join(module_lines[:50])  # cap at 50 lines

        chunk_id = hashlib.sha256(
            f"{file_path}:module".encode()
        ).hexdigest()[:16]

        return CodeChunk(
            id=chunk_id,
            content=content,
            language="python",
            file_path=str(file_path),
            start_line=1,
            end_line=min(50, len(lines)),
            chunk_type="module",
            metadata={"file_path": str(file_path)}
        )

    def _fallback_line_chunks(
        self,
        source: str,
        file_path: Path,
        chunk_size: int = 40
    ) -> list[CodeChunk]:
        """Fallback for files that do not parse (syntax errors, encoding issues)."""
        lines = source.splitlines()
        chunks = []

        for i in range(0, len(lines), chunk_size):
            chunk_lines = lines[i:i + chunk_size]
            content = "\n".join(chunk_lines)
            chunk_id = hashlib.sha256(
                f"{file_path}:fallback:{i}".encode()
            ).hexdigest()[:16]
            chunks.append(CodeChunk(
                id=chunk_id,
                content=content,
                language="unknown",
                file_path=str(file_path),
                start_line=i + 1,
                end_line=i + len(chunk_lines),
                chunk_type="block",
                metadata={"fallback": True}
            ))

        return chunks
```

---

## 3.4 Function-Level and Class-Level Decomposition

The guiding principle for chunk granularity: **one chunk should answer one kind of question**.

A function-level chunk answers: "What does this function do? When would I call it? What does it depend on?"

A class-level chunk answers: "What is this class? What is its responsibility? What can it do?"

A module-level chunk answers: "What does this file provide? What does it import? What are its public symbols?"

These are distinct query intents, and they map to distinct chunk types. Conflating them produces chunks that embed as blends of multiple intents, degrading retrieval precision.

### When to Use Class-Level Chunks

For search purposes, emit class-level chunks as **summary stubs**, not as full class bodies. The full class body — all methods included — is better served by individual method-level chunks. The class-level chunk should capture:

- The class name and any base classes (crucial for inheritance queries)
- The class docstring (if present)
- The signatures of all methods (stubs only — no bodies)
- Any class-level constants or `__slots__`

This design means the query "find classes that implement the Observer pattern" can match the class-level chunk (which contains method stubs like `def subscribe(self, ...)`, `def notify(self, ...)`) without requiring the full bodies, which add noise.

### Handling Long Functions

Functions longer than the model's token limit (which is surprisingly common in legacy codebases) require special treatment. The options are:

| Strategy | Approach | Best For |
|---|---|---|
| Smart truncation | Signature + docstring + first N lines + `# ...` comment | Most cases; preserves the most semantically dense content |
| Sub-chunking by block | Split at inner `for`/`if`/`try` blocks; emit each as a child chunk with parent context prepended | Functions with clearly distinct logical sections |
| Summary embedding | Use an LLM to generate a natural language summary of the function; embed the summary instead of the code | Very long, complex functions (>300 lines) |
| Sliding window | Overlapping windows of N lines, each with the function signature prepended | Brute force fallback; works but produces many redundant chunks |

For most production systems, **smart truncation** plus a fallback to **sub-chunking by block** handles 95% of cases.

---

## 3.5 Sliding Window with Overlap

AST-aware chunking handles well-structured code. But real codebases contain large configuration files (YAML, JSON, Terraform), shell scripts without parseable structure, auto-generated files, and minified code.

For these, a **sliding window with overlap** is the appropriate fallback.

```python
# chunkers/sliding_window.py
from __future__ import annotations

from pathlib import Path
from typing import Iterator

from .base import CodeChunk
import hashlib


class SlidingWindowChunker:
    """
    Sliding window chunker for files that cannot be AST-parsed.
    Produces overlapping chunks of N lines with M lines of overlap.

    The overlap ensures that content near chunk boundaries is captured
    in at least one chunk with sufficient surrounding context.
    """

    def __init__(
        self,
        window_lines: int = 40,
        overlap_lines: int = 10,
        min_chunk_lines: int = 5,
    ):
        assert overlap_lines < window_lines, "Overlap must be less than window"
        self.window_lines = window_lines
        self.overlap_lines = overlap_lines
        self.min_chunk_lines = min_chunk_lines
        self.stride = window_lines - overlap_lines

    def chunk_text(
        self,
        source: str,
        file_path: Path,
        language: str = "unknown"
    ) -> list[CodeChunk]:
        lines = source.splitlines()
        chunks = []

        start = 0
        while start < len(lines):
            end = min(start + self.window_lines, len(lines))
            chunk_lines = lines[start:end]

            if len(chunk_lines) < self.min_chunk_lines:
                break

            content = "\n".join(chunk_lines)
            chunk_id = hashlib.sha256(
                f"{file_path}:sw:{start}:{end}".encode()
            ).hexdigest()[:16]

            chunks.append(CodeChunk(
                id=chunk_id,
                content=content,
                language=language,
                file_path=str(file_path),
                start_line=start + 1,
                end_line=end,
                chunk_type="block",
                metadata={
                    "chunker": "sliding_window",
                    "overlap_lines": self.overlap_lines
                }
            ))

            start += self.stride

        return chunks
```

**The overlap parameter** is more important than it appears. Without overlap, content near chunk boundaries is either under-represented (in the middle of two chunks, capturing nothing coherent) or split arbitrarily. With 10 lines of overlap on 40-line windows, every line appears in at least two chunks — one where it is near the beginning, and one where it appears in a richer surrounding context. This meaningfully improves recall for content near natural break points.

> **Key Insight**
>
> Set overlap to roughly 25% of window size. Below 15%, boundary artifacts are significant. Above 40%, you are generating too many near-duplicate chunks that inflate index size and confuse reranking without proportionally improving recall.

---

## 3.6 Metadata Enrichment

A chunk without metadata is a vector in a database. A chunk with rich metadata is a searchable, filterable, actionable result.

The minimum useful metadata for a code chunk:

```python
@dataclass
class ChunkMetadata:
    # Location
    file_path: str
    repo_name: str
    start_line: int
    end_line: int

    # Identity
    language: str
    chunk_type: str           # "function", "class", "module", "block"
    name: str | None          # Function/class name if applicable

    # Structure
    parent_class: str | None  # For methods: the containing class name
    module_path: str          # Dotted module path (e.g., "auth.validators")

    # Content signals
    has_docstring: bool
    has_type_hints: bool
    line_count: int

    # Graph signals (optional but high value)
    imports: list[str]        # Modules imported in this file
    called_functions: list[str]  # Functions called within this chunk

    # Temporal
    last_modified: str        # ISO 8601 from git blame or fs mtime
    author: str | None        # From git blame (if available)
    commit_hash: str | None   # Last commit touching this chunk
```

Metadata serves three purposes in retrieval:

1. **Pre-filtering**: "Only search functions in the `auth/` directory" or "Only search Python files" — executed before the vector search to reduce the candidate space.

2. **Post-filtering**: "Only return results where `has_docstring = True`" — applied after retrieval to cut noise.

3. **Ranking signals**: `has_docstring`, `has_type_hints`, recent `last_modified`, and `line_count` are useful features for the reranker in Chapter 6.

---

## 3.7 Choosing a Strategy: Decision Framework

```
Figure 3.1 — Chunking strategy decision tree.

Is the file parseable by an AST parser
(Python, JS/TS, Go, Rust, Java, etc.)?
        │
        ├── YES ──→ Use AST chunker
        │              │
        │              ├── Function length < 400 tokens? ──→ Full function chunk
        │              ├── Function length 400–2000 tokens? ──→ Smart truncate
        │              └── Function length > 2000 tokens? ──→ Sub-chunk by block
        │
        └── NO ───→ Is file structured data (YAML, JSON, TOML, HCL)?
                       │
                       ├── YES ──→ Use structure-aware splitter
                       │          (split on top-level keys / objects)
                       └── NO ───→ Use sliding window chunker
                                   window=40, overlap=10
```

| File Type | Strategy | Notes |
|---|---|---|
| Python `.py` | `PythonASTChunker` | Native `ast` module, no dependencies |
| TypeScript/JS | `TreeSitterChunker("javascript")` | Requires `tree-sitter-javascript` |
| Go | `TreeSitterChunker("go")` | |
| Rust | `TreeSitterChunker("rust")` | |
| Java / Kotlin | `TreeSitterChunker("java")` | |
| Terraform `.tf` | `HCLChunker` | Split on `resource`, `module`, `data` blocks |
| YAML configs | `YAMLChunker` | Split on top-level keys |
| Shell scripts | `SlidingWindowChunker` | Functions visible but parsing fragile |
| SQL | `SQLChunker` | Split on statement boundaries |
| Markdown / docs | `MarkdownChunker` | Split on H2/H3 headers |
| Minified JS | Skip or summary-only | Embedding quality is very low |

*Table 3.1 — Chunking strategy by file type.*

> **Try This**
>
> Before deploying your chunker to a full corpus, run a **chunk quality audit** on 100 randomly sampled chunks: (1) Is each chunk syntactically complete? (2) Does each chunk make sense in isolation — could you answer a question about what it does without reading anything else? (3) Is the chunk within the token budget? If more than 10% of chunks fail any criterion, fix the chunker before indexing the rest of the corpus.

---

&nbsp;

---

# Chapter 4: Building the Index

> *"A vector database is just a metric space with a fast nearest-neighbor oracle. The interesting part is how you build the oracle."*

---

## 4.1 Vector Database Landscape

Once you have a set of normalized embeddings, you need an efficient data structure for approximate nearest neighbor (ANN) search. This is the vector database's core job.

The landscape as of 2026 has consolidated around a few strong options:

### FAISS (Facebook AI Similarity Search)

A C++ library with Python bindings. FAISS is not a database — it is a library of index implementations. It has no persistence layer, no metadata storage, no query language. It is fast, flexible, and requires you to build everything around it.

**Best for**: Research, offline batch processing, embedding quality evaluation, cases where you need exact control over index internals.

**Not for**: Production serving with metadata filtering, multi-tenancy, incremental updates.

### Qdrant

A purpose-built vector database written in Rust. Native support for HNSW, scalar quantization, product quantization, payload (metadata) filtering, and sparse vector hybrid search. Excellent operational tooling (Kubernetes operator, REST + gRPC APIs, snapshot/restore).

**Best for**: Production deployments requiring metadata filtering, hybrid search, and operational reliability. This is what Pyckle runs in production.

**Not for**: Offline batch processing where FAISS's raw speed is needed.

### Milvus

The most feature-complete open-source vector database. Distributed architecture (uses Kafka and MinIO internally), multiple index types, GPU-accelerated search.

**Best for**: Very large scale (hundreds of millions of vectors), enterprise deployments, teams with existing Kafka/MinIO infrastructure.

**Not for**: Small to medium deployments — the operational complexity is significant. A three-node Milvus cluster is harder to run than a single Qdrant instance.

### ChromaDB

A developer-friendly vector database designed for rapid prototyping. Embedded mode requires zero infrastructure. HTTP mode is simple to operate.

**Best for**: Prototyping, local development, small corpora (<1M vectors), applications where ease of use beats performance.

**Not for**: Production at scale — ChromaDB's performance degrades significantly above ~500K vectors and its HNSW implementation is less tunable than Qdrant's.

### Weaviate

A "knowledge graph" vector database with native GraphQL support and a unique schema-first design. Includes built-in BM25 and hybrid search.

**Best for**: Use cases requiring rich schema semantics, graph traversal alongside vector search.

**Not for**: Pure vector search workloads where schema flexibility is a cost, not a benefit.

| Database | Language | License | Max Vectors (practical) | Metadata Filter | Hybrid Search | Ease of Ops |
|---|---|---|---|---|---|---|
| FAISS | C++/Python | MIT | 100M+ (RAM-limited) | No (DIY) | No (DIY) | Medium |
| Qdrant | Rust | Apache 2.0 | 100M+ | Yes (payload) | Yes (sparse) | High |
| Milvus | Go | Apache 2.0 | 1B+ | Yes | Yes | Low |
| ChromaDB | Python | Apache 2.0 | ~1M | Yes | Partial | Very High |
| Weaviate | Go | BSD-3 | 100M+ | Yes | Yes | Medium |
| Pinecone | — | Commercial | Unlimited | Yes | Yes | Very High |

*Table 4.1 — Vector database comparison.*

---

## 4.2 The HNSW Algorithm

**Hierarchical Navigable Small World (HNSW)** is the algorithm underlying most production vector databases today. Understanding how it works — not just how to tune it — is essential for making good index configuration decisions.

### Small World Graphs

A **small world graph** is a graph with two properties: (1) most nodes are clustered together, and (2) there exists a short path (small number of hops) between any two nodes. This is the structure of social networks, the internet, and — usefully — high-dimensional metric spaces.

In a small world graph of embedding vectors, nearby vectors (semantically similar code) are densely connected, while distant vectors have long-range "highway" edges that allow rapid navigation across the graph.

### The Hierarchical Structure

HNSW adds hierarchy to the small world graph. It builds multiple layers:

- **Layer 0**: All vectors, densely connected to nearby neighbors
- **Layer 1**: A random subset of vectors (~1/e ≈ 37% of layer 0), with longer-range connections
- **Layer 2**: A further random subset (~14% of layer 0), with even longer-range connections
- **Layer k**: Progressively sparser, with progressively longer-range connections

Search begins at the top layer (few nodes, long-range jumps) and greedily navigates toward the query, descending to finer layers as it approaches the target region.

```
Figure 4.1 — HNSW hierarchy. Each layer is a sparser view of the same
metric space, enabling fast navigation from coarse to fine.

Layer 2:  ● ─────────────────────── ●

Layer 1:  ● ─── ● ─────────── ● ─── ●

Layer 0:  ●─●─●─●─●─●─●─●─●─●─●─●─●─●
          (dense local connections + occasional long-range edges)

Search: Enter at top layer → greedy descent → exit at Layer 0
        with candidate neighbors → return top-k
```

### Key Tuning Parameters

**`M` (number of connections per node)**
Each node in the graph maintains `M` bidirectional connections to its nearest neighbors. Higher `M` means:
- Better recall (more paths to find distant good matches)
- More memory (M × 8 bytes per vector per layer)
- Slower construction (more edges to compute)

Practical range: `M=8` (speed priority) to `M=64` (quality priority). For code search, `M=16` to `M=32` is the typical sweet spot.

**`ef_construction` (search breadth during index build)**
The beam width used when building the graph. Higher `ef_construction` means the build phase explores more candidates for each edge, producing a higher-quality graph. Trade-off is build time.

Practical range: `ef_construction=100` (fast build) to `ef_construction=400` (highest quality). For code search, `ef_construction=200` is reasonable.

**`ef` (search breadth at query time)**
The beam width used during approximate nearest neighbor search. Higher `ef` means more candidates explored, higher recall, higher latency. This parameter is tunable at query time without rebuilding the index.

```python
# index/qdrant_config.py
from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance,
    VectorParams,
    HnswConfigDiff,
    OptimizersConfigDiff,
    QuantizationConfig,
    ScalarQuantizationConfig,
    ScalarType,
)

def create_code_index(
    client: QdrantClient,
    collection_name: str,
    embedding_dim: int = 768,
    m: int = 16,
    ef_construction: int = 200,
    use_quantization: bool = True
) -> None:
    """
    Create a Qdrant collection optimized for code search.

    Index configuration choices:
    - COSINE distance: appropriate for normalized embeddings
    - M=16: balance of recall and memory for code-scale corpora
    - ef_construction=200: high-quality graph, accept slower build time
    - Scalar quantization: 4x memory reduction, ~2% recall cost
    """

    quantization_config = None
    if use_quantization:
        quantization_config = QuantizationConfig(
            scalar=ScalarQuantizationConfig(
                type=ScalarType.INT8,   # 4x compression: float32 → int8
                quantile=0.99,          # Trim top 1% outliers before scaling
                always_ram=True         # Keep quantized vectors in RAM
            )
        )

    client.recreate_collection(
        collection_name=collection_name,
        vectors_config=VectorParams(
            size=embedding_dim,
            distance=Distance.COSINE,
            on_disk=False  # Keep vectors in RAM for low latency
        ),
        hnsw_config=HnswConfigDiff(
            m=m,
            ef_construct=ef_construction,
            full_scan_threshold=10_000,   # Use brute force below this count
            on_disk=False
        ),
        quantization_config=quantization_config,
        optimizers_config=OptimizersConfigDiff(
            indexing_threshold=20_000,    # Start HNSW build after 20K vectors
            memmap_threshold=200_000,     # Use mmap for very large segments
        )
    )
```

---

## 4.3 Quantization: PQ, SQ, and Binary

For large corpora, raw float32 embeddings are expensive. A 10M-chunk index at 768 dimensions requires 29.4 GB of RAM for vectors alone. Quantization reduces this at a small recall cost.

### Scalar Quantization (SQ)

The simplest form. Each float32 dimension is mapped to an int8 (8-bit integer) using a learned linear scale. This reduces memory by 4× with typical recall degradation of 1–3%.

SQ is the pragmatic default for production code search systems. At Pyckle's scale, scalar quantization halves infrastructure cost with no user-visible quality difference.

### Product Quantization (PQ)

The embedding vector is split into `M` equal sub-vectors. Each sub-vector is independently quantized by mapping it to the nearest centroid in a learned codebook of 256 entries. The vector is then stored as `M` bytes (one byte per sub-vector, indexing into the codebook).

PQ achieves much higher compression than SQ (32–64× is common) at the cost of higher recall degradation (5–15%) and more complex implementation. It is appropriate for extremely large corpora where memory is the primary constraint.

### Binary Quantization (BQ)

Each float32 dimension is mapped to a single bit: 1 if positive, 0 if negative. The entire 768-dimensional vector becomes a 96-byte integer. Distance is computed via Hamming distance (XOR + popcount), which is extremely fast on modern CPUs.

Binary quantization works surprisingly well for code search because embedding models trained with contrastive loss tend to produce vectors where the sign of each dimension carries meaningful semantic information. Recall degradation varies by model (10–25% typical) but the retrieval speed improvement is 20–100× for the ANN step.

The practical recommendation: **always use scalar quantization in production** unless you are operating at 100M+ vectors. Beyond that scale, evaluate product quantization. Binary quantization is an advanced optimization for latency-constrained, extremely high-scale deployments.

---

## 4.4 Hybrid Indexes: Combining Vector and Keyword

Pure vector search has a well-known weakness: **exact term recall**. When a developer searches for `"parse_jwt_token"` — an exact function name — a semantic search system may not retrieve it if the function is named slightly differently in the corpus. Meanwhile, a keyword search would find it instantly.

The solution is **hybrid search**: combining vector similarity with BM25 keyword scoring.

Qdrant natively supports sparse vectors, which can represent BM25 term weights. The final score is a weighted combination:

```
hybrid_score = α × vector_score + (1 - α) × bm25_score
```

where `α` is a blend parameter (typically 0.7–0.9, favoring vector search for conceptual queries).

```python
# index/hybrid_search.py
from qdrant_client import QdrantClient
from qdrant_client.models import (
    NamedVector,
    NamedSparseVector,
    SparseVector,
    SearchRequest,
    FusionQuery,
    NearestQuery,
    Prefetch,
    Fusion,
)
from rank_bm25 import BM25Okapi
import numpy as np
from typing import Any


class HybridCodeIndex:
    """
    Hybrid index combining dense vector search with BM25 keyword search.
    Uses Reciprocal Rank Fusion (RRF) to combine results.
    """

    def __init__(
        self,
        client: QdrantClient,
        collection_name: str,
        embedding_pipeline,  # EmbeddingPipeline from Chapter 2
    ):
        self.client = client
        self.collection_name = collection_name
        self.embedding_pipeline = embedding_pipeline

    def search(
        self,
        query: str,
        limit: int = 20,
        filter_payload: dict | None = None,
        alpha: float = 0.7,  # Weight for vector search
        ef: int = 128,       # HNSW search breadth
    ) -> list[dict[str, Any]]:
        """
        Hybrid search using Qdrant's native fusion.
        Returns chunks sorted by fused score.
        """
        # Embed the query for dense search
        query_vector = self.embedding_pipeline.embed_query(query).tolist()

        # Use Qdrant's built-in RRF fusion over dense + sparse
        results = self.client.query_points(
            collection_name=self.collection_name,
            query=FusionQuery(
                fusion=Fusion.RRF  # Reciprocal Rank Fusion
            ),
            prefetch=[
                # Dense vector retrieval
                Prefetch(
                    query=NearestQuery(nearest=query_vector),
                    using="dense",
                    limit=limit * 3,
                    params={"hnsw_ef": ef},
                    filter=filter_payload
                ),
                # Sparse BM25 retrieval
                Prefetch(
                    query=NearestQuery(
                        nearest=self._tokenize_sparse(query)
                    ),
                    using="sparse",
                    limit=limit * 3,
                    filter=filter_payload
                ),
            ],
            limit=limit,
            with_payload=True
        )

        return [
            {
                "chunk_id": hit.id,
                "score": hit.score,
                "payload": hit.payload,
                "content": hit.payload.get("content", "")
            }
            for hit in results.points
        ]

    def _tokenize_sparse(self, text: str) -> SparseVector:
        """
        Convert query text to a sparse vector for BM25-style matching.
        Uses a simple term frequency approach compatible with Qdrant sparse vectors.
        """
        from collections import Counter
        import re

        # Tokenize: split on non-alphanumeric, lowercase
        tokens = re.findall(r"[a-zA-Z0-9_]+", text.lower())

        # Split camelCase and snake_case identifiers
        expanded_tokens = []
        for token in tokens:
            # Split snake_case
            parts = token.split("_")
            # Split camelCase
            import re as re2
            for part in parts:
                sub_parts = re2.findall(r"[A-Z]?[a-z]+|[A-Z]+(?=[A-Z]|$)", part)
                expanded_tokens.extend([p.lower() for p in sub_parts] if sub_parts else [part.lower()])

        term_counts = Counter(expanded_tokens)

        # Map terms to integer indices (requires a vocabulary built at index time)
        # Simplified: use hash(term) % vocab_size as index
        vocab_size = 30_000
        indices = []
        values = []

        for term, count in term_counts.items():
            idx = hash(term) % vocab_size
            indices.append(idx)
            values.append(float(count))

        return SparseVector(indices=indices, values=values)
```

---

## 4.5 Index Configuration for Code Corpora

Code corpora have distinct characteristics that should inform index configuration:

**High identifier density**: Code has many unique tokens (function names, variable names, class names). This makes BM25's term frequency component less discriminative than in prose — many terms appear only once. This pushes weight toward the vector component in hybrid search.

**Structural redundancy**: Many code chunks are structurally similar (dozens of CRUD functions with the same shape). This creates dense clusters in embedding space and requires a well-tuned `M` parameter to navigate them efficiently.

**Multi-language**: Vector distributions vary by language. Python and TypeScript embeddings cluster by language in some model spaces. If your corpus is heavily multi-language, consider separate collections per language with a federated search layer.

**Incremental update pattern**: Code changes constantly. A git commit may change 10–100 chunks in a 10M-chunk index. The index must support efficient updates (delete-by-id + insert) rather than full rebuilds.

```python
# index/incremental_update.py
from qdrant_client import QdrantClient
from qdrant_client.models import PointIdsList
import logging

logger = logging.getLogger(__name__)


def update_index_for_commit(
    client: QdrantClient,
    collection_name: str,
    changed_files: list[str],
    chunker,
    embedding_pipeline,
) -> dict[str, int]:
    """
    Update the index for files changed in a git commit.

    Strategy:
    1. Delete all existing chunks for the changed files
    2. Re-chunk the current versions of those files
    3. Embed the new chunks
    4. Insert the new chunks

    This is safe for concurrent reads: Qdrant's MVCC means ongoing
    queries see a consistent snapshot during the update.
    """
    stats = {"deleted": 0, "inserted": 0}

    for file_path in changed_files:
        # Step 1: Delete all chunks from this file
        # We stored file_path in the payload at index time
        delete_result = client.delete(
            collection_name=collection_name,
            points_selector=models.FilterSelector(
                filter=models.Filter(
                    must=[
                        models.FieldCondition(
                            key="file_path",
                            match=models.MatchValue(value=file_path)
                        )
                    ]
                )
            )
        )
        stats["deleted"] += delete_result.operation_id or 0

        # Step 2: Re-chunk and re-embed the current version
        from pathlib import Path
        path = Path(file_path)

        if not path.exists():
            logger.info(f"File deleted: {file_path} — removing from index only")
            continue

        new_chunks = chunker.chunk_file(path)
        if not new_chunks:
            continue

        embedded = embedding_pipeline.embed_chunks(new_chunks)

        # Step 3: Insert new chunks
        client.upsert(
            collection_name=collection_name,
            points=[
                models.PointStruct(
                    id=ec.chunk.id,
                    vector=ec.embedding.tolist(),
                    payload={
                        "content": ec.chunk.content,
                        "file_path": ec.chunk.file_path,
                        "language": ec.chunk.language,
                        "chunk_type": ec.chunk.chunk_type,
                        "start_line": ec.chunk.start_line,
                        "end_line": ec.chunk.end_line,
                        **ec.chunk.metadata
                    }
                )
                for ec in embedded
            ]
        )

        stats["inserted"] += len(embedded)
        logger.info(f"Updated {file_path}: {len(embedded)} chunks")

    return stats
```

---

## 4.6 Operational Concerns

### Snapshot and Restore

Always configure automated snapshots. For Qdrant:

```bash
# Create a snapshot (can be triggered via REST API or cron)
curl -X POST "http://localhost:6333/collections/code_chunks/snapshots"

# Qdrant stores snapshots at /qdrant/snapshots/
# Copy to S3/GCS for durable backup
```

### Memory Planning

For a collection of `N` vectors at `D` dimensions with `M` HNSW connections:

```
RAM (vectors, float32)     = N × D × 4 bytes
RAM (vectors, int8/SQ)     = N × D × 1 byte
RAM (HNSW graph)           = N × M × 2 × 8 bytes (bidirectional links, int64)
RAM (payload/metadata)     = N × avg_payload_bytes

Example: 5M chunks, 768D, SQ, M=16
  Vectors (SQ):    5M × 768 × 1   = 3.84 GB
  HNSW graph:      5M × 16 × 2 × 8 = 1.28 GB
  Payload (~500B): 5M × 500       = 2.50 GB
  Total:           ≈ 7.6 GB RAM
```

> **Key Insight**
>
> Plan for 2× the calculated RAM requirement. Qdrant maintains background segments, merger threads, and write-ahead logs that add significant overhead beyond the raw index data. A 7.6 GB index comfortably fits in a 16 GB server instance; do not try to fit it in 8 GB.

---

&nbsp;

---

# Chapter 5: Query Processing and Intent Understanding

> *"The query is not what the user typed. It is what the user meant — which may be different from either the words they used or the code they are hoping to find."*

---

## 5.1 Anatomy of a Code Search Query

Before building a query processing pipeline, it helps to decompose what a code search query actually is. Developers' queries fall into identifiable categories, each with different optimal processing strategies:

**Conceptual queries**: Describe what the code does without specifying implementation details.
- *"authentication middleware"*
- *"convert timezone to UTC"*
- *"retry on transient errors"*

These queries should hit the embedding path directly — they are precisely what semantic search is designed for.

**Identifier queries**: Name a specific function, class, variable, or file.
- *"UserAuthService"*
- *"parse_jwt_token"*
- *"config.database.pool_size"*

These queries should hit the keyword (BM25) path — exact matching is better than approximate semantic matching for named symbols.

**Behavioral queries**: Describe what the code does from the perspective of an observer.
- *"code that sends an email when an order ships"*
- *"error handling in the payment flow"*
- *"what happens when the session expires"*

These queries require the embedding path with strong query expansion — the user describes observable behavior, and the relevant code may use none of those words.

**Structural queries**: Describe the shape or pattern of code.
- *"functions that take more than 5 parameters"*
- *"classes that inherit from BaseModel"*
- *"async functions that don't await"*

These queries are better answered by static analysis tools (AST search, linters). A semantic search system should recognize these and route them appropriately.

**Similarity queries**: Ask for code similar to a known example.
- *"code similar to this function [paste]"*
- *"other places we do the same pattern as X"*

These queries use the embedding of the example as the query vector — the query *is* a code chunk, not a natural language description.

---

## 5.2 Intent Classification

The first step in query processing is classifying the query into one of the above categories. This drives downstream routing — which search path to use, whether to expand the query, and how to weight the hybrid search blend.

```python
# query/intent_classifier.py
from __future__ import annotations

import re
from enum import Enum
from dataclasses import dataclass


class QueryIntent(Enum):
    CONCEPTUAL = "conceptual"
    IDENTIFIER = "identifier"
    BEHAVIORAL = "behavioral"
    STRUCTURAL = "structural"
    SIMILARITY = "similarity"


@dataclass
class ClassifiedQuery:
    original: str
    intent: QueryIntent
    confidence: float
    signals: list[str]  # Which signals drove the classification


class RuleBasedIntentClassifier:
    """
    Fast rule-based intent classification.

    In production, consider augmenting with a small fine-tuned classifier
    (e.g., a distilBERT model trained on 10K labeled queries).
    Rules handle 80% of cases; the model handles edge cases.
    """

    # Patterns that strongly suggest identifier queries
    IDENTIFIER_PATTERNS = [
        r"^[A-Z][a-zA-Z0-9]+$",                  # PascalCase class name
        r"^[a-z_][a-z0-9_]*$",                    # snake_case function
        r"^[a-z_][a-z0-9_]*\.[a-z_][a-z0-9_.]*$",  # dotted path
        r"^[A-Z_][A-Z0-9_]+$",                   # CONSTANT_NAME
        r"\(\)$",                                  # Ends with () → function call
    ]

    # Structural query keywords and patterns
    STRUCTURAL_SIGNALS = [
        "functions that", "classes that", "methods that",
        "more than", "fewer than", "inherits from", "extends",
        "parameters", "arguments", "return type", "no return",
        "async functions", "synchronous", "decorator",
        "all usages", "all calls", "all imports"
    ]

    # Behavioral query signals
    BEHAVIORAL_SIGNALS = [
        "what happens when", "code that", "where does",
        "how does", "the part that", "handles", "processes",
        "when the", "after the", "before the", "triggered by"
    ]

    # Structural regex patterns
    STRUCTURAL_PATTERNS = [
        r"functions?\s+(?:that|with|taking|returning)",
        r"classes?\s+(?:that|with|inheriting|extending)",
        r"all\s+(?:usages|calls|instances|imports)\s+of",
    ]

    def classify(self, query: str) -> ClassifiedQuery:
        query_stripped = query.strip()
        query_lower = query_stripped.lower()
        signals = []

        # Check for pasted code (similarity query)
        if self._looks_like_code(query_stripped):
            return ClassifiedQuery(
                original=query,
                intent=QueryIntent.SIMILARITY,
                confidence=0.95,
                signals=["contains_code_tokens"]
            )

        # Check for identifier patterns
        for pattern in self.IDENTIFIER_PATTERNS:
            if re.match(pattern, query_stripped):
                signals.append(f"matches_identifier_pattern:{pattern[:30]}")
                return ClassifiedQuery(
                    original=query,
                    intent=QueryIntent.IDENTIFIER,
                    confidence=0.90,
                    signals=signals
                )

        # Check for structural signals
        for signal in self.STRUCTURAL_SIGNALS:
            if signal in query_lower:
                signals.append(f"structural_keyword:{signal}")

        for pattern in self.STRUCTURAL_PATTERNS:
            if re.search(pattern, query_lower):
                signals.append(f"structural_pattern:{pattern[:30]}")

        if len(signals) >= 1:
            return ClassifiedQuery(
                original=query,
                intent=QueryIntent.STRUCTURAL,
                confidence=0.80,
                signals=signals
            )

        # Check for behavioral signals
        behavioral_hits = [
            s for s in self.BEHAVIORAL_SIGNALS
            if s in query_lower
        ]
        if len(behavioral_hits) >= 1:
            for hit in behavioral_hits:
                signals.append(f"behavioral_keyword:{hit}")
            return ClassifiedQuery(
                original=query,
                intent=QueryIntent.BEHAVIORAL,
                confidence=0.75,
                signals=signals
            )

        # Default: treat as conceptual query
        return ClassifiedQuery(
            original=query,
            intent=QueryIntent.CONCEPTUAL,
            confidence=0.60,
            signals=["default_conceptual"]
        )

    def _looks_like_code(self, text: str) -> bool:
        """Heuristic: text contains code if it has syntax-like tokens."""
        if "\n" not in text:
            return False  # Single-line, probably not pasted code

        code_signals = [
            r"def\s+\w+\s*\(",      # Python function def
            r"function\s+\w+\s*\(", # JS function
            r"class\s+\w+[:\{]",    # Class definition
            r"import\s+\w+",        # Import statement
            r"\w+\s*=\s*\w+\(",     # Assignment from function call
            r"return\s+\w+",        # Return statement
            r"if\s+\w+.*:",         # If statement
        ]

        hits = sum(
            1 for pattern in code_signals
            if re.search(pattern, text)
        )
        return hits >= 2
```

---

## 5.3 Query Expansion

Query expansion enriches the original query with related terms and synonyms before embedding. This is particularly valuable for conceptual and behavioral queries where the developer's vocabulary may not overlap with the codebase's vocabulary.

There are three practical approaches:

### LLM-Based Expansion

Generate multiple alternative phrasings of the query using a language model, then either (a) embed all phrasings and use the average, or (b) embed each separately and take the union of retrieved results.

```python
# query/expander.py
from __future__ import annotations

import anthropic
import json


class LLMQueryExpander:
    """
    Expands queries using Claude to generate alternative phrasings.
    Focuses on generating phrases that match how code would be written,
    not how a developer might describe it in natural language.
    """

    def __init__(self, client: anthropic.Anthropic):
        self.client = client

    def expand(
        self,
        query: str,
        n_expansions: int = 4,
        include_code_patterns: bool = True
    ) -> list[str]:
        """
        Returns [original_query] + [n_expansion_queries].
        Always include the original — do not discard it.
        """

        prompt = f"""A developer is searching a codebase for: "{query}"

Generate {n_expansions} alternative search queries that might find the same code.
Focus on:
1. Different terminology for the same concept (e.g., "retry" vs "backoff" vs "resilience")
2. Implementation-focused descriptions (what the code does mechanically)
3. Common variable/function naming conventions for this concept
4. Related patterns that often appear with this concept

{"Also include 1-2 short code-pattern descriptions (e.g., 'try/except loop with sleep')." if include_code_patterns else ""}

Output as JSON array of strings. No explanation.
Example output: ["alternative 1", "alternative 2", "alternative 3", "alternative 4"]"""

        message = self.client.messages.create(
            model="claude-opus-4-5",
            max_tokens=256,
            messages=[{"role": "user", "content": prompt}]
        )

        try:
            expansions = json.loads(message.content[0].text.strip())
            if isinstance(expansions, list):
                return [query] + [str(e) for e in expansions[:n_expansions]]
        except (json.JSONDecodeError, IndexError, KeyError):
            pass

        # Fallback: return original query only
        return [query]

    def expand_and_embed(
        self,
        query: str,
        embedding_pipeline,
        n_expansions: int = 3
    ):
        """
        Expand the query and return the mean of all expansion embeddings.
        This produces a single query vector that represents the centroid
        of all the alternative phrasings in embedding space.
        """
        import numpy as np

        expanded_queries = self.expand(query, n_expansions=n_expansions)

        embeddings = [
            embedding_pipeline.embed_query(q)
            for q in expanded_queries
        ]

        # Mean pooling over expansion embeddings, then re-normalize
        mean_embedding = np.mean(embeddings, axis=0)
        norm = np.linalg.norm(mean_embedding)

        if norm > 0:
            mean_embedding = mean_embedding / norm

        return mean_embedding, expanded_queries
```

### Vocabulary-Based Expansion

Build a synonym dictionary from your codebase's own naming conventions — learned from co-occurrence in docstrings, comments, and git commit messages. When a query contains a term, substitute or augment it with known synonyms.

```python
# query/vocab_expander.py
from __future__ import annotations

import json
from pathlib import Path


# Hand-curated + learned synonyms for common code concepts
# In production, augment this with a co-occurrence matrix built from
# your codebase's docstrings and commit messages
CODE_SYNONYM_MAP = {
    "authenticate": ["verify credentials", "login", "auth", "check_login", "validate user"],
    "rate limit": ["throttle", "request guard", "bucket", "slow down", "quota"],
    "cache": ["memoize", "store", "buffer", "lru_cache", "redis", "ttl"],
    "retry": ["backoff", "resilience", "attempt", "jitter", "with_retry"],
    "soft delete": ["archive", "tombstone", "mark inactive", "hide", "logical delete"],
    "event": ["message", "signal", "notification", "pub/sub", "emit", "dispatch"],
    "background job": ["task", "worker", "queue", "celery", "async job", "cron"],
    "middleware": ["interceptor", "filter", "hook", "decorator", "before_action"],
    "health check": ["ping", "liveness", "readiness", "status", "heartbeat"],
    "connection pool": ["pool manager", "session factory", "db pool", "connection cache"],
    "pagination": ["cursor", "page", "offset", "limit", "next page", "scroll"],
    "serialization": ["marshal", "encode", "serialize", "dump", "to_dict", "schema"],
    "validation": ["sanitize", "clean", "check", "assert", "verify", "constraints"],
}


class VocabularyExpander:
    def __init__(self, synonym_map: dict[str, list[str]] | None = None):
        self.synonyms = synonym_map or CODE_SYNONYM_MAP
        # Build reverse index: synonym → canonical term
        self.reverse = {}
        for canonical, syns in self.synonyms.items():
            for syn in syns:
                self.reverse[syn.lower()] = canonical

    def expand(self, query: str) -> list[str]:
        """
        Return the original query plus versions with synonym substitutions.
        """
        query_lower = query.lower()
        expansions = [query]

        for canonical, synonyms in self.synonyms.items():
            if canonical in query_lower:
                for synonym in synonyms[:3]:  # Max 3 substitutions per concept
                    expanded = query_lower.replace(canonical, synonym)
                    if expanded != query_lower:
                        expansions.append(expanded)

            for synonym in synonyms:
                if synonym in query_lower:
                    # Substitute with canonical
                    expanded = query_lower.replace(synonym, canonical)
                    expansions.append(expanded)
                    # Substitute with other synonyms
                    for other_syn in synonyms[:2]:
                        if other_syn != synonym:
                            expanded2 = query_lower.replace(synonym, other_syn)
                            expansions.append(expanded2)

        # Deduplicate while preserving order
        seen = set()
        unique = []
        for e in expansions:
            if e not in seen:
                seen.add(e)
                unique.append(e)

        return unique[:6]  # Cap total expansions
```

> **Warning**
>
> LLM-based query expansion adds 200–500ms of latency. Only use it for queries classified as BEHAVIORAL or CONCEPTUAL with low confidence. For identifier queries and high-confidence conceptual queries, expansion adds noise without improving recall and makes the system feel slow. Route appropriately based on intent classification.

---

## 5.4 Embedding the Query

Query embedding uses the same model as chunk embedding, but with an important difference: the query is in natural language, while the chunks are in code. Well-trained code embedding models (CodeBERT, UniXcoder, codet5p) handle this asymmetry via a **biencoder** architecture trained explicitly on (query, code) pairs — the model learns to project both into the same embedding space where semantic similarity is preserved.

If you are using a model that was not explicitly trained on (query, code) pairs, the query-document gap can hurt retrieval quality significantly. In that case, prepending a task prefix helps:

```python
def embed_query_with_prefix(query: str, model: SentenceTransformer) -> np.ndarray:
    """
    Some models (E5, BGE) are trained with task instruction prefixes.
    For code search, use the retrieval prefix format.
    """
    # For E5-family models
    prefixed_query = f"query: {query}"

    # For BGE-family models
    # prefixed_query = f"Represent this query for searching relevant code: {query}"

    return model.encode(
        prefixed_query,
        normalize_embeddings=True,
        convert_to_numpy=True
    )
```

---

## 5.5 Rewriting and Normalization

Before expanding or embedding, apply a normalization pass to the raw query string. This handles common patterns that degrade query quality:

```python
# query/normalizer.py
from __future__ import annotations

import re


class QueryNormalizer:
    """
    Normalize a raw user query before processing.

    Handles: camelCase splitting, snake_case splitting,
    stop word removal (for keyword path), punctuation cleaning,
    and common developer shorthand.
    """

    STOP_WORDS = frozenset({
        "the", "a", "an", "in", "on", "at", "to", "for",
        "is", "it", "do", "does", "can", "how", "what", "where",
        "find", "show", "get", "look", "search", "code", "function",
        # Note: do NOT remove technical stop words like "null", "none",
        # "true", "false" — these are meaningful in code queries
    })

    SHORTHAND = {
        "db": "database",
        "auth": "authentication authorization",
        "config": "configuration",
        "util": "utility",
        "utils": "utilities",
        "fn": "function",
        "async": "asynchronous",
        "sync": "synchronous",
        "req": "request",
        "res": "response",
        "err": "error",
        "ctx": "context",
        "msg": "message",
        "str": "string",
        "int": "integer",
        "dict": "dictionary",
        "obj": "object",
    }

    def normalize(self, query: str) -> str:
        """
        Returns a normalized query suitable for embedding.
        Note: do NOT overly normalize — the embedding model handles
        most linguistic variation. Focus on expanding abbreviations
        and splitting compound identifiers.
        """
        # Strip extra whitespace
        query = " ".join(query.split())

        # Expand common shorthand
        words = query.split()
        expanded = []
        for word in words:
            word_lower = word.lower()
            if word_lower in self.SHORTHAND:
                expanded.append(self.SHORTHAND[word_lower])
            else:
                expanded.append(word)
        query = " ".join(expanded)

        # Split camelCase identifiers that appear in the query
        # (developer pasted a function name)
        query = re.sub(r"([a-z])([A-Z])", r"\1 \2", query)

        # Split snake_case identifiers
        query = query.replace("_", " ")

        # Remove trailing/leading special characters
        query = re.sub(r"^[^a-zA-Z0-9]+|[^a-zA-Z0-9]+$", "", query)

        return query.strip()

    def normalize_for_keyword_search(self, query: str) -> str:
        """
        More aggressive normalization for the BM25/keyword path.
        Removes stop words, stems (optional), lowercase.
        """
        normalized = self.normalize(query).lower()
        words = normalized.split()
        filtered = [w for w in words if w not in self.STOP_WORDS and len(w) > 1]
        return " ".join(filtered)
```

---

## 5.6 Putting It Together: A Query Processing Pipeline

The complete query processing pipeline composes all of the above:

```python
# query/pipeline.py
from __future__ import annotations

import logging
import time
from dataclasses import dataclass
from typing import Any

import numpy as np

from .intent_classifier import RuleBasedIntentClassifier, QueryIntent, ClassifiedQuery
from .normalizer import QueryNormalizer
from .expander import LLMQueryExpander, VocabularyExpander

logger = logging.getLogger(__name__)


@dataclass
class ProcessedQuery:
    original: str
    normalized: str
    intent: QueryIntent
    intent_confidence: float
    expanded_queries: list[str]
    query_vector: np.ndarray
    processing_time_ms: float
    metadata: dict[str, Any]


class QueryProcessingPipeline:
    """
    Full query processing pipeline.

    Processing flow:
    1. Normalize the raw query
    2. Classify intent
    3. Route to appropriate expansion strategy (based on intent)
    4. Embed the (possibly expanded and averaged) query
    5. Return a ProcessedQuery ready for index retrieval
    """

    def __init__(
        self,
        embedding_pipeline,
        llm_expander: LLMQueryExpander | None = None,
        use_vocab_expansion: bool = True,
        llm_expansion_threshold: float = 0.70,  # Only expand if confidence < this
    ):
        self.embedding_pipeline = embedding_pipeline
        self.normalizer = QueryNormalizer()
        self.classifier = RuleBasedIntentClassifier()
        self.llm_expander = llm_expander
        self.vocab_expander = VocabularyExpander() if use_vocab_expansion else None
        self.llm_expansion_threshold = llm_expansion_threshold

    def process(self, raw_query: str) -> ProcessedQuery:
        start = time.monotonic()
        metadata = {}

        # Step 1: Normalize
        normalized = self.normalizer.normalize(raw_query)
        metadata["normalized"] = normalized

        # Step 2: Classify intent
        classified: ClassifiedQuery = self.classifier.classify(normalized)
        metadata["intent"] = classified.intent.value
        metadata["intent_confidence"] = classified.confidence
        metadata["intent_signals"] = classified.signals

        # Step 3: Determine expansion strategy
        expanded_queries = [normalized]

        if classified.intent == QueryIntent.IDENTIFIER:
            # No expansion for identifier queries — exact match is goal
            pass

        elif classified.intent == QueryIntent.SIMILARITY:
            # Similarity queries: embed the pasted code directly
            # Normalizer should be skipped for code input
            expanded_queries = [raw_query]

        elif classified.intent in (QueryIntent.CONCEPTUAL, QueryIntent.BEHAVIORAL):
            # Vocabulary expansion always
            if self.vocab_expander:
                expanded_queries = self.vocab_expander.expand(normalized)
                metadata["vocab_expansions"] = len(expanded_queries) - 1

            # LLM expansion for low-confidence or behavioral queries
            use_llm = (
                self.llm_expander is not None and
                (
                    classified.confidence < self.llm_expansion_threshold or
                    classified.intent == QueryIntent.BEHAVIORAL
                )
            )

            if use_llm:
                try:
                    llm_expanded, _ = self.llm_expander.expand_and_embed(
                        normalized,
                        self.embedding_pipeline,
                        n_expansions=3
                    )
                    metadata["llm_expansion"] = True
                    # Return expanded vector directly
                    elapsed_ms = (time.monotonic() - start) * 1000
                    return ProcessedQuery(
                        original=raw_query,
                        normalized=normalized,
                        intent=classified.intent,
                        intent_confidence=classified.confidence,
                        expanded_queries=expanded_queries,
                        query_vector=llm_expanded,
                        processing_time_ms=elapsed_ms,
                        metadata=metadata
                    )
                except Exception as e:
                    logger.warning(f"LLM expansion failed, falling back: {e}")
                    metadata["llm_expansion_error"] = str(e)

        # Step 4: Embed (with mean pooling over expansions if applicable)
        if len(expanded_queries) == 1:
            query_vector = self.embedding_pipeline.embed_query(expanded_queries[0])
        else:
            embeddings = [
                self.embedding_pipeline.embed_query(q)
                for q in expanded_queries
            ]
            query_vector = np.mean(embeddings, axis=0)
            norm = np.linalg.norm(query_vector)
            if norm > 0:
                query_vector = query_vector / norm

        elapsed_ms = (time.monotonic() - start) * 1000
        metadata["expansion_count"] = len(expanded_queries)

        logger.debug(
            f"Query processed in {elapsed_ms:.1f}ms | "
            f"intent={classified.intent.value} ({classified.confidence:.2f}) | "
            f"expansions={len(expanded_queries)}"
        )

        return ProcessedQuery(
            original=raw_query,
            normalized=normalized,
            intent=classified.intent,
            intent_confidence=classified.confidence,
            expanded_queries=expanded_queries,
            query_vector=query_vector,
            processing_time_ms=elapsed_ms,
            metadata=metadata
        )
```

> **Try This**
>
> Log every query that returns zero results above a 0.6 similarity threshold. These are your hardest cases — the queries where your embedding model and chunking strategy have completely failed to connect the developer's vocabulary to the code. Analyzing 50 of these failures will tell you more about where to invest improvement effort than any benchmark.

> **Key Insight**
>
> The single highest-value improvement to query processing is usually not a better embedding model or a more sophisticated expander — it is better normalization of identifier-style queries. A developer who types `UserAuthService.verifyToken` will get poor results if the normalizer does not split it into `user auth service verify token`. camelCase splitting and snake_case splitting are two-line regex operations that can improve identifier query recall by 20–40%.

---

This pipeline is the form that Pyckle uses in production. Query intent routing — sending identifier queries to keyword search, conceptual queries to semantic search, behavioral queries to the expanded semantic path — was the single change that most improved the developer experience after launch. Before routing, every query went through the same path; precision on identifier lookups was poor, and conceptual queries were over-indexed with noisy expansion. Intent-based routing made each query type use the search path suited to it.

---

&nbsp;

---


# Chapter 6: Retrieval and Ranking

> *"Getting the right document into the top-10 is an embedding problem. Getting it to position #1 is a ranking problem. Most teams optimize the first and ignore the second."*

---

## 6.1 Hybrid Search: Combining BM25 and Vector Similarity

Neither keyword search nor vector search is universally superior. BM25 — the probabilistic term-weighting function that underlies Elasticsearch, Solr, and most production search engines — wins on exact-match recall: identifier names, error codes, function signatures, version strings. Vector search wins on semantic recall: concepts expressed in different vocabulary, cross-language patterns, behavioral descriptions.

The answer is to run both and fuse the results.

### BM25 in Brief

BM25 scores a document *d* for query *q* as:

```
BM25(d, q) = Σ IDF(t) * (tf(t,d) * (k1 + 1)) / (tf(t,d) + k1 * (1 - b + b * |d|/avgdl))
```

Where:
- `tf(t,d)` is term frequency of term `t` in document `d`
- `IDF(t)` is inverse document frequency — rarer terms score higher
- `k1` (typically 1.2–2.0) controls term frequency saturation
- `b` (typically 0.75) controls document length normalization
- `|d|/avgdl` is document length relative to corpus average

For code, the default BM25 parameters are a poor fit because code identifiers are short and high-frequency within their file but globally rare — the opposite of prose. Tuning `k1` lower (0.9–1.2) and `b` higher (0.85–0.95) to account for dense, short code tokens significantly improves keyword recall on code corpora.

### When Each Wins

| Query Type | BM25 | Vector | Winner |
|---|---|---|---|
| `UserAuthService.verifyToken` | High | Low | BM25 |
| "retry logic for HTTP failures" | Low | High | Vector |
| `NullPointerException in getUser` | High | Medium | BM25 |
| "where do we validate payment amounts" | Low | High | Vector |
| "rate limiter implementation" | Medium | High | Vector |
| `JWT_SECRET` | High | Low | BM25 |
| "background job that sends email" | Low | High | Vector |
| "decorator pattern for caching" | Low | High | Vector |
| `parse_iso8601_date` | High | Medium | BM25 |
| "anti-pattern: blocking call in async" | None | High | Vector |

*Table 6.1 — Query type versus retrieval strategy. BM25 dominates for exact token queries; vector search dominates for conceptual and behavioral queries.*

> **Key Insight**
>
> Do not pick one or the other. For production code search, hybrid retrieval consistently outperforms either method alone by 10–25% on MRR@10. The overhead of running both is modest — BM25 is extremely fast, and the vector search is already required. The fusion step is microseconds of CPU work.

---

## 6.2 Reciprocal Rank Fusion (RRF)

Reciprocal Rank Fusion is the simplest and most robust method for combining ranked lists from different retrieval systems. Given two ranked lists of results — one from BM25, one from vector search — RRF computes a combined score for each document:

```
RRF_score(d) = Σ 1 / (k + rank_in_list(d))
```

Where `k` is a constant (typically 60) that dampens the effect of rank position. A document ranked #1 in one list and absent from the other scores `1/61 ≈ 0.0164`. A document ranked #1 in both scores `1/61 + 1/61 ≈ 0.0328`.

The elegance of RRF is that it:
- Requires no score normalization (BM25 and cosine scores are on incomparable scales)
- Is robust to outliers (a single very high BM25 score for an irrelevant exact match is dampened)
- Rewards consistent relevance across both systems

```python
# retrieval/rrf.py
from __future__ import annotations

from dataclasses import dataclass
from typing import Any


@dataclass
class RankedResult:
    id: str
    score: float
    source: str  # "bm25", "vector", or "fused"
    metadata: dict[str, Any]


def reciprocal_rank_fusion(
    bm25_results: list[RankedResult],
    vector_results: list[RankedResult],
    k: int = 60,
    bm25_weight: float = 1.0,
    vector_weight: float = 1.0,
) -> list[RankedResult]:
    """
    Combine BM25 and vector ranked lists using Reciprocal Rank Fusion.

    Args:
        bm25_results:   Ranked list from BM25, ordered by decreasing score.
        vector_results: Ranked list from vector search, ordered by decreasing score.
        k:              RRF damping constant. 60 is the standard default.
        bm25_weight:    Weight multiplier for BM25 contribution (tune per intent).
        vector_weight:  Weight multiplier for vector contribution (tune per intent).

    Returns:
        Fused ranked list, sorted by descending RRF score.
    """
    scores: dict[str, float] = {}
    metadata_store: dict[str, dict] = {}

    # Accumulate RRF scores from BM25 list
    for rank, result in enumerate(bm25_results, start=1):
        rrf_contribution = bm25_weight / (k + rank)
        scores[result.id] = scores.get(result.id, 0.0) + rrf_contribution
        metadata_store[result.id] = result.metadata

    # Accumulate RRF scores from vector list
    for rank, result in enumerate(vector_results, start=1):
        rrf_contribution = vector_weight / (k + rank)
        scores[result.id] = scores.get(result.id, 0.0) + rrf_contribution
        if result.id not in metadata_store:
            metadata_store[result.id] = result.metadata

    # Sort by descending RRF score
    sorted_ids = sorted(scores, key=lambda x: scores[x], reverse=True)

    return [
        RankedResult(
            id=doc_id,
            score=scores[doc_id],
            source="fused",
            metadata=metadata_store[doc_id],
        )
        for doc_id in sorted_ids
    ]
```

### Tuning the Weights

The default `bm25_weight=1.0, vector_weight=1.0` treats both lists equally. In practice, you will want to adjust this based on query intent (from Chapter 5):

```python
# retrieval/hybrid_retriever.py
from retrieval.rrf import reciprocal_rank_fusion, RankedResult
from query_processing.pipeline import QueryIntent

INTENT_WEIGHTS = {
    QueryIntent.IDENTIFIER:  {"bm25": 2.0, "vector": 0.5},  # favor exact match
    QueryIntent.SIMILARITY:  {"bm25": 0.5, "vector": 2.0},  # favor semantic
    QueryIntent.CONCEPTUAL:  {"bm25": 0.7, "vector": 1.5},  # lean semantic
    QueryIntent.BEHAVIORAL:  {"bm25": 0.3, "vector": 2.0},  # strongly semantic
    QueryIntent.DIAGNOSTIC:  {"bm25": 1.5, "vector": 1.0},  # balanced, BM25 edge
}

def hybrid_search(
    query_vector,
    query_text: str,
    intent: QueryIntent,
    vector_index,   # e.g. QdrantClient
    bm25_index,     # e.g. BM25Okapi or Elasticsearch
    top_k: int = 50,
) -> list[RankedResult]:
    weights = INTENT_WEIGHTS.get(intent, {"bm25": 1.0, "vector": 1.0})

    # Run both retrievers in parallel (or sequentially for simplicity)
    bm25_hits = bm25_index.search(query_text, top_k=top_k)
    vector_hits = vector_index.search(query_vector, top_k=top_k)

    return reciprocal_rank_fusion(
        bm25_results=bm25_hits,
        vector_results=vector_hits,
        bm25_weight=weights["bm25"],
        vector_weight=weights["vector"],
    )
```

> **Try This**
>
> Before implementing a custom fusion algorithm, test pure RRF with `k=60` and equal weights on your evaluation set (Chapter 7 covers how to build one). RRF is difficult to beat for most configurations — the literature consistently shows it matching or outperforming learned fusion models on code retrieval benchmarks. Only invest in learned fusion if you have a large labeled dataset (5,000+ queries) and RRF has clearly plateaued.

---

## 6.3 Cross-Encoder Reranking

RRF fusion gives you a combined ranked list. The top 50 results are reasonably good. But the top 10 — the results the developer actually reads — need to be better than "reasonably good."

This is the job of a **reranker**: a model that takes the query and each candidate result together and produces a relevance score, rather than scoring them independently.

### Bi-Encoder vs. Cross-Encoder

The embedding models from Chapter 2 are **bi-encoders**: they encode the query and each document independently, producing vectors that are compared by dot product. This is fast but architecturally limited — the model cannot see how the query and document interact during encoding.

A **cross-encoder** takes the concatenation of query and document as input and produces a single relevance score. Because the model sees both simultaneously, the attention mechanism can directly model the interaction between query tokens and document tokens. This produces substantially better relevance judgments.

The tradeoff is latency: you cannot pre-compute cross-encoder scores because they depend on the query. You must run the reranker at query time for every candidate. This makes cross-encoders unsuitable for first-stage retrieval over millions of documents — but perfectly suitable for reranking the top 50 candidates from the hybrid retriever.

```python
# retrieval/reranker.py
from __future__ import annotations

import numpy as np
from sentence_transformers import CrossEncoder
from retrieval.rrf import RankedResult


class CrossEncoderReranker:
    """
    Reranks a candidate set using a cross-encoder model.

    Architecture:
        1. Hybrid retriever returns top-N candidates (e.g., N=50)
        2. Cross-encoder scores each (query, candidate) pair
        3. Reranker returns top-K by cross-encoder score (e.g., K=10)

    Recommended models:
        - cross-encoder/ms-marco-MiniLM-L-6-v2  (fast, 22M params)
        - cross-encoder/ms-marco-electra-base    (higher quality, 110M params)
        - jinaai/jina-reranker-v2-base-multilingual (code + NL, recommended)
    """

    def __init__(
        self,
        model_name: str = "jinaai/jina-reranker-v2-base-multilingual",
        device: str = "cpu",
        max_length: int = 512,
    ):
        self.model = CrossEncoder(
            model_name,
            device=device,
            max_length=max_length,
        )

    def rerank(
        self,
        query: str,
        candidates: list[RankedResult],
        top_k: int = 10,
    ) -> list[RankedResult]:
        """
        Score each candidate against the query and return top_k by score.

        The cross-encoder sees:
            [CLS] query [SEP] code_chunk [SEP]

        And produces a scalar relevance score.
        """
        if not candidates:
            return []

        # Build input pairs for the cross-encoder
        pairs = [
            (query, c.metadata.get("content", ""))
            for c in candidates
        ]

        # Score all pairs in one batch
        scores = self.model.predict(
            pairs,
            batch_size=32,
            show_progress_bar=False,
            convert_to_numpy=True,
        )

        # Attach scores and sort
        scored = [
            RankedResult(
                id=c.id,
                score=float(scores[i]),
                source="reranked",
                metadata=c.metadata,
            )
            for i, c in enumerate(candidates)
        ]

        scored.sort(key=lambda r: r.score, reverse=True)
        return scored[:top_k]
```

### When to Use a Cross-Encoder

Cross-encoder reranking adds 20–80ms of latency per query (depending on model size and candidate count). This is acceptable in interactive use (developer IDE plugins, web search) and invisible in batch/offline use. The quality improvement is consistently 5–15% on MRR@10 compared to bi-encoder-only ranking.

| Configuration | MRR@10 | P99 Latency |
|---|---|---|
| Vector only (bi-encoder) | 0.62 | 28ms |
| BM25 only | 0.54 | 8ms |
| Hybrid RRF | 0.71 | 35ms |
| Hybrid RRF + cross-encoder (top 50→10) | 0.78 | 95ms |
| Hybrid RRF + cross-encoder (top 20→10) | 0.76 | 52ms |

*Table 6.2 — Retrieval quality vs. latency for different pipeline configurations. Measurements on internal Pyckle benchmark, Python corpus, 180K chunks.*

> **Warning**
>
> Do not run the cross-encoder over more than 100 candidates per query. Latency scales linearly with candidate count. The retriever (bi-encoder + BM25) already provides reasonable ranking; the cross-encoder's job is to fix the order of the top candidates, not to find needles in a haystack of 500 results.

---

## 6.4 Result Diversification

Even perfect ranking can produce a frustrating user experience if all top-10 results are from the same file or the same function's overloaded implementations. **Result diversification** ensures the result set covers distinct aspects of the query rather than clustering around a single high-confidence hit.

The classic approach is **Maximal Marginal Relevance (MMR)**:

```python
# retrieval/diversify.py
import numpy as np
from retrieval.rrf import RankedResult


def maximal_marginal_relevance(
    query_vector: np.ndarray,
    candidates: list[RankedResult],
    top_k: int = 10,
    lambda_param: float = 0.6,
) -> list[RankedResult]:
    """
    Select results that are both relevant and diverse using MMR.

    MMR score = lambda * relevance(d, q) - (1 - lambda) * max_similarity(d, selected)

    lambda_param:
        1.0 = pure relevance (no diversification)
        0.0 = pure diversity (no relevance weighting)
        0.6 = Pyckle default — good balance for code search

    Requires candidates to have their embedding vector in metadata["vector"].
    """
    selected: list[RankedResult] = []
    remaining = list(candidates)

    while len(selected) < top_k and remaining:
        # For the first selection, pick highest relevance
        if not selected:
            best = max(remaining, key=lambda r: r.score)
            selected.append(best)
            remaining.remove(best)
            continue

        # Compute MMR score for each remaining candidate
        selected_vectors = np.array([
            r.metadata["vector"] for r in selected
        ])

        mmr_scores = []
        for candidate in remaining:
            cand_vector = candidate.metadata["vector"]

            # Relevance: original retrieval score (normalized 0-1)
            relevance = candidate.score

            # Diversity: max similarity to any already-selected result
            similarities = np.dot(selected_vectors, cand_vector)
            max_sim = float(np.max(similarities))

            mmr = lambda_param * relevance - (1 - lambda_param) * max_sim
            mmr_scores.append(mmr)

        best_idx = int(np.argmax(mmr_scores))
        selected.append(remaining[best_idx])
        remaining.pop(best_idx)

    return selected
```

For most code search applications, a `lambda_param` of 0.6–0.7 is appropriate. Pure diversity (0.5 and below) often hurts because it will de-rank the second-best answer to a query in favor of something only tangentially related.

---

## 6.5 Contextual Boosting

Beyond algorithmic ranking, production search systems apply **contextual boosts** — score modifications based on signals orthogonal to textual relevance:

### Recency Boost

Recently modified files are more likely to be the active code a developer is working with. Apply a time-decay multiplier:

```python
import math
from datetime import datetime, timezone

def recency_boost(
    last_modified: datetime,
    half_life_days: float = 30.0,
    max_boost: float = 1.5,
) -> float:
    """
    Exponential decay from max_boost toward 1.0 as the file ages.
    Files modified today get max_boost; files modified 30 days ago get sqrt(max_boost).
    """
    age_days = (datetime.now(timezone.utc) - last_modified).days
    decay = math.exp(-math.log(2) * age_days / half_life_days)
    return 1.0 + (max_boost - 1.0) * decay
```

### File Importance Boost

Some files are structurally more important — entry points, shared utilities, heavily imported modules. Estimate importance from the dependency graph:

```python
def file_importance_boost(
    import_count: int,       # how many other files import this one
    max_imports: int = 100,  # normalization cap
    max_boost: float = 1.3,
) -> float:
    """Files imported by many others are likely more relevant to broad conceptual queries."""
    normalized = min(import_count / max_imports, 1.0)
    return 1.0 + (max_boost - 1.0) * normalized
```

### Usage Frequency Boost

If you track which files developers navigate to after a search, you can boost files that were historically found useful for similar queries:

```python
def usage_frequency_boost(
    click_through_rate: float,  # fraction of queries where this result was clicked
    baseline: float = 0.1,      # expected CTR for a random result
    max_boost: float = 1.4,
) -> float:
    """Boost files that users click on more than baseline."""
    ratio = max(click_through_rate / baseline, 0.0)
    return min(1.0 + (max_boost - 1.0) * math.log1p(ratio) / math.log(10), max_boost)
```

### Assembling the Final Score

```python
def apply_contextual_boosts(
    base_score: float,
    last_modified: datetime,
    import_count: int,
    click_through_rate: float,
    weights: dict[str, float] | None = None,
) -> float:
    """
    Apply all contextual boosts multiplicatively.
    Each boost is >= 1.0, so the base score can only increase.
    """
    w = weights or {"recency": 0.4, "importance": 0.3, "usage": 0.3}

    recency = recency_boost(last_modified)
    importance = file_importance_boost(import_count)
    usage = usage_frequency_boost(click_through_rate)

    # Weighted geometric mean of boosts to prevent compounding inflation
    combined_boost = (
        recency ** w["recency"] *
        importance ** w["importance"] *
        usage ** w["usage"]
    )

    return base_score * combined_boost
```

> **Key Insight**
>
> Contextual boosts are powerful but dangerous if uncalibrated. A file that is heavily imported AND recently modified AND frequently clicked can accumulate a boost of 2.0x or more, pushing it above results that are semantically better matches. Cap your total boost multiplier at 1.8x and audit the top-boosted files periodically to ensure the boost reflects genuine relevance rather than recency bias in a recently active file.

### Ranking Strategy Comparison

| Strategy | Quality | Latency | Complexity | Best For |
|---|---|---|---|---|
| Vector-only (ANN) | Good | Very fast (5–15ms) | Low | Prototype, pure semantic queries |
| BM25-only | Moderate | Very fast (2–8ms) | Low | Identifier lookups, exact match |
| Hybrid RRF | Very good | Fast (30–45ms) | Medium | General production use |
| Hybrid + cross-encoder | Excellent | Moderate (80–120ms) | Medium-high | High-quality developer tools |
| Hybrid + CE + MMR | Excellent + diverse | Moderate (90–130ms) | High | Full-featured search products |
| Hybrid + CE + MMR + boosts | Best | Moderate + (100–150ms) | High | Personalized, production-quality |

*Table 6.3 — Ranking strategy comparison. Latency measured on 100K-chunk Python corpus, single server with 4-core CPU and 16GB RAM.*

---

&nbsp;

---

# Chapter 7: Evaluation and Quality Metrics

> *"A search system that has never been evaluated has never been proven to work. Most teams discover this at the worst possible moment."*

---

## 7.1 Why Evaluation Is Hard for Code Search

Code search evaluation is harder than web search or document retrieval for three reasons.

**No click-through data at launch.** Web search engines train on billions of user clicks. A new code search system has no usage history. You must build ground-truth judgments by hand or generate them synthetically.

**Relevance is subjective and context-dependent.** For the query "retry logic", is a database retry function relevant? It depends on whether the developer is searching for HTTP retry logic or general retry patterns. Relevance judgments are inherently context-sensitive in ways that formal metrics do not capture.

**Code has multiple correct answers.** Unlike factual QA, most code queries have multiple valid results. The function that handles payment retries is relevant, and so is the function that handles job queue retries, and so is the retry decorator. Recall is as important as precision.

Despite these challenges, rigorous evaluation is not optional. Shipping a search system without measuring quality is shipping a system you cannot improve systematically.

---

## 7.2 Offline Metrics

### Mean Reciprocal Rank (MRR)

MRR measures the position of the first relevant result averaged across all queries:

```
MRR = (1/|Q|) * Σ 1/rank_of_first_relevant_result(q)
```

If the first relevant result is at position 1, the score is 1.0. At position 2, it is 0.5. At position 5, it is 0.2. If no relevant result appears in the top-K, the score is 0.

MRR is the right metric when the developer needs one good result quickly — as in an IDE code search where the developer clicks the first relevant match and moves on.

### Recall@K

Recall@K measures the fraction of all known relevant results that appear in the top K:

```
Recall@K = |relevant ∩ top_K| / |relevant|
```

For code search, Recall@10 is typically more important than Recall@5, because developers may scroll through several results when looking for the "right" implementation. A system with MRR@1 of 0.9 but Recall@10 of 0.5 is failing half the time to surface all relevant code.

### Precision@K

Precision@K measures the fraction of top-K results that are relevant:

```
Precision@K = |relevant ∩ top_K| / K
```

This is the "signal-to-noise" metric. A developer who has to scan through 10 results to find 3 relevant ones is experiencing P@10 = 0.3, which feels like poor search quality.

### Normalized Discounted Cumulative Gain (NDCG)

NDCG extends MRR to handle graded relevance (not just binary) and rewards putting the most relevant results at the top:

```
DCG@K = Σ (2^rel_i - 1) / log2(i + 1)   for i = 1 to K
NDCG@K = DCG@K / IDCG@K
```

Where `rel_i` is the relevance score of the item at position `i` (0 = not relevant, 1 = marginally relevant, 2 = highly relevant), and IDCG is the ideal DCG (the best possible ranking). NDCG = 1.0 is perfect; 0.0 is worst possible.

For code search, graded relevance is valuable: a function that exactly solves the query should score 2, a related utility function should score 1, an unrelated false positive should score 0.

```python
# evaluation/metrics.py
from __future__ import annotations

import math
from dataclasses import dataclass


@dataclass
class RelevanceJudgment:
    query_id: str
    doc_id: str
    relevance: int  # 0 = not relevant, 1 = marginal, 2 = highly relevant


def mean_reciprocal_rank(
    results: dict[str, list[str]],  # query_id -> ordered list of doc_ids
    judgments: dict[str, dict[str, int]],  # query_id -> {doc_id: relevance}
    threshold: int = 1,  # minimum relevance to count as "relevant"
) -> float:
    """
    Compute MRR over all queries in the evaluation set.
    threshold=1 means any non-zero relevance counts as relevant.
    """
    reciprocal_ranks = []

    for query_id, ranked_docs in results.items():
        query_judgments = judgments.get(query_id, {})
        rr = 0.0

        for rank, doc_id in enumerate(ranked_docs, start=1):
            if query_judgments.get(doc_id, 0) >= threshold:
                rr = 1.0 / rank
                break

        reciprocal_ranks.append(rr)

    if not reciprocal_ranks:
        return 0.0

    return sum(reciprocal_ranks) / len(reciprocal_ranks)


def recall_at_k(
    results: dict[str, list[str]],
    judgments: dict[str, dict[str, int]],
    k: int = 10,
    threshold: int = 1,
) -> float:
    """
    Compute mean Recall@K across all queries.
    """
    recalls = []

    for query_id, ranked_docs in results.items():
        query_judgments = judgments.get(query_id, {})
        relevant_docs = {
            doc_id for doc_id, rel in query_judgments.items()
            if rel >= threshold
        }

        if not relevant_docs:
            continue  # skip queries with no relevant docs

        retrieved_relevant = set(ranked_docs[:k]) & relevant_docs
        recalls.append(len(retrieved_relevant) / len(relevant_docs))

    return sum(recalls) / len(recalls) if recalls else 0.0


def precision_at_k(
    results: dict[str, list[str]],
    judgments: dict[str, dict[str, int]],
    k: int = 10,
    threshold: int = 1,
) -> float:
    """
    Compute mean Precision@K across all queries.
    """
    precisions = []

    for query_id, ranked_docs in results.items():
        query_judgments = judgments.get(query_id, {})
        top_k = ranked_docs[:k]
        relevant = sum(
            1 for doc_id in top_k
            if query_judgments.get(doc_id, 0) >= threshold
        )
        precisions.append(relevant / k)

    return sum(precisions) / len(precisions) if precisions else 0.0


def ndcg_at_k(
    results: dict[str, list[str]],
    judgments: dict[str, dict[str, int]],
    k: int = 10,
) -> float:
    """
    Compute mean NDCG@K across all queries with graded relevance.
    """
    def dcg(ordered_rels: list[int], k: int) -> float:
        return sum(
            (2 ** rel - 1) / math.log2(i + 2)
            for i, rel in enumerate(ordered_rels[:k])
        )

    ndcgs = []

    for query_id, ranked_docs in results.items():
        query_judgments = judgments.get(query_id, {})
        ranked_rels = [query_judgments.get(doc_id, 0) for doc_id in ranked_docs]
        ideal_rels = sorted(query_judgments.values(), reverse=True)

        actual_dcg = dcg(ranked_rels, k)
        ideal_dcg = dcg(ideal_rels, k)

        if ideal_dcg > 0:
            ndcgs.append(actual_dcg / ideal_dcg)

    return sum(ndcgs) / len(ndcgs) if ndcgs else 0.0


def full_eval_report(
    results: dict[str, list[str]],
    judgments: dict[str, dict[str, int]],
) -> dict[str, float]:
    """Run all metrics and return a summary dict."""
    return {
        "mrr@10":     mean_reciprocal_rank(results, judgments),
        "mrr@1":      mean_reciprocal_rank(results, judgments),
        "recall@5":   recall_at_k(results, judgments, k=5),
        "recall@10":  recall_at_k(results, judgments, k=10),
        "recall@20":  recall_at_k(results, judgments, k=20),
        "p@5":        precision_at_k(results, judgments, k=5),
        "p@10":       precision_at_k(results, judgments, k=10),
        "ndcg@10":    ndcg_at_k(results, judgments, k=10),
    }
```

---

## 7.3 Building Evaluation Datasets for Code Search

An evaluation dataset consists of:
1. A set of **queries** representative of real developer searches
2. A set of **relevant documents** (code chunks) for each query, with relevance grades

### Strategy 1: Manual Labeling

The highest-quality approach. Have engineers from your team write 50–100 queries they have actually searched for, then mark which chunks in the index are relevant. One engineer-day produces ~50 well-labeled queries.

Guidelines for manual labeling:
- Include query diversity: identifiers, conceptual, behavioral, cross-file
- Use three-tier relevance: 0 (not relevant), 1 (useful context), 2 (directly answers query)
- Have two labelers per query and compute inter-annotator agreement (κ > 0.6 is acceptable)
- Include "hard negatives" — documents that superficially match but are not relevant

### Strategy 2: Synthetic Generation with LLMs

Generate evaluation queries from your index automatically:

```python
# evaluation/generate_eval_set.py
import anthropic
import json
import random
from pathlib import Path


def generate_eval_queries(
    chunks: list[dict],
    n_queries: int = 200,
    client: anthropic.Anthropic = None,
    output_path: Path = Path("eval_set.jsonl"),
) -> None:
    """
    Generate an evaluation set by:
    1. Sampling chunks from the index
    2. Having an LLM generate queries that would retrieve each chunk
    3. Having the LLM rate relevance of nearby chunks (hard negatives)
    """
    client = client or anthropic.Anthropic()
    sampled = random.sample(chunks, min(n_queries, len(chunks)))

    with output_path.open("w") as f:
        for chunk in sampled:
            prompt = f"""Given this code chunk:

```{chunk['language']}
{chunk['content'][:800]}
```

1. Write ONE natural language query that a developer would use to find this code.
   The query should sound like something typed into a search box, not a question.
2. Rate this chunk's relevance to that query: 0, 1, or 2
   (0=not relevant, 1=useful context, 2=directly answers the query)

Output JSON:
{{"query": "...", "chunk_id": "{chunk['id']}", "relevance": 2}}"""

            response = client.messages.create(
                model="claude-opus-4-5",
                max_tokens=256,
                messages=[{"role": "user", "content": prompt}],
            )

            try:
                record = json.loads(response.content[0].text.strip())
                record["golden_chunk_id"] = chunk["id"]
                f.write(json.dumps(record) + "\n")
            except json.JSONDecodeError:
                continue  # skip malformed responses
```

### Strategy 3: Mining Commit History

Git commit messages often describe what code was changed and why — exactly the kind of query a developer would search for:

```python
# evaluation/mine_commits.py
import subprocess
import json
from pathlib import Path


def mine_commit_pairs(
    repo_path: Path,
    output_path: Path,
    n_commits: int = 500,
) -> None:
    """
    Extract (commit_message, changed_files) pairs as evaluation candidates.
    A commit message like "fix: handle None in get_user_by_email" is
    a natural query; the changed function is the relevant result.
    """
    result = subprocess.run(
        ["git", "log", f"-{n_commits}", "--name-only", "--pretty=format:%H|%s"],
        cwd=repo_path,
        capture_output=True,
        text=True,
    )

    records = []
    for block in result.stdout.strip().split("\n\n"):
        lines = block.strip().split("\n")
        if not lines:
            continue
        header = lines[0]
        files = [l for l in lines[1:] if l.strip()]

        if "|" not in header:
            continue
        commit_hash, message = header.split("|", 1)

        # Filter for meaningful messages (not "WIP", not merge commits)
        if len(message) < 10 or message.lower().startswith("merge"):
            continue

        records.append({
            "query": message,
            "commit": commit_hash,
            "changed_files": files,
        })

    with output_path.open("w") as f:
        for record in records:
            f.write(json.dumps(record) + "\n")

    print(f"Mined {len(records)} commit pairs to {output_path}")
```

> **Key Insight**
>
> Commit history mining is underutilized as an evaluation strategy. A codebase with 1,000 meaningful commit messages gives you 1,000 (query, relevant_file) pairs for free. The queries are imperfect (commit messages are not always search-query-shaped) but the signal is real — a developer wrote the commit message immediately after making the relevant change, which means the mapping is ground truth.

---

## 7.4 A/B Testing Search Quality

Once your system is live and has users, you can move from offline evaluation to online A/B testing. The primary online metric for code search is **click-through rate at rank position** — specifically, whether users click result #1 more than result #3 or #5.

The correct A/B test framework for search is an **interleaving experiment**, not a split test. In a split test, user A gets system A and user B gets system B. In an interleaving experiment, a single user sees results interleaved from both systems, with positions randomized. The system whose results are clicked more wins.

Interleaving is preferred because:
- It controls for user-level variance (different developers have different search habits)
- It requires fewer users to achieve statistical significance (often 5–10x fewer)
- It directly measures preference, not just usage volume

```python
# evaluation/interleaving.py
import random
from dataclasses import dataclass


@dataclass
class InterleavedResult:
    doc_id: str
    position: int
    source: str  # "system_a" or "system_b"


def team_draft_interleaving(
    results_a: list[str],  # doc_ids from system A, ordered by rank
    results_b: list[str],  # doc_ids from system B, ordered by rank
    max_results: int = 10,
) -> list[InterleavedResult]:
    """
    Team Draft Interleaving: alternate picks from each system.
    The system that picks first is randomized at each query to prevent position bias.
    """
    interleaved = []
    seen = set()
    iters = {
        "system_a": iter(results_a),
        "system_b": iter(results_b),
    }

    # Randomly determine who picks first
    order = ["system_a", "system_b"]
    if random.random() < 0.5:
        order = ["system_b", "system_a"]

    position = 1
    while len(interleaved) < max_results:
        made_progress = False
        for system in order:
            try:
                doc_id = next(iters[system])
                while doc_id in seen:
                    doc_id = next(iters[system])
                seen.add(doc_id)
                interleaved.append(InterleavedResult(
                    doc_id=doc_id,
                    position=position,
                    source=system,
                ))
                position += 1
                made_progress = True
            except StopIteration:
                continue

        if not made_progress:
            break

    return interleaved
```

---

## 7.5 Offline vs. Online Evaluation

| Dimension | Offline | Online |
|---|---|---|
| Data source | Hand-labeled or synthetic | Real user clicks |
| Time to results | Hours (run evaluation suite) | Days to weeks (accumulate clicks) |
| Cost | Engineer time to build dataset | Infrastructure for experiment logging |
| Noise | Low (judgments are controlled) | High (user intent is noisy) |
| Validity | Tests what you define as relevant | Tests what users actually want |
| Iteration speed | Fast | Slow |
| Required for | Every change | Significant ranking changes |

*Table 7.1 — Offline vs. online evaluation: when to use each.*

The standard practice is to use offline evaluation for rapid iteration during development — every model change, every parameter tweak — and online A/B testing for deployment decisions on significant ranking changes.

### Benchmark Results

The following table shows Pyckle's internal benchmark results at different pipeline stages, on a Python corpus of 180K chunks (12M lines of open-source code):

| Pipeline Stage | MRR@10 | Recall@10 | P@10 | NDCG@10 |
|---|---|---|---|---|
| BM25 baseline | 0.541 | 0.612 | 0.387 | 0.558 |
| Vector only (codet5p-110m) | 0.624 | 0.701 | 0.443 | 0.638 |
| Hybrid RRF | 0.714 | 0.798 | 0.521 | 0.726 |
| + Cross-encoder reranking | 0.783 | 0.821 | 0.589 | 0.794 |
| + Contextual boosts | 0.791 | 0.829 | 0.601 | 0.803 |
| + Intent-based weight tuning | 0.812 | 0.847 | 0.618 | 0.821 |

*Table 7.2 — Cumulative quality improvement at each pipeline stage. Evaluation set: 412 hand-labeled queries, Python corpus.*

> **Try This**
>
> Before running a full evaluation pipeline, do a quick **manual spot check**: pick 20 queries from your own recent search history, run them through your system, and grade the top 5 results for each as relevant or not. This takes 30 minutes and immediately reveals the qualitative failures that metrics sometimes obscure. If you see the same failure mode repeatedly — say, the system always surfacing test files instead of implementation files — that is a data quality issue you can fix before spending time on metric computation.

---

&nbsp;

---

# Chapter 8: Scaling to Millions of Files

> *"A search system that works on 10,000 files may not work on 1,000,000. The failure modes are different, the bottlenecks are different, and the solutions are different. Plan for scale before you hit it."*

---

## 8.1 Understanding the Scale Cliff

There is a qualitative difference between indexing tens of thousands of files and indexing millions. It is not just "more of the same" — the architecture must change.

Consider the numbers. One million Python files, averaging 500 lines each, contain roughly 500 million lines of code. At a conservative two chunks per function and one function per 20 lines, that is 25 million chunks. At 768 float32 dimensions per chunk, the raw embedding matrix alone requires:

```
25,000,000 × 768 × 4 bytes = 76.8 GB
```

That does not fit in memory on a single server. You need to shard it, quantize it, or both.

Beyond memory, the HNSW index construction time for 25M vectors at a graph degree of 32 is roughly 8 hours on a single GPU. Incremental index updates that take 200ms at 100K chunks take 4 seconds at 25M chunks. BM25 index scans that complete in 10ms at 100K documents take 2.5 seconds at 25M — unless you shard.

This chapter covers the architectural patterns that make scale manageable.

---

## 8.2 Sharding Strategies

Sharding divides the index across multiple nodes or partitions. The two primary strategies for semantic search are **horizontal sharding** and **semantic sharding**.

### Horizontal (Hash) Sharding

Assign chunks to shards by a hash of their ID:

```python
# scaling/shard_router.py
import hashlib
from dataclasses import dataclass
from typing import Any


@dataclass
class Shard:
    shard_id: int
    host: str
    port: int
    collection: str


class ShardRouter:
    """
    Routes document operations and queries to the correct shard(s).
    Uses consistent hashing to minimize redistribution on shard count change.
    """

    def __init__(self, shards: list[Shard], replication_factor: int = 2):
        self.shards = sorted(shards, key=lambda s: s.shard_id)
        self.n_shards = len(shards)
        self.replication_factor = replication_factor

    def shard_for_doc(self, doc_id: str) -> list[Shard]:
        """
        Return the shard(s) responsible for this document.
        With replication_factor > 1, returns multiple shards.
        """
        hash_val = int(hashlib.sha256(doc_id.encode()).hexdigest(), 16)
        primary_idx = hash_val % self.n_shards

        shards = []
        for i in range(self.replication_factor):
            shard_idx = (primary_idx + i) % self.n_shards
            shards.append(self.shards[shard_idx])

        return shards

    def shards_for_query(self) -> list[Shard]:
        """
        For a query, fan out to all shards (or a subset with sampling).
        Returns the primary shard for each range to avoid duplicate reads.
        """
        # Fan-out: query all shards, merge results
        return self.shards
```

At query time, the search fans out to all shards, each returns its top-K results, and the coordinator merges and re-ranks them:

```python
# scaling/distributed_search.py
import asyncio
from collections.abc import Coroutine


async def distributed_search(
    query_vector,
    query_text: str,
    shard_clients: list,  # one client per shard
    top_k: int = 50,
) -> list[dict]:
    """
    Fan out search to all shards concurrently and merge results.
    Each shard returns its top_k; the coordinator picks the global top_k.
    """
    # Fire all shard queries concurrently
    tasks = [
        client.search(
            query_vector=query_vector,
            query_text=query_text,
            limit=top_k,
        )
        for client in shard_clients
    ]
    shard_results = await asyncio.gather(*tasks, return_exceptions=True)

    # Merge, filtering out exceptions (failed shards degrade gracefully)
    all_results = []
    for results in shard_results:
        if isinstance(results, Exception):
            continue  # log and continue — partial results are better than failure
        all_results.extend(results)

    # Sort by score and return top_k
    all_results.sort(key=lambda r: r["score"], reverse=True)
    return all_results[:top_k]
```

### Semantic Sharding

Assign chunks to shards based on their embedding — chunks with similar embeddings go to the same shard. The advantage is that a query can often be routed to one or two shards instead of all of them, dramatically reducing fan-out at query time.

The challenge is that semantic sharding requires periodic rebalancing as the embedding distribution shifts. It also creates hot shards if some semantic clusters are much larger than others (e.g., if you have far more authentication-related code than logging code).

**Recommendation**: Use horizontal sharding for the first 50M chunks. Only consider semantic sharding if you have profiling data showing that fan-out latency is the bottleneck and you have engineering capacity to handle rebalancing.

---

## 8.3 Incremental Indexing

Re-indexing from scratch when a file changes is acceptable at small scale. At large scale, it is catastrophically expensive. You need an incremental indexing pipeline.

The core insight: track the content hash of every indexed chunk. On file change, re-chunk only the changed file, compute the hash of each new chunk, and upsert only the chunks whose hash changed.

```python
# scaling/incremental_indexer.py
from __future__ import annotations

import hashlib
import json
import sqlite3
from dataclasses import dataclass
from pathlib import Path
from typing import Iterator


@dataclass
class ChunkFingerprint:
    chunk_id: str
    file_path: str
    content_hash: str
    indexed_at: float


class IncrementalIndexer:
    """
    Tracks which chunks are in the index and their content hashes.
    Only re-indexes chunks that have changed since last indexing.
    """

    def __init__(self, db_path: Path = Path(".index_state.db")):
        self.conn = sqlite3.connect(db_path)
        self._init_db()

    def _init_db(self):
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS chunk_fingerprints (
                chunk_id     TEXT PRIMARY KEY,
                file_path    TEXT NOT NULL,
                content_hash TEXT NOT NULL,
                indexed_at   REAL NOT NULL
            )
        """)
        self.conn.execute(
            "CREATE INDEX IF NOT EXISTS idx_file_path ON chunk_fingerprints(file_path)"
        )
        self.conn.commit()

    def content_hash(self, content: str) -> str:
        return hashlib.sha256(content.encode()).hexdigest()

    def needs_indexing(self, chunk_id: str, content: str) -> bool:
        """Returns True if this chunk is new or its content has changed."""
        new_hash = self.content_hash(content)
        row = self.conn.execute(
            "SELECT content_hash FROM chunk_fingerprints WHERE chunk_id = ?",
            (chunk_id,),
        ).fetchone()

        if row is None:
            return True  # new chunk

        return row[0] != new_hash  # changed chunk

    def chunks_to_delete(self, file_path: str, current_chunk_ids: set[str]) -> set[str]:
        """Returns chunk IDs that were previously indexed but are no longer present."""
        rows = self.conn.execute(
            "SELECT chunk_id FROM chunk_fingerprints WHERE file_path = ?",
            (file_path,),
        ).fetchall()

        previously_indexed = {row[0] for row in rows}
        return previously_indexed - current_chunk_ids

    def mark_indexed(self, chunk_id: str, file_path: str, content: str) -> None:
        import time
        self.conn.execute(
            """INSERT OR REPLACE INTO chunk_fingerprints
               (chunk_id, file_path, content_hash, indexed_at)
               VALUES (?, ?, ?, ?)""",
            (chunk_id, file_path, self.content_hash(content), time.time()),
        )
        self.conn.commit()

    def mark_deleted(self, chunk_ids: set[str]) -> None:
        self.conn.executemany(
            "DELETE FROM chunk_fingerprints WHERE chunk_id = ?",
            [(cid,) for cid in chunk_ids],
        )
        self.conn.commit()
```

At scale, the change detection step should run on a **file watcher** or **git hook**, not a polling loop. Processing changes at commit time means the index is at most one commit behind — acceptable for most code search applications.

> **Warning**
>
> Incremental indexing has a subtle failure mode: if the indexer crashes after writing the embedding to the vector store but before recording the fingerprint in the state database, the chunk appears indexed but the state tracker does not know it. On the next run, it will be re-indexed with a new ID, creating a duplicate. Solve this with idempotent upserts in the vector store (upsert by chunk hash, not a random UUID) and eventual consistency checks that periodically scan for orphaned vectors.

---

## 8.4 Distributed Search Architecture

A production-scale semantic search system at millions of files requires multiple layers:

```
Figure 8.1 — Distributed search architecture for 10M+ file corpora.

                      ┌─────────────────────┐
                      │    Query API Layer   │  ← FastAPI, load-balanced
                      │  (query processing,  │
                      │  intent detection,   │
                      │  result merging)     │
                      └──────────┬──────────┘
                                 │ fan-out
              ┌──────────────────┼──────────────────┐
              ▼                  ▼                   ▼
    ┌──────────────────┐  ┌──────────────────┐  ┌──────────────┐
    │   Vector Shard 1 │  │  Vector Shard 2  │  │  BM25 Index  │
    │   (HNSW, 8M vecs)│  │  (HNSW, 8M vecs) │  │ (Elasticsearch│
    │   Qdrant node    │  │  Qdrant node     │  │  or Typesense)│
    └──────────────────┘  └──────────────────┘  └──────────────┘
              │                  │                     │
              └──────────────────┴─────────────────────┘
                                 │ chunk IDs → metadata lookup
                      ┌──────────▼──────────┐
                      │    Metadata Store   │  ← PostgreSQL or Redis
                      │  (file paths, types,│
                      │   git metadata,     │
                      │   usage signals)    │
                      └─────────────────────┘
```

Key architectural decisions:
- **Store embeddings separately from metadata.** The vector index only needs IDs and vectors. Chunk content, file paths, and git metadata live in a relational store and are fetched by ID after retrieval.
- **Keep BM25 in a dedicated engine.** Do not try to run BM25 inside your vector database. Elasticsearch or Typesense handles BM25 much better than ad-hoc implementations in Qdrant.
- **The query API layer is stateless.** All state lives in the storage layer, allowing horizontal scaling of the query layer.

---

## 8.5 Cache Layers

Three cache layers dramatically reduce load at scale:

**Query result cache**: Cache the full result set for a query for 5–60 minutes. Code does not change that fast. Even a 60-second cache reduces embedding inference by 50–70% for codebases with repetitive query patterns (everyone asking "authentication middleware" gets the same result).

```python
# scaling/query_cache.py
import hashlib
import json
import time
from functools import wraps
from typing import Any, Callable

import redis


class QueryCache:
    def __init__(
        self,
        redis_client: redis.Redis,
        default_ttl: int = 300,  # 5 minutes
    ):
        self.redis = redis_client
        self.default_ttl = default_ttl

    def cache_key(self, query: str, intent: str, top_k: int) -> str:
        payload = json.dumps({"q": query, "intent": intent, "k": top_k}, sort_keys=True)
        return f"search:{hashlib.sha256(payload.encode()).hexdigest()[:20]}"

    def get(self, query: str, intent: str, top_k: int) -> list[dict] | None:
        key = self.cache_key(query, intent, top_k)
        value = self.redis.get(key)
        if value is None:
            return None
        return json.loads(value)

    def set(
        self,
        query: str,
        intent: str,
        top_k: int,
        results: list[dict],
        ttl: int | None = None,
    ) -> None:
        key = self.cache_key(query, intent, top_k)
        self.redis.setex(
            key,
            ttl or self.default_ttl,
            json.dumps(results),
        )
```

**Embedding cache**: Cache query embeddings keyed by query text. Query embedding is the cheapest step to cache and the most frequently repeated — developers often run the same search multiple times in a session.

**Metadata cache**: Cache file metadata (path, language, last modified, import count) in Redis with a 1-hour TTL. Metadata rarely changes and is fetched on every result returned.

---

## 8.6 Latency Budgets and Optimization

At scale, every millisecond is a policy decision. A 200ms search is acceptable in a web UI; it is not acceptable in an IDE that fires on every keystroke. Define your latency budget before optimizing.

| Use Case | P50 Target | P99 Target |
|---|---|---|
| IDE autocomplete | 20ms | 60ms |
| IDE search (triggered) | 80ms | 200ms |
| Web search UI | 150ms | 400ms |
| API (batch ingestion) | 500ms | 2s |
| Background analytics | 2s | 10s |

*Table 8.1 — Latency budget by use case.*

Against these targets, map your pipeline components:

| Component | Typical Time | Optimization |
|---|---|---|
| Query normalization | 1–3ms | In-memory regex, no I/O |
| Intent classification | 2–8ms | Lightweight classifier or rule-based |
| Query embedding | 8–30ms | GPU, batch where possible, cache misses only |
| BM25 search | 2–20ms | Dedicated index, SSD-backed |
| Vector ANN search | 10–40ms | HNSW ef=64, appropriate quantization |
| RRF fusion | <1ms | Pure Python, no I/O |
| Cross-encoder reranking | 20–80ms | GPU, limit to top-20 candidates |
| Metadata fetch | 1–5ms | Redis cache, key-value lookup |
| **Total (IDE triggered)** | **44–187ms** | |

*Table 8.2 — Per-component latency breakdown for full pipeline.*

For the IDE use case (P99 < 200ms), cross-encoder reranking may need to be dropped or made asynchronous — kick off a background rerank that updates the result list 100ms later, improving quality without blocking the initial display.

### Real-World Scale Numbers

Based on production deployments and public case studies:

| Scale | Index Size | Indexing Time | Query Latency | Infrastructure |
|---|---|---|---|---|
| 100K chunks | 300MB | 45min (CPU) | 15ms | Single server, 8GB RAM |
| 1M chunks | 3GB | 8h (CPU) / 2h (GPU) | 25ms | Single server, 32GB RAM |
| 10M chunks | 30GB | ~20h (GPU) | 40ms | 2 vector shards, 32GB each |
| 100M chunks | 300GB | ~200h (GPU cluster) | 80ms | 20+ shards, distributed |

*Table 8.3 — Infrastructure requirements by corpus scale. Assumes codet5p-110m-embedding (256-dim), float32, no quantization.*

> **Key Insight**
>
> INT8 quantization reduces the 30GB index for 10M chunks to ~8GB with less than 3% quality degradation. Binary quantization reduces it to ~1GB but degrades quality by 8–15%. For most production deployments, INT8 quantization is a free optimization — apply it before you need it, not after you have already run out of memory.

---

&nbsp;

---

# Chapter 9: The Pyckle Approach

> *"We did not set out to build a semantic search company. We set out to solve the problem of an AI coding assistant that had no idea what was in the codebase it was supposed to help with. Everything in this book is what we learned trying to solve that problem for real."*

---

## 9.1 Design Philosophy

Pyckle started with three constraints that shaped every architectural decision in this book:

**Local-first.** The codebases that need the best search are the ones developers are most protective of. A financial institution's proprietary trading system, a startup's unreleased product, a security tool's implementation — none of these can be sent to a third-party API for indexing. Pyckle is designed to run entirely on your infrastructure, with no data leaving your network.

**Developer-experience first.** Semantic search is only valuable if it integrates seamlessly into where developers already work. We built the IDE plugins before the dashboard. We optimized query latency before we optimized throughput. A technically superior search system that requires a context switch to use will lose to a mediocre one that is right there in the editor.

**Minimal footprint.** An AI tool that requires a dedicated GPU server to try is a tool that most developers will never try. Pyckle is designed to run on the machine that already runs your development environment. The full stack — embedding model, vector index, BM25 index, query API — fits in 4GB of RAM and runs on CPU for corpora up to 500K chunks.

These constraints ruled out some architectures (centralized cloud indexing, heavy GPU requirements, Kubernetes-native deployments) and pointed toward others (self-contained binaries, incremental indexing, quantized local models, the MCP integration pattern described below).

---

## 9.2 Architecture Overview

The Pyckle architecture has four layers:

```
Figure 9.1 — Pyckle system architecture.

  ┌─────────────────────────────────────────────────────────────────┐
  │                        Developer Layer                          │
  │    VS Code Extension   │   JetBrains Plugin   │   CLI Tool      │
  │    (MCP client)        │   (MCP client)       │   (direct API)  │
  └────────────────────────┬────────────────────────────────────────┘
                           │ MCP protocol (local stdio / HTTP)
  ┌────────────────────────▼────────────────────────────────────────┐
  │                     MCP Integration Layer                       │
  │   tool: search_code   │   tool: get_context   │  tool: explain  │
  │   (intent detection,  │   (retrieve by file,  │  (LLM + search  │
  │    query processing)  │    symbol, or range)  │   composition)  │
  └────────────────────────┬────────────────────────────────────────┘
                           │ FastAPI HTTP (localhost)
  ┌────────────────────────▼────────────────────────────────────────┐
  │                      Search Engine Layer                        │
  │   Embedding pipeline  │  Hybrid retriever     │  Reranker       │
  │   (codet5p-110m,      │  (Qdrant + BM25Okapi) │  (cross-encoder │
  │    cached, batched)   │                       │   on top-20)    │
  └────────────────────────┬────────────────────────────────────────┘
                           │
  ┌────────────────────────▼────────────────────────────────────────┐
  │                       Index Storage Layer                       │
  │   Qdrant (local,      │  SQLite BM25          │  Metadata DB    │
  │   embedded mode)      │  (tantivy via Python) │  (SQLite)       │
  └─────────────────────────────────────────────────────────────────┘
```

Everything in the bottom three layers runs locally. The developer layer communicates over either MCP stdio (for AI coding assistants that support MCP) or a local HTTP API.

### The MCP Integration Layer

The Model Context Protocol (MCP) is the integration point that makes Pyckle useful to AI coding assistants. An AI assistant with an MCP client can call Pyckle's search tools the same way it calls file system tools — as structured function calls with typed inputs and outputs.

The three primary MCP tools Pyckle exposes:

```python
# mcp_server/tools.py
from mcp.server import Server
from mcp.types import Tool, TextContent
import mcp.types as types

server = Server("pyckle")


@server.list_tools()
async def list_tools() -> list[Tool]:
    return [
        Tool(
            name="search_code",
            description=(
                "Search the codebase for code relevant to a query. "
                "Returns ranked code chunks with file paths, line ranges, "
                "and relevance scores. Use this when you need to find "
                "where a concept, pattern, or functionality is implemented."
            ),
            inputSchema={
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "Natural language description of what to find",
                    },
                    "top_k": {
                        "type": "integer",
                        "default": 10,
                        "description": "Number of results to return",
                    },
                    "language": {
                        "type": "string",
                        "description": "Filter to a specific language (optional)",
                    },
                },
                "required": ["query"],
            },
        ),
        Tool(
            name="get_context",
            description=(
                "Retrieve the full content of a specific code chunk "
                "by file path and optional line range. Use this to "
                "read code that search_code identified as relevant."
            ),
            inputSchema={
                "type": "object",
                "properties": {
                    "file_path": {"type": "string"},
                    "start_line": {"type": "integer"},
                    "end_line": {"type": "integer"},
                },
                "required": ["file_path"],
            },
        ),
        Tool(
            name="explain_symbol",
            description=(
                "Find all usages and definitions of a symbol name "
                "across the codebase. Returns definitions, call sites, "
                "and related symbols."
            ),
            inputSchema={
                "type": "object",
                "properties": {
                    "symbol": {"type": "string"},
                    "include_tests": {"type": "boolean", "default": False},
                },
                "required": ["symbol"],
            },
        ),
    ]
```

---

## 9.3 Context Routing

Context routing is the component that decides — based on the AI assistant's question and the current file — what additional codebase context to fetch before the assistant generates a response.

The key insight is that an AI assistant's information need is almost always broader than what it explicitly requests. If it asks for the implementation of `process_payment`, it almost certainly also needs:
- The `PaymentResult` type definition
- The `customer_repo` and `payment_gateway` dependencies
- The relevant configuration values
- Any existing tests for this function

Context routing makes these implicit needs explicit, fetching the surrounding context automatically.

```python
# context_routing/router.py
from __future__ import annotations

from dataclasses import dataclass
from enum import Enum


class ContextNeed(Enum):
    DEFINITION = "definition"        # Find where this symbol is defined
    DEPENDENCIES = "dependencies"    # Find what this code depends on
    USAGES = "usages"               # Find where this is called/used
    TESTS = "tests"                 # Find tests for this code
    SIMILAR = "similar"             # Find similar patterns in the codebase
    CONFIG = "config"               # Find related configuration


@dataclass
class RoutedContext:
    primary_chunks: list[dict]       # The directly requested code
    dependency_chunks: list[dict]    # Imports, type definitions, etc.
    test_chunks: list[dict]          # Relevant tests
    config_chunks: list[dict]        # Related configuration
    total_tokens: int                # Estimate for LLM context window planning


class ContextRouter:
    """
    Given a primary search result and an AI assistant's task,
    build a complete context package for the assistant.
    """

    def __init__(self, search_engine, max_context_tokens: int = 8192):
        self.search = search_engine
        self.max_tokens = max_context_tokens

    def route(
        self,
        primary_chunks: list[dict],
        task_description: str,
        include_tests: bool = True,
        include_dependencies: bool = True,
    ) -> RoutedContext:
        """
        Build a complete context package by fetching related chunks.
        """
        # Extract symbols and file paths from primary results
        primary_symbols = self._extract_symbols(primary_chunks)
        primary_files = {c["file_path"] for c in primary_chunks}

        dep_chunks = []
        test_chunks = []
        config_chunks = []

        # Fetch type definitions and imports for extracted symbols
        if include_dependencies:
            for symbol in primary_symbols[:5]:  # limit to top 5 symbols
                deps = self.search.search(
                    f"definition of {symbol}",
                    top_k=2,
                    exclude_files=primary_files,
                )
                dep_chunks.extend(deps)

        # Fetch test files for primary files
        if include_tests:
            for file_path in list(primary_files)[:3]:
                base_name = file_path.split("/")[-1].replace(".py", "")
                tests = self.search.search(
                    f"test {base_name}",
                    top_k=2,
                    file_glob="**/test_*.py",
                )
                test_chunks.extend(tests)

        # Fetch configuration if the task involves settings/config
        config_keywords = {"config", "setting", "env", "environment", "secret", "key"}
        if any(kw in task_description.lower() for kw in config_keywords):
            config_chunks = self.search.search(
                task_description + " configuration",
                top_k=3,
                file_glob="**/*.{yaml,yml,toml,env,cfg}",
            )

        return RoutedContext(
            primary_chunks=primary_chunks,
            dependency_chunks=self._dedupe(dep_chunks, primary_chunks),
            test_chunks=self._dedupe(test_chunks, primary_chunks),
            config_chunks=self._dedupe(config_chunks, primary_chunks),
            total_tokens=self._estimate_tokens(
                primary_chunks + dep_chunks + test_chunks + config_chunks
            ),
        )

    def _extract_symbols(self, chunks: list[dict]) -> list[str]:
        """Extract the top-level symbol names from code chunks."""
        import re
        symbols = []
        pattern = re.compile(r"^(?:def|class|async def)\s+(\w+)", re.MULTILINE)
        for chunk in chunks:
            symbols.extend(pattern.findall(chunk.get("content", "")))
        return list(dict.fromkeys(symbols))  # dedupe while preserving order

    def _dedupe(self, candidates: list[dict], exclude: list[dict]) -> list[dict]:
        exclude_ids = {c["id"] for c in exclude}
        return [c for c in candidates if c["id"] not in exclude_ids]

    def _estimate_tokens(self, chunks: list[dict]) -> int:
        """Rough estimate: 4 chars ≈ 1 token."""
        total_chars = sum(len(c.get("content", "")) for c in chunks)
        return total_chars // 4
```

---

## 9.4 What Makes It Different

The "local-first, privacy-preserving" description is the most common summary of Pyckle's positioning, but the engineering decisions behind it are more specific:

**No persistent cloud embeddings.** Some cloud-hosted code search products embed your code on their servers and store the embeddings for retrieval. Pyckle generates and stores embeddings locally. The code and its vector representations never leave your machine or network.

**Incremental, file-watch-based indexing.** Most developer tools require a manual "re-index" command after significant code changes. Pyckle watches the file system and updates the index within seconds of a save. The index is always current; there is no "remember to re-index" cognitive burden.

**Language-agnostic chunking with per-language grammar support.** The AST chunker described in Chapter 3 uses tree-sitter grammars and supports 40+ languages out of the box. A polyglot codebase — Python services, TypeScript frontend, Terraform infrastructure, SQL migrations — is indexed uniformly, with cross-language semantic search that retrieves relevant code regardless of which language implements the concept.

**Intent-adaptive retrieval.** The query pipeline from Chapter 5 detects whether a query is an identifier lookup, a conceptual search, a behavioral query, or a diagnostic query, and routes it to the appropriate combination of BM25 and vector retrieval with appropriate weights. Most search tools send every query through the same pipeline.

**MCP-native from day one.** Rather than building a proprietary plugin per IDE, Pyckle uses the Model Context Protocol as its integration interface. Any AI coding assistant that supports MCP — Claude, Cursor, Codeium, custom LLM wrappers — gets Pyckle's search tools automatically. One server, every client.

---

## 9.5 Getting Started

The full Pyckle stack requires:
- Python 3.11+
- 4GB RAM minimum (8GB recommended for corpora over 200K chunks)
- 2GB disk for the embedding model
- Any operating system with a file system

```bash
# Install
pip install pyckle

# Initialize an index for your project
cd /path/to/your/project
pyckle init

# Start the indexer (watches for file changes)
pyckle index --watch &

# Start the MCP server
pyckle serve

# Search from the command line
pyckle search "retry logic for HTTP failures"

# Or configure your AI assistant to use the MCP server
# (see docs.pyckle.co for per-assistant setup guides)
```

The `pyckle init` command runs a first-time indexing job, which takes:
- ~2 minutes for a 50K-line codebase on a modern CPU
- ~8 minutes for a 500K-line codebase
- ~30 minutes for a 2M-line codebase

After the initial index is built, incremental updates are sub-second.

> **Try This**
>
> After installing Pyckle, run `pyckle search "the hardest concept to find in your codebase"`. Not the name of a function — a description of what the code does. Compare the results to what `grep -r` would have returned. The difference in the first two minutes is the same difference that Chapter 1 described in abstract terms. It is more persuasive when it is your own code.

---

## 9.6 What We Learned Building This

A few lessons from two years of building and operating Pyckle that did not fit neatly into the preceding chapters:

**Chunking is more important than the embedding model.** We spent months fine-tuning embedding models before realizing our chunking strategy was the actual bottleneck. A good embedding model on bad chunks produces worse results than a mediocre model on good chunks. Fix your chunking first.

**Developer trust is harder to earn than technical quality.** The first version of Pyckle that we showed to developers was technically strong but resulted in skepticism about privacy. We had not communicated clearly enough what data left the machine and when. Nothing changed in the code — we added one paragraph to the setup flow — and adoption improved significantly. Developer trust is a product decision, not just an engineering one.

**The long tail of query types is longer than you think.** We optimized the "conceptual code search" query because that is the glamorous problem. Then we shipped and discovered that 40% of queries were identifier lookups — developers searching for `UserAuthService` because they had seen it referenced but could not navigate to it. Intent-based routing, which was initially a theoretical nicety, became the most practically impactful feature we shipped.

**Query expansion is a trap for low-quality indices.** Query expansion (Chapter 5) improves recall for good indices and catastrophically degrades precision for bad ones. If your recall is low because your chunks are incoherent, adding query expansion retrieves more incoherent chunks. We learned to measure precision and recall independently and to diagnose low-precision failures as chunking problems before reaching for expansion.

**The evaluation dataset is a product.** The most important infrastructure investment we made was building a rigorous evaluation set with 500 hand-labeled queries from three real codebases. Every improvement we made was measured against it. Without it, we would have shipped several "improvements" that made things worse on real queries while improving our informal intuitions.

---

&nbsp;

---


## Conclusion

You have built something real. Not a prototype, not a proof of concept with asterisks — a production-grade semantic search pipeline that understands code the way developers think about it. You can now take a natural language query, embed it into the same vector space as your codebase, retrieve candidates that match on meaning rather than tokens, and rank them by relevance signals that compound across retrieval stages. That is a substantive capability, and most engineering teams don't have it.

The thread that runs through every chapter — from the failure modes of keyword search to the Pyckle architecture at the end — is that code is not text. It is structure, intent, and relationship simultaneously. Treating it as a bag of words was always a workaround, not a solution. The moment you switched to embeddings, you stopped asking "does this file contain the word authentication" and started asking "does this file handle the concept of authentication." That shift is not cosmetic. It changes what queries are even possible to ask. It changes what answers the system can surface. And it changes how developers interact with a codebase they've never touched before — from grep-and-hope to directed retrieval.

The second thread is that quality is earned incrementally, not declared upfront. The evaluation chapter exists because intuition about retrieval quality is almost always wrong. Systems that feel fast feel accurate. Systems with high recall rates hide their precision failures in the noise. The metrics — MRR, nDCG, recall at k — force you to be honest about where your pipeline breaks down. Chunking strategy affects recall more than embedding model choice in most codebases. Query expansion affects precision more than re-ranking in most query distributions. You cannot know which levers matter until you measure, and you cannot measure without knowing what you're optimizing for. Evaluation is not a step you do after the pipeline works. It is how you find out if the pipeline works.

The third thread is that scaling is a design problem, not an infrastructure problem. Throwing more hardware at a retrieval system that chunks poorly or embeds inconsistently produces faster wrong answers. The decisions you make at a hundred files — how to split functions from their context, how to handle imports and type signatures, how to structure metadata — are the decisions that determine whether the system degrades gracefully or catastrophically at ten million files. Scale revealed the fracture lines you introduced early. Most of the scaling chapter is really about design discipline applied under pressure.

Here is what to do Monday morning. Pick the smallest real codebase you have access to — not a toy repository, not hello world, an actual production codebase with mixed file types and inconsistent naming — and run it through your chunking pipeline. Do not touch the embedding model. Do not tune retrieval parameters. Just look at what the chunker produces. Read fifty random chunks. Ask whether each one is a coherent, self-contained unit of meaning. You will find chunks that split a function at the argument list. You will find chunks that include three unrelated utility functions concatenated together. You will find chunks that are pure comments with no code. Fix those first. Chunking quality is the foundation everything else sits on, and it is almost always the weakest part of a pipeline that was built by starting with the embedding model and working backwards.

The reason people don't apply what they've read in books like this one is almost never technical. It is not that the concepts are too hard or the implementation is too complex. It is that the gap between "I understand how this works" and "I have this running in my environment" feels larger than it is, and that feeling is enough to defer indefinitely. The gap is not large. The first working version of a semantic search pipeline — one that actually indexes real code and returns meaningful results — can be running in a weekend. Not polished. Not production-hardened. But real. And real is what converts understanding into capability. Understanding without the working system is just preparation for building the working system.

The evaluation infrastructure takes longer. The scaling work takes longer. But those problems only exist if you have a system worth scaling and evaluating. Start with real code, real queries, and a retriever that returns something. Everything in this book is about improving that something until it is exceptional. You cannot improve what doesn't exist.

What changes if you act on this: your team stops losing time to search. Developers who are new to a codebase become productive faster because retrieval surfaces the relevant context instead of requiring them to already know where things are. Code review improves because reviewers can find the existing implementations a PR should align with. On-call engineers find the relevant error-handling code in three queries instead of thirty minutes of grepping. The codebase becomes legible to the people who need to change it, regardless of how long they've been on the team.

What changes if you don't: nothing, immediately. That is the problem. The cost of not having semantic search is paid in small increments — twenty minutes here, an hour there — distributed across every developer every day, invisible in aggregate because no one measures the time spent looking for code that exists but can't be found. It feels like a background tax, not a crisis. Background taxes compound. The codebase grows. The team grows. The tax rate stays constant and the base expands. Three years from now, the retrieval problem you could have solved in a weekend will be the reason your onboarding takes six months and your senior engineers spend a third of their time answering questions that the codebase already answers, for anyone who can find the right file.

You know how to build the system. The question is whether you build it.
# Appendix A: Glossary

**Approximate Nearest Neighbor (ANN)**: An algorithm that finds the closest vectors to a query vector in high-dimensional space without an exhaustive scan of all vectors. ANN trades a small accuracy loss for dramatically faster search. HNSW, IVF-PQ, and ScaNN are common ANN algorithms.

**Attention Mechanism**: The core operation in transformer models. For each token, attention computes a weighted sum over all other tokens in the sequence, where weights (attention scores) represent how relevant each token is to understanding the current one. The mechanism enables long-range dependency modeling.

**BM25 (Best Match 25)**: A probabilistic ranking function for keyword search, standard in modern information retrieval systems. Extends TF-IDF with term frequency saturation and document length normalization. Parameters `k1` and `b` control saturation and length normalization respectively.

**Bi-encoder**: An embedding architecture that encodes queries and documents independently into a shared vector space. Fast at retrieval because documents can be pre-encoded; limited because query-document interaction is only captured at comparison time, not during encoding. Contrast with cross-encoder.

**Chunk**: The atomic unit of indexing in a semantic search system. A chunk is a contiguous segment of code, ideally aligned to a semantic boundary (function, class, module) rather than an arbitrary character or line count.

**Chunking Strategy**: The algorithm that decides how to divide a codebase into chunks. Good chunking strategies use AST node boundaries, respect semantic scope, and produce chunks that are self-contained enough to embed meaningfully.

**ChromaDB**: An open-source vector database optimized for local, embedded deployments. Written in Python with a Rust core. Good for development and small-to-medium corpora (up to ~500K vectors with acceptable performance).

**Contrastive Learning**: A training objective that teaches an embedding model to place similar inputs close together and dissimilar inputs far apart in vector space. InfoNCE / NT-Xent loss is the standard contrastive objective for retrieval model fine-tuning.

**Cosine Similarity**: A similarity measure for vectors defined as the dot product of two normalized vectors. Ranges from -1 (opposite) to 1 (identical direction). Equivalent to the dot product when vectors are L2-normalized, which is why normalizing embeddings at generation time enables fast dot-product search.

**Cross-encoder**: A reranking architecture that takes the concatenation of query and document as input and produces a relevance score. More accurate than bi-encoders for ranking because query-document interaction is modeled during inference, but cannot pre-encode documents — must be run fresh for every (query, document) pair.

**FAISS (Facebook AI Similarity Search)**: An open-source library for efficient similarity search and clustering of dense vectors. Supports HNSW, IVF, PQ, and their combinations. The de facto standard for embedding similarity search in research settings; Qdrant and Weaviate use FAISS internals for their HNSW implementations.

**Fine-tuning**: Adapting a pre-trained model to a new task or domain by continuing training on task-specific data with a lower learning rate. For code search, fine-tuning an embedding model on (query, code chunk) pairs from your codebase improves retrieval quality for your specific naming conventions and patterns.

**HNSW (Hierarchical Navigable Small World)**: A graph-based ANN algorithm that builds a multi-layer proximity graph. Searches start at the top layer (coarse) and navigate down to the query's neighborhood at the bottom layer (fine). Achieves O(log n) search complexity with high recall. The dominant ANN algorithm in production vector databases.

**Hybrid Search**: Combining keyword-based search (BM25) with vector similarity search. The two signals are complementary: BM25 handles exact token matches well; vector search handles semantic similarity. Reciprocal Rank Fusion is the standard fusion algorithm.

**IDF (Inverse Document Frequency)**: A weighting factor in BM25 and TF-IDF that reduces the weight of terms appearing in many documents. Rare terms have high IDF; common terms (like `self` in Python) have low IDF.

**Incremental Indexing**: Updating the search index in response to file changes rather than rebuilding from scratch. Requires tracking content hashes per chunk to detect what has changed. Critical for production deployments where full re-indexing would be too slow or expensive.

**InfoNCE Loss**: A contrastive loss function used to train bi-encoder embedding models for retrieval. The model is given a batch of (query, positive) pairs and learns to rank each positive above all other documents in the batch (which serve as in-batch negatives). Also called NT-Xent loss.

**Intent Classification**: The step in query processing that categorizes a developer's query as one of: identifier lookup, conceptual search, behavioral query, similarity search, or diagnostic query. Different intents warrant different retrieval strategies (BM25 weight, expansion, etc.).

**IVF (Inverted File Index)**: An ANN algorithm that partitions the vector space into Voronoi cells (clusters) and searches only the most promising clusters for a given query. Combined with product quantization (IVF-PQ), it enables memory-efficient approximate search over very large corpora.

**L2 Normalization**: Scaling a vector so its Euclidean length equals 1. After L2 normalization, the dot product of two vectors equals their cosine similarity. Normalizing embeddings at generation time is a free optimization that enables faster dot-product search.

**Latency Budget**: A defined constraint on how long a search query is allowed to take. Latency budgets are assigned per use case (IDE autocomplete vs. web search vs. batch analytics) and drive architectural decisions about which pipeline components can be included.

**Maximal Marginal Relevance (MMR)**: A diversification algorithm that selects results to maximize both relevance to the query and diversity from already-selected results. Controlled by a `lambda` parameter that trades off relevance against diversity.

**MCP (Model Context Protocol)**: An open protocol developed by Anthropic for AI assistants to call external tools. Pyckle uses MCP as its integration interface, allowing any MCP-compatible AI coding assistant to call Pyckle's search tools as structured function calls.

**Mean Pooling**: A pooling strategy for producing a fixed-size embedding from a variable-length transformer output. Averages the embeddings of all non-padding tokens. Generally produces better retrieval quality for code than [CLS] token pooling.

**MRR (Mean Reciprocal Rank)**: An evaluation metric averaging the reciprocal rank of the first relevant result across all queries. MRR@10 = 0.8 means that on average, the first relevant result appears at position ~1.25 in the top 10.

**NDCG (Normalized Discounted Cumulative Gain)**: An evaluation metric that rewards placing more relevant results higher in the ranking. Supports graded relevance (not just binary). A system that ranks a "highly relevant" result at #1 and "marginally relevant" at #2 scores higher than one that reverses them.

**Product Quantization (PQ)**: A vector compression technique that splits each vector into sub-vectors, quantizes each sub-vector independently using a learned codebook, and stores only the codebook indices. Reduces memory by 4–32x with modest quality loss. Combined with HNSW in most production vector databases.

**Qdrant**: An open-source vector database written in Rust with a Python client. Supports HNSW with multiple quantization options, hybrid search (sparse + dense vectors), and both client-server and embedded (in-process) deployment modes.

**Query Expansion**: Adding related terms or alternative phrasings to a query before retrieval to improve recall. Can be rule-based (synonym dictionaries) or LLM-based (generate alternative formulations). Improves recall on good indices; degrades precision on bad ones.

**Recall@K**: The fraction of all relevant documents that appear in the top-K results. Recall@10 = 0.8 means 80% of all relevant results were retrieved in the top 10.

**Reciprocal Rank Fusion (RRF)**: A score fusion algorithm that combines ranked lists from multiple retrieval systems. Each document's RRF score is the sum of 1/(k + rank) across all lists it appears in, where k (typically 60) is a damping constant. Robust, parameter-free (beyond k), and competitive with learned fusion on most benchmarks.

**Reranker**: A model or algorithm that takes the output of a first-stage retriever and re-orders it for higher precision. Cross-encoder rerankers are the most powerful; they take the query and each candidate document as joint input and produce a relevance score.

**Scalar Quantization (SQ)**: A vector compression technique that converts float32 values to int8 (SQ8) or uint8, reducing memory by 4x with less than 3% quality loss. Simpler than product quantization and widely used in production.

**Semantic Gap**: The difference between the vocabulary used in a query and the vocabulary used in the target documents. The semantic gap is why keyword search degrades on large codebases where different teams use different naming conventions.

**Sharding**: Dividing an index across multiple nodes to distribute storage and query load. Horizontal sharding assigns documents by ID hash; semantic sharding assigns by embedding similarity. Queries fan out to all shards and results are merged at the coordinator.

**tree-sitter**: A parser generator and incremental parsing library that supports 100+ programming languages. Produces an AST (abstract syntax tree) for any supported language. Used by Pyckle and others for AST-aware code chunking.

**Vector Database**: A database optimized for storing and querying high-dimensional vectors. Supports ANN search, often combined with metadata filtering. Examples: Qdrant, Weaviate, Pinecone, Chroma, Milvus, pgvector.

---

&nbsp;

---

# Appendix B: Tools and Resources

## Embedding Models

| Tool | URL | Notes |
|---|---|---|
| Salesforce/codet5p-110m-embedding | huggingface.co/Salesforce/codet5p-110m-embedding | Recommended starting model |
| microsoft/unixcoder-base | huggingface.co/microsoft/unixcoder-base | Strong, battle-tested, 768-dim |
| jinaai/jina-embeddings-v3 | huggingface.co/jinaai/jina-embeddings-v3 | Long context (8K), multi-task |
| voyage-code-3 | voyageai.com | Best quality, API-only |
| sentence-transformers | sbert.net | Library for loading/using all models above |

## Vector Databases

| Tool | URL | Best For |
|---|---|---|
| Qdrant | qdrant.tech | Production self-hosted, Rust-native |
| ChromaDB | trychroma.com | Development, local/embedded |
| Weaviate | weaviate.io | Cloud-native, strong GraphQL API |
| Pinecone | pinecone.io | Managed, serverless option |
| pgvector | github.com/pgvector/pgvector | Postgres-native, familiar ops |
| Milvus | milvus.io | High-scale distributed deployments |
| FAISS | github.com/facebookresearch/faiss | Research, building custom systems |

## BM25 / Keyword Search

| Tool | URL | Notes |
|---|---|---|
| rank-bm25 | github.com/dorianbrown/rank-bm25 | Pure Python, good for prototyping |
| tantivy | github.com/tantivy-search/tantivy | Rust-based, very fast, Python bindings via tantivy-py |
| Elasticsearch | elastic.co | Full-featured, heavy, best for large-scale BM25 |
| Typesense | typesense.org | Lighter than Elasticsearch, good BM25 |
| Whoosh | whoosh.readthedocs.io | Pure Python, no external deps, small corpora only |

## Parsing and Chunking

| Tool | URL | Notes |
|---|---|---|
| tree-sitter | tree-sitter.github.io | AST parsing, 100+ languages |
| py-tree-sitter | pypi.org/project/tree-sitter | Python bindings for tree-sitter |
| tree-sitter grammars | github.com/tree-sitter | Per-language grammar repos |
| LibCST | github.com/Instagram/LibCST | Python-specific, preserves comments/formatting |
| ast (stdlib) | docs.python.org/3/library/ast.html | Python-only, built-in, no install |
| babel-parser | babeljs.io | JavaScript/TypeScript AST |

## Evaluation

| Tool | URL | Notes |
|---|---|---|
| BEIR | github.com/beir-cellar/beir | Standard IR evaluation framework |
| trec-eval | github.com/usnistgov/trec_eval | Classic IR metrics, C tool |
| ranx | github.com/AmenRa/ranx | Python, fast metric computation |
| CodeSearchNet | github.com/github/CodeSearchNet | Benchmark dataset for code retrieval |
| CoSQA | github.com/Jun-jie-Huang/CoCLR | Code-query dataset, 20K pairs |

## Reranking

| Tool | URL | Notes |
|---|---|---|
| jinaai/jina-reranker-v2-base-multilingual | huggingface.co | Recommended for code+NL |
| cross-encoder/ms-marco-MiniLM-L-6-v2 | huggingface.co | Fast, good for general text |
| FlashRank | github.com/PrithivirajDamodaran/FlashRank | Ultra-fast reranking library |
| Cohere Rerank | cohere.com/rerank | API-based, high quality |

## MCP (Model Context Protocol)

| Tool | URL | Notes |
|---|---|---|
| MCP specification | modelcontextprotocol.io | Protocol docs |
| mcp Python SDK | github.com/modelcontextprotocol/python-sdk | Build MCP servers in Python |
| MCP inspector | github.com/modelcontextprotocol/inspector | Debug MCP servers interactively |

---

&nbsp;

---

# Appendix C: Further Reading

## Foundational Papers

**Attention Is All You Need** — Vaswani et al., 2017
The original transformer paper. Everything in this book descends from this architecture. Worth reading once in full even if you know transformers well — the clarity of the original presentation exceeds most explanations of it.
`arxiv.org/abs/1706.03762`

**CodeBERT: A Pre-Trained Model for Programming and Natural Languages** — Feng et al., 2020
The paper that demonstrated transformer pre-training on code produces models useful for code retrieval. Introduces the CodeSearchNet benchmark.
`arxiv.org/abs/2002.08155`

**GraphCodeBERT: Pre-training Code Representations with Data Flow** — Guo et al., 2021
Extends CodeBERT with data flow graph structure. Demonstrates that structural information (not just token co-occurrence) improves code understanding.
`arxiv.org/abs/2009.08366`

**UniXcoder: Unified Cross-Modal Pre-training for Code Representation** — Guo et al., 2022
Introduces AST-flattening for code-aware pre-training without graph neural networks. State-of-the-art on CodeSearchNet at time of publication.
`arxiv.org/abs/2203.03850`

**BEIR: A Heterogeneous Benchmark for Zero-Shot Evaluation of Information Retrieval Models** — Thakur et al., 2021
The standard framework for evaluating retrieval models across diverse domains. Section 4 covers the IR metrics used throughout Chapter 7.
`arxiv.org/abs/2104.08663`

**Efficient Passage Retrieval with Hashing for Open-domain Question Answering** — Yamada et al., 2021
A clean treatment of bi-encoder retrieval, dense passage retrieval, and the tradeoffs between bi-encoder and cross-encoder architectures.
`arxiv.org/abs/2106.00882`

## Retrieval and Ranking

**Pretrained Transformers for Text Ranking: BERT and Beyond** — Lin et al., 2021
Comprehensive survey of transformer-based ranking approaches. Chapters 3 and 4 cover cross-encoder vs. bi-encoder reranking in detail.
`arxiv.org/abs/2010.06467`

**Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods** — Cormack et al., 2009
The original RRF paper. Remarkably short (2 pages) and remarkably effective. The algorithm has not been substantially improved in 15 years.
`dl.acm.org/doi/10.1145/1571941.1572114`

**Approximate Nearest Neighbor Search in High Dimensions** — Andoni & Indyk, 2018
Survey of ANN algorithms. Sections 3–5 cover the theory behind HNSW, IVF, and product quantization.
`arxiv.org/abs/1806.09823`

**Efficient and Robust Approximate Nearest Neighbor Search Using HNSW** — Malkov & Yashunin, 2018
The original HNSW paper. Required reading if you need to tune HNSW parameters (M, ef, ef_construction) for your specific use case.
`arxiv.org/abs/1603.09320`

## Code Intelligence

**A Systematic Literature Review on the Use of Deep Learning in Software Engineering Research** — Watson et al., 2022
Broad survey of deep learning applied to code-related tasks including retrieval, summarization, and generation.
`arxiv.org/abs/2009.06520`

**CodeSearchNet Challenge: Evaluating the State of Semantic Code Search** — Husain et al., 2019
Introduces the CodeSearchNet dataset (6 languages, 6.4M function-docstring pairs) and benchmark. The standard evaluation corpus for code retrieval.
`arxiv.org/abs/1909.09436`

**An Empirical Study of Deep Learning Models for Vulnerability Detection** — Chakraborty et al., 2022
Relevant for teams building security-focused code search. Covers the challenges of semantic representation for vulnerability patterns.
`arxiv.org/abs/2212.08109`

## Scaling and Production

**Scaling Distributed Machine Learning with the Parameter Server** — Li et al., 2014
Still the most readable treatment of distributed training and inference at scale. The parameter server pattern appears in distributed vector databases.
`proceedings.mlsys.org/paper/2020/file/1f36c15d6a3d18d52e8d493bc8187cb9-Paper.pdf`

**FAISS: A Library for Efficient Similarity Search** — Johnson et al., 2021
The technical paper behind FAISS. Covers IVF, PQ, and GPU acceleration in detail.
`arxiv.org/abs/2401.08281`

**Billion-scale similarity search with GPUs** — Johnson et al., 2017
Covers GPU-accelerated ANN search and the challenges of very large-scale retrieval.
`arxiv.org/abs/1702.08734`

## Evaluation Methodology

**Offline Evaluation Without Gain** — Sakai, 2020
Critical reading for anyone building evaluation sets for information retrieval. Covers the statistical assumptions behind MRR, NDCG, and their limitations.
`dl.acm.org/doi/10.1145/3340531.3412123`

**IR Evaluation Methods for Highly Relevant Documents** — Sakai & Robertson, 2010
Covers evaluation metrics when binary relevance is insufficient — directly applicable to code search where results range from "not relevant" to "exactly what I needed."
`dl.acm.org/doi/10.1145/1835449.1835550`

## Practical Guides and Blog Posts

**Building a Semantic Search Engine for Code** — Nelson Elhage, 2023
Detailed technical post-mortem from building code search at Anthropic. Covers chunking decisions and query processing in production.
`nelhage.com/post/building-semantic-search-engine-for-code`

**Embedding Models: From Architecture to Implementation** — Pinecone Learning Center
Clear explanations of pooling strategies, fine-tuning, and model selection. Good supplementary reading for Chapter 2.
`pinecone.io/learn/series/nlp/dense-vector-embeddings-nlp/`

**Evaluating Embedding Models for Retrieval** — Hugging Face MTEB Leaderboard
Living benchmark of embedding models on retrieval tasks. The MTEB code-retrieval category is directly applicable.
`huggingface.co/spaces/mteb/leaderboard`

**Vector Database Comparison** — ANN Benchmarks
Annual benchmarks of ANN algorithm implementations. Raw numbers for recall vs. queries-per-second tradeoffs across Qdrant, Weaviate, FAISS, and others.
`ann-benchmarks.com`

---

&nbsp;

---

*Building a Semantic Search Pipeline — Version 1.0 — March 2026*

*By David Kelly Price | pyckle.co*

*All code examples tested on Python 3.11. Dependencies: sentence-transformers ≥ 2.7, qdrant-client ≥ 1.8, rank-bm25 ≥ 0.2, anthropic ≥ 0.25, numpy ≥ 1.26, tree-sitter ≥ 0.21.*

---



---

## Related Blog Posts

- [Semantic Routing Explained](https://pyckle.co/blog/semantic-routing-explained.html)
- [We Trained Our Own Code Embedding Model from Scratch](https://pyckle.co/blog/we-trained-our-own-code-embedding-model-from-scratch-heres-what-happened.html)
- [Why Naive Retrieval Breaks at Scale](https://pyckle.co/blog/why-naive-retrieval-breaks-at-scale-and-what-we-built-instead.html)

---

*[Browse all free guides →](https://pyckle.co/books.html)*
