---
title: "How to Evaluate and Select an Embedding Model"
subtitle: "The Benchmark Results Are Lying to You — Here's How to Find the Truth"
author: "Kelly Price"
date: "2026-04-21"
description: "A practical guide to evaluating embedding models on your own data — covering eval methodology, benchmark interpretation, model comparison, cost-quality tradeoffs, and the fine-tuned vs. general decision."
tags: [embeddings, ai, developer-tools, infrastructure]
---

# How to Evaluate and Select an Embedding Model

**The Benchmark Results Are Lying to You — Here's How to Find the Truth**

*Kelly Price*

---

## About This Guide

You ran the benchmarks. The leaderboard said Model X was the best. You deployed it. Your retrieval quality was mediocre at best — and worse than the model you replaced.

This happens constantly. The disconnect between published benchmark numbers and real-world performance on your specific data is one of the most predictable traps in the embedding model selection process. It is not subtle. It is not a corner case. It is the default outcome when you let someone else's eval set stand in for yours.

This guide is about closing that gap.

I wrote this for software developers who are building systems that depend on vector search — code search tools, RAG pipelines, semantic routing, document retrieval, recommendation engines. You may be choosing between open-source models and API providers. You may be wondering whether to fine-tune or ride a general model. You may have a production system underperforming and no clear diagnosis.

By the end of this book, you will have a methodology for evaluating any embedding model on your own data, a clear understanding of what the public benchmarks actually measure and what they don't, a repeatable process for running model bake-offs, a framework for deciding when fine-tuning is worth the investment, and the vocabulary to read retrieval metrics without getting fooled by them.

This is not a survey of every embedding model on the market. The market moves too fast for any book to be definitive on that front. What does not change is the methodology for making a principled selection. The models will rotate. The evaluation process will not.

A few things to be clear about upfront: you will need to write code. The exercises at the end of each chapter are real. If you skip them, you will understand the concepts but not develop the judgment. The judgment comes from running the numbers on your own data and seeing the variance firsthand.

The voice here is direct. If something is a bad idea, I will say it is a bad idea. If a benchmark number is misleading, I will say why. There is no diplomatic way to tell you that the MTEB leaderboard score you are using to make infrastructure decisions is being computed on data that looks nothing like yours — so I will just say it plainly and show you what to do instead.

Let's get to work.

---

## Table of Contents

1. Why Vendor Benchmarks Don't Transfer to Your Data
2. Building a Domain-Specific Evaluation Set
3. Metrics That Matter: MRR@10, Recall@k, NDCG
4. Running a Model Bake-Off on Your Codebase
5. The Fine-Tuned vs. General Model Decision
6. Cost-Quality Tradeoffs Across Model Sizes
7. Interpreting Your Results: What Good Looks Like
8. When to Stop Optimizing
9. Maintaining Evaluation Quality as Your Data Changes

**Conclusion**
**Appendix A:** Glossary
**Appendix B:** Tools and Resources
**Appendix C:** Further Reading

---

## Chapter 1: Why Vendor Benchmarks Don't Transfer to Your Data

Every major embedding model provider publishes benchmark scores. OpenAI publishes numbers on MTEB. Cohere publishes numbers on BEIR. Hugging Face hosts a leaderboard that refreshes weekly with new contenders. These numbers are real. They were computed honestly. They are also largely useless for predicting how a model will perform on your data — and understanding why is the foundation of everything else in this book.

### What the Benchmarks Actually Measure

The Massive Text Embedding Benchmark (MTEB) is the most widely cited public leaderboard for embedding model evaluation. It covers 56 tasks across eight categories: classification, clustering, pair classification, reranking, retrieval, semantic textual similarity, summarization, and bitext mining. When a vendor publishes an MTEB score, they are reporting an average across some subset of these tasks.

The retrieval tasks on MTEB include datasets like MS MARCO, BEIR, and various academic corpora. MS MARCO is a dataset of web search queries paired with passages from the Bing search index. BEIR aggregates datasets across domains including biomedical literature, financial news, and scientific papers. These are not bad datasets. They are carefully constructed and academically rigorous.

They are not your codebase. They are not your internal documentation. They are not the support tickets, Slack threads, or engineering RFCs that your team searches every day. The MTEB leaderboard tests on web crawls and academic datasets — useful for ranking models against each other, but not predictive of performance on YOUR data.

This is not a subtle point. The semantic structure of Python function signatures is genuinely different from the semantic structure of web search queries. When someone searches your codebase for "parse JWT expiry and return unix timestamp," the model needs to understand that `decode_token_expiration` returning `int(exp)` is the right answer — not a Stack Overflow post about JWT libraries. That relationship was not in the training distribution of any general-purpose text embedding model.

> **Key Insight:** MTEB rank is a useful prior for narrowing the field — it tells you which models are at least competent across a broad range. But it tells you nothing about relative performance on your specific domain. A model ranked 15th overall might outperform the top-ranked model on your codebase by 30%.

### The Data Distribution Problem

Every embedding model was trained on a corpus. The model learns to place semantically similar text near each other in vector space based on what "similar" meant in that training corpus. When you query the model at inference time, you are asking it to apply those learned similarity relationships to your data.

If your data looks like the training corpus, the model's similarity judgments will transfer well. If your data is out-of-distribution — different vocabulary, different structure, different query patterns — the model's similarity judgments degrade, sometimes catastrophically.

Code is a high-severity case of this problem. Code has syntactic structure that natural language does not. Variable names, function signatures, docstrings, and implementation bodies are separate channels of meaning that all need to be weighted appropriately. A general text model treats `for i in range(len(arr)):` as a sequence of tokens and tries to embed it using similarity relationships learned from Wikipedia articles. A code-specialized model trained on function bodies paired with docstrings understands that `i` is an iteration variable, that `range(len(arr))` signals array traversal, and that this snippet should be near other array iteration patterns — regardless of what the variable is named.

This is not a hypothetical difference. PyckLM, trained on code structure rather than retrofitted from text, achieves 0.456 MRR@10 on CodeSearchNet — 62% better than GraphCodeBERT. That gap exists entirely because the training distribution matches the query distribution. The general-purpose models are not dumb; they are applying the right reasoning to the wrong domain.

### The Benchmark Gaming Problem

There is a second issue with public benchmarks that is more subtle: benchmark contamination. As MTEB has become the standard leaderboard, some model training pipelines have begun to include the benchmark datasets in their training data — or data from the same sources. This inflates scores without improving general capability. A model can achieve a high MTEB score by memorizing the distribution of the benchmark rather than learning generalizable embeddings.

You cannot distinguish between these cases from the leaderboard score alone. A model with genuine retrieval quality and a model that has been trained on benchmark-adjacent data can produce identical MTEB numbers. On your data — which was definitely not in anyone's training set — only the model with genuine capability will hold up.

> **Warning:** Treat dramatic jumps in MTEB scores from new model releases with skepticism. Genuine capability improvements are incremental. A model that appears at the top of the leaderboard overnight with no architectural explanation is worth investigating before you commit to it.

### The Baseline Failure Mode

The most common practical consequence of over-relying on benchmarks is deploying a model that looks strong on paper but performs at the level of a keyword search baseline on your actual queries. At that point, the vector search infrastructure is adding latency and cost without adding quality.

Before you evaluate any embedding model, you should have a keyword search baseline (BM25 or equivalent). If a candidate embedding model does not beat BM25 on your domain-specific eval set by a meaningful margin, it is not improving your system. Benchmarks will not tell you this. Only running the eval on your data will.

> **Try This:** Take the top three models from the current MTEB retrieval leaderboard. Look at the three retrieval datasets used to compute their scores. Now ask: how similar is that data to your data? If the answer is "not very," you have just discovered exactly how much the leaderboard is telling you about your use case.

### What to Do Instead

The rest of this book is the answer to that question. The short version: build a small evaluation set from your own data, define a metric that captures the retrieval quality you actually care about, and run every candidate model through the same eval before making any infrastructure decisions.

This does not require a large dataset. It does not require a PhD in information retrieval. It requires about two hours of upfront work and a willingness to treat your own data as the ground truth.

**Key Takeaways:**
- MTEB scores are computed on web crawls and academic datasets that do not resemble most production workloads.
- Code, internal documentation, and domain-specific text are all out-of-distribution for general embedding models.
- Benchmark contamination can inflate leaderboard scores without improving real retrieval quality.
- A keyword search baseline (BM25) is your minimum bar — any embedding model you deploy should beat it on your data.
- Two hours of building a domain-specific eval set will give you more useful signal than any public benchmark.

**Practical Exercise:**
Run BM25 on your actual corpus using the top 20 queries your users submit (or the 20 most representative ones you can construct). Record the MRR@10. This is your baseline. Every model you evaluate in this book gets compared against this number.

---

## Chapter 2: Building a Domain-Specific Evaluation Set

The single most valuable thing you can do before selecting an embedding model is build a small, accurate evaluation set from your own data. Not a large one. Not a perfect one. A useful one — which means roughly 30 to 200 (query, expected_result) pairs drawn from your actual corpus.

Most developers skip this step or delay it indefinitely because it feels like a research task. It is not. It is an engineering task with a concrete deliverable. This chapter walks through building one that is statistically useful and practically maintainable.

### What an Evaluation Set Is

An evaluation set for retrieval is a collection of (query, relevant_document) pairs where you know, ahead of time, which document or documents in your corpus should be returned for each query. When you run an embedding model over your corpus and compute retrieval metrics, you are measuring how often the model surfaces those known-relevant documents in its top results.

The simplest form: a list of tuples. Each tuple contains a natural language query and the identifier of the document that should be returned. For a codebase, that looks like:

```python
eval_set = [
    {
        "query": "parse JWT expiry and return unix timestamp",
        "relevant_ids": ["src/auth/tokens.py:decode_token_expiration"]
    },
    {
        "query": "retry logic with exponential backoff",
        "relevant_ids": ["src/utils/retry.py:retry_with_backoff"]
    },
    {
        "query": "validate email address format",
        "relevant_ids": ["src/validators/email.py:validate_email"]
    },
]
```

This is the minimum viable format. Each query is something a real user of your system would actually search for. Each relevant ID points to the file, function, or document that correctly answers the query.

### How Many Pairs Do You Need?

The minimum useful eval set is 30 (query, expected_result) pairs. Below 30, the variance in your metrics is too high to distinguish between models reliably. At 30 pairs, a difference of 0.05 in MRR@10 is meaningful. Above 100 pairs, you start getting diminishing returns on statistical reliability — the next 100 pairs are better spent on coverage of different query types rather than raw count.

> **Key Insight:** 30 pairs is not a small number because 30 is a magic statistical threshold. It is because MRR@10 is a bounded metric (0 to 1) computed over a discrete ranking. With fewer than 30 queries, a single lucky or unlucky placement can shift your score by 0.03 — enough to make one model look better than another when the difference is noise.

For a codebase of 500–5,000 files, aim for 50–75 pairs covering diverse query types. For larger corpora or systems with multiple distinct use patterns, 100–150 pairs is a reasonable target before returns diminish.

### Sourcing Queries

The best queries come from real users. If your system is already in production, your query logs are a goldmine. Pull the 100 most common queries (after deduplication and cleaning), annotate which document in your corpus is the correct answer for each, and you have a high-quality eval set in an afternoon.

If you do not have query logs, you need to construct queries. There are three approaches:

**Method 1: Document-first annotation.** For each document in a random sample of your corpus, write one or two queries that should retrieve that document. This is the most common approach for new systems. It is somewhat artificial — you are writing queries while looking at the answer — but it captures the vocabulary and structure of your actual content.

**Method 2: Scenario-based construction.** Define the top 10 use cases for your retrieval system, then write 3–5 queries per use case. For a codebase search tool, use cases might include: finding a function by what it does, finding the error handler for a specific exception type, finding all callers of a particular API. This approach ensures coverage across different query patterns rather than clustering around what is easiest to annotate.

**Method 3: LLM-assisted generation.** Use a language model to generate queries for documents in your corpus. Feed it a document and ask for five natural language questions that would correctly be answered by that document. This scales well but requires human review — LLMs generate queries that are too similar to the document text, which inflates scores for models that perform simple lexical matching.

> **Warning:** Do not let your eval set be dominated by queries that look like paraphrases of the document text. Those queries are easy for any model to get right. You want queries that require semantic understanding — queries where the vocabulary of the question and the vocabulary of the answer are genuinely different. Include at least 30% "semantic gap" queries where no word in the query appears in the relevant document.

### Ground Truth Annotation

For each query, you need to mark which documents in your corpus are relevant. The minimum annotation is a single most-relevant document per query. If your retrieval task involves multiple relevant documents per query (e.g., "find all files related to authentication"), annotate all of them.

For code search, a pragmatic annotation process:

1. For each query, search your codebase manually (using your IDE's full-text search or grep).
2. Identify the single best-matching file or function.
3. Record its identifier.
4. Note whether there are secondary relevant files (partial matches, related utilities).

The annotation does not need to be exhaustive. For MRR@10 evaluation, you only need to identify the primary relevant document. If it shows up in the top 10 results, the model gets credit. If not, it does not.

```python
import json
from pathlib import Path

def build_eval_set(corpus_path: str, output_path: str) -> None:
    """Interactive CLI for building an evaluation set from a code corpus."""
    corpus_files = list(Path(corpus_path).rglob("*.py"))
    eval_pairs = []

    print(f"Found {len(corpus_files)} Python files in corpus.\n")

    while True:
        query = input("Enter a search query (or 'done' to finish): ").strip()
        if query.lower() == "done":
            break

        print("\nSearch your codebase for the best-matching file.")
        relevant_id = input("Enter the file path of the most relevant result: ").strip()

        secondary = input("Any secondary relevant files? (comma-separated paths, or Enter to skip): ").strip()
        secondary_ids = [s.strip() for s in secondary.split(",")] if secondary else []

        eval_pairs.append({
            "query": query,
            "relevant_ids": [relevant_id] + secondary_ids,
        })

        print(f"Recorded. Total pairs: {len(eval_pairs)}\n")

    with open(output_path, "w") as f:
        json.dump(eval_pairs, f, indent=2)

    print(f"Eval set saved to {output_path} with {len(eval_pairs)} pairs.")
```

### Structuring for Reusability

Store your eval set in a format you can version control and diff. JSON is fine. Include a schema version field so that when you extend the format later (adding difficulty ratings, query categories, secondary relevance), you can distinguish old records from new ones.

```json
{
  "schema_version": "1.0",
  "corpus": "src/",
  "created": "2026-04-21",
  "pairs": [
    {
      "id": "q001",
      "query": "parse JWT expiry and return unix timestamp",
      "relevant_ids": ["src/auth/tokens.py"],
      "category": "auth",
      "difficulty": "semantic_gap"
    }
  ]
}
```

The `difficulty` field is optional but valuable: marking which queries are easy (lexical overlap with relevant document) versus hard (semantic gap between query and document vocabulary) lets you analyze model performance by query type — not just by overall score.

> **Try This:** After building your eval set, compute what percentage of your queries contain at least one word that also appears in the relevant document's top 200 tokens. If that number is above 70%, your eval set is lexically dominated and will not differentiate embedding models from BM25.

**Key Takeaways:**
- 30–75 (query, relevant_document) pairs is sufficient for statistically reliable model comparison.
- Real user queries from logs are higher quality than manually constructed ones.
- Include at least 30% "semantic gap" queries where no query word appears in the relevant document.
- Store your eval set in version control with a schema version field.
- Annotating the single most-relevant document per query is enough for MRR@10 evaluation.

**Practical Exercise:**
Construct a 30-pair eval set for your codebase or document corpus. Use the script above or any method you prefer. Target: 10 queries with direct lexical overlap with the answer, 10 with partial overlap, 10 with no overlap (true semantic gap). Save the result as `eval_set.json`. You will use this file throughout the rest of the book.

---

## Chapter 3: Metrics That Matter: MRR@10, Recall@k, NDCG

Picking the right metric is not a formality. The metric you choose determines which model you select, and different metrics optimize for different user experiences. A model that maximizes NDCG might rank poorly on MRR@10 while being genuinely better for your use case — or the opposite. This chapter explains what each metric measures, when to use it, and how to compute it without introducing subtle bugs.

### Mean Reciprocal Rank (MRR@k)

MRR@k answers a specific question: when a user submits a query, how high in the results list does the first relevant document appear? It computes the reciprocal of the rank (1/rank) for each query and averages across queries.

If the relevant document appears first, the reciprocal rank is 1/1 = 1.0. If it appears second, 1/2 = 0.5. Third, 1/3 = 0.33. The @k suffix means you only count results within the top k — if the relevant document does not appear in the top k, its contribution is 0.

```python
def mrr_at_k(eval_set: list[dict], retrieve_fn, k: int = 10) -> float:
    """
    Compute MRR@k.

    eval_set: list of {"query": str, "relevant_ids": list[str]}
    retrieve_fn: callable(query: str, k: int) -> list[str] of document IDs
    """
    reciprocal_ranks = []

    for pair in eval_set:
        results = retrieve_fn(pair["query"], k)
        relevant = set(pair["relevant_ids"])

        rr = 0.0
        for rank, doc_id in enumerate(results, start=1):
            if doc_id in relevant:
                rr = 1.0 / rank
                break

        reciprocal_ranks.append(rr)

    return sum(reciprocal_ranks) / len(reciprocal_ranks)
```

MRR@10 is the right primary metric for code search and most single-answer retrieval tasks. The user submits a query expecting to find one specific thing. If it is not in the first three results, they revise their query or give up. The fact that the model found 7 other useful files at positions 8–20 is irrelevant to the user experience.

A score above 0.4 is strong for code search. A score of 0.4 means that on average, the relevant document appears at rank 2.5 — within the first three results. Below 0.3, users will notice the quality problems. Below 0.2, you have a retrieval system that is failing at its core job.

> **Key Insight:** MRR@10 heavily penalizes cases where the relevant document appears at rank 2 vs. rank 1 (0.5 vs 1.0), but barely penalizes the difference between rank 9 and rank 10 (0.11 vs 0.10). This matches how humans actually use search results — position 1 matters enormously, positions 8–10 barely differ.

### Recall@k

Recall@k answers a different question: of all the relevant documents in your corpus, what fraction appear in the top k results? This is the right metric when your use case involves finding multiple relevant documents — not just one.

```python
def recall_at_k(eval_set: list[dict], retrieve_fn, k: int = 20) -> float:
    recalls = []

    for pair in eval_set:
        results = retrieve_fn(pair["query"], k)
        relevant = set(pair["relevant_ids"])
        retrieved_relevant = len(relevant.intersection(set(results)))
        recall = retrieved_relevant / len(relevant) if relevant else 0.0
        recalls.append(recall)

    return sum(recalls) / len(recalls)
```

Recall@k is the primary metric for RAG pipelines where the LLM downstream needs to see all relevant chunks to produce a correct answer. If you are building a Q&A system over documentation, you want Recall@10 to be high — missing a relevant chunk means the language model may give an incomplete or incorrect answer, even if it has other relevant context.

For most RAG applications, Recall@10 above 0.75 is a practical target. Below 0.6, your answers will frequently miss important context.

### NDCG (Normalized Discounted Cumulative Gain)

NDCG is the most nuanced of the three metrics. It handles graded relevance — cases where some documents are "very relevant," some are "somewhat relevant," and some are irrelevant. It also discounts results by position, similar to MRR, but unlike MRR it gives partial credit to documents at lower positions rather than only counting the first hit.

```python
import math

def ndcg_at_k(eval_set: list[dict], retrieve_fn, k: int = 10) -> float:
    """
    NDCG@k with binary relevance (relevant=1, not relevant=0).
    For graded relevance, replace binary check with a relevance score dict.
    """
    ndcg_scores = []

    for pair in eval_set:
        results = retrieve_fn(pair["query"], k)
        relevant = set(pair["relevant_ids"])

        # DCG: sum of relevance/log2(rank+1) for top-k results
        dcg = sum(
            1.0 / math.log2(rank + 1)
            for rank, doc_id in enumerate(results[:k], start=1)
            if doc_id in relevant
        )

        # Ideal DCG: all relevant docs at top positions
        ideal_k = min(len(relevant), k)
        idcg = sum(1.0 / math.log2(rank + 1) for rank in range(1, ideal_k + 1))

        ndcg_scores.append(dcg / idcg if idcg > 0 else 0.0)

    return sum(ndcg_scores) / len(ndcg_scores)
```

NDCG is most useful when you have graded relevance in your eval set — when you have marked some documents as primary answers and others as secondary, related results. For binary relevance (one right answer per query), NDCG and MRR@k tell similar stories. For multi-document retrieval with quality gradations, NDCG captures the full picture.

> **Warning:** Do not report NDCG when you have binary relevance annotations unless you also report MRR@k or Recall@k alongside it. NDCG with binary relevance is not wrong, but it papers over important distinctions between models that place the one relevant document first versus models that scatter it throughout the results.

### Which Metric to Primary On

The decision tree is straightforward:
- Code search, Q&A with single answers, command lookup → **MRR@10**
- RAG pipelines, multi-document retrieval, research tools → **Recall@10 or Recall@20**
- Ranking systems with graded relevance, recommendation → **NDCG@10**

For most developer-facing code search and RAG systems, MRR@10 and Recall@10 together give a complete picture. Report both. If they disagree — one model wins on MRR, another wins on Recall — think about which failure mode is worse for your users and weight accordingly.

### Computing Your Baseline Now

With your eval set from Chapter 2 and these metric implementations, you can compute your BM25 baseline right now:

```python
from rank_bm25 import BM25Okapi
import json
from pathlib import Path

def load_corpus(corpus_path: str) -> dict[str, str]:
    """Load all Python files as a dict of {file_path: content}."""
    corpus = {}
    for fp in Path(corpus_path).rglob("*.py"):
        try:
            corpus[str(fp)] = fp.read_text(errors="replace")
        except Exception:
            pass
    return corpus

def build_bm25_retriever(corpus: dict[str, str]):
    doc_ids = list(corpus.keys())
    tokenized = [corpus[d].lower().split() for d in doc_ids]
    bm25 = BM25Okapi(tokenized)

    def retrieve(query: str, k: int = 10) -> list[str]:
        scores = bm25.get_scores(query.lower().split())
        top_k = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
        return [doc_ids[i] for i in top_k]

    return retrieve

with open("eval_set.json") as f:
    eval_data = json.load(f)

corpus = load_corpus("src/")
bm25_retrieve = build_bm25_retriever(corpus)
eval_pairs = eval_data["pairs"]

baseline_mrr = mrr_at_k(eval_pairs, bm25_retrieve, k=10)
baseline_recall = recall_at_k(eval_pairs, bm25_retrieve, k=10)

print(f"BM25 Baseline — MRR@10: {baseline_mrr:.3f}, Recall@10: {baseline_recall:.3f}")
```

This number is your floor. Write it down. Every model you evaluate gets compared against it.

> **Try This:** After computing your BM25 baseline, look at the queries where BM25 scores 0 (the relevant document is not in the top 10 at all). These are your highest-value queries for differentiating embedding models — lexical matching completely fails here, and semantic understanding is the only path to getting them right.

**Key Takeaways:**
- MRR@10 measures how high the first relevant result appears; ideal for single-answer retrieval.
- Recall@k measures coverage of all relevant documents; ideal for RAG and multi-document retrieval.
- NDCG handles graded relevance; most useful when you have primary vs. secondary annotation.
- A score above 0.4 MRR@10 is strong for code search.
- BM25 is your minimum bar — an embedding model that cannot beat keyword search on your data is not worth deploying.

**Practical Exercise:**
Run the BM25 baseline code above on your corpus and eval set. Record MRR@10 and Recall@10. Then look at the 5 queries with the lowest BM25 scores. Write down *why* BM25 fails on those queries — is it vocabulary mismatch? Abbreviated terms? Conceptual queries with no matching keywords? These failure modes will tell you what to look for when evaluating embedding models.

---

## Chapter 4: Running a Model Bake-Off on Your Codebase

A bake-off is a controlled comparison of candidate models on your eval set. The goal is to isolate the variable you care about — embedding model quality — by holding everything else constant: the same corpus, the same queries, the same evaluation code, the same index parameters.

This chapter gives you the complete framework: which models to include, how to structure the comparison, what implementation details will corrupt your results if you get them wrong, and how to read the output.

### Choosing Your Candidate Set

You do not need to evaluate every model on HuggingFace. You need three to five candidates that represent meaningfully different points in the design space. For code search, a reasonable starting set:

**Codestral Embed** — Mistral's code-focused embedding model, trained on code corpora. Strong baseline for retrieval over mixed code and documentation. API-only.

**bge-large-en-v1.5** — BAAI's general retrieval model, consistently strong across diverse BEIR tasks. 335M parameters, 1024-dim output, available as a local model. Excellent if your corpus includes prose documentation alongside code.

**e5-mistral-7b-instruct** — 7B parameter model with strong retrieval quality at the cost of inference speed and memory. Represents the high end of the quality/cost curve for general models.

**text-embedding-3-small / text-embedding-3-large** — OpenAI's embedding models. Wide deployment, well-understood behavior, variable dimensions (matryoshka embeddings allow trading quality for storage). Easy to integrate but API-only.

**Your domain-specific candidate** — If you are evaluating code search specifically, any model trained on code (CodeBERT, GraphCodeBERT, or PyckLM if you are using it) should be in your set. PyckLM's 0.456 MRR@10 on CodeSearchNet is a useful reference point — it outperforms GraphCodeBERT by 62% because the training distribution matches the query distribution.

> **Key Insight:** Include at least one general model and one domain-specialized model in your bake-off. The most important question you are answering is whether domain specialization is worth the operational complexity — and you cannot answer it without both data points.

### The Bake-Off Framework

Structure your bake-off as a single script that: loads each model, embeds the corpus, embeds the eval queries, retrieves top-k results for each query, and computes metrics. Keep the embedding and retrieval logic identical across models so that differences in scores reflect model quality, not implementation differences.

```python
import numpy as np
import json
from pathlib import Path
from dataclasses import dataclass
from typing import Callable

@dataclass
class ModelResult:
    name: str
    mrr_10: float
    recall_10: float
    recall_20: float
    embed_time_sec: float
    dims: int

def cosine_similarity_batch(query_vec: np.ndarray, doc_vecs: np.ndarray) -> np.ndarray:
    query_norm = query_vec / (np.linalg.norm(query_vec) + 1e-10)
    doc_norms = doc_vecs / (np.linalg.norm(doc_vecs, axis=1, keepdims=True) + 1e-10)
    return doc_norms @ query_norm

def run_bake_off(
    eval_pairs: list[dict],
    corpus: dict[str, str],
    models: list[tuple[str, Callable]],
    k_values: list[int] = [10, 20],
) -> list[ModelResult]:
    doc_ids = list(corpus.keys())
    doc_texts = [corpus[d] for d in doc_ids]
    results = []

    for model_name, embed_fn in models:
        import time
        print(f"\nEvaluating: {model_name}")

        t0 = time.time()
        doc_vecs = embed_fn(doc_texts)  # shape: (n_docs, dims)
        embed_time = time.time() - t0

        doc_vecs = np.array(doc_vecs)
        dims = doc_vecs.shape[1]

        def retrieve(query: str, k: int) -> list[str]:
            query_vec = np.array(embed_fn([query])[0])
            sims = cosine_similarity_batch(query_vec, doc_vecs)
            top_k = np.argsort(sims)[::-1][:k]
            return [doc_ids[i] for i in top_k]

        mrr = mrr_at_k(eval_pairs, retrieve, k=10)
        r10 = recall_at_k(eval_pairs, retrieve, k=10)
        r20 = recall_at_k(eval_pairs, retrieve, k=20)

        results.append(ModelResult(
            name=model_name,
            mrr_10=mrr,
            recall_10=r10,
            recall_20=r20,
            embed_time_sec=embed_time,
            dims=dims,
        ))

        print(f"  MRR@10: {mrr:.3f} | Recall@10: {r10:.3f} | Recall@20: {r20:.3f} | Time: {embed_time:.1f}s | Dims: {dims}")

    return results
```

### The "Mushy Middle" Problem

One failure mode worth understanding before you run your bake-off: MSE-trained models tend to produce similarity scores clustered around 0.5. The model has learned to minimize mean squared error on relevance judgments, which pushes scores toward the middle of the range regardless of whether the document is actually relevant.

The consequence: when you retrieve top-k by cosine similarity, you are sorting a sea of 0.48s and 0.51s. The difference between a directly relevant document and a marginally related one is 0.03 similarity points — within the noise of the model's uncertainty. This destroys your ranking quality even when the model has technically encoded the right information.

Models trained with Margin Ranking Loss sidestep this problem. The loss function explicitly forces the model to place true positives significantly above hard negatives — not just slightly. The result is a much wider separation between relevant and irrelevant documents in the similarity score distribution, which makes your ranking substantially more reliable.

> **Warning:** If you see your bake-off results all clustering in the 0.3–0.5 MRR@10 range across wildly different models, check whether you are comparing the similarity scores to a threshold. A static threshold of 0.5 will mark almost everything as relevant for an MSE-trained model and produce meaningless precision/recall numbers. Always rank by similarity and evaluate by rank, not by threshold.

### Controlling for Index Parameters

A common bake-off corruption: you test Model A with a flat (exact) index and Model B with an HNSW approximate index, then report that Model B is faster and close in quality. You have not tested the models — you have tested the indexes.

Keep index type identical across all models in your bake-off. During evaluation, use exact nearest neighbor search (flat cosine similarity over the full corpus). This eliminates approximation error as a variable. If your production corpus is large enough that exact search is too slow, address that as a post-selection optimization with the winning model.

Similarly, normalize your vectors before computing cosine similarity. Embedding models produce vectors of varying magnitude, and unnormalized dot product will favor documents with larger vector norms — which correlates with document length, not relevance.

### Reading Your Bake-Off Output

After running the bake-off, you will have a table like:

```
Model                   MRR@10  Recall@10  Recall@20  Time(s)  Dims
BM25 Baseline           0.312   0.481      0.612      -        -
bge-large-en-v1.5       0.378   0.554      0.701      12.4     1024
Codestral Embed         0.421   0.608      0.748      8.1      1024
e5-mistral-7b-instruct  0.447   0.631      0.774      94.2     4096
text-embedding-3-small  0.363   0.527      0.679      3.2      1536
PyckLM (domain)         0.456   0.641      0.786      6.8      384
```

The interesting comparisons are not just the top-line numbers. Look at:
- Which models beat BM25? (All should — if one doesn't, drop it.)
- What is the gap between MRR@10 and Recall@10? A large gap means the model finds relevant documents but not at the top.
- Does the domain-specialized model outperform general models with many more parameters?
- Where does inference time matter relative to quality gain?

> **Try This:** After your bake-off, take the 5 queries where MRR@10 is 0 for your best model (the relevant document is not in the top 10). Read those queries alongside the relevant documents. Are they genuinely semantically similar? If not, your annotation may be wrong. If yes, you have found the model's specific failure mode — which may be addressable with a different chunking strategy or query preprocessing step.

**Key Takeaways:**
- Evaluate three to five models that represent different design points: general large, general small, domain-specialized, API-hosted.
- Use exact nearest neighbor search (no approximation) during bake-off to isolate model quality from index type.
- MSE-trained models produce mushy similarity scores that corrupt ranking; prefer models trained with Margin Ranking Loss.
- Normalize vectors before cosine similarity to eliminate document-length bias.
- BM25 must be in your comparison table as the baseline to beat.

**Practical Exercise:**
Implement `run_bake_off` against your eval set with at least two embedding models (one API, one local HuggingFace model) and BM25. Report the results table. Note which model wins on MRR@10 vs. Recall@10 — if different models win different metrics, decide which metric is primary for your use case before selecting a model.

---

## Chapter 5: The Fine-Tuned vs. General Model Decision

Fine-tuning is the highest-leverage optimization available for embedding quality — when you have the data to support it. Without the data, fine-tuning is expensive, time-consuming, and frequently produces models that are marginally better on your eval set but more brittle in production. The decision point is specific enough that you should be able to resolve it in an afternoon.

### What Fine-Tuning Actually Does

Fine-tuning an embedding model means continuing training on your domain data. You are adjusting the model's weights to better encode the similarity relationships that appear in your corpus. If your training signal is accurate — if you have (query, positive, negative) triplets that faithfully represent which documents should be near which queries — you can substantially improve retrieval quality on your domain.

The key phrase is "faithfully represent." Fine-tuning on bad data does not produce a bad general model. It produces a model that has overfit to whatever noise or bias exists in your training data and generalizes poorly. This is the most common fine-tuning failure mode: developers collect whatever triplets are available (often synthetic, often noisy), train for a few epochs, see slight improvement on their eval set, and deploy — only to find that the model has learned surface patterns that do not generalize.

### The Data Threshold

The empirical threshold for fine-tuning to meaningfully outperform a calibrated general model on code search is 50,000 high-quality (query, positive, negative) triplets. Below that number, you are almost certainly better off investing the effort in better eval methodology, better query preprocessing, or a better off-the-shelf model.

That 50K number is not a theoretical bound — it is an observation about where the variance of fine-tuning results starts to narrow. With fewer triplets, the improvement you get is highly dependent on the specific triplets you chose. The confidence interval around your expected improvement is wide enough to include "makes things worse." Above 50K good triplets, fine-tuned models reliably outperform their base general models on in-domain retrieval by 10–25%.

> **Warning:** Synthetic triplets generated by an LLM count for significantly less than real (query, positive) pairs collected from user behavior. A language model generating queries for documents tends to produce queries that lexically overlap with the document — which any BM25 retriever handles well and does not require semantic understanding. If you are counting on LLM-generated triplets to reach your 50K threshold, double the threshold.

### What High-Quality Triplets Look Like

A training triplet for embedding fine-tuning consists of:
- **Query**: a natural language question or search string
- **Positive**: a document that correctly answers the query
- **Hard negative**: a document that looks superficially similar to the positive but is not the correct answer

The hard negative is critical. Easy negatives — randomly sampled documents — teach the model almost nothing past a low training budget. Hard negatives — documents that confuse the current model — force the model to learn fine-grained distinctions.

For code search, hard negatives can be generated by:
1. Taking the top-5 results of your current model for a query and removing the true positive. The remaining 4 are hard negatives — the model thought they were relevant, but they are not.
2. Selecting functions with similar names but different behavior (e.g., `validate_token` and `refresh_token` as negatives for a query about token validation).
3. Using BM25 top results as negatives for queries where BM25 fails — these are documents that match on keywords but not semantics.

```python
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

def build_training_examples(triplets: list[dict]) -> list[InputExample]:
    """
    triplets: list of {"query": str, "positive": str, "negative": str}
    """
    return [
        InputExample(texts=[t["query"], t["positive"], t["negative"]])
        for t in triplets
    ]

def fine_tune_embedding_model(
    base_model_name: str,
    triplets: list[dict],
    output_path: str,
    epochs: int = 3,
    batch_size: int = 32,
    warmup_steps: int = 200,
) -> None:
    model = SentenceTransformer(base_model_name)
    train_examples = build_training_examples(triplets)
    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)

    # Margin Ranking Loss forces positive > negative by a margin
    train_loss = losses.TripletLoss(model=model)

    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=epochs,
        warmup_steps=warmup_steps,
        output_path=output_path,
        show_progress_bar=True,
    )
    print(f"Fine-tuned model saved to {output_path}")
```

### When to Fine-Tune vs. When Not To

**Fine-tune when:**
- You have 50K+ verified (query, positive, negative) triplets with real positives, not synthetic
- Your domain vocabulary is genuinely novel (new programming languages, proprietary frameworks, internal APIs)
- You have infrastructure to retrain regularly as your corpus changes
- Your best general model achieves MRR@10 < 0.35 and there is no better off-the-shelf option to try

**Do not fine-tune when:**
- Your eval set has fewer than 100 pairs (you cannot reliably measure improvement)
- Your triplet count is below 50K (use the effort to collect better data instead)
- Your best general model is already achieving MRR@10 > 0.42 (diminishing returns)
- You do not have infrastructure to maintain the fine-tuned model as your corpus evolves

> **Key Insight:** A well-calibrated general model — one where you have tuned the retrieval top-k, applied result scoring, and preprocessed queries appropriately — frequently outperforms a poorly fine-tuned domain model. Calibration is reversible and cheap. Fine-tuning is expensive and creates a maintenance burden. Exhaust calibration options before fine-tuning.

### The Maintenance Problem

Fine-tuned models create a dependency: your retrieval quality is now coupled to a specific model artifact that reflects the distribution of data at fine-tuning time. As your codebase evolves, new patterns emerge that the fine-tuned model has not seen. If you do not retrain periodically, your fine-tuned model will gradually fall behind a general model that you swap out to a newer version.

Plan for quarterly retraining if you fine-tune. Budget the compute. Budget the human annotation time to keep the triplet set current. If you cannot commit to that cadence, the maintenance burden of fine-tuning outweighs the quality gain for most teams.

**Key Takeaways:**
- Fine-tuning requires 50K+ high-quality triplets with real positives before it reliably outperforms calibrated general models.
- Hard negatives are required for effective fine-tuning; easy (random) negatives teach very little.
- LLM-generated synthetic triplets count for significantly less than real user-behavior data.
- Calibrated general models frequently outperform poorly fine-tuned domain models.
- Plan for quarterly retraining if you fine-tune; without it, your fine-tuned model will degrade as your corpus evolves.

**Practical Exercise:**
Audit your available training data. Count how many real (query, positive) pairs you can construct from user logs, annotations, or explicitly curated data. If the count is below 10K, document why and set a roadmap for collection. If it is above 50K, assess the quality of hard negatives you can construct from your current best model's top results.

---

## Chapter 6: Cost-Quality Tradeoffs Across Model Sizes

The best embedding model is not the one with the highest MRR@10 on your eval set. It is the one with the highest MRR@10 subject to your latency, throughput, and storage constraints. Those constraints are real, and ignoring them in model selection is how you end up with a system that works in benchmarks and fails in production.

### The Dimension Tradeoff

Embedding dimension is one of the most directly quantifiable cost parameters. A 384-dimensional embedding requires 1.5 KB of storage per document (384 floats × 4 bytes). A 1536-dimensional embedding requires 6 KB. That is a 4× storage multiplier for a quality gain that is often marginal — especially on in-domain data where the additional dimensions are encoding information the model already captures adequately in fewer dimensions.

At 100K documents: 384-dim costs 150 MB. 1536-dim costs 600 MB. At 10M documents — a mid-size codebase or large documentation set — that gap becomes 1.5 GB vs. 6 GB just for the vector store, before indexing overhead.

The quality question: does the extra storage buy you meaningfully better retrieval? On out-of-domain data (web search, academic text), higher-dimensional models consistently outperform lower-dimensional ones. On in-domain data with a model trained for that domain, the relationship breaks down. A 384-dim model trained specifically on code can outperform a 1536-dim general model because the training distribution matters more than the representational capacity.

> **Key Insight:** Dimension reduction is often worth the investigation. OpenAI's text-embedding-3 series supports matryoshka representation — you can truncate embeddings to a lower dimension at inference time with modest quality loss. Test whether 512-dim truncated embeddings from text-embedding-3-large match the quality of the full 3072-dim representation on your specific eval set. The answer is frequently yes.

### Model Size vs. Inference Latency

Model parameter count directly determines inference latency for local models. The practical tiers for local deployment:

**~100M parameters** (e5-small, all-MiniLM-L6-v2): 2–5ms per batch of 64 documents on GPU, 15–40ms on CPU. Acceptable for real-time retrieval over small corpora. Quality limited on complex semantic tasks.

**~300–400M parameters** (bge-large-en-v1.5, e5-large, BGE-M3): 8–20ms per batch on GPU. Strong quality across retrieval tasks. The practical default for most production systems.

**7B+ parameters** (e5-mistral-7b-instruct, Llama-based embedders): 50–200ms per batch on GPU. High quality but high cost. Justified only when quality cannot be achieved at smaller scale, or when you are running batch indexing rather than real-time retrieval.

```python
import time
import numpy as np
from sentence_transformers import SentenceTransformer

def benchmark_model_latency(
    model_name: str,
    sample_texts: list[str],
    batch_size: int = 64,
    n_warmup: int = 3,
    n_trials: int = 10,
) -> dict:
    model = SentenceTransformer(model_name)
    batch = sample_texts[:batch_size]

    # Warmup
    for _ in range(n_warmup):
        model.encode(batch, show_progress_bar=False)

    latencies = []
    for _ in range(n_trials):
        t0 = time.perf_counter()
        model.encode(batch, show_progress_bar=False)
        latencies.append(time.perf_counter() - t0)

    arr = np.array(latencies)
    return {
        "model": model_name,
        "batch_size": batch_size,
        "p50_ms": float(np.percentile(arr, 50) * 1000),
        "p95_ms": float(np.percentile(arr, 95) * 1000),
        "p99_ms": float(np.percentile(arr, 99) * 1000),
        "docs_per_sec": batch_size / float(np.percentile(arr, 50)),
    }
```

### API vs. Local Model Cost

For API-hosted models, the cost structure is per-token, not per-inference. OpenAI's text-embedding-3-small is $0.02 per million tokens; text-embedding-3-large is $0.13 per million tokens. At 100 tokens average per document:

- 100K documents indexed once: $0.20 (small) / $1.30 (large)
- 10M documents: $20 / $130
- 1M queries/day × 30 tokens per query: $0.60/day (small) / $3.90/day (large)

These numbers are small enough that for most teams, API cost is not the deciding factor. The relevant API vs. local decision points are: data privacy (can your documents leave your infrastructure?), latency requirements (network round-trip adds 50–150ms), and reproducibility (API model versions change without notice, which can shift your eval metrics).

> **Warning:** API embedding model versions change. text-embedding-ada-002 was silently updated multiple times before text-embedding-3 replaced it. If you build an index with one model version and queries are served by a different version, your similarity scores will be meaningfully wrong. Pin to explicit model versions where supported, and test any model upgrade against your eval set before re-indexing production.

### The Static Top-k Problem

One cost issue that has nothing to do with model size but profoundly affects retrieval quality: a static top-k retrieval count. If you always retrieve top_k=20, and for most queries only 1 document in your corpus is truly relevant, you are injecting up to 19 irrelevant results into whatever downstream system receives your output.

This matters for RAG: every irrelevant chunk in the context window costs tokens and introduces noise into the language model's reasoning. Score calibration — learning a threshold below which results are filtered out — can dramatically improve downstream quality without requiring a better embedding model.

```python
def calibrated_retrieve(
    query: str,
    embed_fn,
    doc_vecs: np.ndarray,
    doc_ids: list[str],
    max_k: int = 20,
    similarity_threshold: float = 0.72,
    min_results: int = 1,
) -> list[tuple[str, float]]:
    """
    Retrieve up to max_k results, filtered to similarity_threshold.
    Always returns at least min_results (even below threshold).
    """
    query_vec = np.array(embed_fn([query])[0])
    sims = cosine_similarity_batch(query_vec, doc_vecs)
    top_k_idx = np.argsort(sims)[::-1][:max_k]

    results = [(doc_ids[i], float(sims[i])) for i in top_k_idx]

    filtered = [(doc_id, score) for doc_id, score in results if score >= similarity_threshold]
    return filtered if len(filtered) >= min_results else results[:min_results]
```

The threshold value requires calibration against your eval set — there is no universal number. A threshold that works well for code search (typically 0.70–0.80 for well-trained models) will be different from one appropriate for prose documentation retrieval (0.60–0.75).

**Key Takeaways:**
- 384-dim embeddings cost 4× less storage than 1536-dim with often marginal quality difference on in-domain data.
- For local models, 300–400M parameter models are the practical default; 7B+ only if smaller models genuinely cannot meet quality requirements.
- API model costs are usually not the deciding factor — privacy, latency, and version stability are.
- Static top_k=20 injects irrelevant results; calibrated similarity thresholds improve downstream quality without a better model.
- Measure p95 and p99 latency, not just median — tail latencies determine user experience.

**Practical Exercise:**
Benchmark the inference latency of the winning model from your Chapter 4 bake-off using the `benchmark_model_latency` function above. Run it with batch sizes of 1, 16, and 64. What is the documents-per-second throughput at each batch size? Does your production ingestion pipeline need to handle re-indexing within an acceptable time window? If not, which latency constraint are you actually optimizing for?

---

## Chapter 7: Interpreting Your Results: What Good Looks Like

You have run the bake-off. You have a table of numbers. Now you need to make a decision. This chapter is about translating metric values into concrete quality judgments — what does an MRR@10 of 0.38 actually feel like to use? When is a 0.04 improvement worth a model swap? What patterns in the results indicate a structural problem that no model swap will fix?

### Mapping Metrics to User Experience

Retrieval metrics are abstractions over user experience. The mapping is not exact, but these rough equivalences hold across a wide range of retrieval systems:

**MRR@10 > 0.50**: Users find what they are looking for on the first or second result more than half the time. This is excellent. Most users report the search as "usually works."

**MRR@10 0.35–0.50**: The relevant result is typically in the top 3–4. Users need to scan but usually find what they need. This range covers most well-deployed code search systems.

**MRR@10 0.20–0.35**: Relevant results appear in the top 5–8 on average. Users frequently miss what they need, refine queries, or abandon the tool. This is functional but frustrating.

**MRR@10 < 0.20**: The search is unreliable. Users will not trust it and will default to alternatives (grep, file-tree navigation, asking a colleague).

A score above 0.4 is strong for code search. PyckLM's 0.456 on CodeSearchNet means that on average, the correct file appears in position 2.2 — consistently within the first few results, which is the threshold where search feels natural rather than laborious.

> **Key Insight:** The threshold that matters to your users is lower than you think. Most users will accept MRR@10 of 0.35 if the first result is right more than half the time. What they will not accept is variance — a system that gets some queries exactly right and fails completely on others is perceived as less trustworthy than one with consistently mediocre performance.

### The Improvement Threshold

When should a metric improvement justify a model swap in production? The answer depends on the cost of the swap (re-indexing time, integration work, validation effort) and the magnitude of the improvement.

A useful heuristic: a 0.03 improvement in MRR@10 is the noise floor for a 50-pair eval set. Below that, you are likely measuring variance, not a real quality difference. A 0.05 improvement is statistically meaningful and warrants consideration. A 0.10+ improvement is a strong signal to switch.

These thresholds scale with your eval set size. With 200 pairs, you can detect 0.02 differences reliably. With 30 pairs, even 0.05 has significant noise.

```python
import scipy.stats as stats

def compare_models_significance(
    eval_pairs: list[dict],
    model_a_retrieve,
    model_b_retrieve,
    k: int = 10,
    alpha: float = 0.05,
) -> dict:
    """
    Paired t-test on per-query reciprocal ranks.
    Returns whether the difference is statistically significant.
    """
    rr_a, rr_b = [], []

    for pair in eval_pairs:
        relevant = set(pair["relevant_ids"])

        for rr_list, retrieve_fn in [(rr_a, model_a_retrieve), (rr_b, model_b_retrieve)]:
            results = retrieve_fn(pair["query"], k)
            rr = 0.0
            for rank, doc_id in enumerate(results, start=1):
                if doc_id in relevant:
                    rr = 1.0 / rank
                    break
            rr_list.append(rr)

    t_stat, p_value = stats.ttest_rel(rr_a, rr_b)
    mean_a = sum(rr_a) / len(rr_a)
    mean_b = sum(rr_b) / len(rr_b)

    return {
        "mrr_a": mean_a,
        "mrr_b": mean_b,
        "delta": mean_b - mean_a,
        "p_value": p_value,
        "significant": p_value < alpha,
        "better": "B" if mean_b > mean_a else "A",
    }
```

### Query-Level Diagnosis

Aggregate metrics hide the most interesting information. After your bake-off, look at per-query performance across models. Specifically:

**Queries where all models fail**: These indicate a structural problem — either the annotation is wrong, the document is not in the corpus, or the query is genuinely ambiguous. Do not blame the model. Fix the corpus or the annotation.

**Queries where only the domain model succeeds**: This is the clearest evidence that domain specialization matters. Document these cases. They are the strongest argument for fine-tuning.

**Queries where BM25 beats all embedding models**: This indicates that the query is primarily a keyword lookup — there is no semantic gap to bridge. Reconsider whether your corpus actually needs semantic search for these query types, or whether a hybrid approach (BM25 + semantic, fused via RRF) would be more appropriate.

```python
def per_query_analysis(
    eval_pairs: list[dict],
    models: dict[str, callable],
    k: int = 10,
) -> list[dict]:
    analysis = []
    for pair in eval_pairs:
        relevant = set(pair["relevant_ids"])
        row = {"query": pair["query"], "relevant": list(relevant)}
        for model_name, retrieve_fn in models.items():
            results = retrieve_fn(pair["query"], k)
            rr = 0.0
            for rank, doc_id in enumerate(results, start=1):
                if doc_id in relevant:
                    rr = 1.0 / rank
                    break
            row[f"rr_{model_name}"] = rr
        analysis.append(row)
    return analysis
```

> **Warning:** Do not optimize for the queries in your eval set. Your eval set is a sample. If you make infrastructure decisions specifically to address the failure cases in your eval set, you are at risk of overfitting to your evaluation rather than improving general retrieval quality. Diagnose patterns, not individual queries.

### When the Numbers Are Misleading

Three situations where your bake-off numbers are less trustworthy than they appear:

**Your eval set is too homogeneous.** If 80% of your queries are from one category (e.g., all authentication-related), a model that is good at authentication lookup will look excellent overall while being mediocre at everything else.

**Your corpus is too small.** In a 500-file codebase, every model performs well because there are only 500 candidates. The test of retrieval quality is precision at scale. If you have the option to test on a larger corpus, do so.

**Your annotation was done while looking at results.** If you annotated the eval set by querying the model you are testing against, you have built confirmation bias into your evaluation. The model will appear better than it is because the annotator's "correct answers" were influenced by what the model returned.

**Key Takeaways:**
- MRR@10 > 0.35 is functional; > 0.4 is strong; > 0.5 is excellent for code search.
- A 0.05 improvement in MRR@10 is the practical threshold for a deployment decision on a 50-pair eval set.
- Per-query analysis reveals more than aggregate metrics — look for patterns in failures, not just numbers.
- Queries where all models fail point to corpus or annotation problems, not model problems.
- Hybrid search (BM25 + semantic, fused via RRF) handles queries where keyword matching and semantic understanding complement each other.

**Practical Exercise:**
Run `per_query_analysis` over your bake-off results. Find the 5 queries with the highest variance across models (one model scores high, another scores 0). Read those queries carefully. What is the model that succeeds doing that the others are not? Document your hypothesis — this is the core of your model selection rationale.

---

## Chapter 8: When to Stop Optimizing

Retrieval quality is not the only metric that matters in a real system. You will reach a point where additional optimization effort produces smaller and smaller improvements on your eval set, while the operational complexity of those optimizations continues to grow. Knowing when to stop is a skill, and it is learnable.

### The Diminishing Returns Curve

Embedding model optimization follows a predictable return curve. The first steps — building an eval set, replacing BM25 with a domain-appropriate embedding model, calibrating top-k — typically produce 20–40% improvements in MRR@10. The next steps — better query preprocessing, model fine-tuning, hybrid retrieval fusion — produce 5–15% improvements. Beyond that, you are in the territory of model architecture research, and the marginal improvements require exponential effort.

Most production systems need to get through the first phase (large gains, low effort) and into the second (moderate gains, moderate effort) before stopping. Very few production use cases justify the third phase.

> **Key Insight:** Your retrieval system serves downstream consumers — a language model generating answers, a developer reading search results, an algorithm making decisions. The quality threshold is determined by those consumers, not by the metric itself. If your downstream LLM produces correct answers 90% of the time with MRR@10 of 0.40, going from 0.40 to 0.45 may not improve end-to-end system quality at all.

### Measuring Downstream Impact

The best stopping criterion for retrieval optimization is downstream quality. For RAG systems, that means: does improving retrieval MRR@10 by X actually improve the accuracy of answers generated by the downstream language model?

The relationship is often nonlinear. Going from MRR@10 of 0.25 to 0.35 may produce a large improvement in answer quality because you are crossing the threshold where the language model has enough correct context. Going from 0.35 to 0.45 may produce marginal downstream improvement because the model was already getting correct context often enough.

Build a small end-to-end evaluation: take 20 questions that have known correct answers, run them through your full pipeline (retrieve → generate), and score the generated answers for correctness. Track this metric alongside your retrieval metrics. If improving retrieval no longer improves end-to-end correctness, you have found your stopping point.

```python
def end_to_end_accuracy(
    qa_pairs: list[dict],
    retrieve_fn,
    generate_fn,
    judge_fn,
    k: int = 10,
) -> float:
    """
    qa_pairs: list of {"question": str, "correct_answer": str}
    retrieve_fn: query -> list of document strings
    generate_fn: (question, contexts) -> answer string
    judge_fn: (question, generated_answer, correct_answer) -> bool
    """
    correct = 0
    for pair in qa_pairs:
        contexts = retrieve_fn(pair["question"], k)
        answer = generate_fn(pair["question"], contexts)
        if judge_fn(pair["question"], answer, pair["correct_answer"]):
            correct += 1
    return correct / len(qa_pairs)
```

### The Operational Complexity Budget

Every optimization adds operational complexity. Fine-tuning adds a retraining pipeline. Hybrid retrieval adds a BM25 component alongside the vector store. Query expansion adds a preprocessing step with its own latency and failure modes. Calibration thresholds need to be recalibrated when your data distribution shifts.

Before adding any optimization, ask: what is the expected maintenance cost over the next 12 months? A 0.03 MRR@10 improvement that requires quarterly retraining and a team member who understands the fine-tuning pipeline is often not worth it. The same improvement achievable by swapping to a better off-the-shelf model is always worth it.

> **Warning:** Query expansion — generating multiple queries from a user's input and merging results — is one of the highest-effort, highest-risk optimizations. It adds latency (multiple embedding calls), can amplify retrieval errors (if the expanded queries are off-target), and creates complex failure modes. Implement it only after exhausting model selection, calibration, and chunking strategy.

### Signs You Should Stop

**Your downstream metric has plateaued.** Two consecutive optimization rounds produced less than 2% improvement in end-to-end accuracy. You have hit the point where retrieval quality is no longer the bottleneck.

**Your team cannot explain the current system.** If debugging a retrieval failure requires tracing through fine-tuning triplets, calibration thresholds, and hybrid fusion weights, you have traded correctability for incremental quality. Simplify.

**Your eval set has become stale.** If you built the eval set six months ago and the codebase has changed significantly, your MRR@10 numbers are measuring performance on old data patterns. A new eval set might show that your "optimized" system is actually regressing on new query types.

**You are within noise of the best available model.** If your MRR@10 is 0.44 and the best model on CodeSearchNet is 0.456, you are within 4% of state-of-the-art. The next optimization is unlikely to be a model selection choice — it is a corpus quality or query quality problem.

> **Try This:** Before your next optimization round, estimate the expected MRR@10 improvement, the engineering time required, and the ongoing maintenance cost. If the improvement × (lifetime value of the system) does not exceed the total maintenance cost, do not implement it.

### Switching Costs and Lock-In

Embedding models create lock-in through the index. Your entire corpus is embedded with a specific model. Switching models requires re-embedding the full corpus, re-tuning any calibration thresholds, and re-validating the eval set metrics. For a 10M document corpus at 100ms per document (batched), that is a 12-day re-indexing job.

Design your embedding pipeline so that re-indexing is possible without a full system outage. Keep embedding and indexing separate from serving. Maintain your eval set in version control so that you can re-run the bake-off quickly when a new model warrants consideration. Treat model upgrades as a planned infrastructure change, not an emergency.

**Key Takeaways:**
- Optimization follows a diminishing returns curve; most production systems stop after the first two phases.
- The correct stopping criterion is downstream quality improvement, not retrieval metric improvement.
- Every optimization adds operational complexity; weigh the maintenance cost against the expected gain.
- Stale eval sets make optimization decisions unreliable — refresh annually or when corpus patterns change significantly.
- Design your pipeline for re-indexing as a planned operation; embedding model upgrades require full corpus re-indexing.

**Practical Exercise:**
Define your end-to-end stopping criterion now, before running any more optimizations. What downstream metric (answer accuracy, user satisfaction, code review pass rate) are you ultimately optimizing for? At what value of that metric are you satisfied? Write it down. This is your ship condition — when you hit it, you stop optimizing retrieval and move on.

---

## Chapter 9: Maintaining Evaluation Quality as Your Data Changes

An eval set is not a one-time artifact. Your codebase evolves, your users' query patterns shift, new features are added, old ones are removed. An eval set built against last year's codebase is measuring your model's performance on a corpus that no longer exists. Maintaining evaluation quality requires a systematic approach to keeping your eval set current without starting from scratch every six months.

### Eval Set Decay

Eval set decay is the gradual divergence between your evaluation data and your production data. It has three root causes:

**Document churn.** Files in your eval set's ground truth may be renamed, refactored, or deleted. A query pointing to `src/auth/tokens.py:decode_token_expiration` becomes invalid if that function is moved to `src/auth/jwt_utils.py`. Stale ground truth silently inflates your false-negative rate — the model correctly retrieves the new location but is scored as a miss.

**Query distribution shift.** Your users' search behavior changes as the codebase changes. Queries for deprecated patterns become less common. Queries for new architectural patterns appear. An eval set that does not include new patterns cannot detect regressions on them.

**Annotation bias accumulation.** Over time, annotators (including yourself) accumulate knowledge of what the model returns. This knowledge seeps into annotation decisions — annotators start accepting model-returned results as "good enough" rather than identifying the true optimal result.

> **Warning:** A stale eval set does not produce errors — it produces confident, wrong metric values. You can have a system with steadily degrading production quality while your eval metrics hold steady or improve, because the eval set no longer challenges the model on the queries your users are actually submitting.

### Automated Staleness Detection

Build a staleness check into your eval pipeline that flags records with broken ground truth references:

```python
from pathlib import Path
import json

def check_eval_set_staleness(
    eval_set_path: str,
    corpus_root: str,
) -> dict:
    with open(eval_set_path) as f:
        eval_data = json.load(f)

    corpus_files = set(
        str(p.relative_to(corpus_root))
        for p in Path(corpus_root).rglob("*")
        if p.is_file()
    )

    stale_records = []
    for pair in eval_data["pairs"]:
        missing = [r for r in pair["relevant_ids"] if r not in corpus_files]
        if missing:
            stale_records.append({
                "query_id": pair.get("id"),
                "query": pair["query"],
                "missing_ids": missing,
            })

    staleness_pct = len(stale_records) / len(eval_data["pairs"]) * 100
    return {
        "total_pairs": len(eval_data["pairs"]),
        "stale_pairs": len(stale_records),
        "staleness_pct": staleness_pct,
        "stale_records": stale_records,
    }
```

Run this check as part of your CI pipeline. If staleness exceeds 10%, block metric reporting until the eval set is updated. Surfacing this prominently prevents the silent degradation of evaluation quality.

### Rolling Eval Set Maintenance

Rather than rebuilding your eval set from scratch periodically, maintain it as a rolling collection with explicit retirement of stale records and addition of new ones.

A sustainable cadence for a mid-size codebase (500–5,000 files):
- **Monthly**: Run staleness check, retire records with broken ground truth, annotate 5–10 new pairs from recent query logs
- **Quarterly**: Review distribution of query categories, ensure new architectural patterns are represented, run full bake-off if a new model candidate is available
- **Annually**: Audit the full eval set for annotation quality, check for semantic drift in query patterns, consider whether the primary metric still matches the use case

> **Key Insight:** Five new eval pairs per month is a sustainable contribution from the primary maintainer of a retrieval system. At that rate, you build 60 new pairs per year while the total stays manageable as you retire stale ones. The eval set stays current without requiring a dedicated annotation sprint.

### Using Production Failures as Eval Data

The highest-quality source of new eval pairs is production failures — cases where your retrieval system returned results that users explicitly corrected or dismissed. If your system has any feedback mechanism (a "not helpful" button, query reformulation tracking, or explicit result annotation), mine it.

A user who reformulates their query is telling you that the first retrieval attempt failed. The original query and the final successful query together form a hard negative pair: the model succeeded eventually but failed at first. Document these as difficult queries in your eval set.

```python
def ingest_query_reformulations(
    query_log: list[dict],
    min_reformulation_gap: float = 30.0,
) -> list[dict]:
    """
    Identifies reformulation chains from query logs.

    query_log: list of {"session_id", "query", "timestamp", "result_clicked"}
    min_reformulation_gap: seconds between queries to consider a chain
    """
    from itertools import groupby

    new_eval_candidates = []
    sessions = sorted(query_log, key=lambda x: (x["session_id"], x["timestamp"]))

    for session_id, session_queries in groupby(sessions, key=lambda x: x["session_id"]):
        session = list(session_queries)
        for i in range(len(session) - 1):
            gap = session[i+1]["timestamp"] - session[i]["timestamp"]
            if gap < min_reformulation_gap and session[i+1].get("result_clicked"):
                new_eval_candidates.append({
                    "original_query": session[i]["query"],
                    "successful_query": session[i+1]["query"],
                    "clicked_result": session[i+1]["result_clicked"],
                    "candidate_status": "needs_annotation",
                })

    return new_eval_candidates
```

### Version Control and Reproducibility

Your eval set is code. Treat it that way. Every change to the eval set should be committed with a message explaining what was changed and why. When you compute metrics for a model, record which version of the eval set was used alongside the metric values. This lets you distinguish improvements in model quality from improvements in eval quality.

```
eval_set_v1.0.json   — initial 30-pair set, April 2026
eval_set_v1.1.json   — retired 3 stale records, added 8 new from query logs, May 2026
eval_set_v1.2.json   — added 12 pairs covering new async patterns, June 2026
```

When you report metrics, report them as: `MRR@10 = 0.423 (model: PyckLM-1.2, eval: v1.2, date: 2026-06-15)`. This creates an audit trail that makes regressions detectable and root-causable.

> **Try This:** Check your current eval set right now with the staleness detection script. If more than 10% of your ground truth document references are missing or stale, stop here and fix those records before running any further model comparisons. Broken ground truth produces meaningless metrics.

**Key Takeaways:**
- Eval set decay silently inflates false-negative rates and produces confident, wrong metrics.
- Automated staleness detection in CI prevents invisible quality degradation.
- Five new eval pairs per month is a sustainable maintenance cadence for most teams.
- Production query reformulations are the highest-quality source of new difficult eval pairs.
- Treat the eval set as versioned code; record which eval set version was used alongside every metric value.

**Practical Exercise:**
Run the staleness check on your eval set. If any records are stale, update them now. Then set a calendar reminder for 30 days from today to add 5 new pairs from whatever query logs or user interactions you have accumulated. Document the cadence you will maintain.

---

## Conclusion

Embedding model selection is a measurement problem masquerading as a shopping problem. The mistake most teams make is treating it as a shopping problem — comparing benchmark numbers the way you compare spec sheets, picking the highest number, and moving on. That process produces models optimized for someone else's data.

The process in this book is the alternative. It is more work upfront. Building a domain-specific eval set, setting up a bake-off framework, computing your BM25 baseline — none of that is glamorous. But it produces decisions grounded in evidence about your actual system, and those decisions hold up under scrutiny.

### What the Process Gives You

A 30-pair eval set takes two hours to build and gives you something no benchmark can: a measurement of how each candidate model actually performs on the queries your users submit, against the documents your users need to find. That is the information you need to make a selection. Everything else is a prior that you are updating against that evidence.

The metrics — MRR@10, Recall@k, NDCG — are tools for distilling retrieval quality into numbers you can compare. They are not the goal. The goal is a retrieval system that serves your users well. The metrics are instrumentation on that goal. When you find that improving MRR@10 from 0.38 to 0.43 does not improve the downstream language model's accuracy, that is the metrics telling you something true: you have reached the retrieval quality ceiling for your use case, and the gains are elsewhere.

### The Fine-Tuning Decision in Context

The fine-tuning question is where most teams spend too much energy. The math is clear: below 50K high-quality triplets, fine-tuning does not reliably outperform a well-calibrated general model. Above 50K, it does — but only if you have infrastructure for retraining, monitoring, and model versioning.

Domain specialization is often the better bet. A model like PyckLM, achieving 0.456 MRR@10 on CodeSearchNet because it was trained on code structure rather than retrofitted from text, demonstrates what happens when training distribution matches query distribution. The 62% improvement over GraphCodeBERT is not a fine-tuning result — it is a selection result. The model was trained for the task from the beginning.

Before you commit to a fine-tuning pipeline, ask whether there is a model already trained for your domain. The answer is increasingly yes — code search, legal retrieval, biomedical literature search, and financial analysis all have specialized models with genuine quality advantages over general alternatives.

### Cost, Complexity, and the Stopping Condition

Every layer of optimization adds operational surface area. A hybrid BM25 + embedding retrieval system fused via RRF is better than either alone — and it requires maintaining two retrieval paths, a fusion implementation, and calibration of fusion weights. That complexity is justified when the quality gain is meaningful. It is not justified for a 0.02 MRR@10 improvement in a system that already meets your quality bar.

Set your stopping condition before you start optimizing. What downstream quality does your system need to achieve? At what MRR@10 does user experience become acceptable? When is the retrieval quality no longer the bottleneck? These are not hard questions, but they require specific answers — not "as good as possible." The open-ended mandate to keep improving is the enemy of shipping.

### Maintenance as a First-Class Concern

Eval set maintenance is not optional. A stale eval set is worse than no eval set — it provides false confidence that your system is performing well while actual production quality drifts. Run your staleness check monthly. Add new pairs from production. Retire stale ground truth. Treat the eval set as a living artifact that needs tending, not a one-time deliverable.

The same applies to model selection. Embedding models improve rapidly. The model you selected today may be outperformed six months from now by something newer and cheaper. Your eval framework is the infrastructure that makes re-evaluation fast — with a solid bake-off script and a current eval set, evaluating a new model takes an afternoon, not a sprint. Build the infrastructure once. Run it often.

The ground truth is your data. Trust it over any leaderboard.

---

## Appendix A: Glossary

**BM25** — A probabilistic keyword retrieval algorithm based on term frequency and inverse document frequency. Used as the standard baseline for evaluating embedding model improvements. Implemented in libraries like `rank_bm25`.

**Calibration** — The process of learning a similarity threshold or score transformation that makes model confidence scores meaningful. A calibrated retrieval system returns fewer results when confidence is low, rather than always returning a fixed top-k.

**Cosine Similarity** — A measure of the angle between two vectors. Values range from -1 to 1; 1 means identical direction (highly similar), 0 means orthogonal (no similarity). The standard similarity function for dense retrieval.

**DCGNDCG** (Normalized Discounted Cumulative Gain) — A ranking metric that rewards relevant documents appearing at high positions, with graded relevance support. Normalized against the ideal ranking.

**Dense Retrieval** — Retrieval using dense vector representations (embeddings) and approximate nearest neighbor search. Contrasted with sparse retrieval (BM25, TF-IDF).

**Embedding** — A fixed-length numeric vector representing a piece of text in a high-dimensional space where similar texts are geometrically close.

**Eval Set** — A collection of (query, relevant_document) pairs used to measure retrieval quality. The ground truth for all metric computation.

**Fine-Tuning** — Continuing training of a pre-trained model on domain-specific data to improve performance on that domain.

**Hard Negative** — A document that is superficially similar to the relevant document but is not the correct answer for a query. Used in contrastive training to teach the model fine-grained distinctions.

**Hybrid Retrieval** — Combining sparse (BM25) and dense (embedding) retrieval, typically fused using Reciprocal Rank Fusion (RRF).

**Margin Ranking Loss** — A training objective that forces the model to place positive examples significantly above negative examples by a defined margin. Produces better-calibrated similarity scores than MSE loss.

**Matryoshka Representation Learning (MRL)** — A technique that trains a model to produce embeddings that remain meaningful when truncated to lower dimensions. Used by OpenAI's text-embedding-3 series.

**MTEB** (Massive Text Embedding Benchmark) — The primary public leaderboard for embedding models, covering 56 tasks across 8 categories. Computed primarily on web crawls and academic data.

**MRR@k** (Mean Reciprocal Rank) — The average of reciprocal ranks across queries. For each query, the reciprocal rank is 1/position of the first relevant result. Zero if the relevant result is not in the top k.

**MSE Loss** — Mean Squared Error loss, when applied to similarity scores, tends to produce predictions clustered around the mean. Produces the "mushy middle" failure mode.

**Recall@k** — The fraction of relevant documents that appear in the top k results. The primary metric for multi-document retrieval tasks.

**Reciprocal Rank Fusion (RRF)** — A score fusion technique that combines rankings from multiple retrieval systems by summing reciprocal ranks, penalized by a constant (typically 60).

**Top-k** — The number of results returned by a retrieval system for each query. Static top-k returns a fixed number regardless of confidence; calibrated retrieval adjusts based on similarity scores.

**Triplet** — A training example consisting of (anchor, positive, negative) — typically (query, relevant_document, irrelevant_document).

**Vector Store** — A database optimized for storing and querying dense vectors. Examples: ChromaDB, Pinecone, Weaviate, pgvector, FAISS.

---

## Appendix B: Tools and Resources

### Embedding Models (Local)

**sentence-transformers** — The primary Python library for working with local embedding models. Supports hundreds of HuggingFace models with a unified API.
```bash
pip install sentence-transformers
```

**bge-large-en-v1.5** — BAAI's general retrieval model. Strong across diverse tasks. Pull via:
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
```

**e5-mistral-7b-instruct** — High-quality 7B model for demanding retrieval tasks.
```python
model = SentenceTransformer("intfloat/e5-mistral-7b-instruct")
```

### Embedding Models (API)

**OpenAI text-embedding-3-small / text-embedding-3-large** — Variable dimension, matryoshka-supported. Use via the `openai` Python package.

**Codestral Embed** — Mistral's code-focused embedding model. Available via the Mistral API.

**Cohere Embed v3** — Strong multilingual embedding with domain options. Available via `cohere` Python package.

### Retrieval and Indexing

**rank_bm25** — BM25 implementation for Python. Use for baseline computation.
```bash
pip install rank-bm25
```

**FAISS** — Facebook's library for efficient similarity search. Use for large-scale in-memory vector search.
```bash
pip install faiss-cpu  # or faiss-gpu
```

**ChromaDB** — Embedded vector database with Python-first API. Good for local development.
```bash
pip install chromadb
```

**pgvector** — PostgreSQL extension for vector similarity search. Production-ready for teams already using Postgres.

### Evaluation

**ranx** — Python library for information retrieval evaluation metrics (MRR, NDCG, MAP, Recall).
```bash
pip install ranx
```

**ir_measures** — Comprehensive IR evaluation library supporting TREC-style evaluation.
```bash
pip install ir-measures
```

**beir** — Heterogeneous retrieval benchmark toolkit. Use for benchmarking against standard datasets.
```bash
pip install beir
```

### Training

**sentence-transformers** — Also handles fine-tuning with triplet loss, contrastive loss, and margin ranking loss.

**Unstructured** — Document parsing and chunking for diverse file formats (PDF, Word, HTML).
```bash
pip install unstructured
```

---

## Appendix C: Further Reading

### Core Papers

**"BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models"** (Thakur et al., 2021) — The foundational paper for understanding retrieval generalization across domains. Essential reading for understanding why single-domain benchmarks do not predict cross-domain performance.

**"Text Embeddings Reveal (Almost) As Much As Text"** (Morris et al., 2023) — An analysis of what information is encoded in text embeddings and how embedding dimensionality affects information retention.

**"E5: Large-Scale Text Embeddings with Long-Range Dependencies"** (Wang et al., 2022) — The paper behind the E5 model family. Explains the instruction-tuned embedding approach and its benefits for retrieval.

**"Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks"** (Reimers & Gurevych, 2019) — The foundational paper for sentence-level embeddings and the basis for most practical embedding fine-tuning.

**"Learning Dense Representations for Entity Retrieval"** (Gillick et al., 2019) — Introduces hard negative mining for dense retrieval, the technique behind most modern contrastive training pipelines.

**"CodeSearchNet Challenge: Evaluating the State of Semantic Code Search"** (Husain et al., 2019) — The paper that defined CodeSearchNet, the benchmark used throughout this book. Includes the dataset construction methodology and baseline results.

### Practical Guides

**The Illustrated Transformer** (Jay Alammar) — Visual explanation of the transformer architecture underlying all modern embedding models. Essential mental model for understanding what embedding models are doing.

**Pinecone's Vector Database Fundamentals** — Practical guide to vector storage, indexing strategies, and approximate nearest neighbor algorithms.

**Hugging Face MTEB Leaderboard documentation** — The methodology behind MTEB scoring, including which datasets are used and how overall scores are computed from task-specific results.

### Code and Tooling

**sentence-transformers training examples** — The official repository contains complete training examples for triplet loss, contrastive loss, and cross-encoder distillation. The highest-quality reference for fine-tuning implementation.

**BEIR GitHub repository** — Implementations of all BEIR benchmark datasets with standardized evaluation code. Use this to reproduce any BEIR result you encounter in a paper.

**ranx documentation** — Comprehensive documentation for computing IR metrics in Python, including NDCG, MRR, MAP, and statistical significance testing.

---

*© 2026 Pyckle. All rights reserved. This guide may be shared freely for personal and educational use. Commercial reproduction or redistribution requires written permission. Contact kellyprice@pyckle.co.*
