---
title: "How to Manage Context Windows Effectively"
subtitle: "What Goes In, What Stays Out, and Why It Matters"
author: "Kelly Price"
date: "2026-04-21"
description: "A developer's guide to context window management — understanding attention degradation, context selection strategies, and building systems that stay relevant under pressure."
tags: [ai, developer-tools, productivity]
---
```

# How to Manage Context Windows Effectively
## What Goes In, What Stays Out, and Why It Matters

*Kelly Price*

---

## About This Guide

Every senior developer eventually hits the same wall. You've been running a long session with an LLM — code review, refactor, architecture discussion. The responses start drifting. The model loses track of a constraint you established forty exchanges ago. It re-introduces a pattern you explicitly ruled out. You re-paste the same function three times because the model keeps forgetting it exists.

This is not a model failure. It is a context management failure. And it is almost always yours.

Context windows are the working memory of a language model. Unlike human working memory, they have hard capacity limits measured in tokens. Unlike a database, they are not indexed for retrieval — everything in a context window is processed together, with attention mechanisms that do not treat position zero and position 90,000 with equal weight. The way you fill that window determines the quality of every response you get.

This guide treats context management as an engineering discipline, not a prompt-writing art. The strategies here are concrete: how to select what enters context, how to compress what needs to stay, how to test whether your context is actually doing its job, and how to build pipelines that manage context systematically rather than by gut feel.

The audience is developers who are integrating LLMs into real applications — not casual users. You know what an API call looks like. You understand token budgets in broad strokes. What you may not have is a systematic mental model for thinking about context as a resource that requires the same care as memory allocation or database query optimization.

This guide covers nine core topics: how context windows work mechanically, where attention degrades, strategies for selecting what goes in, compression and distillation, retrieval-augmented patterns, session continuity, multi-turn conversation design, measuring context quality, and building context-aware pipelines. Each chapter ends with exercises that produce working artifacts — not hypothetical ones.

The examples use Python with the Anthropic and OpenAI SDKs, but the principles apply regardless of which model or framework you use. Token limits, attention mechanics, and degradation patterns are properties of transformer architectures, not vendor APIs.

If you have ever pasted the same thing twice in the same session because it "got lost," this guide is for you.

---

## Table of Contents

1. How Context Windows Actually Work
2. Attention Degradation: What Happens at the Edges
3. Selecting What Goes Into Context
4. Compression Techniques: Summarization and Distillation
5. Retrieval-Augmented Patterns vs. Full Context
6. Session Continuity: Picking Up Where You Left Off
7. Multi-Turn Conversations Without Context Bleed
8. Testing Context Quality and Measuring Degradation
9. Building Context-Aware Pipelines
- Conclusion
- Appendix A: Glossary
- Appendix B: Tools and Resources
- Appendix C: Further Reading

---

## Chapter 1: How Context Windows Actually Work

A context window is not a buffer that the model reads sequentially before answering. It is the complete input to a single forward pass through the transformer. Every token in your context — system prompt, conversation history, user message, tool outputs — is processed simultaneously, with each token attending to every other token through the attention mechanism.

This architecture has three practical consequences that shape everything else in this guide.

**First: context is stateless.** The model has no memory between API calls. When you call the API, you send the entire conversation every time. The model does not remember your previous call — it simply processes the messages array you provide. This means your application is responsible for maintaining and managing state. The model is a stateless function: input tokens in, output tokens out.

**Second: the context window has a hard limit, measured in tokens.** Tokens are not words. For most English text, one token is roughly four characters, or about 0.75 words. A 128,000-token context window holds roughly 96,000 words — about the length of a novel. But your context competes with itself for that space. A system prompt that runs 2,000 tokens, conversation history of 40 exchanges, and a large document paste can fill a 128k window faster than you expect.

**Third: position matters.** The transformer attention mechanism can theoretically attend to any position in the context with equal weight. In practice, studies have consistently shown that models perform better on information placed at the beginning or end of the context than on information buried in the middle. This is sometimes called the "lost in the middle" problem, and it has concrete implications for how you structure your context.

> **Key Insight:** The model does not read your context the way a person reads a document. It processes all tokens in parallel via attention. There is no sequential "reading order" — just positional embeddings that encode location. The practical effect is that the structural shape of your context matters as much as its content.

Here is what actually happens when you call a model API with a conversation:

```python
import anthropic

client = anthropic.Anthropic()

# Every API call sends the FULL conversation history.
# The model has no memory outside of what you send here.
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    system="You are a Python code reviewer. Flag any use of mutable default arguments.",
    messages=[
        {"role": "user", "content": "Review this function: def foo(items=[]): items.append(1); return items"},
        {"role": "assistant", "content": "This function has a mutable default argument bug..."},
        {"role": "user", "content": "Fix it and explain the corrected version."},
    ]
)

print(response.usage)
# Usage(input_tokens=87, output_tokens=142)
# input_tokens includes EVERYTHING: system prompt + all messages
```

The `input_tokens` field tells you exactly what the model processed. This is your context cost per call. As conversation grows, input token cost grows linearly. A 50-turn conversation where each turn is 200 tokens means your 51st call sends roughly 10,000 tokens of history before you even ask your question.

Token counting is not just a billing concern. It determines whether you stay within the context limit. Most SDKs expose a token counting utility:

```python
# Count tokens before sending — catch limit violations early
token_count = client.messages.count_tokens(
    model="claude-opus-4-7",
    system="You are a Python code reviewer.",
    messages=[
        {"role": "user", "content": "Review this function: def foo(items=[]): items.append(1); return items"},
    ]
)
print(f"Input tokens: {token_count.input_tokens}")
```

The context window limit for a given model is the ceiling, not a target. Filling 90% of a 200k context window is not good utilization — it likely means you have stopped being selective about what you include. Context quality degrades long before you hit the limit.

> **Warning:** Do not conflate context window size with context quality. A 200,000-token context window does not mean you should use 180,000 tokens. Larger contexts increase latency, cost, and the surface area for attention degradation. Treat the limit as a ceiling for emergencies, not a budget to fill.

Understanding the token budget also means understanding what consumes it. System prompts can be surprisingly large — a detailed system prompt with examples, formatting instructions, and constraints easily runs 1,000–3,000 tokens. Tool definitions, when you use function calling, add hundreds of tokens per tool. Inline documents, code pastes, and long tool outputs compound quickly.

A practical habit: before building any context-heavy pipeline, write a token audit function that breaks down where tokens are going by category (system, history, documents, current query). That breakdown will reveal the first obvious optimization targets.

```python
def token_audit(system: str, history: list, documents: list, query: str, model: str = "claude-opus-4-7") -> dict:
    client = anthropic.Anthropic()

    def count(messages, sys=None):
        kwargs = {"model": model, "messages": messages}
        if sys:
            kwargs["system"] = sys
        return client.messages.count_tokens(**kwargs).input_tokens

    dummy = [{"role": "user", "content": "x"}]
    system_tokens = count(dummy, system) - count(dummy)
    history_tokens = count(history + dummy, system) - count(dummy, system) - system_tokens if history else 0
    doc_tokens = count(history + [{"role": "user", "content": "\n".join(documents) + "\n" + query}], system) \
                 - count(history + [{"role": "user", "content": query}], system) if documents else 0
    query_tokens = count([{"role": "user", "content": query}])

    return {
        "system": system_tokens,
        "history": history_tokens,
        "documents": doc_tokens,
        "query": query_tokens,
        "total": system_tokens + history_tokens + doc_tokens + query_tokens
    }
```

This is not a polished library function — it is a diagnostic tool. Run it against your actual contexts early in development. The numbers will surprise you.

**Key Takeaways**

- The model is stateless; your application owns all context state
- Context is the complete input to a single forward pass — every token competes with every other token for attention
- Token limits are ceilings, not budgets; context quality degrades before you reach the limit
- Position within context matters — beginning and end receive stronger attention than the middle
- Token auditing by category is the first step to any context optimization

**Practical Exercise**

Build the `token_audit` function above, then run it against a real conversation from your application or a recent LLM session. Document where the tokens are going. If any single category exceeds 40% of your total budget, that is your first optimization target for Chapter 3.

---

## Chapter 2: Attention Degradation: What Happens at the Edges

The "lost in the middle" problem has a formal research basis. A 2023 paper from Stanford demonstrated that across multiple model families, performance on tasks requiring retrieval from a provided document dropped significantly when the relevant information was placed in the middle of a long context, compared to placement at the beginning or end. The effect was not subtle — accuracy dropped by over 20 percentage points in some configurations.

What this means for you as a builder: the structural arrangement of your context is a first-class engineering decision, not an aesthetic one.

To understand why this happens, consider how transformer attention works under load. Each token attends to every other token via learned attention weights. In short contexts, the model can maintain strong attention to all positions. As context length grows, the effective attention budget per token gets diluted. Positional embeddings — the mechanism that tells the model where in the sequence each token lives — receive stronger gradient signal near the extremes during training because training sequences frequently have relevant information at the start (instructions) and end (the current query). The middle is where conversation history, documents, and tool outputs accumulate, and it is precisely where attention weakens.

> **Key Insight:** The model's attention is not uniformly distributed across a long context. Think of it as having "anchor points" at the beginning and end, with the middle being comparatively murky. Critical instructions belong at the beginning. The current task belongs at the end. Everything else competes for attention in the middle.

This has a direct architectural implication: the order in which you assemble your context is not arbitrary. A system prompt that places the most critical constraints at the very beginning is structurally sound. A system prompt that buries the key constraint — "never suggest using eval()" — in paragraph six of a dense ten-paragraph system prompt is not.

Here is a concrete experiment you can run to observe this yourself:

```python
import anthropic
import json

client = anthropic.Anthropic()

def needle_in_haystack(haystack_size_tokens: int, needle_position: str) -> str:
    """
    Place a needle (specific fact) in a haystack of filler text,
    then ask the model to retrieve it.
    """
    needle = "The secret deployment key is ZEPHYR-7743."
    filler = "This is general context about the system architecture. " * (haystack_size_tokens // 12)

    if needle_position == "start":
        content = needle + "\n\n" + filler
    elif needle_position == "middle":
        mid = len(filler) // 2
        content = filler[:mid] + "\n\n" + needle + "\n\n" + filler[mid:]
    else:  # end
        content = filler + "\n\n" + needle

    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=100,
        messages=[
            {"role": "user", "content": f"{content}\n\nWhat is the secret deployment key?"}
        ]
    )
    return response.content[0].text

for position in ["start", "middle", "end"]:
    result = needle_in_haystack(5000, position)
    print(f"Position: {position}")
    print(f"Response: {result[:200]}\n")
```

Run this at different haystack sizes — 5k, 20k, 50k tokens. You will see the middle placement start failing before the endpoints do.

The practical response to attention degradation is not to use shorter contexts — it is to manage placement deliberately. Several patterns follow from this:

**Primacy placement for constraints.** Anything the model must not forget — hard constraints, critical context, identity definitions — goes first in the system prompt. Not second, not in a section labeled "Important" buried three screens down. First.

**Recency placement for the immediate task.** The user's current message is naturally at the end of the messages array, which is exactly right. But tool outputs, retrieved documents, and intermediate results should be placed as close to the final user message as possible. If you retrieved a document to help answer a question, inject it immediately before the question, not at the top of the conversation.

**Aggressive pruning for the middle.** Old conversation turns, superseded instructions, and resolved sub-tasks should be pruned or summarized before they accumulate in the middle of your context (more on this in Chapter 4). The middle is not storage — it is where attention goes to die.

> **Warning:** Repetition does not fix degradation. A common instinct is to repeat important instructions at multiple points in the context to ensure they stick. This adds tokens without proportionally adding attention. Worse, if the repeated instructions differ even slightly (due to paraphrase), the model may produce inconsistent behavior. Place the instruction once, at the right position.

Attention degradation also manifests in multi-turn conversations in a specific way: early-turn constraints get violated as conversation length grows. If you told the model in turn one to always respond in JSON, by turn thirty in a long session, you may start seeing prose responses. The instruction has not been removed — it is still in the context — but it is now 30,000 tokens ago, buried under subsequent conversation.

The mitigation is to reinforce critical constraints at the system level, not the conversation level. System prompts receive stronger structural attention than user messages. If a constraint must be persistent, put it in the system prompt. If it cannot be in the system prompt (because it is user-specific), consider re-injecting it as a reminder at regular intervals in your application logic:

```python
CRITICAL_CONSTRAINTS = """
REMINDER: You must respond only in valid JSON. Never use prose.
Never suggest solutions involving external API calls to third-party services.
"""

def inject_constraint_reminder(messages: list, interval: int = 10) -> list:
    """Re-inject critical constraints every `interval` turns."""
    result = []
    for i, msg in enumerate(messages):
        result.append(msg)
        if msg["role"] == "assistant" and (i // 2) % interval == 0 and i > 0:
            result.append({
                "role": "user",
                "content": f"[SYSTEM REMINDER]\n{CRITICAL_CONSTRAINTS}"
            })
    return result
```

This is not elegant, but it is effective. You are working with the architecture's actual behavior, not the behavior you might prefer it to have.

**Key Takeaways**

- Attention is not uniform across a long context — beginning and end positions are stronger
- Critical constraints belong at the start of the system prompt; the current task at the end of the messages array
- Retrieved documents and tool outputs should be placed as close to the final query as possible
- Repetition does not compensate for positional degradation — placement is the fix
- Long multi-turn sessions cause early constraints to drift; re-injection at the system level is the reliable mitigation

**Practical Exercise**

Run the needle-in-haystack experiment above with your primary model at three context sizes: 5k, 20k, and 50k tokens. Record the accuracy at each position (start, middle, end). Document the crossover point — the context size at which middle placement starts failing. This is your degradation threshold, and it should inform your pipeline's pruning schedule.

---

## Chapter 3: Selecting What Goes Into Context

Selection is the highest-leverage act in context management. Every byte you exclude is a byte that cannot degrade attention, drive up cost, or crowd out more relevant content. The question "what should I include?" is the wrong starting point. Start with "what does the model actually need to produce the correct output?"

These are different questions. The first leads to over-inclusion — pasting the entire codebase, the full conversation history, every related document. The second requires you to model the task from the model's perspective: what information, if absent, would cause a wrong or incomplete answer?

This is the minimum-viable-context principle. Your context should contain everything necessary and nothing more. In practice, you will overshoot — that is fine. The discipline is to iterate toward sufficiency, not maximality.

> **Key Insight:** Context selection is a prediction problem. You are predicting what information the model will need to produce the correct response. The better you understand the task, the better your selection will be. If you cannot articulate what the model needs to know, you do not yet understand the task well enough to select good context.

Start with a classification of information types:

**Task-critical information** is anything without which the model cannot complete the task. For a code review request, this is the code under review, the language version, and any relevant constraints (e.g., "this must run on Python 3.9"). Without these, the model is guessing.

**Background information** is information that improves quality but whose absence would not cause a failure. For the same code review, this might be the project's style guide or examples of preferred patterns. Useful but not necessary.

**Tangentially related information** is information you included because it seemed relevant when you assembled the context, but that the model does not actually need. Old conversation turns discussing resolved issues, entire files when only three functions matter, documentation pages that cover adjacent features. This category is where most context bloat lives.

A practical filter: for each piece of content you are considering including, ask "what does the model do differently if this is absent?" If the answer is "nothing," cut it.

Here is a function that applies this filter to code context by extracting only the symbols relevant to a specific task:

```python
import ast
from typing import Optional

def extract_relevant_symbols(source: str, target_function: str, include_called: bool = True) -> str:
    """
    Extract only the functions relevant to a target function from a Python source file.
    Avoids including the entire file when only a slice is needed.
    """
    tree = ast.parse(source)

    # Find the target function
    target_node = None
    for node in ast.walk(tree):
        if isinstance(node, ast.FunctionDef) and node.name == target_function:
            target_node = node
            break

    if not target_node:
        return source  # Fallback: return everything if target not found

    # Find all function calls within the target
    called_functions = set()
    if include_called:
        for node in ast.walk(target_node):
            if isinstance(node, ast.Call) and isinstance(node.func, ast.Name):
                called_functions.add(node.func.id)

    # Collect relevant function definitions
    lines = source.splitlines()
    relevant_sections = []

    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
            if node.name == target_function or node.name in called_functions:
                start = node.lineno - 1
                end = node.end_lineno
                relevant_sections.append("\n".join(lines[start:end]))

    return "\n\n".join(relevant_sections)


# Usage: instead of pasting the entire file, paste only what matters
source_code = open("my_module.py").read()
relevant_context = extract_relevant_symbols(source_code, "process_payment")
print(f"Full file: {len(source_code)} chars")
print(f"Relevant context: {len(relevant_context)} chars")
```

This pattern — extract before injecting — is the difference between a 40,000-token context and a 3,000-token context for the same task. The model gets what it needs; you pay for what it needs.

> **Try This:** Before your next LLM call that includes source code, run a word-count comparison: how many tokens is the file you are about to paste? How many tokens are actually relevant to the question? If the ratio is less than 1:3, you are over-including. Write an extraction function specific to your domain.

Selection also applies to conversation history. Not every prior exchange is relevant to the current question. A conversation that started with architecture discussion and has moved to implementation details does not need the architecture discussion in context for questions about implementation — unless the implementation question directly depends on an architectural constraint.

A sliding window approach is the simplest history selection strategy:

```python
def sliding_window_history(messages: list, max_tokens: int, model: str) -> list:
    """
    Return the most recent messages that fit within max_tokens.
    Always preserves the most recent user message.
    """
    import anthropic
    client = anthropic.Anthropic()

    if not messages:
        return messages

    # Always include the last message
    window = [messages[-1]]

    for msg in reversed(messages[:-1]):
        candidate = [msg] + window
        token_count = client.messages.count_tokens(
            model=model,
            messages=candidate
        ).input_tokens

        if token_count <= max_tokens:
            window = candidate
        else:
            break

    return window
```

Sliding window is naive — it keeps recent messages regardless of relevance. Chapter 5 covers retrieval-augmented patterns that select by semantic relevance rather than recency. But for many applications, recency correlates well with relevance, making sliding window a practical default.

The third dimension of selection is format. The same information in a different format costs different numbers of tokens. A JSON object with verbose key names costs more than the same data in a compact representation. Markdown tables cost more than their CSV equivalents. When you control the format of content entering your context — tool outputs, retrieved documents, database results — prefer compact representations without sacrificing the information the model needs.

```python
import json

# Verbose: ~180 tokens
verbose_record = {
    "customer_identifier": "cust_abc123",
    "account_creation_timestamp": "2024-01-15T10:30:00Z",
    "subscription_plan_name": "professional",
    "monthly_recurring_revenue_usd": 99.00,
    "payment_method_type": "credit_card"
}

# Compact: ~60 tokens
compact_record = {
    "id": "cust_abc123",
    "created": "2024-01-15",
    "plan": "professional",
    "mrr": 99,
    "payment": "credit_card"
}

print(json.dumps(compact_record))
```

This is a 3x token reduction for identical information. At scale, across a pipeline processing thousands of records, this matters.

**Key Takeaways**

- Start with "what does the model need?" not "what might be relevant?"
- Classify context into task-critical, background, and tangential — eliminate the tangential
- Extract relevant code symbols rather than pasting entire files
- Sliding window history is a practical default; relevance-based retrieval is better for long sessions
- Compact data formats reduce token cost without reducing information content

**Practical Exercise**

Take a real prompt from your application and audit every piece of content included. Label each as task-critical, background, or tangential. Measure the token counts of each category. Remove all tangential content, re-run the prompt, and compare output quality. Document whether output quality changed, stayed the same, or improved.

---

## Chapter 4: Compression Techniques: Summarization and Distillation

When information must stay in your context but its full form is too expensive, compression is the answer. Compression is any technique that reduces token count while preserving the information the model needs. Two primary forms apply here: summarization (condensing the same information into fewer tokens) and distillation (extracting only the high-value signals from a larger body).

These are not the same thing. Summarization is lossy compression: you trade detail for brevity, accepting that some nuance is lost. Distillation is selective extraction: you keep the precise tokens you need and discard the rest, with no loss for the specific downstream task.

The choice between them depends on how you will use the compressed content. If the model needs a general sense of what happened in a long conversation segment, summarize. If the model needs specific facts — a constraint, a decision, a numerical value — distill.

> **Key Insight:** Summarization and distillation serve different purposes. Use summarization when the model needs context to reason from. Use distillation when the model needs facts to act on. Confusing the two produces either too much loss (over-summarized facts) or too much bulk (distillation that keeps prose).

**Summarizing conversation history** is the most common application. When a conversation grows long enough that its full history would consume too much of your context budget, you summarize the older segments and keep the recent turns verbatim.

```python
import anthropic

client = anthropic.Anthropic()

SUMMARY_PROMPT = """Summarize the following conversation segment.
Focus on: decisions made, constraints established, information provided, and open questions.
Be specific and factual. Use past tense. Target 150-250 words.
Do not include greetings, pleasantries, or meta-commentary about the conversation.

Conversation:
{conversation}

Summary:"""

def summarize_segment(messages: list) -> str:
    """Compress a list of messages into a summary."""
    formatted = "\n".join(
        f"{'User' if m['role'] == 'user' else 'Assistant'}: {m['content']}"
        for m in messages
    )

    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Use the fast, cheap model for compression
        max_tokens=400,
        messages=[{
            "role": "user",
            "content": SUMMARY_PROMPT.format(conversation=formatted)
        }]
    )
    return response.content[0].text


def compress_history(messages: list, keep_recent: int = 6, threshold: int = 20) -> list:
    """
    When history exceeds `threshold` messages, summarize all but the last `keep_recent`.
    Returns a messages list with a summary injected as a system-style user message.
    """
    if len(messages) <= threshold:
        return messages

    old_messages = messages[:-keep_recent]
    recent_messages = messages[-keep_recent:]

    summary = summarize_segment(old_messages)

    summary_message = {
        "role": "user",
        "content": f"[CONVERSATION SUMMARY — earlier context]\n{summary}"
    }
    # Inject a brief acknowledgment to maintain valid message alternation
    summary_ack = {
        "role": "assistant",
        "content": "Understood. I have the context from our earlier discussion."
    }

    return [summary_message, summary_ack] + recent_messages
```

Note the model choice: use a fast, inexpensive model for compression tasks. Haiku-class models are well-suited for summarization — the task is mechanical enough that you do not need the full capability of a frontier model. You will save money and reduce latency on every compression step.

**Distillation** is more precise. Rather than producing a prose summary, you extract specific structured data from a larger document or conversation. The output is compact and queryable.

```python
DISTILL_PROMPT = """Extract the following from the conversation below.
Return ONLY a JSON object with these exact keys. Use null for missing values.

Keys:
- constraints: list of hard requirements or prohibitions
- decisions: list of decisions made with their rationale
- open_questions: list of unresolved questions
- key_facts: list of specific facts (numbers, names, dates, identifiers)

Conversation:
{conversation}"""

def distill_to_facts(messages: list) -> dict:
    """Extract structured facts from a conversation segment."""
    formatted = "\n".join(
        f"{'User' if m['role'] == 'user' else 'Assistant'}: {m['content']}"
        for m in messages
    )

    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=600,
        messages=[{
            "role": "user",
            "content": DISTILL_PROMPT.format(conversation=formatted)
        }]
    )

    import json
    try:
        return json.loads(response.content[0].text)
    except json.JSONDecodeError:
        # Extract JSON block if wrapped in prose
        text = response.content[0].text
        start = text.find('{')
        end = text.rfind('}') + 1
        return json.loads(text[start:end]) if start != -1 else {}
```

The distilled output is dramatically smaller than the original. A 5,000-token conversation segment may distill to a 200-token JSON object. That object can be prepended to new sessions to restore the relevant state without dragging the full history along.

> **Warning:** Compression is irreversible within a session. Once you summarize or distill a conversation segment, the original detail is gone from your context. Keep the original messages in your application's persistent storage (a database, file) for audit and debugging purposes. Only replace them in the context you send to the model.

**Document compression** follows the same principles but applies to injected documents. If a user uploads a 20-page PDF and asks a specific question about it, you do not need all 20 pages in context. Extract the relevant sections first, then compress those if they are still too long.

```bash
# Using pdftotext (poppler-utils) to extract PDF text without layout noise
pdftotext -layout document.pdf - | python3 -c "
import sys, re

text = sys.stdin.read()
# Remove repeated whitespace and empty lines
text = re.sub(r'\n{3,}', '\n\n', text)
text = re.sub(r'[ \t]+', ' ', text)
print(text[:50000])  # Hard cap before sending to chunker
" > document_clean.txt
```

Then chunk and retrieve selectively (Chapter 5 covers this in depth). For inline document injection, target 20% of the original length as your compression goal — aggressive but achievable for most technical documents with focused extraction.

**Rolling compression** is the architectural pattern that keeps all of this manageable at scale. As your conversation or pipeline accumulates tokens over time, you periodically compress the oldest segments. The result is a context that maintains a fixed maximum size regardless of session length.

```python
class RollingContext:
    def __init__(self, max_tokens: int = 40000, compression_threshold: float = 0.8):
        self.max_tokens = max_tokens
        self.compression_threshold = compression_threshold
        self.messages = []
        self.client = anthropic.Anthropic()

    def add(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        self._maybe_compress()

    def _token_count(self) -> int:
        if not self.messages:
            return 0
        return self.client.messages.count_tokens(
            model="claude-opus-4-7",
            messages=self.messages
        ).input_tokens

    def _maybe_compress(self):
        if self._token_count() > self.max_tokens * self.compression_threshold:
            # Compress the oldest 40% of messages
            cutoff = len(self.messages) // 2
            old = self.messages[:cutoff]
            self.messages = self.messages[cutoff:]

            summary = summarize_segment(old)
            self.messages.insert(0, {
                "role": "user",
                "content": f"[COMPRESSED HISTORY]\n{summary}"
            })
            self.messages.insert(1, {
                "role": "assistant",
                "content": "Acknowledged."
            })

    def get_messages(self) -> list:
        return self.messages
```

**Key Takeaways**

- Summarization preserves general meaning; distillation preserves specific facts — match the technique to the use case
- Use fast, inexpensive models for compression tasks; save frontier models for reasoning tasks
- Always persist original content in external storage before compressing in-context
- Rolling compression is the pattern for long-running sessions with bounded context size
- Target 20% of original length as an aggressive but achievable compression goal for documents

**Practical Exercise**

Implement the `compress_history` function above and add it to a real conversation handler. Run a 30-turn test conversation, logging token counts before and after each compression event. Measure the quality difference between a full-history response and a compressed-history response on the same question posed at turn 30.

---

## Chapter 5: Retrieval-Augmented Patterns vs. Full Context

Retrieval-Augmented Generation (RAG) is the pattern where you store content in a vector database, search it at query time, and inject only the retrieved results into your context. It is the alternative to including everything upfront. Understanding when to use each approach — and how to build the retrieval layer correctly — is essential for any application handling more information than fits in a context window.

The core tradeoff: full-context inclusion is simpler and eliminates retrieval latency, but it costs more tokens and degrades with size. RAG is more complex, adds infrastructure, and introduces retrieval quality as a failure mode — but it scales to arbitrarily large knowledge bases without context blowup.

**When full context beats RAG:** small, stable knowledge bases (under 50k tokens); tasks requiring cross-document synthesis (the model needs to reason across many documents simultaneously); situations where recall is more important than precision (you cannot afford to miss relevant content); quick prototypes.

**When RAG beats full context:** large knowledge bases (hundreds of documents, codebases, wikis); dynamic content that changes frequently; cost-sensitive applications where most queries touch only a fraction of the corpus; latency-constrained pipelines where pre-indexing amortizes retrieval time.

> **Key Insight:** RAG is not inherently better than full context — it shifts cost and risk. Full context has high token cost and attention degradation risk. RAG has indexing cost and retrieval quality risk. Choose based on your corpus size, query patterns, and tolerance for each type of failure.

Building a minimal RAG pipeline requires three components: a chunker, an embedder, and a retriever. Here is a working implementation using ChromaDB and the Anthropic API:

```bash
pip install chromadb anthropic
```

```python
import chromadb
import anthropic
from chromadb.utils import embedding_functions
import hashlib

client = anthropic.Anthropic()
chroma_client = chromadb.Client()

# Use a local embedding function (or Voyage AI via Anthropic partnership)
ef = embedding_functions.DefaultEmbeddingFunction()
collection = chroma_client.create_collection("knowledge_base", embedding_function=ef)


def chunk_document(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:
    """
    Split text into overlapping chunks for indexing.
    Overlap preserves context across chunk boundaries.
    """
    words = text.split()
    chunks = []
    start = 0

    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap

    return chunks


def index_document(doc_id: str, text: str, metadata: dict = None):
    chunks = chunk_document(text)
    ids = [f"{doc_id}_{i}" for i in range(len(chunks))]
    metadatas = [{**(metadata or {}), "doc_id": doc_id, "chunk": i} for i in range(len(chunks))]

    collection.add(documents=chunks, ids=ids, metadatas=metadatas)
    print(f"Indexed {len(chunks)} chunks for {doc_id}")


def retrieve(query: str, n_results: int = 5, filter_doc_id: str = None) -> list[str]:
    where = {"doc_id": filter_doc_id} if filter_doc_id else None
    results = collection.query(
        query_texts=[query],
        n_results=n_results,
        where=where
    )
    return results["documents"][0]


def rag_query(query: str, system: str = "You are a helpful assistant.") -> str:
    chunks = retrieve(query)
    context = "\n\n---\n\n".join(chunks)

    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        system=system,
        messages=[{
            "role": "user",
            "content": f"Use the following context to answer the question.\n\nContext:\n{context}\n\nQuestion: {query}"
        }]
    )
    return response.content[0].text
```

Chunking strategy is the most impactful variable in RAG quality. The defaults above (512 words, 64-word overlap) are starting points, not gospel. For code, chunk by function or class — semantic boundaries matter more than word count. For documentation, chunk by section header. For conversation transcripts, chunk by speaker turn or time interval.

Chunk size determines the granularity of your retrieval. Small chunks are precise but may lack context; large chunks preserve context but retrieve too much. The overlap parameter ensures that content near chunk boundaries is not lost — a sentence split across two chunks will appear in both.

> **Warning:** RAG retrieval is a recall problem, not a precision problem. A missed chunk cannot be recovered — if the relevant information is not retrieved, the model will hallucinate or refuse to answer. Tune chunk size and overlap to maximize recall on your query distribution before worrying about precision. Too many retrieved chunks waste tokens; too few retrieved chunks produce wrong answers. Wrong is worse than expensive.

**Hybrid retrieval** combines dense (semantic embedding) search with sparse (keyword) search, using Reciprocal Rank Fusion (RRF) to merge results. This is significantly more robust than either approach alone, especially for queries with domain-specific terms that embedding models may not represent well.

```python
from rank_bm25 import BM25Okapi  # pip install rank-bm25

class HybridRetriever:
    def __init__(self, chunks: list[str]):
        self.chunks = chunks
        self.bm25 = BM25Okapi([c.split() for c in chunks])
        # In production, store embeddings in ChromaDB; here simplified

    def bm25_scores(self, query: str) -> list[float]:
        return self.bm25.get_scores(query.split())

    def rrf_merge(self, bm25_ranks: list[int], semantic_ranks: list[int], k: int = 60) -> list[int]:
        """Reciprocal Rank Fusion — combine two ranked lists."""
        scores = {}
        for rank, idx in enumerate(bm25_ranks):
            scores[idx] = scores.get(idx, 0) + 1 / (k + rank + 1)
        for rank, idx in enumerate(semantic_ranks):
            scores[idx] = scores.get(idx, 0) + 1 / (k + rank + 1)
        return sorted(scores.keys(), key=lambda x: scores[x], reverse=True)

    def retrieve(self, query: str, top_k: int = 5) -> list[str]:
        bm25_scores = self.bm25_scores(query)
        bm25_ranks = sorted(range(len(bm25_scores)), key=lambda i: bm25_scores[i], reverse=True)
        # Semantic ranks would come from ChromaDB in production
        # For demonstration, use BM25 ranks as proxy
        merged = self.rrf_merge(bm25_ranks, bm25_ranks[::-1])[:top_k]
        return [self.chunks[i] for i in merged]
```

Finally, RAG introduces a new failure mode that full context does not have: retrieval errors. When the retrieval step returns the wrong chunks — either because the query embedding did not match the relevant content, or because the relevant content was poorly chunked — the model answers from the wrong context. It may answer confidently and incorrectly. Build retrieval evaluation into your testing pipeline (Chapter 8 covers this).

**Key Takeaways**

- Full context is simpler; RAG scales to larger corpora — choose based on corpus size and query patterns
- Chunking strategy is the highest-impact variable in RAG quality; chunk at semantic boundaries, not arbitrary word counts
- Overlap between chunks prevents information loss at chunk boundaries
- Hybrid retrieval (semantic + BM25 with RRF) outperforms either approach alone for domain-specific queries
- RAG failures are quiet — the model answers confidently from wrong context; retrieval evaluation is not optional

**Practical Exercise**

Index a document set you work with regularly (documentation, internal wiki, codebase README files) using the pipeline above. Write 10 test queries with known answers. Measure recall at k=3, k=5, and k=10. Identify the query type where recall is lowest, then adjust your chunking strategy to improve it.

---

## Chapter 6: Session Continuity: Picking Up Where You Left Off

A session ends. A new session starts. The model remembers nothing. This is the statefulness problem in LLM application design, and it is where most naive implementations fall apart.

The naive solution is to dump the entire previous session into the context of the new one. For short sessions, this is fine. For sessions longer than a few thousand tokens, it defeats the purpose of starting fresh and immediately fills your context with low-relevance history.

The right solution is session serialization: at the end of each session, extract the state that matters and store it in a format that can be efficiently injected at the start of the next session. "State that matters" is not the conversation transcript. It is the structured output of distillation: decisions, constraints, facts, open questions, and the current task status.

> **Key Insight:** Session continuity is an application design problem, not a model problem. The model will use whatever you give it. Your job is to give it the right slice of the previous session — dense, relevant, and compact — not the full transcript.

Here is a session serialization schema that works across most developer tasks:

```python
from dataclasses import dataclass, asdict
from datetime import datetime
import json

@dataclass
class SessionState:
    session_id: str
    created_at: str
    task_description: str
    constraints: list[str]          # Hard requirements that must persist
    decisions: list[dict]           # {"decision": str, "rationale": str}
    open_questions: list[str]       # Unresolved items to address next session
    artifacts: list[dict]           # {"name": str, "type": str, "location": str}
    last_known_state: str           # One-paragraph summary of where things stand

    def to_context_block(self) -> str:
        """Format as a context injection block for the next session."""
        lines = [
            f"## Resumed Session: {self.session_id}",
            f"**Task:** {self.task_description}",
            "",
            "**Active Constraints:**",
        ]
        for c in self.constraints:
            lines.append(f"- {c}")

        lines.extend(["", "**Decisions Made:**"])
        for d in self.decisions:
            lines.append(f"- {d['decision']} (rationale: {d['rationale']})")

        lines.extend(["", "**Open Questions:**"])
        for q in self.open_questions:
            lines.append(f"- {q}")

        lines.extend(["", "**Current State:**", self.last_known_state])

        return "\n".join(lines)

    def save(self, path: str):
        with open(path, 'w') as f:
            json.dump(asdict(self), f, indent=2)

    @classmethod
    def load(cls, path: str) -> 'SessionState':
        with open(path) as f:
            return cls(**json.load(f))


def extract_session_state(session_id: str, messages: list, task_description: str) -> SessionState:
    """Use the model to extract structured state from a conversation."""
    formatted = "\n".join(
        f"{'User' if m['role'] == 'user' else 'Assistant'}: {m['content']}"
        for m in messages
    )

    extraction_prompt = f"""Analyze this conversation and extract the session state as JSON.

Return exactly this structure:
{{
  "constraints": ["list of hard requirements"],
  "decisions": [{{"decision": "what was decided", "rationale": "why"}}],
  "open_questions": ["unresolved items"],
  "last_known_state": "one paragraph on current state"
}}

Conversation:
{formatted}"""

    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=800,
        messages=[{"role": "user", "content": extraction_prompt}]
    )

    text = response.content[0].text
    start, end = text.find('{'), text.rfind('}') + 1
    data = json.loads(text[start:end])

    return SessionState(
        session_id=session_id,
        created_at=datetime.now().isoformat(),
        task_description=task_description,
        artifacts=[],
        **data
    )
```

The workflow is: at session end, call `extract_session_state`. Save the result to disk (or a database). At the start of the next session, load the state and inject `state.to_context_block()` as the first user message, with a brief assistant acknowledgment.

```python
def resume_session(state_path: str) -> list:
    """Build the opening messages for a resumed session."""
    state = SessionState.load(state_path)

    return [
        {"role": "user", "content": state.to_context_block()},
        {"role": "assistant", "content": (
            f"Resuming session {state.session_id}. I have the context from our previous work. "
            f"Current open questions: {', '.join(state.open_questions) if state.open_questions else 'none'}. "
            "What would you like to tackle next?"
        )}
    ]
```

This opening costs roughly 300–600 tokens depending on how much state was extracted — a fraction of what full history injection would cost, and it contains exactly the information the model needs to continue meaningfully.

> **Warning:** Session state extraction is only as good as the conversation it extracts from. If your conversation was vague about decisions and constraints, the extracted state will be vague. Make explicit, declarative statements during sessions: "We have decided to use PostgreSQL for this project." "The constraint is that all API responses must be under 200ms." The model extracts what is said, not what is implied.

For longer-lived projects where sessions accumulate over days or weeks, maintain a project-level state file that aggregates across sessions:

```python
class ProjectContext:
    def __init__(self, project_path: str):
        self.path = f"{project_path}/.llm_context.json"
        self._load()

    def _load(self):
        try:
            with open(self.path) as f:
                self.data = json.load(f)
        except FileNotFoundError:
            self.data = {
                "project": "",
                "persistent_constraints": [],
                "architecture_decisions": [],
                "sessions": []
            }

    def add_session(self, state: SessionState):
        self.data["sessions"].append({
            "id": state.session_id,
            "date": state.created_at,
            "summary": state.last_known_state
        })
        # Merge constraints (deduplicate)
        existing = set(self.data["persistent_constraints"])
        for c in state.constraints:
            existing.add(c)
        self.data["persistent_constraints"] = list(existing)

        with open(self.path, 'w') as f:
            json.dump(self.data, f, indent=2)
```

This file becomes the authoritative project memory. New sessions load from it, not from any individual session transcript.

**Key Takeaways**

- Session continuity is your responsibility, not the model's — design for it explicitly
- Serialize structured state (decisions, constraints, open questions) not conversation transcripts
- Session state injection costs a few hundred tokens; full history injection costs thousands
- Use fast models for state extraction — it is mechanical work
- Maintain project-level context files for multi-session work on the same project

**Practical Exercise**

Build the `SessionState` class and `extract_session_state` function. Run a real 20-turn conversation on a coding task, then extract the session state. Save it, start a fresh Python interpreter, load the state, build the resumption messages, and start a new conversation. Evaluate whether the resumed session has sufficient context to continue the work without re-explanation.

---

## Chapter 7: Multi-Turn Conversations Without Context Bleed

Context bleed is when information from one conversation topic influences responses to another, or when assumptions made during an earlier phase of a conversation persist inappropriately into later phases. It is the context management equivalent of a variable scope bug — state that should be local behaves as if it is global.

Context bleed manifests in several ways. The model applies a constraint from a previous task to a new, unrelated task. It uses variable names or terminology from an earlier code example when discussing a different problem. It maintains a "persona" or tone adopted for one user interaction when that interaction has clearly ended. In multi-tenant applications, context from one user's session bleeds into another's — a serious correctness and privacy issue.

> **Warning:** In multi-tenant applications, context bleed is not just a quality issue — it is a privacy and security issue. If user A's conversation influences user B's responses, you have a data isolation failure. Session isolation must be enforced at the application layer. Never share a conversation history object across user sessions.

The root cause is always the same: the wrong content is in the context. Fixing context bleed means being deliberate about what enters the context and when it leaves.

**Phase transitions** are one of the cleanest patterns for preventing bleed within a single session. When a conversation naturally transitions from one phase to another — from requirements gathering to implementation, from debugging to refactoring — explicitly mark the transition and prune context that was specific to the prior phase.

```python
class PhaseAwareConversation:
    def __init__(self, system_prompt: str):
        self.system_prompt = system_prompt
        self.phases: list[dict] = []  # {"name": str, "messages": list, "summary": str}
        self.current_phase: str = "init"
        self.current_messages: list = []

    def transition_to(self, new_phase: str, carry_forward: list[str] = None):
        """
        End the current phase, summarize it, start a new phase.
        carry_forward: list of facts to explicitly preserve from prior phase.
        """
        if self.current_messages:
            summary = summarize_segment(self.current_messages)
            self.phases.append({
                "name": self.current_phase,
                "messages": self.current_messages,
                "summary": summary
            })

        # Build fresh context for new phase
        self.current_phase = new_phase
        self.current_messages = []

        if carry_forward:
            # Only bring forward what is explicitly needed
            carry_content = "\n".join(f"- {fact}" for fact in carry_forward)
            self.current_messages = [
                {
                    "role": "user",
                    "content": f"[CONTEXT FROM PRIOR PHASE: {self.phases[-1]['name']}]\n{carry_content}"
                },
                {
                    "role": "assistant",
                    "content": "Understood. I will carry these facts forward."
                }
            ]

    def add_turn(self, role: str, content: str):
        self.current_messages.append({"role": role, "content": content})

    def get_messages(self) -> list:
        return self.current_messages
```

The key design decision here is `carry_forward`: an explicit list of facts that should survive the transition. This forces you to decide what is genuinely needed in the new phase, rather than dragging everything along by default.

**Scope isolation** is the pattern for preventing bleed between conceptually separate tasks within a session. Some multi-agent or pipeline architectures require the model to handle discrete subtasks sequentially. Each subtask should receive only its own context, not the accumulated context of all previous subtasks.

```python
def isolated_subtask(
    subtask_prompt: str,
    shared_context: str,  # Only the truly shared context (project constraints, etc.)
    model: str = "claude-opus-4-7"
) -> str:
    """
    Run a subtask in isolation — only shared context + task-specific prompt.
    No prior subtask messages.
    """
    response = client.messages.create(
        model=model,
        max_tokens=2048,
        system=shared_context,
        messages=[{"role": "user", "content": subtask_prompt}]
    )
    return response.content[0].text


# In a pipeline:
SHARED_CONTEXT = """
Project: E-commerce checkout service
Language: Python 3.11
Constraints: No external HTTP libraries except httpx. All currency in cents (integer).
"""

subtask_results = []
for subtask in ["Write the cart validation function", "Write the payment processing function"]:
    result = isolated_subtask(subtask, SHARED_CONTEXT)
    subtask_results.append(result)
    # Each subtask starts clean — no bleed from the previous one
```

This is more expensive than a single continuous conversation because each subtask call sends the shared context again. But it is architecturally sound — each subtask is guaranteed to receive only relevant context.

> **Try This:** In your next multi-step pipeline, identify which context elements are truly "shared" (relevant to all steps) vs. "local" (relevant only to the current step). Move shared context into the system prompt. Keep local context only in the user message for that step. This partition eliminates a large class of bleed errors.

**Context expiration** is the pattern for content that is valid for a limited number of turns. Some instructions are only relevant for a specific sub-task: "For the next three responses, output only the modified lines without the surrounding function." After those three responses, the instruction should no longer apply, but without explicit management, it will persist in the context and potentially influence later turns.

```python
class ExpiringContext:
    def __init__(self):
        self.items: list[dict] = []
        self.turn_count: int = 0

    def add(self, content: str, expires_after_turns: int):
        self.items.append({
            "content": content,
            "added_at": self.turn_count,
            "expires_at": self.turn_count + expires_after_turns
        })

    def tick(self):
        self.turn_count += 1
        self.items = [i for i in self.items if i["expires_at"] > self.turn_count]

    def active_context(self) -> str:
        if not self.items:
            return ""
        return "\n".join(i["content"] for i in self.items)
```

Integrate this into your message assembly loop so that expired context is automatically omitted from subsequent calls.

**Key Takeaways**

- Context bleed is a scope management problem — the wrong content is in context for the wrong task
- Multi-tenant applications must enforce session isolation at the application layer, not the model layer
- Phase transitions with explicit carry-forward lists are the cleanest intra-session bleed prevention
- Isolated subtasks (each with fresh context) trade token cost for correctness guarantees
- Expiring context items handle time-limited instructions without manual removal

**Practical Exercise**

Audit a multi-turn pipeline or conversation handler you own. Identify any place where context from step N is present during step N+3 or later without explicit intent. Classify each instance as intentional carry-forward or accidental bleed. Fix the accidental bleed cases using one of the patterns above.

---

## Chapter 8: Testing Context Quality and Measuring Degradation

You cannot manage what you do not measure. Context quality is not observable by reading your code — you must test it empirically, with real inputs and real model outputs, against defined quality criteria. The absence of explicit context quality testing is why context management problems tend to be discovered in production, not in development.

There are three things worth measuring: retrieval quality (for RAG systems), instruction adherence (do constraints persist correctly), and factual consistency (does the model maintain accurate knowledge of session-established facts across turns).

**Retrieval quality** is measured by recall@k: given a set of queries with known-relevant chunks, what fraction of the relevant chunks appear in the top k retrieved results?

```python
def evaluate_retrieval(
    retriever,  # Callable: query -> list[str]
    test_cases: list[dict],  # [{"query": str, "expected_content": str}]
    k: int = 5
) -> dict:
    """
    Measure recall@k for a retriever.
    expected_content is a substring that should appear in one of the top-k chunks.
    """
    hits = 0
    results = []

    for case in test_cases:
        retrieved = retriever(case["query"])[:k]
        hit = any(case["expected_content"].lower() in chunk.lower() for chunk in retrieved)
        hits += int(hit)
        results.append({
            "query": case["query"],
            "hit": hit,
            "retrieved_count": len(retrieved)
        })

    recall = hits / len(test_cases)
    print(f"Recall@{k}: {recall:.2%} ({hits}/{len(test_cases)})")

    for r in results:
        status = "HIT" if r["hit"] else "MISS"
        print(f"  [{status}] {r['query'][:60]}")

    return {"recall_at_k": recall, "k": k, "results": results}


# Example test cases for a Python documentation retriever
test_cases = [
    {"query": "how to open a file safely", "expected_content": "context manager"},
    {"query": "list comprehension syntax", "expected_content": "for x in"},
    {"query": "exception handling", "expected_content": "try:"},
]
```

Run this against your RAG system before and after any chunking or indexing change. Treat recall drops as regressions.

**Instruction adherence** testing checks whether model behavior matches your established constraints across turns. This is where context degradation causes the most user-visible failures.

```python
def test_instruction_adherence(
    conversation_builder,  # Callable: returns messages list
    constraints: list[dict],  # [{"instruction": str, "check": Callable}]
    turn_count: int = 20
) -> dict:
    """
    Build a synthetic conversation of `turn_count` turns.
    At each turn, verify that the response satisfies all constraints.
    """
    messages = []
    failures = {c["instruction"]: [] for c in constraints}

    for turn in range(turn_count):
        user_message = conversation_builder(turn, messages)
        messages.append({"role": "user", "content": user_message})

        response = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=512,
            system="Always respond in valid JSON with a 'response' key.",  # Example constraint
            messages=messages
        )

        assistant_content = response.content[0].text
        messages.append({"role": "assistant", "content": assistant_content})

        for constraint in constraints:
            if not constraint["check"](assistant_content):
                failures[constraint["instruction"]].append(turn)

    return {
        "turn_count": turn_count,
        "failures": failures,
        "first_failure": {k: min(v) if v else None for k, v in failures.items()}
    }


import json as json_lib

def is_valid_json(text: str) -> bool:
    try:
        json_lib.loads(text)
        return True
    except:
        # Check for json blocks
        start = text.find('{')
        if start == -1:
            return False
        try:
            json_lib.loads(text[start:text.rfind('}') + 1])
            return True
        except:
            return False

results = test_instruction_adherence(
    conversation_builder=lambda turn, _: f"Tell me something interesting about topic {turn}.",
    constraints=[{"instruction": "respond in JSON", "check": is_valid_json}],
    turn_count=15
)
print(f"First JSON failure at turn: {results['first_failure']}")
```

> **Key Insight:** The `first_failure` turn tells you your instruction durability window. If your constraint fails by turn 8, and your real sessions regularly run 20 turns, you have a production bug waiting to happen. The fix is either constraint re-injection (Chapter 2) or shorter sessions with explicit state serialization (Chapter 6).

**Factual consistency** testing checks whether the model correctly recalls facts established earlier in the conversation, across increasing context distances.

```python
def test_fact_recall(fact: str, fact_turn: int, query: str, total_turns: int) -> bool:
    """
    Establish a fact at turn `fact_turn`, pad with filler turns,
    then query for the fact at `total_turns`. Check if it's recalled correctly.
    """
    messages = [
        {"role": "user", "content": f"Remember this: {fact}"},
        {"role": "assistant", "content": f"Noted: {fact}"}
    ]

    # Add padding turns
    for i in range(total_turns - 1):
        messages.append({"role": "user", "content": f"What is 2 + {i}?"})
        messages.append({"role": "assistant", "content": str(2 + i)})

    messages.append({"role": "user", "content": query})

    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=200,
        messages=messages
    )

    answer = response.content[0].text.lower()
    return fact.lower() in answer


# Test at increasing distances
fact = "The database host is db-prod-01.internal"
query = "What database host did I mention at the start?"

for padding_turns in [2, 5, 10, 20, 40]:
    recalled = test_fact_recall(fact, 0, query, padding_turns)
    print(f"Recall after {padding_turns} filler turns: {'YES' if recalled else 'NO'}")
```

> **Warning:** Model behavior on these tests will vary across model versions and releases. A context quality test suite that passes today is not guaranteed to pass after a model update. Pin your tests to specific model versions in CI and explicitly re-evaluate when you upgrade.

Combine these tests into a context quality CI suite that runs against your staging environment on every significant change to your context management code:

```bash
#!/bin/bash
# context_quality_check.sh
set -e

echo "Running RAG recall evaluation..."
python3 tests/eval_retrieval.py --min-recall 0.85

echo "Running instruction adherence test..."
python3 tests/eval_adherence.py --min-turns 15

echo "Running fact recall test..."
python3 tests/eval_fact_recall.py --max-padding 20

echo "All context quality checks passed."
```

**Key Takeaways**

- Measure recall@k for RAG systems; treat drops as regressions in CI
- Instruction adherence testing reveals the durability window of your constraints — the number of turns before they fail
- Fact recall testing reveals the effective context depth at which information degrades
- Pin tests to specific model versions; re-evaluate on every model upgrade
- Context quality testing belongs in CI, not just in ad-hoc manual review

**Practical Exercise**

Write and run the `test_instruction_adherence` test against your primary context management setup. Find the turn at which your most critical constraint first fails. If that number is less than your typical session length, implement one of the mitigation patterns from earlier chapters and re-run until the failure turn exceeds your session length by a comfortable margin.

---

## Chapter 9: Building Context-Aware Pipelines

All the techniques from the previous chapters converge in pipeline design. A context-aware pipeline is not a script that calls an LLM in a loop — it is a system with explicit context management as a first-class concern, with defined strategies for what enters context at each step, how context is compressed and carried forward, and how the pipeline fails gracefully when context limits are approached.

The anatomy of a context-aware pipeline has five layers: an input processor (normalizes and filters incoming content), a context assembler (builds the context for each model call), a model caller (handles the API call and token accounting), an output processor (parses and stores model output), and a state manager (maintains context state across steps and sessions).

> **Key Insight:** A pipeline that does not explicitly manage context will implicitly manage it badly — either by including everything (expensive, degradation-prone) or by including nothing (cheap, useless). Explicit context management is not optional overhead; it is the difference between a pipeline that works at scale and one that fails at turn 10.

Here is a reference implementation of a context-aware pipeline for a multi-step document analysis task:

```python
import anthropic
import json
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class PipelineContext:
    system_prompt: str
    max_context_tokens: int = 50000
    compression_threshold: float = 0.75
    messages: list = field(default_factory=list)
    token_count: int = 0
    step_outputs: dict = field(default_factory=dict)

    def token_budget_remaining(self) -> int:
        return self.max_context_tokens - self.token_count

    def budget_pct(self) -> float:
        return self.token_count / self.max_context_tokens


class ContextAwarePipeline:
    def __init__(self, system_prompt: str, max_tokens: int = 50000):
        self.client = anthropic.Anthropic()
        self.ctx = PipelineContext(system_prompt=system_prompt, max_context_tokens=max_tokens)
        self.step_history: list[dict] = []

    def _count_tokens(self) -> int:
        if not self.ctx.messages:
            return 0
        return self.client.messages.count_tokens(
            model="claude-opus-4-7",
            system=self.ctx.system_prompt,
            messages=self.ctx.messages
        ).input_tokens

    def _maybe_compress(self):
        self.ctx.token_count = self._count_tokens()
        if self.ctx.budget_pct() > self.ctx.compression_threshold:
            print(f"[PIPELINE] Context at {self.ctx.budget_pct():.0%} — compressing...")
            cutoff = max(2, len(self.ctx.messages) // 2)
            old_messages = self.ctx.messages[:cutoff]
            self.ctx.messages = self.ctx.messages[cutoff:]

            summary = summarize_segment(old_messages)
            self.ctx.messages.insert(0, {
                "role": "user",
                "content": f"[COMPRESSED PIPELINE CONTEXT]\n{summary}"
            })
            self.ctx.messages.insert(1, {
                "role": "assistant",
                "content": "Context compressed and acknowledged."
            })
            self.ctx.token_count = self._count_tokens()
            print(f"[PIPELINE] After compression: {self.ctx.token_count} tokens")

    def step(
        self,
        step_name: str,
        user_message: str,
        inject_documents: Optional[list[str]] = None,
        isolated: bool = False,
        max_output_tokens: int = 1024
    ) -> str:
        """
        Execute one pipeline step.
        isolated=True runs without history (for independent subtasks).
        """
        content = user_message
        if inject_documents:
            doc_block = "\n\n---\n".join(inject_documents)
            content = f"[DOCUMENTS]\n{doc_block}\n\n[TASK]\n{user_message}"

        if isolated:
            messages_to_send = [{"role": "user", "content": content}]
        else:
            self._maybe_compress()
            self.ctx.messages.append({"role": "user", "content": content})
            messages_to_send = self.ctx.messages

        response = self.client.messages.create(
            model="claude-opus-4-7",
            max_tokens=max_output_tokens,
            system=self.ctx.system_prompt,
            messages=messages_to_send
        )

        output = response.content[0].text

        if not isolated:
            self.ctx.messages.append({"role": "assistant", "content": output})
            self.ctx.token_count = response.usage.input_tokens + response.usage.output_tokens

        self.ctx.step_outputs[step_name] = output
        self.step_history.append({
            "step": step_name,
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
            "isolated": isolated
        })

        print(f"[PIPELINE:{step_name}] {response.usage.input_tokens}in/{response.usage.output_tokens}out tokens")
        return output

    def summary_report(self) -> dict:
        total_in = sum(s["input_tokens"] for s in self.step_history)
        total_out = sum(s["output_tokens"] for s in self.step_history)
        return {
            "steps": len(self.step_history),
            "total_input_tokens": total_in,
            "total_output_tokens": total_out,
            "step_detail": self.step_history
        }
```

Using this pipeline for a document analysis task:

```python
pipeline = ContextAwarePipeline(
    system_prompt="You are a technical document analyst. Be precise and factual.",
    max_tokens=60000
)

# Step 1: Extract structure (isolated — no prior context needed)
structure = pipeline.step(
    "extract_structure",
    "Extract the main sections and their purpose.",
    inject_documents=[open("technical_spec.txt").read()],
    isolated=True
)

# Step 2: Identify requirements (builds on structure)
requirements = pipeline.step(
    "identify_requirements",
    f"Given this document structure:\n{structure}\n\nList all functional requirements.",
    inject_documents=[open("technical_spec.txt").read()]
)

# Step 3: Flag ambiguities (builds on requirements)
ambiguities = pipeline.step(
    "flag_ambiguities",
    "Identify any requirements that are ambiguous or underspecified."
)

# Step 4: Generate clarifying questions (isolated — just needs the ambiguities output)
questions = pipeline.step(
    "generate_questions",
    f"Generate precise clarifying questions for:\n{ambiguities}",
    isolated=True,
    max_output_tokens=512
)

print(pipeline.summary_report())
```

> **Try This:** Add the `summary_report` call to every pipeline you build. Review it after each pipeline run during development. Watch for steps where input tokens are disproportionately high compared to the value of that step's output — that is where your context management is inefficient.

The pipeline above handles compression automatically, distinguishes between isolated and sequential steps, and produces a token audit at the end. This is the minimum viable context-aware pipeline. As you add steps, track the token count at each step — the growth curve tells you where you need to add compression or switch to isolated execution.

For production systems, add error handling for context limit violations:

```python
def safe_step(self, step_name: str, user_message: str, **kwargs) -> Optional[str]:
    """Pipeline step with context limit guard."""
    projected = self._count_tokens() + len(user_message.split()) * 2  # rough estimate
    if projected > self.ctx.max_context_tokens * 0.95:
        print(f"[PIPELINE:{step_name}] Context limit imminent — forcing compression")
        self._maybe_compress()

    try:
        return self.step(step_name, user_message, **kwargs)
    except anthropic.BadRequestError as e:
        if "context_length" in str(e).lower():
            print(f"[PIPELINE:{step_name}] Context limit hit — retrying as isolated step")
            return self.step(step_name, user_message, isolated=True, **kwargs)
        raise
```

**Key Takeaways**

- A context-aware pipeline has five layers: input processing, context assembly, model calling, output processing, and state management
- Distinguish between isolated steps (no history) and sequential steps (accumulate history) — not every step needs history
- Automatic compression with a token budget threshold keeps pipeline context bounded
- Token audit by step is the primary diagnostic for pipeline context efficiency
- Handle context limit errors gracefully with a fallback to isolated execution

**Practical Exercise**

Identify a multi-step task you currently handle with separate, disconnected LLM calls. Rewrite it using the `ContextAwarePipeline` class above. Run the `summary_report` to see the token cost per step. Identify at least one step that should be `isolated=True` and one that benefits from accumulated history. Measure the quality difference.

---

## Conclusion

Context management is not a feature you add to an LLM application after it works — it is a property you design in from the start. Every architectural decision you make about what enters context, how it is compressed, when it is pruned, and how sessions are serialized will manifest directly in the quality, cost, and reliability of your application.

The mental model that makes this tractable is thinking of context as a resource with both a capacity limit and a quality gradient. Capacity is the token limit — hard and enforced by the API. Quality is softer: attention is not uniform, position matters, and the same information in different locations produces different model behavior. Managing context means managing both dimensions simultaneously.

The tools available to you are now concrete. You know how to audit token budgets by category. You know that attention degrades for content in the middle of long contexts, and that critical constraints belong at the beginning of the system prompt. You know how to select minimum-viable-context by asking what information the model would be missing without a given piece of content. You know how to compress conversation history into dense structured summaries using cheap, fast models. You know when RAG outperforms full context and how to build a hybrid retriever. You know how to serialize session state so work survives session boundaries. You know how to prevent context bleed with phase transitions and isolated subtask execution. You know how to measure retrieval quality, instruction adherence, and factual consistency empirically, and how to run those measurements in CI.

What you do with these tools depends on your specific application. A coding assistant has different context pressures than a document analysis pipeline, which has different pressures than a customer support chatbot. The principles are the same; the specific trade-offs differ. The recurring exercise throughout this guide — run the test, measure the failure turn, apply the mitigation, re-run — is a process, not a recipe. You will encounter edge cases these chapters did not cover. The process will handle them.

A few final principles to carry forward:

**Measure before optimizing.** The token audit, the needle-in-haystack test, the instruction adherence test — run these before you assume a problem exists and before you claim a solution works. Context bugs are subtle enough that they resist intuition.

**Prefer explicit over implicit.** State your constraints explicitly. Serialize your session state explicitly. Declare your context budget explicitly. Implicit context management — "it should just remember" — is the source of most context bugs.

**Match the model to the task.** Compression and extraction work do not need a frontier model. Use Haiku-class models for summarization, distillation, and state extraction. Reserve your most capable model for reasoning tasks. This is both faster and cheaper, and it directs your costs to where they generate the most value.

**Treat context quality testing as a first-class concern.** Your retrieval recall, your instruction adherence window, your fact recall depth — these are quality metrics for your application, as important as unit test coverage or response latency. They belong in CI.

The craft here is knowing which technique applies to which problem. That comes from running experiments, reading failures carefully, and building a feel for how these systems behave under pressure. The chapters gave you the foundation. The exercises gave you the methods. The rest is practice.

---

## Appendix A: Glossary

**Attention mechanism** — The core operation in transformer models that determines how much each token "attends to" every other token when computing a representation. Implemented as scaled dot-product attention over query, key, and value matrices.

**BM25** — Best Match 25. A sparse retrieval algorithm based on term frequency and inverse document frequency. Effective for keyword-based retrieval; used alongside dense (embedding) retrieval in hybrid search.

**Chunk** — A segment of a larger document, produced by splitting the document for indexing. Chunk size and overlap are primary variables in RAG system quality.

**Context bleed** — The unwanted persistence of context from one task, phase, or user session into another. A correctness and potential privacy issue in multi-tenant or multi-task applications.

**Context window** — The fixed-size input buffer of a language model, measured in tokens. The complete input to a single forward pass: system prompt, conversation history, user message, tool outputs, and retrieved documents all compete for space within it.

**Distillation (context)** — The extraction of specific structured facts from a larger body of content for compact context injection. Distinct from summarization, which preserves general meaning rather than specific facts.

**Embedding** — A dense vector representation of text, produced by an embedding model. Semantically similar texts produce numerically similar vectors, enabling semantic search via cosine similarity or similar distance metrics.

**Forward pass** — A single evaluation of the neural network, processing all input tokens simultaneously and producing output token probabilities. Each API call corresponds to one forward pass.

**Lost in the middle** — The empirically documented phenomenon where transformer models perform worse on information located in the middle of a long context, compared to information at the beginning or end.

**RAG (Retrieval-Augmented Generation)** — A pattern where relevant content is retrieved from an external store at query time and injected into context, rather than including all content upfront. Scales to arbitrarily large knowledge bases.

**Recall@k** — A retrieval evaluation metric: the fraction of queries for which the relevant content appears within the top k retrieved results.

**RRF (Reciprocal Rank Fusion)** — An algorithm for merging multiple ranked lists into a single ranking. Used in hybrid retrieval to combine dense (semantic) and sparse (BM25) search results.

**Rolling compression** — A session management pattern where the oldest context segments are periodically summarized and replaced with their summaries, keeping total context size bounded regardless of session length.

**Session state** — A structured record of the decisions, constraints, open questions, and current status extracted from a conversation, used to resume work in a new session without injecting full conversation history.

**Sliding window** — A history selection strategy that retains only the most recent N messages (or the messages that fit within a token budget), discarding older history.

**Summarization (context)** — Lossy compression of a conversation segment or document into a shorter prose form, trading detail for reduced token count.

**System prompt** — The instruction block sent as the `system` parameter in an API call. Processed before the conversation messages and given structural prominence in model attention. Where persistent, session-spanning constraints should live.

**Token** — The basic unit of text for language model processing. Approximately 4 characters or 0.75 words in English. Every operation in the API is measured in tokens: input, output, and context limits.

**Token budget** — The allocation of available context window tokens across categories (system, history, documents, query). Managing the budget explicitly is the foundation of context optimization.

---

## Appendix B: Tools and Resources

**Anthropic Python SDK**
```bash
pip install anthropic
```
Official SDK for Claude API access. Includes `messages.count_tokens()` for pre-call token counting and `messages.create()` for all model interactions. Documentation at the Anthropic developer portal.

**OpenAI Python SDK**
```bash
pip install openai
```
SDK for GPT-4 and GPT-4o. `tiktoken` (companion library) provides accurate token counting: `pip install tiktoken`. Use `tiktoken.encoding_for_model("gpt-4o").encode(text)` to count tokens.

**ChromaDB**
```bash
pip install chromadb
```
Embedded vector database. No external service required. Suitable for development and medium-scale production. Supports both in-memory and persistent storage modes.

**rank-bm25**
```bash
pip install rank-bm25
```
Pure Python BM25 implementation. Sufficient for hybrid retrieval when combined with ChromaDB for dense search. `BM25Okapi` is the standard variant.

**LlamaIndex**
```bash
pip install llama-index
```
Framework for building RAG pipelines with connectors for common document types (PDF, HTML, Markdown, code). Handles chunking, indexing, and retrieval with configurable strategies.

**LangChain**
```bash
pip install langchain langchain-anthropic
```
Framework for composing LLM pipelines. Includes text splitters, document loaders, and memory management classes. `ConversationSummaryMemory` and `ConversationBufferWindowMemory` are relevant to this guide's techniques.

**tiktoken**
```bash
pip install tiktoken
```
OpenAI's tokenizer library. Also useful for rough token estimation with non-OpenAI models, as many models use BPE tokenization with similar vocabulary sizes.

**pdftotext (poppler-utils)**
```bash
# Ubuntu/Debian
sudo apt install poppler-utils

# macOS
brew install poppler
```
Command-line PDF text extraction. More reliable than pure-Python alternatives for structured PDFs.

**sentence-transformers**
```bash
pip install sentence-transformers
```
Local embedding model library. Useful for offline or privacy-sensitive applications where you cannot send content to an external embedding API. `all-MiniLM-L6-v2` is a good starting model.

---

## Appendix C: Further Reading

**"Lost in the Middle: How Language Models Use Long Contexts"**
Liu, Nelson F., et al. (2023). The empirical study that documented the attention degradation pattern discussed in Chapter 2. Available on arXiv (2307.03172). Read the experimental methodology — it explains how to replicate the needle-in-haystack test with your own model.

**"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"**
Lewis, Patrick, et al. (2020). The original RAG paper. The architecture described here has been extended significantly, but the core insight — decouple knowledge storage from model parameters — remains foundational.

**"Dense Passage Retrieval for Open-Domain Question Answering"**
Karpukhin, Vladimir, et al. (2020). The paper behind dense retrieval using learned embeddings. Understanding DPR gives you the conceptual foundation for why embedding-based retrieval works and where it fails.

**Anthropic Prompt Engineering Guide**
The official documentation on structuring prompts for Claude. Pay particular attention to the sections on long-context usage and system prompt design. Available in the Anthropic developer documentation.

**"Efficient Transformers: A Survey"**
Tay, Yi, et al. (2022). Covers the attention complexity problem and approaches to extending context length — sparse attention, linear attention, and related techniques. Relevant background for understanding why context limits exist and how they are being extended.

**"RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval"**
Sarthi, Parth, et al. (2024). A RAG variant that builds hierarchical summaries of document sets, enabling retrieval at multiple levels of granularity. Directly applicable to large document corpora where flat chunking misses cross-document structure.

**"Generative Agents: Interactive Simulacra of Human Behavior"**
Park, Joon Sung, et al. (2023). Although framed around agent simulation, the memory stream and retrieval architecture in this paper is directly applicable to session continuity design. The pattern of extracting "importance scores" for memory selection maps to the distillation techniques in Chapter 4.

**Hugging Face Documentation: Text Generation**
The HuggingFace documentation on `transformers` generation covers KV cache, context extension techniques (RoPE scaling, YaRN), and quantization. Relevant if you are running local models and need to understand context window behavior at the implementation level.

**"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models"**
Wei, Jason, et al. (2022). The chain-of-thought paper is relevant here not for reasoning, but because CoT responses are significantly longer than non-CoT responses — and that affects your token budget. Understanding the token cost of reasoning modes is practically important for context planning.

---

*How to Manage Context Windows Effectively — Kelly Price — 2026*

---

*© 2026 Pyckle. All rights reserved. This guide may be shared freely for personal and educational use. Commercial reproduction or redistribution requires written permission. Contact kellyprice@pyckle.co.*
