---
title: "How to Cut Your LLM Bill in Half"
subtitle: "Token Optimization, Caching, and Routing Strategies That Work"
author: "Kelly Price"
date: "2026-04-21"
description: "The practical guide to reducing LLM API costs without degrading output quality — covering context optimization, prompt efficiency, model routing, and caching."
tags: [ai, developer-tools, productivity]
---
```

# How to Cut Your LLM Bill in Half
## Token Optimization, Caching, and Routing Strategies That Work

*Kelly Price*

---

## About This Guide

Most teams discover their LLM costs are out of control the same way: the billing alert fires, someone opens the dashboard, and the number staring back at them is three to five times what anyone expected. The immediate instinct is to blame usage volume. That's almost always wrong.

The real problem is invisible waste — tokens spent on context that doesn't move the needle, prompts that over-explain what the model already knows, questions answered identically for the hundredth time at full price, and tasks sent to expensive frontier models when a cheap fast model would have done the job in half the time at a tenth of the cost.

This guide exists because the engineering literature on LLM cost optimization is scattered, often shallow, and frequently written by people trying to sell you a platform rather than teach you the underlying mechanics. What follows is a practitioner's handbook. Everything here has been applied in production systems — not toy demos.

The strategies in this book are not about degrading quality to save money. That trade is almost never worth making. Instead, they are about eliminating the structural waste that accumulates in every LLM integration that wasn't designed with cost in mind from day one. Teams that apply even a subset of these techniques routinely cut their monthly API spend by 40–60% without their end users noticing any difference.

You should have basic familiarity with calling LLM APIs programmatically — Python or TypeScript is assumed for code examples, though the concepts translate to any language. You don't need to understand model architecture, transformer internals, or ML theory. This is an engineering book, not a research paper.

Each chapter targets a specific cost driver and gives you a concrete, deployable strategy. By the end, you'll have a systematic approach to token budgeting, caching, routing, monitoring, and team culture — the full stack of what it actually takes to run LLM features economically at scale.

Start with Chapter 1 even if you think you already know where your tokens go. You probably don't — not precisely enough to fix it.

---

## Table of Contents

1. [Where Your Token Budget Actually Goes](#chapter-1-where-your-token-budget-actually-goes)
2. [Context Window Economics: What Costs What](#chapter-2-context-window-economics-what-costs-what)
3. [Prompt Engineering for Efficiency](#chapter-3-prompt-engineering-for-efficiency)
4. [Semantic Caching: Stop Paying for the Same Answer Twice](#chapter-4-semantic-caching-stop-paying-for-the-same-answer-twice)
5. [Model Routing: Match Task Complexity to Model Cost](#chapter-5-model-routing-match-task-complexity-to-model-cost)
6. [Chunking and Retrieval as a Cost-Control Strategy](#chapter-6-chunking-and-retrieval-as-a-cost-control-strategy)
7. [Batching and Async Patterns for Throughput](#chapter-7-batching-and-async-patterns-for-throughput)
8. [Monitoring and Alerting on Token Spend](#chapter-8-monitoring-and-alerting-on-token-spend)
9. [Building a Token Budget Culture on Your Team](#chapter-9-building-a-token-budget-culture-on-your-team)

---

## Chapter 1: Where Your Token Budget Actually Goes

Before you can cut costs, you need an accurate picture of where the money is going. The distribution is almost never what engineers expect when they first measure it.

Most teams assume their biggest cost is output tokens — the model's response. That assumption is wrong in the majority of production applications. The dominant cost driver is almost always input tokens, specifically the context that gets sent with every request. In applications with conversation history, system prompts, tool definitions, and retrieved documents, the input side of each request can easily be 5–20x larger than the output. At current pricing, input tokens are cheaper per unit than output tokens on most providers, but the raw volume overwhelms that advantage.

The second most common false assumption is that token usage scales linearly with user count. It doesn't. It scales with user count multiplied by context depth. A user thirty messages into a conversation sends thirty times the context tokens of a user on their first message, assuming naive history concatenation. This is the hidden growth curve that causes billing surprises.

To get an accurate picture, you need to instrument your actual calls — not estimate from your codebase. Here's a minimal logging wrapper for the Anthropic Python SDK:

```python
import anthropic
import json
import time
from dataclasses import dataclass, asdict
from typing import Optional

@dataclass
class TokenUsageRecord:
    timestamp: float
    model: str
    input_tokens: int
    output_tokens: int
    cache_read_tokens: int
    cache_write_tokens: int
    endpoint: str
    user_id: Optional[str]
    cost_usd: float

def calculate_cost(model: str, input_tokens: int, output_tokens: int,
                   cache_read: int, cache_write: int) -> float:
    # Prices as of April 2026 — verify current pricing at anthropic.com/pricing
    pricing = {
        "claude-opus-4-7": {"input": 15.0, "output": 75.0, "cache_read": 1.5, "cache_write": 18.75},
        "claude-sonnet-4-6": {"input": 3.0, "output": 15.0, "cache_read": 0.30, "cache_write": 3.75},
        "claude-haiku-4-5": {"input": 0.80, "output": 4.0, "cache_read": 0.08, "cache_write": 1.0},
    }
    p = pricing.get(model, pricing["claude-sonnet-4-6"])
    return (
        (input_tokens / 1_000_000) * p["input"] +
        (output_tokens / 1_000_000) * p["output"] +
        (cache_read / 1_000_000) * p["cache_read"] +
        (cache_write / 1_000_000) * p["cache_write"]
    )

class InstrumentedClient:
    def __init__(self, log_file: str = "token_usage.jsonl"):
        self.client = anthropic.Anthropic()
        self.log_file = log_file

    def create(self, endpoint: str, user_id: Optional[str] = None, **kwargs):
        response = self.client.messages.create(**kwargs)
        usage = response.usage
        record = TokenUsageRecord(
            timestamp=time.time(),
            model=kwargs.get("model", "unknown"),
            input_tokens=usage.input_tokens,
            output_tokens=usage.output_tokens,
            cache_read_tokens=getattr(usage, "cache_read_input_tokens", 0),
            cache_write_tokens=getattr(usage, "cache_creation_input_tokens", 0),
            endpoint=endpoint,
            user_id=user_id,
            cost_usd=calculate_cost(
                kwargs.get("model", ""),
                usage.input_tokens,
                usage.output_tokens,
                getattr(usage, "cache_read_input_tokens", 0),
                getattr(usage, "cache_creation_input_tokens", 0),
            )
        )
        with open(self.log_file, "a") as f:
            f.write(json.dumps(asdict(record)) + "\n")
        return response
```

Run this for 48 hours across your production traffic, then analyze the log:

```python
import json
from collections import defaultdict

records = [json.loads(line) for line in open("token_usage.jsonl")]

by_endpoint = defaultdict(lambda: {"input": 0, "output": 0, "cost": 0.0, "calls": 0})
for r in records:
    e = by_endpoint[r["endpoint"]]
    e["input"] += r["input_tokens"]
    e["output"] += r["output_tokens"]
    e["cost"] += r["cost_usd"]
    e["calls"] += 1

for endpoint, stats in sorted(by_endpoint.items(), key=lambda x: -x[1]["cost"]):
    avg_input = stats["input"] / stats["calls"]
    print(f"{endpoint}: ${stats['cost']:.2f} | {stats['calls']} calls | avg input: {avg_input:.0f} tok")
```

This output will almost certainly surprise you. One or two endpoints will dominate spend, and the average input token count for those endpoints will be the number you need to attack first.

> **Key Insight:** The endpoint with the highest average input token count is your highest-leverage optimization target — not the endpoint with the most calls. A thousand cheap calls matters less than fifty expensive ones.

The third category of hidden cost is tool definitions. If you use function calling or tool use, every tool schema you include in the request counts as input tokens. A system with fifteen tools, each with a moderately detailed JSON schema, can add 2,000–4,000 tokens to every single request, before any user message or history is included. If those tools aren't all needed for every request, you're paying for them anyway.

> **Warning:** Tool schemas are charged at input token rates on every request, regardless of whether the model invokes any tool. Audit your tool definitions ruthlessly. Remove tools that are available in fewer than 30% of interactions from the default set and inject them conditionally.

Finally, examine your system prompts. The average production system prompt, after months of iterative additions by different team members, is 800–2,000 tokens. Many contain redundant instructions, outdated rules, and explanations the model doesn't need. Chapter 3 covers prompt compression in detail, but right now, just measure it: print `len(tokenizer.encode(system_prompt))` and write that number down.

**Key Takeaways**

- Input tokens, not output tokens, are the dominant cost in most production LLM applications.
- Context depth multiplies per-user cost non-linearly — conversation history is the primary growth vector.
- Instrument every API call with token counts, cost estimates, and endpoint tags before optimizing anything.
- Tool definitions silently inflate input costs on every request, regardless of invocation rate.
- System prompt size compounds across every call — measure it now.

**Practical Exercise**

Deploy the `InstrumentedClient` wrapper to your production environment for 48 hours. Aggregate costs by endpoint and compute the average input token count per call for your top three most expensive endpoints. These three numbers are your optimization roadmap for the rest of this book.

---

## Chapter 2: Context Window Economics: What Costs What

The context window is not a free resource. Every token inside it at inference time costs money — whether it's your system prompt, the user's message, conversation history, retrieved documents, or tool definitions. Understanding how these components interact and what each actually costs is the foundation of every other optimization in this book.

Context windows have grown dramatically. Current frontier models support 200K tokens or more. This is operationally convenient, but it is also financially dangerous. Larger context availability makes it easy to adopt a "just include everything" approach that scales catastrophically with usage.

The cost structure is straightforward: you pay for input tokens at the input rate, and output tokens at the (higher) output rate. Input tokens include everything you send to the model. Output tokens are everything the model sends back. The ratio between the two varies by use case:

- **Chat applications**: typically 60–80% input, 20–40% output by token count
- **Summarization**: 85–95% input, 5–15% output
- **Code generation**: 40–60% input, 40–60% output
- **Classification/extraction**: 70–90% input, 10–30% output

Understanding your actual ratio is important because optimizations that reduce input tokens don't help equally across all use cases. A summarization pipeline should focus almost exclusively on compressing input. A code generation endpoint should balance input and output optimization.

> **Key Insight:** Calculate your input-to-output token ratio for each endpoint. This ratio tells you which type of optimization yields the most impact per engineering hour.

Conversation history is the most insidious cost driver in chat applications because it compounds over time. Consider a naive implementation that concatenates all messages:

```python
# Naive — costs grow quadratically with conversation length
def build_messages(history: list[dict]) -> list[dict]:
    return history  # Every message, every time

# After 20 turns, you're sending 20 messages per request
# After 40 turns, 40 messages per request
# Token cost grows as O(n²) across a conversation
```

The fix is a sliding window with intelligent summarization:

```python
import anthropic

client = anthropic.Anthropic()

def summarize_old_messages(messages: list[dict], model: str = "claude-haiku-4-5-20251001") -> str:
    """Compress old messages into a summary using a cheap model."""
    formatted = "\n".join(
        f"{m['role'].upper()}: {m['content']}" for m in messages
    )
    response = client.messages.create(
        model=model,
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": f"Summarize this conversation segment in 2-3 sentences, preserving key facts and decisions:\n\n{formatted}"
        }]
    )
    return response.content[0].text

def build_messages_windowed(
    history: list[dict],
    window_size: int = 10,
    summarize_threshold: int = 20
) -> list[dict]:
    if len(history) <= window_size:
        return history

    # Summarize everything outside the window
    old_messages = history[:-window_size]
    recent_messages = history[-window_size:]

    if len(old_messages) >= summarize_threshold // 2:
        summary = summarize_old_messages(old_messages)
        return [
            {"role": "user", "content": f"[Conversation summary: {summary}]"},
            {"role": "assistant", "content": "Understood. Continuing from that context."},
            *recent_messages
        ]

    return recent_messages
```

This pattern keeps conversation costs bounded. The summarization call uses a cheap model (Haiku) and produces a compressed digest that preserves semantic content. A 30-message conversation that would cost 15,000 input tokens with naive history now costs 2,000–4,000 tokens with windowed summarization.

> **Warning:** Don't summarize too aggressively. Users who reference earlier parts of a conversation and get wrong answers because the context was discarded will lose trust in the product. Test your summarization strategy on real conversation samples before deploying.

Retrieved documents are another major input cost category, particularly in RAG (Retrieval-Augmented Generation) architectures. The common mistake is retrieving more chunks than necessary "just in case." Every retrieved chunk is tokens paid for, regardless of whether the model actually uses that information.

Chapter 6 covers retrieval strategy in depth, but the core principle belongs here: retrieved context should be bounded by a token budget, not by a document count. Counting retrieved chunks doesn't tell you what you're actually paying for.

```python
def budget_retrieved_context(
    chunks: list[str],
    token_budget: int = 4000,
    tokenizer=None
) -> list[str]:
    """Include chunks until token budget is exhausted."""
    selected = []
    used = 0
    for chunk in chunks:  # chunks should be pre-sorted by relevance score
        chunk_tokens = len(tokenizer.encode(chunk)) if tokenizer else len(chunk) // 4
        if used + chunk_tokens > token_budget:
            break
        selected.append(chunk)
        used += chunk_tokens
    return selected
```

Prompt caching, available on Anthropic's API and some other providers, changes the economics significantly for stable context. When you mark a portion of your prompt as cacheable, the provider stores a computed representation and charges a significantly reduced rate for subsequent requests that hit the cache. This is explored in detail in Chapter 4, but the structural implication belongs here: design your message layout to put stable content (system prompt, static instructions, tool definitions) at the top, and variable content (user message, dynamic data) at the bottom. Caching only works on a prefix — the stable prefix gets cached, the variable suffix does not.

```python
# Good layout for caching — stable content first
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": LARGE_STABLE_DOCUMENT,  # cached after first call
                "cache_control": {"type": "ephemeral"}
            },
            {
                "type": "text",
                "text": user_query  # variable — not cached
            }
        ]
    }
]
```

**Key Takeaways**

- Every token in the context window costs money — system prompts, history, tools, and retrieved documents all compound.
- Conversation history grows as O(n²) without windowing — implement sliding window summarization early, before scale makes it painful.
- Retrieved context should be bounded by a token budget, not a document count.
- Input-to-output token ratio determines which optimization category yields the most return for your use case.
- Structure prompts with stable content first to maximize cache hit rates.

**Practical Exercise**

Pick your most expensive chat endpoint. Add token counting to measure the average breakdown: system prompt tokens, history tokens, retrieved context tokens, and user message tokens. Calculate what percentage of your input cost each category represents. The largest category is your next optimization target.

---

## Chapter 3: Prompt Engineering for Efficiency

Prompt engineering is usually discussed in the context of output quality. This chapter discusses it as a cost variable — because the two are not in tension as often as people assume. Tighter, more precise prompts frequently produce better outputs than verbose, over-specified ones, and they do it cheaper.

The average production system prompt has a word count problem. It accumulates over time as engineers and product managers add instructions in response to edge cases, failing outputs, and stakeholder requests. Nobody removes old instructions. The result is a 1,500-word prompt that contradicts itself in three places, repeats instructions four times "for emphasis," and contains paragraphs that made sense six months ago but don't apply to the current product.

Start with an audit. Paste your system prompt into a document and go through it line by line asking: does removing this change the model's behavior in a meaningful way? The answer is "no" far more often than you'll expect.

> **Try This:** Copy your system prompt. Delete one-third of it — the parts that feel like reminders of things the model already knows. Test the abbreviated prompt against your evaluation set. If quality holds, keep the deletion. Repeat until quality degrades. You'll typically find you can cut 30–50% without measurable output degradation.

Here's a before/after example from a real customer support system:

**Before (340 tokens):**
```
You are a helpful customer support assistant for Acme Corp. Your job is to help
customers with their questions and concerns. Always be polite and professional.
Never be rude. Always greet the customer warmly. Make sure to ask clarifying
questions when you need more information. Be concise in your responses but make
sure to fully address the customer's concern. If you don't know the answer to
something, say so and offer to escalate. Never make up information. Always
verify information before sharing it. If a customer is angry, remain calm and
empathetic. Try to resolve issues on the first contact when possible. Follow
all company policies. Do not discuss competitors. Do not share internal
company information. Always end the conversation by asking if there's anything
else you can help with...
```

**After (89 tokens):**
```
You are Acme Corp customer support. Be concise and accurate. If unsure, say so
and offer escalation. Don't discuss competitors or share internal information.
Resolve issues on first contact when possible.
```

The second prompt produces outputs that are indistinguishable in quality on a properly designed evaluation set. The difference is 251 tokens per request. At 10,000 requests per day on Sonnet, that's $0.75/day saved — about $275/year on system prompt bloat alone, per endpoint. And that's before accounting for the fact that shorter, cleaner prompts often produce more consistent outputs because there are fewer contradictions for the model to reconcile.

> **Key Insight:** Instructions that tell the model to "be professional," "be helpful," and "be accurate" are nearly always redundant. These are default behaviors. You only need to specify behaviors that deviate from the default or that require specific domain context.

Few-shot examples are another area where efficiency matters. Including examples in a prompt is a well-established technique for improving output format and style. But examples are expensive — a prompt with five detailed examples might cost 800–1,500 tokens before the user's actual message is included.

The right number of examples depends on task complexity and output format specificity. For well-understood tasks (sentiment classification, JSON extraction from a defined schema), zero-shot with a clear output format specification usually matches few-shot performance:

```python
# Zero-shot JSON extraction — usually matches few-shot quality for structured tasks
EXTRACTION_PROMPT = """Extract the following fields from the customer message as JSON:
- intent: one of [refund_request, account_issue, product_question, complaint, other]
- urgency: one of [low, medium, high]
- product_mentioned: string or null
- account_number: string or null

Return ONLY valid JSON, no explanation."""

# vs. three-shot with examples — 900+ extra tokens, minimal quality gain for this task
```

For novel or complex output formats, one or two examples are usually sufficient. More than three examples rarely improves quality proportionally to cost.

Instruction format also matters. Numbered lists and headers in system prompts consume tokens without improving model adherence — models process semantic content, not document structure. A paragraph with the same instructions is equally effective and often shorter. Use structure only when the model needs to navigate and select from a large option space.

Variable interpolation is where many prompts bloat unnecessarily. If you're injecting large blocks of data into a prompt, ask whether all of it is necessary:

```python
# Before — injects full user profile (600+ tokens)
prompt = f"""
User profile:
{json.dumps(full_user_profile, indent=2)}

Answer the user's question using their profile for context.
"""

# After — inject only relevant fields (80 tokens)
relevant_fields = {
    "plan": user_profile["subscription_plan"],
    "account_age_days": user_profile["account_age_days"],
    "open_tickets": user_profile["open_support_tickets"]
}
prompt = f"User context: {json.dumps(relevant_fields)}\n\nAnswer the user's question."
```

Output format specification is the final lever. Asking for shorter outputs explicitly reduces output tokens. If you need a one-sentence answer, say so. If you need a JSON object with four fields, specify the schema. Don't let the model decide the length of its response when you know what you need.

```python
# Vague — model will produce 200-400 word response
"Explain what this error means."

# Specific — model produces 1-2 sentences
"Explain this error in one or two sentences for a developer audience."

# Structured — model produces exactly the fields you need
"Return a JSON object with keys: 'cause' (string), 'fix' (string), 'severity' (low|medium|high)"
```

**Key Takeaways**

- Audit your system prompts regularly and delete instructions that don't change behavior — expect to cut 30–50%.
- Default model behaviors (politeness, accuracy, helpfulness) don't need explicit instruction.
- For structured output tasks, zero-shot with a schema specification usually matches few-shot quality at much lower cost.
- Inject only the fields of dynamic data that are relevant to the task at hand.
- Specify output length and format explicitly to control output token spend.

**Practical Exercise**

Take your longest system prompt. Create a version with exactly half the word count by removing redundant instructions, pleasantries, and anything the model does by default. Run both against twenty representative test cases. If quality is equivalent, deploy the shorter version.

---

## Chapter 4: Semantic Caching: Stop Paying for the Same Answer Twice

Exact-match caching — returning a cached response when the same query string appears twice — works for a narrow class of LLM applications. Most real applications have enough query variation that exact-match hit rates stay below 5%. Semantic caching solves this by recognizing that "What's your return policy?" and "How do I return an item?" are the same question, semantically, and should produce the same cached response.

This is not a new idea in information retrieval. It's standard practice in search systems. Applying it to LLM calls is newer and requires a bit more infrastructure, but the payoff is substantial: hit rates of 20–40% are realistic for applications with a large user base asking questions within a defined domain. At that hit rate, you're eliminating 20–40% of your API calls entirely.

The architecture is simple: before making an API call, embed the incoming query and search a vector store for semantically similar past queries. If you find one above a similarity threshold, return the cached response. If not, call the API, cache the result, and return it.

```python
import numpy as np
import anthropic
import json
import time
from typing import Optional

client = anthropic.Anthropic()

# In production, use a persistent vector store like Redis + pgvector,
# Weaviate, or Chroma. This is an in-memory example for clarity.
class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.92):
        self.threshold = similarity_threshold
        self.entries: list[dict] = []  # {embedding, query, response, timestamp}

    def _embed(self, text: str) -> list[float]:
        # Use a cheap embedding model — text-embedding-3-small or similar
        # For Anthropic-only stacks, use a third-party embedding service
        import openai
        oc = openai.OpenAI()
        result = oc.embeddings.create(input=text, model="text-embedding-3-small")
        return result.data[0].embedding

    def _cosine_similarity(self, a: list[float], b: list[float]) -> float:
        a, b = np.array(a), np.array(b)
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

    def get(self, query: str) -> Optional[str]:
        if not self.entries:
            return None
        query_embedding = self._embed(query)
        best_score = 0.0
        best_response = None
        for entry in self.entries:
            score = self._cosine_similarity(query_embedding, entry["embedding"])
            if score > best_score:
                best_score = score
                best_response = entry["response"]
        if best_score >= self.threshold:
            return best_response
        return None

    def set(self, query: str, response: str) -> None:
        embedding = self._embed(query)
        self.entries.append({
            "embedding": embedding,
            "query": query,
            "response": response,
            "timestamp": time.time()
        })

cache = SemanticCache(similarity_threshold=0.92)

def cached_completion(
    user_message: str,
    system_prompt: str,
    model: str = "claude-sonnet-4-6",
    max_tokens: int = 1024
) -> tuple[str, bool]:
    """Returns (response_text, cache_hit)."""
    cached = cache.get(user_message)
    if cached:
        return cached, True

    response = client.messages.create(
        model=model,
        max_tokens=max_tokens,
        system=system_prompt,
        messages=[{"role": "user", "content": user_message}]
    )
    text = response.content[0].text
    cache.set(user_message, text)
    return text, False
```

> **Warning:** The similarity threshold is the critical parameter. Too low (below 0.85) and you'll return incorrect cached responses for questions that are superficially similar but semantically distinct. Too high (above 0.97) and hit rates approach zero. Start at 0.92 and tune based on manual review of cache hits and misses in your specific domain.

The threshold also depends on your application's tolerance for stale or approximate answers. A customer support bot answering factual questions about a product catalog has high sensitivity — "Does the Pro plan include SSO?" and "Does SSO come with the Business tier?" might embed similarly but have different correct answers. A creative writing assistant has much lower sensitivity — similar questions deserve similar answers.

Provider-level prompt caching (distinct from semantic caching) is a complementary technique. Anthropic's API supports cache breakpoints in the prompt that instruct the service to cache the key-value attention states for a prefix. When subsequent requests share the same prefix, those tokens are read at a dramatically reduced rate (roughly 10% of the standard input price) instead of recomputed.

```python
# Using Anthropic prompt caching with cache_control breakpoints
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LARGE_STABLE_SYSTEM_PROMPT,  # 2000+ tokens, rarely changes
            "cache_control": {"type": "ephemeral"}  # Cache this prefix
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": LARGE_REFERENCE_DOCUMENT,  # Another stable block
                    "cache_control": {"type": "ephemeral"}
                },
                {
                    "type": "text",
                    "text": user_question  # Variable — not cached
                }
            ]
        }
    ]
)

# Check cache usage in response
usage = response.usage
print(f"Cache write: {usage.cache_creation_input_tokens}")
print(f"Cache read: {usage.cache_read_input_tokens}")
print(f"Standard input: {usage.input_tokens}")
```

The cache TTL is five minutes on Anthropic's API. For high-traffic applications, this means the cache pays for itself continuously. For low-traffic applications, you may see few cache hits during normal operation — but burst traffic patterns still benefit substantially.

Combine semantic caching with provider caching for maximum effect: semantic caching eliminates API calls for similar queries, while provider caching reduces the cost of API calls that do go through by reading the stable prefix from cache rather than reprocessing it.

> **Key Insight:** Semantic caching and provider prompt caching operate at different layers and are not mutually exclusive. Semantic caching reduces call volume. Provider caching reduces per-call cost. Deploy both.

Cache invalidation strategy matters. Responses can become stale if your underlying data or policy changes. At minimum, implement TTL-based expiry. For applications where correctness is critical, add a versioning mechanism: include a cache namespace that changes when your data or policy updates, which effectively flushes the cache without requiring explicit deletion of individual entries.

**Key Takeaways**

- Exact-match caching has low hit rates in most applications; semantic caching hits 20–40% for domain-specific applications.
- Similarity threshold of 0.92 is a reasonable starting point — tune based on manual review of hits and misses.
- Provider prompt caching reduces per-call cost for stable context prefixes; this is complementary to, not a substitute for, semantic caching.
- Cache invalidation requires explicit strategy — TTL at minimum, versioned namespaces for correctness-critical applications.
- Embed the query (not the full prompt) for semantic cache lookup; the query captures user intent, which is what you're matching.

**Practical Exercise**

Implement semantic caching for your highest-traffic endpoint using an in-memory store. After 24 hours, compute the hit rate. If it's above 15%, migrate to a persistent vector store (Redis with the RedisVL library, or any hosted vector DB). Track cost savings per day.

---

## Chapter 5: Model Routing: Match Task Complexity to Model Cost

The most immediately impactful cost reduction most teams can make is model routing — automatically sending simple tasks to cheap, fast models and only escalating to expensive frontier models when the task genuinely requires it. The price difference between model tiers is not incremental. Haiku is roughly 20x cheaper than Opus. Routing 60% of your traffic to Haiku while maintaining quality on those tasks cuts your model cost by more than 50%.

The key challenge is correctly identifying task complexity. Get it wrong in one direction and you spend too much. Get it wrong in the other and quality degrades. The good news is that the distribution of task complexity in most applications is heavily skewed toward simple tasks. In a typical customer support application, 60–70% of queries are simple lookups or FAQ responses that a small model handles well. Only 15–25% require the nuanced reasoning that justifies a frontier model.

A simple routing architecture looks like this:

```python
import anthropic
from enum import Enum

class ModelTier(Enum):
    CHEAP = "claude-haiku-4-5-20251001"
    MID = "claude-sonnet-4-6"
    FRONTIER = "claude-opus-4-7"

client = anthropic.Anthropic()

ROUTER_PROMPT = """Classify the complexity of this user request for an AI assistant.

Return ONLY one of these labels:
- SIMPLE: factual lookup, yes/no question, simple format conversion, one-step task
- MODERATE: multi-step reasoning, comparison, summarization of provided text
- COMPLEX: open-ended analysis, code debugging, creative generation, ambiguous intent requiring inference

Request: {query}

Label:"""

def route_request(query: str) -> ModelTier:
    """Use a cheap model to classify request complexity."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Always use cheap model for routing
        max_tokens=10,
        messages=[{
            "role": "user",
            "content": ROUTER_PROMPT.format(query=query)
        }]
    )
    label = response.content[0].text.strip().upper()
    return {
        "SIMPLE": ModelTier.CHEAP,
        "MODERATE": ModelTier.MID,
        "COMPLEX": ModelTier.FRONTIER
    }.get(label, ModelTier.MID)  # Default to mid on ambiguous classification

def routed_completion(query: str, system_prompt: str, max_tokens: int = 1024) -> dict:
    tier = route_request(query)
    response = client.messages.create(
        model=tier.value,
        max_tokens=max_tokens,
        system=system_prompt,
        messages=[{"role": "user", "content": query}]
    )
    return {
        "response": response.content[0].text,
        "model_used": tier.value,
        "tier": tier.name
    }
```

> **Warning:** The routing call itself costs tokens. For very high volume applications with cheap per-call costs, the routing overhead can approach or exceed the savings. Measure the overhead: if routing cost is more than 10% of the cost saved by downgrading calls, consider a cheaper classification method (keyword heuristics, a fine-tuned classifier, or length-based rules).

Rule-based routing is often underrated. Before building a classification model, check whether simple heuristics capture most of the signal:

```python
def rule_based_route(query: str, context: dict) -> ModelTier:
    query_lower = query.lower()

    # Signals for cheap routing
    if len(query.split()) < 10:
        return ModelTier.CHEAP
    if any(kw in query_lower for kw in ["what is", "when did", "how much", "yes or no"]):
        return ModelTier.CHEAP
    if context.get("task_type") in ["classification", "extraction", "format_conversion"]:
        return ModelTier.CHEAP

    # Signals for frontier routing
    if context.get("task_type") in ["code_debug", "creative_writing", "legal_analysis"]:
        return ModelTier.FRONTIER
    if len(query.split()) > 100:  # Long, complex queries
        return ModelTier.FRONTIER
    if "?" in query and query.count("?") > 2:  # Multi-part questions
        return ModelTier.FRONTIER

    return ModelTier.MID
```

In practice, a hybrid approach works best: rule-based routing as a fast first pass, with LLM-based classification only for cases where rules don't give a clear signal. This keeps routing overhead minimal while maintaining accuracy.

> **Key Insight:** Quality validation is essential when deploying routing. Run your routed responses through an LLM judge that compares cheap-model output to frontier-model output on a random sample. Any systematic quality gap in a specific category tells you the routing boundary for that category is wrong.

Cascade routing is a more conservative strategy: always try the cheap model first, and escalate if the response is below a quality threshold. This is more reliable than classification-based routing because the cheap model's actual output — not a prediction — determines escalation:

```python
def cascade_completion(
    query: str,
    system_prompt: str,
    quality_checker_prompt: str,
    max_tokens: int = 1024
) -> dict:
    # First attempt with cheap model
    cheap_response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=max_tokens,
        system=system_prompt,
        messages=[{"role": "user", "content": query}]
    )
    cheap_text = cheap_response.content[0].text

    # Check quality with a lightweight judge
    judge_response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=5,
        messages=[{
            "role": "user",
            "content": quality_checker_prompt.format(
                query=query, response=cheap_text
            )
        }]
    )
    quality_ok = "yes" in judge_response.content[0].text.lower()

    if quality_ok:
        return {"response": cheap_text, "model": "haiku", "escalated": False}

    # Escalate to frontier model
    frontier_response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=max_tokens,
        system=system_prompt,
        messages=[{"role": "user", "content": query}]
    )
    return {
        "response": frontier_response.content[0].text,
        "model": "opus",
        "escalated": True
    }
```

Track escalation rates by task type. If a particular category escalates 80% of the time, it should route directly to the frontier model — the cascade overhead isn't worth it. If a category escalates less than 10%, consider dropping it out of the cascade entirely and routing it directly to cheap.

**Key Takeaways**

- The price gap between model tiers (20x between Haiku and Opus) makes routing one of the highest-ROI cost strategies available.
- Most application traffic is simpler than engineers assume — measure the distribution before assuming you need a frontier model for everything.
- Rule-based routing should precede LLM-based routing; heuristics often capture 60–70% of routing decisions at near-zero cost.
- Cascade routing (try cheap, escalate on quality failure) is more reliable than classification-based routing for new applications without historical data.
- Track escalation rates per task type and adjust routing thresholds based on observed data, not assumptions.

**Practical Exercise**

Pick a single high-volume endpoint and implement cascade routing. Log the escalation rate for one week. If escalation rate is under 20%, the cheap model is handling your traffic well — calculate the cost savings. If escalation rate is above 60%, revise your quality checker or route that endpoint directly to the mid-tier model.

---

## Chapter 6: Chunking and Retrieval as a Cost-Control Strategy

RAG architectures are now the default for applications that need to ground LLM responses in specific data — documentation, knowledge bases, product catalogs, legal contracts. The standard implementation retrieves the top-k most relevant chunks and stuffs them into the prompt. The problem: top-k is a document count, not a cost control. Three large chunks can cost more than twenty small ones.

The right mental model for retrieval is: you have a token budget for context, and you want to maximize the information value per token within that budget. Chunk size, chunk count, ranking quality, and context compression all affect how close you get to that optimum.

Chunk size strategy matters more than most RAG tutorials acknowledge. Very small chunks (100–200 tokens) have high precision but miss multi-sentence context that the model needs to answer correctly. Very large chunks (1,500–2,000 tokens) have high recall but waste tokens on irrelevant content surrounding the relevant passage. The sweet spot for most domains is 400–600 tokens with meaningful overlap (50–100 tokens) between adjacent chunks.

```python
from anthropic import Anthropic
import tiktoken

client = Anthropic()
enc = tiktoken.get_encoding("cl100k_base")

def chunk_document(
    text: str,
    chunk_size: int = 500,
    overlap: int = 75
) -> list[str]:
    tokens = enc.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + chunk_size, len(tokens))
        chunk_tokens = tokens[start:end]
        chunks.append(enc.decode(chunk_tokens))
        if end == len(tokens):
            break
        start = end - overlap
    return chunks

def token_count(text: str) -> int:
    return len(enc.encode(text))

def retrieve_within_budget(
    query: str,
    chunks_with_scores: list[tuple[str, float]],  # (chunk_text, relevance_score)
    token_budget: int = 3000
) -> list[str]:
    """Select highest-scoring chunks that fit within the token budget."""
    # Sort by relevance score descending
    ranked = sorted(chunks_with_scores, key=lambda x: x[1], reverse=True)

    selected = []
    used_tokens = 0
    for chunk, score in ranked:
        chunk_tokens = token_count(chunk)
        if used_tokens + chunk_tokens > token_budget:
            continue  # Skip chunks that don't fit; don't stop — a later smaller chunk might fit
        selected.append(chunk)
        used_tokens += chunk_tokens

    return selected
```

> **Key Insight:** After retrieving and ranking chunks, use a "continue, don't break" strategy when one chunk exceeds budget. Later chunks may be smaller and still relevant — breaking out of the loop at the first oversized chunk leaves usable budget on the table.

Reranking dramatically improves the signal-to-noise ratio of retrieved context. A bi-encoder (dense retrieval) is fast but not maximally accurate. Adding a cross-encoder reranker as a second pass — applied only to the top 20–30 candidates from initial retrieval — produces much higher-quality selections. Better quality selection means fewer chunks needed for the same answer quality, which means fewer tokens spent.

```python
# Install: pip install sentence-transformers
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank_chunks(
    query: str,
    candidates: list[str],
    top_n: int = 5
) -> list[tuple[str, float]]:
    pairs = [(query, chunk) for chunk in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return ranked[:top_n]
```

Context compression is the next lever. After selecting relevant chunks, you can further reduce token cost by compressing each chunk to the most relevant sentences before including it in the prompt:

```python
def compress_chunk(
    query: str,
    chunk: str,
    model: str = "claude-haiku-4-5-20251001"
) -> str:
    """Extract only the sentences from chunk that are relevant to the query."""
    response = client.messages.create(
        model=model,
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"""Query: {query}

Extract only the sentences from the following text that are directly relevant to answering the query. Return only the extracted sentences, nothing else.

Text:
{chunk}"""
        }]
    )
    return response.content[0].text.strip()
```

This compresses 500-token chunks to 100–200 tokens on average, at the cost of a cheap model call. The math works out when you're compressing chunks that will be sent to an expensive model: compress with Haiku, pass compressed context to Opus, and the net cost is lower than passing raw chunks to Opus.

> **Warning:** Context compression can introduce inaccuracies if the compressor omits sentences that the main model would have found important. Always test compressed context against raw context on a representative evaluation set before deploying compression in production.

Hypothetical document embedding (HyDE) is a retrieval technique that reduces the mismatch between query style and document style. Instead of embedding the raw query (which is often short and question-form) against document chunks (which are statements and explanations), you first generate a hypothetical answer to the query, then embed that hypothetical answer for retrieval:

```python
def hyde_embed(query: str, model: str = "claude-haiku-4-5-20251001") -> str:
    """Generate a hypothetical answer to improve embedding quality."""
    response = client.messages.create(
        model=model,
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"Write a short, factual answer (2-4 sentences) to this question, as if you were an expert:\n\n{query}"
        }]
    )
    return response.content[0].text  # Embed this, not the original query
```

HyDE improves retrieval precision, which means you need fewer chunks to get the same quality answer, which means fewer tokens in context.

**Key Takeaways**

- Use token budgets, not chunk counts, to cap retrieved context cost.
- Chunk size of 400–600 tokens with 50–100 token overlap is a reasonable default for most document types.
- Reranking with a cross-encoder improves retrieval precision substantially; better precision means fewer chunks needed.
- Context compression with a cheap model can reduce RAG prompt size by 50–60% before sending to an expensive model.
- HyDE improves retrieval quality by aligning query embeddings with document-style content.

**Practical Exercise**

Instrument your RAG pipeline to log: number of chunks retrieved, total tokens of retrieved context, and number of chunks cited in the response (if extractable). The gap between chunks retrieved and chunks used is your waste metric. Target reducing it by 50% through reranking or budget-based retrieval.

---

## Chapter 7: Batching and Async Patterns for Throughput

Batching is the most underused cost optimization for non-real-time LLM workloads. Most teams build request-response pipelines because that's the default for web applications. For workloads that don't need synchronous responses — document processing, content generation, data enrichment, nightly analysis — batching unlocks substantial discounts and throughput gains.

Anthropic's Message Batches API allows you to send up to 10,000 requests in a single batch at a 50% cost reduction versus synchronous calls. The trade-off is latency: batch results aren't available immediately, typically within 1–24 hours. For workloads that fit this model, this is a straightforward win.

```python
import anthropic
import json

client = anthropic.Anthropic()

def submit_batch(requests: list[dict]) -> str:
    """Submit a batch of requests and return the batch ID."""
    batch_requests = []
    for i, req in enumerate(requests):
        batch_requests.append({
            "custom_id": req.get("id", f"req_{i}"),
            "params": {
                "model": req.get("model", "claude-sonnet-4-6"),
                "max_tokens": req.get("max_tokens", 1024),
                "messages": req["messages"]
            }
        })

    batch = client.beta.messages.batches.create(requests=batch_requests)
    print(f"Batch submitted: {batch.id} | {len(batch_requests)} requests")
    return batch.id

def poll_batch(batch_id: str) -> list[dict]:
    """Poll until batch completes and return results."""
    import time
    while True:
        batch = client.beta.messages.batches.retrieve(batch_id)
        print(f"Status: {batch.processing_status} | "
              f"Succeeded: {batch.request_counts.succeeded} | "
              f"Errored: {batch.request_counts.errored}")

        if batch.processing_status == "ended":
            break
        time.sleep(60)  # Poll every minute

    results = []
    for result in client.beta.messages.batches.results(batch_id):
        if result.result.type == "succeeded":
            results.append({
                "id": result.custom_id,
                "text": result.result.message.content[0].text,
                "usage": {
                    "input": result.result.message.usage.input_tokens,
                    "output": result.result.message.usage.output_tokens,
                }
            })
        else:
            results.append({
                "id": result.custom_id,
                "error": result.result.error.type
            })
    return results

# Usage example
documents = [
    {"id": f"doc_{i}", "content": f"Document content {i}..."}
    for i in range(100)
]

batch_requests = [
    {
        "id": doc["id"],
        "messages": [{
            "role": "user",
            "content": f"Summarize this document in 2 sentences:\n\n{doc['content']}"
        }]
    }
    for doc in documents
]

batch_id = submit_batch(batch_requests)
# Store batch_id, poll later or via webhook
```

For workloads that need faster turnaround but not real-time responses, async concurrency is the right tool. Python's `asyncio` with the Anthropic async client allows you to run many requests concurrently within rate limits:

```python
import asyncio
import anthropic
from asyncio import Semaphore

async_client = anthropic.AsyncAnthropic()

async def process_document(
    semaphore: Semaphore,
    document: dict,
    model: str = "claude-sonnet-4-6"
) -> dict:
    async with semaphore:
        response = await async_client.messages.create(
            model=model,
            max_tokens=512,
            messages=[{
                "role": "user",
                "content": f"Extract key entities from:\n\n{document['content']}"
            }]
        )
        return {
            "id": document["id"],
            "entities": response.content[0].text,
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
        }

async def process_all_documents(documents: list[dict], concurrency: int = 20) -> list[dict]:
    semaphore = Semaphore(concurrency)
    tasks = [process_document(semaphore, doc) for doc in documents]
    return await asyncio.gather(*tasks, return_exceptions=True)

# Run 100 documents concurrently (up to 20 at a time)
results = asyncio.run(process_all_documents(documents, concurrency=20))
```

> **Key Insight:** The concurrency limit should be tuned against your rate limits, not set arbitrarily. Anthropic rate limits are expressed in tokens-per-minute. At high concurrency, you'll hit TPM limits before request limits. Monitor for 429 responses and implement exponential backoff.

Exponential backoff with jitter is non-negotiable for production async workloads:

```python
import random

async def resilient_request(
    semaphore: Semaphore,
    messages: list[dict],
    model: str,
    max_retries: int = 5
) -> anthropic.types.Message:
    async with semaphore:
        for attempt in range(max_retries):
            try:
                return await async_client.messages.create(
                    model=model,
                    max_tokens=1024,
                    messages=messages
                )
            except anthropic.RateLimitError:
                if attempt == max_retries - 1:
                    raise
                wait = (2 ** attempt) + random.uniform(0, 1)
                await asyncio.sleep(wait)
            except anthropic.APIStatusError as e:
                if e.status_code >= 500 and attempt < max_retries - 1:
                    await asyncio.sleep(2 ** attempt)
                else:
                    raise
```

> **Warning:** Don't use `asyncio.gather` without a semaphore in production. Without concurrency limits, all requests fire simultaneously, saturate your rate limit immediately, and trigger cascading retries that make the situation worse.

Pipeline parallelism is a higher-order batching pattern. If your workload involves sequential LLM calls (e.g., extract → classify → summarize), you can pipeline stages so that Stage 2 starts processing the first batch of Stage 1 outputs while Stage 1 is still working on the remaining inputs. This improves throughput without changing per-request cost.

**Key Takeaways**

- Batch API with 50% cost discount applies to any workload that can tolerate hours of latency — use it for all offline processing.
- Async concurrency enables high-throughput processing for workloads that need results within minutes.
- Rate limits are expressed in tokens-per-minute; tune concurrency against TPM limits, not request limits.
- Exponential backoff with jitter is required for production reliability under rate limiting.
- Pipeline parallelism improves throughput for multi-stage workloads without changing per-call cost.

**Practical Exercise**

Identify one non-real-time workload in your system — nightly processing, content generation, batch enrichment. Submit it as a Message Batches API job and compare total cost to your previous synchronous implementation. Record the cost savings as a percentage.

---

## Chapter 8: Monitoring and Alerting on Token Spend

You cannot optimize what you cannot measure, and you cannot control what you cannot monitor. LLM cost monitoring is operationally immature in most organizations — teams react to billing surprises rather than detecting drift in real time. This chapter builds the monitoring infrastructure that makes proactive cost management possible.

The minimum viable monitoring stack has three components: per-request logging, aggregated metrics, and alerting on anomalies. Everything else is optional but valuable.

Per-request logging was introduced in Chapter 1. Assuming that's in place, the next step is aggregating logs into queryable metrics. A lightweight approach using SQLite is sufficient for teams processing under 1M requests/day:

```python
import sqlite3
import json
from datetime import datetime, timedelta

def init_metrics_db(db_path: str = "llm_metrics.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS token_usage (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            timestamp TEXT NOT NULL,
            endpoint TEXT NOT NULL,
            model TEXT NOT NULL,
            user_id TEXT,
            input_tokens INTEGER NOT NULL,
            output_tokens INTEGER NOT NULL,
            cache_read_tokens INTEGER DEFAULT 0,
            cost_usd REAL NOT NULL
        )
    """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_timestamp ON token_usage(timestamp)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_endpoint ON token_usage(endpoint)")
    conn.commit()
    return conn

def insert_usage(conn: sqlite3.Connection, record: dict):
    conn.execute("""
        INSERT INTO token_usage
        (timestamp, endpoint, model, user_id, input_tokens, output_tokens, cache_read_tokens, cost_usd)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?)
    """, (
        datetime.utcnow().isoformat(),
        record["endpoint"],
        record["model"],
        record.get("user_id"),
        record["input_tokens"],
        record["output_tokens"],
        record.get("cache_read_tokens", 0),
        record["cost_usd"]
    ))
    conn.commit()

def hourly_cost_by_endpoint(conn: sqlite3.Connection, hours: int = 24) -> list[dict]:
    since = (datetime.utcnow() - timedelta(hours=hours)).isoformat()
    cursor = conn.execute("""
        SELECT
            strftime('%Y-%m-%d %H:00', timestamp) as hour,
            endpoint,
            SUM(cost_usd) as total_cost,
            SUM(input_tokens) as total_input,
            SUM(output_tokens) as total_output,
            COUNT(*) as calls
        FROM token_usage
        WHERE timestamp >= ?
        GROUP BY hour, endpoint
        ORDER BY hour DESC, total_cost DESC
    """, (since,))
    return [dict(zip([d[0] for d in cursor.description], row)) for row in cursor.fetchall()]
```

> **Key Insight:** Aggregate costs hourly, not daily. Daily aggregation masks intraday spikes that indicate bugs, runaway loops, or abuse. Hourly data lets you detect and stop a cost anomaly within minutes of it starting.

Anomaly detection for token spend doesn't require ML. A simple baseline-plus-threshold approach catches the vast majority of cost events:

```python
def detect_cost_anomalies(
    conn: sqlite3.Connection,
    endpoint: str,
    lookback_days: int = 14,
    threshold_multiplier: float = 3.0
) -> list[dict]:
    """Flag hours where cost exceeds 3x the 14-day hourly average."""
    baseline_since = (datetime.utcnow() - timedelta(days=lookback_days)).isoformat()
    recent_since = (datetime.utcnow() - timedelta(hours=24)).isoformat()

    baseline = conn.execute("""
        SELECT AVG(hourly_cost) as avg_hourly_cost
        FROM (
            SELECT strftime('%Y-%m-%d %H:00', timestamp) as hour,
                   SUM(cost_usd) as hourly_cost
            FROM token_usage
            WHERE endpoint = ? AND timestamp >= ?
            GROUP BY hour
        )
    """, (endpoint, baseline_since)).fetchone()[0] or 0

    recent_hours = conn.execute("""
        SELECT strftime('%Y-%m-%d %H:00', timestamp) as hour,
               SUM(cost_usd) as hourly_cost,
               COUNT(*) as calls
        FROM token_usage
        WHERE endpoint = ? AND timestamp >= ?
        GROUP BY hour
        ORDER BY hour DESC
    """, (endpoint, recent_since)).fetchall()

    anomalies = []
    for hour, cost, calls in recent_hours:
        if baseline > 0 and cost > baseline * threshold_multiplier:
            anomalies.append({
                "hour": hour,
                "cost": cost,
                "baseline": baseline,
                "ratio": cost / baseline,
                "calls": calls,
                "endpoint": endpoint
            })
    return anomalies
```

Alerting should send to wherever your team actually pays attention — Slack, PagerDuty, email. A simple Slack webhook alert:

```python
import urllib.request

def send_slack_alert(webhook_url: str, message: str):
    payload = json.dumps({"text": message}).encode()
    req = urllib.request.Request(
        webhook_url,
        data=payload,
        headers={"Content-Type": "application/json"}
    )
    urllib.request.urlopen(req)

def run_anomaly_check(conn: sqlite3.Connection, slack_webhook: str, endpoints: list[str]):
    for endpoint in endpoints:
        anomalies = detect_cost_anomalies(conn, endpoint)
        for a in anomalies:
            msg = (
                f":warning: *LLM Cost Anomaly* — `{a['endpoint']}`\n"
                f"Hour: {a['hour']}\n"
                f"Cost: ${a['cost']:.4f} ({a['ratio']:.1f}x baseline)\n"
                f"Calls: {a['calls']}"
            )
            send_slack_alert(slack_webhook, msg)
```

> **Warning:** Don't set your anomaly threshold too low. A 1.5x threshold will generate constant noise and get ignored. A 3x threshold catches genuine anomalies — runaway loops, DDoS-style abuse, or newly deployed code with a context bug — without crying wolf. Tune based on your actual traffic variability.

Beyond anomaly detection, track these metrics as ongoing KPIs: cost per active user per day, cache hit rate, routing tier distribution (what percentage of calls went to each model tier), and average input token count per endpoint. These four metrics tell you whether your optimizations are working and whether new code changes are regressing cost behavior.

```bash
# Quick daily summary from command line
sqlite3 llm_metrics.db "
SELECT
  endpoint,
  ROUND(SUM(cost_usd), 4) as total_cost,
  COUNT(*) as calls,
  ROUND(AVG(input_tokens), 0) as avg_input,
  ROUND(AVG(output_tokens), 0) as avg_output
FROM token_usage
WHERE timestamp >= datetime('now', '-1 day')
GROUP BY endpoint
ORDER BY total_cost DESC;
"
```

**Key Takeaways**

- Hourly aggregation, not daily, is the minimum resolution needed to catch cost anomalies before they compound.
- A 3x-above-baseline threshold for anomaly alerting catches genuine events without generating noise.
- Four KPIs cover most of cost monitoring: cost per user, cache hit rate, routing tier distribution, and average input tokens per endpoint.
- Alerts should go to channels the team actually monitors — a monitoring system nobody looks at doesn't exist.
- Track these metrics week-over-week to detect gradual drift, not just acute spikes.

**Practical Exercise**

Set up the SQLite logging and run the hourly anomaly detection script as a cron job every hour. Configure it to alert to a Slack channel. Let it run for two weeks. The anomalies it catches in those two weeks will tell you more about your actual cost risks than any static analysis.

---

## Chapter 9: Building a Token Budget Culture on Your Team

Individual optimizations are valuable, but they don't compound unless your team treats token spend as a first-class engineering concern — something designed for, reviewed against, and monitored continuously. This chapter covers the organizational and process changes that make cost efficiency sustainable rather than a one-time sprint.

The fundamental problem is incentive alignment. Engineers are evaluated on features shipped and latency maintained. Token costs are an invisible externality. Until someone is looking at the cost dashboard and connecting it to specific engineering decisions, the default behavior is to write whatever code produces a passing demo, not whatever code is economical at scale.

The first step is visibility. Every engineer who writes code that calls the LLM API should have access to the cost dashboard and know what their endpoints cost. Not an aggregated company total — their endpoints, specifically. The moment an engineer sees that their feature costs $0.40 per user interaction, the conversation about optimization begins naturally.

> **Key Insight:** Attribution is more powerful than reporting. "LLM API cost: $3,200 this month" produces no action. "Your chat endpoint costs $0.38/interaction, which projects to $18,000/month at current growth" produces engineering decisions.

Define token budgets per endpoint before shipping, not after. This is the highest-leverage process change available. A token budget is a maximum input token count and maximum output token count for a given endpoint at p95. Define it during design, measure it during implementation, and treat exceeding it as a regression that requires justification.

A token budget specification might look like this:

```yaml
# token_budgets.yml
endpoints:
  chat_reply:
    max_input_tokens_p95: 4000
    max_output_tokens_p95: 600
    target_model: claude-sonnet-4-6
    allowed_escalation_rate: 0.15  # Max 15% of calls may use frontier model
    monthly_budget_usd: 8000

  document_summarize:
    max_input_tokens_p95: 8000
    max_output_tokens_p95: 400
    target_model: claude-haiku-4-5-20251001
    allowed_escalation_rate: 0.0  # No routing needed
    monthly_budget_usd: 2000

  code_review:
    max_input_tokens_p95: 12000
    max_output_tokens_p95: 2000
    target_model: claude-opus-4-7
    allowed_escalation_rate: 1.0  # Always frontier
    monthly_budget_usd: 5000
```

Include token budget checks in your CI/CD pipeline. For endpoints with defined budgets, run a cost regression test against a representative test dataset before merging:

```python
# tests/test_token_budgets.py
import pytest
import yaml
from your_app.client import InstrumentedClient

with open("token_budgets.yml") as f:
    BUDGETS = yaml.safe_load(f)["endpoints"]

TEST_CASES = {
    "chat_reply": [
        {"role": "user", "content": "What are your support hours?"},
        {"role": "user", "content": "I need help with my account"},
        # ... 20 representative test cases
    ]
}

@pytest.mark.parametrize("endpoint", list(TEST_CASES.keys()))
def test_token_budget_not_exceeded(endpoint):
    budget = BUDGETS[endpoint]
    client = InstrumentedClient()

    input_counts = []
    output_counts = []

    for messages in TEST_CASES[endpoint]:
        response = client.create(endpoint=endpoint, messages=[messages],
                                  model=budget["target_model"], max_tokens=budget["max_output_tokens_p95"])
        input_counts.append(response.usage.input_tokens)
        output_counts.append(response.usage.output_tokens)

    p95_input = sorted(input_counts)[int(len(input_counts) * 0.95)]
    p95_output = sorted(output_counts)[int(len(output_counts) * 0.95)]

    assert p95_input <= budget["max_input_tokens_p95"], (
        f"{endpoint}: p95 input {p95_input} exceeds budget {budget['max_input_tokens_p95']}"
    )
    assert p95_output <= budget["max_output_tokens_p95"], (
        f"{endpoint}: p95 output {p95_output} exceeds budget {budget['max_output_tokens_p95']}"
    )
```

> **Warning:** Cost regression tests only work if the test cases are representative of production traffic. A test suite of ten trivial queries will pass easily while production traffic with complex multi-turn histories blows the budget. Maintain test cases that reflect the actual distribution of production requests.

Code review for LLM features should include a token cost checkpoint. Add it to your PR template:

```markdown
## LLM Cost Checklist
- [ ] Endpoint has a defined token budget in token_budgets.yml
- [ ] System prompt length is under 500 tokens (or justified if over)
- [ ] All included tool definitions are necessary for this endpoint
- [ ] Conversation history uses windowed summarization, not full concatenation
- [ ] Retrieved context is bounded by token budget, not document count
- [ ] Cost regression test passes (see CI output)
```

Quarterly optimization sprints complement the ongoing process. Every three months, pull the past quarter's cost data, identify the five most expensive endpoints, and dedicate a sprint to reducing them. Track cost-per-user as the primary metric — it accounts for traffic growth and makes optimization wins visible even as usage scales.

**Key Takeaways**

- Visibility and attribution are prerequisites for cultural change — show engineers what their specific endpoints cost.
- Define token budgets per endpoint before shipping, and treat budget exceedance as a regression requiring justification.
- Add token budget checks to CI/CD pipelines to catch cost regressions before they reach production.
- Include a cost checkpoint in LLM feature PR templates to normalize cost review in the development process.
- Quarterly optimization sprints targeting the five most expensive endpoints sustain cost efficiency over time.

**Practical Exercise**

Create `token_budgets.yml` for your three most expensive endpoints. Define input and output p95 budgets based on current measured values, then set a target budget that is 30% lower. Schedule a two-week sprint to get actual costs below the target budget for each endpoint.

---

## Conclusion

The strategies in this book are not theoretical — each one has a production precedent, and most produce measurable results within the first week of deployment. But the larger lesson is structural: LLM API cost is an engineering problem, not a budget problem. Budget conversations happen after the bill arrives. Engineering conversations happen before the code is written.

The sequence that produces the most durable results is this: measure first, prioritize by impact, implement systematically, and monitor continuously. Skipping measurement and jumping to implementation is how teams spend three weeks optimizing an endpoint that accounts for 4% of their total spend while the real cost driver sits untouched.

If you've worked through the practical exercises in each chapter, you now have:

- Instrumented logging with per-call cost attribution
- A breakdown of where your token budget actually goes, by endpoint and component
- A compressed system prompt that does the same job in fewer tokens
- Semantic caching with a measured hit rate for your highest-traffic endpoint
- Routing logic that sends simple traffic to cheap models and escalates only when necessary
- A retrieval pipeline bounded by token budgets rather than document counts
- Async or batch processing for at least one non-real-time workload
- Hourly cost monitoring with anomaly alerting
- Token budgets per endpoint with CI enforcement

Applied together, these practices typically produce 40–60% cost reductions in the first quarter. The second quarter tends to produce smaller gains — you've captured the large obvious waste — but ongoing monitoring and optimization sprints continue to compound. Teams that institutionalize the process with defined budgets, cost-aware code review, and regular optimization cycles maintain cost efficiency as their LLM usage scales.

The final thing worth saying is this: the goal is not to minimize LLM spend in absolute terms. The goal is to maximize the value delivered per dollar spent. A team that cuts costs by 50% while their LLM features become less useful has failed at the actual optimization problem. Every technique in this book is designed to eliminate waste — tokens that don't move the quality needle, API calls that return identical responses, model capacity that exceeds what the task requires. The token that does real work is worth spending.

Spend on what matters. Cut everything else.

---

## Appendix A: Glossary

**Batch API** — A provider feature that allows submission of multiple LLM requests for asynchronous processing at reduced cost, accepting higher latency in exchange for pricing discounts (typically 50%).

**BM25** — A keyword-based ranking algorithm used in hybrid retrieval systems alongside semantic (vector) search.

**Cache hit rate** — The percentage of LLM requests served from a semantic or prompt cache rather than requiring a live API call.

**Cache TTL (Time to Live)** — The duration for which a cached entry is considered valid. Provider-side prompt caches (e.g., Anthropic's) use a 5-minute TTL.

**Cascade routing** — A model routing strategy that attempts a cheaper model first and escalates to a more capable model if the response quality is insufficient.

**Chunk** — A segment of text produced by splitting a larger document for retrieval purposes.

**Cross-encoder reranker** — A model that takes a query-document pair as joint input and produces a relevance score; more accurate than bi-encoders but computationally heavier.

**Dense retrieval** — Retrieval based on vector embeddings and similarity search (cosine, dot product), as opposed to sparse/keyword retrieval.

**HyDE (Hypothetical Document Embedding)** — A retrieval technique that embeds a generated hypothetical answer rather than the raw query to improve retrieval precision.

**Input tokens** — Tokens consumed by the content sent to the model (system prompt, user message, conversation history, retrieved context, tool definitions).

**Output tokens** — Tokens produced by the model's response; typically priced higher per unit than input tokens.

**Prompt caching** — A provider feature (available on Anthropic and others) that stores computed attention states for a stable prompt prefix, charging a reduced rate on cache reads.

**RAG (Retrieval-Augmented Generation)** — An architecture that retrieves relevant documents or passages from an external store and includes them in the LLM prompt as context.

**Reranking** — A second-pass ranking step applied to initial retrieval candidates to improve relevance ordering.

**Semantic caching** — A caching strategy that matches incoming queries against cached queries by semantic similarity rather than exact string match.

**Sliding window** — A conversation history management strategy that retains only the most recent N messages, optionally summarizing older messages to preserve semantic content.

**Token budget** — A defined maximum number of input or output tokens for an endpoint, used to bound cost and enforce efficiency standards.

**Token-per-minute (TPM)** — The rate limit unit used by most LLM providers; limits the number of tokens that can be processed per minute across all requests.

**Tool definition** — The JSON schema describing a function the model can invoke via tool use / function calling; counted as input tokens on every request regardless of invocation.

---

## Appendix B: Tools and Resources

### Token Counting

- **tiktoken** (OpenAI) — `pip install tiktoken` — Fast BPE tokenizer for counting tokens before sending requests. Use `cl100k_base` encoding as a reasonable approximation for most current models.
- **Anthropic token counting API** — The Anthropic client supports `client.messages.count_tokens()` for exact pre-call token counts without making a full inference call.

### Semantic Caching Infrastructure

- **Redis + RedisVL** — `pip install redisvl` — Production-grade vector store with TTL support built on Redis. Good default for teams already running Redis.
- **Chroma** — `pip install chromadb` — Lightweight embedded or hosted vector database. Good for development and small-scale production.
- **Weaviate** — Hosted vector database with built-in hybrid search. Strong option for teams that need managed infrastructure.
- **pgvector** — PostgreSQL extension for vector storage. Best for teams that want to keep everything in their existing Postgres instance.

### Reranking

- **sentence-transformers** — `pip install sentence-transformers` — Includes cross-encoder models for reranking. `cross-encoder/ms-marco-MiniLM-L-6-v2` is a fast, accurate default.
- **Cohere Rerank API** — Hosted reranking service. Convenient if you don't want to self-host a reranker model.

### Monitoring and Metrics

- **SQLite** — Built into Python. Sufficient for teams processing under 1M requests/day.
- **ClickHouse** — Columnar database designed for high-volume time-series event data. Best choice for teams processing tens of millions of requests per day.
- **Grafana + Prometheus** — Standard observability stack for dashboarding token metrics alongside other engineering metrics.

### Async and Batching

- **anthropic** Python SDK — Built-in async client via `anthropic.AsyncAnthropic()`. Supports all endpoints with async/await.
- **tenacity** — `pip install tenacity` — Retry library with exponential backoff and jitter. More ergonomic than manual retry loops for production code.

### Cost Estimation

- **LiteLLM** — `pip install litellm` — Multi-provider LLM client with built-in cost tracking across providers. Useful for teams working across OpenAI, Anthropic, and others.

---

## Appendix C: Further Reading

### Provider Documentation

- Anthropic Message Batches API — Official documentation covering batch submission, polling, result retrieval, and pricing.
- Anthropic Prompt Caching Guide — Documentation on `cache_control` breakpoints, TTL behavior, and cache hit metrics.
- Anthropic Rate Limits — Current TPM and RPM limits by tier and model.

### Academic and Technical Papers

- **"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"** (Lewis et al., 2020) — The foundational RAG paper. Provides context for why retrieval quality matters to output quality, not just cost.
- **"Precise Zero-Shot Dense Retrieval without Relevance Labels"** (Gao et al., 2022) — The HyDE paper. Clear and short.
- **"Lost in the Middle: How Language Models Use Long Contexts"** (Liu et al., 2023) — Explains why stuffing your context window with retrieved content doesn't always improve output quality — and sometimes degrades it.

### Books

- **"Designing Data-Intensive Applications"** (Kleppmann) — Not about LLMs, but the chapters on indexing, caching, and batch processing underpin the infrastructure patterns in this book.
- **"The Pragmatic Programmer"** (Thomas and Hunt) — Chapter on "good enough software" applies directly to model routing decisions: perfection is the enemy of shipping at reasonable cost.

### Community Resources

- **Anthropic Cookbook** (GitHub: anthropics/anthropic-cookbook) — Practical code examples for prompt caching, tool use, batch processing, and streaming. Updated regularly.
- **LlamaIndex documentation** — Comprehensive coverage of RAG pipeline patterns including chunking strategies, reranking, and retrieval evaluation.
- **RAGAS** — `pip install ragas` — Framework for evaluating RAG pipeline quality. Useful for measuring whether retrieval optimizations maintain answer quality.

---

*How to Cut Your LLM Bill in Half — Kelly Price — 2026*

---

*© 2026 Pyckle. All rights reserved. This guide may be shared freely for personal and educational use. Commercial reproduction or redistribution requires written permission. Contact kellyprice@pyckle.co.*
