---
title: "Code Search Patterns"
subtitle: "50 Query Recipes for Debugging, Reviews, Onboarding, and Architecture"
author: "David Kelly Price"
version: "1.0"
date: 2026-03-21
status: draft
type: ebook
target_audience: "All Pyckle users — developers who want a ready-made library of semantic search queries for everyday tasks across debugging, code review, onboarding, refactoring, architecture analysis, performance, and security"
estimated_pages: 80
chapters:
  - "Debugging Queries"
  - "Code Review Queries"
  - "Onboarding Queries"
  - "Refactoring Queries"
  - "Architecture Queries"
  - "Performance Queries"
  - "Security Queries"
  - "Building Your Own Patterns"
tags:
  - pyckle
  - ebook
  - query-patterns
  - semantic-search
  - cookbook
  - debugging
  - code-review
  - onboarding
  - refactoring
  - architecture
  - performance
  - security
  - draft
---

<!-- DESIGN & LAYOUT NOTES

Target formats:
- Primary: Markdown (source of truth)
- Export: PDF via Pandoc, web page
- Print-ready: Letter size, 1" margins

Typography:
- Headers: Sans-serif (brand-consistent)
- Body: Serif or clean sans-serif for readability
- Code: Monospace, syntax highlighted, line-numbered where helpful

Color scheme:
- Pyckle brand palette
- Callout boxes use muted background tints, not heavy borders

Callout box types:
- **Try This** — Exercises and hands-on activities
- **Key Insight** — Important concepts worth remembering
- **Warning** — Common mistakes or gotchas

Code blocks:
- Syntax highlighted by language
- Numbered lines for reference in explanatory text
- Copy-pasteable (no line numbers in actual code)

Recipe format:
- Pattern Name (H3)
- When to Use
- The Query
- What to Look For
- Example (concrete scenario with code)
- Variations
-->

---

# Code Search Patterns

## 50 Query Recipes for Debugging, Reviews, Onboarding, and Architecture

**By David Kelly Price**

Version 1.0 — March 2026

---

## Table of Contents

1. Debugging Queries
2. Code Review Queries
3. Onboarding Queries
4. Refactoring Queries
5. Architecture Queries
6. Performance Queries
7. Security Queries
8. Building Your Own Patterns

Appendix A: Glossary
Appendix B: Tools & Resources
Appendix C: Further Reading

---

## About This Guide

This is a cookbook. It contains 50 semantic search query recipes organized by the task you are trying to accomplish -- debugging, reviewing, onboarding, refactoring, understanding architecture, diagnosing performance, and auditing security. Each recipe gives you the exact query, explains when to use it, tells you what to look for in the results, walks through a concrete example, and offers variations for edge cases. You can read it cover to cover or jump to the recipe you need right now.

---

## How to Use This Guide

**Reading order:** This guide is modular. Jump to the chapter that matches what you are doing right now. Chapter 1 is the most universally useful starting point. Chapter 8 is worth reading after you have used a dozen or so recipes and are ready to create your own.

**Recipe format:** Every recipe follows the same structure:
- **Pattern Name** -- a descriptive name you can reference
- **When to Use** -- the situation that calls for this query
- **The Query** -- the exact text to type into Pyckle
- **What to Look For** -- how to interpret the results
- **Example** -- a concrete scenario with code snippets and search output
- **Variations** -- related queries for edge cases or narrower searches

**Exercises:** Each chapter ends with a hands-on exercise. They take 5-15 minutes and use your own codebase. The recipes are abstract until you see them return results from code you know.

**Prerequisites:** A codebase indexed with Pyckle. Familiarity with your project's structure. No particular language or framework required -- the recipes work across languages because semantic search operates on meaning, not syntax.

**Companion content:** This guide expands on concepts introduced in the *Code Search, Decoded* series. Where relevant, episodes are referenced by number so you can dig deeper.

---

# Part I: Everyday Queries

---

## Chapter 1: Debugging Queries

### Chapter Overview

Debugging is a search problem. You have a symptom -- an error, a wrong value, a crash -- and you need to find the code that explains it. These nine recipes cover the most common debugging search patterns, from tracing error origins to mapping data flows to understanding configuration impacts. They build on the debugging workflow covered in Episode 14 of *Code Search, Decoded*.

---

### Recipe 1: Error Origin Trace

**When to Use:** You have an error message or exception and need to find where it originates in the codebase -- not just where it surfaces, but where the condition that causes it first arises.

**The Query:**

```
where does this error originate: [error message or type]
```

**What to Look For:** The results should surface the throw/raise site, not just the catch site. Look for the function that first creates the error condition. Pay attention to results that show constructor calls for custom exception classes or the logic that decides to raise.

**Example:**

A user reports seeing "InsufficientFundsError" in production logs. You need to find where this error is thrown and what conditions trigger it.

```bash
pyckle search "where does InsufficientFundsError originate"
```

```
Results (5 hits, 9ms):
  src/services/wallet_service.py       WalletService.withdraw()           0.93
  src/exceptions/payment.py            class InsufficientFundsError       0.90
  src/services/transfer_service.py     TransferService.execute()          0.85
  src/validators/balance_checker.py    check_sufficient_balance()         0.81
  tests/test_wallet.py                 test_withdraw_insufficient()       0.74
```

The top result points at `WalletService.withdraw()` -- the function that actually raises the error. The second result is the exception class definition. The third reveals a second call path through `TransferService.execute()`. The validator in result four shows the balance check that precedes the error. Now you know there are two code paths that can produce this error, not one.

```python
# src/services/wallet_service.py
class WalletService:
    def withdraw(self, account_id: str, amount: Decimal) -> Transaction:
        balance = self.get_balance(account_id)
        if balance < amount:
            raise InsufficientFundsError(
                f"Account {account_id}: balance {balance}, requested {amount}"
            )
        return self._execute_withdrawal(account_id, amount)
```

The fix depends on which code path is producing the production error. The search narrowed it from "somewhere in the codebase" to two specific functions in under 10 seconds.

**Variations:**
- `"what raises [ExceptionType] and under what conditions"` -- when you need the triggering logic, not just the raise site
- `"where is [error message string] constructed"` -- when the error message is custom and unique
- `"all throw sites for [exception type]"` -- when you suspect multiple raise sites

---

### Recipe 2: State Mutation Tracker

**When to Use:** A value is wrong at runtime and you need to find every function that can modify it. The debugger shows the bad value. Semantic search shows who changed it.

**The Query:**

```
what functions modify [state/field/variable name]
```

**What to Look For:** Setter methods, assignment statements, database update calls, and event handlers that mutate the value. Watch for indirect mutations -- functions that modify a parent object which includes the field, or ORM save calls that persist changes made elsewhere.

**Example:**

A user's `subscription_status` is showing as "active" when it should be "cancelled." You need every place that can change this field.

```bash
pyckle search "what functions modify subscription status"
```

```
Results (6 hits, 11ms):
  src/services/subscription.py     SubscriptionService.cancel()         0.94
  src/services/subscription.py     SubscriptionService.activate()       0.91
  src/webhooks/stripe_handler.py   handle_subscription_updated()        0.88
  src/workers/renewal_worker.py    process_renewal()                    0.84
  src/admin/subscription_admin.py  admin_override_status()              0.79
  src/migrations/backfill_v2.py    backfill_subscription_statuses()     0.71
```

Six mutation points. The service layer has the expected `cancel()` and `activate()`. But there is also a Stripe webhook that can update the status independently, a renewal worker that reactivates expired subscriptions, an admin override, and a migration backfill. If the cancel call succeeds but the webhook fires afterward with stale data from Stripe, the status gets overwritten back to "active."

```python
# src/webhooks/stripe_handler.py
def handle_subscription_updated(event):
    sub_id = event['data']['object']['id']
    status = event['data']['object']['status']
    # This overwrites whatever the application set
    Subscription.objects.filter(stripe_id=sub_id).update(
        status=map_stripe_status(status)
    )
```

The webhook does a raw database update that bypasses the service layer entirely. That is your race condition.

**Variations:**
- `"where is [field] written to or updated"` -- broader phrasing for less common field names
- `"what can change [object] state after it is created"` -- focuses on post-creation mutations
- `"event handlers that affect [field]"` -- targets async/event-driven mutations specifically

---

### Recipe 3: Data Flow Trace

**When to Use:** You need to understand how data moves from point A to point B -- from user input to database, from API request to response, from one service to another. The data arrives wrong at B, and you need to find where it gets transformed or corrupted along the way.

**The Query:**

```
trace the data flow from [source] to [destination]
```

**What to Look For:** The chain of functions that touch the data between source and destination. Look for transformation steps, validation layers, serialization/deserialization points, and any place where the data type or format changes.

**Example:**

The API returns prices with incorrect currency formatting. The database stores the correct value. Something between the database and the API response is transforming it wrong.

```bash
pyckle search "trace the data flow from price in database to API response"
```

```
Results (5 hits, 10ms):
  src/models/product.py              Product.price (Decimal field)       0.92
  src/repositories/product_repo.py   get_product_by_id()                 0.88
  src/services/catalog_service.py    format_product_for_display()        0.86
  src/serializers/product.py         ProductSerializer.to_dict()         0.83
  src/api/products.py                get_product() endpoint              0.78
```

Five steps in the chain. The model stores price as `Decimal`. The repository fetches it. The catalog service formats it for display. The serializer converts it to a dict. The API endpoint returns it. The formatting step in the catalog service is the likely culprit.

```python
# src/services/catalog_service.py
def format_product_for_display(product):
    return {
        "name": product.name,
        "price": f"${float(product.price):.2f}",  # Decimal -> float -> string
        "currency": product.currency
    }
```

The `float()` conversion introduces floating-point precision errors. `Decimal('19.99')` becomes `float(19.99)` which can produce `19.990000000000002`. The fix is to format directly from the Decimal without the float intermediate.

**Variations:**
- `"how does [data type] move through the system"` -- broader, good for initial exploration
- `"what transforms [data] between [layer A] and [layer B]"` -- narrower, targets specific layers
- `"serialization and deserialization of [data type]"` -- focuses on format conversion points

---

### Recipe 4: Validation Chain Mapper

**When to Use:** Bad data is getting through to a layer that should be protected. You need to find every validation step between input and the failure point to identify which one is missing or broken.

**The Query:**

```
what validates [input/data] before [operation]
```

**What to Look For:** Middleware validators, schema checks, type assertions, guard clauses, and pre-condition checks. Pay attention to the order -- validation that runs after a side effect is a bug waiting to happen.

**Example:**

Users can submit negative quantities in orders, which should be impossible. You need the full validation chain.

```bash
pyckle search "what validates order quantity before saving"
```

```
Results (4 hits, 8ms):
  src/api/schemas.py                OrderCreateSchema                    0.93
  src/validators/order_validator.py validate_order_items()               0.90
  src/models/order_item.py          OrderItem constraints                0.84
  src/services/order_service.py     OrderService.create_order()          0.78
```

Four layers of validation. The API schema defines the input contract. The order validator checks business rules. The model has database constraints. The service orchestrates creation. Let us check what each one actually validates.

```python
# src/api/schemas.py
class OrderCreateSchema(BaseModel):
    items: List[OrderItemSchema]

class OrderItemSchema(BaseModel):
    product_id: int
    quantity: int  # No constraint -- accepts any integer including negatives
```

```python
# src/validators/order_validator.py
def validate_order_items(items):
    for item in items:
        if item.product_id not in get_valid_product_ids():
            raise ValidationError("Invalid product")
    # Checks product validity but not quantity range
```

The schema accepts any integer. The validator checks product IDs but not quantities. The database model might have a `CHECK (quantity > 0)` constraint, but that is the last line of defense and produces a database error rather than a clean validation message. The fix goes in the schema: `quantity: int = Field(gt=0)`.

**Variations:**
- `"what checks run between API request and database write for [entity]"` -- maps the full validation pipeline
- `"where is [field] validated or constrained"` -- finds all constraint definitions for a specific field
- `"input validation for [endpoint or operation]"` -- scoped to a specific entry point

---

### Recipe 5: Error Boundary Mapper

**When to Use:** An error is propagating further than it should -- crashing the application instead of being caught, or being swallowed silently instead of surfacing. You need to see the error handling landscape between the throw site and the user.

**The Query:**

```
how do errors propagate between [source component] and [destination component]
```

**What to Look For:** Try/catch blocks, error middleware, exception handlers, and fallback logic. Look for gaps -- code paths where no handler exists between the error source and the user-facing layer.

**Example:**

A database connection timeout is causing the entire API to return 500 errors instead of a graceful "service unavailable" message. You need the error handling chain between the database layer and the API response.

```bash
pyckle search "how do errors propagate between database layer and API response"
```

```
Results (5 hits, 10ms):
  src/middleware/error_handler.py    global_error_handler()              0.92
  src/db/session.py                 get_db() context manager             0.89
  src/services/base_service.py      BaseService exception handling       0.85
  src/api/middleware.py             request_exception_middleware()        0.82
  src/config/error_codes.py        ERROR_CODE_MAP                       0.75
```

The global error handler catches generic exceptions and maps them to HTTP responses. The database session context manager handles connection lifecycle. The base service has its own exception handling. The request middleware adds another layer. But look at the actual implementation:

```python
# src/db/session.py
@contextmanager
def get_db():
    session = SessionLocal()
    try:
        yield session
        session.commit()
    except Exception:
        session.rollback()
        raise  # Re-raises everything -- no specific handling for connection errors
```

```python
# src/middleware/error_handler.py
def global_error_handler(request, exc):
    if isinstance(exc, HTTPException):
        return JSONResponse(status_code=exc.status_code, content=exc.detail)
    # Falls through to generic 500 for anything else
    return JSONResponse(status_code=500, content={"error": "Internal server error"})
```

The database session re-raises all exceptions without distinguishing connection errors from query errors. The global handler catches `HTTPException` subclasses but treats everything else as a 500. A `ConnectionError` or `OperationalError` from the database falls straight through to the generic 500 handler. The fix: add a specific catch for database connection errors in the session manager or the error middleware, and return a 503.

**Variations:**
- `"what error handling exists between [component A] and [component B]"` -- direct mapping
- `"where are exceptions caught and re-raised in [module]"` -- finds catch-and-rethrow patterns
- `"unhandled exception paths in [service or layer]"` -- looks for gaps in error handling

---

### Recipe 6: Configuration Impact Finder

**When to Use:** A behavior changed and you suspect a configuration value is responsible. Or you are about to change a config value and want to know what it affects.

**The Query:**

```
what configuration affects [behavior or feature name]
```

**What to Look For:** Config files, environment variable reads, feature flags, constants, and runtime settings. Pay attention to defaults -- when a config value is missing, the default determines behavior, and defaults are often defined far from where the config is consumed.

**Example:**

The application's session timeout changed from 30 minutes to 24 hours after a recent deploy. Nobody changed the session code. Something in the configuration shifted.

```bash
pyckle search "what configuration affects session timeout"
```

```
Results (5 hits, 9ms):
  src/config/auth.py                SESSION_TIMEOUT_SECONDS              0.94
  src/middleware/session.py         SessionMiddleware timeout param       0.90
  src/config/defaults.py            DEFAULT_SESSION_TIMEOUT              0.86
  docker/docker-compose.yml         SESSION_TIMEOUT env var              0.79
  src/auth/token_manager.py         token_expiry calculation             0.74
```

The config file defines `SESSION_TIMEOUT_SECONDS`. The middleware consumes it. There is a default in `defaults.py`. Docker Compose sets it as an environment variable. The token manager also has a related expiry calculation.

```python
# src/config/auth.py
SESSION_TIMEOUT_SECONDS = int(os.getenv("SESSION_TIMEOUT", DEFAULT_SESSION_TIMEOUT))

# src/config/defaults.py
DEFAULT_SESSION_TIMEOUT = 86400  # 24 hours in seconds
```

```yaml
# docker/docker-compose.yml (production)
environment:
  - SESSION_TIMEOUT=1800  # 30 minutes

# docker/docker-compose.dev.yml (development -- recently merged to prod accidentally)
environment:
  # SESSION_TIMEOUT intentionally not set -- uses default
```

The dev compose file does not set `SESSION_TIMEOUT`, so it falls back to `DEFAULT_SESSION_TIMEOUT` of 86400 (24 hours). If someone merged the dev compose config into production or removed the env var from the production deployment, the default takes over. The search surfaced both the config chain and the default value, which is the actual root cause.

**Variations:**
- `"where is [config value] defined and consumed"` -- traces a specific config key
- `"what environment variables affect [feature]"` -- focuses on env vars specifically
- `"default values for [feature] configuration"` -- finds defaults that silently control behavior

---

### Recipe 7: Race Condition Detector

**When to Use:** A bug appears intermittently -- sometimes the code works, sometimes it does not. You suspect concurrent access to shared state. You need to find the contention points.

**The Query:**

```
concurrent access to [shared resource] and how it is synchronized
```

**What to Look For:** Lock acquisitions, mutex usage, atomic operations, database transactions with isolation levels, and queue-based processing. Also look for the *absence* of synchronization -- shared state accessed without any protection is often the bug.

**Example:**

An e-commerce site occasionally oversells inventory -- two users buy the last item and both succeed. Classic race condition territory.

```bash
pyckle search "concurrent access to inventory count and how it is synchronized"
```

```
Results (4 hits, 8ms):
  src/services/inventory_service.py   InventoryService.reserve_stock()    0.93
  src/models/product.py               Product.stock_count field           0.88
  src/workers/order_processor.py      decrement_stock()                   0.84
  src/api/cart.py                     add_to_cart() stock check           0.77
```

```python
# src/services/inventory_service.py
class InventoryService:
    def reserve_stock(self, product_id: int, quantity: int) -> bool:
        product = Product.objects.get(id=product_id)
        if product.stock_count >= quantity:  # Read
            product.stock_count -= quantity  # Modify
            product.save()                   # Write
            return True
        return False
```

Read-modify-write without a lock or atomic operation. Two concurrent requests can both read `stock_count = 1`, both pass the check, both decrement, and both save. The fix is either a database-level atomic update (`UPDATE ... SET stock_count = stock_count - 1 WHERE stock_count >= 1`) or an application-level lock.

**Variations:**
- `"what shared state can be accessed by multiple threads or requests"` -- broader scan for contention
- `"database transactions and isolation levels for [operation]"` -- focuses on DB-level concurrency
- `"locking and synchronization in [module or service]"` -- finds existing sync mechanisms

---

### Recipe 8: Dependency Side Effect Finder

**When to Use:** Updating a dependency or changing a function's behavior. You need to know what downstream effects to expect. What else will break or change when you modify this piece.

**The Query:**

```
what depends on [function/module/class] and what side effects does it trigger
```

**What to Look For:** Direct callers, indirect callers (through interfaces or dependency injection), event emissions, callback registrations, and any code that relies on the return value or side effects of the target.

**Example:**

You are about to change the return type of `UserService.get_user()` from a dict to a User object. You need the blast radius.

```bash
pyckle search "what depends on UserService.get_user and what does it return"
```

```
Results (6 hits, 11ms):
  src/services/user_service.py       UserService.get_user()              0.95
  src/api/users.py                   get_user_endpoint()                 0.91
  src/services/order_service.py      create_order() user lookup          0.87
  src/services/notification.py       send_notification() user fetch      0.84
  src/middleware/auth.py             load_user_from_token()              0.80
  src/workers/report_generator.py   generate_user_report()               0.75
```

Six consumers. The API endpoint, the order service, the notification service, the auth middleware, and a report generator all call `get_user()`. Every one of them accesses the return value with dict syntax like `user['email']`. Changing to a User object means every consumer needs to switch to `user.email`.

```python
# src/api/users.py
def get_user_endpoint(user_id: int):
    user = user_service.get_user(user_id)
    return {"name": user["name"], "email": user["email"]}  # Dict access

# src/middleware/auth.py
def load_user_from_token(token):
    payload = decode_token(token)
    user = user_service.get_user(payload["user_id"])
    request.user_role = user["role"]  # Dict access
```

The blast radius is six files. If you miss any of them, you get a `TypeError: 'User' object is not subscriptable` at runtime. The search found all six in one query.

**Variations:**
- `"all callers of [function] across the codebase"` -- pure call graph
- `"what breaks if [function] changes its return type"` -- focuses on return value consumers
- `"side effects triggered by [function] including events and callbacks"` -- targets indirect effects

---

### Recipe 9: Regression Scope Identifier

**When to Use:** A feature that used to work is now broken. You need to find what changed in the area that could have caused the regression -- not a git diff of everything, but a semantic search for changes related to the broken behavior.

**The Query:**

```
recent changes to [feature area] and what they affect
```

**What to Look For:** Modified logic, new conditional branches, changed defaults, updated dependencies, and altered data flows in the area surrounding the broken feature. Cross-reference with git blame to see what changed recently.

**Example:**

Email notifications stopped working. They were fine last week. You need to find what changed in the notification system.

```bash
pyckle search "email notification sending logic and configuration"
```

```
Results (5 hits, 10ms):
  src/services/email_service.py      EmailService.send()                 0.94
  src/config/email.py                SMTP configuration                  0.90
  src/templates/email/               Email template directory             0.85
  src/workers/notification_worker.py process_email_queue()                0.82
  src/services/notification.py       NotificationService.notify()         0.77
```

Now combine with git blame on the top results:

```bash
git log --oneline --since="2 weeks ago" -- src/services/email_service.py src/config/email.py
```

```
a3f21c4 refactor: extract SMTP config to environment variables
```

One commit. The SMTP config was moved from a hardcoded config file to environment variables. If the deployment did not set those environment variables, `EmailService.send()` is using empty strings for the SMTP host and port, and failing silently.

**Variations:**
- `"what in [feature area] has changed or been refactored recently"` -- broader scope
- `"configuration and environment for [feature]"` -- targets the config layer specifically
- `"error handling and logging in [feature area]"` -- finds where failures might be swallowed

---

### Exercise

> **Try This**
>
> Pick a bug you fixed in the last month. Reconstruct the debugging session using the recipes from this chapter:
>
> 1. Start with Recipe 1 (Error Origin Trace) or Recipe 3 (Data Flow Trace) depending on whether the bug was an error or a wrong value
> 2. Use Recipe 4 (Validation Chain Mapper) to find where validation should have caught the issue
> 3. Use Recipe 8 (Dependency Side Effect Finder) to check if your fix has a blast radius
>
> Compare the search results to how you actually found the bug. How many files would the recipes have surfaced that you had to find manually?

---

### Key Takeaways

- Debugging queries work best when they describe intent, not identifiers -- "where does this error originate" outperforms grep for the exception class name
- State mutation bugs require finding *every* writer, not just the obvious one -- webhooks, workers, and admin tools are common hidden mutators
- Data flow traces reveal transformation steps where corruption happens -- the intermediate conversion you did not know existed
- Validation chain mapping exposes gaps -- the layer that checks product IDs but not quantities
- Always check for race conditions when bugs are intermittent -- read-modify-write without synchronization is the classic pattern

---

## Chapter 2: Code Review Queries

### Chapter Overview

A diff shows you what changed. It does not show you what matters. These seven recipes build the context you need to review a PR thoroughly -- callers, tests, schemas, patterns, and the blast radius of every change. They expand on the four-query workflow from Episode 15 of *Code Search, Decoded*.

---

### Recipe 10: Blast Radius Scanner

**When to Use:** A PR modifies a function, class, or module. You need to know how many other parts of the system depend on the changed code before you can assess the risk.

**The Query:**

```
what calls [changed function] and depends on its behavior
```

**What to Look For:** Direct callers, indirect consumers through interfaces, and anything that relies on the function's return value, side effects, or contract. Count the callers -- a function with 20 callers needs a more careful review than one with 2.

**Example:**

A PR changes the signature of `format_address()` to accept an optional `locale` parameter. Looks harmless. But how many things call it?

```bash
pyckle search "what calls format_address and depends on its behavior"
```

```
Results (7 hits, 12ms):
  src/services/shipping.py          calculate_shipping_label()           0.93
  src/services/invoice.py           generate_invoice()                   0.90
  src/api/checkout.py               checkout_endpoint()                  0.87
  src/templates/email/order.py      render_order_confirmation()          0.84
  src/reports/address_report.py     export_addresses()                   0.80
  src/admin/customer_view.py        display_customer()                   0.76
  tests/test_address.py             test_format_address_*                0.72
```

Seven consumers. Adding an optional parameter is backwards-compatible for callers that do not pass it. But if the new `locale` parameter changes the default formatting behavior -- say, from US-style addresses to locale-aware formatting -- then all seven callers might produce different output than before, even without code changes. That is a subtle regression risk that the diff alone would not reveal.

**Variations:**
- `"all callers of [function] including indirect references"` -- catches functions passed as callbacks or stored in data structures
- `"what would break if [function] signature changes"` -- focuses on contract violations
- `"downstream effects of changing [function] return value"` -- targets return value consumers

---

### Recipe 11: Test Coverage Checker

**When to Use:** A PR changes logic. You need to know whether existing tests cover the changed behavior, and whether the PR adds tests for the new behavior.

**The Query:**

```
tests that cover [changed function or feature]
```

**What to Look For:** Unit tests that call the changed function directly, integration tests that exercise the code path, and edge-case tests. If the PR changes a conditional, look for tests that exercise both branches.

**Example:**

A PR adds retry logic to `PaymentGateway.charge()`. The retry wraps the existing charge call with exponential backoff. You need to know what tests exist.

```bash
pyckle search "tests that cover PaymentGateway charge and retry behavior"
```

```
Results (4 hits, 9ms):
  tests/test_payment_gateway.py      test_charge_success()               0.91
  tests/test_payment_gateway.py      test_charge_failure()               0.87
  tests/integration/test_checkout.py test_full_checkout_flow()            0.81
  tests/test_payment_gateway.py      test_charge_timeout()               0.75
```

Three unit tests and one integration test. There is a timeout test, but no retry test. The PR adds retry logic but no tests for: retry succeeding on second attempt, retry exhausting all attempts, retry with non-retryable errors, and retry backoff timing. That is your review comment: "This needs tests for retry success, retry exhaustion, and non-retryable error handling."

**Variations:**
- `"edge case tests for [function]"` -- finds tests that exercise boundary conditions
- `"what test coverage exists for [module]"` -- broader scope for module-level changes
- `"integration tests that exercise [feature]"` -- focuses on end-to-end coverage

---

### Recipe 12: Pattern Consistency Checker

**When to Use:** A PR introduces a new implementation of something that already exists elsewhere in the codebase. You want to verify it follows the established pattern.

**The Query:**

```
find similar patterns to [function or implementation] in the codebase
```

**What to Look For:** Other implementations of the same interface, similar function signatures, parallel service structures, and established conventions. If the new code deviates from the pattern, it should be intentional and commented.

**Example:**

A PR adds a new `NotificationService` class. The codebase already has `EmailService`, `SMSService`, and `PushService`. You want to check if the new one follows the same pattern.

```bash
pyckle search "find similar patterns to NotificationService in service implementations"
```

```
Results (4 hits, 10ms):
  src/services/email_service.py      EmailService                        0.92
  src/services/sms_service.py        SMSService                          0.89
  src/services/push_service.py       PushService                         0.86
  src/services/base_service.py       BaseService interface               0.83
```

All three existing services inherit from `BaseService` and implement `send()`, `validate_recipient()`, and `get_delivery_status()`. If the new `NotificationService` does not inherit from `BaseService` or is missing one of these methods, that is a pattern violation. Check the PR diff against the pattern:

```python
# Established pattern (all three existing services)
class EmailService(BaseService):
    def send(self, recipient, content): ...
    def validate_recipient(self, recipient): ...
    def get_delivery_status(self, message_id): ...

# PR introduces
class NotificationService:  # Missing BaseService inheritance
    def send_notification(self, user, message): ...  # Different method name
    # Missing validate_recipient and get_delivery_status
```

Three deviations. Your review comment: "The existing services follow a BaseService pattern with consistent method names. This should either follow that pattern or document why it diverges."

**Variations:**
- `"how other services in the codebase implement [interface or pattern]"` -- finds the convention to compare against
- `"where else is this pattern used: [description]"` -- searches by behavior, not name
- `"naming conventions for [type of thing] in this codebase"` -- catches naming inconsistencies

---

### Recipe 13: Schema and Contract Checker

**When to Use:** A PR changes a data model, API endpoint, or internal interface. You need to find everything that depends on the current schema or contract.

**The Query:**

```
what depends on the schema or contract of [entity]
```

**What to Look For:** Serializers, API schemas, database migrations, validators, client code, documentation, and any hardcoded references to field names or types. Schema changes are high-risk because they affect multiple layers simultaneously.

**Example:**

A PR renames the `user.email_address` field to `user.email` in the User model. Clean refactor. But what else references the old name?

```bash
pyckle search "what depends on user email_address field schema"
```

```
Results (6 hits, 11ms):
  src/models/user.py                 User.email_address field            0.94
  src/api/schemas.py                 UserSchema serialization            0.91
  src/services/email_service.py      user.email_address access           0.88
  src/templates/email/welcome.html   {{ user.email_address }}            0.83
  src/exports/user_csv.py            CSV column: email_address           0.78
  docs/api/users.md                  API docs field reference            0.73
```

Six dependencies on the field name. The PR renames the model field. If it does not also update the API schema, the email service, the email template, the CSV export, and the API documentation, you have a broken system. Each one is a review comment.

**Variations:**
- `"all references to [field name] across code and templates"` -- catches template references that IDE refactoring misses
- `"API contract for [endpoint] including request and response fields"` -- maps the full external contract
- `"database migrations related to [model or table]"` -- finds migration dependencies

---

### Recipe 14: Auth and Permission Layer Check

**When to Use:** A PR adds a new endpoint or modifies access control. You need to verify that authentication and authorization are correctly applied.

**The Query:**

```
how is authentication and authorization applied to [endpoint or resource]
```

**What to Look For:** Auth middleware, permission decorators, role checks, and the absence of these -- an endpoint without auth middleware is either intentionally public or a security gap.

**Example:**

A PR adds a new admin endpoint `/api/admin/export-users`. You need to verify it has proper access control.

```bash
pyckle search "how is authentication and authorization applied to admin endpoints"
```

```
Results (5 hits, 9ms):
  src/middleware/auth.py             require_auth() decorator            0.93
  src/middleware/roles.py            require_role('admin') decorator     0.90
  src/api/admin/users.py            existing admin endpoints             0.87
  src/config/permissions.py          ADMIN_PERMISSIONS config            0.82
  tests/test_admin_auth.py           admin auth test patterns            0.77
```

The existing admin endpoints use two decorators: `@require_auth` and `@require_role('admin')`. The test file shows the pattern for testing admin auth. Check the PR's new endpoint:

```python
# PR adds:
@router.get("/api/admin/export-users")
@require_auth
async def export_users():  # Missing @require_role('admin')
    return await user_service.export_all()
```

Missing the role check. Any authenticated user can export all user data. Your review comment: "This needs `@require_role('admin')` to match the other admin endpoints."

**Variations:**
- `"what auth decorators or middleware protect [endpoint pattern]"` -- finds the security layer
- `"public endpoints that do not require authentication"` -- scans for intentionally unprotected routes
- `"permission model for [resource type]"` -- maps the full authorization structure

---

### Recipe 15: Related Change Finder

**When to Use:** A PR makes a change in one layer. You need to check whether corresponding changes are needed in other layers -- config, docs, tests, migrations, or related services.

**The Query:**

```
what else needs to change when [description of change]
```

**What to Look For:** Configuration files, environment variables, documentation, migration scripts, seed data, enum definitions, serializers, and any "satellite files" that must stay in sync with the code being changed.

**Example:**

A PR adds a new order status `"refunded"` to the order processing flow. The code change is in `OrderService`. What else needs updating?

```bash
pyckle search "what else needs to change when adding a new order status"
```

```
Results (6 hits, 11ms):
  src/models/order.py                OrderStatus enum                    0.93
  src/api/schemas.py                 OrderStatusSchema                   0.89
  src/services/order_service.py      status transition logic             0.86
  src/templates/admin/orders.html    status display mapping              0.82
  src/reports/order_report.py        status aggregation queries          0.78
  docs/api/orders.md                 API status documentation            0.73
```

Six places that are aware of order statuses. The enum, the API schema, the service logic, the admin template, the reporting queries, and the documentation. If the PR only adds "refunded" to the service logic, the enum will reject it, the API schema will not serialize it, the admin template will not display it, the reports will not aggregate it, and the docs will not mention it. Each missing update is a review comment.

**Variations:**
- `"all places that reference [enum or constant]"` -- finds every consumer of a value set
- `"documentation that describes [feature]"` -- catches stale docs
- `"migration required for [schema change]"` -- identifies missing database migrations

---

### Recipe 16: Rollback Risk Assessor

**When to Use:** You are reviewing a PR that includes a database migration or data transformation. You need to assess whether the change can be safely rolled back if something goes wrong in production.

**The Query:**

```
how is [migration or data change] reversible and what data could be lost
```

**What to Look For:** Down migrations, data backup logic, destructive operations (column drops, data deletions), and one-way transformations (hashing, truncation). If the migration drops a column, rolling back will not restore the data.

**Example:**

A PR includes a migration that merges `first_name` and `last_name` into a single `full_name` column and drops the originals.

```bash
pyckle search "how are database column drops handled and reversed in migrations"
```

```
Results (4 hits, 8ms):
  src/migrations/base.py             Migration base class                0.91
  src/migrations/0042_merge_names.py the PR's migration                  0.88
  src/db/backup.py                   pre_migration_backup()              0.83
  docs/migrations.md                 Migration rollback policy           0.77
```

```python
# src/migrations/0042_merge_names.py
def upgrade():
    op.add_column('users', sa.Column('full_name', sa.String))
    op.execute("UPDATE users SET full_name = first_name || ' ' || last_name")
    op.drop_column('users', 'first_name')
    op.drop_column('users', 'last_name')

def downgrade():
    op.add_column('users', sa.Column('first_name', sa.String))
    op.add_column('users', sa.Column('last_name', sa.String))
    # Cannot reliably split full_name back into first and last
    op.execute("UPDATE users SET first_name = full_name, last_name = ''")
    op.drop_column('users', 'full_name')
```

The downgrade is destructive. It sets `last_name` to empty for every user. Your review comment: "This migration is not safely reversible. The downgrade loses all last name data. Either keep the original columns until the next release confirms the merge is stable, or add a pre-migration backup step."

**Variations:**
- `"destructive operations in [migration file]"` -- finds DROP, DELETE, TRUNCATE operations
- `"data backup before [migration or transformation]"` -- checks for safety nets
- `"one-way data transformations in [module]"` -- finds irreversible changes

---

### Exercise

> **Try This**
>
> Pick an open PR in your project (or the last PR you reviewed). Run these four queries against the changed function or feature:
>
> 1. Recipe 10 (Blast Radius Scanner) -- how many callers does the changed code have?
> 2. Recipe 11 (Test Coverage Checker) -- are the changes tested?
> 3. Recipe 13 (Schema and Contract Checker) -- does anything else depend on the changed schema?
> 4. Recipe 15 (Related Change Finder) -- what satellite files need updating?
>
> Compare the results to the actual review comments (yours or others'). Did the recipes catch something that was missed?

---

### Key Takeaways

- The blast radius of a change determines the risk level of a review -- a function with 20 callers needs more scrutiny than one with 2
- Test coverage checks should happen before you read the diff -- missing tests are the most common review gap
- Pattern consistency matters more than individual style -- new code should follow established conventions unless it has a documented reason to diverge
- Schema changes have the widest blast radius -- they affect models, serializers, templates, exports, and documentation simultaneously
- Always check for missing satellite updates -- the enum, the docs, the config, the migration

---

## Chapter 3: Onboarding Queries

### Chapter Overview

A new developer's first two weeks are dominated by navigation questions. "How does auth work?" "Where is the entry point?" "What does the deployment pipeline do?" These seven recipes answer the questions every new hire asks, directly from the codebase. They expand on the self-service learning approach from Episode 17 of *Code Search, Decoded*.

---

### Recipe 17: System Entry Point Finder

**When to Use:** You have just cloned a new codebase and want to understand where execution starts. Where is `main()`? Where do HTTP requests enter? What is the bootstrapping sequence?

**The Query:**

```
where is the entry point and how does the application start
```

**What to Look For:** Main functions, application factory functions, server startup code, and bootstrapping sequences. Look for the order of initialization -- what gets set up first determines the dependency structure.

**Example:**

You just joined a team and cloned a Python FastAPI application. You need to know where it starts.

```bash
pyckle search "where is the entry point and how does the application start"
```

```
Results (5 hits, 8ms):
  src/main.py                        create_app() factory                0.94
  src/config/settings.py             Settings initialization             0.89
  src/db/init.py                     init_database()                     0.85
  src/middleware/setup.py            register_middleware()                0.81
  src/routes/__init__.py             register_routes()                   0.77
```

The application starts in `main.py` with a `create_app()` factory. It loads settings first, initializes the database, registers middleware, then registers routes. That is the bootstrapping sequence in five search results.

```python
# src/main.py
def create_app() -> FastAPI:
    settings = Settings()
    app = FastAPI(title=settings.app_name)
    init_database(settings.database_url)
    register_middleware(app, settings)
    register_routes(app)
    return app
```

Reading this one function tells you the initialization order and where each subsystem is configured. From here, you can drill into any of the four setup functions to understand that layer.

**Variations:**
- `"application bootstrapping and initialization sequence"` -- focuses on startup order
- `"where do HTTP requests first enter the application"` -- finds the request entry point specifically
- `"what runs on application startup before serving requests"` -- finds setup code, migrations, cache warming

---

### Recipe 18: Authentication Flow Explorer

**When to Use:** You need to understand how the application handles login, sessions, tokens, and permissions. This is one of the most common questions in the first week.

**The Query:**

```
how does user authentication work from login to session
```

**What to Look For:** Login endpoints, token generation, session management, middleware that validates tokens on every request, and the user model. Map the full flow from "user submits credentials" to "user is authenticated on subsequent requests."

**Example:**

```bash
pyckle search "how does user authentication work from login to session"
```

```
Results (6 hits, 10ms):
  src/routes/auth.py                 login(), refresh(), logout()        0.93
  src/auth/jwt_handler.py            create_token(), verify_token()      0.90
  src/middleware/auth.py             AuthMiddleware.authenticate()        0.87
  src/models/user.py                 User model, password hash           0.83
  src/config/security.py            TOKEN_SECRET, TOKEN_EXPIRY           0.79
  src/auth/password.py               hash_password(), verify()           0.74
```

The flow: user hits the login route, credentials are verified against the User model using the password hashing utility, a JWT is created with the secret and expiry from config, and on subsequent requests the auth middleware verifies the token. Six files. The entire auth system. A senior developer would have taken 10 minutes to walk you through this. The search took 10 milliseconds.

**Variations:**
- `"how are user permissions and roles checked"` -- focuses on authorization, not authentication
- `"token refresh and session expiration logic"` -- focuses on the session lifecycle
- `"how does the API distinguish authenticated from unauthenticated requests"` -- targets the middleware

---

### Recipe 19: Database Schema Explorer

**When to Use:** You need to understand the data model -- what tables exist, how they relate to each other, and what fields matter.

**The Query:**

```
what are the main data models and how do they relate to each other
```

**What to Look For:** Model classes, foreign key relationships, join tables, and any documentation or comments that explain the data model. Focus on the core entities first -- User, Order, Product, or whatever the domain objects are.

**Example:**

```bash
pyckle search "what are the main data models and how do they relate to each other"
```

```
Results (6 hits, 11ms):
  src/models/user.py                 User model                          0.92
  src/models/order.py                Order model (FK to User)            0.89
  src/models/product.py              Product model                       0.86
  src/models/order_item.py           OrderItem (FK to Order, Product)    0.83
  src/models/category.py            Category (M2M with Product)          0.79
  src/db/relationships.py           relationship definitions             0.74
```

The core domain: Users place Orders. Orders contain OrderItems. OrderItems reference Products. Products belong to Categories. That is the entity-relationship diagram, extracted from code, in 11 milliseconds.

```python
# src/models/order.py
class Order(Base):
    __tablename__ = 'orders'
    id = Column(Integer, primary_key=True)
    user_id = Column(Integer, ForeignKey('users.id'))
    status = Column(String, default='pending')
    items = relationship('OrderItem', back_populates='order')
    user = relationship('User', back_populates='orders')
```

**Variations:**
- `"database schema and table relationships"` -- more technical phrasing
- `"what fields does the [model name] model have"` -- drills into a specific model
- `"many-to-many relationships in the data model"` -- focuses on complex relationships

---

### Recipe 20: Deployment Pipeline Explorer

**When to Use:** You need to understand how code gets from a developer's machine to production. CI/CD pipelines, Docker builds, deployment configs, and release processes.

**The Query:**

```
how does the deployment pipeline work from commit to production
```

**What to Look For:** CI configuration files (GitHub Actions, GitLab CI, Jenkins), Dockerfiles, deployment scripts, environment-specific configs, and any manual steps documented in the codebase.

**Example:**

```bash
pyckle search "how does the deployment pipeline work from commit to production"
```

```
Results (5 hits, 9ms):
  .github/workflows/deploy.yml      CI/CD pipeline definition           0.94
  Dockerfile                         Container build                     0.89
  scripts/deploy.sh                  Deployment script                   0.85
  src/config/environments/           Environment-specific configs        0.80
  docker-compose.prod.yml           Production compose config            0.75
```

The pipeline: push to main triggers a GitHub Actions workflow. The workflow builds a Docker image using the Dockerfile, runs tests, and if they pass, deploys to the production environment using `deploy.sh`. Environment-specific configuration lives in `config/environments/`. The production Docker Compose file defines the service topology.

**Variations:**
- `"what happens in CI when a pull request is opened"` -- focuses on PR checks
- `"how are environment variables managed across environments"` -- targets config management
- `"rollback procedure and deployment history"` -- focuses on recovery

---

### Recipe 21: Error Handling Strategy Explorer

**When to Use:** You need to understand how the application handles errors -- how exceptions are caught, logged, and reported to the user or monitoring systems.

**The Query:**

```
how does the application handle errors and report them
```

**What to Look For:** Global error handlers, error middleware, logging configuration, error tracking integrations (Sentry, Datadog), custom exception classes, and user-facing error responses.

**Example:**

```bash
pyckle search "how does the application handle errors and report them"
```

```
Results (5 hits, 10ms):
  src/middleware/error_handler.py    GlobalErrorHandler                   0.93
  src/exceptions/__init__.py         Custom exception hierarchy          0.89
  src/config/logging.py             Logging configuration                0.85
  src/integrations/sentry.py        Sentry error tracking                0.81
  src/api/responses.py              Error response formatting             0.77
```

The error handling strategy: custom exceptions inherit from a base exception class. The global error handler middleware catches all exceptions, logs them, reports to Sentry, and returns a formatted error response. Logging is configured separately. Five files map the entire error handling architecture.

```python
# src/middleware/error_handler.py
class GlobalErrorHandler:
    async def __call__(self, request, call_next):
        try:
            return await call_next(request)
        except AppException as e:
            logger.warning(f"Application error: {e}")
            return error_response(e.status_code, e.message)
        except Exception as e:
            logger.error(f"Unhandled error: {e}", exc_info=True)
            sentry.capture_exception(e)
            return error_response(500, "Internal server error")
```

**Variations:**
- `"what gets logged and where do logs go"` -- focuses on observability
- `"custom exception classes and what triggers them"` -- maps the exception taxonomy
- `"how does the API return errors to the client"` -- focuses on the user-facing side

---

### Recipe 22: API Endpoint Catalog

**When to Use:** You need a quick overview of all API endpoints -- what they do, what methods they support, and how they are organized.

**The Query:**

```
all API endpoints and what each one does
```

**What to Look For:** Route definitions, handler functions, URL patterns, HTTP methods, and any grouping by router or blueprint. This gives you the public surface area of the application.

**Example:**

```bash
pyckle search "all API endpoints and what each one does"
```

```
Results (6 hits, 11ms):
  src/routes/auth.py                 POST /login, /refresh, /logout      0.92
  src/routes/users.py                GET/POST/PUT /users                  0.89
  src/routes/orders.py               GET/POST /orders, /orders/{id}      0.86
  src/routes/products.py            GET /products, /products/{id}        0.83
  src/routes/admin.py               GET /admin/dashboard, /admin/users   0.79
  src/routes/health.py              GET /health, /readiness               0.74
```

Six route modules. Auth (login/refresh/logout), Users (CRUD), Orders (create and view), Products (read-only), Admin (dashboard and user management), and Health (monitoring). That is the entire API surface in one query. Now you know where to add a new endpoint or where to look when a request fails.

**Variations:**
- `"REST API routes organized by resource"` -- groups by entity
- `"which endpoints require authentication"` -- maps the auth boundary
- `"API endpoint for [specific action like creating an order]"` -- finds one specific endpoint

---

### Recipe 23: Background Job Explorer

**When to Use:** The application has workers, cron jobs, or task queues that run outside the request-response cycle. You need to know what they do and when they run.

**The Query:**

```
what background jobs and workers exist and what do they do
```

**What to Look For:** Task queue definitions (Celery, RQ, custom workers), cron job configurations, scheduled task registrations, and worker entry points. Background jobs are often the least documented part of a system.

**Example:**

```bash
pyckle search "what background jobs and workers exist and what do they do"
```

```
Results (5 hits, 9ms):
  src/workers/email_worker.py        process_email_queue()               0.93
  src/workers/report_worker.py       generate_daily_reports()            0.89
  src/workers/cleanup_worker.py      purge_expired_sessions()            0.85
  src/config/celery.py              Celery configuration and schedule    0.81
  src/tasks/__init__.py             task registration                    0.76
```

Three workers: email processing, daily report generation, and session cleanup. They are all Celery tasks with schedules defined in the Celery config. Now you know what runs in the background, when it runs, and where to look when a scheduled process fails.

```python
# src/config/celery.py
beat_schedule = {
    'daily-reports': {
        'task': 'workers.report_worker.generate_daily_reports',
        'schedule': crontab(hour=2, minute=0),  # 2 AM daily
    },
    'session-cleanup': {
        'task': 'workers.cleanup_worker.purge_expired_sessions',
        'schedule': crontab(hour=3, minute=0),  # 3 AM daily
    },
}
```

**Variations:**
- `"scheduled tasks and when they run"` -- focuses on timing
- `"task queue configuration and worker setup"` -- focuses on infrastructure
- `"what happens asynchronously outside the request cycle"` -- broader framing

---

### Exercise

> **Try This**
>
> Pretend you just joined your team today. Run these five recipes against your own codebase:
>
> 1. Recipe 17 (System Entry Point Finder)
> 2. Recipe 18 (Authentication Flow Explorer)
> 3. Recipe 19 (Database Schema Explorer)
> 4. Recipe 20 (Deployment Pipeline Explorer)
> 5. Recipe 22 (API Endpoint Catalog)
>
> Write down the answers in a scratch document. Compare what you learned from the search results to your existing knowledge of the codebase. Did the queries surface anything you had forgotten or never knew?

---

### Key Takeaways

- Onboarding questions are navigation questions -- "where is X" and "how does Y work" -- which are exactly what semantic search handles best
- Broad, conceptual queries outperform specific ones for exploration -- "how does auth work" is a better search than "JWTMiddleware class"
- Five queries can map the core architecture of most applications: entry point, auth, data model, deployment, and API surface
- Background jobs are the most common blind spot for new developers -- they run outside the visible request-response cycle
- Save these queries as team onboarding bookmarks (see Episode 17 for saved search setup)

---

## Chapter 4: Refactoring Queries

### Chapter Overview

Refactoring is surgery on a running system. You need to know every connection before you cut. These seven recipes help you map the dependency graph, find deprecated patterns, identify interface consumers, and assess the scope of a refactoring project before you write the first line of changed code.

---

### Recipe 24: Interface Consumer Finder

**When to Use:** You are changing an interface, abstract class, or protocol. You need to find every implementation and every consumer before modifying the contract.

**The Query:**

```
find all implementations and callers of [interface or abstract class]
```

**What to Look For:** Classes that implement the interface, functions that accept the interface type as a parameter, and factory or registry patterns that create instances. Do not just find the implementations -- find the code that *uses* the interface.

**Example:**

You are refactoring `PaymentProvider`, an abstract class with three implementations. You want to add a new required method. You need to know who implements it and who calls it.

```bash
pyckle search "find all implementations and callers of PaymentProvider"
```

```
Results (6 hits, 12ms):
  src/payments/base.py               PaymentProvider ABC                 0.95
  src/payments/stripe_provider.py   StripeProvider(PaymentProvider)      0.92
  src/payments/paypal_provider.py   PayPalProvider(PaymentProvider)      0.89
  src/payments/braintree_provider.py BraintreeProvider(PaymentProvider)  0.86
  src/services/checkout.py          CheckoutService uses PaymentProvider 0.82
  src/payments/factory.py           PaymentProviderFactory.create()      0.78
```

Three implementations, one consumer, one factory. Adding a new required method means updating all three implementations, verifying the consumer does not break, and checking whether the factory needs to be aware of the new method. Six files, zero guesswork.

**Variations:**
- `"classes that inherit from [base class]"` -- finds subclasses specifically
- `"where is [interface] used as a type hint or parameter"` -- finds consumers, not implementers
- `"factory or registry for [interface] implementations"` -- finds the creation points

---

### Recipe 25: Deprecated Pattern Finder

**When to Use:** You have established a new pattern and need to find all remaining instances of the old pattern that need migrating.

**The Query:**

```
where is [deprecated pattern] still used instead of [new pattern]
```

**What to Look For:** Direct usage of the deprecated API, import statements for the old module, and any code that follows the old convention rather than the new one.

**Example:**

The team moved from raw SQL queries to an ORM six months ago. You need to find all remaining raw SQL.

```bash
pyckle search "where are raw SQL queries still used instead of ORM"
```

```
Results (5 hits, 10ms):
  src/reports/revenue_report.py     execute(f"SELECT ...")                0.93
  src/migrations/legacy_import.py   cursor.execute("INSERT ...")          0.89
  src/admin/data_export.py          raw SQL join query                    0.85
  src/workers/analytics.py          db.execute("SELECT COUNT...")         0.81
  src/utils/db_health.py            connection.execute("SELECT 1")       0.74
```

Five files still using raw SQL. The revenue report, a legacy import script, a data export, an analytics worker, and a health check. The health check is intentionally raw (it is testing the database connection, not querying data). The other four are candidates for migration.

**Variations:**
- `"code that uses [old library or API] instead of [new one]"` -- finds library migration leftovers
- `"functions that still follow the [old convention]"` -- finds convention violations
- `"import statements for [deprecated module]"` -- finds files that depend on old code

---

### Recipe 26: Module Dependency Mapper

**When to Use:** You want to extract a module into a separate package, service, or library. You need to know its inbound and outbound dependencies to determine the extraction boundary.

**The Query:**

```
what does [module] depend on and what depends on [module]
```

**What to Look For:** Imports from the module (outbound), imports into the module (inbound), shared state, and circular dependencies. The extraction boundary is cleanest where there are few cross-cutting dependencies.

**Example:**

You want to extract the `billing` module into a standalone service. You need its dependency graph.

```bash
pyckle search "what does the billing module depend on and what depends on billing"
```

```
Results (7 hits, 13ms):
  src/billing/__init__.py            billing module exports               0.93
  src/billing/invoice.py             imports from models, config          0.90
  src/billing/tax.py                 imports from billing.invoice         0.87
  src/services/order_service.py      imports billing.create_invoice       0.84
  src/services/subscription.py       imports billing.calculate_proration  0.81
  src/api/billing_endpoints.py      imports billing.*                     0.78
  src/workers/billing_worker.py     imports billing.process_pending       0.74
```

Inbound: order_service, subscription, billing_endpoints, and billing_worker all depend on the billing module. Outbound: billing depends on models and config. The extraction needs to: (1) define an API that replaces the four inbound import paths, and (2) either bundle the model and config dependencies or abstract them behind an interface.

**Variations:**
- `"imports from [module] across the codebase"` -- finds all inbound dependencies
- `"what [module] imports from other modules"` -- finds all outbound dependencies
- `"circular dependencies involving [module]"` -- finds dependency cycles

---

### Recipe 27: Dead Code Detector

**When to Use:** You suspect a function, class, or module is no longer used anywhere. You want to confirm it is safe to delete before removing it.

**The Query:**

```
is [function or class] still used anywhere in the codebase
```

**What to Look For:** Callers, importers, configuration references, string references (for dynamically dispatched code), and test-only usage. Code that is only referenced in tests might be dead production code that still has live test code.

**Example:**

The codebase has a `LegacyExporter` class that you think is no longer used. You want to confirm before deleting it.

```bash
pyckle search "is LegacyExporter still used anywhere in the codebase"
```

```
Results (3 hits, 7ms):
  src/exporters/legacy_exporter.py   LegacyExporter class definition     0.94
  tests/test_legacy_exporter.py      test_legacy_export()                0.88
  src/config/exporters.yaml         legacy_exporter: enabled: false      0.79
```

Three results: the definition itself, a test, and a config entry that shows it is disabled. No production code calls it. The config has `enabled: false`. This is dead code with live tests -- safe to delete both the class and its tests, and clean up the config entry.

**Variations:**
- `"all references to [class or function] including dynamic dispatch"` -- catches string-based references
- `"code that was commented out or disabled for [feature]"` -- finds disabled features
- `"unused imports and unreachable code in [module]"` -- broader dead code scan

---

### Recipe 28: Naming Convention Auditor

**When to Use:** You are standardizing naming conventions across the codebase. You need to find functions, variables, or files that do not follow the agreed-upon pattern.

**The Query:**

```
naming conventions used for [type of thing] in this codebase
```

**What to Look For:** Consistent patterns (snake_case, camelCase, PascalCase), prefix/suffix conventions (like `Service` suffix for service classes), and outliers that do not match.

**Example:**

You want to verify that all service classes follow the `XxxService` naming pattern.

```bash
pyckle search "service classes and their naming conventions"
```

```
Results (7 hits, 11ms):
  src/services/user_service.py       UserService                         0.93
  src/services/order_service.py      OrderService                        0.91
  src/services/payment_processor.py  PaymentProcessor                    0.88
  src/services/email_service.py      EmailService                        0.86
  src/services/notification.py       NotificationHandler                 0.82
  src/services/cache_manager.py      CacheManager                        0.78
  src/services/billing/tax.py       TaxCalculator                        0.74
```

`UserService`, `OrderService`, and `EmailService` follow the pattern. `PaymentProcessor`, `NotificationHandler`, `CacheManager`, and `TaxCalculator` do not. Four out of seven service-layer classes use non-standard names. A refactoring plan: rename them to `PaymentService`, `NotificationService`, `CacheService`, and `TaxService` to establish consistency.

**Variations:**
- `"how are [repositories / controllers / handlers] named in this codebase"` -- checks conventions for a specific layer
- `"inconsistent naming patterns in [directory or module]"` -- finds outliers directly
- `"file naming conventions in [layer]"` -- checks file names, not just class names

---

### Recipe 29: Duplication Finder

**When to Use:** You suspect similar logic exists in multiple places. Before refactoring into a shared utility, you need to find all the copies.

**The Query:**

```
similar implementations of [description of logic] across the codebase
```

**What to Look For:** Functions that do the same thing with slight variations, copy-pasted code blocks with minor modifications, and parallel implementations in different modules.

**Example:**

You noticed date formatting logic in three different files. You want to find all instances before extracting a shared utility.

```bash
pyckle search "date formatting and timestamp conversion logic across the codebase"
```

```
Results (6 hits, 10ms):
  src/api/serializers.py             format_datetime() for API           0.92
  src/reports/formatters.py          format_date_for_report()            0.89
  src/templates/filters.py          datetime_filter() for templates      0.86
  src/exports/csv_writer.py         format_timestamp() for CSV           0.83
  src/utils/date_utils.py           parse_date(), date_to_string()       0.79
  src/notifications/formatter.py    format_notification_date()           0.74
```

Six implementations of date formatting. There is even a `date_utils.py` that nobody is using -- the other five files each have their own version. The refactoring plan: audit all six, determine the superset of formatting needs, consolidate into `date_utils.py`, and update the five consumers.

**Variations:**
- `"duplicate implementations of [specific logic]"` -- targets a known duplication
- `"utility functions that do similar things across different modules"` -- broader scan
- `"copy-pasted code patterns for [operation]"` -- looks for near-duplicates

---

### Recipe 30: Extraction Boundary Finder

**When to Use:** You are extracting a piece of functionality into a new module, package, or service. You need to find the cleanest cut point -- the boundary with the fewest cross-cutting dependencies.

**The Query:**

```
boundaries and interfaces between [component A] and [the rest of the system]
```

**What to Look For:** Functions that are called across module boundaries, shared data types, and any state that is accessed from both sides of the proposed boundary. The best extraction point has few public functions and no shared mutable state.

**Example:**

You want to extract the search functionality into its own service. You need to find the boundary.

```bash
pyckle search "boundaries and interfaces between search functionality and the rest of the system"
```

```
Results (5 hits, 10ms):
  src/search/engine.py               SearchEngine.search()               0.93
  src/search/indexer.py              Indexer.index_document()             0.89
  src/api/search_endpoint.py        search_endpoint() calls SearchEngine 0.85
  src/workers/index_worker.py       index_worker() calls Indexer         0.82
  src/search/config.py              search-specific configuration        0.78
```

The search module has two public entry points: `SearchEngine.search()` used by the API, and `Indexer.index_document()` used by the worker. The config is self-contained. This is a clean extraction boundary -- two functions become the service's API, the config moves with it, and the consumers only need to change from direct function calls to service client calls.

**Variations:**
- `"public API surface of [module]"` -- finds what is exported and used externally
- `"shared state between [module A] and [module B]"` -- finds coupling through state
- `"cross-module function calls involving [module]"` -- maps the dependency graph edges

---

### Exercise

> **Try This**
>
> Pick a module in your codebase that feels too large or does too many things:
>
> 1. Run Recipe 26 (Module Dependency Mapper) to see its inbound and outbound dependencies
> 2. Run Recipe 29 (Duplication Finder) to check for duplicated logic within it
> 3. Run Recipe 30 (Extraction Boundary Finder) to identify the cleanest cut point
>
> Draw the dependency graph on paper. How many inbound arrows does the module have? How many outbound? Where is the narrowest point for splitting it?

---

### Key Takeaways

- Refactoring without dependency mapping is surgery without imaging -- you will cut something you did not know was connected
- Interface changes have a multiplied blast radius -- every implementation and every consumer is affected
- Dead code detection requires checking for dynamic dispatch -- string-based references and config entries can keep dead code "alive"
- Duplication often exists alongside a utility that nobody uses -- check for existing shared code before creating new shared code
- The cleanest extraction boundary has few public functions and no shared mutable state

---

## Chapter 5: Architecture Queries

### Chapter Overview

Understanding a system's architecture means understanding how its parts communicate, where the boundaries are, and what the data structures look like. These seven recipes surface the architectural patterns, service boundaries, and design decisions embedded in the code. They work whether the architecture is documented or not -- because they read the implementation, not the wiki.

---

### Recipe 31: Service Communication Mapper

**When to Use:** You need to understand how services or modules communicate with each other -- HTTP calls, message queues, shared databases, event buses, or direct function calls.

**The Query:**

```
how do services communicate with each other
```

**What to Look For:** HTTP client calls, message queue producers and consumers, shared database access patterns, event emission and subscription, and any RPC or gRPC definitions.

**Example:**

You are joining a microservices project and need to understand the communication topology.

```bash
pyckle search "how do services communicate with each other"
```

```
Results (6 hits, 12ms):
  src/clients/order_client.py        HTTP client to order service        0.93
  src/events/publisher.py            RabbitMQ event publisher            0.90
  src/events/consumer.py            Event consumer registrations         0.87
  src/clients/payment_client.py     HTTP client to payment service       0.84
  src/config/services.py            Service URLs and endpoints           0.80
  src/shared/events.py              Event type definitions               0.75
```

Two communication patterns: synchronous HTTP calls (order and payment clients) and asynchronous events (RabbitMQ publisher and consumer). The service URLs are configured centrally. Event types are defined in a shared module. This tells you: order and payment are synchronous dependencies (latency matters), while other services communicate asynchronously through events (eventual consistency).

```python
# src/clients/order_client.py
class OrderClient:
    def __init__(self, base_url: str):
        self.base_url = base_url

    async def get_order(self, order_id: str) -> dict:
        async with httpx.AsyncClient() as client:
            response = await client.get(f"{self.base_url}/orders/{order_id}")
            return response.json()

# src/events/publisher.py
class EventPublisher:
    def publish(self, event_type: str, payload: dict):
        self.channel.basic_publish(
            exchange='events',
            routing_key=event_type,
            body=json.dumps(payload)
        )
```

**Variations:**
- `"HTTP client calls between services"` -- focuses on synchronous communication
- `"message queue producers and consumers"` -- focuses on asynchronous communication
- `"shared database access across services"` -- finds coupling through shared state (an anti-pattern in microservices)

---

### Recipe 32: API Boundary Finder

**When to Use:** You need to understand where the public API surface is -- what is exposed to external clients versus what is internal-only.

**The Query:**

```
where are the API boundaries and what is exposed externally
```

**What to Look For:** Public endpoint definitions, API gateway configurations, CORS settings, rate limiting on external routes, and any distinction between internal and external API versions.

**Example:**

```bash
pyckle search "where are the API boundaries and what is exposed externally"
```

```
Results (5 hits, 10ms):
  src/api/v1/public/routes.py        Public API routes                   0.94
  src/api/v1/internal/routes.py     Internal API routes                  0.91
  src/middleware/api_gateway.py     API gateway middleware                0.87
  src/config/cors.py                CORS allowed origins                  0.83
  src/middleware/rate_limiter.py    Rate limiting configuration           0.78
```

The API is split into public and internal routes. The public routes go through the API gateway middleware, have CORS configured for specific origins, and are rate-limited. The internal routes bypass rate limiting and CORS but require a service-to-service authentication token.

```python
# src/api/v1/public/routes.py
router = APIRouter(prefix="/api/v1", dependencies=[
    Depends(rate_limiter),
    Depends(cors_middleware),
])

# src/api/v1/internal/routes.py
router = APIRouter(prefix="/internal/v1", dependencies=[
    Depends(service_auth),  # Service-to-service token, no rate limit
])
```

**Variations:**
- `"public vs internal API routes and what distinguishes them"` -- maps the access boundary
- `"API versioning strategy and how versions coexist"` -- finds versioning patterns
- `"API gateway configuration and routing rules"` -- focuses on the gateway layer

---

### Recipe 33: Data Model Mapper

**When to Use:** You need a comprehensive view of the data structures -- not just the database models, but also DTOs, API schemas, and any transformation layers between them.

**The Query:**

```
what are all the data models, schemas, and DTOs and how do they relate
```

**What to Look For:** Database models, API request/response schemas, internal DTOs, serializers, and any mapper or transformer classes that convert between these representations.

**Example:**

```bash
pyckle search "what are all the data models schemas and DTOs and how do they relate"
```

```
Results (6 hits, 11ms):
  src/models/                        Database models (SQLAlchemy)         0.93
  src/api/schemas/                   API request/response schemas         0.90
  src/services/dto/                  Internal DTOs                        0.86
  src/serializers/                   Model-to-schema converters           0.83
  src/mappers/                       DTO transformation layer             0.79
  src/validators/                    Input validation schemas             0.74
```

Four layers of data representation: database models (SQLAlchemy), API schemas (Pydantic), internal DTOs (dataclasses), and validators (Pydantic, overlapping with schemas). The serializers convert database models to API schemas. The mappers convert between DTOs. That is three transformation steps between the database and the API response.

If you find yourself asking "why are there so many representations of the same data?", that is a legitimate architecture question. The answer is usually separation of concerns -- the database model should not leak into the API contract, and the API contract should not drive the internal domain model. But four layers might be one too many.

**Variations:**
- `"how does data transform between database and API response"` -- traces the transformation chain
- `"all Pydantic models and where they are used"` -- targets a specific schema framework
- `"where are data types defined vs where they are consumed"` -- finds definition/consumption mismatches

---

### Recipe 34: Event Flow Mapper

**When to Use:** The application uses an event-driven architecture. You need to understand what events exist, who emits them, and who handles them.

**The Query:**

```
what events are emitted and who handles them
```

**What to Look For:** Event class definitions, emit/publish calls, handler/subscriber registrations, and the mapping between event types and their handlers. In event-driven systems, the control flow is implicit -- events connect producers to consumers without direct function calls.

**Example:**

```bash
pyckle search "what events are emitted and who handles them"
```

```
Results (6 hits, 12ms):
  src/events/types.py                Event type definitions              0.94
  src/events/publisher.py            publish() call sites                0.91
  src/events/handlers/order.py      OrderCreated, OrderCancelled         0.88
  src/events/handlers/notification.py handle notifications on events     0.84
  src/events/handlers/analytics.py  track events for analytics           0.80
  src/events/registry.py            event-to-handler mapping             0.76
```

The event registry maps event types to handlers:

```python
# src/events/registry.py
EVENT_HANDLERS = {
    'order.created': [
        handlers.order.send_confirmation,
        handlers.notification.notify_team,
        handlers.analytics.track_order,
    ],
    'order.cancelled': [
        handlers.order.process_refund,
        handlers.notification.notify_customer,
        handlers.analytics.track_cancellation,
    ],
    'user.registered': [
        handlers.notification.send_welcome,
        handlers.analytics.track_signup,
    ],
}
```

Three events, eight handler registrations. When an order is created, three things happen: confirmation email, team notification, and analytics tracking. This map is the event-driven architecture, laid out explicitly.

**Variations:**
- `"all event publishers and what events they emit"` -- focuses on the producer side
- `"event handler registrations and what triggers them"` -- focuses on the consumer side
- `"event ordering and dependencies between handlers"` -- finds sequential constraints

---

### Recipe 35: Configuration Architecture Mapper

**When to Use:** You need to understand how configuration flows through the system -- where it is defined, how it is loaded, what defaults exist, and what can be overridden at runtime.

**The Query:**

```
how is application configuration structured and loaded
```

**What to Look For:** Config files (YAML, TOML, JSON), environment variable loading, config classes, default values, per-environment overrides, feature flags, and runtime-reloadable settings.

**Example:**

```bash
pyckle search "how is application configuration structured and loaded"
```

```
Results (5 hits, 9ms):
  src/config/settings.py             Settings class (pydantic)           0.94
  src/config/defaults.py             Default values                      0.90
  src/config/environments/           Per-environment overrides            0.86
  src/config/feature_flags.py       Feature flag management              0.82
  src/config/__init__.py            Config loading and validation         0.77
```

The configuration architecture: a Pydantic `Settings` class defines all config with type annotations and validation. Defaults are in a separate file. Environment-specific overrides live in a directory. Feature flags are managed separately. The config loader in `__init__.py` orchestrates the priority: environment variables override environment-specific files, which override defaults.

```python
# src/config/__init__.py
def load_config(env: str = "production") -> Settings:
    defaults = load_defaults()
    env_overrides = load_env_file(f"config/environments/{env}.yaml")
    env_vars = load_env_vars()
    merged = {**defaults, **env_overrides, **env_vars}
    return Settings(**merged)
```

Priority: env vars > environment file > defaults. If a value is wrong in production, check the env var first, then the production config file, then the default.

**Variations:**
- `"configuration priority and override hierarchy"` -- maps the override chain
- `"feature flags and how they are toggled"` -- focuses on runtime control
- `"secrets management and how credentials are loaded"` -- focuses on sensitive config

---

### Recipe 36: Middleware and Pipeline Mapper

**When to Use:** You need to understand the request processing pipeline -- what middleware runs, in what order, and what each piece does.

**The Query:**

```
what middleware and pipeline stages process each request
```

**What to Look For:** Middleware registration order, request/response interceptors, authentication filters, logging hooks, CORS handlers, and any pipeline pattern (chain of responsibility, middleware stack).

**Example:**

```bash
pyckle search "what middleware and pipeline stages process each request"
```

```
Results (5 hits, 10ms):
  src/middleware/setup.py            register_middleware() order          0.94
  src/middleware/auth.py             Authentication middleware            0.91
  src/middleware/logging.py          Request logging middleware           0.87
  src/middleware/rate_limiter.py     Rate limiting middleware             0.83
  src/middleware/error_handler.py   Error handling middleware             0.79
```

```python
# src/middleware/setup.py
def register_middleware(app):
    app.add_middleware(ErrorHandlerMiddleware)    # Outermost - catches all errors
    app.add_middleware(LoggingMiddleware)          # Logs request/response
    app.add_middleware(RateLimiterMiddleware)     # Rejects excess traffic
    app.add_middleware(AuthMiddleware)             # Authenticates requests
    # Routes handle the actual request (innermost)
```

The middleware stack runs top-to-bottom on request, bottom-to-top on response. Error handling wraps everything. Logging sees every request. Rate limiting happens before authentication (so unauthenticated flood traffic is rejected early). Authentication runs last before the route handler.

**Variations:**
- `"middleware execution order and dependencies"` -- focuses on ordering
- `"request lifecycle from receipt to response"` -- maps the full request processing
- `"what runs before and after each API request"` -- finds pre/post processing hooks

---

### Recipe 37: Cross-Cutting Concern Mapper

**When to Use:** You need to understand how a cross-cutting concern -- logging, caching, metrics, authorization -- is implemented across the application. These concerns touch many modules but are often not centralized.

**The Query:**

```
how is [cross-cutting concern] implemented across the application
```

**What to Look For:** Centralized implementations (middleware, decorators), distributed implementations (inline code in each module), configuration, and any inconsistencies in how different parts of the application handle the same concern.

**Example:**

You need to understand how caching works across the application.

```bash
pyckle search "how is caching implemented across the application"
```

```
Results (6 hits, 11ms):
  src/cache/redis_client.py          Redis cache client                  0.93
  src/cache/decorators.py           @cached() decorator                  0.90
  src/services/product_service.py   manual cache.get/set calls           0.86
  src/services/user_service.py      manual cache.get/set calls           0.83
  src/config/cache.py               Cache TTL and configuration          0.79
  src/middleware/cache_middleware.py response caching middleware          0.74
```

Three caching strategies in one application: a `@cached()` decorator, manual `cache.get/set` calls in two services, and response-level caching in middleware. The config defines TTLs. This is a common architectural inconsistency -- the decorator pattern exists but not all services use it. Two services implement caching manually with potentially different TTLs and invalidation logic. That is a refactoring opportunity.

**Variations:**
- `"how is logging implemented across services"` -- audits logging consistency
- `"where are metrics and monitoring points in the codebase"` -- finds observability gaps
- `"authorization checks across different API endpoints"` -- audits permission consistency

---

### Exercise

> **Try This**
>
> Run three of these architecture recipes against your codebase:
>
> 1. Recipe 31 (Service Communication Mapper) -- what communication patterns does your application use?
> 2. Recipe 36 (Middleware and Pipeline Mapper) -- what is the request processing order?
> 3. Recipe 37 (Cross-Cutting Concern Mapper) -- pick one concern (caching, logging, or auth) and search for how it is implemented
>
> Draw a diagram of what you find. Does the actual architecture match what you expected? Are there inconsistencies in how cross-cutting concerns are handled?

---

### Key Takeaways

- Architecture lives in code, not in diagrams -- semantic search reads the implementation, which is always current
- Service communication patterns determine reliability characteristics -- synchronous calls create latency coupling, async events create eventual consistency
- Data model proliferation (models, DTOs, schemas, serializers) is common and sometimes excessive -- mapping it reveals simplification opportunities
- Cross-cutting concerns are often implemented inconsistently -- some modules use the centralized pattern, others do it manually
- The middleware stack order matters -- authentication before rate limiting has different security properties than rate limiting before authentication

---

## Chapter 6: Performance Queries

### Chapter Overview

Performance problems are usually architectural, not algorithmic. The N+1 query is not a clever algorithm gone wrong -- it is a data access pattern that nobody noticed. These six recipes help you find the most common performance anti-patterns: unnecessary database queries, unbounded loops, missing caches, and inefficient data loading.

---

### Recipe 38: N+1 Query Detector

**When to Use:** The application is slow when loading collections. You suspect the code is making one database query per item in a list instead of a single query for all items.

**The Query:**

```
where are database queries made inside loops or per-item in collections
```

**What to Look For:** Database calls inside `for` loops, ORM lazy loading triggered during iteration, and any pattern where a query runs once per item in a list rather than once for the entire list.

**Example:**

An endpoint that lists orders is taking 3 seconds. It should take 300 milliseconds.

```bash
pyckle search "where are database queries made inside loops or per-item in collections"
```

```
Results (4 hits, 9ms):
  src/services/order_service.py      get_orders_with_details()           0.93
  src/api/orders.py                  list_orders() endpoint              0.89
  src/repositories/order_repo.py    get_order_items() per order          0.85
  src/serializers/order.py          serialize_order() triggers lazy load 0.79
```

```python
# src/services/order_service.py
def get_orders_with_details(user_id):
    orders = Order.query.filter_by(user_id=user_id).all()  # 1 query
    for order in orders:
        order.items = get_order_items(order.id)  # N queries
        order.customer = get_customer(order.customer_id)  # N queries
    return orders
```

For a user with 50 orders, this executes 101 queries: 1 for orders, 50 for items, 50 for customers. The fix is eager loading with joins or a batch query:

```python
# Fixed version
def get_orders_with_details(user_id):
    orders = (Order.query
        .filter_by(user_id=user_id)
        .options(joinedload(Order.items), joinedload(Order.customer))
        .all())  # 1 query with joins
    return orders
```

**Variations:**
- `"ORM lazy loading that triggers additional queries"` -- catches implicit N+1 from ORM relationships
- `"database calls inside iteration or list comprehension"` -- broader pattern match
- `"batch query vs per-item query patterns for [entity]"` -- finds both the problem and the solution pattern

---

### Recipe 39: Unbounded Collection Finder

**When to Use:** You suspect an operation is processing an unbounded collection -- loading all records, iterating all items, or returning all results without pagination.

**The Query:**

```
where are collections loaded without limits or pagination
```

**What to Look For:** Queries without `LIMIT`, `all()` calls on large tables, list operations without bounds, and API endpoints that return full collections.

**Example:**

```bash
pyckle search "where are collections loaded without limits or pagination"
```

```
Results (5 hits, 10ms):
  src/repositories/user_repo.py     get_all_users() no limit             0.93
  src/services/report_service.py    load_all_transactions()              0.90
  src/api/products.py               list_products() no pagination        0.86
  src/exports/data_export.py        export_all_records()                 0.82
  src/workers/batch_processor.py    process_all_pending()                0.77
```

Five places where "all" means "all." `get_all_users()` returns every user. On day one, that is 50 records. In a year, it is 50,000. Each one is a memory bomb waiting to go off at scale.

```python
# src/repositories/user_repo.py
def get_all_users():
    return User.query.all()  # Returns every row in the users table

# Fixed version
def get_users(page: int = 1, per_page: int = 50):
    return User.query.paginate(page=page, per_page=per_page)
```

**Variations:**
- `"queries that return all rows without LIMIT"` -- database-specific
- `"API endpoints that return full collections without pagination"` -- API-specific
- `"memory-intensive operations that load entire datasets"` -- broader performance concern

---

### Recipe 40: Cache Usage Auditor

**When to Use:** You want to understand what is cached, what is not, and whether the caching strategy matches the access patterns.

**The Query:**

```
where is caching implemented and what is not cached that should be
```

**What to Look For:** Cache read/write calls, cache decorators, TTL configurations, and frequently accessed data that has no caching layer. Also look for cache invalidation logic -- caching without invalidation leads to stale data.

**Example:**

```bash
pyckle search "where is caching implemented and what data access is expensive"
```

```
Results (5 hits, 10ms):
  src/cache/decorators.py           @cached decorator                    0.93
  src/services/product_service.py   get_product() — cached              0.90
  src/services/pricing_service.py   calculate_price() — not cached      0.86
  src/services/config_service.py    get_feature_flags() — not cached    0.82
  src/config/cache.py               Cache TTLs                           0.77
```

Products are cached. Pricing calculations are not -- but they call the product service and the tax service for every calculation. Feature flags are fetched from the database on every check. Two candidates for caching: pricing calculations (if they are deterministic for the same inputs) and feature flags (which change rarely but are checked on every request).

```python
# src/services/pricing_service.py
def calculate_price(product_id, quantity, country):
    product = get_product(product_id)  # Cached
    tax_rate = get_tax_rate(country)    # Not cached — DB call every time
    return product.price * quantity * (1 + tax_rate)
```

The product lookup is cached, but the tax rate lookup is not. Tax rates change rarely -- maybe quarterly. Caching them with a 1-hour TTL would eliminate a database call per price calculation.

**Variations:**
- `"cache invalidation logic and when caches are cleared"` -- finds stale data risks
- `"frequently accessed data that hits the database on every request"` -- finds caching opportunities
- `"cache hit/miss patterns and TTL configuration"` -- audits the caching strategy

---

### Recipe 41: Slow Query Pattern Finder

**When to Use:** The database is the bottleneck. You need to find query patterns that are likely to be slow -- missing indexes, full table scans, complex joins, or queries that compute aggregations without limits.

**The Query:**

```
database queries that could be slow due to missing indexes or full scans
```

**What to Look For:** Queries filtering on non-indexed columns, queries with multiple joins, text search without full-text indexes, sorting on non-indexed columns, and aggregation queries without date bounds.

**Example:**

```bash
pyckle search "database queries that could be slow due to missing indexes or full scans"
```

```
Results (4 hits, 9ms):
  src/repositories/order_repo.py     filter by status without index      0.92
  src/repositories/log_repo.py      text search on message column        0.88
  src/reports/analytics.py          aggregation across all records        0.84
  src/services/search_service.py    LIKE '%query%' search pattern        0.79
```

Four slow-query candidates. Filtering orders by status (likely needs an index on `status`). Text searching log messages (needs a full-text index). Aggregating analytics without date bounds (should be limited to a time range). Using `LIKE '%query%'` for search (cannot use an index, should use full-text search).

```python
# src/repositories/order_repo.py
def get_orders_by_status(status):
    # If the orders table has 1M rows and no index on status,
    # this is a full table scan
    return Order.query.filter_by(status=status).all()
```

**Variations:**
- `"queries that filter or sort on columns that might not be indexed"` -- targets index gaps
- `"complex joins involving three or more tables"` -- finds expensive join patterns
- `"aggregation queries that scan large tables"` -- finds unbounded aggregations

---

### Recipe 42: Memory Leak Pattern Finder

**When to Use:** The application's memory usage grows over time without an obvious cause. You need to find code that accumulates data without releasing it.

**The Query:**

```
where does data accumulate in memory without cleanup or bounds
```

**What to Look For:** Growing collections (lists, dicts) that are never cleared, caches without eviction, event listener registrations without removal, and class-level attributes that accumulate data across requests.

**Example:**

```bash
pyckle search "where does data accumulate in memory without cleanup or bounds"
```

```
Results (4 hits, 8ms):
  src/services/event_tracker.py     events list grows unbounded          0.93
  src/cache/local_cache.py          in-memory cache no eviction          0.89
  src/middleware/request_log.py     request history stored in memory     0.85
  src/utils/connection_pool.py      connections not released on error    0.80
```

Four accumulation patterns. The event tracker appends to a list that is never cleared. The local cache has no maximum size or eviction policy. The request logger stores all request data in memory. The connection pool leaks connections when errors occur during request processing.

```python
# src/services/event_tracker.py
class EventTracker:
    _events = []  # Class-level list -- survives across requests

    def track(self, event):
        self._events.append(event)  # Grows forever

    # No flush, no trim, no bounds
```

**Variations:**
- `"class-level or global state that persists across requests"` -- finds request-spanning accumulation
- `"caches without size limits or eviction policies"` -- finds unbounded caches
- `"resource cleanup and connection management"` -- finds resource leak patterns

---

### Recipe 43: Unnecessary Computation Finder

**When to Use:** A function is slow and you suspect it is doing work it does not need to -- recomputing values that could be cached, re-fetching data that was already loaded, or performing operations on data that has not changed.

**The Query:**

```
where is the same computation or data fetch performed repeatedly
```

**What to Look For:** Functions called multiple times with the same arguments in a single request cycle, data that is fetched at the start and then fetched again later in the same flow, and transformations that run on every call when the result could be memoized.

**Example:**

```bash
pyckle search "where is the same computation or data fetch performed repeatedly"
```

```
Results (4 hits, 9ms):
  src/services/pricing_service.py   get_exchange_rate() called per item  0.93
  src/api/dashboard.py              user permissions checked 4 times     0.89
  src/services/shipping.py         address geocoding per order item      0.85
  src/templates/render.py          config reloaded per template render   0.80
```

Four redundancy patterns. The exchange rate is fetched from an external API for every line item in an order -- it should be fetched once per currency pair per request. User permissions are checked four times in the same endpoint handler. Address geocoding runs per item when all items ship to the same address. Config is reloaded on every template render.

```python
# src/services/pricing_service.py
def calculate_order_total(items, target_currency):
    total = 0
    for item in items:
        rate = get_exchange_rate(item.currency, target_currency)  # API call per item
        total += item.price * rate
    return total

# Fixed version
def calculate_order_total(items, target_currency):
    rates = {}
    total = 0
    for item in items:
        if item.currency not in rates:
            rates[item.currency] = get_exchange_rate(item.currency, target_currency)
        total += item.price * rates[item.currency]
    return total
```

**Variations:**
- `"external API calls that could be batched or cached"` -- finds remote call redundancy
- `"repeated database queries with the same parameters in one request"` -- finds DB-level duplication
- `"computation that could be memoized or precomputed"` -- finds memoization opportunities

---

### Exercise

> **Try This**
>
> Run Recipe 38 (N+1 Query Detector) and Recipe 39 (Unbounded Collection Finder) against your codebase:
>
> 1. How many N+1 query patterns did you find?
> 2. How many "load all" patterns exist?
> 3. For each one found, estimate the current impact (is the table small or large?) and the future impact (will it grow?)
>
> Pick the worst offender and sketch a fix. How many lines of code would it take to add pagination or eager loading?

---

### Key Takeaways

- N+1 queries are the single most common performance problem in ORM-based applications -- the fix is usually one line of eager loading
- "Load all" patterns are time bombs -- they work fine at small scale and fail catastrophically at large scale
- Caching decisions should match access patterns -- cache things that are read often and change rarely
- Memory leaks in long-running applications often come from class-level state that accumulates across requests
- Redundant computation is often invisible until you trace a single request path and count how many times the same data is fetched

---

## Chapter 7: Security Queries

### Chapter Overview

Security vulnerabilities are often hiding in plain sight -- unvalidated user input, SQL built from strings, hardcoded secrets, and missing permission checks. These six recipes surface the most common security concerns. They do not replace a professional security audit, but they catch the patterns that auditors look for first.

---

### Recipe 44: Input Validation Auditor

**When to Use:** You want to verify that all user-facing inputs are validated before being used in database queries, file operations, or business logic.

**The Query:**

```
where is user input used without validation or sanitization
```

**What to Look For:** Request parameters used directly in queries, file paths constructed from user input, user-supplied data passed to `eval()` or `exec()`, and any place where data crosses a trust boundary without checking.

**Example:**

```bash
pyckle search "where is user input used without validation or sanitization"
```

```
Results (5 hits, 10ms):
  src/api/search.py                  query param used directly           0.93
  src/api/files.py                   filename from user in path join     0.90
  src/api/reports.py                 date range from query string        0.86
  src/admin/execute.py               user input in eval()                0.83
  src/api/export.py                  format param unchecked              0.78
```

Five input validation gaps. The search endpoint passes the query parameter directly to the database. The file endpoint constructs a path from user input (path traversal risk). The report endpoint takes date ranges without validation. An admin endpoint uses `eval()` on user input (code injection). The export endpoint accepts an unchecked format parameter.

```python
# src/api/files.py
@router.get("/files/{filename}")
def get_file(filename: str):
    path = os.path.join(UPLOAD_DIR, filename)  # Path traversal: ../../../etc/passwd
    return FileResponse(path)

# Fixed version
@router.get("/files/{filename}")
def get_file(filename: str):
    safe_name = secure_filename(filename)
    path = os.path.join(UPLOAD_DIR, safe_name)
    if not path.startswith(UPLOAD_DIR):
        raise HTTPException(403, "Access denied")
    return FileResponse(path)
```

**Variations:**
- `"where are request parameters used directly in database queries"` -- targets SQL injection risk
- `"file paths constructed from user-provided input"` -- targets path traversal
- `"eval, exec, or dynamic code execution with user input"` -- targets code injection

---

### Recipe 45: SQL Injection Pattern Finder

**When to Use:** You want to find places where SQL queries are built by string concatenation or string formatting instead of parameterized queries.

**The Query:**

```
where are SQL queries built from string concatenation or formatting
```

**What to Look For:** String concatenation with SQL keywords, f-strings or `.format()` in database queries, and any place where user input could be interpolated into a SQL string.

**Example:**

```bash
pyckle search "where are SQL queries built from string concatenation or formatting"
```

```
Results (4 hits, 8ms):
  src/reports/revenue_report.py     f"SELECT ... WHERE date = '{date}'"  0.94
  src/admin/data_export.py          "SELECT * FROM " + table_name        0.91
  src/search/full_text.py           f"... LIKE '%{query}%'"              0.87
  src/migrations/legacy_import.py   cursor.execute(f"INSERT ...")        0.82
```

Four SQL injection vectors. The revenue report interpolates a date into SQL. The admin export concatenates a table name. The full-text search interpolates a user query into a LIKE clause. The legacy import uses f-strings in INSERT statements.

```python
# src/reports/revenue_report.py
def get_revenue(date: str):
    query = f"SELECT SUM(amount) FROM orders WHERE date = '{date}'"
    # An attacker sends: date = "'; DROP TABLE orders; --"
    return db.execute(query)

# Fixed version
def get_revenue(date: str):
    query = "SELECT SUM(amount) FROM orders WHERE date = :date"
    return db.execute(query, {"date": date})
```

**Variations:**
- `"raw SQL with user-supplied parameters"` -- broader than just string concatenation
- `"database queries not using parameterized statements"` -- focuses on the fix pattern
- `"dynamic table or column names from user input"` -- catches indirect injection

---

### Recipe 46: Secrets and Credentials Scanner

**When to Use:** You want to find hardcoded secrets, API keys, passwords, or other credentials in the codebase. These should be in environment variables or a secrets manager, not in source code.

**The Query:**

```
where are secrets, API keys, or passwords hardcoded in the code
```

**What to Look For:** String literals that look like API keys, password assignments, connection strings with embedded credentials, and any file that contains tokens or secrets outside of a secrets management system.

**Example:**

```bash
pyckle search "where are secrets API keys or passwords hardcoded in the code"
```

```
Results (5 hits, 9ms):
  src/config/settings.py             API_KEY = "sk_live_..."             0.94
  src/services/email_service.py     SMTP_PASSWORD = "..."               0.91
  src/integrations/stripe.py        stripe.api_key = "sk_test_..."      0.87
  src/db/connection.py              DATABASE_URL with password           0.83
  tests/conftest.py                 test credentials                     0.76
```

Four production secrets hardcoded in source files. The settings file has a live API key. The email service has an SMTP password. The Stripe integration has a test key (which could become a live key). The database connection has an embedded password. The test credentials are less concerning but should still use a test-specific config.

```python
# src/config/settings.py (BEFORE)
API_KEY = "sk_live_4eC39HqLyjWDarjtT1zdp7dc"  # Hardcoded secret  # pragma: allowlist secret

# src/config/settings.py (AFTER)
API_KEY = os.getenv("API_KEY")
if not API_KEY:
    raise ValueError("API_KEY environment variable not set")
```

**Variations:**
- `"connection strings with embedded passwords"` -- targets database credential leaks
- `"API keys or tokens in source code"` -- targets third-party service credentials
- `"environment variables that should contain secrets but have defaults"` -- finds fallback credentials

---

### Recipe 47: Authentication Bypass Finder

**When to Use:** You want to verify that all protected endpoints actually enforce authentication. A missing auth decorator or middleware check is an open door.

**The Query:**

```
endpoints or routes that might be missing authentication checks
```

**What to Look For:** Route handlers without auth decorators, endpoints that skip middleware, debug or test endpoints that remain in production code, and any conditional logic that allows unauthenticated access.

**Example:**

```bash
pyckle search "endpoints or routes that might be missing authentication checks"
```

```
Results (5 hits, 10ms):
  src/api/health.py                  /health — intentionally public      0.91
  src/api/webhooks.py               /webhooks/stripe — no user auth      0.88
  src/api/debug.py                  /debug/info — no auth decorator      0.85
  src/api/internal.py               /internal/metrics — no auth          0.81
  src/api/v2/users.py               /api/v2/users — missing decorator   0.77
```

Five endpoints without authentication. The health check is intentionally public. The webhook endpoint uses a different auth mechanism (signature verification, not user auth -- confirm it is implemented). The debug endpoint should not exist in production. The internal metrics endpoint needs service-to-service auth. The v2 users endpoint is missing its auth decorator entirely -- this is a security bug.

```python
# src/api/v2/users.py
@router.get("/api/v2/users")
# Missing: @require_auth
async def list_users():
    return await user_service.get_all()  # Returns all user data, no auth
```

**Variations:**
- `"routes without auth middleware or decorators"` -- direct auth gap scan
- `"debug or test endpoints in production code"` -- finds exposed debug tools
- `"conditional authentication bypass logic"` -- finds intentional but risky bypasses

---

### Recipe 48: Sensitive Data Exposure Finder

**When to Use:** You want to verify that sensitive data (passwords, SSNs, tokens) is not being logged, returned in API responses, or stored in plaintext.

**The Query:**

```
where is sensitive data logged, returned in responses, or stored in plaintext
```

**What to Look For:** Log statements that include passwords or tokens, API responses that return sensitive fields, database columns that store sensitive data without encryption, and any serialization that includes fields that should be redacted.

**Example:**

```bash
pyckle search "where is sensitive data logged returned in responses or stored in plaintext"
```

```
Results (5 hits, 10ms):
  src/middleware/logging.py          logs full request body                0.93
  src/api/users.py                   returns password_hash in response    0.89
  src/models/user.py                 ssn stored as plain String           0.85
  src/services/auth_service.py      logs token value on creation          0.81
  src/exports/user_csv.py           includes email and phone in export    0.76
```

Five exposure patterns. The logging middleware logs full request bodies -- which includes login passwords. The user API returns the password hash. The SSN is stored as a plain string (should be encrypted at rest). The auth service logs newly created tokens. The CSV export includes personally identifiable information.

```python
# src/middleware/logging.py
class LoggingMiddleware:
    async def __call__(self, request, call_next):
        body = await request.body()
        logger.info(f"Request: {request.method} {request.url} Body: {body}")
        # Logs: "Body: {"email": "user@example.com", "password": "hunter2"}"
```

```python
# Fixed version
REDACTED_FIELDS = {"password", "token", "secret", "ssn", "credit_card"}

class LoggingMiddleware:
    async def __call__(self, request, call_next):
        body = await request.body()
        sanitized = redact_sensitive_fields(json.loads(body), REDACTED_FIELDS)
        logger.info(f"Request: {request.method} {request.url} Body: {sanitized}")
```

**Variations:**
- `"what user data is included in log output"` -- audits logging for PII
- `"API responses that include fields that should be hidden"` -- audits response serialization
- `"plaintext storage of passwords, tokens, or personal data"` -- audits data-at-rest protection

---

### Recipe 49: Dependency Vulnerability Surface

**When to Use:** You want to understand what third-party code the application depends on and where it is used, so you can assess the impact of a vulnerability in any one dependency.

**The Query:**

```
where are third-party libraries used and what do they do
```

**What to Look For:** Import statements for external libraries, the operations performed with each dependency, and how deeply each dependency is integrated. A vulnerability in a library you use for one utility function has a smaller blast radius than one you use throughout the application.

**Example:**

A CVE is published for the `pyyaml` library. You need to know where it is used in the codebase.

```bash
pyckle search "where is yaml parsing used and what data is it parsing"
```

```
Results (4 hits, 8ms):
  src/config/loader.py               yaml.safe_load() config files       0.93
  src/api/import_endpoint.py        yaml.load() user-uploaded files      0.90
  src/migrations/data_import.py     yaml.safe_load() seed data           0.85
  tests/fixtures/setup.py           yaml.safe_load() test fixtures       0.77
```

Four usage sites. Three use `safe_load()` (the secure method). One uses `yaml.load()` on user-uploaded files -- that is the dangerous one. `yaml.load()` without a safe loader can execute arbitrary Python code during deserialization. Even without the CVE, this is a critical vulnerability.

```python
# src/api/import_endpoint.py (DANGEROUS)
def import_data(file):
    data = yaml.load(file.read())  # Arbitrary code execution risk

# Fixed version
def import_data(file):
    data = yaml.safe_load(file.read())  # Safe deserialization only
```

**Variations:**
- `"all usage of [specific library] in the codebase"` -- scopes to one dependency
- `"deserialization of untrusted data"` -- finds deserialization vulnerabilities broadly
- `"where does the application process user-uploaded files"` -- finds file upload attack surface

---

### Recipe 50: CORS and Cross-Origin Policy Auditor

**When to Use:** You want to verify that the application's cross-origin resource sharing (CORS) policy is correctly configured and not overly permissive. Misconfigured CORS can expose your API to unauthorized cross-origin requests from malicious websites.

**The Query:**

```
where is CORS configured and what origins are allowed
```

**What to Look For:** CORS middleware configuration, allowed origin lists, wildcard origins (`*`), credential sharing settings, and any per-route CORS overrides. A wildcard origin with `allow_credentials=True` is a critical vulnerability.

**Example:**

```bash
pyckle search "where is CORS configured and what origins are allowed"
```

```
Results (4 hits, 8ms):
  src/middleware/cors.py             CORSMiddleware configuration         0.94
  src/config/security.py            ALLOWED_ORIGINS list                  0.90
  src/api/public/routes.py          per-route CORS override              0.85
  src/api/webhooks.py               webhook CORS policy                   0.79
```

The main CORS middleware reads from a config list. But check the actual configuration:

```python
# src/middleware/cors.py
app.add_middleware(
    CORSMiddleware,
    allow_origins=settings.ALLOWED_ORIGINS,
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# src/config/security.py
ALLOWED_ORIGINS = os.getenv("ALLOWED_ORIGINS", "*").split(",")
```

The default value for `ALLOWED_ORIGINS` is `"*"` -- any origin is allowed. Combined with `allow_credentials=True`, this means any website can make authenticated requests to the API using the user's cookies. If the environment variable is not set in production (a common oversight), the default creates a wide-open CORS policy.

```python
# Fixed version
ALLOWED_ORIGINS = os.getenv("ALLOWED_ORIGINS", "").split(",")
if not ALLOWED_ORIGINS or ALLOWED_ORIGINS == [""]:
    raise ValueError("ALLOWED_ORIGINS must be explicitly configured")
```

**Variations:**
- `"cross-origin policy and credential sharing settings"` -- focuses on the credential risk
- `"per-route CORS overrides that differ from the global policy"` -- finds inconsistent policies
- `"what headers and methods are exposed to cross-origin requests"` -- audits the response surface

---

### Exercise

> **Try This**
>
> Run these three security recipes against your codebase:
>
> 1. Recipe 45 (SQL Injection Pattern Finder) -- are there any string-built queries?
> 2. Recipe 46 (Secrets and Credentials Scanner) -- are any secrets hardcoded?
> 3. Recipe 47 (Authentication Bypass Finder) -- are any endpoints missing auth?
>
> For each finding, classify it as: Critical (actively exploitable), Warning (potentially exploitable), or Informational (not ideal but low risk). How many findings fall into each category?

---

### Key Takeaways

- Input validation gaps are the most common security vulnerability -- every trust boundary crossing needs a check
- SQL injection still exists in modern applications wherever raw SQL is used with string formatting
- Hardcoded secrets are more common than anyone admits -- search for them regularly
- Missing authentication on endpoints is easy to miss in code review but trivial to find with semantic search
- Sensitive data exposure in logs is the most overlooked category -- logging middleware that records full request bodies will log passwords

---

# Part II: Meta-Patterns

---

## Chapter 8: Building Your Own Patterns

### Chapter Overview

The 50 recipes in chapters 1-7 are a starting point. Every codebase has its own idioms, its own problem patterns, and its own questions that come up repeatedly. This chapter teaches you how to identify, build, validate, and share query patterns specific to your team and your code. It expands on the pattern development process from Episode 19 of *Code Search, Decoded*.

---

### Why Custom Patterns Matter

The recipes in this book are generic by design -- they work across any codebase because they describe universal development tasks. But your codebase is not generic. It has its own domain language, its own architectural conventions, and its own recurring problems.

When a developer on your team discovers that searching for "circuit breaker state transitions in the payment service" consistently returns better results than "payment error handling," that discovery should not stay in one person's head. Custom query patterns capture team-specific search knowledge and make it available to everyone.

The difference between a team that uses ad-hoc searches and a team that uses shared patterns is the same as the difference between a team where one person knows where everything is and a team where everyone knows where everything is.

---

### Recognizing Pattern Opportunities

Patterns emerge from repetition. Watch for these signals:

**The same question asked different ways.** If three developers on your team search for "how does the retry logic work," "retry configuration for HTTP calls," and "where is backoff implemented for API requests," they are asking the same question with different words. The best phrasing -- the one that consistently returns the most useful results -- should become a pattern.

**Questions that follow a task.** Every time someone starts a code review, they run the same four queries. Every time someone debugs a payment issue, they start with the same two searches. These task-triggered sequences are workflow patterns, not just individual queries.

**Questions that require domain knowledge to phrase well.** In a payments codebase, "where is PCI data handled" is a much better query than "where is credit card data processed" because the search model picks up on "PCI" as a specific compliance concept. But a new team member would not know to search for "PCI." A pattern captures that domain-specific phrasing.

**Questions that a new hire would ask but a veteran would not.** These are navigation patterns -- they describe how to find common things in your specific codebase. "How does the order fulfillment pipeline work" is a pattern opportunity if your codebase has a non-obvious fulfillment flow.

---

### The Pattern Development Process

Do not write patterns on a whiteboard during a planning meeting. Develop them from real usage over time. Here is the process:

**Step 1: Observe (2 weeks).** Pay attention to the queries you and your teammates write during normal work. Keep a scratch file. Note three things for each query: the query text, whether it returned good results, and the task that prompted it. Do not filter -- write down queries that failed too.

```
2026-03-01 debugging: "where does the timeout get set for downstream calls" -- good results, 4 hits
2026-03-01 debugging: "timeout config" -- too vague, 12 irrelevant hits
2026-03-03 review: "what else uses the BatchProcessor interface" -- great results, found 3 implementations
2026-03-04 onboarding: "how does the queue consumer handle poison messages" -- good, needed domain term "poison message"
```

**Step 2: Extract (1 session).** After two weeks, review your scratch file. Group queries by task type (debugging, review, onboarding, etc.). For each group, identify the queries that worked well and extract the reusable structure.

"Where does the timeout get set for downstream calls" becomes "where does the timeout get set for {feature_or_service}." The specific service is the parameter. Everything else is the pattern.

"What else uses the BatchProcessor interface" becomes "what else uses the {interface_name} interface." The interface name is the parameter.

**Step 3: Validate (1 session).** Test each candidate pattern against at least three different parameter values from your codebase. A pattern that works for `BatchProcessor` but fails for `EventPublisher` needs refinement.

Common refinements:
- Adding context words: "what else uses {interface_name} interface **and how**" returns richer results
- Removing specificity: "what else uses {interface_name}" works better than "what classes implement {interface_name}" when some implementations use composition instead of inheritance
- Adding domain terms: "how does {service} handle **errors and retries**" works better than "error handling in {service}" because "retries" surfaces the retry-specific code

If a pattern does not work consistently after two rounds of refinement, discard it. Five reliable patterns are worth more than twenty fragile ones.

**Step 4: Share (1 PR).** Add validated patterns to your repository's `.pyckle/queries.toml`:

```toml
[queries.timeout-config]
description = "Find where timeouts are configured for a service"
template = "where does the timeout get set for {service}"

[queries.interface-users]
description = "Find all implementations and consumers of an interface"
template = "what uses the {interface_name} interface and how"

[queries.poison-messages]
description = "How a queue consumer handles bad messages"
template = "how does {consumer} handle poison messages or failed processing"
```

Write a one-line description for each pattern. The description should tell a developer *when* to use it, not *what* it does. "Find where timeouts are configured for a service" tells you this is for timeout-related investigations. The template tells you how.

Commit the file and open a PR. Let the team review the patterns just like they would review code.

---

### Pattern Categories for Your Team

Most teams develop patterns in these categories over the first few months:

**Navigation patterns** answer "where is this?" for your specific codebase. They encode knowledge about your project structure that is obvious to veterans but invisible to newcomers.

```toml
[queries.event-handler]
description = "Find the handler for a specific domain event"
template = "what handles the {event_name} event and what does it do"

[queries.config-for-feature]
description = "Find all configuration for a feature"
template = "configuration and environment variables for {feature}"
```

**Investigation patterns** answer "what is happening?" when debugging or reviewing. They encode the team's best search strategies for common problems.

```toml
[queries.state-mutation]
description = "Find everything that can change a piece of state"
template = "what can modify {state_name} and under what conditions"

[queries.data-flow]
description = "Trace data through the system"
template = "how does {data_type} flow from input to storage"
```

**Compliance patterns** answer "is this correct?" for regulatory, security, or architectural standards. They encode the checks that your team needs to run regularly.

```toml
[queries.pci-scope]
description = "Find code that handles PCI-scoped data"
template = "where is {data_type} handled in PCI scope including storage and transmission"

[queries.auth-check]
description = "Verify auth is applied to endpoints in a module"
template = "authentication and authorization for endpoints in {module}"
```

**Convention patterns** answer "how do we do this?" for your team's established patterns. They help enforce consistency without a style guide document.

```toml
[queries.service-pattern]
description = "Find the established pattern for services"
template = "how are service classes structured and what methods do they implement"

[queries.error-pattern]
description = "Find the established error handling pattern"
template = "how do similar services handle errors and what exceptions do they use"
```

Start with Navigation and Investigation. These cover 80% of the queries your team runs daily. Add Compliance and Convention patterns as your team's search habits mature.

---

### Patterns vs. Saved Searches

Episode 17 covered saved searches -- fixed bookmarks for specific queries. Patterns are different:

| | Saved Searches | Patterns |
|---|---|---|
| **Parameters** | None -- fixed query | Parameterized template |
| **Use case** | Same question every time | Same *type* of question, different target |
| **Example** | "How does authentication work" | "How does {feature} work" |
| **Best for** | Onboarding, stable systems | Investigation, debugging, review |

Use saved searches for questions that have one answer: "where is the auth flow" always returns the same results. Use patterns for questions that have many answers depending on the parameter: "what depends on {function}" returns different results every time.

In practice, many teams start with saved searches during onboarding setup (Episode 17) and migrate to patterns as they start using search for debugging and review (Episodes 14-15).

---

### Measuring Pattern Quality

A pattern is good if it meets three criteria:

**Consistency.** It returns useful results for at least three different parameter values. A pattern that only works for one input is a saved search pretending to be a pattern.

**Relevance.** The top 3-5 results are directly related to the query, with scores above 0.75. If the results require scrolling past irrelevant hits, the template wording needs refinement.

**Actionability.** The results help the user take a specific action -- fix a bug, write a review comment, understand a flow. Results that are technically relevant but do not help with the task are noise.

Track these informally. When a team member reports that a pattern "does not work well for X," investigate. The fix is usually adding a context word or removing an overly specific term from the template.

---

### The Feedback Loop

Patterns are living code. They evolve as the codebase evolves. Build a feedback loop:

**Weekly.** Notice which patterns you used this week. Did any of them miss? Make a mental note.

**Monthly.** Review the scratch file. Are there recurring queries that should be patterns? Are there patterns that consistently underperform? Add new patterns. Refine existing ones. Remove broken ones.

**Quarterly.** Do a pattern review during your team's codebase health check. Delete patterns that reference deprecated code structures. Add patterns for new architecture. Update descriptions to match current terminology.

A bad pattern is worse than no pattern because it teaches wrong search habits. If a pattern consistently returns poor results, remove it. The team will develop better ad-hoc queries than they will get from a bad template.

---

### Workflow Composition

Individual patterns are useful. Composed workflows are powerful. A workflow is a sequence of patterns run together for a specific task:

```bash
# Code review workflow
review-check() {
    local func="$1"
    pyckle search --pattern blast-radius --function_name "$func"
    pyckle search --pattern test-coverage --feature "$func"
    pyckle search --pattern schema-deps --entity "$func"
    pyckle search --pattern related-changes --change "$func"
}
```

```bash
# Security audit workflow
security-check() {
    local module="$1"
    pyckle search --pattern input-validation --module "$module"
    pyckle search --pattern sql-construction --module "$module"
    pyckle search --pattern auth-check --module "$module"
    pyckle search --pattern secret-scan --module "$module"
}
```

Workflows compose patterns into task-oriented sequences. The individual patterns are reusable. The workflow encodes the order and combination that makes sense for a specific task. Document workflows in your team's `README` or onboarding guide -- they are the highest level of search knowledge codification.

---

### Sharing Patterns Across Teams

If your organization has multiple teams with separate codebases, some patterns are universal and some are codebase-specific:

**Universal patterns** work across any codebase. The 50 recipes in this book are universal patterns. Keep a shared library of these in a common repository or internal documentation.

**Codebase-specific patterns** use domain language and conventions unique to one project. These live in the project's `.pyckle/queries.toml` and travel with the code.

**Organization-specific patterns** use domain language shared across the organization but not present in generic search. "PCI scope" patterns, "HIPAA compliance" patterns, or "our microservice naming convention" patterns fall here. Keep these in an organization-level config that individual projects can import.

The layering: universal patterns (this book) provide the foundation. Organization patterns add compliance and convention. Project patterns add domain specificity. Each layer builds on the one below.

---

### Getting Started Checklist

If you are starting from zero, here is the minimum viable pattern library:

1. **Start with observation.** Use the recipes from this book for two weeks. Note which ones work best for your codebase and which need modification.

2. **Extract your first three patterns.** Take the three queries you use most often and parameterize them. These are your seed patterns.

3. **Validate against three inputs each.** Test each pattern with three different parameter values. Refine the templates.

4. **Commit and share.** Add the three patterns to `.pyckle/queries.toml`. Open a PR. Get team feedback.

5. **Build the feedback loop.** Check in monthly. Add one or two patterns per month based on real usage. Remove patterns that do not perform.

After three months, you will have 8-12 solid patterns. After six months, 15-20. That is enough to cover the vast majority of your team's daily search tasks. The goal is not to have the most patterns -- it is to have the right patterns.

---

### Exercise

> **Try This**
>
> Start your observation log today:
>
> 1. Create a scratch file -- `query-log.md` or a note in your preferred tool
> 2. For the next week, write down every semantic search query you run, along with: the task that prompted it, whether the results were useful, and any refinements you made
> 3. At the end of the week, review the log and identify one query that would make a good pattern
> 4. Parameterize it, test it against three inputs, and add it to your project's `.pyckle/queries.toml`
>
> You now have your first custom pattern. Repeat monthly.

---

### Key Takeaways

- Custom patterns capture team-specific search knowledge that generic recipes cannot provide
- The development process is observe, extract, validate, share -- not design by committee
- Five reliable patterns are worth more than twenty untested ones
- Patterns should be treated as code -- reviewed, tested, and maintained
- The feedback loop is essential -- bad patterns teach wrong habits and should be removed
- Workflow composition turns individual patterns into task-oriented search sequences

---


## Conclusion

You finished this book with something most developers never build deliberately: a searchable mental model of how code is structured, not just what it does. That's the capability you now have. Not "better grep skills." Not "familiarity with some tools." A systematic way to ask questions of a codebase and get answers worth acting on — whether you're debugging a production incident at 2am, onboarding to a new system in week one, or making an architectural call that will outlast your tenure on the team.

Three threads run through every chapter here, and they're worth naming explicitly.

The first is that precision in the query determines precision in the answer. This sounds obvious, but most developers search the way they talk — loosely, with synonyms, hoping the tool infers intent. The patterns in this book are specific because specificity is what separates a useful result from a list of noise you have to manually filter. "Authentication middleware" finds a lot of files. "JWT validation before route handler" finds the one you need. The discipline of narrowing the query before you run it is a habit that compounds. You get faster. You get more accurate. You stop settling for approximate answers.

The second thread is that code search is not a lookup operation — it's a reasoning process. Debugging queries don't find bugs; they surface the context that lets you reason about where bugs could exist. Security queries don't audit your codebase; they identify the surface area worth auditing. Architecture queries don't explain how a system works; they give you the evidence to build that explanation yourself. Every pattern in this book is a way of structuring your thinking into a form the tool can execute. The tool does the mechanical work. The reasoning is still yours.

The third thread is that the value compounds when patterns become habits, not when they're applied occasionally. A developer who runs a security query once before a big release gets some value. A developer who runs security queries as part of every code review gets a fundamentally different outcome — not because any single query is revelatory, but because the cumulative attention changes what problems ever reach production. The patterns in the onboarding chapter aren't just for new hires. The performance patterns aren't just for optimization sprints. The whole point of building your own patterns in the final chapter is that the library should grow with your specific domain, your specific codebase, your specific failure modes.

Monday morning: pick one query pattern from the chapter most relevant to what you're working on right now. Run it against your actual codebase. Don't do this as an exercise — do it to answer a real question you have about the code. See what the result tells you. Then modify the query based on what you found and run it again. That iteration — run, observe, refine — is the core loop. Do it once with real stakes and you'll understand what this book was actually teaching better than you understood it from reading.

Here's the most common reason people don't apply this: they think they need a better setup first. A fancier tool, a properly indexed codebase, a clean project where the patterns will work more elegantly. This is a delay tactic your brain generates to avoid the cognitive friction of changing how you work. The patterns in this book run on grep. They run on ripgrep. They run on GitHub search. They run on whatever you have open right now. The setup is not the barrier. The barrier is the moment of switching from the habit you have to the habit you want, and the only way through that moment is through it.

The developers who get the most value from this material are not the ones who read it most carefully. They're the ones who used it to answer one real question before they finished the book. Then another. Then enough that the patterns stopped feeling like techniques and started feeling like instincts — the natural way they navigate code rather than a method they remember to apply.

What changes if you act on this is not subtle. You become faster at every phase of the development cycle — not because the queries save you minutes (they do), but because you stop accumulating the wrong kind of understanding. You stop building mental models based on what code you happened to read and start building them based on what questions you deliberately answered. Those models are more accurate, more durable, and more transferable when you move to the next codebase. You become the person on the team who actually knows where things are and why — not because you have more tenure, but because you've been asking better questions.

What stays the same if you don't is also not subtle. You keep navigating code the way most developers navigate it: by memory, by asking colleagues, by reading files sequentially until something looks familiar. That approach scales with seniority until it doesn't — until the codebase is too large, the team too distributed, the context too fragmented for institutional knowledge to carry the load. At that point the developers who built systematic search habits have a structural advantage that no amount of experience can fully compensate for.

The patterns exist. You have them now. The only question is whether you use them.
# Back Matter

---

## Appendix A: Glossary

| Term | Definition |
|------|-----------|
| Semantic search | A search technique that matches by meaning rather than exact text. "How does auth work" finds authentication code even if the word "auth" never appears in it. |
| Query | The text you type into a semantic search tool. Unlike grep, queries can be natural language phrases that describe intent. |
| Recipe | A reusable query template with context about when to use it, what to look for, and how to interpret results. |
| Pattern | A parameterized query template that can be reused with different inputs. "What depends on {function}" is a pattern; "what depends on calculate_tax" is an instance of that pattern. |
| Saved search | A fixed, non-parameterized query bookmarked for repeated use. Unlike patterns, saved searches always return the same results (for the same codebase state). |
| Blast radius | The set of code that is affected by a change. A function with many callers has a large blast radius. |
| N+1 query | A database access pattern where one query fetches a list, then N additional queries fetch related data for each item in the list. Usually fixable with eager loading or batch queries. |
| Trust boundary | A point in the code where data crosses from untrusted (user input, external API) to trusted (internal processing). Validation should happen at every trust boundary. |
| Cross-encoder | A neural model that scores the relevance of a query-document pair. Used in Pyckle's reranking stage to improve result quality after initial retrieval. |
| BM25 | A keyword-based ranking algorithm. Pyckle uses hybrid BM25 + semantic scoring for retrieval. |
| Eager loading | A database query technique that fetches related data in the same query (via joins) instead of in separate queries. The standard fix for N+1 patterns. |
| Dead code | Code that exists in the codebase but is never executed in production. May still have tests or config references. |
| Path traversal | A security vulnerability where user input is used to construct file paths, allowing access to files outside the intended directory. |
| Race condition | A bug that occurs when two concurrent operations access shared state without synchronization, producing unpredictable results depending on timing. |
| Middleware | Code that runs on every request before or after the main handler. Used for authentication, logging, error handling, and other cross-cutting concerns. |

---

## Appendix B: Tools & Resources

| Tool / Resource | URL | Purpose |
|----------------|-----|---------|
| Pyckle | https://pyckle.co | Semantic code search -- index your codebase and search by meaning, not text |
| Pyckle CLI | https://docs.pyckle.co/cli | Command-line interface for running queries, patterns, and saved searches |
| Pyckle API | https://api.pyckle.co | REST API for integrating semantic search into CI/CD, IDE plugins, and custom workflows |
| Code Search, Decoded (series) | https://pyckle.co/blog/series/code-search-decoded | 20-episode series covering the theory and practice of semantic code search |
| Vibe Coding, Real Debugging (ebook) | https://pyckle.co/ebooks/vibe-coding-debugging | Companion guide focused on debugging workflows for AI-generated code |

---

## Appendix C: Further Reading

- **Episode 14: Debugging with Semantic Context** -- The debugging workflow that recipes 1-9 expand upon. Covers the three-query debugging walkthrough and the three layers of debugging (where, what, why).
- **Episode 15: Code Review Prep** -- The four-query review workflow that recipes 10-16 expand upon. Covers the context gap in code review and how to close it before opening the diff.
- **Episode 17: Teaching a Junior Dev with Semantic Search** -- The self-service learning approach that recipes 17-23 expand upon. Covers saved searches for onboarding and the compound effect on team productivity.
- **Episode 19: Custom Query Patterns for Your Team** -- The pattern development process that Chapter 8 expands upon. Covers patterns vs. saved searches, the four-step development process, and pattern categories.
- **Episode 3: How Semantic Search Actually Works** -- The technical foundation. Covers embeddings, BM25, hybrid scoring, cross-encoder reranking, and AST boosting. Useful for understanding *why* certain query phrasings return better results.
- **Episode 7: Score Interpretation** -- How to read confidence scores in search results. Useful for understanding the 0.xx scores shown in recipe examples.

---

## About the Author

David Kelly Price is the founder of Pyckle, building AI context optimization tools for development teams. Background in AI/ML tooling, retrieval systems, and context routing for codebases. MBA in Finance -- analytical rigor applied to technical problems.

---

## About Pyckle

Pyckle is a semantic code search engine for development teams. It indexes your codebase and lets you search by meaning -- "how does authentication work" instead of grep for "auth." The pipeline uses hybrid BM25 and semantic scoring, AST-aware boosting, cross-encoder reranking, and adaptive thresholds to return the most relevant code for natural language queries.

Teams use Pyckle for debugging, code review, onboarding, refactoring, and architecture exploration. It integrates with CI/CD pipelines for automatic re-indexing, supports saved searches for onboarding, and enables parameterized query patterns for team-wide search knowledge sharing.

The API is available at api.pyckle.co. The CLI runs locally. The index stays current because CI rebuilds it on every push.

---

*Code Search Patterns: 50 Query Recipes for Debugging, Reviews, Onboarding, and Architecture — Version 1.0 — March 2026*
*Published by Pyckle (pyckle.co)*

*© 2026 Pyckle. All rights reserved. This guide may be shared freely for personal and educational use. Commercial reproduction or redistribution requires written permission. Contact kellyprice@pyckle.co.*



---

## Related Blog Posts

- [When Everything Is Flat, Everything Gets Lost](https://pyckle.co/blog/when-everything-is-flat-everything-gets-lost.html)
- [Your Codebase Has Its Own Language](https://pyckle.co/blog/your-codebase-has-its-own-languageand-your-ai-doesnt-speak-it.html)

---

*[Browse all free guides →](https://pyckle.co/books.html)*
