Append tool results to conversation history each iteration
DON'T
Anti-patterns
Parse natural language signals to detect completion
Use arbitrary iteration caps as primary stop mechanism
Check for assistant text content as completion indicator
Multi-Agent Orchestration (1.2 + 1.3)
Hub-and-spoke pattern
Coordinator manages ALL inter-subagent communication and error handling
Subagents have isolated context — they do not inherit coordinator history
Pass complete findings explicitly in each subagent's prompt
Use Task tool to spawn subagents; allowedTools must include
"Task"
Spawn parallel subagents by emitting multiple Task calls in a single coordinator response
(not across separate turns)
Coordinator prompts: specify goals + quality criteria, not step-by-step procedures
Risk
Narrow decomposition
If coordinator assigns subtasks that don't cover the full topic, downstream agents succeed but output is
incomplete. Design coordinator to map the full domain before assigning subtasks.
Session
Forking & resumption
--resume <name> continues a named session
fork_session branches from a shared baseline
If tool results are stale, start fresh with injected summary
When resuming, tell the agent which files changed
Enforcement vs Prompt-Based (1.4 + 1.5)
Programmatic (hooks)
Deterministic. Use when compliance is non-negotiable: identity verification, financial thresholds, policy
gates. Zero failure rate.
Prompt-based
Probabilistic. Non-zero failure rate. OK for guidance and suggestions. Never rely on this for critical
business logic.
SDK Hooks (PostToolUse)
Intercept tool results to normalize formats before model sees them (Unix timestamps, ISO 8601, numeric
codes)
Intercept outgoing tool calls to enforce rules (block refunds > $500, redirect to escalation)
Task Decomposition (1.6)
Prompt chaining
Fixed sequential steps. Best for predictable multi-aspect reviews. Example: analyze each
file, then run a cross-file integration pass.
Dynamic decomposition
Generate subtasks based on what is discovered at each step. Best for open-ended
investigation. Map structure first, then prioritize adaptively.
Tool Descriptions (2.1)
What a good tool description includes
Purpose (specific, not generic)
Expected input formats and example queries
Edge cases and what it does NOT handle
When to use it vs similar alternatives
DON'T
Overlapping descriptions
Two tools with near-identical descriptions cause misrouting. Fix: rename + rewrite to eliminate overlap.
System prompt wording can also create unintended tool associations.
DO
Split generic tools
Replace one vague tool with 2-3 purpose-specific tools with defined input/output contracts. E.g.,
analyze_document → extract_data_points, summarize_content,
verify_claim.
Structured Error Responses (2.2)
Error response must include:
errorCategory: transient / validation / permission / business
isRetryable: boolean
Human-readable description
For business violations: retriable: false + customer-friendly message
Access failure (timeout)
Needs retry logic. Propagate with structured context so coordinator can decide.
Valid empty result
Successful query, no matches. Do NOT treat as an error. Distinguish clearly.
Tool Distribution (2.3)
Too many tools = worse performance
18 tools vs 4-5 tools — decision complexity degrades reliability. Give each agent only the tools its role
needs. Agents with out-of-scope tools will misuse them.
tool_choice options
"auto" — model may return text or call a tool
"any" — model must call some tool (prevents conversational text)
{"type":"tool","name":"..."} — forces a specific named tool first
MCP Server Configuration (2.4)
Project-level (.mcp.json)
Shared via version control. Use for shared team tooling. Use ${ENV_VAR} expansion for auth
tokens — never commit secrets.
User-level (~/.claude.json)
Personal/experimental only. Not shared with teammates. Both levels can be active simultaneously.
MCP Resources
Expose content catalogs (issue summaries, DB schemas, doc hierarchies) as resources to reduce exploratory
tool calls. Use existing community MCP servers for standard integrations (e.g., Jira) — build custom only
for team-specific workflows.
Glob: find files by name/extension pattern (**/*.test.tsx)
Edit vs Read+Write
Edit: targeted modification using unique text match
If Edit fails (non-unique match), fall back to Read then Write
CLAUDE.md Hierarchy (3.1)
Three levels
~/.claude/CLAUDE.md — user-level: personal only, not version-controlled
.claude/CLAUDE.md or root CLAUDE.md — project-level: shared via git
Subdirectory CLAUDE.md — directory-level: scoped to that folder
@import syntax
Reference external files to keep CLAUDE.md modular. Import only the standards relevant to each package.
.claude/rules/
Topic-specific rule files (e.g., testing.md, api-conventions.md). Use YAML
frontmatter with paths: glob patterns for conditional loading only when editing matching
files.
Common mistake
User-level config doesn't share to teammates
If a new team member doesn't receive instructions, check whether they live in
~/.claude/CLAUDE.md instead of project-level. Use /memory to verify which files
are loaded in a session.
The same session that generated code is less effective at reviewing it (retains reasoning context, less
likely to question its own decisions). Use an independent review instance without the
generator's reasoning context.
Prompt Precision (4.1)
Specific criteria (works)
"Flag comments only when claimed behavior contradicts actual code behavior"
Vague instructions (don't work)
"Be conservative" / "Only report high-confidence findings" — no meaningful improvement in precision
False positive effect
High false-positive categories undermine trust in accurate ones. Temporarily disable noisy categories to
restore developer trust while improving those prompts separately. Define explicit severity criteria with
concrete code examples per severity level.
Few-Shot Prompting (4.2)
When few-shot is the most effective technique
Detailed instructions alone produce inconsistent output
Ambiguous scenarios where judgment must be demonstrated
Reducing hallucination in extraction tasks with varied formats
Showing model what to accept vs flag (false positive reduction)
Demonstrating correct handling of varied document structures
Structure of good few-shot examples
2-4 targeted examples covering ambiguous cases
Show reasoning: why this choice over plausible alternatives
Demonstrate desired output format (location, issue, severity, suggested fix)
Include varied document structures to enable generalization to novel patterns
Strict schemas prevent syntax errors but not semantic errors (line items not summing,
values placed in wrong fields)
Make fields optional/nullable when source may not contain that data — prevents
fabrication to satisfy required fields
Use "other" + detail field pattern for extensible enum categories
Add "unclear" enum value for genuinely ambiguous cases
Validation + Retry Loops (4.4)
Retry works for:
Format mismatches
Structural output errors
Schema compliance failures
Include on retry: original doc + failed extraction + specific validation error.
Retry won't fix:
Information absent from the source document
Data only in an external doc not provided
Detect these cases and route to human review. Don't loop indefinitely.
Key trap
Semantic error detection
Extract calculated_total alongside stated_total and flag mismatches for human
review. Never silently reconcile financial data or pick the "most likely correct" value automatically.
Batch Processing (4.5)
Use Message Batches API for:
Overnight reports, weekly audits
Nightly test generation
Any latency-tolerant workload
50% cost savings. Up to 24h processing window. No guaranteed latency SLA.
Do NOT use batch for:
Blocking pre-merge checks (devs wait for results)
Any workflow requiring fast response
Multi-turn tool calling (not supported)
Batch SLA formula
Worst-case total = max wait time + 24h batch window. Must be under SLA. Example: 30h SLA → submit every 4h
(4+24=28h, 2h buffer). 6h submission windows leave zero buffer.
Multi-Instance Review (4.6)
Independent review beats self-review
A model retains reasoning context from generation — less likely to question its own decisions. An
independent Claude instance (without prior context) catches more subtle issues than self-review instructions
or extended thinking. Split large reviews into per-file local passes + cross-file integration pass to avoid
attention dilution.
Context Window Management (5.1)
Lost in the middle
Models reliably process info at the beginning and end of long inputs. Middle sections
get attention gaps. Put key findings summaries first. Use explicit section headers for detailed middle
content.
Structured fact extraction
Extract transactional facts (amounts, dates, order numbers, statuses) into a persistent "case facts"
block included in each prompt — outside summarized history. Don't let these get condensed.
Trim verbose tool outputs
Tool results accumulate and consume tokens disproportionately to their relevance. Keep only relevant
fields (e.g., 5 return-relevant fields from an order lookup with 40+ fields).
Key trap
Long multi-issue sessions
Progressive summarization for resolved earlier issues — compress into narrative summaries, keep full
history only for the active issue. Simpler and more reliable than sidecar context layers or sliding windows.
Escalation Decision-Making (5.2)
Escalate when:
Customer explicitly requests a human — honor immediately, no investigation first
Policy is ambiguous or silent on the specific request (not just "complex")
Unable to make meaningful progress on the case
Key trap
Frustration without explicit escalation request
Acknowledge frustration + offer to resolve. Only escalate if the customer reiterates their
preference for a human. Never take unilateral action (e.g., process the refund anyway) without honoring
their stated preference first.
Unreliable escalation triggers — don't use:
Sentiment analysis / frustration detection (doesn't correlate with case complexity)
Self-reported LLM confidence scores (poorly calibrated on hard cases)
"Case complexity" heuristics without explicit criteria
Multiple customer matches
When tool returns multiple matches: ask for additional identifiers. Never select based on heuristics (e.g.,
"most recent" or "most likely").
Error Propagation in Multi-Agent Systems (5.3)
Structured error context to coordinator must include:
Failure type
What was attempted (query, parameters)
Any partial results obtained
Potential alternative approaches
Anti-pattern: silent suppression
Returning empty results as "success" prevents recovery and risks incomplete research outputs. Never mask
errors this way.
Anti-pattern: full workflow termination
Terminating the entire workflow on one subagent failure. Handle locally when possible; propagate with
context when not. Proceed with partial results.
Large Codebase Context (5.4)
Scratchpad files
Persist key findings across context boundaries. Agents reference scratchpad for subsequent questions to
counteract context degradation (model referencing "typical patterns" instead of actual classes found).
Crash recovery pattern
Each agent exports state to a known location. Coordinator loads a manifest on resume and injects state
into agent prompts. Don't require full re-exploration.
Key trap
Coordinator spawning subagents for simple follow-ups
If the coordinator already has findings in its context, do not spawn a subagent to re-ingest 80K tokens for
a simple summarization request. Subagent spawning is for complex analysis — not follow-up questions on
already-retrieved data.
/compact
Use to reduce context usage during extended exploration sessions when context fills with verbose discovery
output. Combine with Explore subagent to isolate verbose discovery phases.
Human Review Workflows (5.5)
Don't trust aggregate accuracy metrics
97% overall accuracy can mask poor performance on specific document types or fields. Validate accuracy by
document type and field segment before reducing human review. Use stratified random sampling to detect novel
error patterns.
Confidence-based routing
Output field-level confidence scores per extraction
Calibrate review thresholds using labeled validation sets
Route low-confidence or conflicting-source extractions to human review