As AI sessions extend into hundreds of turns—spanning days or even weeks—the 200,000 token context window becomes a finite, precious resource. Managing this window isn't just about truncation; it's about Recurrent Contextualization: the art of ensuring Claude retains mission-critical "Memory" while discarding the "Noise" that leads to performance degradation and hallucination.
In a complex judicial trial lasting months, no judge can remember every spoken word. If they tried to hold the entire "context" in their raw memory, they would become overwhelmed by administrative sidebars, irrelevant objections, and lunch breaks. Performance would degrade, and essential legal precedents would be "forgotten" in the noise.
1. The Charges (Pinned Context): The initial legal charges and court rules never change. They are the "System Prompt" that anchors every decision.
2. The Archive (Postgres/S3): Every word is recorded in a database for deep lookup (RAG), but isn't kept in the judge's active focus.
3. The Periodic Briefs (Summary Bridges): At the end of each session, a "Clerk" (a sub-agent like Haiku) summarizes the testimony into a 2-page brief for the judge to use the next day.
4. The Active Testimony (Sliding Window): The judge focuses intensely on the words spoken in the last 15 minutes to decide on immediate objections.
The Architect's Job: To build the "Automated Court Clerk" that decides which tokens are "Evidence" (Keep) and which are "Noise" (Prune). Without this, Claude will eventually suffer from Context Fatigue, leading to hallucinations and loss of instruction following.
Rather than sending raw history, elite architects implement a pre-processing pipeline. This ensures that every turn sent to the API is high-density and low-noise.
def prepare_context(raw_history, max_tokens=180000): # 1. Pinned Anchor: System instructions + Global State prompt = [get_system_block(), get_pinned_goals()] # 2. Semantic Pruning: Remove tool output blobs (e.g. 500 lines of logs) cleaned_history = [prune_technical_noise(turn) for turn in raw_history] # 3. Recurrent Summarization: If > 50 turns, collapse older turns if len(cleaned_history) > threshold: summary = call_haiku_to_summarize(cleaned_history[:-10]) prompt.append(summary) prompt.append(cleaned_history[-10:]) # Keep recent turns raw return prompt
Architects must choose a strategy based on the tradeoff between Token Cost, Latency, and Semantic Integrity.
| Method | Architectural Mechanism | Cognitive Impact | Best For |
|---|---|---|---|
| Context Pinning | Message 0 (System) + First 2 Turns are always injected. | Prevents "Instruction Drift." Claude remembers the goal forever. | Coding agents, complex workflows. |
| Sliding Window | Strictly keep Turns [N-15] to [N]. | "Goldfish memory." Loss of global context after ~15 turns. | Customer FAQs, one-off transactional queries. |
| Recursive Summary | Every 10 turns, the oldest 10 are replaced by a <context_brief>. |
Retains global context at lower resolution. | Creative writing, multi-day reasoning projects. |
| Vector-RAG History | Search history database for semantic matches to current turn. | Allows for "Infinite" effective context window. | Lifelong assistants, legal/medical research. |
For long-running engineering agents (like Claude Code), pure history is often a distraction. The most effective pattern is the Recurrent State Injection. This involves maintaining a structured "Mission Status" that is updated every time a major step is completed.
<running_brief> <current_objective>Migrating React 17 to React 18 in Auth Module.</current_objective> <completed_files> [App.js, index.js, Store.js] </completed_files> <detected_bugs> - Legacy CSS-in-JS library causing hydration mismatch. - StrictMode warning in UserProfile.js. </detected_bugs> <remaining_critical_tasks> 1. Update 'hydrate' to 'createRoot' in LegacyLogin.js. 2. Patch webpack config for SVG support. </remaining_critical_tasks> <session_metadata>Branch: 'feat/r18', Turns: 42, Last_Successful_Compile: '2025-03-29'</session_metadata> </running_brief>
By injecting this brief at the START of the context, you can evict 90% of the middle turns. Claude ignores the specific dialogue history and relies on the Consolidated Truth of the brief.
Claude's Prompt Caching is the architect's most powerful tool for multi-turn sessions. It allows you to keep 100,000+ tokens of context in memory for repeated hits at a fraction of the cost.
If you have a 100k token context and a user sends 10 messages:
- Without Caching: 1,000,000 input tokens paid at full price.
- With Caching: 100,000 input tokens + 900,000 "Cache Hits" (10% of cost).
Result: 90% Cost Reduction + 60% Latency Improvement (No reprocessing time).
In multi-agent systems (e.g. a "Planner" agent handing off to a "Coder" agent), the primary bottleneck is Context Leaking. You should never pass the "Planner's" raw history to the "Coder."
<context_brief>.Allowing "Chatty" sidebars to displace the core system objective in the sliding window. Fix: Use Context Pinning for system instructions.
Pruning a turn that contained a critical variable name or user requirement. Fix: Use Structured Summarization (keep a separate JSON of all mentioned entities).
Injecting a dynamic value (like Current Time or Session Duration) inside a cached block. Result: Every turn invalidates the cache, destroying ROI. Fix: Put dynamic metadata at the very END of the prompt.
Scenario: You are building a 24/7 coding assistant. After 100 messages, the agent starts forgetting the project's coding style rules (Snake_case vs CamelCase) established in the first message. Performance is also slowing down significantly.
Question: Which TWO architectural changes are most effective?
Correct Answers: A & B. Pinning maintains the "Golden Rules" (Style), while Summarization reduces the "Token Bloat" (Noise) that causes latency and semantic interference.