Context is the most precious resource in any LLM system. Learn to manage, compress, prioritise, and inject context strategically across your MCP server ecosystem — so every token counts and your agents stay sharp across long sessions.
The context window is the working memory of an LLM. For Claude 3.5, that's 200 K tokens. Every message, tool result, and system prompt consumes tokens. Once the window fills, you must either truncate history (losing information) or summarise it (preserving meaning at lower cost). Understanding where tokens go is the first step to engineering them deliberately.
| Context Component | Typical Token Cost | Control Strategy |
|---|---|---|
| System prompt | 500–5,000 | Cache with cache_control; keep stable |
| Conversation history | Grows unbounded | Sliding window or rolling summary |
| Tool result (small) | 100–500 | Return concise summaries, not raw output |
| Tool result (large) | 10,000+ | Truncate + pointer pattern; paginate |
| Retrieved documents | 1,000–20,000 | Semantic ranking; top-K only |
| Structured memory | 200–1,000 | Inject as JSON; compress over time |
Budget = window_size − system_prompt − reserved_output. E.g. 200K − 4K − 8K = 188K available for history + tool results. Monitor this with usage.input_tokens in the Converse API response.
When conversation history grows large, compaction replaces old turns with a compact summary injected as a system message. Claude Code uses this automatically. In your own MCP clients, implement a rolling compaction strategy triggered when token usage crosses a threshold (e.g., 70% of window).
import anthropic, tiktoken class CompactingClient: def __init__(self, threshold=140_000): self.client = anthropic.Anthropic() self.messages = [] self.system = "" self.threshold = threshold # token count triggers compaction def _count_tokens(self) -> int: # Use Anthropic token counting endpoint resp = self.client.messages.count_tokens( model="claude-sonnet-4-5", system=self.system, messages=self.messages ) return resp.input_tokens async def compact(self): # Summarise first 70% of history; keep last 30% verbatim cutoff = int(len(self.messages) * 0.7) to_compact = self.messages[:cutoff] keep = self.messages[cutoff:] summary_resp = self.client.messages.create( model="claude-haiku-4-5", # use cheap model to summarise max_tokens=2048, system="Summarise the conversation concisely, preserving all key decisions, facts, and tool results.", messages=to_compact ) summary_text = summary_resp.content[0].text # Rebuild: summary injection + recent history self.messages = [ {"role": "user", "content": f"[CONTEXT SUMMARY]\n{summary_text}"}, {"role": "assistant", "content": "Understood. Continuing with full context."}, *keep ] async def chat(self, user_msg: str): self.messages.append({"role": "user", "content": user_msg}) if self._count_tokens() > self.threshold: await self.compact() print("[Compacted context]") resp = self.client.messages.create( model="claude-sonnet-4-5", max_tokens=4096, system=self.system, messages=self.messages ) reply = resp.content[0].text self.messages.append({"role": "assistant", "content": reply}) return reply
Use claude-haiku as the compaction model — it's ~20× cheaper than Sonnet and perfectly capable of summarising a conversation. Only invoke Sonnet/Opus for the actual task turns.
Instead of injecting all available data into context, selective retrieval fetches only what's relevant for the current turn. This is the difference between a 200-token context injection and a 20,000-token one. Combine embedding-based similarity with metadata filtering for surgical precision.
from fastmcp import FastMCP import boto3, hashlib, json from typing import Optional mcp = FastMCP("context-retriever") bedrock = boto3.client("bedrock-agent-runtime", region_name="us-east-1") KB_ID = "YOUR_KB_ID" @mcp.tool() async def retrieve_context( query: str, max_results: int = 5, filter_tag: Optional[str] = None ) -> str: """Retrieve only the most relevant context chunks for the current query.""" kwargs = { "knowledgeBaseId": KB_ID, "retrievalQuery": {"text": query}, "retrievalConfiguration": { "vectorSearchConfiguration": { "numberOfResults": max_results, "overrideSearchType": "HYBRID" # semantic + keyword } } } if filter_tag: kwargs["retrievalConfiguration"]["vectorSearchConfiguration"]["filter"] = { "equals": {"key": "tag", "value": filter_tag} } resp = bedrock.retrieve(**kwargs) chunks = resp["retrievalResults"] # Format compactly: score + excerpt only (no full document) lines = [] for i, c in enumerate(chunks, 1): score = round(c["score"], 3) text = c["content"]["text"][:400] # truncate each chunk to 400 chars src = c.get("location", {}).get("s3Location", {}).get("uri", "unknown") lines.append(f"[{i}] score={score} src={src}\n{text}") return "\n\n".join(lines) or "No relevant context found."
Retrieve only the highest-scoring K chunks. Typical sweet spot: K=3–5 for focused queries, K=8–10 for broad research tasks.
Combine vector similarity (semantic meaning) with BM25 keyword matching. Bedrock Knowledge Bases supports this natively with HYBRID mode.
Filter by document tag, date, or author before ranking. Prevents retrieving irrelevant but semantically similar content from unrelated projects.
Return the first 400 chars per chunk, not the full document. The model can always call the tool again with a more specific query if needed.
Context injection is how and where you add information into the prompt. The position matters: information injected near the start of a long context window is subject to lost in the middle degradation. Strategic injection keeps the most important facts close to where the model needs them.
def build_system_prompt(user_profile: dict, session_memory: dict) -> str: """ Injection order (top-to-bottom = most to least important for Claude): 1. Role + behaviour instructions (stable → cache-friendly) 2. Session memory snapshot (changes per session) 3. User profile facts (changes rarely) 4. Tool usage guidelines (stable → cache-friendly) """ return f"""You are an expert AWS assistant. Be concise and precise. ## Session Memory Current task: {session_memory.get('current_task', 'none')} Completed steps: {json.dumps(session_memory.get('steps_done', []))} Key facts discovered: {json.dumps(session_memory.get('facts', {}))} ## User Profile Name: {user_profile['name']} AWS account: {user_profile['account_id']} Preferred region: {user_profile['region']} Expertise: {user_profile['level']} ## Tool Guidelines - Always use retrieve_context before answering factual questions. - Truncate large outputs; offer to paginate if user needs more. - For destructive actions, confirm intent before calling the tool.""" def inject_tool_result(result: str, max_chars=2000) -> str: """Truncate large tool results and append a pointer.""" if len(result) <= max_chars: return result truncated = result[:max_chars] remaining = len(result) - max_chars return f"{truncated}\n\n[... {remaining} chars truncated. Call tool again with offset param to see more.]"
LLMs pay most attention to the beginning and end of context. Place your most critical instructions and recent facts at the top of the system prompt and the end of the conversation history. Avoid burying key data in the middle of a long tool result.
Long-running agents need memory that persists beyond the context window. There are four memory types, each with a different cost/recall trade-off. Most production agents combine all four.
| Memory Type | Storage | Scope | Best For |
|---|---|---|---|
| In-context | Conversation messages | Current session | Recent tool results, active task state |
| External episodic | DynamoDB / S3 | Cross-session | User preferences, past interactions |
| Semantic | Bedrock Knowledge Base | Global | Documents, FAQs, domain knowledge |
| Procedural | System prompt / tools | Global | How-to patterns, workflows, policies |
@mcp.tool() async def save_memory(key: str, value: str, session_id: str) -> str: """Persist a key-value fact to episodic memory (DynamoDB).""" dynamo = boto3.resource("dynamodb") table = dynamo.Table("agent-memory") table.put_item(Item={ "pk": f"session#{session_id}", "sk": key, "value": value, "ttl": int(__import__("time").time()) + 86400 * 30 # 30-day TTL }) return f"Saved {key} to episodic memory." @mcp.tool() async def recall_memory(session_id: str) -> str: """Retrieve all stored facts for this session.""" dynamo = boto3.resource("dynamodb") table = dynamo.Table("agent-memory") resp = table.query( KeyConditionExpression="pk = :pk", ExpressionAttributeValues={":pk": f"session#{session_id}"} ) facts = {item["sk"]: item["value"] for item in resp["Items"]} return json.dumps(facts, indent=2) if facts else "No memories stored."
Anthropic's prompt caching lets you cache up to 4 blocks of context (system prompt + messages). Cached tokens are re-billed at 10% of normal input cost. For a system prompt that's reused across thousands of agent turns, this alone can reduce costs by 80–90%.
# Enable prompt caching on the stable system prompt response = client.messages.create( model="claude-sonnet-4-5", max_tokens=4096, system=[ { "type": "text", "text": stable_system_prompt, # large, rarely changes "cache_control": {"type": "ephemeral"} # cache this block }, { "type": "text", "text": dynamic_session_context # not cached — changes each turn } ], messages=messages ) # Check cache performance usage = response.usage cache_hit = usage.cache_read_input_tokens cache_miss = usage.cache_creation_input_tokens hit_rate = cache_hit / (cache_hit + cache_miss + 1) * 100 print(f"Cache hit rate: {hit_rate:.1f}% — saving {cache_hit * 0.9:.0f} token costs")
Keep cached blocks stable. Moving them (reordering or editing) invalidates the cache. Put user-specific data below cached blocks, never inside them.
Log usage.input_tokens per turn. Alert when a session exceeds 100K tokens — it indicates runaway tool results or missing compaction logic.