📅 Day 29⏱ 60 min🔥 Ascend🧠 Context

Advanced Context
Engineering Patterns

Context is the most precious resource in any LLM system. Learn to manage, compress, prioritise, and inject context strategically across your MCP server ecosystem — so every token counts and your agents stay sharp across long sessions.

Every token in the context window is a decision. A naive agent that dumps full tool results into the conversation will hit the limit in minutes. An expert engineer curates context deliberately: summarising old turns, selectively retrieving relevant chunks, compressing large tool outputs, and injecting structured memory — keeping the model focused and cost-efficient.
📋 Today's topics
🧠 Fundamentals

Context Window Fundamentals

The context window is the working memory of an LLM. For Claude 3.5, that's 200 K tokens. Every message, tool result, and system prompt consumes tokens. Once the window fills, you must either truncate history (losing information) or summarise it (preserving meaning at lower cost). Understanding where tokens go is the first step to engineering them deliberately.

Context ComponentTypical Token CostControl Strategy
System prompt500–5,000Cache with cache_control; keep stable
Conversation historyGrows unboundedSliding window or rolling summary
Tool result (small)100–500Return concise summaries, not raw output
Tool result (large)10,000+Truncate + pointer pattern; paginate
Retrieved documents1,000–20,000Semantic ranking; top-K only
Structured memory200–1,000Inject as JSON; compress over time
📐
Token Budget Formula

Budget = window_size − system_prompt − reserved_output. E.g. 200K − 4K − 8K = 188K available for history + tool results. Monitor this with usage.input_tokens in the Converse API response.

✂️ Compaction

Compaction & Summarisation

When conversation history grows large, compaction replaces old turns with a compact summary injected as a system message. Claude Code uses this automatically. In your own MCP clients, implement a rolling compaction strategy triggered when token usage crosses a threshold (e.g., 70% of window).

import anthropic, tiktoken

class CompactingClient:
    def __init__(self, threshold=140_000):
        self.client = anthropic.Anthropic()
        self.messages = []
        self.system = ""
        self.threshold = threshold  # token count triggers compaction

    def _count_tokens(self) -> int:
        # Use Anthropic token counting endpoint
        resp = self.client.messages.count_tokens(
            model="claude-sonnet-4-5",
            system=self.system,
            messages=self.messages
        )
        return resp.input_tokens

    async def compact(self):
        # Summarise first 70% of history; keep last 30% verbatim
        cutoff = int(len(self.messages) * 0.7)
        to_compact = self.messages[:cutoff]
        keep = self.messages[cutoff:]

        summary_resp = self.client.messages.create(
            model="claude-haiku-4-5",  # use cheap model to summarise
            max_tokens=2048,
            system="Summarise the conversation concisely, preserving all key decisions, facts, and tool results.",
            messages=to_compact
        )
        summary_text = summary_resp.content[0].text

        # Rebuild: summary injection + recent history
        self.messages = [
            {"role": "user", "content": f"[CONTEXT SUMMARY]\n{summary_text}"},
            {"role": "assistant", "content": "Understood. Continuing with full context."},
            *keep
        ]

    async def chat(self, user_msg: str):
        self.messages.append({"role": "user", "content": user_msg})
        if self._count_tokens() > self.threshold:
            await self.compact()
            print("[Compacted context]")
        resp = self.client.messages.create(
            model="claude-sonnet-4-5", max_tokens=4096,
            system=self.system, messages=self.messages
        )
        reply = resp.content[0].text
        self.messages.append({"role": "assistant", "content": reply})
        return reply
💡
Haiku for Summarisation

Use claude-haiku as the compaction model — it's ~20× cheaper than Sonnet and perfectly capable of summarising a conversation. Only invoke Sonnet/Opus for the actual task turns.

🔍 Retrieval

Selective Retrieval Patterns

Instead of injecting all available data into context, selective retrieval fetches only what's relevant for the current turn. This is the difference between a 200-token context injection and a 20,000-token one. Combine embedding-based similarity with metadata filtering for surgical precision.

from fastmcp import FastMCP
import boto3, hashlib, json
from typing import Optional

mcp = FastMCP("context-retriever")
bedrock = boto3.client("bedrock-agent-runtime", region_name="us-east-1")

KB_ID = "YOUR_KB_ID"

@mcp.tool()
async def retrieve_context(
    query: str,
    max_results: int = 5,
    filter_tag: Optional[str] = None
) -> str:
    """Retrieve only the most relevant context chunks for the current query."""
    kwargs = {
        "knowledgeBaseId": KB_ID,
        "retrievalQuery": {"text": query},
        "retrievalConfiguration": {
            "vectorSearchConfiguration": {
                "numberOfResults": max_results,
                "overrideSearchType": "HYBRID"  # semantic + keyword
            }
        }
    }
    if filter_tag:
        kwargs["retrievalConfiguration"]["vectorSearchConfiguration"]["filter"] = {
            "equals": {"key": "tag", "value": filter_tag}
        }

    resp = bedrock.retrieve(**kwargs)
    chunks = resp["retrievalResults"]

    # Format compactly: score + excerpt only (no full document)
    lines = []
    for i, c in enumerate(chunks, 1):
        score = round(c["score"], 3)
        text = c["content"]["text"][:400]  # truncate each chunk to 400 chars
        src = c.get("location", {}).get("s3Location", {}).get("uri", "unknown")
        lines.append(f"[{i}] score={score} src={src}\n{text}")

    return "\n\n".join(lines) or "No relevant context found."

🎯 Top-K Retrieval

Retrieve only the highest-scoring K chunks. Typical sweet spot: K=3–5 for focused queries, K=8–10 for broad research tasks.

🔗 Hybrid Search

Combine vector similarity (semantic meaning) with BM25 keyword matching. Bedrock Knowledge Bases supports this natively with HYBRID mode.

🏷️ Metadata Filtering

Filter by document tag, date, or author before ranking. Prevents retrieving irrelevant but semantically similar content from unrelated projects.

✂️ Chunk Truncation

Return the first 400 chars per chunk, not the full document. The model can always call the tool again with a more specific query if needed.

💉 Injection

Context Injection Strategies

Context injection is how and where you add information into the prompt. The position matters: information injected near the start of a long context window is subject to lost in the middle degradation. Strategic injection keeps the most important facts close to where the model needs them.

def build_system_prompt(user_profile: dict, session_memory: dict) -> str:
    """
    Injection order (top-to-bottom = most to least important for Claude):
    1. Role + behaviour instructions  (stable → cache-friendly)
    2. Session memory snapshot        (changes per session)
    3. User profile facts             (changes rarely)
    4. Tool usage guidelines          (stable → cache-friendly)
    """
    return f"""You are an expert AWS assistant. Be concise and precise.

## Session Memory
Current task: {session_memory.get('current_task', 'none')}
Completed steps: {json.dumps(session_memory.get('steps_done', []))}
Key facts discovered: {json.dumps(session_memory.get('facts', {}))}

## User Profile
Name: {user_profile['name']}
AWS account: {user_profile['account_id']}
Preferred region: {user_profile['region']}
Expertise: {user_profile['level']}

## Tool Guidelines
- Always use retrieve_context before answering factual questions.
- Truncate large outputs; offer to paginate if user needs more.
- For destructive actions, confirm intent before calling the tool."""

def inject_tool_result(result: str, max_chars=2000) -> str:
    """Truncate large tool results and append a pointer."""
    if len(result) <= max_chars:
        return result
    truncated = result[:max_chars]
    remaining = len(result) - max_chars
    return f"{truncated}\n\n[... {remaining} chars truncated. Call tool again with offset param to see more.]"
📍
Lost-in-the-Middle Effect

LLMs pay most attention to the beginning and end of context. Place your most critical instructions and recent facts at the top of the system prompt and the end of the conversation history. Avoid burying key data in the middle of a long tool result.

🧩 Memory

Agent Memory Architectures

Long-running agents need memory that persists beyond the context window. There are four memory types, each with a different cost/recall trade-off. Most production agents combine all four.

Memory TypeStorageScopeBest For
In-contextConversation messagesCurrent sessionRecent tool results, active task state
External episodicDynamoDB / S3Cross-sessionUser preferences, past interactions
SemanticBedrock Knowledge BaseGlobalDocuments, FAQs, domain knowledge
ProceduralSystem prompt / toolsGlobalHow-to patterns, workflows, policies
@mcp.tool()
async def save_memory(key: str, value: str, session_id: str) -> str:
    """Persist a key-value fact to episodic memory (DynamoDB)."""
    dynamo = boto3.resource("dynamodb")
    table = dynamo.Table("agent-memory")
    table.put_item(Item={
        "pk": f"session#{session_id}",
        "sk": key,
        "value": value,
        "ttl": int(__import__("time").time()) + 86400 * 30  # 30-day TTL
    })
    return f"Saved {key} to episodic memory."

@mcp.tool()
async def recall_memory(session_id: str) -> str:
    """Retrieve all stored facts for this session."""
    dynamo = boto3.resource("dynamodb")
    table = dynamo.Table("agent-memory")
    resp = table.query(
        KeyConditionExpression="pk = :pk",
        ExpressionAttributeValues={":pk": f"session#{session_id}"}
    )
    facts = {item["sk"]: item["value"] for item in resp["Items"]}
    return json.dumps(facts, indent=2) if facts else "No memories stored."
💰 Cost

Prompt Cache & Cost Control

Anthropic's prompt caching lets you cache up to 4 blocks of context (system prompt + messages). Cached tokens are re-billed at 10% of normal input cost. For a system prompt that's reused across thousands of agent turns, this alone can reduce costs by 80–90%.

# Enable prompt caching on the stable system prompt
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=4096,
    system=[
        {
            "type": "text",
            "text": stable_system_prompt,  # large, rarely changes
            "cache_control": {"type": "ephemeral"}  # cache this block
        },
        {
            "type": "text",
            "text": dynamic_session_context  # not cached — changes each turn
        }
    ],
    messages=messages
)

# Check cache performance
usage = response.usage
cache_hit = usage.cache_read_input_tokens
cache_miss = usage.cache_creation_input_tokens
hit_rate = cache_hit / (cache_hit + cache_miss + 1) * 100
print(f"Cache hit rate: {hit_rate:.1f}% — saving {cache_hit * 0.9:.0f} token costs")

⚡ Cache Best Practices

Keep cached blocks stable. Moving them (reordering or editing) invalidates the cache. Put user-specific data below cached blocks, never inside them.

📊 Token Monitoring

Log usage.input_tokens per turn. Alert when a session exceeds 100K tokens — it indicates runaway tool results or missing compaction logic.

🧠 Check Your Understanding

QUESTION 1 OF 3
Why should you use a cheap model like claude-haiku for compaction instead of the main model?
AHaiku has a larger context window
BHaiku is ~20× cheaper and summarisation is a simple task that doesn't need top intelligence
CHaiku supports prompt caching but Sonnet does not
DHaiku produces shorter summaries automatically
Summarisation is a straightforward task — condense and preserve key facts. Using Haiku for this step cuts costs dramatically while preserving quality. Reserve Sonnet/Opus for reasoning-heavy tasks.
QUESTION 2 OF 3
What does the "lost-in-the-middle" effect mean for context injection?
ATool results are lost when the context window fills up
BLLMs pay less attention to information in the middle of a long context window
CThe summary model ignores the middle 30% of conversations
DMetadata filters remove middle-ranked retrieval results
Research shows LLMs attend most strongly to the start and end of the context window. Critical instructions and recent facts should be placed there, not buried in the middle of a long tool result.
QUESTION 3 OF 3
Which memory type is best for storing user preferences that should persist across multiple sessions?
AIn-context memory
BProcedural memory
CExternal episodic memory (DynamoDB/S3)
DSemantic memory (Knowledge Base)
External episodic memory (e.g., DynamoDB with a session PK) persists across sessions and holds user-specific facts like preferences and past decisions. In-context memory is wiped when the session ends.