📚 Domain 2 · Task Statement 2.2

Implement Structured Error Responses for MCP Tools

⏳ 📊 Domain Weight: 18% 🎯 Difficulty: Core Concept 🔗 Scenarios: Customer Support & Multi-Agent Research

When an MCP tool fails, what Claude does next depends entirely on the quality of the error response it receives. A generic "Operation failed" message leaves Claude guessing — should it retry? Escalate? Try a different tool? This guide covers the complete error classification taxonomy, the isError flag pattern, retryable vs non-retryable error design, and how subagents propagate failures up to coordinators without losing context.

📋 Contents

Real-World Analogy: The Hospital Diagnostic Chain
The isError Flag Pattern
The Four Error Categories
Retryable vs Non-Retryable Error Design
Workflow Diagram: Error Recovery Decision Flow
Sequence Diagram: Subagent Error Propagation
Distinguishing Empty Results from Access Failures
Code Patterns: Structured Error Implementation
Anti-Patterns to Avoid
Exam Readiness & Key Takeaways

🏥 Real-World Analogy: The Hospital Diagnostic Chain

🩹 Analogy — Doctor, Lab, and Structured Communication

A doctor orders a blood test. The lab experiences a machine failure. Compare two lab responses:

Bad response: "Test failed." — The doctor cannot decide what to do. Was the sample corrupted? Is the machine down temporarily? A billing issue?

Good response: "Analyzer unit B is offline (hardware fault, est. repair 4 hours). Sample intact and preserved. Retry after 16:00 using unit A. No new sample needed. Pre-auth ID: LAB-9921."

The structured error gives: the error category (hardware/transient), retryability (yes, after 4 hours), what was preserved (sample), and what action to take next. This is exactly what Claude needs from an MCP tool failure.

In agentic systems, Claude is the doctor, your MCP tool is the lab, and the error response is the lab's communication back. Claude cannot make intelligent recovery decisions from generic failure messages.

🚩 The isError Flag Pattern

MCP tools communicate failures to Claude using the isError flag in the tool result. This is the primary signal Claude uses to detect a tool failure, distinct from a successful response with an empty or negative result.

🎯 Exam Focus — isError vs Empty Result

isError: true = the tool failed to work properly. Claude enters error recovery mode.

isError: false + empty payload = the tool worked correctly but found no matching data. Claude informs the user and stops. These are completely different scenarios requiring completely different handling.

JSON — MCP Tool Result: Error vs Empty Result

// STRUCTURED ERROR (isError: true) - triggers recovery logic
{
  "type": "tool_result",
  "isError": true,
  "content": [{"type": "text", "text": "{errorCategory: transient, isRetryable: true, retryAfterSeconds: 30}"}]
}

// VALID EMPTY RESULT (isError: false) - tool succeeded, nothing found
{
  "type": "tool_result",
  "isError": false,
  "content": [{"type": "text", "text": "{found: false, message: No orders found for CUST-00123}"}]
}

📈 The Four Error Categories

The exam guide defines four distinct error categories. Each requires a different recovery strategy. You must know all four precisely — including which are retryable and which are not.

Category	Trigger	isRetryable	Claude Action
transient	Temporary infrastructure failures: timeouts, service unavailability, network blips	true	Retry after delay using `retryAfterSeconds`. Apply exponential backoff.
validation	Invalid input format or value from Claude (malformed ID, wrong type)	false	Do not retry. Re-examine input construction. Correct and try if fix is obvious.
permission	Authorization or access control failure (caller lacks rights)	false	Do not retry. Escalate to human agent with explanation of access restriction.
business	Policy or domain rule violation (not an infrastructure failure)	false	Do not retry. Return customer-friendly explanation of the business rule violated.

💡 Mental Model: Transient vs Structural

Transient errors are like a traffic jam — temporary, will clear, retry is appropriate. Validation, permission, and business errors are like driving the wrong way on a one-way street — the road itself prohibits the action; retrying does not help. The isRetryable boolean encodes this distinction machine-readably.

🔄 Retryable vs Non-Retryable Error Design

The isRetryable boolean is the single most operationally significant field in a structured error. It controls whether Claude wastes compute retrying a doomed operation vs. immediately escalating.

Why Uniform Errors Break Recovery

If all errors return "Operation failed", Claude cannot determine whether to: retry a transient error, give up on a non-retryable error, correct input for a validation error, or escalate a permission error. Structured classification enables all four intelligent paths.

⚠️ Anti-Pattern: Generic Error Swallowing

The most common anti-pattern: except Exception: return "Operation failed". This destroys all differentiated error information Claude needs. Always classify the exception type before constructing the error response.

Required Fields in a Structured Error Response

Field	Type	Required	Purpose
`errorCategory`	string enum	✅ Always	transient / validation / permission / business
`isRetryable`	boolean	✅ Always	Whether Claude should attempt the operation again
`description`	string	✅ Always	Technical explanation for Claude (not the user)
`retryAfterSeconds`	integer	🔶 When retryable	Minimum wait before retry to avoid hammering the service
`customerFriendlyMessage`	string	🔶 Recommended	Safe message Claude can relay to the end user
`partialResults`	object	🔶 If available	Any data obtained before the failure point
`attemptedActions`	string[]	🔶 If relevant	Steps completed; prevents duplication on coordinator retry

🕐 Workflow Diagram: Error Recovery Decision Flow

When Claude receives a tool result with isError: true, it follows this decision tree to determine the appropriate recovery action:

Figure 1 — Claude Error Recovery Decision Flow

🔨 Sequence Diagram: Subagent Error Propagation

In multi-agent systems, errors in a subagent must be propagated with enough context for the coordinator to make intelligent recovery decisions.

Figure 2 — Multi-Agent Error Propagation Sequence

🎯 Exam Focus — Propagate Only What Cannot Be Resolved Locally

Subagents handle local recovery for transient failures (retry 1-2 times). They propagate to the coordinator ONLY errors that cannot be resolved locally, always including: errorCategory, partialResults, and attemptedActions. The coordinator should never receive raw infrastructure exceptions — only structured error payloads.

📋 Distinguishing Empty Results from Access Failures

One of the most frequently tested distinctions in Task 2.2: an access failure (tool could not execute) vs a valid empty result (tool executed successfully but found no data).

Figure 3 — Empty Result vs Access Failure

💻 Code Patterns: Structured Error Implementation

Python — Anti-Pattern: Generic Error Swallowing

async def lookup_order(order_id: str) -> dict:
    try:
        result = await order_db.get(order_id)
        return result
    except Exception as e:
        # All exceptions: same generic message - Claude cannot route recovery
        return {"isError": True, "error": "Operation failed"}

Python — Production: Classified Structured Errors

async def lookup_order(order_id: str) -> dict:
    # 1. Validate input format FIRST (prevent hitting DB with bad input)
    if not order_id.startswith("ORD-"):
        return {
            "isError": True,
            "errorCategory": "validation",
            "isRetryable": False,
            "description": f"Invalid order_id format. Expected ORD-XXXXX, got: {order_id}",
            "customerFriendlyMessage": "I wasn't able to look up that order. Could you double-check the order number?"
        }
    try:
        result = await order_db.get(order_id)
        # 2. Valid empty result (NOT an error)
        if result is None:
            return {"found": False, "message": f"No order found with ID {order_id}."}
        return {"found": True, "order": result}

    except TimeoutError:
        # 3. Transient: temporary, retry after delay
        return {
            "isError": True, "errorCategory": "transient",
            "isRetryable": True, "retryAfterSeconds": 30,
            "description": "Order DB timed out. Service under load.",
            "customerFriendlyMessage": "I'm having trouble reaching the order system. Let me try again shortly."
        }
    except PermissionDeniedError:
        # 4. Permission: escalate immediately, no retry
        return {
            "isError": True, "errorCategory": "permission",
            "isRetryable": False,
            "description": "Agent lacks authorization to access order records.",
            "customerFriendlyMessage": "I need to transfer you to a specialist who can access your order details."
        }
    except RefundPolicyViolation as e:
        # 5. Business rule: explain policy, no retry
        return {
            "isError": True, "errorCategory": "business",
            "isRetryable": False,
            "description": f"Refund rejected: {e.reason}",
            "customerFriendlyMessage": f"This order isn't eligible for a refund because {e.customer_reason}."
        }

Python — Subagent Error Propagation Payload

def build_propagation_payload(error, partial_results, attempted):
    """Structured payload for coordinator - all context without subagent history."""
    import json
    return json.dumps({
        "status": "partial_failure",
        "errorCategory": error["errorCategory"],
        "isRetryable": error["isRetryable"],
        "description": error["description"],
        "partialResults": partial_results, # What was obtained before failure
        "attemptedActions": attempted,   # Prevents coordinator duplication
        "recommendation": "Continue with partial results; flag gap in report."
    })

⛔ Anti-Patterns to Avoid

⛔ Uniform "Operation Failed"

Catching all exceptions and returning the same generic error string. Destroys all classification information Claude needs to route recovery.

⛔ Retrying Non-Retryable

Retrying permission or business errors. Wastes compute cycles, generates security logs, and never succeeds. isRetryable: false must be respected.

⛔ Treating Empty as Error

Returning isError: true when a query has zero results. An empty DB response is a valid successful outcome. Claude should inform, not recover.

⛔ Propagating Without Context

Subagents sending bare error messages without partial results or attempted actions. Coordinator cannot make recovery decisions without this context.

⛔ Missing customerFriendlyMessage

Returning technical error details (stack traces, DB errors) Claude may relay to end users. Always include a sanitized, customer-appropriate message.

⛔ No retryAfterSeconds on Transient

Marking error isRetryable: true without a delay hint. Claude retries immediately, potentially hammering an already struggling service.

✓ Classify Before Constructing

Catch specific exception types in order: validation → permission → business → transient. Each branch constructs a fully structured error with all required fields.

✓ Local Retry Then Propagate

Subagents retry transient errors locally (1-2 attempts). If unresolved, propagate a structured summary: partial results + attempted actions.

✓ Separate Empty from Failure

Return isError: false, found: false for valid empty results. Reserve isError: true exclusively for actual tool execution failures.

✅ Exam Readiness & Key Takeaways

🎓 Exam Scenario — Customer Support Error Recovery

Scenario: The process_refund tool returns an error for an item past the return window. Questions test: (1) Which error category? (Answer: business) (2) Should Claude retry? (Answer: No) (3) What should Claude tell the customer? (Answer: customerFriendlyMessage explaining policy).

Common distractor: Choosing "retry the operation" for a business rule error (wrong). Or "inform the user and stop" for a transient error (wrong — transient errors should retry first). The category determines the action.

isError: true is the primary MCP failure signal. When set, Claude enters error recovery mode. The content carries the structured error payload. Without isError, content is treated as a successful response.

Four error categories — memorize and differentiate: transient (retry after delay), validation (bad input, fix then maybe retry), permission (escalate immediately), business (explain policy). Only transient is retryable.

isRetryable is the machine-readable gate. Always include retryAfterSeconds when isRetryable is true. Missing this causes immediate hammering of struggling services.

Generic errors prevent all intelligent recovery. "Operation failed" gives Claude zero routing information. Always classify the exception before constructing the error response. Catch specific types, not bare Exception.

Subagents handle local retry, then propagate structured summaries. Transient failures: retry 1-2 times locally. Still failing: propagate to coordinator with errorCategory + partialResults + attemptedActions.

Empty result is NOT an error. isError: false, found: false = valid empty query. isError: true, errorCategory: permission = access failure requiring escalation. Never conflate these two scenarios.

Always include customerFriendlyMessage. Claude may relay technical error details verbatim if no sanitized message is provided. Business and permission errors require human-appropriate explanations that don't expose internal system state.

Previous Task ← Task 2.1: Tool Interfaces & Descriptions

Next Task Task 2.3: Tool Distribution & tool_choice →