📚 Domain 2 · Task Statement 2.2

Implement Structured Error Responses for MCP Tools

📊 Domain Weight: 18% 🎯 Difficulty: Core Concept 🔗 Scenarios: Customer Support & Multi-Agent Research

When an MCP tool fails, what Claude does next depends entirely on the quality of the error response it receives. A generic "Operation failed" message leaves Claude guessing — should it retry? Escalate? Try a different tool? This guide covers the complete error classification taxonomy, the isError flag pattern, retryable vs non-retryable error design, and how subagents propagate failures up to coordinators without losing context.

📋 Contents

  1. Real-World Analogy: The Hospital Diagnostic Chain
  2. The isError Flag Pattern
  3. The Four Error Categories
  4. Retryable vs Non-Retryable Error Design
  5. Workflow Diagram: Error Recovery Decision Flow
  6. Sequence Diagram: Subagent Error Propagation
  7. Distinguishing Empty Results from Access Failures
  8. Code Patterns: Structured Error Implementation
  9. Anti-Patterns to Avoid
  10. Exam Readiness & Key Takeaways

🏥 Real-World Analogy: The Hospital Diagnostic Chain

🩹 Analogy — Doctor, Lab, and Structured Communication

A doctor orders a blood test. The lab experiences a machine failure. Compare two lab responses:

Bad response: "Test failed." — The doctor cannot decide what to do. Was the sample corrupted? Is the machine down temporarily? A billing issue?

Good response: "Analyzer unit B is offline (hardware fault, est. repair 4 hours). Sample intact and preserved. Retry after 16:00 using unit A. No new sample needed. Pre-auth ID: LAB-9921."

The structured error gives: the error category (hardware/transient), retryability (yes, after 4 hours), what was preserved (sample), and what action to take next. This is exactly what Claude needs from an MCP tool failure.

In agentic systems, Claude is the doctor, your MCP tool is the lab, and the error response is the lab's communication back. Claude cannot make intelligent recovery decisions from generic failure messages.

🚩 The isError Flag Pattern

MCP tools communicate failures to Claude using the isError flag in the tool result. This is the primary signal Claude uses to detect a tool failure, distinct from a successful response with an empty or negative result.

🎯 Exam Focus — isError vs Empty Result

isError: true = the tool failed to work properly. Claude enters error recovery mode.

isError: false + empty payload = the tool worked correctly but found no matching data. Claude informs the user and stops. These are completely different scenarios requiring completely different handling.

JSON — MCP Tool Result: Error vs Empty Result
// STRUCTURED ERROR (isError: true) - triggers recovery logic
{
  "type": "tool_result",
  "isError": true,
  "content": [{"type": "text", "text": "{errorCategory: transient, isRetryable: true, retryAfterSeconds: 30}"}]
}

// VALID EMPTY RESULT (isError: false) - tool succeeded, nothing found
{
  "type": "tool_result",
  "isError": false,
  "content": [{"type": "text", "text": "{found: false, message: No orders found for CUST-00123}"}]
}

📈 The Four Error Categories

The exam guide defines four distinct error categories. Each requires a different recovery strategy. You must know all four precisely — including which are retryable and which are not.

CategoryTriggerisRetryableClaude Action
transient Temporary infrastructure failures: timeouts, service unavailability, network blips true Retry after delay using retryAfterSeconds. Apply exponential backoff.
validation Invalid input format or value from Claude (malformed ID, wrong type) false Do not retry. Re-examine input construction. Correct and try if fix is obvious.
permission Authorization or access control failure (caller lacks rights) false Do not retry. Escalate to human agent with explanation of access restriction.
business Policy or domain rule violation (not an infrastructure failure) false Do not retry. Return customer-friendly explanation of the business rule violated.
💡 Mental Model: Transient vs Structural

Transient errors are like a traffic jam — temporary, will clear, retry is appropriate. Validation, permission, and business errors are like driving the wrong way on a one-way street — the road itself prohibits the action; retrying does not help. The isRetryable boolean encodes this distinction machine-readably.

🔄 Retryable vs Non-Retryable Error Design

The isRetryable boolean is the single most operationally significant field in a structured error. It controls whether Claude wastes compute retrying a doomed operation vs. immediately escalating.

Why Uniform Errors Break Recovery

If all errors return "Operation failed", Claude cannot determine whether to: retry a transient error, give up on a non-retryable error, correct input for a validation error, or escalate a permission error. Structured classification enables all four intelligent paths.

⚠️ Anti-Pattern: Generic Error Swallowing

The most common anti-pattern: except Exception: return "Operation failed". This destroys all differentiated error information Claude needs. Always classify the exception type before constructing the error response.

Required Fields in a Structured Error Response

FieldTypeRequiredPurpose
errorCategorystring enum✅ Alwaystransient / validation / permission / business
isRetryableboolean✅ AlwaysWhether Claude should attempt the operation again
descriptionstring✅ AlwaysTechnical explanation for Claude (not the user)
retryAfterSecondsinteger🔶 When retryableMinimum wait before retry to avoid hammering the service
customerFriendlyMessagestring🔶 RecommendedSafe message Claude can relay to the end user
partialResultsobject🔶 If availableAny data obtained before the failure point
attemptedActionsstring[]🔶 If relevantSteps completed; prevents duplication on coordinator retry

🕐 Workflow Diagram: Error Recovery Decision Flow

When Claude receives a tool result with isError: true, it follows this decision tree to determine the appropriate recovery action:

Figure 1 — Claude Error Recovery Decision Flow
Tool Result: isError true Structured error received Check errorCategory transient Check isRetryable? true Retry After retryAfterSeconds with exp. backoff validation Re-Examine Input Format Correct + Retry permission Escalate Human No retry. Log access attempt. business Explain Policy customerFriendly Message + alt. Can subagent resolve locally? No Propagate to Coordinator: category + partialResults + attemptedActions

🔨 Sequence Diagram: Subagent Error Propagation

In multi-agent systems, errors in a subagent must be propagated with enough context for the coordinator to make intelligent recovery decisions.

Figure 2 — Multi-Agent Error Propagation Sequence
Coordinator Search Subagent MCP: web_search External API 1. Delegate: research AI in healthcare 2. web_search(query="AI healthcare 2025") 3. HTTP GET search API 4. 503 Service Down 5. isError:true, transient, retryAfter:30s 6. Wait 30s, retry... 7. web_search (retry #1) 8. isError:true, still transient 9. Max retries hit 10. Propagate to coordinator: category + partialResults + attemptedActions 11. Coordinator Continues with available data 12. Synthesis with cached data Note: search step incomplete Key: Local retry for transient failures. After max retries: propagate structured error + partial results. Coordinator continues without blocking on sub-failure.
🎯 Exam Focus — Propagate Only What Cannot Be Resolved Locally

Subagents handle local recovery for transient failures (retry 1-2 times). They propagate to the coordinator ONLY errors that cannot be resolved locally, always including: errorCategory, partialResults, and attemptedActions. The coordinator should never receive raw infrastructure exceptions — only structured error payloads.

📋 Distinguishing Empty Results from Access Failures

One of the most frequently tested distinctions in Task 2.2: an access failure (tool could not execute) vs a valid empty result (tool executed successfully but found no data).

Figure 3 — Empty Result vs Access Failure
Valid Empty Result "isError": false "found": false "message": "No orders in 90 days" Tool executed successfully. DB returned zero rows. Action: Inform user. Do NOT retry. Do NOT escalate. Access Failure "isError": true "errorCategory": "permission" "isRetryable": false Tool could not execute. Caller lacks authorization. Action: Escalate to human. Do NOT retry. Log access.

💻 Code Patterns: Structured Error Implementation

Python — Anti-Pattern: Generic Error Swallowing
async def lookup_order(order_id: str) -> dict:
    try:
        result = await order_db.get(order_id)
        return result
    except Exception as e:
        # All exceptions: same generic message - Claude cannot route recovery
        return {"isError": True, "error": "Operation failed"}
Python — Production: Classified Structured Errors
async def lookup_order(order_id: str) -> dict:
    # 1. Validate input format FIRST (prevent hitting DB with bad input)
    if not order_id.startswith("ORD-"):
        return {
            "isError": True,
            "errorCategory": "validation",
            "isRetryable": False,
            "description": f"Invalid order_id format. Expected ORD-XXXXX, got: {order_id}",
            "customerFriendlyMessage": "I wasn't able to look up that order. Could you double-check the order number?"
        }
    try:
        result = await order_db.get(order_id)
        # 2. Valid empty result (NOT an error)
        if result is None:
            return {"found": False, "message": f"No order found with ID {order_id}."}
        return {"found": True, "order": result}

    except TimeoutError:
        # 3. Transient: temporary, retry after delay
        return {
            "isError": True, "errorCategory": "transient",
            "isRetryable": True, "retryAfterSeconds": 30,
            "description": "Order DB timed out. Service under load.",
            "customerFriendlyMessage": "I'm having trouble reaching the order system. Let me try again shortly."
        }
    except PermissionDeniedError:
        # 4. Permission: escalate immediately, no retry
        return {
            "isError": True, "errorCategory": "permission",
            "isRetryable": False,
            "description": "Agent lacks authorization to access order records.",
            "customerFriendlyMessage": "I need to transfer you to a specialist who can access your order details."
        }
    except RefundPolicyViolation as e:
        # 5. Business rule: explain policy, no retry
        return {
            "isError": True, "errorCategory": "business",
            "isRetryable": False,
            "description": f"Refund rejected: {e.reason}",
            "customerFriendlyMessage": f"This order isn't eligible for a refund because {e.customer_reason}."
        }
Python — Subagent Error Propagation Payload
def build_propagation_payload(error, partial_results, attempted):
    """Structured payload for coordinator - all context without subagent history."""
    import json
    return json.dumps({
        "status": "partial_failure",
        "errorCategory": error["errorCategory"],
        "isRetryable": error["isRetryable"],
        "description": error["description"],
        "partialResults": partial_results, # What was obtained before failure
        "attemptedActions": attempted,   # Prevents coordinator duplication
        "recommendation": "Continue with partial results; flag gap in report."
    })

Anti-Patterns to Avoid

⛔ Uniform "Operation Failed"

Catching all exceptions and returning the same generic error string. Destroys all classification information Claude needs to route recovery.

⛔ Retrying Non-Retryable

Retrying permission or business errors. Wastes compute cycles, generates security logs, and never succeeds. isRetryable: false must be respected.

⛔ Treating Empty as Error

Returning isError: true when a query has zero results. An empty DB response is a valid successful outcome. Claude should inform, not recover.

⛔ Propagating Without Context

Subagents sending bare error messages without partial results or attempted actions. Coordinator cannot make recovery decisions without this context.

⛔ Missing customerFriendlyMessage

Returning technical error details (stack traces, DB errors) Claude may relay to end users. Always include a sanitized, customer-appropriate message.

⛔ No retryAfterSeconds on Transient

Marking error isRetryable: true without a delay hint. Claude retries immediately, potentially hammering an already struggling service.

✓ Classify Before Constructing

Catch specific exception types in order: validation → permission → business → transient. Each branch constructs a fully structured error with all required fields.

✓ Local Retry Then Propagate

Subagents retry transient errors locally (1-2 attempts). If unresolved, propagate a structured summary: partial results + attempted actions.

✓ Separate Empty from Failure

Return isError: false, found: false for valid empty results. Reserve isError: true exclusively for actual tool execution failures.

Exam Readiness & Key Takeaways

🎓 Exam Scenario — Customer Support Error Recovery

Scenario: The process_refund tool returns an error for an item past the return window. Questions test: (1) Which error category? (Answer: business) (2) Should Claude retry? (Answer: No) (3) What should Claude tell the customer? (Answer: customerFriendlyMessage explaining policy).

Common distractor: Choosing "retry the operation" for a business rule error (wrong). Or "inform the user and stop" for a transient error (wrong — transient errors should retry first). The category determines the action.

1
isError: true is the primary MCP failure signal. When set, Claude enters error recovery mode. The content carries the structured error payload. Without isError, content is treated as a successful response.
2
Four error categories — memorize and differentiate: transient (retry after delay), validation (bad input, fix then maybe retry), permission (escalate immediately), business (explain policy). Only transient is retryable.
3
isRetryable is the machine-readable gate. Always include retryAfterSeconds when isRetryable is true. Missing this causes immediate hammering of struggling services.
4
Generic errors prevent all intelligent recovery. "Operation failed" gives Claude zero routing information. Always classify the exception before constructing the error response. Catch specific types, not bare Exception.
5
Subagents handle local retry, then propagate structured summaries. Transient failures: retry 1-2 times locally. Still failing: propagate to coordinator with errorCategory + partialResults + attemptedActions.
6
Empty result is NOT an error. isError: false, found: false = valid empty query. isError: true, errorCategory: permission = access failure requiring escalation. Never conflate these two scenarios.
7
Always include customerFriendlyMessage. Claude may relay technical error details verbatim if no sanitized message is provided. Business and permission errors require human-appropriate explanations that don't expose internal system state.