When an MCP tool fails, what Claude does next depends entirely on the quality of the error response it receives. A generic "Operation failed" message leaves Claude guessing — should it retry? Escalate? Try a different tool? This guide covers the complete error classification taxonomy, the isError flag pattern, retryable vs non-retryable error design, and how subagents propagate failures up to coordinators without losing context.
A doctor orders a blood test. The lab experiences a machine failure. Compare two lab responses:
Bad response: "Test failed." — The doctor cannot decide what to do. Was the sample corrupted? Is the machine down temporarily? A billing issue?
Good response: "Analyzer unit B is offline (hardware fault, est. repair 4 hours). Sample intact and preserved. Retry after 16:00 using unit A. No new sample needed. Pre-auth ID: LAB-9921."
The structured error gives: the error category (hardware/transient), retryability (yes, after 4 hours), what was preserved (sample), and what action to take next. This is exactly what Claude needs from an MCP tool failure.
In agentic systems, Claude is the doctor, your MCP tool is the lab, and the error response is the lab's communication back. Claude cannot make intelligent recovery decisions from generic failure messages.
MCP tools communicate failures to Claude using the isError flag in the tool result. This is the primary signal Claude uses to detect a tool failure, distinct from a successful response with an empty or negative result.
isError: true = the tool failed to work properly. Claude enters error recovery mode.
isError: false + empty payload = the tool worked correctly but found no matching data. Claude informs the user and stops. These are completely different scenarios requiring completely different handling.
// STRUCTURED ERROR (isError: true) - triggers recovery logic { "type": "tool_result", "isError": true, "content": [{"type": "text", "text": "{errorCategory: transient, isRetryable: true, retryAfterSeconds: 30}"}] } // VALID EMPTY RESULT (isError: false) - tool succeeded, nothing found { "type": "tool_result", "isError": false, "content": [{"type": "text", "text": "{found: false, message: No orders found for CUST-00123}"}] }
The exam guide defines four distinct error categories. Each requires a different recovery strategy. You must know all four precisely — including which are retryable and which are not.
| Category | Trigger | isRetryable | Claude Action |
|---|---|---|---|
| transient | Temporary infrastructure failures: timeouts, service unavailability, network blips | true | Retry after delay using retryAfterSeconds. Apply exponential backoff. |
| validation | Invalid input format or value from Claude (malformed ID, wrong type) | false | Do not retry. Re-examine input construction. Correct and try if fix is obvious. |
| permission | Authorization or access control failure (caller lacks rights) | false | Do not retry. Escalate to human agent with explanation of access restriction. |
| business | Policy or domain rule violation (not an infrastructure failure) | false | Do not retry. Return customer-friendly explanation of the business rule violated. |
Transient errors are like a traffic jam — temporary, will clear, retry is appropriate. Validation, permission, and business errors are like driving the wrong way on a one-way street — the road itself prohibits the action; retrying does not help. The isRetryable boolean encodes this distinction machine-readably.
The isRetryable boolean is the single most operationally significant field in a structured error. It controls whether Claude wastes compute retrying a doomed operation vs. immediately escalating.
If all errors return "Operation failed", Claude cannot determine whether to: retry a transient error, give up on a non-retryable error, correct input for a validation error, or escalate a permission error. Structured classification enables all four intelligent paths.
The most common anti-pattern: except Exception: return "Operation failed". This destroys all differentiated error information Claude needs. Always classify the exception type before constructing the error response.
| Field | Type | Required | Purpose |
|---|---|---|---|
errorCategory | string enum | ✅ Always | transient / validation / permission / business |
isRetryable | boolean | ✅ Always | Whether Claude should attempt the operation again |
description | string | ✅ Always | Technical explanation for Claude (not the user) |
retryAfterSeconds | integer | 🔶 When retryable | Minimum wait before retry to avoid hammering the service |
customerFriendlyMessage | string | 🔶 Recommended | Safe message Claude can relay to the end user |
partialResults | object | 🔶 If available | Any data obtained before the failure point |
attemptedActions | string[] | 🔶 If relevant | Steps completed; prevents duplication on coordinator retry |
When Claude receives a tool result with isError: true, it follows this decision tree to determine the appropriate recovery action:
In multi-agent systems, errors in a subagent must be propagated with enough context for the coordinator to make intelligent recovery decisions.
Subagents handle local recovery for transient failures (retry 1-2 times). They propagate to the coordinator ONLY errors that cannot be resolved locally, always including: errorCategory, partialResults, and attemptedActions. The coordinator should never receive raw infrastructure exceptions — only structured error payloads.
One of the most frequently tested distinctions in Task 2.2: an access failure (tool could not execute) vs a valid empty result (tool executed successfully but found no data).
async def lookup_order(order_id: str) -> dict: try: result = await order_db.get(order_id) return result except Exception as e: # All exceptions: same generic message - Claude cannot route recovery return {"isError": True, "error": "Operation failed"}
async def lookup_order(order_id: str) -> dict: # 1. Validate input format FIRST (prevent hitting DB with bad input) if not order_id.startswith("ORD-"): return { "isError": True, "errorCategory": "validation", "isRetryable": False, "description": f"Invalid order_id format. Expected ORD-XXXXX, got: {order_id}", "customerFriendlyMessage": "I wasn't able to look up that order. Could you double-check the order number?" } try: result = await order_db.get(order_id) # 2. Valid empty result (NOT an error) if result is None: return {"found": False, "message": f"No order found with ID {order_id}."} return {"found": True, "order": result} except TimeoutError: # 3. Transient: temporary, retry after delay return { "isError": True, "errorCategory": "transient", "isRetryable": True, "retryAfterSeconds": 30, "description": "Order DB timed out. Service under load.", "customerFriendlyMessage": "I'm having trouble reaching the order system. Let me try again shortly." } except PermissionDeniedError: # 4. Permission: escalate immediately, no retry return { "isError": True, "errorCategory": "permission", "isRetryable": False, "description": "Agent lacks authorization to access order records.", "customerFriendlyMessage": "I need to transfer you to a specialist who can access your order details." } except RefundPolicyViolation as e: # 5. Business rule: explain policy, no retry return { "isError": True, "errorCategory": "business", "isRetryable": False, "description": f"Refund rejected: {e.reason}", "customerFriendlyMessage": f"This order isn't eligible for a refund because {e.customer_reason}." }
def build_propagation_payload(error, partial_results, attempted): """Structured payload for coordinator - all context without subagent history.""" import json return json.dumps({ "status": "partial_failure", "errorCategory": error["errorCategory"], "isRetryable": error["isRetryable"], "description": error["description"], "partialResults": partial_results, # What was obtained before failure "attemptedActions": attempted, # Prevents coordinator duplication "recommendation": "Continue with partial results; flag gap in report." })
Catching all exceptions and returning the same generic error string. Destroys all classification information Claude needs to route recovery.
Retrying permission or business errors. Wastes compute cycles, generates security logs, and never succeeds. isRetryable: false must be respected.
Returning isError: true when a query has zero results. An empty DB response is a valid successful outcome. Claude should inform, not recover.
Subagents sending bare error messages without partial results or attempted actions. Coordinator cannot make recovery decisions without this context.
Returning technical error details (stack traces, DB errors) Claude may relay to end users. Always include a sanitized, customer-appropriate message.
Marking error isRetryable: true without a delay hint. Claude retries immediately, potentially hammering an already struggling service.
Catch specific exception types in order: validation → permission → business → transient. Each branch constructs a fully structured error with all required fields.
Subagents retry transient errors locally (1-2 attempts). If unresolved, propagate a structured summary: partial results + attempted actions.
Return isError: false, found: false for valid empty results. Reserve isError: true exclusively for actual tool execution failures.
Scenario: The process_refund tool returns an error for an item past the return window. Questions test: (1) Which error category? (Answer: business) (2) Should Claude retry? (Answer: No) (3) What should Claude tell the customer? (Answer: customerFriendlyMessage explaining policy).
Common distractor: Choosing "retry the operation" for a business rule error (wrong). Or "inform the user and stop" for a transient error (wrong — transient errors should retry first). The category determines the action.
retryAfterSeconds when isRetryable is true. Missing this causes immediate hammering of struggling services.Exception.isError: false, found: false = valid empty query. isError: true, errorCategory: permission = access failure requiring escalation. Never conflate these two scenarios.