Bulletproof your MCP server against every failure mode — from protocol errors to transport drops, with retries, circuit breakers, and graceful degradation.
📅 Day 10 of 120⏱ 🛡 Error Patterns🚀 Production Ready
Section 1
Why MCP Error Handling is Different
MCP has two completely separate error channels — protocol errors and tool errors — and confusing them is the #1 mistake beginners make. One channel speaks JSON-RPC; the other lives inside the tool result object. You must handle each differently.
📡
Protocol Error
throws → JSON-RPC error response
Thrown as an exception from your handler (or by the SDK). The SDK converts it into a JSON-RPC error object with a numeric code in the -32xxx range. From the client's perspective, callTool()throws an McpError.
Used for: invalid params, method not found, server bugs, parse failures.
🔧
Tool Error
isError: true in result
Returned as a normal result with isError: true in the content array. The protocol itself succeeds — callTool()resolves normally. The caller must check result.isError to detect the failure.
Used for: 404s, rate limits, upstream API failures, business-logic errors.
🍽️
Analogy time: A protocol error is like the waiter dropping your order slip before reaching the kitchen — the order never made it there at all. A tool error is like the kitchen receiving the order, trying to make your dish, but discovering the ingredient is out of stock. The request was processed; it just couldn't succeed.
⚠️
The most common mistake: throwing new Error('GitHub 404') from inside a tool handler. This produces a protocol error (code -32603 InternalError) instead of a tool error. The client sees an exception, not an isError result — breaking graceful handling entirely.
Section 2
The Complete Error Taxonomy
Every possible outcome of a client tool call fits into one of four buckets. Understanding where each error type originates — and what the client sees — is the foundation of resilient MCP engineering.
flowchart TD
A([Client calls tool]) --> B{Transport OK?}
B -- No --> C[Transport Error\nconnection reset / timeout\nclient throws Error]
B -- Yes --> D{Valid JSON-RPC?}
D -- No --> E[Parse Error -32700\nclient throws McpError]
D -- Yes --> F{Method exists?\nParams valid?}
F -- No --> G[Protocol Error -32601/-32602\nclient throws McpError]
F -- Yes --> H{Tool logic runs}
H -- Exception thrown --> I[InternalError -32603\nclient throws McpError]
H -- isError: true --> J[Tool Domain Error\nclient.callTool resolves\ncheck result.isError]
H -- success --> K[Tool Success\nresult.content array\nresult.isError === false]
style C fill:#1a1a24,stroke:#ef4444,color:#f87171
style E fill:#1a1a24,stroke:#ef4444,color:#f87171
style G fill:#1a1a24,stroke:#ef4444,color:#f87171
style I fill:#1a1a24,stroke:#ef4444,color:#f87171
style J fill:#1a1a24,stroke:#e11d48,color:#fb7185
style K fill:#1a1a24,stroke:#10b981,color:#34d399
📶
Transport Error
Connection reset, socket timeout, process crash. Client throws a plain Error — not even an McpError. Usually requires reconnect.
📡
Protocol Error (JSON-RPC)
Bad JSON, unknown method, invalid params, or an unhandled exception in your tool. Client throws McpError with a -32xxx code.
🔧
Tool Domain Error (isError)
Your tool ran, but business logic failed. API 404, rate limit, validation error. callTool() resolves — caller checks result.isError.
⚠️
Success with Warnings
Tool succeeded but has partial results or advisory messages. Return isError: false with warning text mixed into the content array.
Section 3
JSON-RPC Error Code Reference
The MCP spec defines seven standard error codes. Knowing which ones to retry — and which to never retry — is critical for building robust clients.
Code
Name
Meaning
Retry?
Typical Cause
-32700
ParseError
Invalid JSON was received — the server couldn't parse the message at all
Never
SDK bug, corrupted transport stream
-32600
InvalidRequest
The JSON structure is valid but not a valid JSON-RPC request object
Never
Missing jsonrpc or method field
-32601
MethodNotFound
The requested method does not exist on this server
Never
Wrong tool name, typo, server version mismatch
-32602
InvalidParams
Valid method but the params fail schema validation
Never
Missing required field, wrong type — Zod rejects it
-32603
InternalError
Uncaught exception inside the server handler
Maybe ×1
Unhandled thrown Error, server-side transient bug
-32001
RequestTimeout
Server did not respond within its configured timeout window
Yes w/ backoff
Slow upstream API, overloaded server, long computation
-32002
ResourceNotFound
The requested resource URI does not exist
Never
Wrong URI, resource deleted, stale cached reference
ℹ️
The golden rule: Only retry errors caused by transient conditions (timeouts, server overload). Never retry errors caused by bad inputs — -32602 InvalidParams won't fix itself no matter how many times you try.
Section 4
Tool Error Patterns (isError)
The rule is simple: never throw from a tool handler unless you want a protocol error. For anything that is an expected operational failure — API 404, rate limit, bad user input — return isError: true with structured content.
Why two content items? The first is a human-readable summary for the LLM to include in its response. The second is structured JSON for any client that wants to programmatically parse the error category and act on it (e.g., show a "Rate limited — retry in Xs" message to the user).
Section 5
Protocol Error Handling
Throwing from a tool handler should be rare — reserved for truly unexpected bugs and cases where the SDK itself validates inputs. Understand exactly when a throw is correct vs. when it creates a worse experience.
🤔
When SHOULD you throw? Only two legitimate cases:
(1) Zod input validation — but the SDK does this for you automatically when you declare a schema.
(2) A truly unexpected server-side bug that you have no way to recover from (e.g., config missing at startup).
Everything else should be isError: true.
// src/errors/protocol-errors.ts
import { McpError, ErrorCode } from '@modelcontextprotocol/sdk/types.js';
// ── The McpError class ────────────────────────────────────────────────────
// Throwing McpError produces a proper JSON-RPC error response.
// The SDK serialises it as: { code: -32xxx, message: '...' }
// ✅ Correct: throw McpError for protocol-level issues
server.tool('search_repos', { query: z.string().min(1) }, async (args) => {
// Zod .min(1) already rejects empty string with InvalidParams (-32602)
// so you don't need to check it again.
// But if you discover a logic invariant is violated:
if (!process.env.GITHUB_TOKEN) {
throw new McpError(
ErrorCode.InternalError,
'Server misconfiguration: GITHUB_TOKEN not set'
);
}
// ... normal handler logic
});
// ── Zod validation vs returning isError comparison ────────────────────────
// ❌ Wrong: manual throw for expected business failure
server.tool('get_user', { username: z.string() }, async (args) => {
const user = await github.getUser(args.username);
if (!user) throw new Error('User not found'); // becomes -32603 InternalError!
});
// ✅ Right: return isError for expected business failure
server.tool('get_user', { username: z.string() }, async (args) => {
const user = await github.getUser(args.username);
if (!user) return createToolError({ code: 'NOT_FOUND', message: `User ${args.username} not found` });
return { content: [{ type: 'text', text: JSON.stringify(user) }] };
});protocol vs tool errors
📌
Zod input validation errors are protocol errors, not tool errors. When you pass a Zod schema to server.tool(), the SDK runs validation before your handler is ever called. If it fails, the SDK automatically throws McpError(ErrorCode.InvalidParams, ...) — your code never runs and there is no isError result.
Section 6
Retry Strategies
Not every failure deserves a retry. Retrying non-retriable errors wastes time and amplifies load on already-struggling services. Use exponential backoff with jitter, and prefer a time budget over a fixed retry count.
Error Type
Retry?
Strategy
Transport error (socket reset)
Yes
Reconnect transport, then retry with backoff
-32603 InternalError
Once
Single retry after 500ms — could be transient
-32001 RequestTimeout
Yes
Exponential backoff up to budget
isError: RATE_LIMITED
Yes
Respect retryAfter header, then linear backoff
isError: UPSTREAM_ERROR (5xx)
Maybe
Backoff; check if upstream is intermittent
-32602 InvalidParams
Never
Wrong args — fix the call, not retry it
-32601 MethodNotFound
Never
Wrong method name — a code bug
isError: NOT_FOUND
Never
Resource doesn't exist — won't change on retry
isError: PERMISSION_DENIED
Never
Auth issue — credentials must change first
// src/retry/backoff.ts
interface RetryOptions {
maxBudgetMs: number; // total time we're willing to spend
initialDelayMs: number; // first retry delay
maxDelayMs: number; // cap on any single delay
jitterFactor: number; // 0–1, how much randomness to add
}
const DEFAULT_OPTS: RetryOptions = {
maxBudgetMs: 30_000,
initialDelayMs: 300,
maxDelayMs: 8_000,
jitterFactor: 0.25,
};
function withJitter(delay: number, factor: number): number {
const jitter = delay * factor * (Math.random() * 2 - 1); // ±factor%
return Math.round(delay + jitter);
}
// ── Retry budget pattern ──────────────────────────────────────────────────
export async function retryWithBudget<T>(
fn: (attempt: number) => Promise<T>,
isRetriable: (err: unknown) => boolean,
opts: Partial<RetryOptions> = {}
): Promise<T> {
const o = { ...DEFAULT_OPTS, ...opts };
const deadline = Date.now() + o.maxBudgetMs;
let delay = o.initialDelayMs;
let attempt = 0;
while (true) {
attempt++;
try {
return await fn(attempt);
} catch (err) {
const remaining = deadline - Date.now();
if (!isRetriable(err) || remaining <= 0) throw err;
const waitMs = Math.min(withJitter(delay, o.jitterFactor), remaining, o.maxDelayMs);
console.warn(`[retry] attempt ${attempt} failed — waiting ${waitMs}ms (${remaining}ms budget left)`);
await new Promise(r => setTimeout(r, waitMs));
delay = Math.min(delay * 2, o.maxDelayMs);
}
}
}
// ── Decorator: wrap any tool handler with retry logic ─────────────────────
import { McpError, ErrorCode } from '@modelcontextprotocol/sdk/types.js';
type ToolHandler<T> = (args: T) => Promise<unknown>;
const RETRIABLE_CODES = new Set([
ErrorCode.InternalError, // -32603
ErrorCode.RequestTimeout, // -32001 (if your SDK version defines it)
]);
export function decorateWithRetry<T>(
handler: ToolHandler<T>,
opts?: Partial<RetryOptions>
): ToolHandler<T> {
return (args) =>
retryWithBudget(
() => handler(args),
(err) => err instanceof McpError && RETRIABLE_CODES.has(err.code),
opts
);
}
// Usage:
server.tool('search_repos', { query: z.string() },
decorateWithRetry(async (args) => {
// ... handler logic
}, { maxBudgetMs: 15_000 })
);src/retry/backoff.ts
Section 7
Circuit Breaker Pattern
Retrying is fine for occasional failures. But if a downstream service is truly down, hammering it with retries makes things worse. A circuit breaker stops trying after N consecutive failures, gives the service time to recover, then probes cautiously before resuming full traffic.
CLOSED
Normal operation. Requests pass through. Failures are counted. When threshold is hit → OPEN.
OPEN
All requests immediately rejected without calling downstream. After cooldown period → HALF-OPEN.
HALF-OPEN
One trial request is allowed. If it succeeds → CLOSED. If it fails → back to OPEN.
MCP tool calls can hang indefinitely if an upstream service never responds. Always apply two timeout layers: a per-tool timeout via AbortController, and a global request timeout for the overall server. And always clean up in a finally block.
// src/timeout/with-timeout.ts
// ── Layer 1: Per-tool timeout via AbortController ─────────────────────────
export async function withTimeout<T>(
fn: (signal: AbortSignal) => Promise<T>,
timeoutMs: number,
toolName: string
): Promise<T> {
const controller = new AbortController();
const handle = setTimeout(() => {
controller.abort(new Error(`${toolName} timed out after ${timeoutMs}ms`));
}, timeoutMs);
try {
return await fn(controller.signal);
} finally {
// ✅ Always clear the timer — prevents handle leaks even on success
clearTimeout(handle);
}
}
// ── Layer 2: Promise.race for tools calling slow external APIs ────────────
async function fetchWithTimeout(url: string, signal: AbortSignal): Promise<Response> {
const response = await fetch(url, { signal });
if (!response.ok) throw new Error(`HTTP ${response.status}`);
return response;
}
// ── Complete example: search tool with 10s timeout ────────────────────────
server.tool('search_repos', { query: z.string() }, async (args) => {
return withTimeout(
async (signal) => {
try {
const res = await fetchWithTimeout(
`https://api.github.com/search/repositories?q=${encodeURIComponent(args.query)}`,
signal
);
const data = await res.json();
return { content: [{ type: 'text', text: JSON.stringify(data, null, 2) }] };
} catch (err) {
if (err instanceof Error && err.name === 'AbortError') {
return createToolError({
code: 'UPSTREAM_ERROR',
message: 'Search request timed out — GitHub API is slow. Try a more specific query.',
});
}
return createToolError({ code: 'UPSTREAM_ERROR', message: String(err) });
}
},
10_000,
'search_repos'
);
});src/timeout/with-timeout.ts
🚨
Always clean up timers in finally blocks. If you call setTimeout to abort a fetch, but the fetch succeeds first, the timer is still pending. Without clearTimeout in a finally, Node.js holds a reference to that timer indefinitely — multiplied across thousands of requests, this leaks memory and prevents graceful shutdown.
Section 9
Graceful Degradation
When the primary data source fails, don't just return an error — try to return something useful. Stale cache, partial results, and fallback chains turn hard failures into soft degradations that the LLM can work with.
// src/degradation/graceful.ts
import { createToolError } from '../errors/tool-errors.js';
// ── Pattern 1: Stale cache fallback ──────────────────────────────────────
const cache = new Map<string, { data: unknown; ts: number }>();
async function withStaleFallback<T>(
key: string,
freshFn: () => Promise<T>,
maxStaleMs = 5 * 60_000 // serve stale data up to 5 min old
): Promise<{ data: T; stale: boolean }> {
try {
const data = await freshFn();
cache.set(key, { data, ts: Date.now() });
return { data, stale: false };
} catch {
const entry = cache.get(key);
if (entry && Date.now() - entry.ts < maxStaleMs) {
return { data: entry.data as T, stale: true };
}
throw new Error('No fresh or cached data available');
}
}
// ── Pattern 2: Partial results ────────────────────────────────────────────
async function fetchMultipleRepos(repos: string[]) {
const results = await Promise.allSettled(
repos.map(async (r) => {
const [owner, name] = r.split('/');
return { repo: r, data: await githubClient.getRepo(owner, name) };
})
);
const successes = results
.filter((r): r is PromiseFulfilledResult<unknown> => r.status === 'fulfilled')
.map((r) => r.value);
const failures = results
.filter((r): r is PromiseRejectedResult => r.status === 'rejected')
.map((r, i) => ({ repo: repos[i], reason: r.reason?.message }));
return {
content: [
{
type: 'text',
text: JSON.stringify({
results: successes,
warnings: failures.length
? `${failures.length} repos failed: ${failures.map(f => f.repo).join(', ')}`
: undefined,
}, null, 2),
},
],
isError: false, // partial success is still success
};
}
// ── Pattern 3: Fallback chain (all three patterns together) ──────────────
server.tool('get_repo_stats', { owner: z.string(), repo: z.string() }, async (args) => {
const cacheKey = `${args.owner}/${args.repo}`;
// Try 1: fresh primary API
// Try 2: stale cache (up to 10 min old)
// Try 3: secondary data source (e.g., a read-replica or mirror)
// Try 4: hard error
try {
const { data, stale } = await withStaleFallback(
cacheKey,
() => githubClient.getRepoStats(args.owner, args.repo),
10 * 60_000
);
return {
content: [
{
type: 'text',
text: JSON.stringify({
...data as object,
_meta: stale ? { warning: 'Showing cached data — live API unavailable' } : undefined,
}, null, 2),
},
],
};
} catch {
// Try secondary source
try {
const mirror = await mirrorClient.getRepoStats(args.owner, args.repo);
return {
content: [{
type: 'text',
text: JSON.stringify({ ...mirror, _meta: { source: 'mirror' } }, null, 2),
}],
};
} catch {
return createToolError({
code: 'UPSTREAM_ERROR',
message: `All data sources unavailable for ${cacheKey} — please try again later`,
});
}
}
});src/degradation/graceful.ts
Section 10
Structured Error Logging
Random console.error() calls are untrackable in production. Use correlation IDs to link a client request to every log line it generates, and emit structured JSON to stderr so log aggregators can index, filter, and alert on it.
Two logging channels in MCP:process.stderr for server-side logs (visible in your terminal / log aggregator), and server.notification({ method: 'notifications/message' }) to send log events to the connected client over the MCP protocol. Use both for full observability.
Section 11
Testing Error Paths (Vitest)
Happy-path tests are easy. Error-path tests are where production reliability is built. Every retry strategy, every circuit breaker transition, and every isError branch needs its own test.
// src/__tests__/error-handling.test.ts
import { describe, it, expect, vi, beforeEach } from 'vitest';
import { CircuitBreaker } from '../circuit-breaker/CircuitBreaker.js';
import { createToolError } from '../errors/tool-errors.js';
// ── Mock GitHub client ────────────────────────────────────────────────────
const mockGithub = {
getRepo: vi.fn(),
searchRepos: vi.fn(),
};
// ── Test 1: Tool returns isError on upstream 404 ──────────────────────────
describe('get_repo tool', () => {
it('returns isError: true when GitHub returns 404', async () => {
mockGithub.getRepo.mockResolvedValueOnce({ status: 404, ok: false });
const result = await callGetRepoHandler({ owner: 'acme', repo: 'missing' }, mockGithub);
expect(result.isError).toBe(true);
expect(result.content[0].text).toContain('NOT_FOUND');
expect(result.content[0].text).toContain('acme/missing');
});
it('does NOT throw for GitHub 404 — protocol stays healthy', async () => {
mockGithub.getRepo.mockResolvedValueOnce({ status: 404, ok: false });
// callTool should resolve, not reject
await expect(
callGetRepoHandler({ owner: 'acme', repo: 'missing' }, mockGithub)
).resolves.toBeDefined();
});
});
// ── Test 2: Circuit breaker opens after threshold ─────────────────────────
describe('CircuitBreaker', () => {
let cb: CircuitBreaker;
beforeEach(() => {
cb = new CircuitBreaker({ failureThreshold: 3, cooldownMs: 5_000 });
});
it('starts in CLOSED state', () => {
expect(cb.getState()).toBe('CLOSED');
});
it('opens after failureThreshold consecutive failures', async () => {
const fail = () => Promise.reject(new Error('upstream down'));
await expect(cb.call(fail)).rejects.toThrow();
await expect(cb.call(fail)).rejects.toThrow();
await expect(cb.call(fail)).rejects.toThrow(); // 3rd failure → OPEN
expect(cb.getState()).toBe('OPEN');
});
it('rejects immediately in OPEN state without calling downstream', async () => {
// Force open
for (let i = 0; i < 3; i++) {
await cb.call(() => Promise.reject(new Error('fail'))).catch(() => {});
}
const spy = vi.fn().mockResolvedValue('ok');
await expect(cb.call(spy)).rejects.toThrow('Circuit breaker OPEN');
expect(spy).not.toHaveBeenCalled(); // downstream never called
});
});
// ── Test 3: Zod validation returns JSON-RPC InvalidParams error ───────────
describe('Zod validation (protocol error)', () => {
it('returns McpError with code -32602 for missing required param', async () => {
// Use in-process MCP client/server for integration test
const { client, cleanup } = await createTestServerClient();
try {
await client.callTool({ name: 'search_repos', arguments: {} }); // missing query
} catch (err: unknown) {
expect(err).toBeInstanceOf(McpError);
expect((err as McpError).code).toBe(ErrorCode.InvalidParams);
} finally {
await cleanup();
}
});
});src/__tests__/error-handling.test.ts
🧪
Mock injection pattern: Pass your GitHub client as a parameter to the tool handler factory (dependency injection) rather than importing it at the module level. This makes it trivial to swap in mockGithub in tests without any module mocking magic — just a different argument.
Section 12
Production Error Checklist
Before you ship any MCP server to production, run through this checklist. Each item maps directly to a pattern covered in this lesson.
1Never throw plain errors from tool handlers. All expected business failures (404, rate limit, validation, permission) use return createToolError({ code, message }) with isError: true.
2Throw McpError (not Error) for protocol-level failures. Use new McpError(ErrorCode.InvalidParams, msg) only when the request itself is malformed beyond what Zod already catches.
3Implement retry with exponential backoff and jitter for timeout and transient InternalError codes. Cap retries with a time budget (e.g. 30s), not a fixed count.
4Never retry -32602 InvalidParams, -32601 MethodNotFound, NOT_FOUND, or PERMISSION_DENIED. These are deterministic failures — repeating them wastes resources and delays user feedback.
5Wrap every external API call with a CircuitBreaker. Use a threshold of 5 consecutive failures and a 30-second cooldown minimum. Return a user-friendly isError result when the breaker is open.
6Apply per-tool timeouts via AbortController for all outbound HTTP calls. Always clearTimeout in a finally block to prevent handle leaks. Recommended default: 10s per tool, 30s global.
7Implement at least one graceful degradation strategy for every tool that calls an external service: stale cache, partial results, or a fallback source. Never let an upstream outage become a complete tool blackout.
8Use structured JSON logging to stderr with a correlation ID, tool name, error code, duration, and attempt number on every error. Never use console.log on stdout — that breaks the JSON-RPC channel.
9Emit MCP log notifications via server.notification({ method: 'notifications/message' }) for errors the client host should surface to the user. Keep stderr logs for server-side observability.
10Write tests for every error path:isError on upstream failure, circuit breaker state transitions, Zod validation → InvalidParams, timeout → isError result. Use dependency injection to inject mock clients without module mocking.
🎓
Level 1 Complete!
You've mastered all 10 foundational MCP concepts. From protocol basics to production error handling — you're now a solid MCP practitioner.
Day 1: What is MCP?Day 2: Protocol InternalsDay 3: Your First ServerDay 4: Tools Deep DiveDay 5: ResourcesDay 6: PromptsDay 7: Week 1 RecapDay 8: Transport LayerDay 9: Client SDKDay 10: Error Handling ✓
Quiz · Day 10
Error Handling Check
5 questions covering the two error channels, retry strategies, circuit breakers, timeouts, and isError semantics. Score 5/5 and you're production-ready.
Q1A tool handler calls the GitHub API which returns a 404. The correct pattern is to...
Athrow new McpError(ErrorCode.InternalError, '404') — map the HTTP error to a protocol error
Cthrow new Error('GitHub 404') — let the SDK convert it to a JSON-RPC error automatically
Dreturn an empty content array — the client infers failure from the missing content
Q2Which JSON-RPC error code should you never retry automatically?
A-32603 InternalError — could be transient, worth one retry
B-32001 RequestTimeout — the server was just slow, retry with backoff
C-32602 InvalidParams — wrong arguments won't fix themselves on retry
D-32603 and -32001 both should never be retried
Q3A circuit breaker is in OPEN state. A request arrives. What happens?
AThe request is immediately rejected without calling the downstream service
BThe request is queued and retried automatically when the breaker closes
CThe circuit breaker switches to HALF_OPEN and tries the request as a trial
DThe request falls through to the next retry attempt in the backoff queue
Q4You have a tool handler that uses AbortController for a 10-second timeout. The fetch completes in 3 seconds. What must you do to avoid a memory/handle leak?
ACall process.nextTick to flush the event loop and release the timer
BCall clearTimeout on the timeout handle and abort the controller in a finally block
CNothing — the AbortController garbage-collects automatically when it goes out of scope
DRestart the server — leaked handles can't be cleaned up without a process restart
Q5Your MCP server's tool handler returns { isError: true, content: [...] }. From the client's perspective, what does client.callTool() do?
AThrows an McpError with code -32603 InternalError — isError maps to a protocol error
BRejects the promise — isError: true always means a rejected promise on the client
CResolves normally — the caller must check result.isError to detect the error condition
DReturns undefined — an isError result has no content for the client to inspect