Day 10 · Level 1 Complete · Spark Phase

Error Handling
& Resilience

Bulletproof your MCP server against every failure mode — from protocol errors to transport drops, with retries, circuit breakers, and graceful degradation.

📅 Day 10 of 120 🛡 Error Patterns 🚀 Production Ready
Why MCP Error Handling is Different
MCP has two completely separate error channels — protocol errors and tool errors — and confusing them is the #1 mistake beginners make. One channel speaks JSON-RPC; the other lives inside the tool result object. You must handle each differently.
📡
Protocol Error
throws → JSON-RPC error response
Thrown as an exception from your handler (or by the SDK). The SDK converts it into a JSON-RPC error object with a numeric code in the -32xxx range. From the client's perspective, callTool() throws an McpError.

Used for: invalid params, method not found, server bugs, parse failures.
🔧
Tool Error
isError: true in result
Returned as a normal result with isError: true in the content array. The protocol itself succeeds — callTool() resolves normally. The caller must check result.isError to detect the failure.

Used for: 404s, rate limits, upstream API failures, business-logic errors.
🍽️
Analogy time: A protocol error is like the waiter dropping your order slip before reaching the kitchen — the order never made it there at all. A tool error is like the kitchen receiving the order, trying to make your dish, but discovering the ingredient is out of stock. The request was processed; it just couldn't succeed.
⚠️
The most common mistake: throwing new Error('GitHub 404') from inside a tool handler. This produces a protocol error (code -32603 InternalError) instead of a tool error. The client sees an exception, not an isError result — breaking graceful handling entirely.
The Complete Error Taxonomy
Every possible outcome of a client tool call fits into one of four buckets. Understanding where each error type originates — and what the client sees — is the foundation of resilient MCP engineering.
flowchart TD A([Client calls tool]) --> B{Transport OK?} B -- No --> C[Transport Error\nconnection reset / timeout\nclient throws Error] B -- Yes --> D{Valid JSON-RPC?} D -- No --> E[Parse Error -32700\nclient throws McpError] D -- Yes --> F{Method exists?\nParams valid?} F -- No --> G[Protocol Error -32601/-32602\nclient throws McpError] F -- Yes --> H{Tool logic runs} H -- Exception thrown --> I[InternalError -32603\nclient throws McpError] H -- isError: true --> J[Tool Domain Error\nclient.callTool resolves\ncheck result.isError] H -- success --> K[Tool Success\nresult.content array\nresult.isError === false] style C fill:#1a1a24,stroke:#ef4444,color:#f87171 style E fill:#1a1a24,stroke:#ef4444,color:#f87171 style G fill:#1a1a24,stroke:#ef4444,color:#f87171 style I fill:#1a1a24,stroke:#ef4444,color:#f87171 style J fill:#1a1a24,stroke:#e11d48,color:#fb7185 style K fill:#1a1a24,stroke:#10b981,color:#34d399
📶
Transport Error
Connection reset, socket timeout, process crash. Client throws a plain Error — not even an McpError. Usually requires reconnect.
📡
Protocol Error (JSON-RPC)
Bad JSON, unknown method, invalid params, or an unhandled exception in your tool. Client throws McpError with a -32xxx code.
🔧
Tool Domain Error (isError)
Your tool ran, but business logic failed. API 404, rate limit, validation error. callTool() resolves — caller checks result.isError.
⚠️
Success with Warnings
Tool succeeded but has partial results or advisory messages. Return isError: false with warning text mixed into the content array.
JSON-RPC Error Code Reference
The MCP spec defines seven standard error codes. Knowing which ones to retry — and which to never retry — is critical for building robust clients.
Code Name Meaning Retry? Typical Cause
-32700 ParseError Invalid JSON was received — the server couldn't parse the message at all Never SDK bug, corrupted transport stream
-32600 InvalidRequest The JSON structure is valid but not a valid JSON-RPC request object Never Missing jsonrpc or method field
-32601 MethodNotFound The requested method does not exist on this server Never Wrong tool name, typo, server version mismatch
-32602 InvalidParams Valid method but the params fail schema validation Never Missing required field, wrong type — Zod rejects it
-32603 InternalError Uncaught exception inside the server handler Maybe ×1 Unhandled thrown Error, server-side transient bug
-32001 RequestTimeout Server did not respond within its configured timeout window Yes w/ backoff Slow upstream API, overloaded server, long computation
-32002 ResourceNotFound The requested resource URI does not exist Never Wrong URI, resource deleted, stale cached reference
ℹ️
The golden rule: Only retry errors caused by transient conditions (timeouts, server overload). Never retry errors caused by bad inputs-32602 InvalidParams won't fix itself no matter how many times you try.
Tool Error Patterns (isError)
The rule is simple: never throw from a tool handler unless you want a protocol error. For anything that is an expected operational failure — API 404, rate limit, bad user input — return isError: true with structured content.
// src/errors/tool-errors.ts
// ── Typed error categories ────────────────────────────────────────────────
export type ToolErrorCode =
  | 'NOT_FOUND'
  | 'RATE_LIMITED'
  | 'VALIDATION_FAILED'
  | 'PERMISSION_DENIED'
  | 'UPSTREAM_ERROR';

export interface ToolErrorPayload {
  code: ToolErrorCode;
  message: string;
  details?: Record<string, unknown>;
}

// ── Helper: build a well-formed isError result ────────────────────────────
export function createToolError(payload: ToolErrorPayload) {
  return {
    isError: true,
    content: [
      {
        type: 'text' as const,
        text: `[${payload.code}] ${payload.message}`,
      },
      {
        type: 'text' as const,
        text: JSON.stringify({ error: payload }, null, 2),
      },
    ],
  };
}

// ── Usage inside a tool handler ───────────────────────────────────────────
server.tool('get_repo', { owner: z.string(), repo: z.string() }, async (args) => {
  const res = await githubClient.getRepo(args.owner, args.repo);

  // ✅ Correct: operational failure → isError: true
  if (res.status === 404) {
    return createToolError({
      code: 'NOT_FOUND',
      message: `Repository ${args.owner}/${args.repo} not found`,
      details: { httpStatus: 404 },
    });
  }

  if (res.status === 429) {
    return createToolError({
      code: 'RATE_LIMITED',
      message: 'GitHub API rate limit exceeded — try again in 60 seconds',
      details: { retryAfter: res.headers['retry-after'] },
    });
  }

  if (res.status === 403) {
    return createToolError({
      code: 'PERMISSION_DENIED',
      message: 'Insufficient permissions to access this repository',
    });
  }

  // ✅ Correct: unexpected upstream error → still isError, not throw
  if (!res.ok) {
    return createToolError({
      code: 'UPSTREAM_ERROR',
      message: `GitHub API error: HTTP ${res.status}`,
      details: { httpStatus: res.status },
    });
  }

  const data = await res.json();
  return {
    content: [{ type: 'text', text: JSON.stringify(data, null, 2) }],
  };
});src/errors/tool-errors.ts
💡
Why two content items? The first is a human-readable summary for the LLM to include in its response. The second is structured JSON for any client that wants to programmatically parse the error category and act on it (e.g., show a "Rate limited — retry in Xs" message to the user).
Protocol Error Handling
Throwing from a tool handler should be rare — reserved for truly unexpected bugs and cases where the SDK itself validates inputs. Understand exactly when a throw is correct vs. when it creates a worse experience.
🤔
When SHOULD you throw? Only two legitimate cases: (1) Zod input validation — but the SDK does this for you automatically when you declare a schema. (2) A truly unexpected server-side bug that you have no way to recover from (e.g., config missing at startup). Everything else should be isError: true.
// src/errors/protocol-errors.ts
import { McpError, ErrorCode } from '@modelcontextprotocol/sdk/types.js';

// ── The McpError class ────────────────────────────────────────────────────
// Throwing McpError produces a proper JSON-RPC error response.
// The SDK serialises it as: { code: -32xxx, message: '...' }

// ✅ Correct: throw McpError for protocol-level issues
server.tool('search_repos', { query: z.string().min(1) }, async (args) => {
  // Zod .min(1) already rejects empty string with InvalidParams (-32602)
  // so you don't need to check it again.

  // But if you discover a logic invariant is violated:
  if (!process.env.GITHUB_TOKEN) {
    throw new McpError(
      ErrorCode.InternalError,
      'Server misconfiguration: GITHUB_TOKEN not set'
    );
  }

  // ... normal handler logic
});

// ── Zod validation vs returning isError comparison ────────────────────────

// ❌ Wrong: manual throw for expected business failure
server.tool('get_user', { username: z.string() }, async (args) => {
  const user = await github.getUser(args.username);
  if (!user) throw new Error('User not found'); // becomes -32603 InternalError!
});

// ✅ Right: return isError for expected business failure
server.tool('get_user', { username: z.string() }, async (args) => {
  const user = await github.getUser(args.username);
  if (!user) return createToolError({ code: 'NOT_FOUND', message: `User ${args.username} not found` });
  return { content: [{ type: 'text', text: JSON.stringify(user) }] };
});protocol vs tool errors
📌
Zod input validation errors are protocol errors, not tool errors. When you pass a Zod schema to server.tool(), the SDK runs validation before your handler is ever called. If it fails, the SDK automatically throws McpError(ErrorCode.InvalidParams, ...) — your code never runs and there is no isError result.
Retry Strategies
Not every failure deserves a retry. Retrying non-retriable errors wastes time and amplifies load on already-struggling services. Use exponential backoff with jitter, and prefer a time budget over a fixed retry count.
Error TypeRetry?Strategy
Transport error (socket reset)YesReconnect transport, then retry with backoff
-32603 InternalErrorOnceSingle retry after 500ms — could be transient
-32001 RequestTimeoutYesExponential backoff up to budget
isError: RATE_LIMITEDYesRespect retryAfter header, then linear backoff
isError: UPSTREAM_ERROR (5xx)MaybeBackoff; check if upstream is intermittent
-32602 InvalidParamsNeverWrong args — fix the call, not retry it
-32601 MethodNotFoundNeverWrong method name — a code bug
isError: NOT_FOUNDNeverResource doesn't exist — won't change on retry
isError: PERMISSION_DENIEDNeverAuth issue — credentials must change first
// src/retry/backoff.ts
interface RetryOptions {
  maxBudgetMs: number;      // total time we're willing to spend
  initialDelayMs: number;   // first retry delay
  maxDelayMs: number;       // cap on any single delay
  jitterFactor: number;     // 0–1, how much randomness to add
}

const DEFAULT_OPTS: RetryOptions = {
  maxBudgetMs: 30_000,
  initialDelayMs: 300,
  maxDelayMs: 8_000,
  jitterFactor: 0.25,
};

function withJitter(delay: number, factor: number): number {
  const jitter = delay * factor * (Math.random() * 2 - 1); // ±factor%
  return Math.round(delay + jitter);
}

// ── Retry budget pattern ──────────────────────────────────────────────────
export async function retryWithBudget<T>(
  fn: (attempt: number) => Promise<T>,
  isRetriable: (err: unknown) => boolean,
  opts: Partial<RetryOptions> = {}
): Promise<T> {
  const o = { ...DEFAULT_OPTS, ...opts };
  const deadline = Date.now() + o.maxBudgetMs;
  let delay = o.initialDelayMs;
  let attempt = 0;

  while (true) {
    attempt++;
    try {
      return await fn(attempt);
    } catch (err) {
      const remaining = deadline - Date.now();
      if (!isRetriable(err) || remaining <= 0) throw err;

      const waitMs = Math.min(withJitter(delay, o.jitterFactor), remaining, o.maxDelayMs);
      console.warn(`[retry] attempt ${attempt} failed — waiting ${waitMs}ms (${remaining}ms budget left)`);
      await new Promise(r => setTimeout(r, waitMs));
      delay = Math.min(delay * 2, o.maxDelayMs);
    }
  }
}

// ── Decorator: wrap any tool handler with retry logic ─────────────────────
import { McpError, ErrorCode } from '@modelcontextprotocol/sdk/types.js';

type ToolHandler<T> = (args: T) => Promise<unknown>;

const RETRIABLE_CODES = new Set([
  ErrorCode.InternalError,   // -32603
  ErrorCode.RequestTimeout,  // -32001 (if your SDK version defines it)
]);

export function decorateWithRetry<T>(
  handler: ToolHandler<T>,
  opts?: Partial<RetryOptions>
): ToolHandler<T> {
  return (args) =>
    retryWithBudget(
      () => handler(args),
      (err) => err instanceof McpError && RETRIABLE_CODES.has(err.code),
      opts
    );
}

// Usage:
server.tool('search_repos', { query: z.string() },
  decorateWithRetry(async (args) => {
    // ... handler logic
  }, { maxBudgetMs: 15_000 })
);src/retry/backoff.ts
Circuit Breaker Pattern
Retrying is fine for occasional failures. But if a downstream service is truly down, hammering it with retries makes things worse. A circuit breaker stops trying after N consecutive failures, gives the service time to recover, then probes cautiously before resuming full traffic.
CLOSED
Normal operation. Requests pass through. Failures are counted. When threshold is hit → OPEN.
OPEN
All requests immediately rejected without calling downstream. After cooldown period → HALF-OPEN.
HALF-OPEN
One trial request is allowed. If it succeeds → CLOSED. If it fails → back to OPEN.
stateDiagram-v2 [*] --> CLOSED CLOSED --> OPEN : failures >= threshold OPEN --> HALF_OPEN : cooldown elapsed HALF_OPEN --> CLOSED : trial request succeeds HALF_OPEN --> OPEN : trial request fails CLOSED --> CLOSED : request succeeds (reset counter)
// src/circuit-breaker/CircuitBreaker.ts
type CBState = 'CLOSED' | 'OPEN' | 'HALF_OPEN';

interface CBOptions {
  failureThreshold: number;   // consecutive failures to open
  cooldownMs: number;         // how long to stay open
  successThreshold: number;   // consecutive successes to close from half-open
}

export class CircuitBreaker {
  private state: CBState = 'CLOSED';
  private failureCount = 0;
  private successCount = 0;
  private openedAt = 0;
  private readonly opts: CBOptions;

  constructor(opts: Partial<CBOptions> = {}) {
    this.opts = {
      failureThreshold: opts.failureThreshold ?? 5,
      cooldownMs: opts.cooldownMs ?? 30_000,
      successThreshold: opts.successThreshold ?? 2,
    };
  }

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'OPEN') {
      if (Date.now() - this.openedAt >= this.opts.cooldownMs) {
        this.state = 'HALF_OPEN';
        this.successCount = 0;
      } else {
        throw new Error(
          `Circuit breaker OPEN — service unavailable (cooldown: ${
            Math.ceil((this.opts.cooldownMs - (Date.now() - this.openedAt)) / 1000)
          }s remaining)`
        );
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure();
      throw err;
    }
  }

  private onSuccess(): void {
    this.failureCount = 0;
    if (this.state === 'HALF_OPEN') {
      this.successCount++;
      if (this.successCount >= this.opts.successThreshold) {
        this.state = 'CLOSED';
        console.info('[circuit-breaker] CLOSED — service recovered');
      }
    }
  }

  private onFailure(): void {
    this.failureCount++;
    if (
      this.state === 'HALF_OPEN' ||
      this.failureCount >= this.opts.failureThreshold
    ) {
      this.state = 'OPEN';
      this.openedAt = Date.now();
      console.warn(
        `[circuit-breaker] OPEN — ${this.failureCount} consecutive failures`
      );
    }
  }

  getState(): CBState { return this.state; }
}

// ── Integration with MCP tool handler ────────────────────────────────────
const githubBreaker = new CircuitBreaker({ failureThreshold: 5, cooldownMs: 30_000 });

server.tool('search_repos', { query: z.string() }, async (args) => {
  try {
    const data = await githubBreaker.call(() => githubClient.searchRepos(args.query));
    return { content: [{ type: 'text', text: JSON.stringify(data) }] };
  } catch (err) {
    const msg = err instanceof Error ? err.message : 'Unknown error';
    const isOpen = msg.startsWith('Circuit breaker OPEN');
    return createToolError({
      code: isOpen ? 'UPSTREAM_ERROR' : 'UPSTREAM_ERROR',
      message: isOpen
        ? 'GitHub service is currently unavailable — please try again shortly'
        : `GitHub API error: ${msg}`,
    });
  }
});src/circuit-breaker/CircuitBreaker.ts
Timeout Management
MCP tool calls can hang indefinitely if an upstream service never responds. Always apply two timeout layers: a per-tool timeout via AbortController, and a global request timeout for the overall server. And always clean up in a finally block.
// src/timeout/with-timeout.ts

// ── Layer 1: Per-tool timeout via AbortController ─────────────────────────
export async function withTimeout<T>(
  fn: (signal: AbortSignal) => Promise<T>,
  timeoutMs: number,
  toolName: string
): Promise<T> {
  const controller = new AbortController();
  const handle = setTimeout(() => {
    controller.abort(new Error(`${toolName} timed out after ${timeoutMs}ms`));
  }, timeoutMs);

  try {
    return await fn(controller.signal);
  } finally {
    // ✅ Always clear the timer — prevents handle leaks even on success
    clearTimeout(handle);
  }
}

// ── Layer 2: Promise.race for tools calling slow external APIs ────────────
async function fetchWithTimeout(url: string, signal: AbortSignal): Promise<Response> {
  const response = await fetch(url, { signal });
  if (!response.ok) throw new Error(`HTTP ${response.status}`);
  return response;
}

// ── Complete example: search tool with 10s timeout ────────────────────────
server.tool('search_repos', { query: z.string() }, async (args) => {
  return withTimeout(
    async (signal) => {
      try {
        const res = await fetchWithTimeout(
          `https://api.github.com/search/repositories?q=${encodeURIComponent(args.query)}`,
          signal
        );
        const data = await res.json();
        return { content: [{ type: 'text', text: JSON.stringify(data, null, 2) }] };
      } catch (err) {
        if (err instanceof Error && err.name === 'AbortError') {
          return createToolError({
            code: 'UPSTREAM_ERROR',
            message: 'Search request timed out — GitHub API is slow. Try a more specific query.',
          });
        }
        return createToolError({ code: 'UPSTREAM_ERROR', message: String(err) });
      }
    },
    10_000,
    'search_repos'
  );
});src/timeout/with-timeout.ts
🚨
Always clean up timers in finally blocks. If you call setTimeout to abort a fetch, but the fetch succeeds first, the timer is still pending. Without clearTimeout in a finally, Node.js holds a reference to that timer indefinitely — multiplied across thousands of requests, this leaks memory and prevents graceful shutdown.
Graceful Degradation
When the primary data source fails, don't just return an error — try to return something useful. Stale cache, partial results, and fallback chains turn hard failures into soft degradations that the LLM can work with.
// src/degradation/graceful.ts
import { createToolError } from '../errors/tool-errors.js';

// ── Pattern 1: Stale cache fallback ──────────────────────────────────────
const cache = new Map<string, { data: unknown; ts: number }>();

async function withStaleFallback<T>(
  key: string,
  freshFn: () => Promise<T>,
  maxStaleMs = 5 * 60_000 // serve stale data up to 5 min old
): Promise<{ data: T; stale: boolean }> {
  try {
    const data = await freshFn();
    cache.set(key, { data, ts: Date.now() });
    return { data, stale: false };
  } catch {
    const entry = cache.get(key);
    if (entry && Date.now() - entry.ts < maxStaleMs) {
      return { data: entry.data as T, stale: true };
    }
    throw new Error('No fresh or cached data available');
  }
}

// ── Pattern 2: Partial results ────────────────────────────────────────────
async function fetchMultipleRepos(repos: string[]) {
  const results = await Promise.allSettled(
    repos.map(async (r) => {
      const [owner, name] = r.split('/');
      return { repo: r, data: await githubClient.getRepo(owner, name) };
    })
  );

  const successes = results
    .filter((r): r is PromiseFulfilledResult<unknown> => r.status === 'fulfilled')
    .map((r) => r.value);
  const failures = results
    .filter((r): r is PromiseRejectedResult => r.status === 'rejected')
    .map((r, i) => ({ repo: repos[i], reason: r.reason?.message }));

  return {
    content: [
      {
        type: 'text',
        text: JSON.stringify({
          results: successes,
          warnings: failures.length
            ? `${failures.length} repos failed: ${failures.map(f => f.repo).join(', ')}`
            : undefined,
        }, null, 2),
      },
    ],
    isError: false, // partial success is still success
  };
}

// ── Pattern 3: Fallback chain (all three patterns together) ──────────────
server.tool('get_repo_stats', { owner: z.string(), repo: z.string() }, async (args) => {
  const cacheKey = `${args.owner}/${args.repo}`;

  // Try 1: fresh primary API
  // Try 2: stale cache (up to 10 min old)
  // Try 3: secondary data source (e.g., a read-replica or mirror)
  // Try 4: hard error
  try {
    const { data, stale } = await withStaleFallback(
      cacheKey,
      () => githubClient.getRepoStats(args.owner, args.repo),
      10 * 60_000
    );

    return {
      content: [
        {
          type: 'text',
          text: JSON.stringify({
            ...data as object,
            _meta: stale ? { warning: 'Showing cached data — live API unavailable' } : undefined,
          }, null, 2),
        },
      ],
    };
  } catch {
    // Try secondary source
    try {
      const mirror = await mirrorClient.getRepoStats(args.owner, args.repo);
      return {
        content: [{
          type: 'text',
          text: JSON.stringify({ ...mirror, _meta: { source: 'mirror' } }, null, 2),
        }],
      };
    } catch {
      return createToolError({
        code: 'UPSTREAM_ERROR',
        message: `All data sources unavailable for ${cacheKey} — please try again later`,
      });
    }
  }
});src/degradation/graceful.ts
Structured Error Logging
Random console.error() calls are untrackable in production. Use correlation IDs to link a client request to every log line it generates, and emit structured JSON to stderr so log aggregators can index, filter, and alert on it.
// src/logging/structured.ts
import { randomUUID } from 'crypto';
import { Server } from '@modelcontextprotocol/sdk/server/index.js';

// ── Log entry structure ───────────────────────────────────────────────────
interface LogEntry {
  timestamp: string;
  correlationId: string;
  toolName: string;
  errorCode?: string;
  errorMessage?: string;
  duration?: number;
  attempt?: number;
  level: 'info' | 'warn' | 'error';
}

function structuredLog(entry: LogEntry): void {
  // MCP servers MUST use stderr for logs — stdout is the JSON-RPC channel
  process.stderr.write(JSON.stringify(entry) + '\n');
}

// ── MCP logging primitive: send to client via server notification ─────────
export function createLogMiddleware(server: Server) {
  return {
    async logInfo(correlationId: string, toolName: string, message: string) {
      const entry: LogEntry = {
        timestamp: new Date().toISOString(),
        correlationId,
        toolName,
        level: 'info',
        errorMessage: message,
      };
      structuredLog(entry);
      // Also send to client via MCP logging notification
      await server.notification({
        method: 'notifications/message',
        params: { level: 'info', logger: toolName, data: message },
      });
    },

    async logError(correlationId: string, toolName: string, err: unknown, duration: number, attempt?: number) {
      const entry: LogEntry = {
        timestamp: new Date().toISOString(),
        correlationId,
        toolName,
        errorCode: err instanceof Error ? err.constructor.name : 'UnknownError',
        errorMessage: err instanceof Error ? err.message : String(err),
        duration,
        attempt,
        level: 'error',
      };
      structuredLog(entry);
    },
  };
}

// ── Usage in a tool handler with correlation ID ───────────────────────────
server.tool('search_repos', { query: z.string() }, async (args) => {
  const correlationId = randomUUID();
  const startTime = Date.now();

  const logger = createLogMiddleware(server);
  await logger.logInfo(correlationId, 'search_repos', `Starting search: "${args.query}"`);

  try {
    const data = await githubClient.searchRepos(args.query);
    return { content: [{ type: 'text', text: JSON.stringify(data) }] };
  } catch (err) {
    const duration = Date.now() - startTime;
    await logger.logError(correlationId, 'search_repos', err, duration);
    return createToolError({ code: 'UPSTREAM_ERROR', message: String(err) });
  }
});src/logging/structured.ts
📋
Two logging channels in MCP: process.stderr for server-side logs (visible in your terminal / log aggregator), and server.notification({ method: 'notifications/message' }) to send log events to the connected client over the MCP protocol. Use both for full observability.
Testing Error Paths (Vitest)
Happy-path tests are easy. Error-path tests are where production reliability is built. Every retry strategy, every circuit breaker transition, and every isError branch needs its own test.
// src/__tests__/error-handling.test.ts
import { describe, it, expect, vi, beforeEach } from 'vitest';
import { CircuitBreaker } from '../circuit-breaker/CircuitBreaker.js';
import { createToolError } from '../errors/tool-errors.js';

// ── Mock GitHub client ────────────────────────────────────────────────────
const mockGithub = {
  getRepo: vi.fn(),
  searchRepos: vi.fn(),
};

// ── Test 1: Tool returns isError on upstream 404 ──────────────────────────
describe('get_repo tool', () => {
  it('returns isError: true when GitHub returns 404', async () => {
    mockGithub.getRepo.mockResolvedValueOnce({ status: 404, ok: false });

    const result = await callGetRepoHandler({ owner: 'acme', repo: 'missing' }, mockGithub);

    expect(result.isError).toBe(true);
    expect(result.content[0].text).toContain('NOT_FOUND');
    expect(result.content[0].text).toContain('acme/missing');
  });

  it('does NOT throw for GitHub 404 — protocol stays healthy', async () => {
    mockGithub.getRepo.mockResolvedValueOnce({ status: 404, ok: false });

    // callTool should resolve, not reject
    await expect(
      callGetRepoHandler({ owner: 'acme', repo: 'missing' }, mockGithub)
    ).resolves.toBeDefined();
  });
});

// ── Test 2: Circuit breaker opens after threshold ─────────────────────────
describe('CircuitBreaker', () => {
  let cb: CircuitBreaker;

  beforeEach(() => {
    cb = new CircuitBreaker({ failureThreshold: 3, cooldownMs: 5_000 });
  });

  it('starts in CLOSED state', () => {
    expect(cb.getState()).toBe('CLOSED');
  });

  it('opens after failureThreshold consecutive failures', async () => {
    const fail = () => Promise.reject(new Error('upstream down'));

    await expect(cb.call(fail)).rejects.toThrow();
    await expect(cb.call(fail)).rejects.toThrow();
    await expect(cb.call(fail)).rejects.toThrow(); // 3rd failure → OPEN

    expect(cb.getState()).toBe('OPEN');
  });

  it('rejects immediately in OPEN state without calling downstream', async () => {
    // Force open
    for (let i = 0; i < 3; i++) {
      await cb.call(() => Promise.reject(new Error('fail'))).catch(() => {});
    }

    const spy = vi.fn().mockResolvedValue('ok');
    await expect(cb.call(spy)).rejects.toThrow('Circuit breaker OPEN');
    expect(spy).not.toHaveBeenCalled(); // downstream never called
  });
});

// ── Test 3: Zod validation returns JSON-RPC InvalidParams error ───────────
describe('Zod validation (protocol error)', () => {
  it('returns McpError with code -32602 for missing required param', async () => {
    // Use in-process MCP client/server for integration test
    const { client, cleanup } = await createTestServerClient();

    try {
      await client.callTool({ name: 'search_repos', arguments: {} }); // missing query
    } catch (err: unknown) {
      expect(err).toBeInstanceOf(McpError);
      expect((err as McpError).code).toBe(ErrorCode.InvalidParams);
    } finally {
      await cleanup();
    }
  });
});src/__tests__/error-handling.test.ts
🧪
Mock injection pattern: Pass your GitHub client as a parameter to the tool handler factory (dependency injection) rather than importing it at the module level. This makes it trivial to swap in mockGithub in tests without any module mocking magic — just a different argument.
Production Error Checklist
Before you ship any MCP server to production, run through this checklist. Each item maps directly to a pattern covered in this lesson.
  1. 1 Never throw plain errors from tool handlers. All expected business failures (404, rate limit, validation, permission) use return createToolError({ code, message }) with isError: true.
  2. 2 Throw McpError (not Error) for protocol-level failures. Use new McpError(ErrorCode.InvalidParams, msg) only when the request itself is malformed beyond what Zod already catches.
  3. 3 Implement retry with exponential backoff and jitter for timeout and transient InternalError codes. Cap retries with a time budget (e.g. 30s), not a fixed count.
  4. 4 Never retry -32602 InvalidParams, -32601 MethodNotFound, NOT_FOUND, or PERMISSION_DENIED. These are deterministic failures — repeating them wastes resources and delays user feedback.
  5. 5 Wrap every external API call with a CircuitBreaker. Use a threshold of 5 consecutive failures and a 30-second cooldown minimum. Return a user-friendly isError result when the breaker is open.
  6. 6 Apply per-tool timeouts via AbortController for all outbound HTTP calls. Always clearTimeout in a finally block to prevent handle leaks. Recommended default: 10s per tool, 30s global.
  7. 7 Implement at least one graceful degradation strategy for every tool that calls an external service: stale cache, partial results, or a fallback source. Never let an upstream outage become a complete tool blackout.
  8. 8 Use structured JSON logging to stderr with a correlation ID, tool name, error code, duration, and attempt number on every error. Never use console.log on stdout — that breaks the JSON-RPC channel.
  9. 9 Emit MCP log notifications via server.notification({ method: 'notifications/message' }) for errors the client host should surface to the user. Keep stderr logs for server-side observability.
  10. 10 Write tests for every error path: isError on upstream failure, circuit breaker state transitions, Zod validation → InvalidParams, timeout → isError result. Use dependency injection to inject mock clients without module mocking.
🎓
Level 1 Complete!
You've mastered all 10 foundational MCP concepts. From protocol basics to production error handling — you're now a solid MCP practitioner.
→ View Level 1 Cheat Sheet
Day 1: What is MCP? Day 2: Protocol Internals Day 3: Your First Server Day 4: Tools Deep Dive Day 5: Resources Day 6: Prompts Day 7: Week 1 Recap Day 8: Transport Layer Day 9: Client SDK Day 10: Error Handling ✓
Error Handling Check
5 questions covering the two error channels, retry strategies, circuit breakers, timeouts, and isError semantics. Score 5/5 and you're production-ready.
Q1A tool handler calls the GitHub API which returns a 404. The correct pattern is to...
Athrow new McpError(ErrorCode.InternalError, '404') — map the HTTP error to a protocol error
Breturn { isError: true, content: [{ type: 'text', text: 'Repo not found: 404' }] } — it's an expected operational failure
Cthrow new Error('GitHub 404') — let the SDK convert it to a JSON-RPC error automatically
Dreturn an empty content array — the client infers failure from the missing content
Q2Which JSON-RPC error code should you never retry automatically?
A-32603 InternalError — could be transient, worth one retry
B-32001 RequestTimeout — the server was just slow, retry with backoff
C-32602 InvalidParams — wrong arguments won't fix themselves on retry
D-32603 and -32001 both should never be retried
Q3A circuit breaker is in OPEN state. A request arrives. What happens?
AThe request is immediately rejected without calling the downstream service
BThe request is queued and retried automatically when the breaker closes
CThe circuit breaker switches to HALF_OPEN and tries the request as a trial
DThe request falls through to the next retry attempt in the backoff queue
Q4You have a tool handler that uses AbortController for a 10-second timeout. The fetch completes in 3 seconds. What must you do to avoid a memory/handle leak?
ACall process.nextTick to flush the event loop and release the timer
BCall clearTimeout on the timeout handle and abort the controller in a finally block
CNothing — the AbortController garbage-collects automatically when it goes out of scope
DRestart the server — leaked handles can't be cleaned up without a process restart
Q5Your MCP server's tool handler returns { isError: true, content: [...] }. From the client's perspective, what does client.callTool() do?
AThrows an McpError with code -32603 InternalError — isError maps to a protocol error
BRejects the promise — isError: true always means a rejected promise on the client
CResolves normally — the caller must check result.isError to detect the error condition
DReturns undefined — an isError result has no content for the client to inspect
← Previous Day
Day 9: MCP Client SDK
Build clients that talk to any MCP server
Next Day →
Day 11: OAuth 2.0 & Authentication
Secure MCP servers with OAuth flows