Day 10 — Error Handling & Resilience

Section 1

Why MCP Error Handling is Different

MCP has two completely separate error channels — protocol errors and tool errors — and confusing them is the #1 mistake beginners make. One channel speaks JSON-RPC; the other lives inside the tool result object. You must handle each differently.

📡

Protocol Error

throws → JSON-RPC error response

Thrown as an exception from your handler (or by the SDK). The SDK converts it into a JSON-RPC error object with a numeric code in the -32xxx range. From the client's perspective, callTool() throws an McpError.

Used for: invalid params, method not found, server bugs, parse failures.

🔧

Tool Error

isError: true in result

Returned as a normal result with isError: true in the content array. The protocol itself succeeds — callTool() resolves normally. The caller must check result.isError to detect the failure.

Used for: 404s, rate limits, upstream API failures, business-logic errors.

🍽️

Analogy time: A protocol error is like the waiter dropping your order slip before reaching the kitchen — the order never made it there at all. A tool error is like the kitchen receiving the order, trying to make your dish, but discovering the ingredient is out of stock. The request was processed; it just couldn't succeed.

⚠️

The most common mistake: throwing new Error('GitHub 404') from inside a tool handler. This produces a protocol error (code -32603 InternalError) instead of a tool error. The client sees an exception, not an isError result — breaking graceful handling entirely.

Section 2

The Complete Error Taxonomy

Every possible outcome of a client tool call fits into one of four buckets. Understanding where each error type originates — and what the client sees — is the foundation of resilient MCP engineering.

flowchart TD A([Client calls tool]) --> B{Transport OK?} B -- No --> C[Transport Error\nconnection reset / timeout\nclient throws Error] B -- Yes --> D{Valid JSON-RPC?} D -- No --> E[Parse Error -32700\nclient throws McpError] D -- Yes --> F{Method exists?\nParams valid?} F -- No --> G[Protocol Error -32601/-32602\nclient throws McpError] F -- Yes --> H{Tool logic runs} H -- Exception thrown --> I[InternalError -32603\nclient throws McpError] H -- isError: true --> J[Tool Domain Error\nclient.callTool resolves\ncheck result.isError] H -- success --> K[Tool Success\nresult.content array\nresult.isError === false] style C fill:#1a1a24,stroke:#ef4444,color:#f87171 style E fill:#1a1a24,stroke:#ef4444,color:#f87171 style G fill:#1a1a24,stroke:#ef4444,color:#f87171 style I fill:#1a1a24,stroke:#ef4444,color:#f87171 style J fill:#1a1a24,stroke:#e11d48,color:#fb7185 style K fill:#1a1a24,stroke:#10b981,color:#34d399

📶

Transport Error

Connection reset, socket timeout, process crash. Client throws a plain Error — not even an McpError. Usually requires reconnect.

📡

Protocol Error (JSON-RPC)

Bad JSON, unknown method, invalid params, or an unhandled exception in your tool. Client throws McpError with a -32xxx code.

🔧

Tool Domain Error (isError)

Your tool ran, but business logic failed. API 404, rate limit, validation error. callTool() resolves — caller checks result.isError.

⚠️

Success with Warnings

Tool succeeded but has partial results or advisory messages. Return isError: false with warning text mixed into the content array.

Section 3

JSON-RPC Error Code Reference

The MCP spec defines seven standard error codes. Knowing which ones to retry — and which to never retry — is critical for building robust clients.

Code	Name	Meaning	Retry?	Typical Cause
-32700	ParseError	Invalid JSON was received — the server couldn't parse the message at all	Never	SDK bug, corrupted transport stream
-32600	InvalidRequest	The JSON structure is valid but not a valid JSON-RPC request object	Never	Missing `jsonrpc` or `method` field
-32601	MethodNotFound	The requested method does not exist on this server	Never	Wrong tool name, typo, server version mismatch
-32602	InvalidParams	Valid method but the params fail schema validation	Never	Missing required field, wrong type — Zod rejects it
-32603	InternalError	Uncaught exception inside the server handler	Maybe ×1	Unhandled thrown Error, server-side transient bug
-32001	RequestTimeout	Server did not respond within its configured timeout window	Yes w/ backoff	Slow upstream API, overloaded server, long computation
-32002	ResourceNotFound	The requested resource URI does not exist	Never	Wrong URI, resource deleted, stale cached reference

ℹ️

The golden rule: Only retry errors caused by transient conditions (timeouts, server overload). Never retry errors caused by bad inputs — -32602 InvalidParams won't fix itself no matter how many times you try.

Section 4

Tool Error Patterns (isError)

The rule is simple: never throw from a tool handler unless you want a protocol error. For anything that is an expected operational failure — API 404, rate limit, bad user input — return isError: true with structured content.

// src/errors/tool-errors.ts
// ── Typed error categories ────────────────────────────────────────────────
export type ToolErrorCode =
  | 'NOT_FOUND'
  | 'RATE_LIMITED'
  | 'VALIDATION_FAILED'
  | 'PERMISSION_DENIED'
  | 'UPSTREAM_ERROR';

export interface ToolErrorPayload {
  code: ToolErrorCode;
  message: string;
  details?: Record<string, unknown>;
}

// ── Helper: build a well-formed isError result ────────────────────────────
export function createToolError(payload: ToolErrorPayload) {
  return {
    isError: true,
    content: [
      {
        type: 'text' as const,
        text: `[${payload.code}] ${payload.message}`,
      },
      {
        type: 'text' as const,
        text: JSON.stringify({ error: payload }, null, 2),
      },
    ],
  };
}

// ── Usage inside a tool handler ───────────────────────────────────────────
server.tool('get_repo', { owner: z.string(), repo: z.string() }, async (args) => {
  const res = await githubClient.getRepo(args.owner, args.repo);

  // ✅ Correct: operational failure → isError: true
  if (res.status === 404) {
    return createToolError({
      code: 'NOT_FOUND',
      message: `Repository ${args.owner}/${args.repo} not found`,
      details: { httpStatus: 404 },
    });
  }

  if (res.status === 429) {
    return createToolError({
      code: 'RATE_LIMITED',
      message: 'GitHub API rate limit exceeded — try again in 60 seconds',
      details: { retryAfter: res.headers['retry-after'] },
    });
  }

  if (res.status === 403) {
    return createToolError({
      code: 'PERMISSION_DENIED',
      message: 'Insufficient permissions to access this repository',
    });
  }

  // ✅ Correct: unexpected upstream error → still isError, not throw
  if (!res.ok) {
    return createToolError({
      code: 'UPSTREAM_ERROR',
      message: `GitHub API error: HTTP ${res.status}`,
      details: { httpStatus: res.status },
    });
  }

  const data = await res.json();
  return {
    content: [{ type: 'text', text: JSON.stringify(data, null, 2) }],
  };
});src/errors/tool-errors.ts

💡

Why two content items? The first is a human-readable summary for the LLM to include in its response. The second is structured JSON for any client that wants to programmatically parse the error category and act on it (e.g., show a "Rate limited — retry in Xs" message to the user).

Section 5

Protocol Error Handling

Throwing from a tool handler should be rare — reserved for truly unexpected bugs and cases where the SDK itself validates inputs. Understand exactly when a throw is correct vs. when it creates a worse experience.

🤔

When SHOULD you throw? Only two legitimate cases: (1) Zod input validation — but the SDK does this for you automatically when you declare a schema. (2) A truly unexpected server-side bug that you have no way to recover from (e.g., config missing at startup). Everything else should be isError: true.

// src/errors/protocol-errors.ts
import { McpError, ErrorCode } from '@modelcontextprotocol/sdk/types.js';

// ── The McpError class ────────────────────────────────────────────────────
// Throwing McpError produces a proper JSON-RPC error response.
// The SDK serialises it as: { code: -32xxx, message: '...' }

// ✅ Correct: throw McpError for protocol-level issues
server.tool('search_repos', { query: z.string().min(1) }, async (args) => {
  // Zod .min(1) already rejects empty string with InvalidParams (-32602)
  // so you don't need to check it again.

  // But if you discover a logic invariant is violated:
  if (!process.env.GITHUB_TOKEN) {
    throw new McpError(
      ErrorCode.InternalError,
      'Server misconfiguration: GITHUB_TOKEN not set'
    );
  }

  // ... normal handler logic
});

// ── Zod validation vs returning isError comparison ────────────────────────

// ❌ Wrong: manual throw for expected business failure
server.tool('get_user', { username: z.string() }, async (args) => {
  const user = await github.getUser(args.username);
  if (!user) throw new Error('User not found'); // becomes -32603 InternalError!
});

// ✅ Right: return isError for expected business failure
server.tool('get_user', { username: z.string() }, async (args) => {
  const user = await github.getUser(args.username);
  if (!user) return createToolError({ code: 'NOT_FOUND', message: `User ${args.username} not found` });
  return { content: [{ type: 'text', text: JSON.stringify(user) }] };
});protocol vs tool errors

📌

Zod input validation errors are protocol errors, not tool errors. When you pass a Zod schema to server.tool(), the SDK runs validation before your handler is ever called. If it fails, the SDK automatically throws McpError(ErrorCode.InvalidParams, ...) — your code never runs and there is no isError result.

Section 6

Retry Strategies

Not every failure deserves a retry. Retrying non-retriable errors wastes time and amplifies load on already-struggling services. Use exponential backoff with jitter, and prefer a time budget over a fixed retry count.

Error Type	Retry?	Strategy
Transport error (socket reset)	Yes	Reconnect transport, then retry with backoff
-32603 InternalError	Once	Single retry after 500ms — could be transient
-32001 RequestTimeout	Yes	Exponential backoff up to budget
isError: RATE_LIMITED	Yes	Respect `retryAfter` header, then linear backoff
isError: UPSTREAM_ERROR (5xx)	Maybe	Backoff; check if upstream is intermittent
-32602 InvalidParams	Never	Wrong args — fix the call, not retry it
-32601 MethodNotFound	Never	Wrong method name — a code bug
isError: NOT_FOUND	Never	Resource doesn't exist — won't change on retry
isError: PERMISSION_DENIED	Never	Auth issue — credentials must change first

// src/retry/backoff.ts
interface RetryOptions {
  maxBudgetMs: number;      // total time we're willing to spend
  initialDelayMs: number;   // first retry delay
  maxDelayMs: number;       // cap on any single delay
  jitterFactor: number;     // 0–1, how much randomness to add
}

const DEFAULT_OPTS: RetryOptions = {
  maxBudgetMs: 30_000,
  initialDelayMs: 300,
  maxDelayMs: 8_000,
  jitterFactor: 0.25,
};

function withJitter(delay: number, factor: number): number {
  const jitter = delay * factor * (Math.random() * 2 - 1); // ±factor%
  return Math.round(delay + jitter);
}

// ── Retry budget pattern ──────────────────────────────────────────────────
export async function retryWithBudget<T>(
  fn: (attempt: number) => Promise<T>,
  isRetriable: (err: unknown) => boolean,
  opts: Partial<RetryOptions> = {}
): Promise<T> {
  const o = { ...DEFAULT_OPTS, ...opts };
  const deadline = Date.now() + o.maxBudgetMs;
  let delay = o.initialDelayMs;
  let attempt = 0;

  while (true) {
    attempt++;
    try {
      return await fn(attempt);
    } catch (err) {
      const remaining = deadline - Date.now();
      if (!isRetriable(err) || remaining <= 0) throw err;

      const waitMs = Math.min(withJitter(delay, o.jitterFactor), remaining, o.maxDelayMs);
      console.warn(`[retry] attempt ${attempt} failed — waiting ${waitMs}ms (${remaining}ms budget left)`);
      await new Promise(r => setTimeout(r, waitMs));
      delay = Math.min(delay * 2, o.maxDelayMs);
    }
  }
}

// ── Decorator: wrap any tool handler with retry logic ─────────────────────
import { McpError, ErrorCode } from '@modelcontextprotocol/sdk/types.js';

type ToolHandler<T> = (args: T) => Promise<unknown>;

const RETRIABLE_CODES = new Set([
  ErrorCode.InternalError,   // -32603
  ErrorCode.RequestTimeout,  // -32001 (if your SDK version defines it)
]);

export function decorateWithRetry<T>(
  handler: ToolHandler<T>,
  opts?: Partial<RetryOptions>
): ToolHandler<T> {
  return (args) =>
    retryWithBudget(
      () => handler(args),
      (err) => err instanceof McpError && RETRIABLE_CODES.has(err.code),
      opts
    );
}

// Usage:
server.tool('search_repos', { query: z.string() },
  decorateWithRetry(async (args) => {
    // ... handler logic
  }, { maxBudgetMs: 15_000 })
);src/retry/backoff.ts

Section 7

Circuit Breaker Pattern

Retrying is fine for occasional failures. But if a downstream service is truly down, hammering it with retries makes things worse. A circuit breaker stops trying after N consecutive failures, gives the service time to recover, then probes cautiously before resuming full traffic.

CLOSED

Normal operation. Requests pass through. Failures are counted. When threshold is hit → OPEN.

OPEN

All requests immediately rejected without calling downstream. After cooldown period → HALF-OPEN.

HALF-OPEN

One trial request is allowed. If it succeeds → CLOSED. If it fails → back to OPEN.

stateDiagram-v2 [*] --> CLOSED CLOSED --> OPEN : failures >= threshold OPEN --> HALF_OPEN : cooldown elapsed HALF_OPEN --> CLOSED : trial request succeeds HALF_OPEN --> OPEN : trial request fails CLOSED --> CLOSED : request succeeds (reset counter)

// src/circuit-breaker/CircuitBreaker.ts
type CBState = 'CLOSED' | 'OPEN' | 'HALF_OPEN';

interface CBOptions {
  failureThreshold: number;   // consecutive failures to open
  cooldownMs: number;         // how long to stay open
  successThreshold: number;   // consecutive successes to close from half-open
}

export class CircuitBreaker {
  private state: CBState = 'CLOSED';
  private failureCount = 0;
  private successCount = 0;
  private openedAt = 0;
  private readonly opts: CBOptions;

  constructor(opts: Partial<CBOptions> = {}) {
    this.opts = {
      failureThreshold: opts.failureThreshold ?? 5,
      cooldownMs: opts.cooldownMs ?? 30_000,
      successThreshold: opts.successThreshold ?? 2,
    };
  }

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'OPEN') {
      if (Date.now() - this.openedAt >= this.opts.cooldownMs) {
        this.state = 'HALF_OPEN';
        this.successCount = 0;
      } else {
        throw new Error(
          `Circuit breaker OPEN — service unavailable (cooldown: ${
            Math.ceil((this.opts.cooldownMs - (Date.now() - this.openedAt)) / 1000)
          }s remaining)`
        );
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure();
      throw err;
    }
  }

  private onSuccess(): void {
    this.failureCount = 0;
    if (this.state === 'HALF_OPEN') {
      this.successCount++;
      if (this.successCount >= this.opts.successThreshold) {
        this.state = 'CLOSED';
        console.info('[circuit-breaker] CLOSED — service recovered');
      }
    }
  }

  private onFailure(): void {
    this.failureCount++;
    if (
      this.state === 'HALF_OPEN' ||
      this.failureCount >= this.opts.failureThreshold
    ) {
      this.state = 'OPEN';
      this.openedAt = Date.now();
      console.warn(
        `[circuit-breaker] OPEN — ${this.failureCount} consecutive failures`
      );
    }
  }

  getState(): CBState { return this.state; }
}

// ── Integration with MCP tool handler ────────────────────────────────────
const githubBreaker = new CircuitBreaker({ failureThreshold: 5, cooldownMs: 30_000 });

server.tool('search_repos', { query: z.string() }, async (args) => {
  try {
    const data = await githubBreaker.call(() => githubClient.searchRepos(args.query));
    return { content: [{ type: 'text', text: JSON.stringify(data) }] };
  } catch (err) {
    const msg = err instanceof Error ? err.message : 'Unknown error';
    const isOpen = msg.startsWith('Circuit breaker OPEN');
    return createToolError({
      code: isOpen ? 'UPSTREAM_ERROR' : 'UPSTREAM_ERROR',
      message: isOpen
        ? 'GitHub service is currently unavailable — please try again shortly'
        : `GitHub API error: ${msg}`,
    });
  }
});src/circuit-breaker/CircuitBreaker.ts

Section 8

Timeout Management

MCP tool calls can hang indefinitely if an upstream service never responds. Always apply two timeout layers: a per-tool timeout via AbortController, and a global request timeout for the overall server. And always clean up in a finally block.

// src/timeout/with-timeout.ts

// ── Layer 1: Per-tool timeout via AbortController ─────────────────────────
export async function withTimeout<T>(
  fn: (signal: AbortSignal) => Promise<T>,
  timeoutMs: number,
  toolName: string
): Promise<T> {
  const controller = new AbortController();
  const handle = setTimeout(() => {
    controller.abort(new Error(`${toolName} timed out after ${timeoutMs}ms`));
  }, timeoutMs);

  try {
    return await fn(controller.signal);
  } finally {
    // ✅ Always clear the timer — prevents handle leaks even on success
    clearTimeout(handle);
  }
}

// ── Layer 2: Promise.race for tools calling slow external APIs ────────────
async function fetchWithTimeout(url: string, signal: AbortSignal): Promise<Response> {
  const response = await fetch(url, { signal });
  if (!response.ok) throw new Error(`HTTP ${response.status}`);
  return response;
}

// ── Complete example: search tool with 10s timeout ────────────────────────
server.tool('search_repos', { query: z.string() }, async (args) => {
  return withTimeout(
    async (signal) => {
      try {
        const res = await fetchWithTimeout(
          `https://api.github.com/search/repositories?q=${encodeURIComponent(args.query)}`,
          signal
        );
        const data = await res.json();
        return { content: [{ type: 'text', text: JSON.stringify(data, null, 2) }] };
      } catch (err) {
        if (err instanceof Error && err.name === 'AbortError') {
          return createToolError({
            code: 'UPSTREAM_ERROR',
            message: 'Search request timed out — GitHub API is slow. Try a more specific query.',
          });
        }
        return createToolError({ code: 'UPSTREAM_ERROR', message: String(err) });
      }
    },
    10_000,
    'search_repos'
  );
});src/timeout/with-timeout.ts

🚨

Always clean up timers in finally blocks. If you call setTimeout to abort a fetch, but the fetch succeeds first, the timer is still pending. Without clearTimeout in a finally, Node.js holds a reference to that timer indefinitely — multiplied across thousands of requests, this leaks memory and prevents graceful shutdown.

Section 9

Graceful Degradation

When the primary data source fails, don't just return an error — try to return something useful. Stale cache, partial results, and fallback chains turn hard failures into soft degradations that the LLM can work with.

// src/degradation/graceful.ts
import { createToolError } from '../errors/tool-errors.js';

// ── Pattern 1: Stale cache fallback ──────────────────────────────────────
const cache = new Map<string, { data: unknown; ts: number }>();

async function withStaleFallback<T>(
  key: string,
  freshFn: () => Promise<T>,
  maxStaleMs = 5 * 60_000 // serve stale data up to 5 min old
): Promise<{ data: T; stale: boolean }> {
  try {
    const data = await freshFn();
    cache.set(key, { data, ts: Date.now() });
    return { data, stale: false };
  } catch {
    const entry = cache.get(key);
    if (entry && Date.now() - entry.ts < maxStaleMs) {
      return { data: entry.data as T, stale: true };
    }
    throw new Error('No fresh or cached data available');
  }
}

// ── Pattern 2: Partial results ────────────────────────────────────────────
async function fetchMultipleRepos(repos: string[]) {
  const results = await Promise.allSettled(
    repos.map(async (r) => {
      const [owner, name] = r.split('/');
      return { repo: r, data: await githubClient.getRepo(owner, name) };
    })
  );

  const successes = results
    .filter((r): r is PromiseFulfilledResult<unknown> => r.status === 'fulfilled')
    .map((r) => r.value);
  const failures = results
    .filter((r): r is PromiseRejectedResult => r.status === 'rejected')
    .map((r, i) => ({ repo: repos[i], reason: r.reason?.message }));

  return {
    content: [
      {
        type: 'text',
        text: JSON.stringify({
          results: successes,
          warnings: failures.length
            ? `${failures.length} repos failed: ${failures.map(f => f.repo).join(', ')}`
            : undefined,
        }, null, 2),
      },
    ],
    isError: false, // partial success is still success
  };
}

// ── Pattern 3: Fallback chain (all three patterns together) ──────────────
server.tool('get_repo_stats', { owner: z.string(), repo: z.string() }, async (args) => {
  const cacheKey = `${args.owner}/${args.repo}`;

  // Try 1: fresh primary API
  // Try 2: stale cache (up to 10 min old)
  // Try 3: secondary data source (e.g., a read-replica or mirror)
  // Try 4: hard error
  try {
    const { data, stale } = await withStaleFallback(
      cacheKey,
      () => githubClient.getRepoStats(args.owner, args.repo),
      10 * 60_000
    );

    return {
      content: [
        {
          type: 'text',
          text: JSON.stringify({
            ...data as object,
            _meta: stale ? { warning: 'Showing cached data — live API unavailable' } : undefined,
          }, null, 2),
        },
      ],
    };
  } catch {
    // Try secondary source
    try {
      const mirror = await mirrorClient.getRepoStats(args.owner, args.repo);
      return {
        content: [{
          type: 'text',
          text: JSON.stringify({ ...mirror, _meta: { source: 'mirror' } }, null, 2),
        }],
      };
    } catch {
      return createToolError({
        code: 'UPSTREAM_ERROR',
        message: `All data sources unavailable for ${cacheKey} — please try again later`,
      });
    }
  }
});src/degradation/graceful.ts

Section 10

Structured Error Logging

Random console.error() calls are untrackable in production. Use correlation IDs to link a client request to every log line it generates, and emit structured JSON to stderr so log aggregators can index, filter, and alert on it.

// src/logging/structured.ts
import { randomUUID } from 'crypto';
import { Server } from '@modelcontextprotocol/sdk/server/index.js';

// ── Log entry structure ───────────────────────────────────────────────────
interface LogEntry {
  timestamp: string;
  correlationId: string;
  toolName: string;
  errorCode?: string;
  errorMessage?: string;
  duration?: number;
  attempt?: number;
  level: 'info' | 'warn' | 'error';
}

function structuredLog(entry: LogEntry): void {
  // MCP servers MUST use stderr for logs — stdout is the JSON-RPC channel
  process.stderr.write(JSON.stringify(entry) + '\n');
}

// ── MCP logging primitive: send to client via server notification ─────────
export function createLogMiddleware(server: Server) {
  return {
    async logInfo(correlationId: string, toolName: string, message: string) {
      const entry: LogEntry = {
        timestamp: new Date().toISOString(),
        correlationId,
        toolName,
        level: 'info',
        errorMessage: message,
      };
      structuredLog(entry);
      // Also send to client via MCP logging notification
      await server.notification({
        method: 'notifications/message',
        params: { level: 'info', logger: toolName, data: message },
      });
    },

    async logError(correlationId: string, toolName: string, err: unknown, duration: number, attempt?: number) {
      const entry: LogEntry = {
        timestamp: new Date().toISOString(),
        correlationId,
        toolName,
        errorCode: err instanceof Error ? err.constructor.name : 'UnknownError',
        errorMessage: err instanceof Error ? err.message : String(err),
        duration,
        attempt,
        level: 'error',
      };
      structuredLog(entry);
    },
  };
}

// ── Usage in a tool handler with correlation ID ───────────────────────────
server.tool('search_repos', { query: z.string() }, async (args) => {
  const correlationId = randomUUID();
  const startTime = Date.now();

  const logger = createLogMiddleware(server);
  await logger.logInfo(correlationId, 'search_repos', `Starting search: "${args.query}"`);

  try {
    const data = await githubClient.searchRepos(args.query);
    return { content: [{ type: 'text', text: JSON.stringify(data) }] };
  } catch (err) {
    const duration = Date.now() - startTime;
    await logger.logError(correlationId, 'search_repos', err, duration);
    return createToolError({ code: 'UPSTREAM_ERROR', message: String(err) });
  }
});src/logging/structured.ts

📋

Two logging channels in MCP: process.stderr for server-side logs (visible in your terminal / log aggregator), and server.notification({ method: 'notifications/message' }) to send log events to the connected client over the MCP protocol. Use both for full observability.

Section 11

Testing Error Paths (Vitest)

Happy-path tests are easy. Error-path tests are where production reliability is built. Every retry strategy, every circuit breaker transition, and every isError branch needs its own test.

// src/__tests__/error-handling.test.ts
import { describe, it, expect, vi, beforeEach } from 'vitest';
import { CircuitBreaker } from '../circuit-breaker/CircuitBreaker.js';
import { createToolError } from '../errors/tool-errors.js';

// ── Mock GitHub client ────────────────────────────────────────────────────
const mockGithub = {
  getRepo: vi.fn(),
  searchRepos: vi.fn(),
};

// ── Test 1: Tool returns isError on upstream 404 ──────────────────────────
describe('get_repo tool', () => {
  it('returns isError: true when GitHub returns 404', async () => {
    mockGithub.getRepo.mockResolvedValueOnce({ status: 404, ok: false });

    const result = await callGetRepoHandler({ owner: 'acme', repo: 'missing' }, mockGithub);

    expect(result.isError).toBe(true);
    expect(result.content[0].text).toContain('NOT_FOUND');
    expect(result.content[0].text).toContain('acme/missing');
  });

  it('does NOT throw for GitHub 404 — protocol stays healthy', async () => {
    mockGithub.getRepo.mockResolvedValueOnce({ status: 404, ok: false });

    // callTool should resolve, not reject
    await expect(
      callGetRepoHandler({ owner: 'acme', repo: 'missing' }, mockGithub)
    ).resolves.toBeDefined();
  });
});

// ── Test 2: Circuit breaker opens after threshold ─────────────────────────
describe('CircuitBreaker', () => {
  let cb: CircuitBreaker;

  beforeEach(() => {
    cb = new CircuitBreaker({ failureThreshold: 3, cooldownMs: 5_000 });
  });

  it('starts in CLOSED state', () => {
    expect(cb.getState()).toBe('CLOSED');
  });

  it('opens after failureThreshold consecutive failures', async () => {
    const fail = () => Promise.reject(new Error('upstream down'));

    await expect(cb.call(fail)).rejects.toThrow();
    await expect(cb.call(fail)).rejects.toThrow();
    await expect(cb.call(fail)).rejects.toThrow(); // 3rd failure → OPEN

    expect(cb.getState()).toBe('OPEN');
  });

  it('rejects immediately in OPEN state without calling downstream', async () => {
    // Force open
    for (let i = 0; i < 3; i++) {
      await cb.call(() => Promise.reject(new Error('fail'))).catch(() => {});
    }

    const spy = vi.fn().mockResolvedValue('ok');
    await expect(cb.call(spy)).rejects.toThrow('Circuit breaker OPEN');
    expect(spy).not.toHaveBeenCalled(); // downstream never called
  });
});

// ── Test 3: Zod validation returns JSON-RPC InvalidParams error ───────────
describe('Zod validation (protocol error)', () => {
  it('returns McpError with code -32602 for missing required param', async () => {
    // Use in-process MCP client/server for integration test
    const { client, cleanup } = await createTestServerClient();

    try {
      await client.callTool({ name: 'search_repos', arguments: {} }); // missing query
    } catch (err: unknown) {
      expect(err).toBeInstanceOf(McpError);
      expect((err as McpError).code).toBe(ErrorCode.InvalidParams);
    } finally {
      await cleanup();
    }
  });
});src/__tests__/error-handling.test.ts

🧪

Mock injection pattern: Pass your GitHub client as a parameter to the tool handler factory (dependency injection) rather than importing it at the module level. This makes it trivial to swap in mockGithub in tests without any module mocking magic — just a different argument.

Section 12

Production Error Checklist

Before you ship any MCP server to production, run through this checklist. Each item maps directly to a pattern covered in this lesson.

1 Never throw plain errors from tool handlers. All expected business failures (404, rate limit, validation, permission) use return createToolError({ code, message }) with isError: true.
2 Throw McpError (not Error) for protocol-level failures. Use new McpError(ErrorCode.InvalidParams, msg) only when the request itself is malformed beyond what Zod already catches.
3 Implement retry with exponential backoff and jitter for timeout and transient InternalError codes. Cap retries with a time budget (e.g. 30s), not a fixed count.
4 Never retry -32602 InvalidParams, -32601 MethodNotFound, NOT_FOUND, or PERMISSION_DENIED. These are deterministic failures — repeating them wastes resources and delays user feedback.
5 Wrap every external API call with a CircuitBreaker. Use a threshold of 5 consecutive failures and a 30-second cooldown minimum. Return a user-friendly isError result when the breaker is open.
6 Apply per-tool timeouts via AbortController for all outbound HTTP calls. Always clearTimeout in a finally block to prevent handle leaks. Recommended default: 10s per tool, 30s global.
7 Implement at least one graceful degradation strategy for every tool that calls an external service: stale cache, partial results, or a fallback source. Never let an upstream outage become a complete tool blackout.
8 Use structured JSON logging to stderr with a correlation ID, tool name, error code, duration, and attempt number on every error. Never use console.log on stdout — that breaks the JSON-RPC channel.
9 Emit MCP log notifications via server.notification({ method: 'notifications/message' }) for errors the client host should surface to the user. Keep stderr logs for server-side observability.
10 Write tests for every error path: isError on upstream failure, circuit breaker state transitions, Zod validation → InvalidParams, timeout → isError result. Use dependency injection to inject mock clients without module mocking.

Quiz · Day 10

Error Handling Check

5 questions covering the two error channels, retry strategies, circuit breakers, timeouts, and isError semantics. Score 5/5 and you're production-ready.

Q1A tool handler calls the GitHub API which returns a 404. The correct pattern is to...

Athrow new McpError(ErrorCode.InternalError, '404') — map the HTTP error to a protocol error

Breturn { isError: true, content: [{ type: 'text', text: 'Repo not found: 404' }] } — it's an expected operational failure

Cthrow new Error('GitHub 404') — let the SDK convert it to a JSON-RPC error automatically

Dreturn an empty content array — the client infers failure from the missing content

Q2Which JSON-RPC error code should you never retry automatically?

A-32603 InternalError — could be transient, worth one retry

B-32001 RequestTimeout — the server was just slow, retry with backoff

C-32602 InvalidParams — wrong arguments won't fix themselves on retry

D-32603 and -32001 both should never be retried

Q3A circuit breaker is in OPEN state. A request arrives. What happens?

AThe request is immediately rejected without calling the downstream service

BThe request is queued and retried automatically when the breaker closes

CThe circuit breaker switches to HALF_OPEN and tries the request as a trial

DThe request falls through to the next retry attempt in the backoff queue

Q4You have a tool handler that uses AbortController for a 10-second timeout. The fetch completes in 3 seconds. What must you do to avoid a memory/handle leak?

ACall process.nextTick to flush the event loop and release the timer

BCall clearTimeout on the timeout handle and abort the controller in a finally block

CNothing — the AbortController garbage-collects automatically when it goes out of scope

DRestart the server — leaked handles can't be cleaned up without a process restart

Q5Your MCP server's tool handler returns { isError: true, content: [...] }. From the client's perspective, what does client.callTool() do?

AThrows an McpError with code -32603 InternalError — isError maps to a protocol error

BRejects the promise — isError: true always means a rejected promise on the client

CResolves normally — the caller must check result.isError to detect the error condition

DReturns undefined — an isError result has no content for the client to inspect

← Previous Day

Day 9: MCP Client SDK

Build clients that talk to any MCP server

Next Day →

Day 11: OAuth 2.0 & Authentication

Secure MCP servers with OAuth flows

Error Handling& Resilience

Error Handling
& Resilience