Production AI systems are "probabilistic" not only in their output but in their uptime. An Architect's job is to build for the inevitability of 429 (Rate Limits), 529 (Overloaded Servers), and network timeouts. A resilient system doesn't just crash when Claude is busy; it fails over gracefully across models and regions.
Imagine a global shipping company. If a major canal is blocked (A 529 Server Overload), the company doesn't just stop all shipping for the year. They have a pre-planned **Resilience Hierarchy**.
1. Local Wait (429 Retries): If a dock is busy, the ship waits outside for 30 minutes. If still busy, it waits 60, then 120 (Exponential Backoff).
2. Route Diversion (Regional Fail-over): If the port of New York is closed, the ship is rerouted to New Jersey or Philadelphia (Switching from us-east-1 to us-east-2).
3. Alternative Transport (Model Cascade): If the heavy cargo ship (Sonnet) is too slow to get through, they fly the critical parts via a smaller, faster jet (Haiku).
4. Deferred Delivery (Batch API): If the customer doesn't need it today, they put it on a slow train that arrives in 24 hours (Message Batch API).
Building for Resilience is about ensuring the system remains "Available" even when the primary service is "Degraded." An Architect's goal is to turn a "Critical Failure" into a "Slight Latency Increase."
Architects must map specific HTTP status codes to custom recovery strategies. Misidentifying a 400 as a 429 leads to infinite retry loops and wasted costs.
| Code | Meaning | Action | Automatic Retry? |
|---|---|---|---|
| 400 | Bad Request (Validation) | DROP and LOG. Usually a prompt escape or invalid tool schema. | NO |
| 429 | Rate Limit Hit | Trigger Exponential Backoff + Jitter. | YES (Aggressive) |
| 529 | Overloaded / Busy | Trigger Regional Fail-over or Model Downgrade. | YES (Tiered) |
| 401/403 | Auth / Permissions | CRITICAL ALERT. API keys may be expired or rotated. | NO |
In high-availability applications, the "Best" model (Sonnet 3.5) might occasionally hit capacity limits. The **Model Cascade** ensures the user gets *an* answer, even if it's from a less intelligent model.
async call_claude_resilient(prompt): for model in [SONNET_US, SONNET_EU, HAIKU_US]: try: return await call_api(model, prompt) catch (Error e): if e.code == 429 or e.code == 529: wait(exponential_backoff_with_jitter()) continue throw e // Don't retry auth errors or bad prompts
When an API returns a 429, thousands of clients might all wait exactly 1 second and then retry. This is the **Thundering Herd** problem. To solve this, architects use **Full Jitter**.
// Standard backoff often causes collisions const delay = base * Math.pow(2, attempt); // Full Jitter spreads the load perfectly across the window const jittered_delay = Math.random() * (base * Math.pow(2, attempt));
If a model endpoint returns >20 errors in 10 seconds, the **Circuit Opens**. For the next 60 seconds, all calls to that endpoint are rejected *locally* without hitting the API. This protects your thread pool and prevents useless token billing during a major outage.
For workloads that can tolerate a 24-hour delay (data processing, summary generation, evaluations), you should never use the interactive API. The **Message Batch API** provides a 50% cost reduction and significantly higher rate limits.
If your real-time API is hitting 429s during a traffic surge:
Retrying a request that failed because the prompt was too long or invalid. Result: You hit your rate limit on requests that will never succeed. Fix: Catch 400 errors and block them from retry logic.
Switching regions without checking if the secondary region is also overloaded. Fix: Implement a Global Load Balancer that tracks health metrics across all endpoints.
Retrying for 30 seconds while the user's screen is frozen. Fix: Update UI state after 2 seconds to show "System high load, attempting retry..."
Scenario: Your application processes 2,000 document summaries every night at 12:00 AM. During this window, interactive users experience frequent "Service Overloaded" errors.
Question: What is the most architecturally sound solution?
Correct Answer: B. The Batch API uses a different rate limit pool and priority, preventing background tasks from "starving" real-time users.