"The best prompt is the one that isn't sent." In a production AI ecosystem, success is defined by **Unit Economics**: the ability to deliver mission-critical value at a sustainable token-to-revenue ratio. An Architect must master Asymmetric Model Routing, Context Eviction, and Throughput Pipeline design to transform a "Cool Prototype" into a "Profitable Business Line."
If you need to deliver an envelope across town, you don't send a 40-ton semi-truck (Claude 3 Opus). You use a bicycle courier (Claude 3 Haiku). If you need to deliver 50 heavy pianos, you use the truck but organize them in a single route on a weekend (Message Batching). A logistics manager's job is to ensure the **Revenue per Delivery** is always higher than the **Fuel per Mile**.
1. Vehicle Selection (Model Routing): Matching the intelligence (cost) of the model to the difficulty of the task. Don't use a "Brain Surgeon" to flip burgers.
2. Cargo Pruning (Token Management): Taking only the 5 critical items instead of 100 non-essential ones to save space and weight (Tokens).
3. Route Planning (Pipelines): Using the highway (Message Batch API) for non-urgent deliveries to get a 50% discount on fuel.
Mastering Unit Economics is about ensuring the system remains "Profitable" at scale. In production, a 10% reduction in tokens can mean $100k in annual savings for an enterprise app.
Architects must treat models as "Price Tiers." The gap between Haiku and Opus is 100x. If you can move 80% of your tasks to a lower tier, you save 80% of your budget instantly.
| Tier | Intelligence | Cost (Relative) | Best Use Case |
|---|---|---|---|
| Claude 3 Opus | Supreme reasoning, deep synthesis. | $$$$$ (100x) | Legal analysis, medical summaries, PhD-level coding. |
| Claude 3.5 Sonnet | High intelligence, gold-standard speed. | $$$ (10x) | General purpose agent, refactoring, agent routing. |
| Claude 3 Haiku | Fast, concise, structurally accurate. | $ (1x) | Classification, extraction, simple unit tests, routing filters. |
Prompt Caching is the "Game Changer" for Domain 5. It turns **O(N^2)** token costs (paying for history again and again) into **O(N)** costs.
// NO CACHING Turn 1-10: 100k + 105k + 110k... = 1,225,000 Tokens Paid. // WITH CACHING Turn 1: 100k (Write fee) Turn 2-10: 100k (90% discount) + usage RESULT: ~250,000 Tokens Paid. (~80% SAVINGS)
For workloads like analytics, historical audits, or document indexers, real-time responses are "Wasteful." The **Message Batch API** provides a **50% flat discount** on all token usage with a 24-hour SLA.
To prevent a "Rogue Agent" from spending $5,000 in an hour due to a tool loop, Architects must implement **State-Based Hard Limits**.
Using Opus for simple JSON extraction because "it's the best." Result: You pay 100x the required price. Fix: Route simple extraction to Haiku.
Sending the entire 200-turn chat history for every new message. Result: Exponential cost growth. Fix: Implement Context Eviction or Summarization (Task 5.1).
Redundantly sending the same 2k-token system prompt every time. Fix: Set a cache breakpoint at the end of the system block.
Scenario: A startup has 1,000 free users. The token cost is outpacing their funding. Most users use the tool for simple text summarization of documents.
Question: What change will provide the HIGHEST immediate cost reduction without damaging the core user experience?
Correct Answer: B. Model routing (Opus -> Haiku) is a 100x cost reduction. Prompt caching ensures the fixed costs (System Prompt) are only paid once per session.
cache_control headers to maximize ROI in multi-turn chat.