🎯 Domain 5 · Task Statement 5.6

Unit Economics & Performance Mastery

⏳ 📊 Domain Weight: 15% 💵 Focus: Profitability & Efficiency

"The best prompt is the one that isn't sent." In a production AI ecosystem, success is defined by **Unit Economics**: the ability to deliver mission-critical value at a sustainable token-to-revenue ratio. An Architect must master Asymmetric Model Routing, Context Eviction, and Throughput Pipeline design to transform a "Cool Prototype" into a "Profitable Business Line."

🏭 Real-World Analogy: The Logistics Fleet Economy

If you need to deliver an envelope across town, you don't send a 40-ton semi-truck (Claude 3 Opus). You use a bicycle courier (Claude 3 Haiku). If you need to deliver 50 heavy pianos, you use the truck but organize them in a single route on a weekend (Message Batching). A logistics manager's job is to ensure the **Revenue per Delivery** is always higher than the **Fuel per Mile**.

🩹 The Efficient Hub

1. Vehicle Selection (Model Routing): Matching the intelligence (cost) of the model to the difficulty of the task. Don't use a "Brain Surgeon" to flip burgers.

2. Cargo Pruning (Token Management): Taking only the 5 critical items instead of 100 non-essential ones to save space and weight (Tokens).

3. Route Planning (Pipelines): Using the highway (Message Batch API) for non-urgent deliveries to get a 50% discount on fuel.

Mastering Unit Economics is about ensuring the system remains "Profitable" at scale. In production, a 10% reduction in tokens can mean $100k in annual savings for an enterprise app.

📄 The Cost Pyramid: Model Selection

Architects must treat models as "Price Tiers." The gap between Haiku and Opus is 100x. If you can move 80% of your tasks to a lower tier, you save 80% of your budget instantly.

Tier	Intelligence	Cost (Relative)	Best Use Case
Claude 3 Opus	Supreme reasoning, deep synthesis.	$$$$$ (100x)	Legal analysis, medical summaries, PhD-level coding.
Claude 3.5 Sonnet	High intelligence, gold-standard speed.	$$$ (10x)	General purpose agent, refactoring, agent routing.
Claude 3 Haiku	Fast, concise, structurally accurate.	$ (1x)	Classification, extraction, simple unit tests, routing filters.

🚀 Phase 1: Prompt Caching Economics

Prompt Caching is the "Game Changer" for Domain 5. It turns **O(N^2)** token costs (paying for history again and again) into **O(N)** costs.

💡 The "Caching Breakpoint" Strategy

Cache Hit: Pay ~10% of the normal input cost (depending on model/tier).
Cache Miss (Write): Pay a ~25% premium on the initial write.
Architect's Rule: Use a breakpoint for any content shared by > 5 concurrent users or used in > 3 sequential turns.

Example ROI Calculation (100k Context, 10 Turns)

// NO CACHING
Turn 1-10: 100k + 105k + 110k... = 1,225,000 Tokens Paid.

// WITH CACHING
Turn 1: 100k (Write fee)
Turn 2-10: 100k (90% discount) + usage
RESULT: ~250,000 Tokens Paid. (~80% SAVINGS)

💵 Phase 2: The "Overnight" Discount (Batch API)

For workloads like analytics, historical audits, or document indexers, real-time responses are "Wasteful." The **Message Batch API** provides a **50% flat discount** on all token usage with a 24-hour SLA.

Batch Architecture Decision Matrix

Is the user waiting at a screen? -> Real-time API.
Is it a background process? -> **Batch API**.
Is it high volume (1,000+ docs)? -> **Batch API**.

🔒 Phase 3: Architectural Budget Enforcers

To prevent a "Rogue Agent" from spending $5,000 in an hour due to a tool loop, Architects must implement **State-Based Hard Limits**.

🎓 Enforcer Logic Checklist

Session Quota: Max 500k tokens per user session.
Turn Quota: Max 15 turns per task before requiring a human "checkpoint."
Rate Throttling: Leaky-bucket algorithm at the API Gateway to prevent burst costs.
Recursive Loop Detection: Alert if the exact same tool call is made 3 times with the same input.

⛔ Anti-Patterns: The Money Pit

"The Golden Hammer"

Using Opus for simple JSON extraction because "it's the best." Result: You pay 100x the required price. Fix: Route simple extraction to Haiku.

"Ignoring History Bloat"

Sending the entire 200-turn chat history for every new message. Result: Exponential cost growth. Fix: Implement Context Eviction or Summarization (Task 5.1).

"Zero Caching"

Redundantly sending the same 2k-token system prompt every time. Fix: Set a cache breakpoint at the end of the system block.

✅ Exam Readiness & Key Takeaways

🎓 Exam Scenario — The Startup Scaleup

Scenario: A startup has 1,000 free users. The token cost is outpacing their funding. Most users use the tool for simple text summarization of documents.

Question: What change will provide the HIGHEST immediate cost reduction without damaging the core user experience?

A) Implement a 5-message limit per day.
B) Route 90% of traffic (summarization) to Claude 3 Haiku and use Prompt Caching for the shared system instructions.
C) Move all summaries to the Batch API.

Correct Answer: B. Model routing (Opus -> Haiku) is a 100x cost reduction. Prompt caching ensures the fixed costs (System Prompt) are only paid once per session.

Right Model for the Right Task. Don't pay for intelligence your task doesn't require.

Breakpoints are Profits. Master the placement of cache_control headers to maximize ROI in multi-turn chat.

Enforce and Alert. Never allow "Open Ended" token consumption; every architect needs a kill-switch.

Previous Task ← Task 5.5: Benchmarks

Domain Summary Summary & Cheat Sheet →