🎯 Domain 5 · Task Statement 5.6

Unit Economics & Performance Mastery

📊 Domain Weight: 15% 💵 Focus: Profitability & Efficiency

"The best prompt is the one that isn't sent." In a production AI ecosystem, success is defined by **Unit Economics**: the ability to deliver mission-critical value at a sustainable token-to-revenue ratio. An Architect must master Asymmetric Model Routing, Context Eviction, and Throughput Pipeline design to transform a "Cool Prototype" into a "Profitable Business Line."

🏭 Real-World Analogy: The Logistics Fleet Economy

If you need to deliver an envelope across town, you don't send a 40-ton semi-truck (Claude 3 Opus). You use a bicycle courier (Claude 3 Haiku). If you need to deliver 50 heavy pianos, you use the truck but organize them in a single route on a weekend (Message Batching). A logistics manager's job is to ensure the **Revenue per Delivery** is always higher than the **Fuel per Mile**.

🩹 The Efficient Hub

1. Vehicle Selection (Model Routing): Matching the intelligence (cost) of the model to the difficulty of the task. Don't use a "Brain Surgeon" to flip burgers.

2. Cargo Pruning (Token Management): Taking only the 5 critical items instead of 100 non-essential ones to save space and weight (Tokens).

3. Route Planning (Pipelines): Using the highway (Message Batch API) for non-urgent deliveries to get a 50% discount on fuel.

Mastering Unit Economics is about ensuring the system remains "Profitable" at scale. In production, a 10% reduction in tokens can mean $100k in annual savings for an enterprise app.

📄 The Cost Pyramid: Model Selection

Architects must treat models as "Price Tiers." The gap between Haiku and Opus is 100x. If you can move 80% of your tasks to a lower tier, you save 80% of your budget instantly.

Tier Intelligence Cost (Relative) Best Use Case
Claude 3 Opus Supreme reasoning, deep synthesis. $$$$$ (100x) Legal analysis, medical summaries, PhD-level coding.
Claude 3.5 Sonnet High intelligence, gold-standard speed. $$$ (10x) General purpose agent, refactoring, agent routing.
Claude 3 Haiku Fast, concise, structurally accurate. $ (1x) Classification, extraction, simple unit tests, routing filters.

🚀 Phase 1: Prompt Caching Economics

Prompt Caching is the "Game Changer" for Domain 5. It turns **O(N^2)** token costs (paying for history again and again) into **O(N)** costs.

💡 The "Caching Breakpoint" Strategy
  • Cache Hit: Pay ~10% of the normal input cost (depending on model/tier).
  • Cache Miss (Write): Pay a ~25% premium on the initial write.
  • Architect's Rule: Use a breakpoint for any content shared by > 5 concurrent users or used in > 3 sequential turns.
Example ROI Calculation (100k Context, 10 Turns)
// NO CACHING
Turn 1-10: 100k + 105k + 110k... = 1,225,000 Tokens Paid.

// WITH CACHING
Turn 1: 100k (Write fee)
Turn 2-10: 100k (90% discount) + usage
RESULT: ~250,000 Tokens Paid. (~80% SAVINGS)

💵 Phase 2: The "Overnight" Discount (Batch API)

For workloads like analytics, historical audits, or document indexers, real-time responses are "Wasteful." The **Message Batch API** provides a **50% flat discount** on all token usage with a 24-hour SLA.

Batch Architecture Decision Matrix

  1. Is the user waiting at a screen? -> Real-time API.
  2. Is it a background process? -> **Batch API**.
  3. Is it high volume (1,000+ docs)? -> **Batch API**.

🔒 Phase 3: Architectural Budget Enforcers

To prevent a "Rogue Agent" from spending $5,000 in an hour due to a tool loop, Architects must implement **State-Based Hard Limits**.

🎓 Enforcer Logic Checklist
  • Session Quota: Max 500k tokens per user session.
  • Turn Quota: Max 15 turns per task before requiring a human "checkpoint."
  • Rate Throttling: Leaky-bucket algorithm at the API Gateway to prevent burst costs.
  • Recursive Loop Detection: Alert if the exact same tool call is made 3 times with the same input.

Anti-Patterns: The Money Pit

"The Golden Hammer"

Using Opus for simple JSON extraction because "it's the best." Result: You pay 100x the required price. Fix: Route simple extraction to Haiku.

"Ignoring History Bloat"

Sending the entire 200-turn chat history for every new message. Result: Exponential cost growth. Fix: Implement Context Eviction or Summarization (Task 5.1).

"Zero Caching"

Redundantly sending the same 2k-token system prompt every time. Fix: Set a cache breakpoint at the end of the system block.

Exam Readiness & Key Takeaways

🎓 Exam Scenario — The Startup Scaleup

Scenario: A startup has 1,000 free users. The token cost is outpacing their funding. Most users use the tool for simple text summarization of documents.

Question: What change will provide the HIGHEST immediate cost reduction without damaging the core user experience?

  • A) Implement a 5-message limit per day.
  • B) Route 90% of traffic (summarization) to Claude 3 Haiku and use Prompt Caching for the shared system instructions.
  • C) Move all summaries to the Batch API.

Correct Answer: B. Model routing (Opus -> Haiku) is a 100x cost reduction. Prompt caching ensures the fixed costs (System Prompt) are only paid once per session.

1
Right Model for the Right Task. Don't pay for intelligence your task doesn't require.
2
Breakpoints are Profits. Master the placement of cache_control headers to maximize ROI in multi-turn chat.
3
Enforce and Alert. Never allow "Open Ended" token consumption; every architect needs a kill-switch.