🎯 Domain 5 · Task Statement 5.4

Architect Observability for Generative AI Systems

⏳ 📊 Domain Weight: 15% 📈 Focus: Telemetry & Monitoring

"What you can't measure, you can't optimize." In the world of non-deterministic LLMs, the traditional "System Health" check (200 OK) is insufficient. An Architect must build an observability pipeline that tracks The 3 Pillars of AI Telemetry: Latency (TTFT), Unit Economics (Token Cost), and Perceived Quality (Grounding).

🏭 Real-World Analogy: Formula 1 Racing Telemetry

A Formula 1 driver doesn't just listen to the engine sounds and look at a speedometer. Their car is covered in 300+ sensors tracking fuel consumption, tire temperature (Cost), and micro-vibrations in the chassis (Quality) at 1,000 samples per second. In the pits, 50 engineers analyze this data to decide on the next pit stop (Prompt adjustment or Model switch).

🩹 The AI Command Dash

1. Fuel Gauge (Cost): Tracking exactly how many tokens (money) were burned per lap. This is **Token Velocity**.

2. Speedometer (Latency): Measuring how long it took the car to accelerate out of a curve. This is **TTFT (Time to First Token)**.

3. Vibration Sensor (Quality/Reliability): Detecting when the engine is sputtering even if the car is still moving. This is **Token Logits** and **Probability Drift**.

Designing Observability is about building this dashboard so you can find out *why* a prompt is slow or *how* a specific model change impacted your monthly bill. Without telemetry, you are "Flying Blind" in a non-deterministic sky.

📄 The 4 Golden Pillars of AI Observability

Traditional monitoring focuses on CPU/RAM. AI monitoring focuses on **Tokens** and **Semantics**. Architects must track the following metrics to maintain production reliability.

Pillar	Critical Metric	Business Impact	Architect's Target
Latency	TTFT (Time to First Token)	User retention. High TTFT > 2s leads to perceived "Lag."	< 800ms for Sonnet 3.5.
Cost	Token Input Ratio	Economics. Bloated bills from excessive tool logs.	> 80% Cache Hit Rate for heavy apps.
Quality	Tool Pass Rate	Functionality. Detecting when Claude misses a tool schema.	100% Success on "Golden Paths."
Usage	TPM (Tokens Per Minute)	Capacity planning. Predicting when to increase limits.	Zero 429 errors from capacity walls.

TTFT vs. TBT (Total Byte Throughput)

In streaming architectures, the Total Response Time (e.g., 10 seconds for a full page) is irrelevant. What matters is the TTFT. If the user sees the first character in 400ms, they feel the system is "instant," even if the generation takes 2 minutes. **Architecture Goal:** Always prioritize TTFT over total throughput by using streaming endpoints.

🚀 Phase 1: Distributed Tracing & Correlation

In a modern multi-agent system, a single user prompt might trigger a chain of events: User -> Router Agent -> Research Agent -> Tool Execution (Search) -> Tool Execution (DB) -> Final Summary.

📝 The Trace ID Header Strategy

X-Trace-ID: A unique uuid generated at the User Entry point. It must be passed to EVERY sub-agent.
X-Span-ID: Identifies an individual unit of work (e.g. one Tool call).
Metatada Mapping: Link every Claude API response header (request-id) to your trace. This allows you to open an "Anthropic Ticket" with a specific trace if an error occurs.

OpenTelemetry Instrumentation (Pseudocode)

with tracer.start_as_current_span("claude_call") as span:
    span.set_attribute("gen_ai.prompt_tokens", res.usage.input_tokens)
    span.set_attribute("gen_ai.completion_tokens", res.usage.output_tokens)
    span.set_attribute("app.user_id", "usr_99")
    # Now the entire call chain is correlated in Grafana/Jaeger

🔒 Phase 2: The PII Scrubber Pipeline

Logging the full prompt is critical for debugging hallucinations, but storing Customer Data (SSNs, Emails, API Keys) in a central logging stack (ELK/Grafana) is a major compliance violation (GDPR/HIPAA).

The Scrubbing Proxy Architecture

Architects must implement a "Middleman" between the Claude API and the Log Drain:

Entity Extraction: Use a fast model or Regex to find Names, Locations, and Finance data.
Tokenization: Replace "John Smith" with "[USER_1]".
Asymmetric Keys: Encrypt the raw mappings (User_1 -> John Smith) in a "Vault" only accessible by authorized auditors.
Result: Engineers can debug the *Reasoning* of the prompt without ever seeing the *Private Data*.

📈 Phase 3: Token Efficiency Audit

Are you spending tokens wisely? Monitor the Input:Output Ratio.

The "Chatty" Agent: High output tokens relative to input. Potential for rambling or "hallucination loops."
The "Context Heavy" Agent: High input tokens (e.g. 50k) for short output (e.g. 50 words). **Optimization Opportunity:** Implement Prompt Caching or better summarization (Task 5.1).

💡 Alert Strategy: The "Token Spike"

Set up an alert if a single session's consumption exceeds 2,000,000 tokens in 1 hour. This usually indicates a Recursive Tool Loop where Claude and a tool are shouting at each other infinitely.

⛔ Anti-Patterns: Blind Architectures

"The Log Firehose"

Logging every single prompt turn for 1,000,000 users. Result: Your CloudWatch bill becomes larger than your Anthropic bill. Fix: Use Statistical Sampling (log only 1-5% of healthy turns; 100% of errors).

"Manual Error Hunting"

Waiting for users to complain about bad answers. Result: 90% of failures go unnoticed. Fix: Implement LLM-as-a-Judge triggers to flag toxic or low-quality responses automatically.

"Ignoring Response Headers"

Discarding the anthropic-ratelimit-remaining headers. Fix: Use these headers to adjust your Throttle Rates in real-time before you hit a Hard 429 Wall.

✅ Exam Readiness & Key Takeaways

🎓 Exam Scenario — The Latency Audit

Scenario: You are the Architect for a high-volume chat app. Management reports a decrease in user engagement despite the system always returning "200 OK". You suspect the AI response time is the cause, but the total generation takes 8 seconds for everyone.

Question: Which specific metric is most critical to investigate for user satisfaction?

A) Total Response Latency.
B) **TTFT (Time to First Token)** in a streaming architecture.
C) The Outbound Token Count.

Correct Answer: B. TTFT defines the "Perceived Latency." Even if a total answer takes 10s, if the first token appears in 400ms, the user feels the system is responsive. Optimization should target TTFT.

Measure Perceived Speed. Focus on TTFT as your North Star metric for interactive AI apps.

Traceability is Debugability. Without Trace IDs, you cannot reconstruct the reasoning chain of a hallucinating agent.

Privacy by Design. Implement PII scrubbing *before* the log drain to maintain enterprise security compliance.

Previous Task ← Task 5.3: Fault Tolerance

Next Task Task 5.5: Benchmarks & Evals →