🎯 Domain 5 · Task Statement 5.5

Continuous Benchmarking & Evaluators for CI/CD

⏳ 📊 Domain Weight: 15% 🚀 Focus: Quality Assurance & Automation

In traditional software, tests are binary (Pass/Fail). In the non-deterministic world of Claude, an Architect must build **Probabilistic Evaluators**. This task explores the creation of Golden Datasets and LLM-as-a-Judge pipelines to ensure that a 1-line prompt change today doesn't cause a 10% quality regression tomorrow.

🏭 Real-World Analogy: The Automated Coffee Roastery

A master coffee roaster doesn't just "guess" if the beans are good by smelling one bean. They use a **Master Sample** (The Golden Dataset) and a **Digital Colorimeter** (The Evaluator) to compare every new batch against a scientific standard. If the color is off by 2%, the entire batch is rejected before it reaches a single customer.

🩹 The Quality Control Lab

1. The Master Batch: A perfect roast that everyone agrees is 10/10. (The Golden Dataset).

2. The Testing Suite: Samples from different parts of the roaster are compared to the master batch to ensure consistency (CI/CD Performance).

3. The Chief Taster: A secondary expert who determines if the "profile" matches the brand. (The LLM-as-a-Judge).

Designing Benchmarks is about building this "Roastery Lab" for your AI, so you never ship a "burnt" prompt to production. In the non-deterministic world of Claude, an Architect's goal is to turn "Vibe Checks" into "Data-Driven Decisions."

📄 The 3 Pillars of Evaluative Integrity

Testing "Accuracy" is too vague. Architects must break down evaluation into three distinct, measurable metrics (often called the RAGAS framework for RAG systems).

Pillar	What it Measures	Fail Condition
Faithfulness	Is the answer derived only from the provided context? (Prevents Hallucinations).	The answer mentions a price not in the document.
Answer Relevancy	Does the response actually address the user's specific question?	The user asked for "Price" but got a "Feature List."
Correctness	Does the answer match the "Ground Truth" provided in the test set?	The ground truth says "True" but Claude says "False."

🕐 Phase 1: LLM-as-a-Judge Pipeline

Since human review doesn't scale, we use a "Judge Model" (usually Claude 3 Opus) to grade the performance of our "Application Model" (usually Claude 3.5 Sonnet).

💡 The Judge's Secret: Chain of Thought Grading

Never ask a judge for a simple "Pass/Fail." Ask it to **Reason first**, then score. This prevents the judge from being biased towards long or "polite" answers.

System Prompt: The Technical Evaluator

"You are a strict technical auditor. 
Analyze the [AGENT_RESPONSE] against the [GOLDEN_TRUTH].
1. Identify every claim made in the response.
2. For each claim, check if it exists in the Golden Truth.
3. Assign a score of 0 (Fail) to 5 (Perfect).
4. OUTPUT FORMAT: JSON only with fields {reasoning, score}."

🚀 Phase 2: Similarity Metrics

Traditional tests use string matching. AI tests use **Vector Embeddings**. If the Ground Truth is "The sky is azure" and Claude says "The sky is blue", a string test fails, but a Cosine Similarity test passes at 98%.

🎓 Exam Focus: When to use which?

Exact Match: For Tool Calls, JSON keys, and API status codes.
Semantic Similarity: For creative writing, summaries, and chat tone.
Levenshtein Distance: For code refactoring and translation tasks.

⚡ Phase 3: The "Zero-Regression" Deployment

Every time an engineer changes a prompt in GitHub, the CI/CD pipeline triggers an **Evaluation Burst**.

Commit Tag: feat: update system instruction
Eval Burst: CI/CD spins up 50 parallel containers. Each runs one test case from the Golden Set.
Judge Collection: opus-judge grades the 50 results.
The Threshold Check:
- If Average Score < 90% -> **BUILD FAIL** (Deployment Blocked).
- If Latency > 2s -> **BUILD FAIL**.
- Else -> **DEPLOY TO PROD**.

⛔ Anti-Patterns: The Vibe-Check Trap

"The 3-Turn Vibe Check"

Testing a prompt by chatting with it 3 times personally. Result: You miss edge cases that only appear at scale. Fix: Run min 100 batch Evals before every merge.

"Self-Evaluation"

Using Sonnet to judge Sonnet's own output. Result: The model will be biased toward its own style. Fix: Use a **Stronger Model** (Opus) as the judge.

"Static Golden Sets"

Never updating the test cases. Result: The model "overfits" to the tests while failing in the real world. Fix: Add 2-3 real user failures to the Golden Set every week.

✅ Exam Readiness & Key Takeaways

🎓 Exam Scenario — The Refinement Regression

Scenario: You update your system prompt to include a new "Tone" requirement (be more supportive). After deployment, the agent becomes very supportive but its coding accuracy drops by 10% on complex logic tasks.

Question: What was the architectural failure in the delivery pipeline?

A) The developer didn't review the logs.
B) Lack of an **Automated Regression Eval** suite comparing the new prompt against a coding logic "Golden Set."
C) The model's context window was too small.

Correct Answer: B. Logic regressions are common when adding "Tone" instructions. Only automated evels can detect these trade-offs before they reach production.

No prompt without an Eval. If you can't measure the impact of a change, don't ship it.

Probabilistic Pass/Fail. Your build system should fail on "semantic failure," not just code errors.

Opus is the Auditor. Always use your most capable model for evaluations to ensure subtle errors are caught.

Previous Task ← Task 5.4: AI Observability

Next Task Task 5.6: Unit Economics →