In traditional software, tests are binary (Pass/Fail). In the non-deterministic world of Claude, an Architect must build **Probabilistic Evaluators**. This task explores the creation of Golden Datasets and LLM-as-a-Judge pipelines to ensure that a 1-line prompt change today doesn't cause a 10% quality regression tomorrow.
A master coffee roaster doesn't just "guess" if the beans are good by smelling one bean. They use a **Master Sample** (The Golden Dataset) and a **Digital Colorimeter** (The Evaluator) to compare every new batch against a scientific standard. If the color is off by 2%, the entire batch is rejected before it reaches a single customer.
1. The Master Batch: A perfect roast that everyone agrees is 10/10. (The Golden Dataset).
2. The Testing Suite: Samples from different parts of the roaster are compared to the master batch to ensure consistency (CI/CD Performance).
3. The Chief Taster: A secondary expert who determines if the "profile" matches the brand. (The LLM-as-a-Judge).
Designing Benchmarks is about building this "Roastery Lab" for your AI, so you never ship a "burnt" prompt to production. In the non-deterministic world of Claude, an Architect's goal is to turn "Vibe Checks" into "Data-Driven Decisions."
Testing "Accuracy" is too vague. Architects must break down evaluation into three distinct, measurable metrics (often called the RAGAS framework for RAG systems).
| Pillar | What it Measures | Fail Condition |
|---|---|---|
| Faithfulness | Is the answer derived *only* from the provided context? (Prevents Hallucinations). | The answer mentions a price not in the document. |
| Answer Relevancy | Does the response actually address the user's specific question? | The user asked for "Price" but got a "Feature List." |
| Correctness | Does the answer match the "Ground Truth" provided in the test set? | The ground truth says "True" but Claude says "False." |
Since human review doesn't scale, we use a "Judge Model" (usually Claude 3 Opus) to grade the performance of our "Application Model" (usually Claude 3.5 Sonnet).
Never ask a judge for a simple "Pass/Fail." Ask it to **Reason first**, then score. This prevents the judge from being biased towards long or "polite" answers.
"You are a strict technical auditor.
Analyze the [AGENT_RESPONSE] against the [GOLDEN_TRUTH].
1. Identify every claim made in the response.
2. For each claim, check if it exists in the Golden Truth.
3. Assign a score of 0 (Fail) to 5 (Perfect).
4. OUTPUT FORMAT: JSON only with fields {reasoning, score}."
Traditional tests use string matching. AI tests use **Vector Embeddings**. If the Ground Truth is "The sky is azure" and Claude says "The sky is blue", a string test fails, but a Cosine Similarity test passes at 98%.
Every time an engineer changes a prompt in GitHub, the CI/CD pipeline triggers an **Evaluation Burst**.
feat: update system instructionTesting a prompt by chatting with it 3 times personally. Result: You miss edge cases that only appear at scale. Fix: Run min 100 batch Evals before every merge.
Using Sonnet to judge Sonnet's own output. Result: The model will be biased toward its own style. Fix: Use a **Stronger Model** (Opus) as the judge.
Never updating the test cases. Result: The model "overfits" to the tests while failing in the real world. Fix: Add 2-3 real user failures to the Golden Set every week.
Scenario: You update your system prompt to include a new "Tone" requirement (be more supportive). After deployment, the agent becomes very supportive but its coding accuracy drops by 10% on complex logic tasks.
Question: What was the architectural failure in the delivery pipeline?
Correct Answer: B. Logic regressions are common when adding "Tone" instructions. Only automated evels can detect these trade-offs before they reach production.