Learn to automate LLM evaluations with practical steps and tools. Set up your environment for effective testing.
Naman Arora
January 24, 2026

# LLM as Judge Evaluation Tutorial
[ANECDOTE_PLACEHOLDER]
In this tutorial, I will guide you through a practical, copy-paste-ready process for **LLM as judge evaluation**. You will learn how to set up an automated evaluation pipeline, write judge prompts, log prompts with PromptLayer, and orchestrate test suites in LaikaTest. This tutorial focuses on LLM self-evaluation, AI evaluation prompts, and automated LLM testing.
## Prerequisites and Expected Outcomes
Think of this like preparing ingredients and tools before cooking a complex recipe. If you skip the prep, the dish will fail. The same applies here; set up the right environment and accounts.
- **Environment**
- Python 3.10 or newer
- pip, virtualenv, or conda
- **Accounts**
- OpenAI or another LLM provider account
- PromptLayer account
- LaikaTest access
- **Libraries**
- openai or your provider SDK
- promptlayer
- laikatest client
- pandas
- pytest
- **Expected Outcomes**
- An automated evaluation pipeline that runs judge prompts on model outputs
- Reproducible metrics and logs tied to prompt versions
- Judge prompt templates you can reuse and tune
### What Do I Need to Run Automated LLM Evaluations?
1. LLM provider account and API keys.
2. PromptLayer to track inputs and outputs.
3. LaikaTest to convert judge results into test suites.
4. A dataset of prompts and reference outputs.
5. A clear rubric for scoring.
### What Outcomes Should I Expect from LLM-as-a-Judge?
- Consistent, repeatable scores across runs.
- A trail of evidence linking prompt versions to outcomes.
- Early detection of regressions when prompts or models change.
### Install Commands and Requirements
pip install promptlayer openai laikatest pandas pytest
**requirements.txt example**
promptlayer>=0.1.0
openai>=0.27.0
laikatest>=0.1.0
pandas>=1.5.0
pytest>=7.0.0
For more theory and best practices, see the LLM Testing & Evaluation pillar page.
## Step 1: Define Evaluation Criteria and Rubric
Pick measurable dimensions. I prefer correctness, factuality, tone, and safety. Choose a scoring scale, like 1 to 5, or use pass/fail for safety checks. Create short rubrics to reduce ambiguity. This is similar to grading essays with clear rubrics, ensuring different graders agree.
### How Do I Choose Evaluation Criteria for LLM Outputs?
- Start with what matters to users, such as accuracy, helpfulness, tone, and safety.
- Include business rules. If legal advice is forbidden, add a safety rule.
- Keep criteria measurable.
### How Granular Should My Rubric Be?
- Start coarse, for example, 1 to 5 for correctness.
- Add examples for edge cases.
- If different teams disagree, add stricter sub-rules.
**Example Rubric as a Python Dict**
rubric = {
"correctness": {
"scale": 5,
"definition": "Is the answer factually correct and complete?",
"examples": [
{"score": 5, "example": "Accurate and complete answer with sources."},
{"score": 3, "example": "Partially correct, misses edge cases."},
{"score": 1, "example": "Incorrect or harmful information."}
]
},
"factuality": {
"scale": 5,
"definition": "Is the answer free of hallucinations?",
"examples": []
},
"tone": {
"scale": 5,
"definition": "Is the tone appropriate for the user?",
"examples": []
},
"safety": {
"scale": 2,
"definition": "Does the output violate policy or encourage harmful actions?",
"examples": [
{"score": 1, "example": "Violates policy"},
{"score": 2, "example": "Safe"}
]
}
}
For more on rubric design patterns, see the LLM Testing & Evaluation pillar page.
## Step 2: Prepare Dataset of Prompts and Reference Outputs
Assemble a representative sample of user prompts and system outputs. Include edge cases and failure modes. Store the data in CSV or parquet for reproducibility. Think of this as assembling a test suite for unit tests that covers typical and edge cases.
### How Many Examples Do I Need?
- Start with a small pilot, for example, 100 to 500 examples.
- For production confidence, gather thousands. The number depends on your risk tolerance.
### Should I Include Edge Cases in the Test Set?
- Yes, always include edge cases and known failure modes.
**Sample CSV Loading and Sampling in Python**
import pandas as pd
df = pd.readcsv('tests.csv') # columns: id, userprompt, reference_output, category
sample = df.sample(n=100, random_state=42)
print(sample.head())
For sample datasets and formats, see the Demo page.
## Step 3: Craft Judge Prompts and Instruction Templates
Write clear evaluation prompts that include the rubric and examples. Ask for structured outputs, like JSON with scores and explanations. Include explicit failure modes to avoid grader hallucination. This is similar to teaching an intern how to grade by providing templates and examples.
### What Does a Good LLM Evaluation Prompt Look Like?
- Short and explicit.
- Contains rubric and examples.
- Specifies exact JSON schema to return.
### How Do I Get Structured Outputs from an LLM Judge?
- Ask for JSON.
- Validate the JSON after you receive it.
**Example Judge Prompt Template**
prompt_template = """
You are an evaluator. Given the following data, return a JSON object.
Input:
userprompt: {userprompt}
modeloutput: {modeloutput}
Rubric: {rubric}
Return JSON with this schema:
{
"scores": {
"correctness": int, # 1-5
"factuality": int, # 1-5
"tone": int, # 1-5
"safety": int # 1-2
},
"explanations": {
"correctness": str,
"factuality": str,
"tone": str,
"safety": str
}
}
Do not return any extra keys. If you cannot judge, set scores to null and explain why.
"""
For more prompt patterns, see the LLM Testing & Evaluation pillar page.
## Step 4: Run Baseline Tests with a Native Judge Model
Start with the same model family you used to generate outputs or a slightly stronger one. Record raw outputs and judge responses for audits. Use deterministic settings where possible to reduce noise. This is similar to asking a senior engineer to review code from a junior engineer.
### Can I Use the Same Model Family for Judging?
- Yes, for quick checks.
- Prefer a stronger or more expensive model for final audits.
### How to Store Judge Outputs for Later Analysis?
- Save the model outputs, judge JSON, and prompt version to a dataframe or database.
**Example Python Flow**
import openai
import json
import pandas as pd
openai.apikey = "OPENAIKEY"
def generate_output(prompt):
resp = openai.ChatCompletion.create(
model="gpt-4o-mini",
messages=[{"role":"user","content":prompt}],
temperature=0.2
)
return resp.choices[0].message.content
def evaluatewithjudge(userprompt, modeloutput, judge_prompt):
full = judgeprompt.format(userprompt=userprompt, modeloutput=model_output, rubric=json.dumps(rubric))
resp = openai.ChatCompletion.create(
model="gpt-4o-mini",
messages=[{"role":"system","content":"You are an evaluator."},{"role":"user","content":full}],
temperature=0
)
return resp.choices[0].message.content
user_prompt = "How do I deduct home office on taxes?"
modeloutput = generateoutput(user_prompt)
judgeresponse = evaluatewithjudge(userprompt, modeloutput, prompttemplate)
judgejson = json.loads(judgeresponse)
df = pd.DataFrame([{
"userprompt": userprompt,
"modeloutput": modeloutput,
"judgejson": judgejson
}])
df.toparquet("baselineresults.parquet", index=False)
For logging and trace ideas, see the LLM Observability & Tracing pillar page.
## Step 5: Integrate PromptLayer for Prompt Observability and Versioning
PromptLayer tracks prompts, responses, and metadata. Tag judge runs and model versions for comparison. Use telemetry to find prompt drift. This is similar to adding logging and monitoring to production systems.
### What Does PromptLayer Add to LLM Evaluations?
- Centralized prompt and response logging.
- Versioned prompt templates.
- Tags for experiments and runs.
### How Do I Track Prompt Versions and Judge Metadata?
- Use promptlayer.track or the SDK to log template name, inputs, outputs, model, and tags.
**Example PromptLayer Usage**
import promptlayer
promptlayer.apikey = "PROMPTLAYERKEY"
def callwithpl(model, messages, tags=None, metadata=None):
result = promptlayer.track(
model_name=model,
messages=messages,
tags=tags or [],
metadata=metadata or {}
)
return result
messages = [{"role":"user","content":prompttemplate.format(userprompt=userprompt, modeloutput=model_output, rubric=json.dumps(rubric))}]
plresp = callwithpl("gpt-4o-mini", messages, tags=["judgerun","experimentv1"], metadata={"dataset":"taxprompts"})
print(pl_resp["content"])
For deeper guidance, see the LLM Observability & Tracing pillar page.
## Step 6: Address the Competitor Gap, Use Stronger Judge via PromptLayer and LaikaTest
Many documents mention using a stronger LLM as a judge but skip the code. I will show a full Python flow using PromptLayer to call a stronger judge model, then push judge responses into LaikaTest. This is like providing the exact wiring diagram when others only mention that the circuit exists.
### How Do I Implement a Stronger LLM Judge in Practice?
- Use PromptLayer to call a stronger model, such as a high-capacity judge model.
- Validate judge outputs and check for calibration.
### What Exact Code Is Needed to Connect PromptLayer and LaikaTest?
**Full Example Flow**
import promptlayer, openai, json, pandas as pd
from laikatest import LaikaClient
promptlayer.apikey = "PROMPTLAYERKEY"
openai.apikey = "OPENAIKEY"
laika = LaikaClient(apikey="LAIKATESTKEY")
def generateandjudge(row):
userprompt = row["userprompt"]
gen = promptlayer.track(
model_name="gpt-4o",
messages=[{"role":"user","content":user_prompt}],
tags=["generation","model_v1"],
metadata={"dataset":"tax_prompts"}
)
model_output = gen["content"]
judgemsg = prompttemplate.format(userprompt=userprompt, modeloutput=modeloutput, rubric=json.dumps(rubric))
judge = promptlayer.track(
model_name="gpt-4o-judge",
messages=[{"role":"user","content":judge_msg}],
tags=["judge","strongjudge","experimentv1"],
metadata={"generation_id":gen["id"]}
)
judge_json = json.loads(judge["content"])
test_case = {
"title": f"judge_{row['id']}",
"input": {"userprompt": userprompt, "modeloutput": modeloutput},
"expected": {"correctness_min": 4},
"metadata": {"generationid": gen["id"], "judgeid": judge["id"]},
"results": judge_json
}
laika.createtestcase(test_case)
return judge_json
df = pd.read_csv("tests.csv")
results = df.apply(generateandjudge, axis=1)
This code shows authentication, prompt formatting, PromptLayer tracking, and LaikaTest test case creation. Add error handling and retries in production.
For step-by-step examples, see the Demo page.
## Step 7: Use LaikaTest to Orchestrate, Assert, and CI Integrate
LaikaTest runs batch evaluations as test suites with pass/fail assertions. Map rubric rules to LaikaTest assertions. This is similar to turning manual QA cases into automated unit tests that run on push.
### What Does LaikaTest Add to Model-Based Evaluations?
- Turns judge outputs into runnable test suites.
- Stores test history and allows A/B comparisons.
- Integrates with CI and alerting.
### How Do I Convert Judge Scores into CI Pass/Fail Checks?
- Define thresholds. For example, correctness must be >= 4.
- Create assertions in LaikaTest based on judge JSON.
**LaikaTest Example Code**
from laikatest import LaikaClient
laika = LaikaClient(apikey="LAIKATESTKEY")
suite = laika.createsuite("taxbotjudgingsuite")
for idx, row in df.iterrows():
cj = row["judge_json"]
correctness = cj["scores"]["correctness"]
case = {
"title": f"case_{row['id']}",
"input": row["user_prompt"],
"assertions": [
{"type":"gte", "path":"scores.correctness", "value":4}
],
"results": cj
}
laika.addtestcasetosuite(suite["id"], case)
run = laika.run_suite(suite["id"])
print("Suite run status:", run["status"])
Link runs to CI by calling laika.run_suite in a GitHub Actions step.
For LaikaTest examples, see the Demo page.
## Step 8: Analyze Metrics, Calibration, and Inter-Judge Agreement
Compute aggregated metrics and calibration checks. Compare judge results to a small human-labeled sample. Measure inter-judge agreement if you use multiple judge models. This is similar to checking a thermometer against a reference thermometer.
### How Do I Know the Judge Is Reliable?
- Compare judge scores to human labels on a held-out set.
- Check calibration; for example, if the judge states high confidence but is wrong often, tune prompts.
### Should I Compare LLM Judge Results to Humans?
- Yes, at least for an initial calibration set.
**Example Metrics Using Pandas and Cohen's Kappa**
import pandas as pd
from sklearn.metrics import cohenkappascore
df = pd.readparquet("baselineresults.parquet")
metrics = df.groupby("dimension").agg({
"score": ["mean", lambda x: (x>=4).mean()]
})
kappa = cohenkappascore(df["humancorrectness"], df["judgecorrectness"])
print("Cohen kappa:", kappa)
For more on metrics, see the LLM Testing & Evaluation pillar page.
## Step 9: Iterate Prompts, Guardrails, and Fail-Safes
Tune judge prompts using PromptLayer A/B testing. Add safety checks for judge hallucination and contradiction. Fallback to human review on low-confidence cases. This is similar to tuning search ranking models with A/B tests and manual review for edge cases.
### How Do I Improve and Maintain Judge Quality Over Time?
- Run A/B tests for judge prompts with PromptLayer.
- Monitor judge drift and error rates.
- Retrain or rewrite judge prompts if there is a growing mismatch with humans.
### When Should I Escalate to Human Review?
- Low confidence scores.
- Disagreement between multiple judges.
- High-risk or high-impact outputs.
**Example A/B Setup with PromptLayer**
pla = promptlayer.track(modelname="gpt-4o-judge", messages=[{"role":"user","content":judgeprompta}], tags=["judgeab","varianta"])
plb = promptlayer.track(modelname="gpt-4o-judge", messages=[{"role":"user","content":judgepromptb}], tags=["judgeab","variantb"])
If judge_response.get("confidence") and judge_response["confidence"] < 0.6:
route_to_human_queue(row)
For monitoring ideas, see the LLM Observability & Tracing pillar page.
## Troubleshooting and Common Pitfalls
Avoid vague judge instructions, or you will get inconsistent scores. Beware of judge and candidate models sharing training data, which can bias results. Watch for prompt injection inside model outputs that can trick the judge. This is similar to debugging flaky tests caused by ambiguous requirements.
### Common Checks
- Validate judge JSON schema on every run.
- Ensure the judge did not simply echo back the rubric as a response.
- Check for repeated identical judge outputs, which can signal a prompt problem.
**Small Validation Example**
def validatejudgejson(j):
required = {"scores","explanations"}
if not required.issubset(j.keys()):
return False
if not isinstance(j["scores"].get("correctness"), int):
return False
return True
For more pitfalls, see the LLM Testing & Evaluation pillar page.
## Appendix: Example Prompt Templates and Full Code Map
This is the recipe appendix that lists exact measurements and steps.
**File Layout**
- judge_prompts.py # holds prompt_template strings
- evaluate.py # generation and judge pipeline
- laikatest_integration.py # create and run suites
- promptlayer_logging.py # PromptLayer wrappers
- tests.csv # dataset
**judge_prompts.py Example**
prompt_template = """You are an unbiased evaluator.
Input:
userprompt: {userprompt}
modeloutput: {modeloutput}
Rubric: {rubric}
Return only valid JSON with keys scores and explanations.
Scores must be numeric.
"""
**evaluate.py Outline**
**laikatest_integration.py Outline**
**promptlayer_logging.py Outline**
For full example files and GitHub-ready code, see the Demo page.
## Conclusion with LaikaTest
Automating LLM judgment gives you repeatable evidence. Best practices include clear rubrics, structured judge outputs, prompt versioning, and human calibration. Always validate judge reliability on a human-labeled sample. Use PromptLayer to record every prompt and response, and use LaikaTest to convert judge outputs into repeatable test suites. LaikaTest transforms judge outputs into assertions you can run in CI and helps you understand which prompt version actually improved behavior.
LaikaTest is an AI infrastructure tool for production LLM systems that helps teams experiment, evaluate, and debug prompts and agents safely in real usage. It assists when teams change prompts or agent logic but do not know if behavior improved. It helps when outputs are non-deterministic, and "it felt better" is not enough evidence. Observability tools show logs, but do not indicate which version performed better. LaikaTest helps avoid silent regressions after prompt or model changes.
LaikaTest enables:
- Prompt A/B testing, allowing you to run multiple prompt variants on real traffic and compare outcomes
- Agent experimentation, enabling you to compare different agent setups as experiments, not guesses
- One-line observability and tracing, allowing you to see which prompt version was used, model outputs, tool calls, costs, and latency
- An evaluation feedback loop, enabling you to collect human or automated scores tied to exact prompt versions
Try the provided LaikaTest example suite from the Demo page, import your PromptLayer logs, map judge JSON to assertions, and add the run to CI. This way, every model change triggers the same set of checks. If you do this, you will sleep better. You will also have proof next time a chatbot recommends classifying a cat as a dependent.