Learn how to implement LLM CI/CD for better model reliability and user experience. Follow this checklist for effective practices.
Naman Arora
January 24, 2026

# Practical LLM CI/CD Guide
[ANECDOTE_PLACEHOLDER]
*I tell that story because it shows how small prompt changes can cause weird user-facing issues. After that day, I treated prompts like code and tests like basic hygiene.*
## Overview and Prerequisites
What is LLM CI/CD? I will answer that first. LLM CI/CD is continuous integration and continuous delivery for systems that use large language models. It covers code, prompt templates, data, and model artifacts. It checks that changes do not break the pipeline or the user experience. LLM CI/CD is different from regular CI/CD for code because models and prompts add a new layer of non-deterministic behavior. Tests must cover prompts, model outputs, and cost or latency. We need reproducible checks and gates that prevent silent regressions.
Think of LLM CI/CD like a restaurant kitchen. The cooks are your code. The recipes are your prompts. The ingredients are data and model files. Before serving a dish, you check the recipe formatting, taste a sample portion, and watch how it looks. If any step fails, you stop the service. The same applies here. We test format, run a small evaluation, and only then deploy.
### Prerequisites
- Repo access and branch protections.
- Model hosting or API keys for your model provider.
- Test datasets, a golden set with examples and expected behaviors.
- CI runner with GPU or a fast CPU for small evaluations.
- Artifact store or model registry for versioned models.
- Secrets stored in the CI secret store, not in code.
### Expected Outcome
- A reproducible pipeline that runs unit tests, prompt evaluations, and safe deploy steps.
- Faster PR feedback for prompt and model changes.
- Fewer surprises in production.
### Required Files You Should Have in Your Repo
- .github/workflows/ci.yml
- tests/test_prompts.py
- eval/evaluate_prompts.py
- dvc.yaml or model registry config
Link: [LLMOps & Production AI pillar page](/llmops)
### Answer: What is LLM CI/CD?
LLM CI/CD is the set of automated checks and deployment steps for LLM systems. It covers code, prompt templates, data, model artifacts, and runtime checks. It aims to prevent regressions in behavior, cost, and latency.
, -
## Step 0: Detailed Prerequisites and Repo Layout
### Repository Layout
- src/ # application code and inference wrapper
- prompts/ # prompt templates and schema
- data/ # raw and processed data
- models/ # small examples and version metadata
- tests/ # unit and integration tests
- eval/ # evaluation harness and golden datasets
- k8s/ # deployment manifests
Treat code, prompts, and data as first-class citizens. Put prompt templates in a folder and use schema files to validate them. Store large artifacts with Git LFS or DVC. Keep model metadata in a registry.
### Use Versioning
- Git LFS for large binary files like small model checkpoints.
- DVC or a model registry for datasets and large models.
- Use consistent model version tags. Keep metadata in models/version.json.
### Secrets and API Keys
- Store secrets in the CI secret store.
- Never commit API keys to source.
- Use least privilege. Create keys that only allow inference for CI accounts.
### Analogy
Organize your repo like a kitchen. Put ingredients in an ingredients folder, keep recipe templates where cooks can find them, and keep the big items in the fridge with labels.
### Example .gitattributes Entry for Git LFS
*.pt filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
### Example dvc.yaml Snippet
stages:
preprocess:
cmd: python src/preprocess.py
deps:
src/preprocess.py
data/raw
outs:
data/processed
### Answer: Which Tools Are Best for LLM CI/CD?
- Git and GitHub for code and PR workflows.
- DVC or a model registry for large data and model artifacts.
- Git LFS for medium-sized binaries.
- CI systems that support GPUs or powerful runners, like GitHub Actions, CircleCI, or self-hosted runners.
- Secret stores in the CI provider.
- Observability tools that support traces, logs, and metrics.
, -
## Step 1: Unit Tests and Static Checks for LLM Pipelines
Unit tests and static checks are the first line of defense. They catch tokenization bugs, broken prompt placeholders, and format errors. They run fast and should fail quickly.
### What to Test
- Tokenizers and preprocessors.
- Prompt template formatting.
- JSON and schema validations for prompts.
- Small smoke tests for inference wrappers.
### Analogy
Unit tests are like tasting a sauce before you serve it. If it tastes off, you stop and fix it.
### Include Code
tests/test_prompts.py
import json
import jsonschema
import pytest
from prompts.template import PROMPT, PROMPT_SCHEMA
def testpromptformat():
formatted = PROMPT.format(customer_name='A')
assert '{' not in formatted and '}' not in formatted
def testpromptschema_valid():
with open('prompts/template.json') as f:
data = json.load(f)
jsonschema.validate(instance=data, schema=PROMPT_SCHEMA)
Fail fast on broken templates, malformed JSON, or missing fields. Add linters for formatting. Use simple scripts to validate placeholder names.
### Answer: How Do You Run Tests for LLMs in CI?
Run standard pytest in your CI job. Use separate jobs for unit tests and heavier evaluations. For tests that touch tokenizers or small I/O, run these on default runners. Reserve GPU runners for model evaluations.
, -
## Step 2: Add Prompt and Model Evaluations to CI
We need a small golden dataset. It should be fast to run in CI. The dataset covers typical user requests and edge cases. Each example should include input, expected behavior, and optional scoring rules.
### Evaluation Harness
- Load golden dataset from eval/golden.jsonl.
- Call the model via API client or a local server.
- Compute metrics like exact match, BLEU, or custom hallucination checks.
- Print a JSON summary and exit with a non-zero code if metrics fall below thresholds.
### Analogy
Prompt evaluations are like the quality control table where a sample meal is verified against the recipe. You check one plate from the line before sending more.
### Include Code
eval/evaluate_prompts.py
import json
import sys
import requests
THRESHOLDS = {
"exact_match": 0.9
}
def load_golden(path='eval/golden.jsonl'):
items = []
with open(path) as f:
for line in f:
items.append(json.loads(line))
return items
def callmodel(prompt, apikey, url):
resp = requests.post(url, json={"prompt": prompt}, headers={"Authorization": f"Bearer {api_key}"})
resp.raiseforstatus()
return resp.json()["text"]
def compute_metrics(results):
total = len(results)
exact = sum(1 for r in results if r["expected"].strip() == r["actual"].strip())
return {"exact_match": exact / total}
def main():
items = load_golden()
apikey = os.environ.get("MODELAPI_KEY")
url = os.environ.get("MODELAPIURL", "https://api.example.com/generate")
results = []
for it in items:
out = callmodel(it["prompt"], apikey, url)
results.append({"expected": it["expected"], "actual": out})
metrics = compute_metrics(results)
with open('eval/results.json', 'w') as f:
json.dump(metrics, f)
for k, v in THRESHOLDS.items():
if metrics.get(k, 0) < v:
print("Evaluation failed", metrics)
return 2
print("Evaluation passed", metrics)
return 0
if name == "main":
import os
sys.exit(main())
### Pytest Wrapper
tests/test_eval_gate.py
import eval.evaluateprompts as evalmod
import pytest
def testevalpasses():
rc = eval_mod.main()
assert rc == 0
### Answer: How Do You Add Prompt Evaluations in a CI Pipeline?
Add a CI job that runs eval/evaluate_prompts.py. Make the job use secrets for MODEL_API_KEY. Fail the job when metrics fall below thresholds. Upload results as artifacts and post a summary to PRs.
, -
## Step 3: GitHub Actions CI Pipeline Example
Think of the workflow as the kitchen line. Each station checks one thing and moves the plate on.
### Full .github/workflows/ci.yml Example
name: CI
on:
pull_request:
branches: [ main ]
push:
branches: [ main ]
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
uses: actions/checkout@v3
name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
name: Install deps
run: pip install -r requirements.txt
name: Run unit tests
run: pytest -q
eval:
needs: unit-tests
runs-on: [ self-hosted, gpu ] # use your GPU runner label here
steps:
uses: actions/checkout@v3
name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
name: Install deps
run: pip install -r requirements.txt
name: Run eval
env:
MODELAPIKEY: ${{ secrets.MODELAPIKEY }}
MODELAPIURL: ${{ secrets.MODELAPIURL }}
run: |
python eval/evaluate_prompts.py
name: Upload eval artifact
uses: actions/upload-artifact@v3
with:
name: eval-results
path: eval/results.json
deploy:
needs: [unit-tests, eval]
if: github.ref == 'refs/heads/main' && needs.eval.result == 'success'
runs-on: ubuntu-latest
steps:
name: Deploy Canary
env:
KUBECONFIGDATA: ${{ secrets.KUBECONFIGDATA }}
run: |
echo "$KUBECONFIGDATA" | base64 , decode > kubeconfig
kubectl , kubeconfig=kubeconfig apply -f k8s/canary.yaml
### Notes
- Use secrets for MODEL_API_KEY.
- Use a GPU runner for evaluation jobs when needed.
- Upload eval results as artifacts and parse them to decide deploy. You can make a step that reads eval/results.json and fails on thresholds.
Link: [LLMOps & Production AI pillar page](/llmops)
### Answer: Which Tools Are Best for LLM CI/CD?
See earlier. For GitHub, use self-hosted GPU runners for heavy evaluations. For artifact stores, use DVC or a model registry.
### Answer: How Do You Add Prompt Evaluations in a CI Pipeline?
Add a job that runs the evaluation harness, use secrets for the model key, and fail on threshold breaches. Upload artifacts and post summaries to PRs.
, -
## Step 4: Safe Deployment Strategies for LLM Systems
Use canary and shadow deployments. Test new model or prompt changes on a small percentage of traffic. Use feature flags to route traffic to the new version. Keep rollback steps ready.
### Analogy
Deploy like a soft opening. Invite a few customers first and watch reactions before full launch.
### Deployment Snippet
k8s/canary.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: inference-canary
labels:
app: inference
version: canary
spec:
replicas: 1
selector:
matchLabels:
app: inference
version: canary
template:
metadata:
labels:
app: inference
version: canary
spec:
containers:
name: server
image: myregistry/inference:canary
resources:
limits:
cpu: "2"
memory: "4Gi"
### GitHub Actions Step to Toggle Feature Flag
name: Toggle feature flag
run: |
curl -X POST -H "Authorization: Bearer ${{ secrets.FEATUREFLAGKEY }}" \
-d '{"flag":"use_canary","enabled":true,"percent":10}' \
https://flags.example.com/api/toggle
### Answer: How Do You Deploy LLM Systems Safely?
Use canary deployments, shadow testing, and feature flags. Version everything and keep a clear rollback plan. Store artifacts and model versions in a model registry or artifact store.
Link: [LLMOps & Production AI pillar page](/llmops)
, -
## Step 5: Observability and Tracing for LLM Inference
Log inputs and outputs in structured form. Do not log PII. Redact or hash sensitive fields. Use OpenTelemetry or vendor tracing to collect latency and errors.
### Analogy
Observability is like CCTV and sensors in the kitchen. You want to see which station took too long or made a mistake.
### Python Logging Snippet and Tracing
import json
import hashlib
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
trace.settracerprovider(TracerProvider())
tracer = trace.get_tracer("inference")
def redact_input(inp):
if "email" in inp:
inp["email"] = hashlib.sha256(inp["email"].encode()).hexdigest()
return inp
def log_call(prompt, response, extra):
record = {"prompt": prompt[:500], "response_len": len(response), "extra": extra}
print(json.dumps(record))
def callmodeland_trace(client, prompt):
with tracer.startascurrentspan("callmodel") as span:
resp = client.generate(prompt)
span.setattribute("promptlength", len(prompt))
span.setattribute("tokensused", resp.get("tokens", 0))
return resp
Track model-specific signals like prompt length, token usage, latency, and hallucination indicators. Emit metrics to your metrics backend.
Link: [LLM Observability & Tracing pillar page](/observability)
### Answer: How Do You Monitor and Trace LLM Production Issues?
Log structured inputs and outputs with sensitive data redacted. Use tracing to connect frontend requests to model calls. Track model signals and set alerts on regressions.
, -
## Step 6: Retraining Triggers and Automation
Define metric thresholds and drift detectors that trigger retraining. Keep a human in the loop to validate candidate models. Use scheduled pipelines to collect labeled examples. Store candidate artifacts with metadata and evaluations.
### Analogy
Retraining is like planning a menu update when customers start asking for new dishes.
### Example Workflow to Trigger Retrain
on:
workflow_dispatch:
schedule:
cron: '0 3 * * 1' # weekly
jobs:
retrain:
runs-on: ubuntu-latest
steps:
uses: actions/checkout@v3
name: Start retrain
run: python src/retrain.py , output models/candidatev{{ github.runid }}
Or an Airflow DAG can trigger the pipeline when an alert hits. When a candidate passes evaluation, automate a PR creation with results for human review.
### Answer: How Do You Deploy LLM Systems Safely?
Use gated retraining and human approval. Automate candidate creation, evaluation, and PR generation for review.
### Answer: How Do You Monitor and Trace LLM Production Issues?
See the observability section. Use logs, traces, and model metrics. Set alerts for drift or metric drops.
Link: [LLMOps & Production AI pillar page](/llmops)
, -
## Step 7: Troubleshooting Checklist and Expected Outcomes
### Common Failures and Quick Fixes
- Tokenization mismatch: fix by aligning tokenizer version and add a unit test.
- Prompt template breakage: revert or patch template and add schema tests.
- API rate limits: add retries and backoff in client.
- Metric regressions: rollback and run eval locally to reproduce.
### Expected Outcomes After This Guide
- Faster PR feedback for prompts and models.
- Reproducible evaluations and fewer surprises in production.
- Actionable monitoring tied to exact prompt and model versions.
### Operational Playbooks
- Rollback: how to switch traffic to the previous model version.
- Alerting runbook: who to call and what to collect.
- Incident analysis: tag incidents with model and prompt versions.
### Commands to Fetch Artifacts and Replay
- Fetch artifact from GitHub UI or run:
1. Download eval/results.json from the workflow artifacts.
2. Reproduce locally:
python eval/evaluate_prompts.py , model-version v123 , input tests/golden.jsonl
### Analogy
Troubleshooting is like the kitchen closing checklist. You follow a short sequence to find the root cause.
### Answer: What is LLM CI/CD?
LLM CI/CD is CI/CD tuned for LLM systems. It combines code checks, prompt validation, evaluation, and safe deployment. It helps you avoid regressions with model or prompt changes.
### Answer: How Do You Run Tests for LLMs in CI?
Run quick unit tests and schema checks on normal runners. Run small evaluations on stronger runners. Fail CI on metric drops. Upload artifacts for debugging.
, -
## Appendix: Files and Full Snippet List to Copy
### Copy Paste List
- .github/workflows/ci.yml
- eval/evaluate_prompts.py
- tests/test_prompts.py
- dvc.yaml
- k8s/canary.yaml
### Notes on Runners, Caching, and Secrets
- Use self-hosted GPU runners for heavy evaluations.
- Cache pip packages and model downloads to speed CI.
- Store MODEL_API_KEY and KUBE_CONFIG_DATA in CI secrets.
### Checklist to Add to Each PR
1. All unit tests pass.
2. Eval metrics unchanged or improved.
3. Infra review if deploy or k8s changes are included.
4. Add evaluation artifact link in PR.
### Analogy
This appendix is your mise en place list. It makes sure you do not forget any ingredient before service.
### Include Full Example Files
- See earlier code blocks for ci.yml, evaluate_prompts.py, and test examples.
- k8s/canary.yaml included above.
### Answer: Which Tools Are Best for LLM CI/CD?
Use GitHub Actions with self-hosted GPU runners, DVC or a model registry, Git LFS for binaries, and monitoring tools with tracing.
### Answer: How Do You Add Prompt Evaluations in a CI Pipeline?
Add an eval job that runs a short golden dataset against the model. Fail the job when metrics fall below thresholds. Upload eval results and post highlights to the PR.
, -
## Conclusion with LaikaTest
After following this guide, you will get faster feedback, reproducible checks, and safer rollouts. You will catch prompt regressions before they reach users. For teams that change prompts often, a dedicated evaluation tool helps a lot.
I recommend adding a dedicated evaluation tool like LaikaTest to the pipeline. LaikaTest helps teams experiment, evaluate, and debug prompts and agents safely in real usage. It solves core problems teams face. Teams change prompts or agent logic but do not know if behavior actually improved. AI outputs are non-deterministic, so "it felt better" is not evidence. Observability tools show logs but do not tell which version performed better. Silent regressions happen after prompt or model changes.
LaikaTest enables prompt A/B testing, agent experimentation, one-line observability and tracing, and an evaluation feedback loop. Use LaikaTest to author and run prompt test suites, store historical evaluation artifacts, and gate merges automatically. Make LaikaTest jobs a required CI check so prompt regressions are caught before deployment.
If you take one thing from this guide, make it this: treat prompts like code, add fast evaluations to CI, and keep a small golden dataset for every important behavior. Then add a tool like LaikaTest to scale evaluations against real traffic and to keep a clear history of what changed and why.
Good luck. Keep the chai coming, and may your prompts stop inventing coffee shops for real customers.