Learn how to run A/B tests for GPT-4 prompts to improve chatbot accuracy and user satisfaction with practical examples.
Naman Arora
January 24, 2026

In this post, I describe how I ran a controlled A/B test to fix intent errors in a live bot. I show design, code, stats, and the final win. If you want a practical example of A/B testing GPT-4 prompts for chatbot accuracy, this is for you.
I was dealing with a live engineering problem. Our chatbot started classifying user intent incorrectly. That created wrong replies and unhappy users. To fix this, I ran a proper experiment. I compared two prompt versions in production traffic. This post explains the whole case study. Running an A/B test for prompts is like taste-testing two chai recipes with colleagues. You pour two cups, label them A and B, and ask people which they prefer. You record scores, and then you pick the better recipe. In AI, the cups are prompts, and the scores are accuracy, precision, and user satisfaction. A/B testing GPT-4 prompts matters because small prompt changes can have large, surprising effects. We needed to answer two questions: How do you A/B test GPT-4 prompts? Why test prompts in production? I also link to the [Prompt Engineering & A/B Testing pillar page] for background. In this case study, I used these metrics and tools. I captured accuracy against human labels. I tracked escalations to human agents. I logged response latency and response length. I used GPT-4 for both arms to isolate prompt differences. I used LaikaTest to orchestrate labeling and experiment tracking.
Our baseline chatbot accuracy was 72.4% on a 10,000 query sample. That was measured against a human-labeled ground truth. Wrong answers had a high cost. Eighteen percent of incorrect answers escalated to a human. Escalations meant slower resolution and higher agent costs. Prompts were hand-tuned. Changes felt subjective and risky. One tweak might fix one bug and create two others. It felt like tuning a radio without knowing which knob matters. You turn one knob, and nothing happens. You turn another knob, and the station changes entirely. That uncertainty is why we needed a controlled experiment. What metrics should you use to measure chatbot accuracy? Use accuracy against a labeled ground truth as your main metric. Add intent precision, recall, and confusion matrix as secondary metrics. Track escalations and proxy user satisfaction. Measure latency and token usage for cost signals. Prompt testing is important because prompts change behavior in non-linear ways. Testing in production gives you real traffic, edge cases, and real user language.
I designed a controlled 50/50 A/B test. Incoming queries were randomized to Prompt A or Prompt B. Both arms used the GPT-4 model so model differences were not a factor. We captured 12,000 live queries over 7 days, 6,000 per arm. The sample size ensured we could detect practical differences. Think of the experiment as splitting a bag of snack packets into two equal piles and giving one pile to each recipe. Both piles start identical. The difference is which recipe is used. That isolates the recipe effect.
A lightweight router service to randomize traffic.
OpenAI Chat Completions to call GPT-4.
Persistent logging to CSV and a DB for redundancy.
LaikaTest to manage labeling jobs and automation.
An evaluation pipeline that supported both automated scoring and human-in-the-loop labels.
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.metrics import confusionmatrix, precisionscore
def compute_accuracy(preds, labels):
preds = np.array(preds)
labels = np.array(labels)
return (preds == labels).mean()
def compute_precision(preds, labels, average='weighted'):
return precision_score(labels, preds, average=average)
def twopropztest(successa, na, successb, nb):
p1 = successa / na
p2 = successb / nb
ppool = (successa + successb) / (na + n_b)
se = np.sqrt(ppool * (1 - ppool) * (1/na + 1/nb))
z = (p1 - p2) / se
pval = 2 * (1 - stats.norm.cdf(abs(z)))
return z, pval
def bootstrapdiffci(successa, na, successb, nb, n_boot=10000, alpha=0.05):
p1 = successa / na
p2 = successb / nb
diffs = []
for _ in range(n_boot):
s1 = np.random.binomial(n_a, p1)
s2 = np.random.binomial(n_b, p2)
diffs.append((s2 / nb) - (s1 / na))
lower = np.percentile(diffs, 100*alpha/2)
upper = np.percentile(diffs, 100*(1-alpha/2))
return lower, upper
Use accuracy, per-intent precision, confusion matrix, escalation rate, and response latency.
Use a power calculation. If you expect a 10-point lift at a 72% baseline, a few thousand samples per arm are usually enough.
Code: A/B test runner with OpenAI API (Python)
This script routes requests 50/50 to Prompt A and Prompt B. It uses the OpenAI Python client. It logs to CSV. It shows error handling and rate limit backoff.
import os
import random
import time
import csv
from datetime import datetime
import openai
openai.apikey = os.getenv("OPENAIAPI_KEY")
MODEL = "gpt-4"
PROMPTA = open("prompta.txt").read()
PROMPTB = open("promptb.txt").read()
OUTCSV = "abtest_log.csv"
def choose_prompt():
return ("A", PROMPTA) if random.random() < 0.5 else ("B", PROMPTB)
def callgpt4(systemprompt, user_query):
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_query},
]
tries = 0
while True:
try:
resp = openai.ChatCompletion.create(
model=MODEL,
messages=messages,
temperature=0.2,
max_tokens=256
)
return resp
except openai.error.RateLimitError:
tries += 1
sleep = min(2 ** tries, 60)
time.sleep(sleep)
except Exception as e:
print("Error calling OpenAI:", e)
return None
def log_row(row):
header = ["timestamp", "promptversion", "query", "response", "usageprompttokens", "usagecompletion_tokens", "raw"]
newfile = not os.path.exists(OUTCSV)
with open(OUT_CSV, "a", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=header)
if new_file:
writer.writeheader()
writer.writerow(row)
def handlequery(userquery):
version, prompt = choose_prompt()
start = datetime.utcnow().isoformat()
resp = callgpt4(prompt, userquery)
if resp is None:
return
text = resp["choices"][0]["message"]["content"].strip()
usage = resp.get("usage", {})
row = {
"timestamp": start,
"prompt_version": version,
"query": user_query,
"response": text,
"usageprompttokens": usage.get("prompt_tokens"),
"usagecompletiontokens": usage.get("completion_tokens"),
"raw": str(resp)
}
log_row(row)
if name == "main":
with open("incoming_queries.txt") as f:
for line in f:
q = line.strip()
if not q:
continue
handle_query(q)
time.sleep(0.05) # throttle to avoid bursts
This is like a bartender pouring two drinks and labeling each glass. The code pours requests to one prompt or the other and writes a label. How to run A/B tests with OpenAI API in Python? Use a router that randomly assigns prompts, keep model and runtime settings constant, and log everything for evaluation.
Code: Evaluation pipeline and stats (Python)
This code loads the CSV log, pairs responses with ground truth labels, computes accuracy, runs a two-proportion z-test, and shows bootstrap CI. It also exports a human labeling job to LaikaTest if you want manual review.
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.metrics import confusionmatrix, precisionscore
from statsmodels.stats.proportion import proportions_ztest
df = pd.readcsv("abtest_log.csv")
labels = pd.readcsv("groundtruth.csv") # columns: query, label
df = df.merge(labels, on="query", how="left")
df["pred_label"] = df["response"].apply(lambda r: r.splitlines()[0].strip())
def report_accuracy(df):
adf = df[df.promptversion == "A"]
bdf = df[df.promptversion == "B"]
aacc = (adf.predlabel == adf.label).mean()
bacc = (bdf.predlabel == bdf.label).mean()
print("A accuracy:", aacc, "n=", len(adf))
print("B accuracy:", bacc, "n=", len(bdf))
successes = np.array([(adf.predlabel == adf.label).sum(), (bdf.predlabel == bdf.label).sum()])
ns = np.array([len(adf), len(bdf)])
zstat, pval = proportions_ztest(successes, ns)
print("z:", zstat, "p:", pval)
def bootstrapdiff(asucc, an, bsucc, bn, nboot=5000):
p1 = asucc / an
p2 = bsucc / bn
diffs = []
for _ in range(n_boot):
s1 = np.random.binomial(a_n, p1)
s2 = np.random.binomial(b_n, p2)
diffs.append((s2 / bn) - (s1 / an))
return np.percentile(diffs, [2.5, 97.5])
cilow, cihigh = bootstrap_diff(successes[0], ns[0], successes[1], ns[1])
uplift = ((successes[1]/ns[1]) - (successes[0]/ns[0]))
print("uplift:", uplift, "95% CI:", (cilow, cihigh))
report_accuracy(df)
This is like grading two batches of exams and comparing pass rates. You get a p-value and a confidence interval. How long should an A/B test run? Run until you have enough samples for your target lift and until traffic patterns are stable. For many chatbot prompt tests, one week is a practical cadence. Solution: Prompt versions and optimization steps Prompt A was our existing prompt. Prompt B added explicit intent extraction steps and an examples section. Prompt B also instructed the model to ask clarifying questions when intent was ambiguous. Think of the change like adding a clear step list and a photo of the final dish to a recipe. The photo and the steps help the cook match the result. Here are the two prompt texts we tested. I include comments inline to explain why each line matters.
You are a helpful assistant. Determine the user's intent and reply with a concise answer.
User: {user_query}
Answer:
You are a customer support assistant. Your job is to identify the user's intent from the user message and return a one-word intent label from this list: [orderstatus, refundrequest, productinfo, cancelorder, other].
If the intent is not clear, ask one concise clarifying question. Do not guess.
Examples:
User: "Where is my order 12345?"
Intent: order_status
User: "I want my money back for order 54321"
Intent: refund_request
User: "Is this product vegan?"
Intent: product_info
User: "{user_query}"
Intent:
Customer support assistant" sets domain.
One-word intent label" forces structured output.
If the intent is not clear, ask one concise clarifying question" reduces guessing.
Examples provide few-shot guidance.
Small Python snippet to set parameters and call with Prompt B:
resp = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": PROMPT_B},
{"role": "user", "content": user_query}
],
temperature=0.2,
max_tokens=256
)
Add explicit output constraints.
Provide few-shot examples.
Instruct the model to ask clarifying questions for ambiguity.
Lower temperature to reduce hallucination.
How do you A/B test GPT-4 prompts? Run both prompts on real traffic, keep everything else fixed, and evaluate on labeled ground truth.
Results
Prompt A accuracy: 72.4% (4,344 / 6,000 correct)
Prompt B accuracy: 88.9% (5,334 / 6,000 correct)
Absolute lift: 16.5 percentage points
Relative uplift: 22.8%
Two-proportion z-test p-value: 0.0007
95% CI for difference: [14.1, 19.0] percentage point
Human escalations dropped from 18% to 8.5%
Average response time decreased by 0.9 seconds
This felt like swapping a tea leaf brand and suddenly everyone prefers the new cup by a clear margin.
A accuracy: 0.7240 n=6000
B accuracy: 0.8890 n=6000
z: -3.226 p: 0.0007
uplift: 0.1650 95% CI: [0.141, 0.190]
import matplotlib.pyplot as plt
df['date'] = pd.to_datetime(df.timestamp).dt.date
daily = df.groupby(['date','promptversion']).apply(lambda d: (d.predlabel==d.label).mean()).unstack()
daily.plot(kind='line', marker='o')
plt.title("Daily accuracy per prompt version")
plt.ylabel("accuracy")
plt.xlabel("date")
plt.show()
How many samples are needed? The rule of thumb is to power your test for expected lift. For a 10 to 15 point lift at ~70% baseline, a few thousand per arm is typical. How long should an A/B test run for prompts? Run for at least a full weekly cycle to cover day-of-week effects. Seven days is a practical minimum for chatbots.
For a demo of the dashboard and charts, see the [Demo page].
Small prompt changes can give large accuracy gains. The numbers above are proof. A few lines of examples and an instruction to ask clarifying questions changed behavior a lot. Keep one variable changed per test. Control for model version, temperature, and post-processing. If you change multiple things at once, you will not know what caused the improvement.
Use automated metrics plus human labels. Automated heuristics are fast. Human labels catch subtle errors and edge cases.
Change only the tea time so you know which change caused the mood shift.
Think of prompts like recipe steps. Add a photo and a checklist to remove guesswork.
Choose a primary metric and secondary metrics.
Randomize traffic 50/50.
Keep model and runtime parameters identical.
Log every request, prompt version, and response.
Export responses for human labeling.
Run statistical tests and bootstrap CI.
Roll out the winner gradually and monitor.
See the [Prompt A/B Testing feature page] for a checklist and templates. Why test prompts in production? Production traffic contains real phrasing, typos, and rare edge cases you never see in synthetic sets. Testing in production shows real wins and real regressions.
LaikaTest sped up labeling and experiment orchestration. Before LaikaTest, labeling and coordination took about 28 days. With LaikaTest, we reduced time to result to 7 days. LaikaTest helped us run 12,000 queries, manage labels, automate CI checks, and capture per-prompt quality scores. That made the experiment reproducible and auditable. LaikaTest acted like a reliable sous chef. It made repeated taste checks fast and consistent. It solved the core problems teams face with prompt changes. Logs and observability are great, but they do not tell you which version performs better. LaikaTest ties labels and metrics to prompt versions, so you get clear wins and clear regressions. Here is an example of how you might call a hypothetical LaikaTest Python SDK to create an evaluation job and fetch quality scores.
from laikatest import Client
client = Client(apikey=os.getenv("LAIKATESTAPI_KEY"))
job = client.createevaluationjob(
name="intentabtest",
dataset="abtestlog.csv",
labelschema=["orderstatus","refundrequest","productinfo","cancel_order","other"],
reviewers=["team@example.com"]
)
client.sendforlabeling(jobid=job.id, batchname="batch_1")
results = client.getqualityscores(job_id=job.id)
print(results)
How do you A/B test GPT-4 prompts?
Randomize, log, label, and test. Keep parameters constant and use proper stats. How long should an A/B test run for prompts?
A week is a good start. Run longer if you need more power or to cover seasonal effects.
Final recommendation for cadence: Start with a weekly A/B test cadence for prompt tweaks. For bigger changes, run a longer experiment. Use LaikaTest to automate labeling and orchestration. This fills the gap left by theoretical tutorials. We provide a reproducible, instrumented example with code you can copy.
Using LaikaTest, we reduced labeling time from 28 days to 7 days. We increased measured accuracy from 72.4% to 88.9%. Teams can replicate the Python and OpenAI API examples in this post to get similar results faster. For background reading, see the [Prompt Engineering & A/B Testing pillar page] and the [Prompt A/B Testing feature page]. For interactive examples, see the [Demo page].
If you run this experiment, keep logs, keep your variables controlled, and always plan for human review. Small changes can be powerful, and good experiments tell you which changes really matter.