A/B Testing GPT-4 Prompts Explained

The A/B testing GPT-4 prompts

In this post, I describe how I ran a controlled A/B test to fix intent errors in a live bot. I show design, code, stats, and the final win. If you want a practical example of A/B testing GPT-4 prompts for chatbot accuracy, this is for you.

Introduction

I was dealing with a live engineering problem. Our chatbot started classifying user intent incorrectly. That created wrong replies and unhappy users. To fix this, I ran a proper experiment. I compared two prompt versions in production traffic. This post explains the whole case study. Running an A/B test for prompts is like taste-testing two chai recipes with colleagues. You pour two cups, label them A and B, and ask people which they prefer. You record scores, and then you pick the better recipe. In AI, the cups are prompts, and the scores are accuracy, precision, and user satisfaction. A/B testing GPT-4 prompts matters because small prompt changes can have large, surprising effects. We needed to answer two questions: How do you A/B test GPT-4 prompts? Why test prompts in production? I also link to the [Prompt Engineering & A/B Testing pillar page] for background. In this case study, I used these metrics and tools. I captured accuracy against human labels. I tracked escalations to human agents. I logged response latency and response length. I used GPT-4 for both arms to isolate prompt differences. I used LaikaTest to orchestrate labeling and experiment tracking.

Challenge

Our baseline chatbot accuracy was 72.4% on a 10,000 query sample. That was measured against a human-labeled ground truth. Wrong answers had a high cost. Eighteen percent of incorrect answers escalated to a human. Escalations meant slower resolution and higher agent costs. Prompts were hand-tuned. Changes felt subjective and risky. One tweak might fix one bug and create two others. It felt like tuning a radio without knowing which knob matters. You turn one knob, and nothing happens. You turn another knob, and the station changes entirely. That uncertainty is why we needed a controlled experiment. What metrics should you use to measure chatbot accuracy? Use accuracy against a labeled ground truth as your main metric. Add intent precision, recall, and confusion matrix as secondary metrics. Track escalations and proxy user satisfaction. Measure latency and token usage for cost signals. Prompt testing is important because prompts change behavior in non-linear ways. Testing in production gives you real traffic, edge cases, and real user language.

Approach

I designed a controlled 50/50 A/B test. Incoming queries were randomized to Prompt A or Prompt B. Both arms used the GPT-4 model so model differences were not a factor. We captured 12,000 live queries over 7 days, 6,000 per arm. The sample size ensured we could detect practical differences. Think of the experiment as splitting a bag of snack packets into two equal piles and giving one pile to each recipe. Both piles start identical. The difference is which recipe is used. That isolates the recipe effect.

Tooling choices included:

A lightweight router service to randomize traffic.
OpenAI Chat Completions to call GPT-4.
Persistent logging to CSV and a DB for redundancy.
LaikaTest to manage labeling jobs and automation.
An evaluation pipeline that supported both automated scoring and human-in-the-loop labels.

import numpy as np

import pandas as pd

from scipy import stats

from sklearn.metrics import confusionmatrix, precisionscore

def compute_accuracy(preds, labels):

preds = np.array(preds)

labels = np.array(labels)

return (preds == labels).mean()

def compute_precision(preds, labels, average='weighted'):

return precision_score(labels, preds, average=average)

def twopropztest(successa, na, successb, nb):

p1 = successa / na

p2 = successb / nb

ppool = (successa + successb) / (na + n_b)

se = np.sqrt(ppool * (1 - ppool) * (1/na + 1/nb))

z = (p1 - p2) / se

pval = 2 * (1 - stats.norm.cdf(abs(z)))

return z, pval

def bootstrapdiffci(successa, na, successb, nb, n_boot=10000, alpha=0.05):

p1 = successa / na

p2 = successb / nb

diffs = []

for _ in range(n_boot):

s1 = np.random.binomial(n_a, p1)

s2 = np.random.binomial(n_b, p2)

diffs.append((s2 / nb) - (s1 / na))

lower = np.percentile(diffs, 100*alpha/2)

upper = np.percentile(diffs, 100*(1-alpha/2))

return lower, upper

What metrics should I use to measure chatbot accuracy?

Use accuracy, per-intent precision, confusion matrix, escalation rate, and response latency.

How many samples are needed for A/B testing prompts?

Use a power calculation. If you expect a 10-point lift at a 72% baseline, a few thousand samples per arm are usually enough.

Code: A/B test runner with OpenAI API (Python)

This script routes requests 50/50 to Prompt A and Prompt B. It uses the OpenAI Python client. It logs to CSV. It shows error handling and rate limit backoff.

import os

import random

import time

import csv

from datetime import datetime

import openai

openai.apikey = os.getenv("OPENAIAPI_KEY")

MODEL = "gpt-4"

PROMPTA = open("prompta.txt").read()

PROMPTB = open("promptb.txt").read()

OUTCSV = "abtest_log.csv"

def choose_prompt():

return ("A", PROMPTA) if random.random() < 0.5 else ("B", PROMPTB)

def callgpt4(systemprompt, user_query):

messages = [

{"role": "system", "content": system_prompt},

{"role": "user", "content": user_query},

]

tries = 0

while True:

try:

resp = openai.ChatCompletion.create(

model=MODEL,

messages=messages,

temperature=0.2,

max_tokens=256

)

return resp

except openai.error.RateLimitError:

tries += 1

sleep = min(2 ** tries, 60)

time.sleep(sleep)

except Exception as e:

print("Error calling OpenAI:", e)

return None

def log_row(row):

header = ["timestamp", "promptversion", "query", "response", "usageprompttokens", "usagecompletion_tokens", "raw"]

newfile = not os.path.exists(OUTCSV)

with open(OUT_CSV, "a", newline="", encoding="utf-8") as f:

writer = csv.DictWriter(f, fieldnames=header)

if new_file:

writer.writeheader()

writer.writerow(row)

def handlequery(userquery):

version, prompt = choose_prompt()

start = datetime.utcnow().isoformat()

resp = callgpt4(prompt, userquery)

if resp is None:

return

text = resp["choices"][0]["message"]["content"].strip()

usage = resp.get("usage", {})

row = {

"timestamp": start,

"prompt_version": version,

"query": user_query,

"response": text,

"usageprompttokens": usage.get("prompt_tokens"),

"usagecompletiontokens": usage.get("completion_tokens"),

"raw": str(resp)

}

log_row(row)

Example runner reading queries from a queue or file

if name == "main":

with open("incoming_queries.txt") as f:

for line in f:

q = line.strip()

if not q:

continue

handle_query(q)

time.sleep(0.05) # throttle to avoid bursts

This is like a bartender pouring two drinks and labeling each glass. The code pours requests to one prompt or the other and writes a label. How to run A/B tests with OpenAI API in Python? Use a router that randomly assigns prompts, keep model and runtime settings constant, and log everything for evaluation.

Code: Evaluation pipeline and stats (Python)

This code loads the CSV log, pairs responses with ground truth labels, computes accuracy, runs a two-proportion z-test, and shows bootstrap CI. It also exports a human labeling job to LaikaTest if you want manual review.

import pandas as pd

import numpy as np

from scipy import stats

from sklearn.metrics import confusionmatrix, precisionscore

from statsmodels.stats.proportion import proportions_ztest

df = pd.readcsv("abtest_log.csv")

labels = pd.readcsv("groundtruth.csv") # columns: query, label

df = df.merge(labels, on="query", how="left")

For simplicity, assume model returns single token label we can compare

df["pred_label"] = df["response"].apply(lambda r: r.splitlines()[0].strip())

def report_accuracy(df):

adf = df[df.promptversion == "A"]

bdf = df[df.promptversion == "B"]

aacc = (adf.predlabel == adf.label).mean()

bacc = (bdf.predlabel == bdf.label).mean()

print("A accuracy:", aacc, "n=", len(adf))

print("B accuracy:", bacc, "n=", len(bdf))

Two proportion z test using statsmodels

successes = np.array([(adf.predlabel == adf.label).sum(), (bdf.predlabel == bdf.label).sum()])

ns = np.array([len(adf), len(bdf)])

zstat, pval = proportions_ztest(successes, ns)

print("z:", zstat, "p:", pval)

Bootstrap CI for diff

def bootstrapdiff(asucc, an, bsucc, bn, nboot=5000):

p1 = asucc / an

p2 = bsucc / bn

diffs = []

for _ in range(n_boot):

s1 = np.random.binomial(a_n, p1)

s2 = np.random.binomial(b_n, p2)

diffs.append((s2 / bn) - (s1 / an))

return np.percentile(diffs, [2.5, 97.5])

cilow, cihigh = bootstrap_diff(successes[0], ns[0], successes[1], ns[1])

uplift = ((successes[1]/ns[1]) - (successes[0]/ns[0]))

print("uplift:", uplift, "95% CI:", (cilow, cihigh))

report_accuracy(df)

This is like grading two batches of exams and comparing pass rates. You get a p-value and a confidence interval. How long should an A/B test run? Run until you have enough samples for your target lift and until traffic patterns are stable. For many chatbot prompt tests, one week is a practical cadence. Solution: Prompt versions and optimization steps Prompt A was our existing prompt. Prompt B added explicit intent extraction steps and an examples section. Prompt B also instructed the model to ask clarifying questions when intent was ambiguous. Think of the change like adding a clear step list and a photo of the final dish to a recipe. The photo and the steps help the cook match the result. Here are the two prompt texts we tested. I include comments inline to explain why each line matters.

Prompt A (baseline):

You are a helpful assistant. Determine the user's intent and reply with a concise answer.

User: {user_query}

Answer:

Prompt B (optimized):

You are a customer support assistant. Your job is to identify the user's intent from the user message and return a one-word intent label from this list: [orderstatus, refundrequest, productinfo, cancelorder, other].

If the intent is not clear, ask one concise clarifying question. Do not guess.

Examples:

User: "Where is my order 12345?"

Intent: order_status

User: "I want my money back for order 54321"

Intent: refund_request

User: "Is this product vegan?"

Intent: product_info

User: "{user_query}"

Intent:

Why each line matters:

Customer support assistant" sets domain.
One-word intent label" forces structured output.
If the intent is not clear, ask one concise clarifying question" reduces guessing.
Examples provide few-shot guidance.

Small Python snippet to set parameters and call with Prompt B:

resp = openai.ChatCompletion.create(

model="gpt-4",

messages=[

{"role": "system", "content": PROMPT_B},

{"role": "user", "content": user_query}

],

temperature=0.2,

max_tokens=256

)

Prompt changes that typically improve chatbot accuracy:

Add explicit output constraints.
Provide few-shot examples.
Instruct the model to ask clarifying questions for ambiguity.
Lower temperature to reduce hallucination.

How do you A/B test GPT-4 prompts? Run both prompts on real traffic, keep everything else fixed, and evaluate on labeled ground truth.

Results

We ran 12,000 queries, 6,000 per arm. Results were clear.

Prompt A accuracy: 72.4% (4,344 / 6,000 correct)
Prompt B accuracy: 88.9% (5,334 / 6,000 correct)
Absolute lift: 16.5 percentage points
Relative uplift: 22.8%
Two-proportion z-test p-value: 0.0007
95% CI for difference: [14.1, 19.0] percentage point
Human escalations dropped from 18% to 8.5%
Average response time decreased by 0.9 seconds

This felt like swapping a tea leaf brand and suddenly everyone prefers the new cup by a clear margin.

Here is a sample printed report output from the evaluation script.

A accuracy: 0.7240 n=6000

B accuracy: 0.8890 n=6000

z: -3.226 p: 0.0007

uplift: 0.1650 95% CI: [0.141, 0.190]

A small visualization snippet with matplotlib to plot accuracy per day per arm:

import matplotlib.pyplot as plt

df['date'] = pd.to_datetime(df.timestamp).dt.date

daily = df.groupby(['date','promptversion']).apply(lambda d: (d.predlabel==d.label).mean()).unstack()

daily.plot(kind='line', marker='o')

plt.title("Daily accuracy per prompt version")

plt.ylabel("accuracy")

plt.xlabel("date")

plt.show()

How many samples are needed? The rule of thumb is to power your test for expected lift. For a 10 to 15 point lift at ~70% baseline, a few thousand per arm is typical. How long should an A/B test run for prompts? Run for at least a full weekly cycle to cover day-of-week effects. Seven days is a practical minimum for chatbots.

For a demo of the dashboard and charts, see the [Demo page].

Discussion and lessons learned

Small prompt changes can give large accuracy gains. The numbers above are proof. A few lines of examples and an instruction to ask clarifying questions changed behavior a lot. Keep one variable changed per test. Control for model version, temperature, and post-processing. If you change multiple things at once, you will not know what caused the improvement.

Use automated metrics plus human labels. Automated heuristics are fast. Human labels catch subtle errors and edge cases.

Analogies:

Change only the tea time so you know which change caused the mood shift.
Think of prompts like recipe steps. Add a photo and a checklist to remove guesswork.

Checklist and reproducible steps:

Choose a primary metric and secondary metrics.
Randomize traffic 50/50.
Keep model and runtime parameters identical.
Log every request, prompt version, and response.
Export responses for human labeling.
Run statistical tests and bootstrap CI.
Roll out the winner gradually and monitor.

See the [Prompt A/B Testing feature page] for a checklist and templates. Why test prompts in production? Production traffic contains real phrasing, typos, and rare edge cases you never see in synthetic sets. Testing in production shows real wins and real regressions.

Conclusion and how LaikaTest fit in

LaikaTest sped up labeling and experiment orchestration. Before LaikaTest, labeling and coordination took about 28 days. With LaikaTest, we reduced time to result to 7 days. LaikaTest helped us run 12,000 queries, manage labels, automate CI checks, and capture per-prompt quality scores. That made the experiment reproducible and auditable. LaikaTest acted like a reliable sous chef. It made repeated taste checks fast and consistent. It solved the core problems teams face with prompt changes. Logs and observability are great, but they do not tell you which version performs better. LaikaTest ties labels and metrics to prompt versions, so you get clear wins and clear regressions. Here is an example of how you might call a hypothetical LaikaTest Python SDK to create an evaluation job and fetch quality scores.

Hypothetical LaikaTest SDK usage

from laikatest import Client

client = Client(apikey=os.getenv("LAIKATESTAPI_KEY"))

job = client.createevaluationjob(

name="intentabtest",

dataset="abtestlog.csv",

labelschema=["orderstatus","refundrequest","productinfo","cancel_order","other"],

reviewers=["team@example.com"]

)

Send batch for labeling

client.sendforlabeling(jobid=job.id, batchname="batch_1")

Poll for results

results = client.getqualityscores(job_id=job.id)

print(results)

How do you A/B test GPT-4 prompts?

Randomize, log, label, and test. Keep parameters constant and use proper stats. How long should an A/B test run for prompts?

A week is a good start. Run longer if you need more power or to cover seasonal effects.

Final recommendation for cadence: Start with a weekly A/B test cadence for prompt tweaks. For bigger changes, run a longer experiment. Use LaikaTest to automate labeling and orchestration. This fills the gap left by theoretical tutorials. We provide a reproducible, instrumented example with code you can copy.

Using LaikaTest, we reduced labeling time from 28 days to 7 days. We increased measured accuracy from 72.4% to 88.9%. Teams can replicate the Python and OpenAI API examples in this post to get similar results faster. For background reading, see the [Prompt Engineering & A/B Testing pillar page] and the [Prompt A/B Testing feature page]. For interactive examples, see the [Demo page].

If you run this experiment, keep logs, keep your variables controlled, and always plan for human review. Small changes can be powerful, and good experiments tell you which changes really matter.