Learn how to enhance the reliability of chatbots through iterative testing and measurable improvements in performance.
Naman Arora
January 24, 2026

I remember a late-night demo where the chatbot confidently booked a flight to Mars, not Mumbai. I was sipping chai and scrambling to explain why the user was getting a trip to a planet. I promised the client we would fix it with repeatable tests. The demo was quirky and humbling. That night, I sketched a plan to stop surprises, and this blog follows that plan.
Conversational AI reliability was the problem I needed to solve after that demo. I will walk through the challenge, the plan we ran, the code we used, and the real numbers we improved. This is a chatbot reliability case study, and it is about iterative LLM testing and measurable AI performance improvement.
We had a strange gap. Unit tests showed 90 percent accuracy, but live user flows were only 68 percent reliable. Multi-step dialogs made things worse. Some intents dropped to 51 percent accuracy when the dialog had three or more turns. In the first month of production, we saw 8 percent task aborts and 5 percent incorrect transactions. The vendor documentation listed static accuracy numbers, but it did not show how to maintain that accuracy over time or across long flows.
Think of it like a kitchen robot. The robot makes a perfect omelet alone on a test bench. Put it in a busy kitchen where it must cook rice, salad, and sauce at the same time, and it burns something. The robot is fine in isolation, but it fails in the real rush. That is our chatbot.
Key questions we answered here are how reliable AI chatbots are and what the disadvantages of conversational AI are. I will answer those as we go. First, I will show a quick log parser to measure per-turn and per-session accuracy. This is the raw work that reveals where the bot fails.
Sample log lines. Each line is a JSON event. We store the prompt, model response, ground truth, intent, session ID, and turn index.
{"session":"s1","turn":1,"user":"Book a flight to Mumbai","intent":"book_flight","response":"Sure, when do you want to travel?","correct":true}
{"session":"s1","turn":2,"user":"Next week, morning","intent":"book_flight","response":"Booking a flight to Mars on 10 Sep","correct":false,"failure":"wrong_destination"}
{"session":"s2","turn":1,"user":"Cancel my order","intent":"cancel_order","response":"I have cancelled your order","correct":true}
Here is a Python snippet to parse logs, compute per-turn and per-session accuracy, and print the top 10 failure reasons.
import json
from collections import Counter, defaultdict
def analyze_log(path):
per_turn = Counter()
per_session = defaultdict(lambda: {"total":0,"correct":True})
failures = Counter()
with open(path) as f:
for line in f:
e = json.loads(line)
per_turn['total'] += 1
if e.get('correct'):
per_turn['correct'] += 1
else:
failures[e.get('failure','unknown')] += 1
per_turn['incorrect'] += 1
per_session[e['session']]['correct'] = False
per_session[e['session']]['total'] += 1
total_turns = per_turn['total']
turn_acc = per_turn.get('correct',0) / total_turns if total_turns else 0
total_sessions = len(per_session)
session_success = sum(1 for s in per_session.values() if s['correct'])
session_acc = session_success / total_sessions if total_sessions else 0
print(f"Turns: {total_turns}, Turn accuracy: {turn_acc:.2%}")
print(f"Sessions: {total_sessions}, Session success: {session_acc:.2%}")
print("Top failure reasons:")
for k,v in failures.most_common(10):
print(k, v)
# Expected output with sample log:
# Turns: 3, Turn accuracy: 66.67%
# Sessions: 2, Session success: 50.00%
# Top failure reasons:
# wrong_destination 1
# unknown 1
For more on debugging and building reliability, see the AI Debugging & Reliability pillar page for frameworks and patterns.
Answering the two questions now, briefly:
How reliable are AI chatbots? It depends. In our case, raw unit tests said 90 percent, but real flows were 68 percent. The gap is real. Measure in production, not just in tests.
What are the disadvantages of conversational AI? The main disadvantages are non-deterministic outputs, context drift in long dialogs, and silent regressions after prompt or model changes. These are fixable with continuous tests.
I designed an iterative plan. We ran 12 weekly iterations over 8 weeks. Each iteration tested 40 prompt variations across 6 critical flows. We used controlled A/B experiments. We logged everything and versioned prompts. This let us isolate changes.
We followed a 10-20-70 rule for engineering effort. That meant 10 percent of time for exploration, 20 percent for validation, and 70 percent on production monitoring and continuous tests. This split made sure the work did not stop at a single good number. It made reliability a running process.
Analogy time. Tuning the prompts is like tuning a radio knob. You test small frequency changes, listen, and then automate the tuner. You do not guess which frequency is better. You run a few short checks, and then the tuner runs by itself.
Here is pseudocode to run A/B experiments. The code sends batched prompts to two model configurations. It captures structured JSON logs. It computes per-variant success rate and writes results to CSV. I also show a small LaikaTest API pseudo-call to schedule iterations.
import csv
import time
import requests
from typing import List, Dict
def call_model(config, prompt):
# Replace with real model call
return {"text": "response text", "score": 0.9}
def run_ab_batch(prompts: List[str], config_a, config_b, out_csv="ab_results.csv"):
rows = []
for p in prompts:
a = call_model(config_a, p)
b = call_model(config_b, p)
# fake eval: success if contains expected token
success_a = "ok" in a['text']
success_b = "ok" in b['text']
rows.append({"prompt": p, "variant":"A", "success": success_a})
rows.append({"prompt": p, "variant":"B", "success": success_b})
with open(out_csv, "w") as f:
writer = csv.DictWriter(f, fieldnames=["prompt","variant","success"])
writer.writeheader()
for r in rows:
writer.writerow(r)
# LaikaTest pseudo-call to schedule iterations
def schedule_laikatest_iteration(project, prompt_set_id, start_time):
requests.post("https://api.laikatest.example/schedule", json={
"project": project,
"prompt_set": prompt_set_id,
"start_time": start_time,
"duration_hours": 24
})
For more on prompts and controlled experiments, see our Prompt Engineering & A/B Testing feature page.
Two quick questions answered:
What is the 10-20-70 rule for AI? It is a split of engineering effort. 10 percent goes to exploration, 20 percent to validation, and 70 percent to production monitoring and continuous tests.
Why can't you trust ChatGPT? ChatGPT is powerful, but it is non-deterministic. It can give different answers to similar prompts. It can also hallucinate. Do not trust it without tests and checks. Use experiments and monitoring.
We ran 480 prompt variants and kept the top 24. We rolled 6 prompt families to production. We then added 18 automated acceptance checks. These checks covered slot validation, intent correctness, hallucination thresholds, and safety flags.
After six iterations, we reduced the ambiguous intent rate from 22 percent to 6 percent. Continuous tests caught regressions quickly. That is the competitor gap I mentioned. Vendors show a final number, but they do not show how to keep that number.
Analogy. This is like A/B testing recipes and then writing down exact steps. Chefs test different salt levels, then they write the final recipe so any cook can repeat the dish.
Here is Python code to evaluate prompt variants and plot reliability by iteration. It includes:
1) a line chart of overall reliability across iterations
2) a per-intent bar chart
3) a bootstrap test to show significance between iteration A and B
import json
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Load evaluator logs
df = pd.read_csv("ab_results.csv") # prompt,variant,success
df['success'] = df['success'].astype(int)
# 1. Overall reliability per iteration (assume iteration tag in prompt)
df['iteration'] = df['prompt'].str.extract(r'iter(\d+)').astype(int)
summary = df.groupby('iteration')['success'].mean().reset_index()
plt.plot(summary['iteration'], summary['success'])
plt.xlabel("Iteration")
plt.ylabel("Reliability")
plt.title("Reliability by Iteration")
plt.grid(True)
plt.savefig("reliability_by_iteration.png")
plt.close()
# 2. Per-intent bar chart (assume intent tag in prompt)
df['intent'] = df['prompt'].str.extract(r'intent_([a-z_]+)')
intent_summary = df.groupby('intent')['success'].mean().sort_values()
intent_summary.plot(kind='barh')
plt.xlabel("Success rate")
plt.title("Per-intent success rate")
plt.savefig("per_intent.png")
plt.close()
# 3. Bootstrap test for A vs B on final iteration
a = df[(df['variant']=='A') & (df['iteration']==8)]['success'].values
b = df[(df['variant']=='B') & (df['iteration']==8)]['success'].values
def bootstrap_diff(a,b,iterations=1000):
diffs = []
n = len(a)
m = len(b)
for _ in range(iterations):
sa = np.random.choice(a, n, replace=True).mean()
sb = np.random.choice(b, m, replace=True).mean()
diffs.append(sa - sb)
diffs = np.array(diffs)
return np.mean(diffs), np.percentile(diffs, [2.5, 97.5])
mean_diff, ci = bootstrap_diff(a,b)
print("Mean diff A-B", mean_diff, "95% CI", ci)
Sample log lines that feed the evaluator:
{"prompt":"iter1_intent_book_flight_v1_user1", "variant":"A", "success":1}
{"prompt":"iter1_intent_book_flight_v2_user2", "variant":"B", "success":0}
See our Prompt Engineering & A/B Testing feature page for more integration ideas.
Quick answers:
Why can't you trust ChatGPT? It is not deterministic, and it can hallucinate or change behavior with hidden context. You need repeatable tests and monitoring.
How reliable are AI chatbots? They can be reliable when you measure and run continuous tests. In our case, we improved from 68 percent to over 90 percent.
After eight weeks, our numbers changed significantly. Overall conversational AI reliability rose from 68 percent to 92 percent. That is a 24 percentage point increase. Task completion rate improved from 82 percent to 95 percent. Aborts dropped from 8 percent to 2.5 percent. False recommendation incidents fell by 63 percent. Refund requests related to bad answers fell by 47 percent. Our regression rate between releases dropped from 4 percent to 0.8 percent. That last number came from the continuous tests and the rollback rules we added.
Think of it like tracking fuel efficiency before and after a tune-up. Before the tune-up, the car gave you 12 km per liter. After, it gave you 16 km per liter. You can show the numbers and measure the cost savings.
Here is Python plotting code that reads iteration CSV and generates:
1) trendline of reliability with confidence intervals
2) heatmap of intent-by-turn failure rates
3) sample console logs for monitoring alerts
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
iters = pd.read_csv("iteration_metrics.csv") # iteration, reliability, lower_ci, upper_ci
plt.fill_between(iters['iteration'], iters['lower_ci'], iters['upper_ci'], alpha=0.2)
plt.plot(iters['iteration'], iters['reliability'], marker='o')
plt.title("Reliability Trend with CI")
plt.xlabel("Iteration")
plt.ylabel("Reliability")
plt.savefig("trend_ci.png")
plt.close()
# Heatmap of intent-by-turn failure rates
heat = pd.read_csv("intent_turn_failures.csv", index_col=0) # intents rows, turns cols
sns.heatmap(heat, annot=True, fmt=".2f", cmap="Reds")
plt.title("Intent by Turn Failure Rates")
plt.savefig("intent_turn_heatmap.png")
plt.close()
# Simple significance test
before = pd.read_csv("before_sample.csv")['success']
after = pd.read_csv("after_sample.csv")['success']
tstat, pval = stats.ttest_ind(before, after, equal_var=False)
print("t-stat", tstat, "p-value", pval)
# Console monitoring alerts sample
print("[ALERT] iteration 5: abort_rate=4.8% > threshold 3.0%")
print("[INFO] rollback triggered for prompt_family booking_v2 due to regression")
All of these artifacts came from scripts that ran daily. The stats snippet shows whether the improvement is significant. The p-value was below 0.01 in our runs.
Link back to the AI Debugging & Reliability pillar page for the architectures we used to collect and store metrics.
Answering the repeated questions:
How reliable are AI chatbots? With a continuous testing pipeline, they can reach and stay above 90 percent reliability for many flows. Without that pipeline, the numbers can drop fast.
What are the disadvantages of conversational AI? The same disadvantages remain. Non-determinism, hallucination, context drift, and hidden regressions. The solution is not to ignore them. You must test continuously.
I include a minimal list of artifacts you should keep. This makes the work reproducible.
Sample log schema
Prompt variant CSV
Evaluation scripts
A reproducible notebook that generates the charts
CI config to run tests on each model and prompt change
Analogy. This is like a recipe card. Include all measurements, times, and oven settings so others can replicate the dish exactly.
Below is a Jupyter notebook outline plus a minimal CI YAML.
Notebook outline in Python:
# notebook: reliability_notebook.ipynb
# 1. Data ingestion
import pandas as pd
logs = pd.read_json("logs.jsonl", lines=True)
# 2. Metrics computation
per_turn = logs.groupby('turn')['correct'].mean()
per_session = logs.groupby('session')['correct'].all().mean()
# 3. Plotting
import matplotlib.pyplot as plt
per_turn.plot()
plt.savefig("turns.png")
# 4. Bootstrap test
# code as above
# 5. Export results
per_intent = logs.groupby('intent')['correct'].mean()
per_intent.to_csv("per_intent.csv")
Sample CI YAML that triggers LaikaTest style runs. This is minimal and copy friendly.
name: laikatest-ci
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
run-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install deps
run: pip install -r requirements.txt
- name: Run LaikaTest experiments
env:
LAIKATEST_API_KEY: ${{ secrets.LAIKATEST_API_KEY }}
run: |
python scripts/schedule_laikatest.py , project mybot , promptset prompts/v2.csv
- name: Run evaluation
run: python scripts/evaluate_results.py
Sample log lines and expected plots were shown earlier. The notebook will generate the same PNG files we displayed.
See the Prompt Engineering & A/B Testing feature page for more CI patterns.
To wrap up, the numbers mattered. We increased conversational AI reliability from 68 percent to 92 percent. Task completion rose from 82 percent to 95 percent. Aborts dropped from 8 percent to 2.5 percent. False recommendations fell by 63 percent. Refunds tied to bad answers fell by 47 percent. Regression rate dropped from 4 percent to 0.8 percent.
The practical step that turned our late-night demo into consistent reliability was one repeatable thing. We ran continuous, automated, iterative testing on real traffic. We scheduled prompt A/B tests. We collected structured logs. We ran the evaluation scripts I showed. We used rollback rules and monitoring alerts.
LaikaTest fits this workflow. It is an AI infrastructure tool for production LLM systems that helps teams experiment, evaluate, and debug prompts and agents safely in real usage. It solves real problems teams have. Teams change prompts or agent logic and do not know if behavior really improved. AI outputs are non-deterministic, so "it felt better" is not evidence. Observability tools show logs, but they do not tell which version performed better. Silent regressions happen after prompt or model changes.
What LaikaTest enables is the practical work we did. Prompt A/B testing runs multiple prompt variants on real traffic and compares outcomes. Agent experimentation compares different agent setups as experiments, not guesses. One-line observability and tracing shows which prompt version was used, the model outputs, tool calls, costs, and latency. The evaluation feedback loop collects human or automated scores tied to the exact prompt version.
I am not being salesy. I am saying this worked. If you adopt a LaikaTest style pipeline, you can run scheduled prompt experiments, capture structured logs, and feed the evaluation scripts in this case study. That is the repeatable practice that fixed my Mars booking demo. If you want to reproduce this, try a 6-week LaikaTest pilot. Run the 12 iterations across 8 weeks I described. You will get numbers, not guesses.
If you want the code and notebook, everything in this blog is copy-paste ready. Start with the log parser, then run the A/B scheduler. Keep a CI job that schedules LaikaTest runs on each prompt change. That is how you move from a quirky late-night demo to a reliable conversational system that your users trust.
For more on the debugging tools and reliability frameworks we used, see the AI Debugging & Reliability pillar page. For specifics on prompts and A/B tooling, see the Prompt Engineering & A/B Testing feature page.