Iterative Testing for Conversational AI Reliability

I remember a late-night demo where the chatbot confidently booked a flight to Mars, not Mumbai. I was sipping chai and scrambling to explain why the user was getting a trip to a planet. I promised the client we would fix it with repeatable tests. The demo was quirky and humbling. That night, I sketched a plan to stop surprises, and this blog follows that plan.

Conversational AI reliability was the problem I needed to solve after that demo. I will walk through the challenge, the plan we ran, the code we used, and the real numbers we improved. This is a chatbot reliability case study, and it is about iterative LLM testing and measurable AI performance improvement.

Challenge: Unreliable chatbot behavior

We had a strange gap. Unit tests showed 90 percent accuracy, but live user flows were only 68 percent reliable. Multi-step dialogs made things worse. Some intents dropped to 51 percent accuracy when the dialog had three or more turns. In the first month of production, we saw 8 percent task aborts and 5 percent incorrect transactions. The vendor documentation listed static accuracy numbers, but it did not show how to maintain that accuracy over time or across long flows.

Think of it like a kitchen robot. The robot makes a perfect omelet alone on a test bench. Put it in a busy kitchen where it must cook rice, salad, and sauce at the same time, and it burns something. The robot is fine in isolation, but it fails in the real rush. That is our chatbot.

Key questions we answered here are how reliable AI chatbots are and what the disadvantages of conversational AI are. I will answer those as we go. First, I will show a quick log parser to measure per-turn and per-session accuracy. This is the raw work that reveals where the bot fails.

Example logs and parser

Sample log lines. Each line is a JSON event. We store the prompt, model response, ground truth, intent, session ID, and turn index.

{"session":"s1","turn":1,"user":"Book a flight to Mumbai","intent":"book_flight","response":"Sure, when do you want to travel?","correct":true} {"session":"s1","turn":2,"user":"Next week, morning","intent":"book_flight","response":"Booking a flight to Mars on 10 Sep","correct":false,"failure":"wrong_destination"} {"session":"s2","turn":1,"user":"Cancel my order","intent":"cancel_order","response":"I have cancelled your order","correct":true}

Here is a Python snippet to parse logs, compute per-turn and per-session accuracy, and print the top 10 failure reasons.

import json from collections import Counter, defaultdict def analyze_log(path): per_turn = Counter() per_session = defaultdict(lambda: {"total":0,"correct":True}) failures = Counter() with open(path) as f: for line in f: e = json.loads(line) per_turn['total'] += 1 if e.get('correct'): per_turn['correct'] += 1 else: failures[e.get('failure','unknown')] += 1 per_turn['incorrect'] += 1 per_session[e['session']]['correct'] = False per_session[e['session']]['total'] += 1 total_turns = per_turn['total'] turn_acc = per_turn.get('correct',0) / total_turns if total_turns else 0 total_sessions = len(per_session) session_success = sum(1 for s in per_session.values() if s['correct']) session_acc = session_success / total_sessions if total_sessions else 0 print(f"Turns: {total_turns}, Turn accuracy: {turn_acc:.2%}") print(f"Sessions: {total_sessions}, Session success: {session_acc:.2%}") print("Top failure reasons:") for k,v in failures.most_common(10): print(k, v) # Expected output with sample log: # Turns: 3, Turn accuracy: 66.67% # Sessions: 2, Session success: 50.00% # Top failure reasons: # wrong_destination 1 # unknown 1

For more on debugging and building reliability, see the AI Debugging & Reliability pillar page for frameworks and patterns.

Answering the two questions now, briefly:

How reliable are AI chatbots? It depends. In our case, raw unit tests said 90 percent, but real flows were 68 percent. The gap is real. Measure in production, not just in tests.
What are the disadvantages of conversational AI? The main disadvantages are non-deterministic outputs, context drift in long dialogs, and silent regressions after prompt or model changes. These are fixable with continuous tests.

Approach: iterative LLM testing plan

I designed an iterative plan. We ran 12 weekly iterations over 8 weeks. Each iteration tested 40 prompt variations across 6 critical flows. We used controlled A/B experiments. We logged everything and versioned prompts. This let us isolate changes.

We followed a 10-20-70 rule for engineering effort. That meant 10 percent of time for exploration, 20 percent for validation, and 70 percent on production monitoring and continuous tests. This split made sure the work did not stop at a single good number. It made reliability a running process.

Analogy time. Tuning the prompts is like tuning a radio knob. You test small frequency changes, listen, and then automate the tuner. You do not guess which frequency is better. You run a few short checks, and then the tuner runs by itself.

Here is pseudocode to run A/B experiments. The code sends batched prompts to two model configurations. It captures structured JSON logs. It computes per-variant success rate and writes results to CSV. I also show a small LaikaTest API pseudo-call to schedule iterations.

import csv import time import requests from typing import List, Dict def call_model(config, prompt): # Replace with real model call return {"text": "response text", "score": 0.9} def run_ab_batch(prompts: List[str], config_a, config_b, out_csv="ab_results.csv"): rows = [] for p in prompts: a = call_model(config_a, p) b = call_model(config_b, p) # fake eval: success if contains expected token success_a = "ok" in a['text'] success_b = "ok" in b['text'] rows.append({"prompt": p, "variant":"A", "success": success_a}) rows.append({"prompt": p, "variant":"B", "success": success_b}) with open(out_csv, "w") as f: writer = csv.DictWriter(f, fieldnames=["prompt","variant","success"]) writer.writeheader() for r in rows: writer.writerow(r) # LaikaTest pseudo-call to schedule iterations def schedule_laikatest_iteration(project, prompt_set_id, start_time): requests.post("https://api.laikatest.example/schedule", json={ "project": project, "prompt_set": prompt_set_id, "start_time": start_time, "duration_hours": 24 })

For more on prompts and controlled experiments, see our Prompt Engineering & A/B Testing feature page.

Two quick questions answered:

What is the 10-20-70 rule for AI? It is a split of engineering effort. 10 percent goes to exploration, 20 percent to validation, and 70 percent to production monitoring and continuous tests.
Why can't you trust ChatGPT? ChatGPT is powerful, but it is non-deterministic. It can give different answers to similar prompts. It can also hallucinate. Do not trust it without tests and checks. Use experiments and monitoring.

Solution: prompt experiments and automated checks

We ran 480 prompt variants and kept the top 24. We rolled 6 prompt families to production. We then added 18 automated acceptance checks. These checks covered slot validation, intent correctness, hallucination thresholds, and safety flags.

After six iterations, we reduced the ambiguous intent rate from 22 percent to 6 percent. Continuous tests caught regressions quickly. That is the competitor gap I mentioned. Vendors show a final number, but they do not show how to keep that number.

Analogy. This is like A/B testing recipes and then writing down exact steps. Chefs test different salt levels, then they write the final recipe so any cook can repeat the dish.

Here is Python code to evaluate prompt variants and plot reliability by iteration. It includes:

1) a line chart of overall reliability across iterations

2) a per-intent bar chart

3) a bootstrap test to show significance between iteration A and B

import json import pandas as pd import matplotlib.pyplot as plt import numpy as np # Load evaluator logs df = pd.read_csv("ab_results.csv") # prompt,variant,success df['success'] = df['success'].astype(int) # 1. Overall reliability per iteration (assume iteration tag in prompt) df['iteration'] = df['prompt'].str.extract(r'iter(\d+)').astype(int) summary = df.groupby('iteration')['success'].mean().reset_index() plt.plot(summary['iteration'], summary['success']) plt.xlabel("Iteration") plt.ylabel("Reliability") plt.title("Reliability by Iteration") plt.grid(True) plt.savefig("reliability_by_iteration.png") plt.close() # 2. Per-intent bar chart (assume intent tag in prompt) df['intent'] = df['prompt'].str.extract(r'intent_([a-z_]+)') intent_summary = df.groupby('intent')['success'].mean().sort_values() intent_summary.plot(kind='barh') plt.xlabel("Success rate") plt.title("Per-intent success rate") plt.savefig("per_intent.png") plt.close() # 3. Bootstrap test for A vs B on final iteration a = df[(df['variant']=='A') & (df['iteration']==8)]['success'].values b = df[(df['variant']=='B') & (df['iteration']==8)]['success'].values def bootstrap_diff(a,b,iterations=1000): diffs = [] n = len(a) m = len(b) for _ in range(iterations): sa = np.random.choice(a, n, replace=True).mean() sb = np.random.choice(b, m, replace=True).mean() diffs.append(sa - sb) diffs = np.array(diffs) return np.mean(diffs), np.percentile(diffs, [2.5, 97.5]) mean_diff, ci = bootstrap_diff(a,b) print("Mean diff A-B", mean_diff, "95% CI", ci)

Sample log lines that feed the evaluator:

{"prompt":"iter1_intent_book_flight_v1_user1", "variant":"A", "success":1} {"prompt":"iter1_intent_book_flight_v2_user2", "variant":"B", "success":0}

See our Prompt Engineering & A/B Testing feature page for more integration ideas.

Quick answers:

Why can't you trust ChatGPT? It is not deterministic, and it can hallucinate or change behavior with hidden context. You need repeatable tests and monitoring.
How reliable are AI chatbots? They can be reliable when you measure and run continuous tests. In our case, we improved from 68 percent to over 90 percent.

Results: measurable improvements and ROI

After eight weeks, our numbers changed significantly. Overall conversational AI reliability rose from 68 percent to 92 percent. That is a 24 percentage point increase. Task completion rate improved from 82 percent to 95 percent. Aborts dropped from 8 percent to 2.5 percent. False recommendation incidents fell by 63 percent. Refund requests related to bad answers fell by 47 percent. Our regression rate between releases dropped from 4 percent to 0.8 percent. That last number came from the continuous tests and the rollback rules we added.

Think of it like tracking fuel efficiency before and after a tune-up. Before the tune-up, the car gave you 12 km per liter. After, it gave you 16 km per liter. You can show the numbers and measure the cost savings.

Here is Python plotting code that reads iteration CSV and generates:

1) trendline of reliability with confidence intervals

2) heatmap of intent-by-turn failure rates

3) sample console logs for monitoring alerts

import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np from scipy import stats iters = pd.read_csv("iteration_metrics.csv") # iteration, reliability, lower_ci, upper_ci plt.fill_between(iters['iteration'], iters['lower_ci'], iters['upper_ci'], alpha=0.2) plt.plot(iters['iteration'], iters['reliability'], marker='o') plt.title("Reliability Trend with CI") plt.xlabel("Iteration") plt.ylabel("Reliability") plt.savefig("trend_ci.png") plt.close() # Heatmap of intent-by-turn failure rates heat = pd.read_csv("intent_turn_failures.csv", index_col=0) # intents rows, turns cols sns.heatmap(heat, annot=True, fmt=".2f", cmap="Reds") plt.title("Intent by Turn Failure Rates") plt.savefig("intent_turn_heatmap.png") plt.close() # Simple significance test before = pd.read_csv("before_sample.csv")['success'] after = pd.read_csv("after_sample.csv")['success'] tstat, pval = stats.ttest_ind(before, after, equal_var=False) print("t-stat", tstat, "p-value", pval) # Console monitoring alerts sample print("[ALERT] iteration 5: abort_rate=4.8% > threshold 3.0%") print("[INFO] rollback triggered for prompt_family booking_v2 due to regression")

All of these artifacts came from scripts that ran daily. The stats snippet shows whether the improvement is significant. The p-value was below 0.01 in our runs.

Link back to the AI Debugging & Reliability pillar page for the architectures we used to collect and store metrics.

Answering the repeated questions:

How reliable are AI chatbots? With a continuous testing pipeline, they can reach and stay above 90 percent reliability for many flows. Without that pipeline, the numbers can drop fast.
What are the disadvantages of conversational AI? The same disadvantages remain. Non-determinism, hallucination, context drift, and hidden regressions. The solution is not to ignore them. You must test continuously.

Appendix: reproducible artifacts to include

I include a minimal list of artifacts you should keep. This makes the work reproducible.

Sample log schema
Prompt variant CSV
Evaluation scripts
A reproducible notebook that generates the charts
CI config to run tests on each model and prompt change

Analogy. This is like a recipe card. Include all measurements, times, and oven settings so others can replicate the dish exactly.

Below is a Jupyter notebook outline plus a minimal CI YAML.

Notebook outline in Python:

# notebook: reliability_notebook.ipynb # 1. Data ingestion import pandas as pd logs = pd.read_json("logs.jsonl", lines=True) # 2. Metrics computation per_turn = logs.groupby('turn')['correct'].mean() per_session = logs.groupby('session')['correct'].all().mean() # 3. Plotting import matplotlib.pyplot as plt per_turn.plot() plt.savefig("turns.png") # 4. Bootstrap test # code as above # 5. Export results per_intent = logs.groupby('intent')['correct'].mean() per_intent.to_csv("per_intent.csv")

Sample CI YAML that triggers LaikaTest style runs. This is minimal and copy friendly.

name: laikatest-ci on: push: branches: [main] pull_request: branches: [main] jobs: run-tests: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v4 with: python-version: '3.10' - name: Install deps run: pip install -r requirements.txt - name: Run LaikaTest experiments env: LAIKATEST_API_KEY: ${{ secrets.LAIKATEST_API_KEY }} run: | python scripts/schedule_laikatest.py , project mybot , promptset prompts/v2.csv - name: Run evaluation run: python scripts/evaluate_results.py

Sample log lines and expected plots were shown earlier. The notebook will generate the same PNG files we displayed.

See the Prompt Engineering & A/B Testing feature page for more CI patterns.

Conclusion with LaikaTest

To wrap up, the numbers mattered. We increased conversational AI reliability from 68 percent to 92 percent. Task completion rose from 82 percent to 95 percent. Aborts dropped from 8 percent to 2.5 percent. False recommendations fell by 63 percent. Refunds tied to bad answers fell by 47 percent. Regression rate dropped from 4 percent to 0.8 percent.

The practical step that turned our late-night demo into consistent reliability was one repeatable thing. We ran continuous, automated, iterative testing on real traffic. We scheduled prompt A/B tests. We collected structured logs. We ran the evaluation scripts I showed. We used rollback rules and monitoring alerts.

LaikaTest fits this workflow. It is an AI infrastructure tool for production LLM systems that helps teams experiment, evaluate, and debug prompts and agents safely in real usage. It solves real problems teams have. Teams change prompts or agent logic and do not know if behavior really improved. AI outputs are non-deterministic, so "it felt better" is not evidence. Observability tools show logs, but they do not tell which version performed better. Silent regressions happen after prompt or model changes.

What LaikaTest enables is the practical work we did. Prompt A/B testing runs multiple prompt variants on real traffic and compares outcomes. Agent experimentation compares different agent setups as experiments, not guesses. One-line observability and tracing shows which prompt version was used, the model outputs, tool calls, costs, and latency. The evaluation feedback loop collects human or automated scores tied to the exact prompt version.

I am not being salesy. I am saying this worked. If you adopt a LaikaTest style pipeline, you can run scheduled prompt experiments, capture structured logs, and feed the evaluation scripts in this case study. That is the repeatable practice that fixed my Mars booking demo. If you want to reproduce this, try a 6-week LaikaTest pilot. Run the 12 iterations across 8 weeks I described. You will get numbers, not guesses.

If you want the code and notebook, everything in this blog is copy-paste ready. Start with the log parser, then run the A/B scheduler. Keep a CI job that schedules LaikaTest runs on each prompt change. That is how you move from a quirky late-night demo to a reliable conversational system that your users trust.

For more on the debugging tools and reliability frameworks we used, see the AI Debugging & Reliability pillar page. For specifics on prompts and A/B tooling, see the Prompt Engineering & A/B Testing feature page.

Improving Conversational AI Reliability