Learn how to containerize LLM agents for better deployment and management. Step-by-step instructions provided.
Naman Arora
January 24, 2026

# Containerizing LLM Agents Guide
[ANECDOTE_PLACEHOLDER]
Containerizing LLM agents is a practical path from messy local scripts to a reproducible production service. I will walk you through my approach. I explain Docker LLM images, Kubernetes AI deployment, and how to get containerized LLM services visible with LaikaTest tracing. My aim is to provide procedural help you can copy and adapt.
## Prerequisites and Expected Outcomes
Before you start, make sure your workstation and cluster are ready.
- **Prerequisites**
- Docker installed and working. Verify with `docker , version`.
- kubectl configured for your cluster. Verify with `kubectl version , client`.
- Access to a Kubernetes cluster and permissions to create workloads.
- Basic LLM agent code that can respond on an HTTP port.
- Container registry credentials. Example commands: `gcloud auth configure-docker` or `docker login registry.example.com`.
- Optional GPU nodes, with drivers installed. Verify with `nvidia-smi` on the node or from a machine that can see the GPU.
**Commands to verify environment**
- `docker , version`
- `kubectl version , client`
- `gcloud auth configure-docker`
- `aws ecr get-login-password | docker login , username AWS , password-stdin <account>.dkr.ecr.us-west-2.amazonaws.com`
- `nvidia-smi`
**Outcomes you will get**
- A container image of the LLM agent pushed to a registry.
- A working Kubernetes deployment with probes and resource requests.
- Autoscaling rules that respond to load.
- Basic tracing and observability integrated with LaikaTest.
**Link:** [LLMOps & Production AI pillar page](#)
## Checklist Before You Deploy
- Model license verified for production use.
- Environment variables and secrets planned.
- Resource estimates for CPU, memory, and GPUs.
- Small testing dataset for smoke tests.
**Analogy: Packing Infra**
Think of prepping infra like packing a lunch box. You want insulated compartments for hot and cold. You want a power bank for your phone. You want snacks that are not poisonous. Each item matters. Missing one item breaks the meal. Likewise, missing a GPU flag, a secret, or tracing can break your deployment.
**Common Quick Answers**
- **How do I containerize an LLM agent?**
- Make your app run from a single command and read config from the environment. Then write a Dockerfile and push the image.
- **Do LLM agents need GPUs in Docker?**
- Not always. Smaller models can run on CPU. Large models and inference at scale often need GPUs. You can run GPUs inside Docker with proper drivers.
- **Is Docker enough for production LLM agents?**
- Docker is required for packaging. For production, you need orchestration, autoscaling, security, and observability. Docker alone is not enough.
## Step 1: Prepare the LLM Agent App for Containerizing
Make your app run by a single command. Use environment variables for config. Add health and readiness endpoints. Support graceful shutdown.
**Key Points**
- Start via `python app.py` or `./run-agent`.
- Pin Python or runtime version.
- Use a multistage build to reduce image size.
- Add health and readiness endpoints for Kubernetes probes.
**Example Analogy**
Think of the app as a shopkeeper who needs a fixed entrance, working lights, and a sign that tells customers the shop is ready. If the shop has no sign, customers will leave.
**Small Code Examples**
**Python Flask Health Endpoints**
from flask import Flask, jsonify
import os
import signal
import sys
app = Flask(name)
ready = False
@app.route("/healthz")
def healthz():
return jsonify(status="ok"), 200
@app.route("/readyz")
def readyz():
return (jsonify(ready=True), 200) if ready else (jsonify(ready=False), 503)
def load_model():
global ready
modelpath = os.getenv("MODELPATH", "/models/default")
ready = True
def shutdown_handler(signum, frame):
sys.exit(0)
if name == "main":
import threading
threading.Thread(target=load_model).start()
signal.signal(signal.SIGTERM, shutdown_handler)
app.run(host="0.0.0.0", port=int(os.getenv("PORT", 8080)))
**Sample Config Loader Reading Env Vars**
import os
MODELPATH = os.getenv("MODELPATH", "/models/model.pt")
APIKEY = os.getenv("APIKEY", "")
PORT = int(os.getenv("PORT", "8080"))
LOGLEVEL = os.getenv("LOGLEVEL", "info")
**Expected Outcome**
- App responds on a port.
- Supports readiness and graceful shutdown.
- Easy to run in a container with env vars.
**Answer: How do I containerize an LLM agent?**
- Make the app configurable by env vars, add probes, pin runtimes, and keep startup deterministic.
## Step 2: Write the Dockerfile and Build a Docker LLM Image
Use a reproducible base image. Use multistage builds. Keep the final image small. Label your image and include a HEALTHCHECK.
**Analogy**
A multistage build is like packing only the clothes you need for a trip and leaving the heavy boxes behind. Traveling light helps start fast.
**Full Dockerfile for a Python LLM Agent**
FROM python:3.11-slim AS builder
WORKDIR /app
ENV VIRTUAL_ENV=/opt/venv
RUN python -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
COPY pyproject.toml poetry.lock ./
RUN pip install , upgrade pip
RUN pip install -r requirements.txt
COPY . /app
FROM python:3.11-slim
ENV VIRTUAL_ENV=/opt/venv
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
WORKDIR /app
COPY , from=builder /opt/venv /opt/venv
COPY , from=builder /app /app
EXPOSE 8080
LABEL org.opencontainers.image.title="llm-agent"
LABEL org.opencontainers.image.version="v1"
HEALTHCHECK , interval=15s , timeout=3s CMD curl -f http://localhost:8080/healthz || exit 1
CMD ["gunicorn", "app:app", "-b", "0.0.0.0:8080", "-w", "4", ", timeout", "120"]
**Build and Push Commands**
- `docker build -t registry.example.com/llm-agent:v1 .`
- `docker push registry.example.com/llm-agent:v1`
**GPU Notes**
- For GPU images, you can start with an NVIDIA base like `nvidia/cuda:12.1-runtime`.
- To run with GPUs, use `docker run , gpus all ...`.
- Make sure drivers match the host.
**Link:** [Demo page](#)
**Answers**
- **How do I containerize an LLM agent?**
- Build a small, labeled image, add HEALTHCHECK, and push to a registry.
- **Do LLM agents need GPUs in Docker?**
- They can. Use GPU base images and run with `, gpus`.
- **Is Docker enough for production LLM agents?**
- No. Combine Docker images with orchestration and observability.
## Step 3: Run and Test Locally with Docker
Run the container locally with env vars and volume mounts. Test health endpoints and a sample inference.
**Analogy**
This is the dress rehearsal before the live show. Fix issues now, not during the performance.
**Run Commands**
- `docker run -e PORT=8080 -p 8080:8080 registry.example.com/llm-agent:v1`
- `docker run , gpus all -e MODEL_PATH=/models/my.pt -v /models:/models -p 8080:8080 registry.example.com/llm-agent:v1`
**Health and Inference Checks**
- `curl -f http://localhost:8080/healthz`
- `curl -X POST http://localhost:8080/infer -H "Content-Type: application/json" -d '{"prompt":"Hello"}'`
**If Using GPU, Verify Inside Container**
- `docker exec -it <container> nvidia-smi`
**Expected Outcome**
- Service responds to inference requests.
- Logs include trace IDs or request IDs to link to LaikaTest.
**Link:** [Demo page](#)
**Answers**
- **How do I containerize an LLM agent?**
- Test locally with Docker run and verify endpoints.
- **Do LLM agents need GPUs in Docker?**
- If you need performance, test GPU runs with `, gpus all`.
## Step 4: Kubernetes Deployment Basics
Create a Deployment, Service, and ConfigMap. Add probes and use nodeSelectors or tolerations for GPUs.
**Analogy**
Kubernetes is like an office building. Each floor has rules for weight and power. The receptionist decides where people sit.
**Deployment YAML**
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-agent
labels:
app: llm-agent
spec:
replicas: 2
selector:
matchLabels:
app: llm-agent
template:
metadata:
labels:
app: llm-agent
spec:
containers:
name: agent
image: registry.example.com/llm-agent:v1
ports:
containerPort: 8080
envFrom:
configMapRef:
name: llm-agent-config
resources:
requests:
cpu: "500m"
memory: "2Gi"
limits:
cpu: "2"
memory: "8Gi"
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 20
tolerations:
key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
**Service YAML**
apiVersion: v1
kind: Service
metadata:
name: llm-agent
spec:
selector:
app: llm-agent
ports:
protocol: TCP
port: 80
targetPort: 8080
type: LoadBalancer
**ConfigMap YAML**
apiVersion: v1
kind: ConfigMap
metadata:
name: llm-agent-config
data:
PORT: "8080"
LOG_LEVEL: "info"
Node selector and tolerations for GPUs help schedule pods on GPU nodes.
**Expected Outcome**
- Pods running and reachable via LoadBalancer.
**Answers**
- **Can I run LLMs on Kubernetes?**
- Yes. Use resource requests, probes, and GPU scheduling.
- **How do I monitor containerized LLM services?**
- Add probes, metrics, and tracing. Integrate with observability tools like LaikaTest.
## Step 5: Autoscaling and Resource Planning
Use HPA for CPU or custom metrics. For GPU, use cluster autoscaler and node pools.
**Analogy**
Autoscaling is like calling helpers when the tea shop gets crowded and sending them home when it quiets down.
**HPA YAML Using CPU**
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-agent-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-agent
minReplicas: 2
maxReplicas: 10
metrics:
type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
For custom metrics, use Prometheus adapter or Kubernetes custom metrics API. Run:
- `kubectl apply -f hpa.yaml`
- `kubectl get hpa`
Plan memory and storage carefully. Large models need local SSD and high memory.
**Expected Outcome**
- Autoscaling triggers under load.
- Safe scaling for startup and shutdown.
**Answer:**
- **Can I run LLMs on Kubernetes?**
- Yes, with careful planning for memory and storage.
- **Do LLM agents need GPUs in Docker?**
- GPUs help. Use node pools for GPU workloads.
**Link:** [LLMOps & Production AI pillar page](#)
## Step 6: Observability, Logging, and Tracing with LaikaTest
Instrument your agent for logs, metrics, and traces. Use OpenTelemetry and send traces to LaikaTest.
**Analogy**
Tracing is like adding a GPS tracker to every chai delivery so you can see where it got delayed.
**OpenTelemetry Python Snippet**
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
trace.settracerprovider(TracerProvider())
otlpexporter = OTLPSpanExporter(endpoint=os.getenv("LAIKATESTOTLP", "https://collector.laikatest.example.com"), headers=(("x-api-key", os.getenv("LAIKATESTAPIKEY")),))
spanprocessor = BatchSpanProcessor(otlpexporter)
trace.gettracerprovider().addspanprocessor(span_processor)
tracer = trace.gettracer(name_)
**Kubernetes Config for LaikaTest Endpoint Example**
apiVersion: v1
kind: ConfigMap
metadata:
name: laikatest-config
data:
LAIKATEST_OTLP: "https://collector.laikatest.example.com"
, -
apiVersion: v1
kind: Secret
metadata:
name: laikatest-secret
stringData:
LAIKATESTAPIKEY: "REPLACEWITHKEY"
**DaemonSet or Sidecar**
- Deploy a DaemonSet to forward logs and traces to LaikaTest, or use a sidecar per pod.
- Configure the agent to add trace IDs in logs so you can join logs and traces.
**Fetch a Trace from LaikaTest API**
- `curl -H "Authorization: Bearer $LAIKATEST_API_KEY" "https://api.laikatest.example.com/traces?trace_id=<id>"`
**Expected Outcome**
- Traces that show request flow, latency breakdown, and failures.
**Link:** [LLMOps & Production AI pillar page](#)
**Answer:**
- **How do I monitor containerized LLM services?**
- Use OpenTelemetry and forward traces to LaikaTest.
- **Is Docker enough for production LLM agents?**
- No. Observability is a must for production.
## Step 7: CI/CD Pipeline to Build, Test, and Deploy
Automate image build, tests, and deploys. Use canary or blue-green for safety. Scan images.
**Analogy**
CI/CD is like an assembly line that checks each cup of chai before it goes out.
**GitHub Actions Snippet**
name: CI
on:
push:
branches: [ main ]
jobs:
build:
runs-on: ubuntu-latest
steps:
uses: actions/checkout@v3
name: Set up Python
uses: actions/setup-python@v4
with:
python-version: 3.11
name: Build image
run: docker build -t registry.example.com/llm-agent:${{ github.sha }} .
name: Push image
run: |
echo "${{ secrets.REGISTRYPASSWORD }}" | docker login registry.example.com -u ${{ secrets.REGISTRYUSER }} , password-stdin
docker push registry.example.com/llm-agent:${{ github.sha }}
name: Deploy
run: kubectl set image deployment/llm-agent agent=registry.example.com/llm-agent:${{ github.sha }}
Use GitOps with Flux or Argo CD for declarative deploys.
**Expected Outcome**
- Repeatable releases with rollback.
**Answer:**
- **Is Docker enough for production LLM agents?**
- No. CI/CD, testing, and secure release are required.
- **How do I monitor containerized LLM services?**
- Integrate tracing in CI tests and production.
**Link:** [Demo page](#)
## Step 8: Security, Secrets, and Compliance
Use Kubernetes Secrets for credentials. Apply RBAC. Scan images and run non-root.
**Analogy**
Treat model access like the lock to the tea recipe. Share only with those who need it.
**Kubernetes Secret Example**
apiVersion: v1
kind: Secret
metadata:
name: llm-agent-secret
type: Opaque
stringData:
APIKEY: "REPLACEWITH_KEY"
**ServiceAccount and Minimal Role**
apiVersion: v1
kind: ServiceAccount
metadata:
name: llm-agent-sa
, -
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: llm-agent-role
rules:
apiGroups: [""]
resources: ["pods", "services"]
verbs: ["get", "list"]
, -
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: llm-agent-rolebinding
subjects:
kind: ServiceAccount
name: llm-agent-sa
roleRef:
kind: Role
name: llm-agent-role
apiGroup: rbac.authorization.k8s.io
**Mount Secret as Env**
env:
name: API_KEY
valueFrom:
secretKeyRef:
name: llm-agent-secret
key: API_KEY
**Expected Outcome**
- Reduced blast radius and compliance.
**Answer:**
- **Is Docker enough for production LLM agents?**
- Docker is part of the stack. Security and secrets are separate needs.
## Step 9: Troubleshooting Common Issues
**Common Issues**
- OOMKilled
- Startup timeouts
- Missing GPU drivers
- Cold start latency
- Misconfigured probes
**Diagnostics Commands**
- `kubectl describe pod <pod>`
- `kubectl logs -c agent <pod>`
- `kubectl exec -it <pod> , /bin/bash`
- `docker run , rm -it , entrypoint=/bin/bash registry.example.com/llm-agent:v1`
- `grep trace_id logs | tail -n 100`
**Analogy**
Troubleshooting is like checking the tea kettle, gas, and cup when a customer says their chai is cold.
**Where to Look and Fixes**
1. OOMKilled, increase memory limit or use a smaller model.
2. Startup timeouts, increase readiness initialDelaySeconds.
3. Missing GPU drivers, check `nvidia-smi` on the node.
4. Cold starts, preload models or keep a warm pool.
5. Misconfigured probes, test endpoints locally with curl.
**Link:** [Demo page](#)
**Answers**
- **How do I monitor containerized LLM services?**
- Use logs, metrics, and traces. Correlate trace IDs across systems.
- **Can I run LLMs on Kubernetes?**
- Yes, with careful resource planning.
## Final Checklist and Next Steps
**Checklist Before Production**
- Monitoring and alerting in place.
- Autoscaling tested.
- Backups and model versioning done.
- Runbooks for on-call.
- Security and scanning in place.
**Commands to Verify**
- `kubectl get all , selector app=llm-agent`
**Example Prometheus Alert Rules**
groups:
name: llm-agent.rules
rules:
alert: HighLatency
expr: histogramquantile(0.95, sum(rate(httpserverrequestdurationsecondsbucket[5m])) by (le)) > 1.5
for: 5m
labels:
severity: page
annotations:
summary: "High p95 latency for llm-agent"
**Analogy**
The final checklist is like tasting each cup before opening the stall for the day.
**Link:** [LLMOps & Production AI pillar page](#)
**Answers**
- **How do I containerize an LLM agent?**
- Follow the steps here, start with Docker and add K8s.
- **How do I monitor containerized LLM services?**
- Add OpenTelemetry, export traces to LaikaTest, and add metrics and alerts.
## Conclusion with LaikaTest
I started with a broken container and no tracing. I ended with reproducible images, Kubernetes deploys, autoscaling, and traces that made failures obvious. The path is local Docker testing, building a small image, deploying to Kubernetes with probes and autoscaling, and adding tracing with OpenTelemetry. Integrate LaikaTest early. It gives one-line observability and tracing. You can see which prompt version ran, the model outputs, tool calls, costs, and latency. Do this before users complain.
**Next Steps**
- Enable OpenTelemetry in your agent.
- Point it at the LaikaTest collector.
- Run a smoke test.
- Inspect traces on the LaikaTest dashboard.
LaikaTest helps teams compare prompt versions, run agent experiments, and fix silent regressions. It is not a silver bullet. It is a practical part of the stack that closes a common gap. After you add tracing and logs, you will catch many issues before users do.
**Link:** [Demo page](#)
**Link:** [LLMOps & Production AI pillar page](#)
If you want, I can share a trimmed example repo with Dockerfile, Kubernetes YAML, and OpenTelemetry snippets that match this guide.