# Containerizing LLM Agents Guide [ANECDOTE_PLACEHOLDER] Containerizing LLM agents is a practical path from messy local scripts to a reproducible production service. I will walk you through my approach. I explain Docker LLM images, Kubernetes AI deployment, and how to get containerized LLM services visible with LaikaTest tracing. My aim is to provide procedural help you can copy and adapt. ## Prerequisites and Expected Outcomes Before you start, make sure your workstation and cluster are ready. - **Prerequisites** - Docker installed and working. Verify with `docker , version`. - kubectl configured for your cluster. Verify with `kubectl version , client`. - Access to a Kubernetes cluster and permissions to create workloads. - Basic LLM agent code that can respond on an HTTP port. - Container registry credentials. Example commands: `gcloud auth configure-docker` or `docker login registry.example.com`. - Optional GPU nodes, with drivers installed. Verify with `nvidia-smi` on the node or from a machine that can see the GPU. **Commands to verify environment** - `docker , version` - `kubectl version , client` - `gcloud auth configure-docker` - `aws ecr get-login-password | docker login , username AWS , password-stdin <account>.dkr.ecr.us-west-2.amazonaws.com` - `nvidia-smi` **Outcomes you will get** - A container image of the LLM agent pushed to a registry. - A working Kubernetes deployment with probes and resource requests. - Autoscaling rules that respond to load. - Basic tracing and observability integrated with LaikaTest. **Link:** [LLMOps & Production AI pillar page](#) ## Checklist Before You Deploy - Model license verified for production use. - Environment variables and secrets planned. - Resource estimates for CPU, memory, and GPUs. - Small testing dataset for smoke tests. **Analogy: Packing Infra** Think of prepping infra like packing a lunch box. You want insulated compartments for hot and cold. You want a power bank for your phone. You want snacks that are not poisonous. Each item matters. Missing one item breaks the meal. Likewise, missing a GPU flag, a secret, or tracing can break your deployment. **Common Quick Answers** - **How do I containerize an LLM agent?** - Make your app run from a single command and read config from the environment. Then write a Dockerfile and push the image. - **Do LLM agents need GPUs in Docker?** - Not always. Smaller models can run on CPU. Large models and inference at scale often need GPUs. You can run GPUs inside Docker with proper drivers. - **Is Docker enough for production LLM agents?** - Docker is required for packaging. For production, you need orchestration, autoscaling, security, and observability. Docker alone is not enough. ## Step 1: Prepare the LLM Agent App for Containerizing Make your app run by a single command. Use environment variables for config. Add health and readiness endpoints. Support graceful shutdown. **Key Points** - Start via `python app.py` or `./run-agent`. - Pin Python or runtime version. - Use a multistage build to reduce image size. - Add health and readiness endpoints for Kubernetes probes. **Example Analogy** Think of the app as a shopkeeper who needs a fixed entrance, working lights, and a sign that tells customers the shop is ready. If the shop has no sign, customers will leave. **Small Code Examples** **Python Flask Health Endpoints**

from flask import Flask, jsonify

import os

import signal

import sys

app = Flask(name)

ready = False

@app.route("/healthz")

def healthz():

return jsonify(status="ok"), 200

@app.route("/readyz")

def readyz():

return (jsonify(ready=True), 200) if ready else (jsonify(ready=False), 503)

def load_model():

global ready

modelpath = os.getenv("MODELPATH", "/models/default")

load model here

ready = True

def shutdown_handler(signum, frame):

sys.exit(0)

if name == "main":

import threading

threading.Thread(target=load_model).start()

signal.signal(signal.SIGTERM, shutdown_handler)

app.run(host="0.0.0.0", port=int(os.getenv("PORT", 8080)))

**Sample Config Loader Reading Env Vars**

import os

MODELPATH = os.getenv("MODELPATH", "/models/model.pt")

APIKEY = os.getenv("APIKEY", "")

PORT = int(os.getenv("PORT", "8080"))

LOGLEVEL = os.getenv("LOGLEVEL", "info")

**Expected Outcome** - App responds on a port. - Supports readiness and graceful shutdown. - Easy to run in a container with env vars. **Answer: How do I containerize an LLM agent?** - Make the app configurable by env vars, add probes, pin runtimes, and keep startup deterministic. ## Step 2: Write the Dockerfile and Build a Docker LLM Image Use a reproducible base image. Use multistage builds. Keep the final image small. Label your image and include a HEALTHCHECK. **Analogy** A multistage build is like packing only the clothes you need for a trip and leaving the heavy boxes behind. Traveling light helps start fast. **Full Dockerfile for a Python LLM Agent**

builder stage

FROM python:3.11-slim AS builder

WORKDIR /app

ENV VIRTUAL_ENV=/opt/venv

RUN python -m venv $VIRTUAL_ENV

ENV PATH="$VIRTUAL_ENV/bin:$PATH"

COPY pyproject.toml poetry.lock ./

RUN pip install , upgrade pip

RUN pip install -r requirements.txt

copy app and models in build stage

COPY . /app

final runtime image

FROM python:3.11-slim

ENV VIRTUAL_ENV=/opt/venv

ENV PATH="$VIRTUAL_ENV/bin:$PATH"

WORKDIR /app

copy venv and app from builder

COPY , from=builder /opt/venv /opt/venv

COPY , from=builder /app /app

EXPOSE 8080

LABEL org.opencontainers.image.title="llm-agent"

LABEL org.opencontainers.image.version="v1"

HEALTHCHECK , interval=15s , timeout=3s CMD curl -f http://localhost:8080/healthz || exit 1

CMD ["gunicorn", "app:app", "-b", "0.0.0.0:8080", "-w", "4", ", timeout", "120"]

**Build and Push Commands** - `docker build -t registry.example.com/llm-agent:v1 .` - `docker push registry.example.com/llm-agent:v1` **GPU Notes** - For GPU images, you can start with an NVIDIA base like `nvidia/cuda:12.1-runtime`. - To run with GPUs, use `docker run , gpus all ...`. - Make sure drivers match the host. **Link:** [Demo page](#) **Answers** - **How do I containerize an LLM agent?** - Build a small, labeled image, add HEALTHCHECK, and push to a registry. - **Do LLM agents need GPUs in Docker?** - They can. Use GPU base images and run with `, gpus`. - **Is Docker enough for production LLM agents?** - No. Combine Docker images with orchestration and observability. ## Step 3: Run and Test Locally with Docker Run the container locally with env vars and volume mounts. Test health endpoints and a sample inference. **Analogy** This is the dress rehearsal before the live show. Fix issues now, not during the performance. **Run Commands** - `docker run -e PORT=8080 -p 8080:8080 registry.example.com/llm-agent:v1` - `docker run , gpus all -e MODEL_PATH=/models/my.pt -v /models:/models -p 8080:8080 registry.example.com/llm-agent:v1` **Health and Inference Checks** - `curl -f http://localhost:8080/healthz` - `curl -X POST http://localhost:8080/infer -H "Content-Type: application/json" -d '{"prompt":"Hello"}'` **If Using GPU, Verify Inside Container** - `docker exec -it <container> nvidia-smi` **Expected Outcome** - Service responds to inference requests. - Logs include trace IDs or request IDs to link to LaikaTest. **Link:** [Demo page](#) **Answers** - **How do I containerize an LLM agent?** - Test locally with Docker run and verify endpoints. - **Do LLM agents need GPUs in Docker?** - If you need performance, test GPU runs with `, gpus all`. ## Step 4: Kubernetes Deployment Basics Create a Deployment, Service, and ConfigMap. Add probes and use nodeSelectors or tolerations for GPUs. **Analogy** Kubernetes is like an office building. Each floor has rules for weight and power. The receptionist decides where people sit. **Deployment YAML**

apiVersion: apps/v1

kind: Deployment

metadata:

labels:

app: llm-agent

spec:

replicas: 2

selector:

matchLabels:

app: llm-agent

template:

metadata:

labels:

app: llm-agent

spec:

containers:

name: agent

image: registry.example.com/llm-agent:v1

ports:

containerPort: 8080

envFrom:

configMapRef:

resources:

requests:

cpu: "500m"

memory: "2Gi"

limits:

cpu: "2"

memory: "8Gi"

readinessProbe:

httpGet:

path: /readyz

port: 8080

initialDelaySeconds: 5

periodSeconds: 10

livenessProbe:

httpGet:

path: /healthz

port: 8080

initialDelaySeconds: 10

periodSeconds: 20

tolerations:

key: "nvidia.com/gpu"

operator: "Exists"

effect: "NoSchedule"

**Service YAML**

apiVersion: v1

kind: Service

metadata:

spec:

selector:

app: llm-agent

ports:

protocol: TCP

port: 80

targetPort: 8080

type: LoadBalancer

**ConfigMap YAML**

apiVersion: v1

kind: ConfigMap

metadata:

data:

PORT: "8080"

LOG_LEVEL: "info"

Node selector and tolerations for GPUs help schedule pods on GPU nodes. **Expected Outcome** - Pods running and reachable via LoadBalancer. **Answers** - **Can I run LLMs on Kubernetes?** - Yes. Use resource requests, probes, and GPU scheduling. - **How do I monitor containerized LLM services?** - Add probes, metrics, and tracing. Integrate with observability tools like LaikaTest. ## Step 5: Autoscaling and Resource Planning Use HPA for CPU or custom metrics. For GPU, use cluster autoscaler and node pools. **Analogy** Autoscaling is like calling helpers when the tea shop gets crowded and sending them home when it quiets down. **HPA YAML Using CPU**

apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

minReplicas: 2

maxReplicas: 10

metrics:

type: Resource

resource:

target:

type: Utilization

averageUtilization: 70

For custom metrics, use Prometheus adapter or Kubernetes custom metrics API. Run: - `kubectl apply -f hpa.yaml` - `kubectl get hpa` Plan memory and storage carefully. Large models need local SSD and high memory. **Expected Outcome** - Autoscaling triggers under load. - Safe scaling for startup and shutdown. **Answer:** - **Can I run LLMs on Kubernetes?** - Yes, with careful planning for memory and storage. - **Do LLM agents need GPUs in Docker?** - GPUs help. Use node pools for GPU workloads. **Link:** [LLMOps & Production AI pillar page](#) ## Step 6: Observability, Logging, and Tracing with LaikaTest Instrument your agent for logs, metrics, and traces. Use OpenTelemetry and send traces to LaikaTest. **Analogy** Tracing is like adding a GPS tracker to every chai delivery so you can see where it got delayed. **OpenTelemetry Python Snippet**

from opentelemetry import trace

from opentelemetry.sdk.trace import TracerProvider

from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

from opentelemetry.sdk.trace.export import BatchSpanProcessor

trace.settracerprovider(TracerProvider())

otlpexporter = OTLPSpanExporter(endpoint=os.getenv("LAIKATESTOTLP", "https://collector.laikatest.example.com"), headers=(("x-api-key", os.getenv("LAIKATESTAPIKEY")),))

spanprocessor = BatchSpanProcessor(otlpexporter)

trace.gettracerprovider().addspanprocessor(span_processor)

tracer = trace.gettracer(name_)

**Kubernetes Config for LaikaTest Endpoint Example**

apiVersion: v1

kind: ConfigMap

metadata:

data:

LAIKATEST_OTLP: "https://collector.laikatest.example.com"

, -

apiVersion: v1

kind: Secret

metadata:

stringData:

LAIKATESTAPIKEY: "REPLACEWITHKEY"

**DaemonSet or Sidecar** - Deploy a DaemonSet to forward logs and traces to LaikaTest, or use a sidecar per pod. - Configure the agent to add trace IDs in logs so you can join logs and traces. **Fetch a Trace from LaikaTest API** - `curl -H "Authorization: Bearer $LAIKATEST_API_KEY" "https://api.laikatest.example.com/traces?trace_id=<id>"` **Expected Outcome** - Traces that show request flow, latency breakdown, and failures. **Link:** [LLMOps & Production AI pillar page](#) **Answer:** - **How do I monitor containerized LLM services?** - Use OpenTelemetry and forward traces to LaikaTest. - **Is Docker enough for production LLM agents?** - No. Observability is a must for production. ## Step 7: CI/CD Pipeline to Build, Test, and Deploy Automate image build, tests, and deploys. Use canary or blue-green for safety. Scan images. **Analogy** CI/CD is like an assembly line that checks each cup of chai before it goes out. **GitHub Actions Snippet**

on:

push:

branches: [ main ]

jobs:

build:

runs-on: ubuntu-latest

steps:

uses: actions/checkout@v3
name: Set up Python

uses: actions/setup-python@v4

with:

python-version: 3.11

name: Build image

run: docker build -t registry.example.com/llm-agent:${{ github.sha }} .

name: Push image

run: |

echo "${{ secrets.REGISTRYPASSWORD }}" | docker login registry.example.com -u ${{ secrets.REGISTRYUSER }} , password-stdin

docker push registry.example.com/llm-agent:${{ github.sha }}

name: Deploy

run: kubectl set image deployment/llm-agent agent=registry.example.com/llm-agent:${{ github.sha }}

Use GitOps with Flux or Argo CD for declarative deploys. **Expected Outcome** - Repeatable releases with rollback. **Answer:** - **Is Docker enough for production LLM agents?** - No. CI/CD, testing, and secure release are required. - **How do I monitor containerized LLM services?** - Integrate tracing in CI tests and production. **Link:** [Demo page](#) ## Step 8: Security, Secrets, and Compliance Use Kubernetes Secrets for credentials. Apply RBAC. Scan images and run non-root. **Analogy** Treat model access like the lock to the tea recipe. Share only with those who need it. **Kubernetes Secret Example**

apiVersion: v1

kind: Secret

metadata:

type: Opaque

stringData:

APIKEY: "REPLACEWITH_KEY"

**ServiceAccount and Minimal Role**

apiVersion: v1

kind: ServiceAccount

metadata:

, -

apiVersion: rbac.authorization.k8s.io/v1

kind: Role

metadata:

rules:

apiGroups: [""]

resources: ["pods", "services"]

verbs: ["get", "list"]

, -

apiVersion: rbac.authorization.k8s.io/v1

kind: RoleBinding

metadata:

subjects:

kind: ServiceAccount

roleRef:

kind: Role

apiGroup: rbac.authorization.k8s.io

**Mount Secret as Env**

env:

name: API_KEY

valueFrom:

secretKeyRef:

key: API_KEY

**Expected Outcome** - Reduced blast radius and compliance. **Answer:** - **Is Docker enough for production LLM agents?** - Docker is part of the stack. Security and secrets are separate needs. ## Step 9: Troubleshooting Common Issues **Common Issues** - OOMKilled - Startup timeouts - Missing GPU drivers - Cold start latency - Misconfigured probes **Diagnostics Commands** - `kubectl describe pod <pod>` - `kubectl logs -c agent <pod>` - `kubectl exec -it <pod> , /bin/bash` - `docker run , rm -it , entrypoint=/bin/bash registry.example.com/llm-agent:v1` - `grep trace_id logs | tail -n 100` **Analogy** Troubleshooting is like checking the tea kettle, gas, and cup when a customer says their chai is cold. **Where to Look and Fixes** 1. OOMKilled, increase memory limit or use a smaller model. 2. Startup timeouts, increase readiness initialDelaySeconds. 3. Missing GPU drivers, check `nvidia-smi` on the node. 4. Cold starts, preload models or keep a warm pool. 5. Misconfigured probes, test endpoints locally with curl. **Link:** [Demo page](#) **Answers** - **How do I monitor containerized LLM services?** - Use logs, metrics, and traces. Correlate trace IDs across systems. - **Can I run LLMs on Kubernetes?** - Yes, with careful resource planning. ## Final Checklist and Next Steps **Checklist Before Production** - Monitoring and alerting in place. - Autoscaling tested. - Backups and model versioning done. - Runbooks for on-call. - Security and scanning in place. **Commands to Verify** - `kubectl get all , selector app=llm-agent` **Example Prometheus Alert Rules**

groups:

name: llm-agent.rules

rules:

alert: HighLatency

expr: histogramquantile(0.95, sum(rate(httpserverrequestdurationsecondsbucket[5m])) by (le)) > 1.5

for: 5m

labels:

severity: page

annotations:

summary: "High p95 latency for llm-agent"

**Analogy** The final checklist is like tasting each cup before opening the stall for the day. **Link:** [LLMOps & Production AI pillar page](#) **Answers** - **How do I containerize an LLM agent?** - Follow the steps here, start with Docker and add K8s. - **How do I monitor containerized LLM services?** - Add OpenTelemetry, export traces to LaikaTest, and add metrics and alerts. ## Conclusion with LaikaTest I started with a broken container and no tracing. I ended with reproducible images, Kubernetes deploys, autoscaling, and traces that made failures obvious. The path is local Docker testing, building a small image, deploying to Kubernetes with probes and autoscaling, and adding tracing with OpenTelemetry. Integrate LaikaTest early. It gives one-line observability and tracing. You can see which prompt version ran, the model outputs, tool calls, costs, and latency. Do this before users complain. **Next Steps** - Enable OpenTelemetry in your agent. - Point it at the LaikaTest collector. - Run a smoke test. - Inspect traces on the LaikaTest dashboard. LaikaTest helps teams compare prompt versions, run agent experiments, and fix silent regressions. It is not a silver bullet. It is a practical part of the stack that closes a common gap. After you add tracing and logs, you will catch many issues before users do. **Link:** [Demo page](#) **Link:** [LLMOps & Production AI pillar page](#) If you want, I can share a trimmed example repo with Dockerfile, Kubernetes YAML, and OpenTelemetry snippets that match this guide.

load model here

builder stage

copy app and models in build stage

final runtime image

copy venv and app from builder

Tags

Share