Troubleshooting Guide

Common issues, diagnostic procedures, and solutions for Paladin deployments.

Table of Contents

Diagnostic Tools

Check Application Status

# Check health endpoint
curl http://localhost:8080/health

# Check metrics
curl http://localhost:8081/metrics

# View logs
kubectl logs -f deployment/paladin -n paladin

# Check pod status
kubectl describe pod <pod-name> -n paladin

Enable Debug Logging

# Set environment variable
export RUST_LOG=debug,paladin=trace

# Or in config.yml
logging:
  level: "debug"
  modules:
    paladin: "trace"

Collect Diagnostic Information

# System information
uname -a
rustc --version
cargo --version

# Application logs
kubectl logs deployment/paladin -n paladin --tail=1000 > paladin.log

# Metrics snapshot
curl http://localhost:8081/metrics > metrics.txt

# Configuration
kubectl get cm paladin-config -o yaml > config.yaml

Common Issues

1. Paladin Execution Fails

Symptoms:

  • PaladinError::ExecutionError
  • Empty or truncated responses
  • Timeout errors

Diagnosis:

# Check logs for error details
kubectl logs deployment/paladin | grep ERROR

# Verify LLM configuration
curl http://localhost:8080/health | jq .components.llm

Solutions:

A. Invalid API Key

# Fix: Update secret with valid key
kubectl create secret generic paladin-secrets \
  --from-literal=openai-api-key="sk-..." \
  --dry-run=client -o yaml | kubectl apply -f -

B. Model Not Found

#![allow(unused)]
fn main() {
// Fix: Use valid model name
let paladin = PaladinBuilder::new(llm_port)
    .model("gpt-4")  // Not "gpt-4-invalid"
    .build()?;
}

C. Rate Limiting

# Fix: Add retry logic and backoff
llm:
  max_retries: 3
  retry_delay: 2s
  timeout: 60s

2. High Memory Usage

Symptoms:

  • OOMKilled pods
  • Memory usage > 80%
  • Slow performance

Diagnosis:

# Check memory usage
kubectl top pods -n paladin

# Check Garrison size
curl http://localhost:8081/metrics | grep garrison_entries

Solutions:

A. Garrison Too Large

# Fix: Reduce garrison limits
garrison:
  max_entries: 500  # Reduce from 1000
  max_tokens: 4000  # Reduce from 8000

B. Memory Leak

# Fix: Update to latest version
docker pull ghcr.io/your-org/paladin:latest
kubectl rollout restart deployment/paladin

C. Insufficient Resources

# Fix: Increase resource limits
resources:
  limits:
    memory: 8Gi  # Increase from 4Gi

3. Connection Refused

Symptoms:

  • Cannot connect to external services
  • ConnectionRefused errors
  • Network timeout

Diagnosis:

# Test connectivity from pod
kubectl exec -it <pod-name> -- curl http://redis:6379
kubectl exec -it <pod-name> -- nslookup redis

# Check network policies
kubectl get networkpolicy -n paladin

Solutions:

A. Service Not Running

# Fix: Start the service
kubectl get svc redis -n paladin
kubectl scale statefulset redis --replicas=1

B. Wrong Hostname

# Fix: Use correct service DNS
queue:
  url: "redis://redis.paladin.svc.cluster.local:6379"

C. Network Policy Blocking

# Fix: Allow egress to Redis
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-redis
spec:
  podSelector:
    matchLabels:
      app: paladin
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: redis
    ports:
    - protocol: TCP
      port: 6379

4. Battalion Execution Hangs

Symptoms:

  • Battalion never completes
  • High CPU usage
  • No error messages

Diagnosis:

# Check active Paladins
curl http://localhost:8081/metrics | grep paladin_active

# Look for deadlocks
kubectl logs deployment/paladin | grep -i "deadlock\|timeout"

Solutions:

A. Circular Dependencies (Campaign)

#![allow(unused)]
fn main() {
// Fix: Ensure DAG has no cycles
campaign.validate()?;  // Will error if cyclic
}

B. Infinite Loop

#![allow(unused)]
fn main() {
// Fix: Set reasonable max_loops
let paladin = PaladinBuilder::new(llm_port)
    .max_loops(10)  // Prevent infinite loops
    .build()?;
}

C. Timeout Not Set

# Fix: Add execution timeout
paladin:
  timeout_seconds: 300  # 5 minutes

Performance Issues

Slow Response Times

Symptoms:

  • P95 latency > 2s
  • High request duration

Diagnosis:

# Check latency metrics
curl http://localhost:8081/metrics | grep duration

# Profile with flamegraph
cargo flamegraph --bin paladin-server

Solutions:

A. Slow LLM Responses

# Fix: Use faster model or increase timeout
llm:
  default_model: "gpt-3.5-turbo"  # Faster than gpt-4
  timeout: 30s

B. Garrison Query Slow

-- Fix: Add index to Garrison database
CREATE INDEX idx_garrison_timestamp ON garrison_entries(timestamp);
CREATE INDEX idx_garrison_session ON garrison_entries(session_id);

C. Too Many Tool Calls

# Fix: Limit concurrent tool executions
arsenal:
  max_concurrent_tools: 5

High CPU Usage

Symptoms:

  • CPU throttling
  • Slow processing
  • Increased costs

Diagnosis:

# Check CPU usage
kubectl top pods -n paladin

# Profile CPU
cargo build --release
perf record -F 99 -g ./target/release/paladin-server
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg

Solutions:

A. Too Many Replicas

# Fix: Reduce replica count
spec:
  replicas: 3  # Reduce from 10

B. Inefficient Code

# Fix: Update to optimized version
git pull origin main
cargo build --release

Configuration Issues

Invalid Configuration

Symptoms:

  • Application won't start
  • Configuration validation errors

Diagnosis:

# Validate configuration
paladin config validate config.yml

# Check for syntax errors
yamllint config.yml

Solutions:

# Fix: Correct YAML syntax
paladin:
  default_temperature: 0.7  # Must be number
  max_loops: 3              # Must be integer

Missing Environment Variables

Symptoms:

  • environment variable not set errors
  • API calls fail

Diagnosis:

# Check environment
kubectl exec deployment/paladin -- env | grep -i key

Solutions:

# Fix: Set missing variables
kubectl create secret generic paladin-secrets \
  --from-literal=openai-api-key="$OPENAI_API_KEY"

Deployment Issues

Pod CrashLoopBackOff

Symptoms:

  • Pods constantly restarting
  • CrashLoopBackOff status

Diagnosis:

# Check pod events
kubectl describe pod <pod-name> -n paladin

# View crash logs
kubectl logs <pod-name> -n paladin --previous

Solutions:

A. Missing Dependencies

# Fix: Add runtime dependencies
RUN apt-get install -y libssl1.1 ca-certificates

B. Health Check Failing

# Fix: Adjust health check timing
livenessProbe:
  initialDelaySeconds: 60  # Increase from 30
  periodSeconds: 30        # Increase from 10

Image Pull Errors

Symptoms:

  • ImagePullBackOff or ErrImagePull
  • Pods stuck in pending

Diagnosis:

# Check image pull status
kubectl describe pod <pod-name> -n paladin | grep -A5 Events

Solutions:

# Fix: Authenticate with registry
kubectl create secret docker-registry ghcr-secret \
  --docker-server=ghcr.io \
  --docker-username=$GITHUB_USER \
  --docker-password=$GITHUB_TOKEN

# Update deployment to use secret
spec:
  imagePullSecrets:
  - name: ghcr-secret

Integration Issues

Redis Connection Failed

Symptoms:

  • Queue operations fail
  • ConnectionRefused errors

Diagnosis:

# Test Redis connectivity
kubectl exec deployment/paladin -- redis-cli -h redis ping

Solutions:

# Fix: Restart Redis
kubectl rollout restart statefulset redis

# Or check authentication
kubectl get secret redis-auth -o jsonpath='{.data.password}' | base64 -d

MinIO/S3 Errors

Symptoms:

  • File storage operations fail
  • AccessDenied errors

Diagnosis:

# Test MinIO connectivity
kubectl exec deployment/paladin -- \
  curl -v http://minio:9000/minio/health/live

Solutions:

# Fix: Update credentials
kubectl create secret generic minio-credentials \
  --from-literal=access-key="minioadmin" \
  --from-literal=secret-key="minioadmin"

LLM Provider Issues

Symptoms:

  • API rate limiting
  • Invalid credentials
  • Model unavailable

Solutions:

A. Rate Limit Exceeded

# Fix: Add rate limiting
llm:
  rate_limit:
    requests_per_minute: 60
    tokens_per_minute: 90000

B. Switch Provider

# Fix: Use fallback provider
llm:
  providers:
    - openai
    - deepseek  # Fallback
    - anthropic # Fallback

Getting Help

Collect Debug Bundle

#!/bin/bash
# debug-bundle.sh

NAMESPACE="paladin"
OUTPUT="debug-bundle-$(date +%Y%m%d-%H%M%S).tar.gz"

mkdir -p debug-bundle
cd debug-bundle

# Logs
kubectl logs deployment/paladin -n $NAMESPACE > paladin.log

# Configuration
kubectl get all,cm,secrets -n $NAMESPACE -o yaml > resources.yaml

# Metrics
curl http://localhost:8081/metrics > metrics.txt

# Events
kubectl get events -n $NAMESPACE > events.txt

cd ..
tar czf $OUTPUT debug-bundle/
echo "Debug bundle created: $OUTPUT"

Open an Issue

Include:

  1. Paladin version
  2. Deployment environment (Docker/K8s)
  3. Error messages and logs
  4. Steps to reproduce
  5. Expected vs actual behavior

Community Support

  • GitHub Issues: Bug reports and feature requests
  • Discussions: Questions and community help
  • Discord: Real-time chat support

Next Steps