Troubleshooting Guide

Common issues, diagnostic procedures, and solutions for Paladin deployments.

Diagnostic Tools
Common Issues
Performance Issues
Configuration Issues
Deployment Issues
Integration Issues
Getting Help

Diagnostic Tools

Check Application Status

# Check health endpoint
curl http://localhost:8080/health

# Check metrics
curl http://localhost:8081/metrics

# View logs
kubectl logs -f deployment/paladin -n paladin

# Check pod status
kubectl describe pod <pod-name> -n paladin

Enable Debug Logging

# Set environment variable
export RUST_LOG=debug,paladin=trace

# Or in config.yml
logging:
  level: "debug"
  modules:
    paladin: "trace"

Collect Diagnostic Information

# System information
uname -a
rustc --version
cargo --version

# Application logs
kubectl logs deployment/paladin -n paladin --tail=1000 > paladin.log

# Metrics snapshot
curl http://localhost:8081/metrics > metrics.txt

# Configuration
kubectl get cm paladin-config -o yaml > config.yaml

Common Issues

1. Paladin Execution Fails

Symptoms:

PaladinError::ExecutionError
Empty or truncated responses
Timeout errors

Diagnosis:

# Check logs for error details
kubectl logs deployment/paladin | grep ERROR

# Verify LLM configuration
curl http://localhost:8080/health | jq .components.llm

Solutions:

A. Invalid API Key

# Fix: Update secret with valid key
kubectl create secret generic paladin-secrets \
  --from-literal=openai-api-key="sk-..." \
  --dry-run=client -o yaml | kubectl apply -f -

B. Model Not Found

// Fix: Use valid model name
let paladin = PaladinBuilder::new(llm_port)
    .model("gpt-4")  // Not "gpt-4-invalid"
    .build()?;

C. Rate Limiting

# Fix: Add retry logic and backoff
llm:
  max_retries: 3
  retry_delay: 2s
  timeout: 60s

2. High Memory Usage

Symptoms:

OOMKilled pods
Memory usage > 80%
Slow performance

Diagnosis:

# Check memory usage
kubectl top pods -n paladin

# Check Garrison size
curl http://localhost:8081/metrics | grep garrison_entries

Solutions:

A. Garrison Too Large

# Fix: Reduce garrison limits
garrison:
  max_entries: 500  # Reduce from 1000
  max_tokens: 4000  # Reduce from 8000

B. Memory Leak

# Fix: Update to latest version
docker pull ghcr.io/your-org/paladin:latest
kubectl rollout restart deployment/paladin

C. Insufficient Resources

# Fix: Increase resource limits
resources:
  limits:
    memory: 8Gi  # Increase from 4Gi

3. Connection Refused

Symptoms:

Cannot connect to external services
ConnectionRefused errors
Network timeout

Diagnosis:

# Test connectivity from pod
kubectl exec -it <pod-name> -- curl http://redis:6379
kubectl exec -it <pod-name> -- nslookup redis

# Check network policies
kubectl get networkpolicy -n paladin

Solutions:

A. Service Not Running

# Fix: Start the service
kubectl get svc redis -n paladin
kubectl scale statefulset redis --replicas=1

B. Wrong Hostname

# Fix: Use correct service DNS
queue:
  url: "redis://redis.paladin.svc.cluster.local:6379"

C. Network Policy Blocking

# Fix: Allow egress to Redis
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-redis
spec:
  podSelector:
    matchLabels:
      app: paladin
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: redis
    ports:
    - protocol: TCP
      port: 6379

4. Battalion Execution Hangs

Symptoms:

Battalion never completes
High CPU usage
No error messages

Diagnosis:

# Check active Paladins
curl http://localhost:8081/metrics | grep paladin_active

# Look for deadlocks
kubectl logs deployment/paladin | grep -i "deadlock\|timeout"

Solutions:

A. Circular Dependencies (Campaign)

// Fix: Ensure DAG has no cycles
campaign.validate()?;  // Will error if cyclic

B. Infinite Loop

// Fix: Set reasonable max_loops
let paladin = PaladinBuilder::new(llm_port)
    .max_loops(10)  // Prevent infinite loops
    .build()?;

C. Timeout Not Set

# Fix: Add execution timeout
paladin:
  timeout_seconds: 300  # 5 minutes

Performance Issues

Slow Response Times

Symptoms:

P95 latency > 2s
High request duration

Diagnosis:

# Check latency metrics
curl http://localhost:8081/metrics | grep duration

# Profile with flamegraph
cargo flamegraph --bin paladin-server

Solutions:

A. Slow LLM Responses

# Fix: Use faster model or increase timeout
llm:
  default_model: "gpt-3.5-turbo"  # Faster than gpt-4
  timeout: 30s

B. Garrison Query Slow

-- Fix: Add index to Garrison database
CREATE INDEX idx_garrison_timestamp ON garrison_entries(timestamp);
CREATE INDEX idx_garrison_session ON garrison_entries(session_id);

C. Too Many Tool Calls

# Fix: Limit concurrent tool executions
arsenal:
  max_concurrent_tools: 5

High CPU Usage

Symptoms:

CPU throttling
Slow processing
Increased costs

Diagnosis:

# Check CPU usage
kubectl top pods -n paladin

# Profile CPU
cargo build --release
perf record -F 99 -g ./target/release/paladin-server
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg

Solutions:

A. Too Many Replicas

# Fix: Reduce replica count
spec:
  replicas: 3  # Reduce from 10

B. Inefficient Code

# Fix: Update to optimized version
git pull origin main
cargo build --release

Configuration Issues

Invalid Configuration

Symptoms:

Application won't start
Configuration validation errors

Diagnosis:

# Validate configuration
paladin config validate config.yml

# Check for syntax errors
yamllint config.yml

Solutions:

# Fix: Correct YAML syntax
paladin:
  default_temperature: 0.7  # Must be number
  max_loops: 3              # Must be integer

Missing Environment Variables

Symptoms:

environment variable not set errors
API calls fail

Diagnosis:

# Check environment
kubectl exec deployment/paladin -- env | grep -i key

Solutions:

# Fix: Set missing variables
kubectl create secret generic paladin-secrets \
  --from-literal=openai-api-key="$OPENAI_API_KEY"

Deployment Issues

Pod CrashLoopBackOff

Symptoms:

Pods constantly restarting
CrashLoopBackOff status

Diagnosis:

# Check pod events
kubectl describe pod <pod-name> -n paladin

# View crash logs
kubectl logs <pod-name> -n paladin --previous

Solutions:

A. Missing Dependencies

# Fix: Add runtime dependencies
RUN apt-get install -y libssl1.1 ca-certificates

B. Health Check Failing

# Fix: Adjust health check timing
livenessProbe:
  initialDelaySeconds: 60  # Increase from 30
  periodSeconds: 30        # Increase from 10

Image Pull Errors

Symptoms:

ImagePullBackOff or ErrImagePull
Pods stuck in pending

Diagnosis:

# Check image pull status
kubectl describe pod <pod-name> -n paladin | grep -A5 Events

Solutions:

# Fix: Authenticate with registry
kubectl create secret docker-registry ghcr-secret \
  --docker-server=ghcr.io \
  --docker-username=$GITHUB_USER \
  --docker-password=$GITHUB_TOKEN

# Update deployment to use secret
spec:
  imagePullSecrets:
  - name: ghcr-secret

Integration Issues

Redis Connection Failed

Symptoms:

Queue operations fail
ConnectionRefused errors

Diagnosis:

# Test Redis connectivity
kubectl exec deployment/paladin -- redis-cli -h redis ping

Solutions:

# Fix: Restart Redis
kubectl rollout restart statefulset redis

# Or check authentication
kubectl get secret redis-auth -o jsonpath='{.data.password}' | base64 -d

MinIO/S3 Errors

Symptoms:

File storage operations fail
AccessDenied errors

Diagnosis:

# Test MinIO connectivity
kubectl exec deployment/paladin -- \
  curl -v http://minio:9000/minio/health/live

Solutions:

# Fix: Update credentials
kubectl create secret generic minio-credentials \
  --from-literal=access-key="minioadmin" \
  --from-literal=secret-key="minioadmin"

LLM Provider Issues

Symptoms:

API rate limiting
Invalid credentials
Model unavailable

Solutions:

A. Rate Limit Exceeded

# Fix: Add rate limiting
llm:
  rate_limit:
    requests_per_minute: 60
    tokens_per_minute: 90000

B. Switch Provider

# Fix: Use fallback provider
llm:
  providers:
    - openai
    - deepseek  # Fallback
    - anthropic # Fallback

Getting Help

Collect Debug Bundle

#!/bin/bash
# debug-bundle.sh

NAMESPACE="paladin"
OUTPUT="debug-bundle-$(date +%Y%m%d-%H%M%S).tar.gz"

mkdir -p debug-bundle
cd debug-bundle

# Logs
kubectl logs deployment/paladin -n $NAMESPACE > paladin.log

# Configuration
kubectl get all,cm,secrets -n $NAMESPACE -o yaml > resources.yaml

# Metrics
curl http://localhost:8081/metrics > metrics.txt

# Events
kubectl get events -n $NAMESPACE > events.txt

cd ..
tar czf $OUTPUT debug-bundle/
echo "Debug bundle created: $OUTPUT"

Open an Issue

Include:

Paladin version
Deployment environment (Docker/K8s)
Error messages and logs
Steps to reproduce
Expected vs actual behavior

Community Support

GitHub Issues: Bug reports and feature requests
Discussions: Questions and community help
Discord: Real-time chat support

Next Steps

Monitoring - Set up monitoring
Performance Tuning - Optimize performance
Logging - Configure logging

Paladin Framework

Troubleshooting Guide

Table of Contents

Diagnostic Tools

Check Application Status

Enable Debug Logging

Collect Diagnostic Information

Common Issues

1. Paladin Execution Fails

2. High Memory Usage

3. Connection Refused

4. Battalion Execution Hangs

Performance Issues

Slow Response Times

High CPU Usage

Configuration Issues

Invalid Configuration

Missing Environment Variables

Deployment Issues

Pod CrashLoopBackOff

Image Pull Errors

Integration Issues

Redis Connection Failed

MinIO/S3 Errors

LLM Provider Issues

Getting Help

Collect Debug Bundle

Open an Issue

Community Support

Next Steps