Troubleshooting Guide
Common issues, diagnostic procedures, and solutions for Paladin deployments.
Table of Contents
- Diagnostic Tools
- Common Issues
- Performance Issues
- Configuration Issues
- Deployment Issues
- Integration Issues
- Getting Help
Diagnostic Tools
Check Application Status
# Check health endpoint
curl http://localhost:8080/health
# Check metrics
curl http://localhost:8081/metrics
# View logs
kubectl logs -f deployment/paladin -n paladin
# Check pod status
kubectl describe pod <pod-name> -n paladin
Enable Debug Logging
# Set environment variable
export RUST_LOG=debug,paladin=trace
# Or in config.yml
logging:
level: "debug"
modules:
paladin: "trace"
Collect Diagnostic Information
# System information
uname -a
rustc --version
cargo --version
# Application logs
kubectl logs deployment/paladin -n paladin --tail=1000 > paladin.log
# Metrics snapshot
curl http://localhost:8081/metrics > metrics.txt
# Configuration
kubectl get cm paladin-config -o yaml > config.yaml
Common Issues
1. Paladin Execution Fails
Symptoms:
PaladinError::ExecutionError- Empty or truncated responses
- Timeout errors
Diagnosis:
# Check logs for error details
kubectl logs deployment/paladin | grep ERROR
# Verify LLM configuration
curl http://localhost:8080/health | jq .components.llm
Solutions:
A. Invalid API Key
# Fix: Update secret with valid key
kubectl create secret generic paladin-secrets \
--from-literal=openai-api-key="sk-..." \
--dry-run=client -o yaml | kubectl apply -f -
B. Model Not Found
#![allow(unused)] fn main() { // Fix: Use valid model name let paladin = PaladinBuilder::new(llm_port) .model("gpt-4") // Not "gpt-4-invalid" .build()?; }
C. Rate Limiting
# Fix: Add retry logic and backoff
llm:
max_retries: 3
retry_delay: 2s
timeout: 60s
2. High Memory Usage
Symptoms:
- OOMKilled pods
- Memory usage > 80%
- Slow performance
Diagnosis:
# Check memory usage
kubectl top pods -n paladin
# Check Garrison size
curl http://localhost:8081/metrics | grep garrison_entries
Solutions:
A. Garrison Too Large
# Fix: Reduce garrison limits
garrison:
max_entries: 500 # Reduce from 1000
max_tokens: 4000 # Reduce from 8000
B. Memory Leak
# Fix: Update to latest version
docker pull ghcr.io/your-org/paladin:latest
kubectl rollout restart deployment/paladin
C. Insufficient Resources
# Fix: Increase resource limits
resources:
limits:
memory: 8Gi # Increase from 4Gi
3. Connection Refused
Symptoms:
- Cannot connect to external services
ConnectionRefusederrors- Network timeout
Diagnosis:
# Test connectivity from pod
kubectl exec -it <pod-name> -- curl http://redis:6379
kubectl exec -it <pod-name> -- nslookup redis
# Check network policies
kubectl get networkpolicy -n paladin
Solutions:
A. Service Not Running
# Fix: Start the service
kubectl get svc redis -n paladin
kubectl scale statefulset redis --replicas=1
B. Wrong Hostname
# Fix: Use correct service DNS
queue:
url: "redis://redis.paladin.svc.cluster.local:6379"
C. Network Policy Blocking
# Fix: Allow egress to Redis
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-redis
spec:
podSelector:
matchLabels:
app: paladin
egress:
- to:
- podSelector:
matchLabels:
app: redis
ports:
- protocol: TCP
port: 6379
4. Battalion Execution Hangs
Symptoms:
- Battalion never completes
- High CPU usage
- No error messages
Diagnosis:
# Check active Paladins
curl http://localhost:8081/metrics | grep paladin_active
# Look for deadlocks
kubectl logs deployment/paladin | grep -i "deadlock\|timeout"
Solutions:
A. Circular Dependencies (Campaign)
#![allow(unused)] fn main() { // Fix: Ensure DAG has no cycles campaign.validate()?; // Will error if cyclic }
B. Infinite Loop
#![allow(unused)] fn main() { // Fix: Set reasonable max_loops let paladin = PaladinBuilder::new(llm_port) .max_loops(10) // Prevent infinite loops .build()?; }
C. Timeout Not Set
# Fix: Add execution timeout
paladin:
timeout_seconds: 300 # 5 minutes
Performance Issues
Slow Response Times
Symptoms:
- P95 latency > 2s
- High request duration
Diagnosis:
# Check latency metrics
curl http://localhost:8081/metrics | grep duration
# Profile with flamegraph
cargo flamegraph --bin paladin-server
Solutions:
A. Slow LLM Responses
# Fix: Use faster model or increase timeout
llm:
default_model: "gpt-3.5-turbo" # Faster than gpt-4
timeout: 30s
B. Garrison Query Slow
-- Fix: Add index to Garrison database
CREATE INDEX idx_garrison_timestamp ON garrison_entries(timestamp);
CREATE INDEX idx_garrison_session ON garrison_entries(session_id);
C. Too Many Tool Calls
# Fix: Limit concurrent tool executions
arsenal:
max_concurrent_tools: 5
High CPU Usage
Symptoms:
- CPU throttling
- Slow processing
- Increased costs
Diagnosis:
# Check CPU usage
kubectl top pods -n paladin
# Profile CPU
cargo build --release
perf record -F 99 -g ./target/release/paladin-server
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg
Solutions:
A. Too Many Replicas
# Fix: Reduce replica count
spec:
replicas: 3 # Reduce from 10
B. Inefficient Code
# Fix: Update to optimized version
git pull origin main
cargo build --release
Configuration Issues
Invalid Configuration
Symptoms:
- Application won't start
- Configuration validation errors
Diagnosis:
# Validate configuration
paladin config validate config.yml
# Check for syntax errors
yamllint config.yml
Solutions:
# Fix: Correct YAML syntax
paladin:
default_temperature: 0.7 # Must be number
max_loops: 3 # Must be integer
Missing Environment Variables
Symptoms:
environment variable not seterrors- API calls fail
Diagnosis:
# Check environment
kubectl exec deployment/paladin -- env | grep -i key
Solutions:
# Fix: Set missing variables
kubectl create secret generic paladin-secrets \
--from-literal=openai-api-key="$OPENAI_API_KEY"
Deployment Issues
Pod CrashLoopBackOff
Symptoms:
- Pods constantly restarting
CrashLoopBackOffstatus
Diagnosis:
# Check pod events
kubectl describe pod <pod-name> -n paladin
# View crash logs
kubectl logs <pod-name> -n paladin --previous
Solutions:
A. Missing Dependencies
# Fix: Add runtime dependencies
RUN apt-get install -y libssl1.1 ca-certificates
B. Health Check Failing
# Fix: Adjust health check timing
livenessProbe:
initialDelaySeconds: 60 # Increase from 30
periodSeconds: 30 # Increase from 10
Image Pull Errors
Symptoms:
ImagePullBackOfforErrImagePull- Pods stuck in pending
Diagnosis:
# Check image pull status
kubectl describe pod <pod-name> -n paladin | grep -A5 Events
Solutions:
# Fix: Authenticate with registry
kubectl create secret docker-registry ghcr-secret \
--docker-server=ghcr.io \
--docker-username=$GITHUB_USER \
--docker-password=$GITHUB_TOKEN
# Update deployment to use secret
spec:
imagePullSecrets:
- name: ghcr-secret
Integration Issues
Redis Connection Failed
Symptoms:
- Queue operations fail
ConnectionRefusederrors
Diagnosis:
# Test Redis connectivity
kubectl exec deployment/paladin -- redis-cli -h redis ping
Solutions:
# Fix: Restart Redis
kubectl rollout restart statefulset redis
# Or check authentication
kubectl get secret redis-auth -o jsonpath='{.data.password}' | base64 -d
MinIO/S3 Errors
Symptoms:
- File storage operations fail
AccessDeniederrors
Diagnosis:
# Test MinIO connectivity
kubectl exec deployment/paladin -- \
curl -v http://minio:9000/minio/health/live
Solutions:
# Fix: Update credentials
kubectl create secret generic minio-credentials \
--from-literal=access-key="minioadmin" \
--from-literal=secret-key="minioadmin"
LLM Provider Issues
Symptoms:
- API rate limiting
- Invalid credentials
- Model unavailable
Solutions:
A. Rate Limit Exceeded
# Fix: Add rate limiting
llm:
rate_limit:
requests_per_minute: 60
tokens_per_minute: 90000
B. Switch Provider
# Fix: Use fallback provider
llm:
providers:
- openai
- deepseek # Fallback
- anthropic # Fallback
Getting Help
Collect Debug Bundle
#!/bin/bash
# debug-bundle.sh
NAMESPACE="paladin"
OUTPUT="debug-bundle-$(date +%Y%m%d-%H%M%S).tar.gz"
mkdir -p debug-bundle
cd debug-bundle
# Logs
kubectl logs deployment/paladin -n $NAMESPACE > paladin.log
# Configuration
kubectl get all,cm,secrets -n $NAMESPACE -o yaml > resources.yaml
# Metrics
curl http://localhost:8081/metrics > metrics.txt
# Events
kubectl get events -n $NAMESPACE > events.txt
cd ..
tar czf $OUTPUT debug-bundle/
echo "Debug bundle created: $OUTPUT"
Open an Issue
Include:
- Paladin version
- Deployment environment (Docker/K8s)
- Error messages and logs
- Steps to reproduce
- Expected vs actual behavior
Community Support
- GitHub Issues: Bug reports and feature requests
- Discussions: Questions and community help
- Discord: Real-time chat support
Next Steps
- Monitoring - Set up monitoring
- Performance Tuning - Optimize performance
- Logging - Configure logging