Performance Tuning Guide

Comprehensive guide for optimizing Paladin performance across different workloads and deployment scenarios.

Table of Contents

Performance Baselines

Expected Performance

MetricTargetAcceptableAction Required
Throughput≥10 req/s≥5 req/s<5 req/s
P95 Latency<2s<5s>5s
Memory per Paladin<50MB<100MB>100MB
CPU per Paladin<100m<200m>200m
Error Rate<0.1%<1%>1%

Benchmark Results

Garrison Memory Operations (Measured - January 2026):

Single Entry Operations:

  • Add entry (10 chars): ~170 ns
  • Add entry (100 chars): ~210 ns
  • Add entry (1000 chars): ~225 ns
  • Add entry (10000 chars): ~380 ns

Batch Operations:

  • Add 10 entries: ~1.05 µs (105 ns/entry)
  • Add 50 entries: ~4.2 µs (84 ns/entry)
  • Add 100 entries: ~8.0 µs (80 ns/entry)
  • Add 500 entries: ~37.5 µs (75 ns/entry)

Retrieval Operations:

  • Get last 10 entries: ~33 ns
  • Get last 50 entries: ~46 ns
  • Get all (100 entries): ~55 ns

Eviction Strategies:

  • FIFO eviction: ~280 ns/eviction
  • SlidingWindow eviction: ~295 ns/eviction

Realistic Conversation (10 turns, 20 messages): ~3.35 µs

Battalion Orchestration (Measured - January 2026):

Formation (Sequential):

  • 3 Paladins (10ms latency): ~30 ms total
  • 5 Paladins (10ms latency): ~50 ms total
  • 10 Paladins (10ms latency): ~100 ms total

Phalanx (Concurrent):

  • 3-20 Paladins (10ms latency): ~10 ms total (parallel)

Orchestration Overhead (Zero Latency):

  • Formation (5 Paladins): ~1.8 µs pure overhead
  • Phalanx (5 Paladins): ~25 µs pure overhead

Aggregation Strategies:

  • CollectAll: ~25 µs
  • FirstSuccess: ~2.6 µs
  • Majority: ~25 µs

Herald Output Formatting (Measured - January 2026):

  • JSON (1KB): ~2.3 µs
  • Markdown (1KB): ~570 ns (fastest)
  • Table (1KB): ~5.5 µs
  • JSON (10KB): ~10 µs
  • Markdown (10KB): ~2.3 µs
  • Table (10KB): ~23 µs

Key Insights:

  • Garrison operations are sub-microsecond (extremely fast)
  • Batch operations show ~25% performance improvement
  • Battalion orchestration overhead is negligible vs LLM latency
  • Markdown formatting is 2-4x faster than JSON
  • All orchestration overhead < 100µs (LLM calls dominate at 1-5s)

Benchmarking

Running Benchmarks

# All benchmarks
cargo bench

# Specific benchmark
cargo bench paladin_execution

# With baseline comparison
cargo bench --bench paladin_benchmarks -- --save-baseline v0.1.0
cargo bench --bench paladin_benchmarks -- --baseline v0.1.0

# Generate HTML report
cargo bench --bench paladin_benchmarks -- --plotting-backend gnuplot

Custom Benchmarks

#![allow(unused)]
fn main() {
use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn paladin_benchmark(c: &mut Criterion) {
    let rt = tokio::runtime::Runtime::new().unwrap();
    let paladin = create_test_paladin();

    c.bench_function("paladin execution", |b| {
        b.to_async(&rt).iter(|| async {
            let result = paladin.execute(black_box("test input")).await;
            black_box(result)
        })
    });
}

criterion_group!(benches, paladin_benchmark);
criterion_main!(benches);
}

Load Testing

# Using Apache Bench
ab -n 1000 -c 10 -T 'application/json' \
  -p request.json \
  http://localhost:8080/api/paladin/execute

# Using k6
k6 run --vus 10 --duration 30s load-test.js

LLM Optimization

Model Selection

# Use appropriate model for task complexity
llm:
  model_routing:
    simple_tasks:
      model: "gpt-3.5-turbo"  # 5-10x faster than GPT-4
      max_tokens: 500

    complex_tasks:
      model: "gpt-4"
      max_tokens: 2000

    classification:
      model: "gpt-3.5-turbo"  # Sufficient for most classification
      temperature: 0.1

Request Batching

#![allow(unused)]
fn main() {
// Batch similar requests
pub struct LlmBatcher {
    pending: Vec<LlmRequest>,
    max_batch_size: usize,
    max_wait_time: Duration,
}

impl LlmBatcher {
    pub async fn add_request(&mut self, request: LlmRequest) -> Result<LlmResponse> {
        self.pending.push(request);

        if self.pending.len() >= self.max_batch_size {
            return self.flush().await;
        }

        // Wait for more requests or timeout
        tokio::select! {
            _ = tokio::time::sleep(self.max_wait_time) => {
                self.flush().await
            }
        }
    }

    async fn flush(&mut self) -> Result<Vec<LlmResponse>> {
        let batch = std::mem::take(&mut self.pending);
        self.llm_port.generate_batch(batch).await
    }
}
}

Caching Responses

#![allow(unused)]
fn main() {
use moka::future::Cache;

pub struct CachedLlmPort {
    inner: Arc<dyn LlmPort>,
    cache: Cache<String, LlmResponse>,
}

impl CachedLlmPort {
    pub fn new(port: Arc<dyn LlmPort>, max_capacity: u64) -> Self {
        Self {
            inner: port,
            cache: Cache::builder()
                .max_capacity(max_capacity)
                .time_to_live(Duration::from_secs(3600))
                .build(),
        }
    }

    async fn generate_cached(&self, messages: &[Message]) -> Result<LlmResponse> {
        let key = compute_cache_key(messages);

        if let Some(cached) = self.cache.get(&key).await {
            return Ok(cached);
        }

        let response = self.inner.generate(messages).await?;
        self.cache.insert(key, response.clone()).await;
        Ok(response)
    }
}
}

Streaming for Long Responses

#![allow(unused)]
fn main() {
// Use streaming to reduce perceived latency
pub async fn execute_with_streaming(
    paladin: &Paladin,
    input: &str,
) -> Result<impl Stream<Item = String>> {
    let stream = paladin.execute_stream(input).await?;

    Ok(stream.map(|chunk| {
        // Process chunk immediately
        format!("Received: {}\n", chunk.content)
    }))
}
}

Memory Optimization

Garrison Configuration

# Optimize memory usage
garrison:
  type: "sqlite"
  max_entries: 500        # Reduce from default 1000
  max_tokens: 4000        # Reduce from default 8000

  # Use sliding window for active conversations
  windowing:
    strategy: "sliding"
    window_size: 10       # Keep last 10 messages

  # Aggressive cleanup
  cleanup:
    enabled: true
    interval: "5m"
    max_age: "1h"

Memory Pooling

#![allow(unused)]
fn main() {
use tokio::sync::RwLock;

pub struct MemoryPool<T> {
    pool: RwLock<Vec<T>>,
    factory: Box<dyn Fn() -> T + Send + Sync>,
}

impl<T> MemoryPool<T> {
    pub async fn acquire(&self) -> T {
        let mut pool = self.pool.write().await;
        pool.pop().unwrap_or_else(|| (self.factory)())
    }

    pub async fn release(&self, item: T) {
        let mut pool = self.pool.write().await;
        if pool.len() < 100 {  // Max pool size
            pool.push(item);
        }
    }
}
}

Lazy Loading

#![allow(unused)]
fn main() {
// Load garrison entries on-demand
pub struct LazyGarrison {
    session_id: Uuid,
    cache: RwLock<Option<Vec<GarrisonEntry>>>,
    repository: Arc<dyn GarrisonRepository>,
}

impl LazyGarrison {
    pub async fn get_entries(&self) -> Result<Vec<GarrisonEntry>> {
        let cache = self.cache.read().await;
        if let Some(entries) = cache.as_ref() {
            return Ok(entries.clone());
        }

        drop(cache);
        let entries = self.repository.load(self.session_id).await?;
        *self.cache.write().await = Some(entries.clone());
        Ok(entries)
    }
}
}

Concurrency Tuning

Thread Pool Configuration

#![allow(unused)]
fn main() {
use tokio::runtime::Builder;

pub fn create_runtime() -> Runtime {
    Builder::new_multi_thread()
        .worker_threads(8)              // Match CPU cores
        .max_blocking_threads(16)       // For blocking operations
        .thread_name("paladin-worker")
        .thread_stack_size(3 * 1024 * 1024)  // 3MB stack
        .build()
        .unwrap()
}
}

Concurrency Limits

# Control concurrent operations
paladin:
  max_concurrent_executions: 100

arsenal:
  max_concurrent_tools: 10
  tool_timeout: 30s

battalion:
  phalanx:
    max_concurrent_paladins: 5

Backpressure Handling

#![allow(unused)]
fn main() {
use tokio::sync::Semaphore;

pub struct RateLimiter {
    semaphore: Arc<Semaphore>,
}

impl RateLimiter {
    pub fn new(max_concurrent: usize) -> Self {
        Self {
            semaphore: Arc::new(Semaphore::new(max_concurrent)),
        }
    }

    pub async fn acquire(&self) -> Result<()> {
        match self.semaphore.acquire().await {
            Ok(permit) => {
                permit.forget();  // Release on drop
                Ok(())
            }
            Err(_) => Err(Error::RateLimitExceeded),
        }
    }
}
}

Database Optimization

SQLite Configuration

-- Optimize SQLite for performance
PRAGMA journal_mode = WAL;           -- Write-Ahead Logging
PRAGMA synchronous = NORMAL;         -- Balance safety/speed
PRAGMA cache_size = -64000;          -- 64MB cache
PRAGMA temp_store = MEMORY;          -- In-memory temp tables
PRAGMA mmap_size = 268435456;        -- 256MB memory-mapped I/O
PRAGMA page_size = 4096;             -- Optimal page size

-- Add indexes for common queries
CREATE INDEX IF NOT EXISTS idx_garrison_session
  ON garrison_entries(session_id, timestamp);

CREATE INDEX IF NOT EXISTS idx_garrison_search
  ON garrison_entries(content)
  USING gin(to_tsvector('english', content));

Connection Pooling

#![allow(unused)]
fn main() {
use sqlx::sqlite::SqlitePoolOptions;

pub async fn create_pool(database_url: &str) -> Result<SqlitePool> {
    SqlitePoolOptions::new()
        .max_connections(10)
        .min_connections(2)
        .acquire_timeout(Duration::from_secs(5))
        .idle_timeout(Duration::from_secs(600))
        .max_lifetime(Duration::from_secs(1800))
        .connect(database_url)
        .await?
}
}

Query Optimization

#![allow(unused)]
fn main() {
// Use prepared statements
let stmt = sqlx::query!(
    "SELECT * FROM garrison_entries
     WHERE session_id = ? AND timestamp > ?
     ORDER BY timestamp DESC
     LIMIT ?",
    session_id,
    cutoff_time,
    limit
);

// Batch inserts
let mut tx = pool.begin().await?;
for entry in entries {
    sqlx::query!(
        "INSERT INTO garrison_entries (session_id, content, timestamp)
         VALUES (?, ?, ?)",
        entry.session_id, entry.content, entry.timestamp
    )
    .execute(&mut *tx)
    .await?;
}
tx.commit().await?;
}

Network Optimization

Connection Reuse

#![allow(unused)]
fn main() {
use reqwest::Client;

// Reuse HTTP client
lazy_static! {
    static ref HTTP_CLIENT: Client = Client::builder()
        .pool_max_idle_per_host(10)
        .pool_idle_timeout(Duration::from_secs(90))
        .timeout(Duration::from_secs(30))
        .build()
        .unwrap();
}
}

Compression

# Enable response compression
server:
  compression:
    enabled: true
    level: 6              # Balance between size and CPU
    min_size: 1024        # Only compress responses > 1KB

HTTP/2 and Keep-Alive

#![allow(unused)]
fn main() {
let client = reqwest::Client::builder()
    .http2_prior_knowledge()      // Use HTTP/2
    .tcp_keepalive(Duration::from_secs(60))
    .pool_max_idle_per_host(10)
    .build()?;
}

Resource Allocation

Kubernetes Resource Tuning

resources:
  requests:
    cpu: "1000m"        # Guaranteed
    memory: "2Gi"
  limits:
    cpu: "4000m"        # Allow bursting
    memory: "4Gi"       # Hard limit

# Horizontal Pod Autoscaler
autoscaling:
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

JVM-Style Tuning (for context)

# Rust doesn't need JVM tuning, but consider:

# 1. Release build optimizations
cargo build --release

# 2. Profile-guided optimization (PGO)
cargo build --profile production

# 3. Link-time optimization
[profile.release]
lto = "fat"
codegen-units = 1

Monitoring Resource Usage

#![allow(unused)]
fn main() {
use sysinfo::{System, SystemExt};

pub fn log_resource_usage() {
    let mut system = System::new_all();
    system.refresh_all();

    info!(
        cpu_usage = system.global_cpu_info().cpu_usage(),
        memory_used = system.used_memory(),
        memory_total = system.total_memory(),
        "Resource usage"
    );
}
}

Performance Checklist

Before production deployment:

  • Run benchmarks and verify targets met
  • Profile CPU and memory usage under load
  • Test with expected concurrency levels
  • Verify database indexes exist
  • Enable connection pooling
  • Configure resource limits
  • Set up monitoring and alerts
  • Test auto-scaling behavior
  • Optimize LLM model selection
  • Enable response caching where appropriate

Next Steps