Production Multi-LoRA at Scale

What Problem Does This Solve?

You’ve understood the mechanics of multi-LoRA serving (Blog A3-A5). Now the question is: how do you run this at scale? Serving 16 adapters on one GPU is straightforward. Serving 1,000+ adapters across a GPU cluster with tensor parallelism, data parallelism, dynamic adapter loading, and SLA guarantees — that’s a different problem.

This blog covers the techniques and patterns for production-grade multi-LoRA deployments.


S-LoRA: Unified Paging for Adapter Weights

The Adapter Memory Problem

In Blog A4, we sized --max-loras to pre-allocate GPU memory for adapter slots. But with thousands of adapters, we can’t pre-allocate slots for all of them:

1,000 adapters × 109 MB each (r=16, Llama-8B) = 109 GB
→ doesn't fit on any single GPU

But at any given time, only 10-20 adapters are "hot" (receiving requests).
The other 980+ are idle.

S-LoRA’s insight: page adapter weights the same way we page KV cache.

How Adapter Weight Paging Works

From the inference engine series (Blog 3), you know how paged attention manages the KV cache: a pool of fixed-size blocks, allocated on demand, freed when done, scattered across non-contiguous memory locations.

S-LoRA applies the same idea to LoRA weights:

KV Cache Paging (Blog 3):
  Block pool: [0][1][2][3][4][5]...
  Sequence A's KV: blocks [3, 7, 1]    ← scattered, not contiguous
  Sequence B's KV: blocks [0, 5]

Adapter Weight Paging (S-LoRA):
  Block pool: [0][1][2][3][4][5]...
  Adapter "medical" weights: blocks [2, 8, 4]
  Adapter "legal" weights: blocks [1, 6, 9]
  Adapter "code" weights: blocks [3, 7]

Benefits:

  • No pre-allocation: adapters are loaded on-demand, one block at a time
  • No fragmentation: blocks are fixed-size, allocated from a shared pool
  • Graceful degradation: if GPU memory is full, page adapter blocks to CPU
  • Memory sharing: shared blocks between adapters (if they have identical layers)

GPU-CPU Tiered Storage

S-LoRA uses a two-tier memory hierarchy:

           ┌────────────────────┐
           │    GPU Memory      │  ← Hot adapters (actively serving)
           │  (fast, limited)   │
           │                    │
           │  Adapter blocks    │
           │  for active LoRAs  │
           └────────┬───────────┘
                    │ swap in / swap out
           ┌────────▼───────────┐
           │    CPU Memory      │  ← Warm adapters (recently used)
           │  (slower, larger)  │
           │                    │
           │  Adapter blocks    │
           │  for cached LoRAs  │
           └────────┬───────────┘
                    │ load / evict
           ┌────────▼───────────┐
           │    Disk Storage    │  ← Cold adapters (not recently used)
           │  (slowest, huge)   │
           └────────────────────┘

When a request arrives for a “warm” adapter (in CPU), it’s swapped to GPU in ~5-20ms (GPU DMA). Much faster than loading from disk (~50-200ms).

Connection to vLLM

vLLM’s implementation draws from S-LoRA but uses a simpler approach: pre-allocated adapter slots (Blog A4) with LRU eviction. The --max-cpu-loras flag enables the CPU tier:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --max-loras 8 \
  --max-cpu-loras 64

This keeps 8 adapters on GPU and 64 in CPU memory. The remaining adapters are on disk.


LoRA + Tensor Parallelism

The Challenge

When the base model is sharded across GPUs with tensor parallelism (TP), how do the LoRA weights get distributed?

TP=2, Llama 3.1-70B:
  GPU 0: left half of each linear layer (W₀[:, :dim/2])
  GPU 1: right half of each linear layer (W₀[:, dim/2:])

Where do the LoRA A and B matrices go?

Sharding Strategy

The LoRA matrices must be sharded consistently with the base model layers they attach to:

Column-parallel layers (Q, K, V, Gate, Up projections):

Base:  W₀ is column-sharded → GPU 0 gets columns [0:dim/2]
                             → GPU 1 gets columns [dim/2:dim]

LoRA A (rank × hidden_dim):  Replicated on both GPUs
  → A is the "input" side — all GPUs need the full input projection
  → A is small (rank × hidden_dim), so replication is cheap

LoRA B (hidden_dim × rank):  Column-sharded, matching the base
  → GPU 0 gets B[:dim/2, :]
  → GPU 1 gets B[dim/2:, :]

Computation on GPU 0:
  base_out = W₀[:, :dim/2] @ x        (column-parallel base)
  lora_out = B[:dim/2, :] @ (A @ x)   (LoRA, sharded B)
  local_out = base_out + lora_out
  
  → AllReduce to combine across GPUs? No — column-parallel doesn't
    need AllReduce (the results are concatenated, not summed)

Row-parallel layers (O, Down projections):

Base:  W₀ is row-sharded → GPU 0 gets rows [0:dim/2]
                          → GPU 1 gets rows [dim/2:dim]

LoRA A (rank × hidden_dim):  Row-sharded, matching the base
  → GPU 0 gets A[:, :dim/2]
  → GPU 1 gets A[:, dim/2:]

LoRA B (hidden_dim × rank):  Replicated on both GPUs
  → B is the "output" side — all GPUs produce the full output

Computation on GPU 0:
  base_out = W₀[:dim/2, :] @ x_local  (row-parallel base)
  lora_out = B @ (A[:, :dim/2] @ x_local)  (LoRA, sharded A)
  local_out = base_out + lora_out
  
  → AllReduce combines across GPUs (same as base model)

Summary of LoRA Sharding

Layer Type       Base Sharding    LoRA A        LoRA B
─────────────────────────────────────────────────────────
Column-parallel  Column-split     Replicated    Column-split
Row-parallel     Row-split        Row-split     Replicated

The key insight: LoRA adds zero additional communication. The AllReduce pattern is identical to the base model’s. The only overhead is the LoRA matmuls, which are tiny.

Memory Impact

TP=4, Llama 3.1-70B, rank=16, max_loras=8:

Per GPU (base model):       35 GB / 4 = 8.75 GB
Per GPU (LoRA, replicated): A matrices ≈ 13 MB (replicated)
Per GPU (LoRA, sharded):    B matrices ≈ 100 MB / 4 = 25 MB
Total LoRA per GPU:         ~38 MB × 8 adapters = 304 MB

LoRA memory: 304 MB / 8,750 MB base = 3.5% overhead

LoRA adds negligible memory compared to the sharded base model.


LoRA + Data Parallelism

Independent Adapters per Replica

With data parallelism (DP), each replica is an independent inference engine:

DP=2:
  Replica 0: base model + adapters [medical, legal, code]
  Replica 1: base model + adapters [medical, french, spanish]

Each replica manages its own adapter set independently.

Cache-Aware Routing

The routing strategy matters significantly for multi-LoRA + DP:

Round-robin routing (default):

Request (adapter="medical") → Replica 0
Request (adapter="medical") → Replica 1
Request (adapter="medical") → Replica 0
...

Problem: both replicas load "medical" → duplicate memory usage

Adapter-aware routing (recommended):

Request (adapter="medical") → always Replica 0  (medical is cached there)
Request (adapter="french")  → always Replica 1  (french is cached there)

Benefit: each replica caches a disjoint set of adapters
         → 2× more adapters in GPU memory across the cluster

This is the same principle as cache-aware routing for prefix caching (Inference Blog 10), applied to adapter caching instead of KV cache.

Implementation

Adapter-aware routing uses consistent hashing on the adapter name:

def route_request(adapter_name, num_replicas):
    if adapter_name is None:
        return round_robin()  # base model requests can go anywhere
    
    # Hash the adapter name to a replica
    return hash(adapter_name) % num_replicas

This ensures:

  • Same adapter always goes to the same replica → maximum cache hits
  • Load is balanced if adapter usage is roughly uniform
  • Adding/removing replicas only redistributes some adapters (consistent hashing)

Long-Context LoRA (LongLoRA)

Extending Context Length with LoRA

Standard LoRA doesn’t change the model’s context length — if the base model supports 8K tokens, the LoRA-adapted model also supports 8K tokens. LongLoRA extends the context length using a LoRA adapter:

Base model:     Llama 3.1-8B with 8K context
LongLoRA:       Same model with 32K context (4× extension)

The LoRA adapter learns:
  1. New RoPE (Rotary Position Embedding) scaling factors
  2. Modified attention patterns via shifted sparse attention

Shifted Sparse Attention (S²-Attn)

During LongLoRA training, attention is computed differently:

Standard attention (8K context):
  Every token attends to all previous tokens
  → O(n²) complexity, limited by context length

Shifted sparse attention (32K context):
  Split the sequence into groups of 2048 tokens
  Within each group: standard full attention
  Across groups: shifted attention pattern
  
  Group 1: tokens [0:2048]     ← full attention within group
  Group 2: tokens [2048:4096]  ← full attention within group
  Shift by half: tokens [1024:3072]  ← crosses group boundary
  ...

At inference time, the model uses standard full attention (not shifted sparse), but the LoRA weights learned during shifted sparse training generalize to full attention.

Serving LongLoRA in vLLM

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --max-model-len 32768 \
  --long-lora-scaling-factors 4.0

The --long-lora-scaling-factors flag tells vLLM to apply RoPE scaling when the LongLoRA adapter is active. Different adapters can have different scaling factors.


Adapter Management in Production

Versioning Strategy

Adapter naming convention:
  {customer}-{task}-{version}
  
  medical-qa-v1
  medical-qa-v2
  medical-qa-v3-candidate

Deployment:
  Production:  medical-qa-v2 (stable)
  Canary:      medical-qa-v3-candidate (5% traffic)
  Rollback:    medical-qa-v1 (available on disk)

Canary Deployment Pattern

import random

def select_adapter(customer, task, canary_percentage=0.05):
    adapter_base = f"{customer}-{task}"
    
    if random.random() < canary_percentage:
        return f"{adapter_base}-candidate"
    else:
        return f"{adapter_base}-stable"

Both the stable and candidate adapters are registered in vLLM. The routing logic is in the application layer, not in vLLM itself.

Adapter Validation Pipeline

Before deploying a new adapter to production:

1. Format validation:
   - adapter_config.json exists and is valid
   - adapter_model.safetensors loads without errors
   - Target modules match the base model

2. Compatibility check:
   - Rank <= max_lora_rank on the serving cluster
   - Vocab size <= lora_extra_vocab_size
   - Context length requirement is met

3. Quality check:
   - Run adapter on a validation dataset
   - Compare against the previous version
   - Check for regressions (accuracy, safety, latency)

4. Load test:
   - Register adapter on a staging vLLM instance
   - Run traffic at expected QPS
   - Verify latency P99 is within SLA

Rollback

# Instant rollback: unload the bad adapter, routes fall back to stable
curl -X POST http://localhost:8000/v1/unload_lora_adapter \
  -d '{"lora_name": "medical-qa-v3-candidate"}'

# Or swap it: load the previous version with the same name
curl -X POST http://localhost:8000/v1/load_lora_adapter \
  -d '{
    "lora_name": "medical-qa-candidate",
    "lora_path": "/data/adapters/medical-qa-v2"
  }'

Rollback is near-instant because it’s just loading different weight files — no model restart needed.


Performance Tuning Checklist

1. Right-size --max-loras

Metric: adapter_evict_rate (evictions per minute)

evict_rate > 10/min:  max_loras too low → increase
evict_rate < 0.1/min: max_loras higher than needed → could reclaim memory for KV cache
evict_rate ≈ 1/min:   healthy — occasional evictions are fine

2. Use CPU Caching

If eviction rate is high but you can't increase max_loras (KV cache needs the memory):
  --max-cpu-loras 64

CPU→GPU reload: 5-20ms (vs. disk→GPU: 50-200ms)

3. Merge Always-Hot Adapters

If one adapter gets >80% of traffic:
  → Merge it into the base model (zero overhead)
  → Serve remaining adapters via LoRA on the merged model
  → Caveat: the merged adapter becomes the "base" — can't un-merge without reload

4. Match Rank to Need

Don't set --max-lora-rank higher than your highest-rank adapter.
Pre-allocation waste:

  max_lora_rank=128, actual max rank=16:
  Wasted per slot: (128-16) × 6.8 MB = 762 MB
  8 slots: 6 GB wasted!

  → Set max_lora_rank=16 if no adapter exceeds rank 16

5. Monitor Adapter Distribution

If adapter traffic is heavily skewed (power law):
  Top 3 adapters: 70% of traffic    → consider merging top adapter
  Next 10:        25% of traffic    → keep in max_loras
  Remaining 987:   5% of traffic    → serve from CPU cache / disk
  
  Optimal: max_loras=10-15, max_cpu_loras=100

6. Benchmark with Realistic Traffic

Common mistake: benchmarking with uniform adapter distribution
Real traffic:   power-law distribution + time-of-day patterns

Benchmark with:
  - Actual adapter distribution from production logs
  - Realistic request interarrival times
  - Mixed prompt lengths
  - Time-varying adapter popularity

Composing Everything: TP + DP + Multi-LoRA

Example: 8 GPUs, 1000 Adapters

Configuration:
  Model: Llama 3.1-70B (needs TP=4 to fit)
  GPUs:  8× A100-80GB
  Setup: DP=2, TP=4 (2 replicas, each spanning 4 GPUs)

Per replica:
  Base model: 70B / 4 GPUs = 17.5 GB per GPU
  KV cache:   ~40 GB per GPU
  LoRA slots: 16 adapters × 38 MB per GPU = 608 MB per GPU
  CPU cache:  128 adapters per replica

Across the cluster:
  GPU-cached adapters: 2 replicas × 16 = 32 (with adapter-aware routing)
  CPU-cached adapters: 2 replicas × 128 = 256
  Disk adapters:       remaining 744
  
  Steady-state: top 32 adapters served from GPU (fast)
                next 224 served from CPU (5-20ms load)
                remaining 744 served from disk (50-200ms load, rare)

Traffic Flow

Request (adapter="customer-42"):
  1. Router hashes "customer-42" → Replica 1
  2. Replica 1 checks: adapter in GPU? Yes → proceed
  3. Forward pass: base (TP=4, AllReduce) + LoRA (SGMV, per-GPU)
  4. Return response

Request (adapter="customer-999"):
  1. Router hashes "customer-999" → Replica 0
  2. Replica 0 checks: adapter in GPU? No. In CPU? Yes → swap to GPU
  3. Evict LRU adapter from GPU → CPU (if at capacity)
  4. Load "customer-999" from CPU → GPU (~10ms)
  5. Forward pass: base (TP=4) + LoRA (SGMV)
  6. Return response

Key Takeaways

  1. S-LoRA paging treats adapter weights like KV cache blocks — allocated on demand, paged between GPU and CPU
  2. LoRA + TP: A is replicated (or row-sharded), B is column-sharded (or replicated), matching the base layer’s sharding. Zero additional communication.
  3. LoRA + DP: use adapter-aware routing (hash adapter name → replica) to maximize cache hits across replicas
  4. LongLoRA extends context length via LoRA, served with --long-lora-scaling-factors
  5. Production ops: version adapters, canary deploy, validate before production, instant rollback by swapping adapter files
  6. Tune for your traffic: profile adapter distribution, right-size max_loras, merge hot adapters, use CPU caching for the warm tier

Series Summary

This completes the LoRA & QLoRA series:

A1: LoRA math (B×A decomposition, rank, alpha)
A2: QLoRA (NF4 base + FP16 LoRA for memory efficiency)
A3: Serving one LoRA adapter in vLLM (--enable-lora, on-the-fly computation)
A4: Multi-LoRA serving (one base, many adapters, adapter caching)
A5: Kernel internals (SGMV/BGMV for batched multi-adapter computation)
A6: Production scale (paging, TP/DP composition, ops patterns)

Further Reading