Production Multi-LoRA at Scale
What Problem Does This Solve?
You’ve understood the mechanics of multi-LoRA serving (Blog A3-A5). Now the question is: how do you run this at scale? Serving 16 adapters on one GPU is straightforward. Serving 1,000+ adapters across a GPU cluster with tensor parallelism, data parallelism, dynamic adapter loading, and SLA guarantees — that’s a different problem.
This blog covers the techniques and patterns for production-grade multi-LoRA deployments.
S-LoRA: Unified Paging for Adapter Weights
The Adapter Memory Problem
In Blog A4, we sized --max-loras to pre-allocate GPU memory for adapter slots. But with thousands of adapters, we can’t pre-allocate slots for all of them:
1,000 adapters × 109 MB each (r=16, Llama-8B) = 109 GB
→ doesn't fit on any single GPU
But at any given time, only 10-20 adapters are "hot" (receiving requests).
The other 980+ are idle.
S-LoRA’s insight: page adapter weights the same way we page KV cache.
How Adapter Weight Paging Works
From the inference engine series (Blog 3), you know how paged attention manages the KV cache: a pool of fixed-size blocks, allocated on demand, freed when done, scattered across non-contiguous memory locations.
S-LoRA applies the same idea to LoRA weights:
KV Cache Paging (Blog 3):
Block pool: [0][1][2][3][4][5]...
Sequence A's KV: blocks [3, 7, 1] ← scattered, not contiguous
Sequence B's KV: blocks [0, 5]
Adapter Weight Paging (S-LoRA):
Block pool: [0][1][2][3][4][5]...
Adapter "medical" weights: blocks [2, 8, 4]
Adapter "legal" weights: blocks [1, 6, 9]
Adapter "code" weights: blocks [3, 7]
Benefits:
- No pre-allocation: adapters are loaded on-demand, one block at a time
- No fragmentation: blocks are fixed-size, allocated from a shared pool
- Graceful degradation: if GPU memory is full, page adapter blocks to CPU
- Memory sharing: shared blocks between adapters (if they have identical layers)
GPU-CPU Tiered Storage
S-LoRA uses a two-tier memory hierarchy:
┌────────────────────┐
│ GPU Memory │ ← Hot adapters (actively serving)
│ (fast, limited) │
│ │
│ Adapter blocks │
│ for active LoRAs │
└────────┬───────────┘
│ swap in / swap out
┌────────▼───────────┐
│ CPU Memory │ ← Warm adapters (recently used)
│ (slower, larger) │
│ │
│ Adapter blocks │
│ for cached LoRAs │
└────────┬───────────┘
│ load / evict
┌────────▼───────────┐
│ Disk Storage │ ← Cold adapters (not recently used)
│ (slowest, huge) │
└────────────────────┘
When a request arrives for a “warm” adapter (in CPU), it’s swapped to GPU in ~5-20ms (GPU DMA). Much faster than loading from disk (~50-200ms).
Connection to vLLM
vLLM’s implementation draws from S-LoRA but uses a simpler approach: pre-allocated adapter slots (Blog A4) with LRU eviction. The --max-cpu-loras flag enables the CPU tier:
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-lora \
--max-loras 8 \
--max-cpu-loras 64
This keeps 8 adapters on GPU and 64 in CPU memory. The remaining adapters are on disk.
LoRA + Tensor Parallelism
The Challenge
When the base model is sharded across GPUs with tensor parallelism (TP), how do the LoRA weights get distributed?
TP=2, Llama 3.1-70B:
GPU 0: left half of each linear layer (W₀[:, :dim/2])
GPU 1: right half of each linear layer (W₀[:, dim/2:])
Where do the LoRA A and B matrices go?
Sharding Strategy
The LoRA matrices must be sharded consistently with the base model layers they attach to:
Column-parallel layers (Q, K, V, Gate, Up projections):
Base: W₀ is column-sharded → GPU 0 gets columns [0:dim/2]
→ GPU 1 gets columns [dim/2:dim]
LoRA A (rank × hidden_dim): Replicated on both GPUs
→ A is the "input" side — all GPUs need the full input projection
→ A is small (rank × hidden_dim), so replication is cheap
LoRA B (hidden_dim × rank): Column-sharded, matching the base
→ GPU 0 gets B[:dim/2, :]
→ GPU 1 gets B[dim/2:, :]
Computation on GPU 0:
base_out = W₀[:, :dim/2] @ x (column-parallel base)
lora_out = B[:dim/2, :] @ (A @ x) (LoRA, sharded B)
local_out = base_out + lora_out
→ AllReduce to combine across GPUs? No — column-parallel doesn't
need AllReduce (the results are concatenated, not summed)
Row-parallel layers (O, Down projections):
Base: W₀ is row-sharded → GPU 0 gets rows [0:dim/2]
→ GPU 1 gets rows [dim/2:dim]
LoRA A (rank × hidden_dim): Row-sharded, matching the base
→ GPU 0 gets A[:, :dim/2]
→ GPU 1 gets A[:, dim/2:]
LoRA B (hidden_dim × rank): Replicated on both GPUs
→ B is the "output" side — all GPUs produce the full output
Computation on GPU 0:
base_out = W₀[:dim/2, :] @ x_local (row-parallel base)
lora_out = B @ (A[:, :dim/2] @ x_local) (LoRA, sharded A)
local_out = base_out + lora_out
→ AllReduce combines across GPUs (same as base model)
Summary of LoRA Sharding
Layer Type Base Sharding LoRA A LoRA B
─────────────────────────────────────────────────────────
Column-parallel Column-split Replicated Column-split
Row-parallel Row-split Row-split Replicated
The key insight: LoRA adds zero additional communication. The AllReduce pattern is identical to the base model’s. The only overhead is the LoRA matmuls, which are tiny.
Memory Impact
TP=4, Llama 3.1-70B, rank=16, max_loras=8:
Per GPU (base model): 35 GB / 4 = 8.75 GB
Per GPU (LoRA, replicated): A matrices ≈ 13 MB (replicated)
Per GPU (LoRA, sharded): B matrices ≈ 100 MB / 4 = 25 MB
Total LoRA per GPU: ~38 MB × 8 adapters = 304 MB
LoRA memory: 304 MB / 8,750 MB base = 3.5% overhead
LoRA adds negligible memory compared to the sharded base model.
LoRA + Data Parallelism
Independent Adapters per Replica
With data parallelism (DP), each replica is an independent inference engine:
DP=2:
Replica 0: base model + adapters [medical, legal, code]
Replica 1: base model + adapters [medical, french, spanish]
Each replica manages its own adapter set independently.
Cache-Aware Routing
The routing strategy matters significantly for multi-LoRA + DP:
Round-robin routing (default):
Request (adapter="medical") → Replica 0
Request (adapter="medical") → Replica 1
Request (adapter="medical") → Replica 0
...
Problem: both replicas load "medical" → duplicate memory usage
Adapter-aware routing (recommended):
Request (adapter="medical") → always Replica 0 (medical is cached there)
Request (adapter="french") → always Replica 1 (french is cached there)
Benefit: each replica caches a disjoint set of adapters
→ 2× more adapters in GPU memory across the cluster
This is the same principle as cache-aware routing for prefix caching (Inference Blog 10), applied to adapter caching instead of KV cache.
Implementation
Adapter-aware routing uses consistent hashing on the adapter name:
def route_request(adapter_name, num_replicas):
if adapter_name is None:
return round_robin() # base model requests can go anywhere
# Hash the adapter name to a replica
return hash(adapter_name) % num_replicas
This ensures:
- Same adapter always goes to the same replica → maximum cache hits
- Load is balanced if adapter usage is roughly uniform
- Adding/removing replicas only redistributes some adapters (consistent hashing)
Long-Context LoRA (LongLoRA)
Extending Context Length with LoRA
Standard LoRA doesn’t change the model’s context length — if the base model supports 8K tokens, the LoRA-adapted model also supports 8K tokens. LongLoRA extends the context length using a LoRA adapter:
Base model: Llama 3.1-8B with 8K context
LongLoRA: Same model with 32K context (4× extension)
The LoRA adapter learns:
1. New RoPE (Rotary Position Embedding) scaling factors
2. Modified attention patterns via shifted sparse attention
Shifted Sparse Attention (S²-Attn)
During LongLoRA training, attention is computed differently:
Standard attention (8K context):
Every token attends to all previous tokens
→ O(n²) complexity, limited by context length
Shifted sparse attention (32K context):
Split the sequence into groups of 2048 tokens
Within each group: standard full attention
Across groups: shifted attention pattern
Group 1: tokens [0:2048] ← full attention within group
Group 2: tokens [2048:4096] ← full attention within group
Shift by half: tokens [1024:3072] ← crosses group boundary
...
At inference time, the model uses standard full attention (not shifted sparse), but the LoRA weights learned during shifted sparse training generalize to full attention.
Serving LongLoRA in vLLM
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-lora \
--max-model-len 32768 \
--long-lora-scaling-factors 4.0
The --long-lora-scaling-factors flag tells vLLM to apply RoPE scaling when the LongLoRA adapter is active. Different adapters can have different scaling factors.
Adapter Management in Production
Versioning Strategy
Adapter naming convention:
{customer}-{task}-{version}
medical-qa-v1
medical-qa-v2
medical-qa-v3-candidate
Deployment:
Production: medical-qa-v2 (stable)
Canary: medical-qa-v3-candidate (5% traffic)
Rollback: medical-qa-v1 (available on disk)
Canary Deployment Pattern
import random
def select_adapter(customer, task, canary_percentage=0.05):
adapter_base = f"{customer}-{task}"
if random.random() < canary_percentage:
return f"{adapter_base}-candidate"
else:
return f"{adapter_base}-stable"
Both the stable and candidate adapters are registered in vLLM. The routing logic is in the application layer, not in vLLM itself.
Adapter Validation Pipeline
Before deploying a new adapter to production:
1. Format validation:
- adapter_config.json exists and is valid
- adapter_model.safetensors loads without errors
- Target modules match the base model
2. Compatibility check:
- Rank <= max_lora_rank on the serving cluster
- Vocab size <= lora_extra_vocab_size
- Context length requirement is met
3. Quality check:
- Run adapter on a validation dataset
- Compare against the previous version
- Check for regressions (accuracy, safety, latency)
4. Load test:
- Register adapter on a staging vLLM instance
- Run traffic at expected QPS
- Verify latency P99 is within SLA
Rollback
# Instant rollback: unload the bad adapter, routes fall back to stable
curl -X POST http://localhost:8000/v1/unload_lora_adapter \
-d '{"lora_name": "medical-qa-v3-candidate"}'
# Or swap it: load the previous version with the same name
curl -X POST http://localhost:8000/v1/load_lora_adapter \
-d '{
"lora_name": "medical-qa-candidate",
"lora_path": "/data/adapters/medical-qa-v2"
}'
Rollback is near-instant because it’s just loading different weight files — no model restart needed.
Performance Tuning Checklist
1. Right-size --max-loras
Metric: adapter_evict_rate (evictions per minute)
evict_rate > 10/min: max_loras too low → increase
evict_rate < 0.1/min: max_loras higher than needed → could reclaim memory for KV cache
evict_rate ≈ 1/min: healthy — occasional evictions are fine
2. Use CPU Caching
If eviction rate is high but you can't increase max_loras (KV cache needs the memory):
--max-cpu-loras 64
CPU→GPU reload: 5-20ms (vs. disk→GPU: 50-200ms)
3. Merge Always-Hot Adapters
If one adapter gets >80% of traffic:
→ Merge it into the base model (zero overhead)
→ Serve remaining adapters via LoRA on the merged model
→ Caveat: the merged adapter becomes the "base" — can't un-merge without reload
4. Match Rank to Need
Don't set --max-lora-rank higher than your highest-rank adapter.
Pre-allocation waste:
max_lora_rank=128, actual max rank=16:
Wasted per slot: (128-16) × 6.8 MB = 762 MB
8 slots: 6 GB wasted!
→ Set max_lora_rank=16 if no adapter exceeds rank 16
5. Monitor Adapter Distribution
If adapter traffic is heavily skewed (power law):
Top 3 adapters: 70% of traffic → consider merging top adapter
Next 10: 25% of traffic → keep in max_loras
Remaining 987: 5% of traffic → serve from CPU cache / disk
Optimal: max_loras=10-15, max_cpu_loras=100
6. Benchmark with Realistic Traffic
Common mistake: benchmarking with uniform adapter distribution
Real traffic: power-law distribution + time-of-day patterns
Benchmark with:
- Actual adapter distribution from production logs
- Realistic request interarrival times
- Mixed prompt lengths
- Time-varying adapter popularity
Composing Everything: TP + DP + Multi-LoRA
Example: 8 GPUs, 1000 Adapters
Configuration:
Model: Llama 3.1-70B (needs TP=4 to fit)
GPUs: 8× A100-80GB
Setup: DP=2, TP=4 (2 replicas, each spanning 4 GPUs)
Per replica:
Base model: 70B / 4 GPUs = 17.5 GB per GPU
KV cache: ~40 GB per GPU
LoRA slots: 16 adapters × 38 MB per GPU = 608 MB per GPU
CPU cache: 128 adapters per replica
Across the cluster:
GPU-cached adapters: 2 replicas × 16 = 32 (with adapter-aware routing)
CPU-cached adapters: 2 replicas × 128 = 256
Disk adapters: remaining 744
Steady-state: top 32 adapters served from GPU (fast)
next 224 served from CPU (5-20ms load)
remaining 744 served from disk (50-200ms load, rare)
Traffic Flow
Request (adapter="customer-42"):
1. Router hashes "customer-42" → Replica 1
2. Replica 1 checks: adapter in GPU? Yes → proceed
3. Forward pass: base (TP=4, AllReduce) + LoRA (SGMV, per-GPU)
4. Return response
Request (adapter="customer-999"):
1. Router hashes "customer-999" → Replica 0
2. Replica 0 checks: adapter in GPU? No. In CPU? Yes → swap to GPU
3. Evict LRU adapter from GPU → CPU (if at capacity)
4. Load "customer-999" from CPU → GPU (~10ms)
5. Forward pass: base (TP=4) + LoRA (SGMV)
6. Return response
Key Takeaways
- S-LoRA paging treats adapter weights like KV cache blocks — allocated on demand, paged between GPU and CPU
- LoRA + TP: A is replicated (or row-sharded), B is column-sharded (or replicated), matching the base layer’s sharding. Zero additional communication.
- LoRA + DP: use adapter-aware routing (hash adapter name → replica) to maximize cache hits across replicas
- LongLoRA extends context length via LoRA, served with
--long-lora-scaling-factors - Production ops: version adapters, canary deploy, validate before production, instant rollback by swapping adapter files
- Tune for your traffic: profile adapter distribution, right-size
max_loras, merge hot adapters, use CPU caching for the warm tier
Series Summary
This completes the LoRA & QLoRA series:
A1: LoRA math (B×A decomposition, rank, alpha)
A2: QLoRA (NF4 base + FP16 LoRA for memory efficiency)
A3: Serving one LoRA adapter in vLLM (--enable-lora, on-the-fly computation)
A4: Multi-LoRA serving (one base, many adapters, adapter caching)
A5: Kernel internals (SGMV/BGMV for batched multi-adapter computation)
A6: Production scale (paging, TP/DP composition, ops patterns)
Further Reading
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters — unified paging for adapter weights
- LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models — context extension via LoRA
- dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving — dynamic adapter scheduling
- CaraServe: CPU-Assisted and Rank-Aware LoRA Serving — rank-aware scheduling optimizations