All Posts
32 posts across 4 series
Building an LLM Inference Engine from Scratch
Learn the internals of vLLM and SGLang by building one from scratch · 15 parts
- 01 Part 1: The Simplest LLM ServerBuild a working LLM inference server in ~130 lines of Python. Covers autoregressive generation, the KV cache, prefill vs. decode phases, and basic sampling.
- 02 Part 2: Async Streaming with FastAPIWhy Flask blocks on long generation, how FastAPI enables true async, and Server-Sent Events for token-by-token streaming.
- 03 Part 3: Paged AttentionHow paged attention borrows the OS virtual memory concept to store KV cache in fixed-size blocks, enabling 2-4x more concurrent requests.
- 04 Part 4: Continuous BatchingWhy static batching wastes GPU cycles and how continuous batching lets requests join and leave the batch at every iteration.
- 05 Part 5: Async SchedulingHow async scheduling overlaps CPU output processing with GPU execution, eliminating 5-10ms of dead GPU time per step.
- 06 Part 6: Chunked PrefillHow long prompts block decode requests and chunked prefill splits them to keep inter-token latency stable.
- 07 Part 7: Prefix CachingContent-addressable KV cache using chained hashing, LRU eviction, and reference counting to skip recomputing shared prefixes.
- 08 Part 8: Speculative DecodingThe draft-verify paradigm: guess K tokens cheaply, verify all at once, with rejection sampling to preserve exact output distribution.
- 09 Part 9: Tensor ParallelismColumn-parallel and row-parallel linear layers (the Megatron pattern), AllReduce for combining results, and weight sharding across GPUs.
- 10 Part 10: Data ParallelismIndependent model replicas with request routing strategies: round-robin, least-pending, and cache-aware routing for throughput scaling.
- 11 Part 11: Expert ParallelismMixture-of-Experts architecture with AllToAll communication to dispatch tokens across GPUs holding different experts.
- 12 Part 12: KV Cache CPU OffloadingTwo-tier GPU/CPU memory with swap-out and swap-in using pinned memory to handle burst traffic beyond GPU capacity.
- 13 Part 13: Disaggregated Prefill-DecodeSeparate prefill and decode worker pools with KV transfer via the KVConnector abstraction for hardware-optimized serving.
- 14 Part 14: QuantizationNumber formats from FP32 to INT4, symmetric quantization, weight-only dequantize-on-the-fly, and nibble packing for 2-4x memory reduction.
- 15 Part 15: The Full ArchitectureComplete architecture diagram with all 14 techniques, request lifecycle, performance attribution, and deployment configurations.
LoRA & QLoRA in vLLM
From LoRA math to production multi-adapter serving at scale · 6 parts
- 01 LoRA Fundamentals: The Math Behind Low-Rank AdaptationB×A matrix decomposition, rank and alpha tradeoffs, target module selection, PEFT training, and adapter merging.
- 02 QLoRA: 4-bit Base + Full-Precision AdaptersNF4 quantization, double quantization, paged optimizers, and how QLoRA preserves quality while cutting memory 4×.
- 03 Serving a LoRA Adapter in vLLMEnable LoRA serving, load adapters on the fly, and understand how the forward pass applies low-rank updates.
- 04 Multi-LoRA Serving: Many Adapters, One Base ModelAdapter caching, LRU eviction, memory budgeting, dynamic registration, and per-request routing patterns.
- 05 LoRA Kernel Internals: SGMV and BGMVHow Punica SGMV/BGMV kernels batch multiple LoRA adapters in a single GPU operation.
- 06 Production Multi-LoRA at ScaleS-LoRA paging, LoRA with tensor and data parallelism, LongLoRA, adapter lifecycle management, and tuning checklist.
Embeddings, Pooling & Rerankers in vLLM
Embedding fundamentals through high-throughput optimization · 5 parts
- 01 Embedding Models 101: Turning Text into VectorsBi-encoders, contrastive learning, similarity metrics, the model landscape, and decoder-only embedding trends.
- 02 Pooling Strategies: From Token Embeddings to Sentence VectorsCLS, mean, last-token, and weighted mean pooling. Normalization, wrong-pooling pitfalls, and how to choose.
- 03 Serving Embedding Models in vLLMLaunch with --task embed, use the /v1/embeddings API, configure PoolingParams, and understand the internal pipeline.
- 04 Rerankers and Cross-Encoders in vLLMCross-encoder scoring, --task score, the /v1/score API, retrieve-then-rerank pattern, and reward models.
- 05 Embedding Throughput OptimizationSaturation curves, prefix caching for embeddings, benchmarking methodology, and deployment patterns at scale.
Vision, Language & Audio Models in vLLM
VLM architecture through multi-modal serving optimization · 6 parts
- 01 VLM Architecture Primer: How Vision Meets LanguageVision encoders (ViT, SigLIP), projection strategies, token interleaving, and the LLaVA/Qwen2-VL/InternVL model families.
- 02 Serving VLMs in vLLM: Your First Image + Text RequestLaunch a VLM, send image requests via the OpenAI vision API, understand chat templates and the preprocessing pipeline.
- 03 Multi-Image and Video InputToken count explosion with multiple images, video frame sampling, dynamic vs fixed resolution, and memory management.
- 04 Audio and Speech Models in vLLMWhisper encoder, mel spectrograms, Qwen2-Audio, Ultravox, and strategies for handling long audio.
- 05 Multi-Modal Internals: How vLLM Routes ModalitiesMultiModalRegistry, plugins, processor pipeline, placeholder token replacement, and KV cache token estimation.
- 06 Optimizing Multi-Modal ServingPrefix caching for repeated images, chunked prefill with visual tokens, TP for VLMs, resolution tuning, and production checklist.