Blog — Ashraf Bhuiyan

Building an LLM Inference Engine from Scratch

Learn the internals of vLLM and SGLang by building one from scratch · 15 parts

01

Part 1: The Simplest LLM Server

Build a working LLM inference server in ~130 lines of Python. Covers autoregressive generation, the KV cache, prefill vs. decode phases, and basic sampling.

LLMinferenceKV cachesampling
02

Part 2: Async Streaming with FastAPI

Why Flask blocks on long generation, how FastAPI enables true async, and Server-Sent Events for token-by-token streaming.

FastAPIasyncstreamingSSE
03

Part 3: Paged Attention

How paged attention borrows the OS virtual memory concept to store KV cache in fixed-size blocks, enabling 2-4x more concurrent requests.

paged attentionKV cachememory managementvLLM
04

Part 4: Continuous Batching

Why static batching wastes GPU cycles and how continuous batching lets requests join and leave the batch at every iteration.

batchingschedulingGPU utilization
05

Part 5: Async Scheduling

How async scheduling overlaps CPU output processing with GPU execution, eliminating 5-10ms of dead GPU time per step.

asyncschedulingGPU optimization
06

Part 6: Chunked Prefill

How long prompts block decode requests and chunked prefill splits them to keep inter-token latency stable.

prefilllatencyscheduling
07

Part 7: Prefix Caching

Content-addressable KV cache using chained hashing, LRU eviction, and reference counting to skip recomputing shared prefixes.

prefix cachingKV cachehashingRadixAttention
08

Part 8: Speculative Decoding

The draft-verify paradigm: guess K tokens cheaply, verify all at once, with rejection sampling to preserve exact output distribution.

speculative decodingdraft modelrejection sampling
09

Part 9: Tensor Parallelism

Column-parallel and row-parallel linear layers (the Megatron pattern), AllReduce for combining results, and weight sharding across GPUs.

tensor parallelismMegatronmulti-GPUAllReduce
10

Part 10: Data Parallelism

Independent model replicas with request routing strategies: round-robin, least-pending, and cache-aware routing for throughput scaling.

data parallelismload balancingscaling
11

Part 11: Expert Parallelism

Mixture-of-Experts architecture with AllToAll communication to dispatch tokens across GPUs holding different experts.

MoEexpert parallelismAllToAllDeepSeek
12

Part 12: KV Cache CPU Offloading

Two-tier GPU/CPU memory with swap-out and swap-in using pinned memory to handle burst traffic beyond GPU capacity.

KV cacheCPU offloadingmemory management
13

Part 13: Disaggregated Prefill-Decode

Separate prefill and decode worker pools with KV transfer via the KVConnector abstraction for hardware-optimized serving.

disaggregated servingprefill-decodeKVConnector
14

Part 14: Quantization

Number formats from FP32 to INT4, symmetric quantization, weight-only dequantize-on-the-fly, and nibble packing for 2-4x memory reduction.

quantizationINT8INT4GPTQAWQ
15

Part 15: The Full Architecture

Complete architecture diagram with all 14 techniques, request lifecycle, performance attribution, and deployment configurations.

architecturevLLMSGLangsystem design

LoRA & QLoRA in vLLM

From LoRA math to production multi-adapter serving at scale · 6 parts

Embeddings, Pooling & Rerankers in vLLM

Embedding fundamentals through high-throughput optimization · 5 parts

Vision, Language & Audio Models in vLLM

VLM architecture through multi-modal serving optimization · 6 parts

All Posts

Building an LLM Inference Engine from Scratch

LoRA & QLoRA in vLLM

Embeddings, Pooling & Rerankers in vLLM

Vision, Language & Audio Models in vLLM