Ashraf Bhuiyan

Ashraf BhuiyanTechnical blog series on LLM inference engines, LoRA serving, embeddings, and multi-modal models in vLLM.https://ashraf-bhuiyan.com/Part 1: The Simplest LLM Serverhttps://ashraf-bhuiyan.com/blog/01-naive-inference/https://ashraf-bhuiyan.com/blog/01-naive-inference/Build a working LLM inference server in ~130 lines of Python. Covers autoregressive generation, the KV cache, prefill vs. decode phases, and basic sampling.Sat, 02 May 2026 00:00:00 GMTPart 2: Async Streaming with FastAPIhttps://ashraf-bhuiyan.com/blog/02-async-streaming/https://ashraf-bhuiyan.com/blog/02-async-streaming/Why Flask blocks on long generation, how FastAPI enables true async, and Server-Sent Events for token-by-token streaming.Sat, 02 May 2026 00:00:00 GMTPart 3: Paged Attentionhttps://ashraf-bhuiyan.com/blog/03-paged-attention/https://ashraf-bhuiyan.com/blog/03-paged-attention/How paged attention borrows the OS virtual memory concept to store KV cache in fixed-size blocks, enabling 2-4x more concurrent requests.Sat, 02 May 2026 00:00:00 GMTPart 4: Continuous Batchinghttps://ashraf-bhuiyan.com/blog/04-continuous-batching/https://ashraf-bhuiyan.com/blog/04-continuous-batching/Why static batching wastes GPU cycles and how continuous batching lets requests join and leave the batch at every iteration.Sat, 02 May 2026 00:00:00 GMTPart 5: Async Schedulinghttps://ashraf-bhuiyan.com/blog/05-async-scheduling/https://ashraf-bhuiyan.com/blog/05-async-scheduling/How async scheduling overlaps CPU output processing with GPU execution, eliminating 5-10ms of dead GPU time per step.Sat, 02 May 2026 00:00:00 GMTPart 6: Chunked Prefillhttps://ashraf-bhuiyan.com/blog/06-chunked-prefill/https://ashraf-bhuiyan.com/blog/06-chunked-prefill/How long prompts block decode requests and chunked prefill splits them to keep inter-token latency stable.Sat, 02 May 2026 00:00:00 GMTPart 7: Prefix Cachinghttps://ashraf-bhuiyan.com/blog/07-prefix-caching/https://ashraf-bhuiyan.com/blog/07-prefix-caching/Content-addressable KV cache using chained hashing, LRU eviction, and reference counting to skip recomputing shared prefixes.Sat, 02 May 2026 00:00:00 GMTPart 8: Speculative Decodinghttps://ashraf-bhuiyan.com/blog/08-speculative-decoding/https://ashraf-bhuiyan.com/blog/08-speculative-decoding/The draft-verify paradigm: guess K tokens cheaply, verify all at once, with rejection sampling to preserve exact output distribution.Sat, 02 May 2026 00:00:00 GMTPart 9: Tensor Parallelismhttps://ashraf-bhuiyan.com/blog/09-tensor-parallelism/https://ashraf-bhuiyan.com/blog/09-tensor-parallelism/Column-parallel and row-parallel linear layers (the Megatron pattern), AllReduce for combining results, and weight sharding across GPUs.Sat, 02 May 2026 00:00:00 GMTPart 10: Data Parallelismhttps://ashraf-bhuiyan.com/blog/10-data-parallelism/https://ashraf-bhuiyan.com/blog/10-data-parallelism/Independent model replicas with request routing strategies: round-robin, least-pending, and cache-aware routing for throughput scaling.Sat, 02 May 2026 00:00:00 GMTPart 11: Expert Parallelismhttps://ashraf-bhuiyan.com/blog/11-expert-parallelism/https://ashraf-bhuiyan.com/blog/11-expert-parallelism/Mixture-of-Experts architecture with AllToAll communication to dispatch tokens across GPUs holding different experts.Sat, 02 May 2026 00:00:00 GMTPart 12: KV Cache CPU Offloadinghttps://ashraf-bhuiyan.com/blog/12-kv-cpu-offloading/https://ashraf-bhuiyan.com/blog/12-kv-cpu-offloading/Two-tier GPU/CPU memory with swap-out and swap-in using pinned memory to handle burst traffic beyond GPU capacity.Sat, 02 May 2026 00:00:00 GMTPart 13: Disaggregated Prefill-Decodehttps://ashraf-bhuiyan.com/blog/13-disaggregated-serving/https://ashraf-bhuiyan.com/blog/13-disaggregated-serving/Separate prefill and decode worker pools with KV transfer via the KVConnector abstraction for hardware-optimized serving.Sat, 02 May 2026 00:00:00 GMTPart 14: Quantizationhttps://ashraf-bhuiyan.com/blog/14-quantization/https://ashraf-bhuiyan.com/blog/14-quantization/Number formats from FP32 to INT4, symmetric quantization, weight-only dequantize-on-the-fly, and nibble packing for 2-4x memory reduction.Sat, 02 May 2026 00:00:00 GMTPart 15: The Full Architecturehttps://ashraf-bhuiyan.com/blog/15-full-architecture/https://ashraf-bhuiyan.com/blog/15-full-architecture/Complete architecture diagram with all 14 techniques, request lifecycle, performance attribution, and deployment configurations.Sat, 02 May 2026 00:00:00 GMTEmbedding Models 101: Turning Text into Vectorshttps://ashraf-bhuiyan.com/blog/embed-01-fundamentals/https://ashraf-bhuiyan.com/blog/embed-01-fundamentals/Bi-encoders, contrastive learning, similarity metrics, the model landscape, and decoder-only embedding trends.Sun, 03 May 2026 00:00:00 GMTPooling Strategies: From Token Embeddings to Sentence Vectorshttps://ashraf-bhuiyan.com/blog/embed-02-pooling/https://ashraf-bhuiyan.com/blog/embed-02-pooling/CLS, mean, last-token, and weighted mean pooling. Normalization, wrong-pooling pitfalls, and how to choose.Sun, 03 May 2026 00:00:00 GMTServing Embedding Models in vLLMhttps://ashraf-bhuiyan.com/blog/embed-03-serving/https://ashraf-bhuiyan.com/blog/embed-03-serving/Launch with --task embed, use the /v1/embeddings API, configure PoolingParams, and understand the internal pipeline.Sun, 03 May 2026 00:00:00 GMTRerankers and Cross-Encoders in vLLMhttps://ashraf-bhuiyan.com/blog/embed-04-rerankers/https://ashraf-bhuiyan.com/blog/embed-04-rerankers/Cross-encoder scoring, --task score, the /v1/score API, retrieve-then-rerank pattern, and reward models.Sun, 03 May 2026 00:00:00 GMTEmbedding Throughput Optimizationhttps://ashraf-bhuiyan.com/blog/embed-05-optimization/https://ashraf-bhuiyan.com/blog/embed-05-optimization/Saturation curves, prefix caching for embeddings, benchmarking methodology, and deployment patterns at scale.Sun, 03 May 2026 00:00:00 GMTLoRA Fundamentals: The Math Behind Low-Rank Adaptationhttps://ashraf-bhuiyan.com/blog/lora-01-fundamentals/https://ashraf-bhuiyan.com/blog/lora-01-fundamentals/B×A matrix decomposition, rank and alpha tradeoffs, target module selection, PEFT training, and adapter merging.Sun, 03 May 2026 00:00:00 GMTQLoRA: 4-bit Base + Full-Precision Adaptershttps://ashraf-bhuiyan.com/blog/lora-02-qlora/https://ashraf-bhuiyan.com/blog/lora-02-qlora/NF4 quantization, double quantization, paged optimizers, and how QLoRA preserves quality while cutting memory 4×.Sun, 03 May 2026 00:00:00 GMTServing a LoRA Adapter in vLLMhttps://ashraf-bhuiyan.com/blog/lora-03-serving/https://ashraf-bhuiyan.com/blog/lora-03-serving/Enable LoRA serving, load adapters on the fly, and understand how the forward pass applies low-rank updates.Sun, 03 May 2026 00:00:00 GMTMulti-LoRA Serving: Many Adapters, One Base Modelhttps://ashraf-bhuiyan.com/blog/lora-04-multi-lora/https://ashraf-bhuiyan.com/blog/lora-04-multi-lora/Adapter caching, LRU eviction, memory budgeting, dynamic registration, and per-request routing patterns.Sun, 03 May 2026 00:00:00 GMTLoRA Kernel Internals: SGMV and BGMVhttps://ashraf-bhuiyan.com/blog/lora-05-kernels/https://ashraf-bhuiyan.com/blog/lora-05-kernels/How Punica SGMV/BGMV kernels batch multiple LoRA adapters in a single GPU operation.Sun, 03 May 2026 00:00:00 GMTProduction Multi-LoRA at Scalehttps://ashraf-bhuiyan.com/blog/lora-06-production/https://ashraf-bhuiyan.com/blog/lora-06-production/S-LoRA paging, LoRA with tensor and data parallelism, LongLoRA, adapter lifecycle management, and tuning checklist.Sun, 03 May 2026 00:00:00 GMTVLM Architecture Primer: How Vision Meets Languagehttps://ashraf-bhuiyan.com/blog/mm-01-vlm-architecture/https://ashraf-bhuiyan.com/blog/mm-01-vlm-architecture/Vision encoders (ViT, SigLIP), projection strategies, token interleaving, and the LLaVA/Qwen2-VL/InternVL model families.Sun, 03 May 2026 00:00:00 GMTServing VLMs in vLLM: Your First Image + Text Requesthttps://ashraf-bhuiyan.com/blog/mm-02-serving-vlms/https://ashraf-bhuiyan.com/blog/mm-02-serving-vlms/Launch a VLM, send image requests via the OpenAI vision API, understand chat templates and the preprocessing pipeline.Sun, 03 May 2026 00:00:00 GMTMulti-Image and Video Inputhttps://ashraf-bhuiyan.com/blog/mm-03-multi-image-video/https://ashraf-bhuiyan.com/blog/mm-03-multi-image-video/Token count explosion with multiple images, video frame sampling, dynamic vs fixed resolution, and memory management.Sun, 03 May 2026 00:00:00 GMTAudio and Speech Models in vLLMhttps://ashraf-bhuiyan.com/blog/mm-04-audio-models/https://ashraf-bhuiyan.com/blog/mm-04-audio-models/Whisper encoder, mel spectrograms, Qwen2-Audio, Ultravox, and strategies for handling long audio.Sun, 03 May 2026 00:00:00 GMTMulti-Modal Internals: How vLLM Routes Modalitieshttps://ashraf-bhuiyan.com/blog/mm-05-internals/https://ashraf-bhuiyan.com/blog/mm-05-internals/MultiModalRegistry, plugins, processor pipeline, placeholder token replacement, and KV cache token estimation.Sun, 03 May 2026 00:00:00 GMTOptimizing Multi-Modal Servinghttps://ashraf-bhuiyan.com/blog/mm-06-optimization/https://ashraf-bhuiyan.com/blog/mm-06-optimization/Prefix caching for repeated images, chunked prefill with visual tokens, TP for VLMs, resolution tuning, and production checklist.Sun, 03 May 2026 00:00:00 GMT