<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"><channel><title>Ashraf Bhuiyan</title><description>Technical blog series on LLM inference engines, LoRA serving, embeddings, and multi-modal models in vLLM.</description><link>https://ashraf-bhuiyan.com/</link><item><title>Part 1: The Simplest LLM Server</title><link>https://ashraf-bhuiyan.com/blog/01-naive-inference/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/01-naive-inference/</guid><description>Build a working LLM inference server in ~130 lines of Python. Covers autoregressive generation, the KV cache, prefill vs. decode phases, and basic sampling.</description><pubDate>Sat, 02 May 2026 00:00:00 GMT</pubDate></item><item><title>Part 2: Async Streaming with FastAPI</title><link>https://ashraf-bhuiyan.com/blog/02-async-streaming/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/02-async-streaming/</guid><description>Why Flask blocks on long generation, how FastAPI enables true async, and Server-Sent Events for token-by-token streaming.</description><pubDate>Sat, 02 May 2026 00:00:00 GMT</pubDate></item><item><title>Part 3: Paged Attention</title><link>https://ashraf-bhuiyan.com/blog/03-paged-attention/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/03-paged-attention/</guid><description>How paged attention borrows the OS virtual memory concept to store KV cache in fixed-size blocks, enabling 2-4x more concurrent requests.</description><pubDate>Sat, 02 May 2026 00:00:00 GMT</pubDate></item><item><title>Part 4: Continuous Batching</title><link>https://ashraf-bhuiyan.com/blog/04-continuous-batching/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/04-continuous-batching/</guid><description>Why static batching wastes GPU cycles and how continuous batching lets requests join and leave the batch at every iteration.</description><pubDate>Sat, 02 May 2026 00:00:00 GMT</pubDate></item><item><title>Part 5: Async Scheduling</title><link>https://ashraf-bhuiyan.com/blog/05-async-scheduling/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/05-async-scheduling/</guid><description>How async scheduling overlaps CPU output processing with GPU execution, eliminating 5-10ms of dead GPU time per step.</description><pubDate>Sat, 02 May 2026 00:00:00 GMT</pubDate></item><item><title>Part 6: Chunked Prefill</title><link>https://ashraf-bhuiyan.com/blog/06-chunked-prefill/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/06-chunked-prefill/</guid><description>How long prompts block decode requests and chunked prefill splits them to keep inter-token latency stable.</description><pubDate>Sat, 02 May 2026 00:00:00 GMT</pubDate></item><item><title>Part 7: Prefix Caching</title><link>https://ashraf-bhuiyan.com/blog/07-prefix-caching/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/07-prefix-caching/</guid><description>Content-addressable KV cache using chained hashing, LRU eviction, and reference counting to skip recomputing shared prefixes.</description><pubDate>Sat, 02 May 2026 00:00:00 GMT</pubDate></item><item><title>Part 8: Speculative Decoding</title><link>https://ashraf-bhuiyan.com/blog/08-speculative-decoding/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/08-speculative-decoding/</guid><description>The draft-verify paradigm: guess K tokens cheaply, verify all at once, with rejection sampling to preserve exact output distribution.</description><pubDate>Sat, 02 May 2026 00:00:00 GMT</pubDate></item><item><title>Part 9: Tensor Parallelism</title><link>https://ashraf-bhuiyan.com/blog/09-tensor-parallelism/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/09-tensor-parallelism/</guid><description>Column-parallel and row-parallel linear layers (the Megatron pattern), AllReduce for combining results, and weight sharding across GPUs.</description><pubDate>Sat, 02 May 2026 00:00:00 GMT</pubDate></item><item><title>Part 10: Data Parallelism</title><link>https://ashraf-bhuiyan.com/blog/10-data-parallelism/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/10-data-parallelism/</guid><description>Independent model replicas with request routing strategies: round-robin, least-pending, and cache-aware routing for throughput scaling.</description><pubDate>Sat, 02 May 2026 00:00:00 GMT</pubDate></item><item><title>Part 11: Expert Parallelism</title><link>https://ashraf-bhuiyan.com/blog/11-expert-parallelism/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/11-expert-parallelism/</guid><description>Mixture-of-Experts architecture with AllToAll communication to dispatch tokens across GPUs holding different experts.</description><pubDate>Sat, 02 May 2026 00:00:00 GMT</pubDate></item><item><title>Part 12: KV Cache CPU Offloading</title><link>https://ashraf-bhuiyan.com/blog/12-kv-cpu-offloading/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/12-kv-cpu-offloading/</guid><description>Two-tier GPU/CPU memory with swap-out and swap-in using pinned memory to handle burst traffic beyond GPU capacity.</description><pubDate>Sat, 02 May 2026 00:00:00 GMT</pubDate></item><item><title>Part 13: Disaggregated Prefill-Decode</title><link>https://ashraf-bhuiyan.com/blog/13-disaggregated-serving/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/13-disaggregated-serving/</guid><description>Separate prefill and decode worker pools with KV transfer via the KVConnector abstraction for hardware-optimized serving.</description><pubDate>Sat, 02 May 2026 00:00:00 GMT</pubDate></item><item><title>Part 14: Quantization</title><link>https://ashraf-bhuiyan.com/blog/14-quantization/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/14-quantization/</guid><description>Number formats from FP32 to INT4, symmetric quantization, weight-only dequantize-on-the-fly, and nibble packing for 2-4x memory reduction.</description><pubDate>Sat, 02 May 2026 00:00:00 GMT</pubDate></item><item><title>Part 15: The Full Architecture</title><link>https://ashraf-bhuiyan.com/blog/15-full-architecture/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/15-full-architecture/</guid><description>Complete architecture diagram with all 14 techniques, request lifecycle, performance attribution, and deployment configurations.</description><pubDate>Sat, 02 May 2026 00:00:00 GMT</pubDate></item><item><title>Embedding Models 101: Turning Text into Vectors</title><link>https://ashraf-bhuiyan.com/blog/embed-01-fundamentals/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/embed-01-fundamentals/</guid><description>Bi-encoders, contrastive learning, similarity metrics, the model landscape, and decoder-only embedding trends.</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate></item><item><title>Pooling Strategies: From Token Embeddings to Sentence Vectors</title><link>https://ashraf-bhuiyan.com/blog/embed-02-pooling/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/embed-02-pooling/</guid><description>CLS, mean, last-token, and weighted mean pooling. Normalization, wrong-pooling pitfalls, and how to choose.</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate></item><item><title>Serving Embedding Models in vLLM</title><link>https://ashraf-bhuiyan.com/blog/embed-03-serving/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/embed-03-serving/</guid><description>Launch with --task embed, use the /v1/embeddings API, configure PoolingParams, and understand the internal pipeline.</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate></item><item><title>Rerankers and Cross-Encoders in vLLM</title><link>https://ashraf-bhuiyan.com/blog/embed-04-rerankers/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/embed-04-rerankers/</guid><description>Cross-encoder scoring, --task score, the /v1/score API, retrieve-then-rerank pattern, and reward models.</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate></item><item><title>Embedding Throughput Optimization</title><link>https://ashraf-bhuiyan.com/blog/embed-05-optimization/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/embed-05-optimization/</guid><description>Saturation curves, prefix caching for embeddings, benchmarking methodology, and deployment patterns at scale.</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate></item><item><title>LoRA Fundamentals: The Math Behind Low-Rank Adaptation</title><link>https://ashraf-bhuiyan.com/blog/lora-01-fundamentals/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/lora-01-fundamentals/</guid><description>B×A matrix decomposition, rank and alpha tradeoffs, target module selection, PEFT training, and adapter merging.</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate></item><item><title>QLoRA: 4-bit Base + Full-Precision Adapters</title><link>https://ashraf-bhuiyan.com/blog/lora-02-qlora/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/lora-02-qlora/</guid><description>NF4 quantization, double quantization, paged optimizers, and how QLoRA preserves quality while cutting memory 4×.</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate></item><item><title>Serving a LoRA Adapter in vLLM</title><link>https://ashraf-bhuiyan.com/blog/lora-03-serving/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/lora-03-serving/</guid><description>Enable LoRA serving, load adapters on the fly, and understand how the forward pass applies low-rank updates.</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate></item><item><title>Multi-LoRA Serving: Many Adapters, One Base Model</title><link>https://ashraf-bhuiyan.com/blog/lora-04-multi-lora/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/lora-04-multi-lora/</guid><description>Adapter caching, LRU eviction, memory budgeting, dynamic registration, and per-request routing patterns.</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate></item><item><title>LoRA Kernel Internals: SGMV and BGMV</title><link>https://ashraf-bhuiyan.com/blog/lora-05-kernels/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/lora-05-kernels/</guid><description>How Punica SGMV/BGMV kernels batch multiple LoRA adapters in a single GPU operation.</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate></item><item><title>Production Multi-LoRA at Scale</title><link>https://ashraf-bhuiyan.com/blog/lora-06-production/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/lora-06-production/</guid><description>S-LoRA paging, LoRA with tensor and data parallelism, LongLoRA, adapter lifecycle management, and tuning checklist.</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate></item><item><title>VLM Architecture Primer: How Vision Meets Language</title><link>https://ashraf-bhuiyan.com/blog/mm-01-vlm-architecture/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/mm-01-vlm-architecture/</guid><description>Vision encoders (ViT, SigLIP), projection strategies, token interleaving, and the LLaVA/Qwen2-VL/InternVL model families.</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate></item><item><title>Serving VLMs in vLLM: Your First Image + Text Request</title><link>https://ashraf-bhuiyan.com/blog/mm-02-serving-vlms/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/mm-02-serving-vlms/</guid><description>Launch a VLM, send image requests via the OpenAI vision API, understand chat templates and the preprocessing pipeline.</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate></item><item><title>Multi-Image and Video Input</title><link>https://ashraf-bhuiyan.com/blog/mm-03-multi-image-video/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/mm-03-multi-image-video/</guid><description>Token count explosion with multiple images, video frame sampling, dynamic vs fixed resolution, and memory management.</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate></item><item><title>Audio and Speech Models in vLLM</title><link>https://ashraf-bhuiyan.com/blog/mm-04-audio-models/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/mm-04-audio-models/</guid><description>Whisper encoder, mel spectrograms, Qwen2-Audio, Ultravox, and strategies for handling long audio.</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate></item><item><title>Multi-Modal Internals: How vLLM Routes Modalities</title><link>https://ashraf-bhuiyan.com/blog/mm-05-internals/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/mm-05-internals/</guid><description>MultiModalRegistry, plugins, processor pipeline, placeholder token replacement, and KV cache token estimation.</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate></item><item><title>Optimizing Multi-Modal Serving</title><link>https://ashraf-bhuiyan.com/blog/mm-06-optimization/</link><guid isPermaLink="true">https://ashraf-bhuiyan.com/blog/mm-06-optimization/</guid><description>Prefix caching for repeated images, chunked prefill with visual tokens, TP for VLMs, resolution tuning, and production checklist.</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate></item></channel></rss>