Ashraf Bhuiyan

Ashraf Bhuiyan

Engineering Leader in AI Inference. Currently leading vLLM team at Red Hat.
Expert in AI inference systems, Architecture, GPU programming, and building production large scale LLM infrastructure.

Writing

Building an LLM Inference Engine from Scratch

32 parts · Learn the internals of vLLM and SGLang

1 · Naive Inference foundation INDEPENDENT (toggle on/off) 2 Async Streaming 7 Prefix Caching 8 Spec. Decoding 12 CPU Offloading 14 Quantization DEPENDENCY CHAIN 3 Paged Attention 4 Cont. Batching 5 Async Sched 6 Chunked PARALLELISM (pick per hardware) 9 Tensor ∥ 10 Data ∥ 11 Expert ∥ DP x TP x EP = total GPUs 13 · Disaggregated P/D system architecture 15 · Full Architecture all combined

LoRA & QLoRA in vLLM

6 parts · From LoRA math to production multi-adapter serving

  1. LoRA Fundamentals — B×A decomposition, rank, alpha, target modules
  2. QLoRA — NF4 quantization, 4-bit base + full-precision adapters
  3. Serving a LoRA Adapter — --enable-lora, LoRARequest, on-the-fly forward pass
  4. Multi-LoRA Serving — adapter cache, LRU eviction, routing patterns
  5. Kernel Internals — SGMV/BGMV, PunicaWrapper, weight stacking
  6. Production at Scale — S-LoRA, TP/DP composition, tuning checklist

Embeddings, Pooling & Rerankers in vLLM

5 parts · Embedding fundamentals through high-throughput optimization

  1. Embedding Fundamentals — bi-encoders, contrastive learning, similarity metrics
  2. Pooling Strategies — CLS, mean, last-token, normalization pitfalls
  3. Serving Embeddings — --task embed, /v1/embeddings API, PoolingParams
  4. Rerankers & Cross-Encoders — --task score, retrieve-then-rerank, reward models
  5. Throughput Optimization — saturation curves, prefix caching, benchmarking

Vision, Language & Audio Models in vLLM

6 parts · VLM architecture through multi-modal serving optimization

  1. VLM Architecture Primer — vision encoders, projection layers, model families
  2. Serving VLMs — OpenAI vision API, image input, preprocessing
  3. Multi-Image & Video — token explosion, frame sampling, memory management
  4. Audio & Speech Models — Whisper encoder, mel spectrograms, Qwen2-Audio
  5. Multi-Modal Internals — registry, plugins, processor, placeholder replacement
  6. Optimization — prefix caching, chunked prefill, TP, resolution tuning