Ashraf Bhuiyan
Engineering Leader in AI Inference. Currently leading vLLM team at Red Hat.
Expert in AI inference systems, Architecture, GPU programming, and building production large scale LLM infrastructure.
Writing
Building an LLM Inference Engine from Scratch
32 parts · Learn the internals of vLLM and SGLang
LoRA & QLoRA in vLLM
6 parts · From LoRA math to production multi-adapter serving
- LoRA Fundamentals — B×A decomposition, rank, alpha, target modules
- QLoRA — NF4 quantization, 4-bit base + full-precision adapters
- Serving a LoRA Adapter — --enable-lora, LoRARequest, on-the-fly forward pass
- Multi-LoRA Serving — adapter cache, LRU eviction, routing patterns
- Kernel Internals — SGMV/BGMV, PunicaWrapper, weight stacking
- Production at Scale — S-LoRA, TP/DP composition, tuning checklist
Embeddings, Pooling & Rerankers in vLLM
5 parts · Embedding fundamentals through high-throughput optimization
- Embedding Fundamentals — bi-encoders, contrastive learning, similarity metrics
- Pooling Strategies — CLS, mean, last-token, normalization pitfalls
- Serving Embeddings — --task embed, /v1/embeddings API, PoolingParams
- Rerankers & Cross-Encoders — --task score, retrieve-then-rerank, reward models
- Throughput Optimization — saturation curves, prefix caching, benchmarking
Vision, Language & Audio Models in vLLM
6 parts · VLM architecture through multi-modal serving optimization
- VLM Architecture Primer — vision encoders, projection layers, model families
- Serving VLMs — OpenAI vision API, image input, preprocessing
- Multi-Image & Video — token explosion, frame sampling, memory management
- Audio & Speech Models — Whisper encoder, mel spectrograms, Qwen2-Audio
- Multi-Modal Internals — registry, plugins, processor, placeholder replacement
- Optimization — prefix caching, chunked prefill, TP, resolution tuning