Examples¶

vLLM's examples are organized into the following categories:

basic/ – Minimal examples for offline inference and online serving.
generate/ – Text generation examples, including multimodal models.
pooling/ – Examples for embedding, classification, scoring, reward, etc.
speech_to_text/ – Speech transcription, translation and real-time audio examples.
features/ – Demonstrations of individual vLLM features: automatic prefix caching, speculative decoding, LoRA, structured outputs, prompt embedding, pause/resume, batch invariance, KV events, data parallelism, and more.
reasoning/ – Examples for reasoning with vLLM.
tool_calling/ – Examples for function/tool calling with vLLM.
applications/ – Application examples such as chatbots and RAG (Retrieval-Augmented Generation).
rl/ – Reinforcement learning examples.
deployment/ – Examples for deploying vLLM in production.
ray_serving/ – Scalable serving using Ray.
disaggregated/ – Examples for disaggregated serving (separate prefill and decode), including various kv cache connectors (LMCache, Mooncake, FlexKV, P2P NCCL) and failure recovery.
observability/ – Metrics, logging, tracing (OpenTelemetry), and dashboards (Grafana, Perses).