Examples¶
vLLM's examples are organized into the following categories:
-
basic/– Minimal examples for offline inference and online serving. -
generate/– Text generation examples, including multimodal models. -
pooling/– Examples for embedding, classification, scoring, reward, etc. -
speech_to_text/– Speech transcription, translation and real-time audio examples. -
features/– Demonstrations of individual vLLM features: automatic prefix caching, speculative decoding, LoRA, structured outputs, prompt embedding, pause/resume, batch invariance, KV events, data parallelism, and more. -
reasoning/– Examples for reasoning with vLLM. -
tool_calling/– Examples for function/tool calling with vLLM. -
applications/– Application examples such as chatbots and RAG (Retrieval-Augmented Generation). -
rl/– Reinforcement learning examples. -
deployment/– Examples for deploying vLLM in production. -
ray_serving/– Scalable serving using Ray. -
disaggregated/– Examples for disaggregated serving (separate prefill and decode), including various kv cache connectors (LMCache, Mooncake, FlexKV, P2P NCCL) and failure recovery. -
observability/– Metrics, logging, tracing (OpenTelemetry), and dashboards (Grafana, Perses).