vllm.v1.simple_kv_offload.cuda_mem_ops ¶
Low-level CUDA/HIP memory helpers: pinning and batch DMA transfers.
_resolve_batch_memcpy ¶
Resolve the platform batch-memcpy entry point (one-time).
- CUDA:
cuMemcpyBatchAsyncviacuGetProcAddress(uses srcAccessOrder=STREAM via one attributes entry). - ROCm:
hipMemcpyBatchAsyncfrom libamdhip64 (ROCm 7.1+). ROCm 7.2.1 or 7.2.2 rejects any call withnumAttrs > 0(see ROCm/clr @ rocm-7.2.1 hipamd/src/hip_memory.cpp:2819-2822), so we call withnumAttrs=0.
Raises RuntimeError if the symbol is unavailable (older CUDA driver, ROCm < 7.1, unusual install). The connector requires the batch API.
Source code in vllm/v1/simple_kv_offload/cuda_mem_ops.py
copy_blocks ¶
copy_blocks(
src_block_ids: list[int],
dst_block_ids: list[int],
params: BatchMemcpyParams,
) -> None
Copy blocks via cuMemcpyBatchAsync / hipMemcpyBatchAsync.
Source code in vllm/v1/simple_kv_offload/cuda_mem_ops.py
pin_tensor ¶
pin_tensor(tensor: Tensor) -> None
Pin a CPU tensor via cudaHostRegister.
This bypasses PyTorch's CUDACachingHostAllocator which rounds every pin_memory=True allocation up to the next power of 2 (e.g. 100 GB becomes 128 GB).