vllm.config.attention ¶
AttentionConfig ¶
Configuration for attention mechanisms in vLLM.
Source code in vllm/config/attention.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | |
backend class-attribute instance-attribute ¶
backend: AttentionBackendEnum | None = None
Attention backend to use. Use "auto" or None for automatic selection.
disable_flashinfer_prefill class-attribute instance-attribute ¶
disable_flashinfer_prefill: bool | None = None
Whether to disable flashinfer prefill.
disable_flashinfer_q_quantization class-attribute instance-attribute ¶
disable_flashinfer_q_quantization: bool = False
If set, when using fp8 kv, do not quantize Q to fp8.
flash_attn_max_num_splits_for_cuda_graph class-attribute instance-attribute ¶
flash_attn_max_num_splits_for_cuda_graph: int = 32
Flash Attention max number splits for cuda graph decode.
flash_attn_version class-attribute instance-attribute ¶
flash_attn_version: Literal[2, 3, 4] | None = None
Force vllm to use a specific flash-attention version (2, 3, or 4). Only valid when using the flash-attention backend.
flex_attn_block_m class-attribute instance-attribute ¶
flex_attn_block_m: int | None = None
Triton kernel BLOCK_M tile size for flex attention. Must be a power of 2 >= 16. If None and VLLM_BATCH_INVARIANT=1, defaults to 16.
flex_attn_block_n class-attribute instance-attribute ¶
flex_attn_block_n: int | None = None
Triton kernel BLOCK_N tile size for flex attention. Must be a power of 2 >= 16. If None and VLLM_BATCH_INVARIANT=1, defaults to 16.
flex_attn_kv_block_size class-attribute instance-attribute ¶
flex_attn_kv_block_size: int | None = None
Logical KV block size for the flex attention block mask. Must be a power of 2 and divisible by flex_attn_block_n. If None, uses the default (kv_cache_block_size on PyTorch >= 2.9, 128 otherwise).
flex_attn_q_block_size class-attribute instance-attribute ¶
flex_attn_q_block_size: int | None = None
Logical Q block size for the flex attention block mask. Must be a power of 2 and divisible by flex_attn_block_m. If None, uses the default (16 on PyTorch >= 2.9, 128 otherwise).
mla_prefill_backend class-attribute instance-attribute ¶
mla_prefill_backend: MLAPrefillBackendEnum | None = None
MLA prefill backend to use. If None, will be selected automatically. Valid options: FLASH_ATTN (FA3/FA4), FLASHINFER, TRTLLM_RAGGED. This option supersedes use_trtllm_ragged_deepseek_prefill and disable_flashinfer_prefill which are deprecated.
tq_max_kv_splits_for_cuda_graph class-attribute instance-attribute ¶
tq_max_kv_splits_for_cuda_graph: int = 32
TurboQuant max NUM_KV_SPLITS for cuda graph decode. Fixes the split count so grid dimensions are constant across captures, and buffers can be pre-allocated to avoid inflating the memory estimate.
use_cudnn_prefill class-attribute instance-attribute ¶
use_cudnn_prefill: bool = False
Deprecated: cuDNN prefill backend has been removed.
use_fp4_indexer_cache class-attribute instance-attribute ¶
use_fp4_indexer_cache: bool = False
If set, use fp4 indexer cache for dsv32 family model (not support yet)
use_non_causal class-attribute instance-attribute ¶
use_non_causal: bool = False
Whether to use non-causal (bidirectional) attention.
use_prefill_decode_attention class-attribute instance-attribute ¶
use_prefill_decode_attention: bool = False
Use separate prefill and decode kernels for attention instead of the unified triton kernel.
use_prefill_query_quantization class-attribute instance-attribute ¶
use_prefill_query_quantization: bool = False
If set, quantize query for attention in prefill.
use_trtllm_attention class-attribute instance-attribute ¶
use_trtllm_attention: bool | None = None
If set to True/False, use or don't use the TRTLLM attention backend in flashinfer. If None, auto-detect the attention backend in flashinfer.
use_trtllm_ragged_deepseek_prefill class-attribute instance-attribute ¶
use_trtllm_ragged_deepseek_prefill: bool = False
Whether to use TRTLLM ragged deepseek prefill.
_migrate_deprecated_mla_prefill_flags ¶
Migrate deprecated MLA prefill flags to mla_prefill_backend.
Source code in vllm/config/attention.py
compute_hash ¶
compute_hash() -> str
Provide a hash that uniquely identifies all the configs that affect the structure of the computation graph from input ids/embeddings to the final hidden states, excluding anything before input ids/embeddings and after the final hidden states.
Source code in vllm/config/attention.py
validate_backend_before classmethod ¶
Enable parsing of the backend enum type from string.
The special value "auto" is treated as None, which triggers automatic backend selection.
Source code in vllm/config/attention.py
validate_mla_prefill_backend_before classmethod ¶
Enable parsing of the mla_prefill_backend enum type from string.