vllm.model_executor.layers.sparse_attn_indexer ¶
Custom Sparse Attention Indexer layers.
SparseAttnIndexer ¶
Bases: CustomOp
Sparse Attention Indexer Custom Op Layer. This layer is extracted as a separate custom op since it involves heavy custom kernels like mqa_logits, paged_mqa_logits and top_k_per_row, etc. Those kernels maybe requires specific memory layout or implementation for different hardware backends to achieve optimal performance.
For now, the default native path will use CUDA backend path. Other platform may requires add the corresponding Custom Op name sparse_attn_indexer to custom_ops in CompilationConfig to enable the platform specific path.
Source code in vllm/model_executor/layers/sparse_attn_indexer.py
439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 | |
_gather_workspace_shapes ¶
_gather_workspace_shapes(
total_seq_lens: int,
head_dim: int,
fp8_dtype: dtype,
use_fp4_cache: bool,
) -> tuple[
tuple[tuple[int, int], dtype],
tuple[tuple[int, int], dtype],
]
Return ((values_shape, values_dtype), (scales_shape, scales_dtype)) for the K-gather workspace. FP8 path: (T, head_dim) fp8 + (T, 4) uint8 fp32 scales. MXFP4 path: (T, head_dim // 2) uint8 packed mxfp4 + (T, head_dim // MXFP4_BLOCK_SIZE) uint8 ue8m0 scales.
Source code in vllm/model_executor/layers/sparse_attn_indexer.py
kv_cache_as_quant_view ¶
4D [num_blocks, block_size, 1, head_width] view expected by DeepGEMM, from the 3D indexer kv-cache allocation.