Skip to content

Instantly share code, notes, and snippets.

@solatticus
Last active May 23, 2026 09:15
Show Gist options
  • Select an option

  • Save solatticus/aab6ec3a0436748b021cbbdd12e8c739 to your computer and use it in GitHub Desktop.

Select an option

Save solatticus/aab6ec3a0436748b021cbbdd12e8c739 to your computer and use it in GitHub Desktop.
FlashAttention-4 Cannot Run on RTX 5090 (SM120) — A Deep Investigation

FlashAttention-4 Cannot Run on RTX 5090 (SM120) — A Deep Investigation

TL;DR

FlashAttention-4 will not run on the NVIDIA RTX 5090 (SM120, "desktop Blackwell") and no amount of software patching can fix it. Despite sharing the "Blackwell" brand with data center GPUs like the B200 (SM100), the RTX 5090 uses a fundamentally different tensor core architecture. SM100 has a dedicated tensor memory (TMEM) subsystem with its own instruction family (UTCHMMA, UTMALDG, etc.) that FA4's warp-specialized kernel design requires. SM120 uses the older HMMA instruction family (the same register-to-register MMA approach used since Volta/Ampere) and the TMEM hardware is physically absent from the GB202 die. This is not a software lock, not a fuse bit, and not a toolchain oversight — it is a silicon-level architectural difference. FA2 via Triton remains the best available attention kernel for the RTX 5090.


Goal

PyTorch announced the integration of FlashAttention-4 into PyTorch's flex_attention API via a new BACKEND="FLASH" option. FA4 promises 1.2×–3.2× gains over Triton-based FlexAttention on Blackwell GPUs through warp-specialized kernels built with NVIDIA's CuTeDSL. The question: can we get this running on an RTX 5090?

System:

  • NVIDIA GeForce RTX 5090 (GB202, SM120, compute capability 12.0)
  • CUDA 13.1, driver 590.48.01
  • PyTorch 2.12.0a0 (from source), Flash Attention repo (latest main), nvidia-cutlass-dsl 4.4.1

Investigation Process

Step 1: Updating Dependencies

Pulled latest code for both PyTorch and flash-attention repos:

  • PyTorch main now contains the BACKEND="FLASH" integration in torch.nn.attention.flex_attention (merged via #161118 and #168017)
  • Flash-attention has 99 commits since Feb 2026, actively developing FA4 in flash_attn/cute/, and is now publishing as a separate pip package flash-attn-4
  • nvidia-cutlass-dsl 4.4.1 is the latest version on PyPI

Step 2: Identifying Python-Level Gatekeepers

Multiple Python-level checks block SM120 from reaching the FA4 compilation path:

flash-attention interface.py:

assert arch // 10 in [9, 10, 11]  # SM120 → 120 // 10 == 12, rejected

flash-attention flash_fwd_sm100.py:

assert self.arch >= Arch.sm_100 and self.arch <= Arch.sm_110f  # SM120 out of range

cutlass-dsl mma.py and copy.py:

admissible_archs = Arch.filter(
    lambda arch: arch.is_family_of(Arch.sm_100f) or arch.is_family_of(Arch.sm_110f)
)
# SM120f.is_family_of(SM100f) → False (major 12 ≠ 10)
# SM120f.is_family_of(SM110f) → False (major 12 ≠ 11)

Interestingly, cutlass-dsl's arch.py already has full SM120 definitions (Arch.sm_120, sm_120a, sm_120f) in its BlackwellArchs() tuple, and its env_manager.py auto-detects the GPU and correctly generates sm_120a as the target. The Python gatekeepers in the op-level code just don't include SM120 in their allow lists.

We patched all of these to allow SM120 through.

Step 3: NVVM MLIR Compiler Rejection

With the Python gates patched, FA4 reaches the CuTeDSL compilation stage. The first error:

error: 2CTA Mode for CpAsyncBulkTensorG2S not supported on this architecture
chip = "sm_120a"

After disabling 2CTA mode in the FA4 code, a second error:

error: tcgen05.alloc supported only on arch-conditional or family-conditional
variants from SM100 onwards.

This error comes from inside _cutlass_ir.cpython-313-x86_64-linux-gnu.so (138MB, stripped ELF). Binary analysis with strings revealed that the library does know about SM120 — it contains profile_sm_120, -mcpu=sm_120a, -mcpu=sm_120f, and even SM120-specific MLIR ops for block-scaled MMA. However, the tcgen05 instruction verifier uses a 32-bit bitmask that only covers the SM100 family.

Disassembly of the tcgen05 verifier:

sub    $0x3e9, %ecx          ; ecx = arch_enum - 1001
cmp    $0x1f, %ecx            ; if result > 31, reject
shl    %cl, %rdx              ; rdx = 1 << (arch_enum - 1001)
test   $0xc0000c03, %edx      ; check against allowed bitmask
jne    <allowed>               ; if bit set, allow
lea    <error_string>, %rdi   ; otherwise, error

The NVVM internal arch enum uses the scheme major*100 + minor*10 + suffix_offset. SM100a = 1001 (bit 0), SM100f = 1002 (bit 1), SM101a = 1011 (bit 10), SM103a = 1031 (bit 30). SM120a = 1201 — 200 positions past the bitmask's 32-bit range. It falls through the cmp $0x1f check unconditionally.

This bitmask pattern appears 7 times in the binary, once per tcgen05 instruction type (alloc, commit, cp, dealloc, fence, mma, relinquish, wait).

Step 4: Bypassing NVVM — Compile as SM100a

We successfully compiled FA4 by setting CUTE_DSL_ARCH=sm_100a, proving the codegen pipeline works. The kernel compiled to PTX and then to a cubin. However:

RuntimeError: cudaErrorNoKernelImageForDevice

The SM100a cubin cannot run on SM120 hardware. Expected.

Step 5: PTX Retargeting Attempt

We extracted the generated PTX (5,465 lines), replaced .target sm_100a with .target sm_120a, and fed it to ptxas:

ptxas error: Instruction 'tcgen05.alloc' not supported on .target 'sm_120a'
ptxas error: Feature '.cta_group::2' not supported on .target 'sm_120a'
ptxas error: Instruction 'tcgen05.mma' not supported on .target 'sm_120a'

NVIDIA's own assembler (ptxas, CUDA 13.1) explicitly rejects tcgen05 instructions for SM120. This is not an NVVM MLIR issue — the instruction is not in ptxas's SM120 target definition.

Step 6: Cubin Patching — Testing for Fuse Bits

To determine whether this is a software restriction or a physical hardware limitation, we patched an SM100a cubin's ELF headers to claim SM120a identity (modified e_flags from 0x06006402 to 0x06007802, patched arch strings and cuinfo sections). The CUDA driver rejected the patched cubin during cuModuleLoadData (error 500: named symbol not found), likely due to the EIATTR_TCGEN05_1CTA_USED attribute triggering a capability check.

Step 7: SASS Disassembly — The Definitive Answer

We disassembled both SM100a and SM120a cubins to compare the actual hardware instruction sets at the SASS (native GPU assembly) level.

SM100a (B200/B300 data center Blackwell) tensor core instructions:

UTCHMMA.2CTA              — tensor core MMA via tensor memory
UTMALDG.4D.2CTA           — tensor memory async global load (4D TMA)
UTMASTG.4D                — tensor memory async global store
UTMACCTL.PF               — tensor memory cache control prefetch
UTMACMDFLUSH              — tensor memory command flush
UTCATOMSWS.2CTA           — tensor core atomic shared-memory ops
UTCBAR.2CTA.MULTICAST     — tensor core barrier with multicast

SM120a (RTX 5090 desktop Blackwell) tensor core instructions:

HMMA.16816.F32.BF16       — standard half-precision MMA (register-to-register)

SM100 operates its tensor cores through a dedicated tensor memory (TMEM) subsystem — an entire UTMA*/UTC* instruction family that moves data between global memory, shared memory, and tensor memory, with the MMA units reading directly from TMEM. This is what enables FA4's deeply pipelined, warp-specialized kernel design.

SM120 uses HMMA — the same register-to-register MMA approach that has been in NVIDIA GPUs since Volta (SM70). There are zero UTC* or UTMA* opcodes in any SM120a binary. The tensor memory interconnect does not exist on the GB202 die.

Determination

FlashAttention-4 cannot and will not run on the RTX 5090. This is a confirmed physical silicon difference between data center Blackwell (SM100/SM103) and desktop Blackwell (SM120):

Feature SM100 (B200/B300) SM120 (RTX 5090)
Tensor Memory (TMEM) ✅ Full UTMA*/UTC* ISA ❌ Not present
MMA Instruction UTCHMMA.2CTA (TMEM-backed) HMMA.16816 (register-backed)
2CTA Clusters
Warp-Specialized Kernels ✅ (FA4 design target) ❌ Cannot express
FlashAttention-4
FlashAttention-2

The "Blackwell" branding shared between the RTX 5090 and B200/B300 obscures what are functionally different GPU architectures for compute workloads. SM120's tensor cores are powerful but architecturally closer to Hopper than to data center Blackwell.

For RTX 5090 users: FlashAttention-2 via Triton (the default flex_attention backend) is your best available attention kernel. It is still fast — the 5090 has excellent bandwidth and clock speeds — but it will not benefit from FA4's tensor-memory-based optimizations.


Investigation performed 2026-03-06 on an RTX 5090 (GB202), CUDA 13.1, driver 590.48.01, PyTorch 2.12.0a0, flash-attention main (v2.8.3+311 commits), nvidia-cutlass-dsl 4.4.1.

@BHTANK
Copy link
Copy Markdown

BHTANK commented Mar 7, 2026

Hi

@0z5a
Copy link
Copy Markdown

0z5a commented Mar 19, 2026

Good research

@tayek333
Copy link
Copy Markdown

tayek333 commented Apr 1, 2026

Same issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment