GPU NVPTX Target

Summary

This is the GPU backend for the same tile DSL. Kernels lower through gpu and nvgpu dialects with explicit global-shared-register movement, cp.async staging, and mma.sync tensor-core ops. Same frontend contract, different optimization problem.

Stack

What it's built with.

GPU & Codegen

NVPTX / NVVM
Tensor Cores (mma.sync)
Shared Memory + cp.async
ldmatrix

MLIR & IR

gpu / nvgpu Dialects
Pattern Rewrites
Memory-Space Attributes
Custom Dialect Authoring

Build & Tooling

C++17
CMake
ptxas (validation)
lit / FileCheck

Details

How it works.

Why GPU is a separate story

The CPU compiler was about frontend, IR design, and lowering pipeline. GPU codegen on top of it is a fundamentally different body of work: optimizations shift from "vectorize and fuse" to "manage memory hierarchy and target tensor cores," with different tooling and mental model. The shared piece is the `tile` dialect itself; once a kernel is in `tile` form, CPU and GPU are sibling lowerings, which is the whole point of an MLIR-style design and what made adding a GPU target tractable instead of a rewrite.

Memory movement and tensor cores

On a GPU, where your data lives matters more than what you do with it. The `tile` type carries a memory-space attribute (global / shared / register), and the lowering pass emits the right primitive per transition. Global-to-shared uses `nvgpu.device_async_copy` (lowering to `cp.async`) with matching `cp.async.commit_group` / `cp.async.wait_group` ops, the standard Ampere pipeline shape. Shared-to-register uses `nvgpu.ldmatrix`, which loads a 16×8 fragment in the layout the tensor-core mma op expects. Register-resident accumulators stay in registers.

The matmul inner body lowers to `nvgpu.mma.sync` (mmaShape=[16, 8, 16], fp16 in, fp32 accum), the canonical Ampere tensor-core form. The pattern validates tile dimensions against the mma-shape constraints, then rewrites into the cooperative op. From there mainline MLIR carries it: `gpu`/`nvgpu` → NVVM → LLVM IR → PTX via `mlir-translate`. The validation harness (`ptxas --gpu-name=sm_80`) checks that the emitted PTX is well-formed; it runs locally when the CUDA toolkit is installed and in CI on the GPU runner.

Not yet built: the XOR-swizzle pass for a bank-conflict-free shared-memory layout (the `swizzle` directive parses; the layout transform is next), and a coalesce-validation pass that would reject schedules with non-coalesced loads.

Current endpoint

The pipeline terminates at `gpu`/`nvgpu` IR plus, end-to-end, NVVM/LLVM IR and PTX text. Deliberately deferred: a CUDA runtime to load the compiled module onto a real GPU, the PTX cache the spec calls for, and a bench harness vs cuBLAS. The honest framing: every claim above is at the IR level today. The architectural argument (that the shared `tile` dialect made adding a tensor-core target tractable instead of a rewrite) holds whether or not the kernel has been on a device yet.

Highlights

The things I'm proudest of.

▹Extended the `tile` MLIR dialect with a GPU lowering path targeting NVIDIA sm_80: `tile` → `gpu` + `nvgpu`, ready for NVVM and PTX. Architecture is read off the schedule's `target` directive (`gpu_sm80` default).
▹Memory-space-aware lowering: tiles carry a memory-space attribute (global / shared / register), and the pass emits the right primitive per transition. Global-to-shared becomes `nvgpu.device_async_copy` plus the `cp.async.commit_group` / `cp.async.wait_group` pipeline shape; shared-to-register becomes `nvgpu.ldmatrix`. That's the structure that makes shared-memory matmul fast on Ampere.
▹Tensor-core matmul: the inner body lowers to `nvgpu.mma.sync` (mmaShape=[16, 8, 16], fp16 in, fp32 accum), the canonical Ampere form. The pattern validates tile dimensions against the mma-shape constraints before rewriting.
▹PTX validation harness wired in (`ptxas --gpu-name=sm_80 --opt-level=3` on the emitted PTX). Skips locally where the CUDA toolkit isn't installed; runs in CI when it is.
▹Hardware-backed validation (run-correctness on an NVIDIA GPU, perf comparison vs cuBLAS) is the next milestone. The IR is correct on disk; the CUDA runtime + PTX cache that would dispatch onto a real device is a scoped engineering session paired with a friend's NVIDIA box or rented cloud GPU.