← Back to projects
Summary

I built this compiler to learn modern ML compiler architecture by actually building one end to end. A single tile-level source language lowers into a custom MLIR dialect, then branches into CPU and GPU targets without changing frontend code. One language, two backends, working pipeline.

Stack

What it's built with.

Compiler Frontend

  • ANTLR4
  • AST Design
  • Type Systems
  • DSL Design

MLIR & IR

  • MLIR Dialect Authoring
  • Pass Infrastructure
  • linalg / vector / affine
  • ODS / TableGen

Build & Tooling

  • C++17
  • CMake
  • lit / FileCheck
Details

How it works.

What this project is

This is a Triton/Halide-style DSL for tile-level tensor kernels with a real compiler behind it. You express the algorithm and the schedule separately, and the compiler lowers through MLIR to generate machine code for CPU or GPU.

I built it because I wanted to learn this layer by doing, not just reading papers. Modern ML compilers share the same core staged-lowering ideas, and building a full pipeline end to end is the fastest way to internalize how they really work.

Why this matters in practice

When you send a message to ChatGPT, the framework (PyTorch) dispatches thousands of tensor operations (matmul, softmax, layer_norm) to the GPU per query. Each one needs actual GPU code. The two paths are vendor libraries (cuBLAS, cuDNN: fast but only the shapes/dtypes/fusions the vendor shipped) or a compiler (Triton, XLA, IREE, this project: take the high-level math and generate optimized code on the fly for any shape/dtype/fusion).

This compiler sits in that second slot, between the framework and the silicon. The framework doesn't care which kernel fired; it just gets a result tensor back. If a compiler-generated kernel is 20% faster than the vendor option, the user sees the response in 1 second instead of 1.25, and across millions of queries that's the difference that matters.

DSL and the `tile` dialect

Tile types are first-class: `Tile<f32, [BM, BK]>` describes a tile of fp32 values with a compile-time shape and an optional memory-space attribute. `tile.load`/`tile.store` move data; `tile.matmul`, `tile.elementwise`, and friends compute on it. The type system enforces shape and layout consistency at compile time, so a kernel that mismatches block sizes fails to type-check.

The `tile` dialect is the project's center of gravity. Source parses into an AST; the AST lowers into `tile` ops with explicit shape and memory-space attributes; from there each backend takes over. Because both backends consume the same dialect, getting the dialect right meant being honest up front about what tensor compilers need to encode: memory hierarchy, tiling, vectorization, schedule directives. Dialect authoring uses ODS/TableGen and the standard pattern-rewrite framework; PassManager composes pipelines so any intermediate stage is inspectable.

One source language, two targets

CPU path (see the CPU backend writeup): `tile` → `linalg` + `memref` + `scf` → LLVM dialect → LLVM IR → x86-64 AVX2 assembly. Schedule application clamps vectorize widths against the target ISA (AVX2: 8-wide fp32; AVX-512: 16-wide). Vertical fusion folds a matmul plus elementwise epilogue into one op.

GPU path (see the GPU backend writeup): `tile` → `gpu` + `nvgpu`, targeting NVIDIA sm_80 tensor cores. Matmul body becomes `nvgpu.mma.sync` (mmaShape=[16,8,16], fp16 in, fp32 accum); global-to-shared is `nvgpu.device_async_copy` (`cp.async`); shared-to-register is `nvgpu.ldmatrix`. Pipeline ends at `gpu`/`nvgpu` IR; `mlir-translate` carries it to NVVM and PTX.

The architectural argument: "vectorize and fuse" on CPU vs. "manage memory hierarchy and target tensor cores" on GPU are very different problems, and the shared `tile` dialect is what makes adding the second backend tractable instead of a rewrite.

Why MLIR

Tensor compilation is fundamentally staged lowering: you start with high-level intent ("matmul a tile") and need to land on hardware-specific instructions (vector FMAs, PTX wmma ops). Each stage wants its own IR shape. Hand-rolling that in LLVM IR alone is possible but masochistic; reusing `linalg`, `vector`, `gpu`, and `nvgpu` saves months. The cost is a real learning curve (dialect authoring, pattern rewrites, ODS/TableGen, the conversion framework, research-grade docs), worth it for a project of this shape but overkill for a single-target compiler.

Highlights

The things I'm proudest of.

  • Tile-typed surface language with explicit shape, layout, and memory-space attributes (`Tile<f32, [BM, BK]>`, `tile.load`/`tile.store`), plus a separate schedule block with Halide/Triton-style directives (tile, bind, vectorize, unroll, parallel, pipeline, prefetch, swizzle, fuse). Schedule is separate from algorithm.
  • Five-pass static type checker with clang-style diagnostics (`file:line:col: error:`, notes, hints, source-line carets). Catches shape mismatches, dangling iter-var references, invalid bind dimensions, and out-of-range pipeline depths at compile time.
  • Custom `tile` MLIR dialect as the connective tissue between backends. ODS/TableGen for op declarations and verifiers; standard pattern-rewrite for transformations; ANTLR4 → AST → tile dialect → backend lowering, composed via PassManager so every stage is inspectable.
  • Two backends, one source. CPU: tile → linalg + memref + scf → LLVM dialect → LLVM IR → x86-64 AVX2 assembly via `llc`. GPU: tile → gpu + nvgpu, targeting sm_80 tensor cores via `nvgpu.mma.sync` (mmaShape=[16,8,16], fp16 in, fp32 accum) with `cp.async` for global-to-shared staging.
  • End-to-end at the IR level on both paths; CPU path also produces real x86 assembly on disk. Lit suite covers the parser, type checker, dialect verifiers, each lowering pass, and one end-to-end CPU run. Hardware-backed validation is the next milestone (see per-backend writeups).