ML DSL Compiler

Summary

I built this compiler to learn modern ML compiler architecture by actually building one end to end. A single tile-level source language lowers into a custom MLIR dialect, then branches into CPU and GPU targets without changing frontend code. One language, two backends, working pipeline.

Stack

What it's built with.

Compiler Frontend

ANTLR4
AST Design
Type Systems
DSL Design

MLIR & IR

MLIR Dialect Authoring
Pass Infrastructure
linalg / vector / affine
ODS / TableGen

Build & Tooling

C++17
CMake
lit / FileCheck

Details

How it works.

What this project is

This is a Triton/Halide-style DSL for tile-level tensor kernels with a real compiler behind it. You express the algorithm and the schedule separately, and the compiler lowers through MLIR to generate machine code for CPU or GPU.

I built it because I wanted to learn this layer by doing, not just reading papers. Modern ML compilers share the same core staged-lowering ideas, and building a full pipeline end to end is the fastest way to internalize how they really work.

Why this matters in practice

When you send a message to ChatGPT, the framework (PyTorch) dispatches thousands of tensor operations (matmul, softmax, layer_norm) to the GPU per query. Each one needs actual GPU code. The two paths are vendor libraries (cuBLAS, cuDNN: fast but only the shapes/dtypes/fusions the vendor shipped) or a compiler (Triton, XLA, IREE, this project: take the high-level math and generate optimized code on the fly for any shape/dtype/fusion).

This compiler sits in that second slot, between the framework and the silicon. The framework doesn't care which kernel fired; it just gets a result tensor back. If a compiler-generated kernel is 20% faster than the vendor option, the user sees the response in 1 second instead of 1.25, and across millions of queries that's the difference that matters.

DSL and the `tile` dialect

Tile types are first-class: `Tile<f32, [BM, BK]>` describes a tile of fp32 values with a compile-time shape and an optional memory-space attribute. `tile.load`/`tile.store` move data; `tile.matmul`, `tile.elementwise`, and friends compute on it. The type system enforces shape and layout consistency at compile time, so a kernel that mismatches block sizes fails to type-check.

The `tile` dialect is the project's center of gravity. Source parses into an AST; the AST lowers into `tile` ops with explicit shape and memory-space attributes; from there each backend takes over. Because both backends consume the same dialect, getting the dialect right meant being honest up front about what tensor compilers need to encode: memory hierarchy, tiling, vectorization, schedule directives. Dialect authoring uses ODS/TableGen and the standard pattern-rewrite framework; PassManager composes pipelines so any intermediate stage is inspectable.

One source language, two targets

CPU path (see the CPU backend writeup): `tile` → `linalg` + `memref` + `scf` → LLVM dialect → LLVM IR → x86-64 AVX2 assembly. Schedule application clamps vectorize widths against the target ISA (AVX2: 8-wide fp32; AVX-512: 16-wide). Vertical fusion folds a matmul plus elementwise epilogue into one op.

GPU path (see the GPU backend writeup): `tile` → `gpu` + `nvgpu`, targeting NVIDIA sm_80 tensor cores. Matmul body becomes `nvgpu.mma.sync` (mmaShape=[16,8,16], fp16 in, fp32 accum); global-to-shared is `nvgpu.device_async_copy` (`cp.async`); shared-to-register is `nvgpu.ldmatrix`. Pipeline ends at `gpu`/`nvgpu` IR; `mlir-translate` carries it to NVVM and PTX.

The architectural argument: "vectorize and fuse" on CPU vs. "manage memory hierarchy and target tensor cores" on GPU are very different problems, and the shared `tile` dialect is what makes adding the second backend tractable instead of a rewrite.

Why MLIR

Tensor compilation is fundamentally staged lowering: you start with high-level intent ("matmul a tile") and need to land on hardware-specific instructions (vector FMAs, PTX wmma ops). Each stage wants its own IR shape. Hand-rolling that in LLVM IR alone is possible but masochistic; reusing `linalg`, `vector`, `gpu`, and `nvgpu` saves months. The cost is a real learning curve (dialect authoring, pattern rewrites, ODS/TableGen, the conversion framework, research-grade docs), worth it for a project of this shape but overkill for a single-target compiler.

Highlights

The things I'm proudest of.

▹Tile-typed surface language with explicit shape, layout, and memory-space attributes (`Tile<f32, [BM, BK]>`, `tile.load`/`tile.store`), plus a separate schedule block with Halide/Triton-style directives (tile, bind, vectorize, unroll, parallel, pipeline, prefetch, swizzle, fuse). Schedule is separate from algorithm.
▹Five-pass static type checker with clang-style diagnostics (`file:line:col: error:`, notes, hints, source-line carets). Catches shape mismatches, dangling iter-var references, invalid bind dimensions, and out-of-range pipeline depths at compile time.
▹Custom `tile` MLIR dialect as the connective tissue between backends. ODS/TableGen for op declarations and verifiers; standard pattern-rewrite for transformations; ANTLR4 → AST → tile dialect → backend lowering, composed via PassManager so every stage is inspectable.
▹Two backends, one source. CPU: tile → linalg + memref + scf → LLVM dialect → LLVM IR → x86-64 AVX2 assembly via `llc`. GPU: tile → gpu + nvgpu, targeting sm_80 tensor cores via `nvgpu.mma.sync` (mmaShape=[16,8,16], fp16 in, fp32 accum) with `cp.async` for global-to-shared staging.
▹End-to-end at the IR level on both paths; CPU path also produces real x86 assembly on disk. Lit suite covers the parser, type checker, dialect verifiers, each lowering pass, and one end-to-end CPU run. Hardware-backed validation is the next milestone (see per-backend writeups).