CPU AVX2 Target

Summary

This is the CPU side of the compiler. The same source kernel flows through linalg, memref, and scf into LLVM IR and then AVX2 assembly. Most of the work is in schedule-aware optimizations like fusion, vector-width clamping, and loop parallelization.

Stack

What it's built with.

Codegen & Optimization

LLVM IR
Loop Tiling & Unrolling
Operator Fusion
SIMD / AVX2

MLIR Dialects (consumed)

linalg
memref
scf
Pattern Rewrites

Pipeline & Tooling

MLIR PassManager
mlir-translate (MLIR → LLVM IR)
llc (LLVM → x86 asm)
lit / FileCheck

Details

How it works.

Why CPU gets its own writeup

The umbrella covers the DSL, the `tile` dialect, and the "two backends, one source" story. CPU and GPU are sibling lowerings of that dialect: same frontend, very different optimization and tooling concerns. On CPU the question is "vectorize and fuse" (turn nested tiled loops into FMA-rich vector code that LLVM can schedule against AVX2). On GPU the question shifts to memory hierarchy and tensor cores.

Lowering pipeline

Stage 1: `tile` → `linalg` + `memref` + `scf`. A pattern-rewrite pass turns `tile.matmul` into `linalg.fill 0` + `linalg.matmul`, `tile.load` into `memref.subview` (plus a `memref.copy` when crossing memory spaces), `tile.store` into `memref.copy`, and `tile.tile_grid` into nested `scf.for` (or `scf.forall` under the parallel directive). `!tile.tile<...>` becomes `memref<shape x elem, memspace>`.

Stage 2: standard MLIR passes lower to LLVM dialect (linalg bufferized and lowered to loops, scf to structured control flow, memref to pointers and offsets, arith/func to their LLVM-dialect equivalents).

Stage 3: `mlir-translate -mlir-to-llvmir` produces LLVM IR; `llc -mtriple=x86_64 -mattr=+avx2` produces x86 assembly. The named pipeline (`--tile-cpu-pipeline`) can run end-to-end or stop at any intermediate stage for inspection.

Key optimizations

Vertical fusion: `tile.elementwise` whose first operand is a `tile.matmul` result folds into the matmul as a `fused_epilogue` attribute. Multi-use cases are blocked to preserve behaviour. This is the canonical "matmul + bias + activation" optimization that matters most for transformer-shaped workloads.

Target-aware vector-width clamping: apply-schedule clamps the user's vectorize-width hint against the ISA (AVX2 fp32: 8; AVX-512 fp32: 16) and records it as `tile.vector_width`. Tile-specific algebraic canonicalization handles `add(x, zeros) → x` and `mul(x, ones) → x` (standard canonicalize doesn't know `tile.elementwise "zeros"`).

Parallel-loop promotion: `parallel tile_grid` flips the outer loop nest from sequential `scf.for` to parallel `scf.forall`. Tile-size selection is schedule-driven today; autotuning is on the umbrella roadmap. OpenMP runtime hookup is the next milestone.

Current endpoint

The pipeline currently terminates by emitting x86-64 AVX2 assembly on disk via `llc`. Lit covers each stage; one end-to-end test pins the full chain. Deliberately deferred: an LLVM ORC JIT to load and execute in-process, plus a Python ctypes harness for the NumPy equivalence check. Both are scoped on the roadmap (the dev machine is arm64, so the perf comparison vs OpenBLAS sgemm needs an actual x86 box).

Highlights

The things I'm proudest of.

▹Lowering pipeline: `tile` → `linalg` + `memref` + `scf` → LLVM dialect → LLVM IR → x86-64 assembly via `llc -mtriple=x86_64 -mattr=+avx2`. Five inspectable intermediate artifacts (tile-dialect MLIR, linalg form, LLVM dialect, LLVM IR, assembly text).
▹Schedule-driven vector-width clamping: the user's `vectorize <op> width=N` directive is a hint; apply-schedule clamps it against the target ISA (AVX2 fp32 caps at 8, AVX-512 at 16) and records the resolved width as an op attribute.
▹Vertical fusion: a `tile.elementwise` whose first operand is a `tile.matmul` result folds into the matmul as a `fused_epilogue` attribute, the canonical "matmul + bias + activation" optimization. Multi-use cases correctly block fusion to preserve behaviour.
▹Parallel directive: `parallel tile_grid` promotes the outer loop nest from sequential `scf.for` to parallel `scf.forall`. OpenMP runtime hookup is the next milestone.
▹Hardware-backed validation (numerical equivalence vs NumPy on x86, perf comparison vs OpenBLAS) is the next milestone. IR and assembly are correct on disk; getting them onto an x86 box and into a JIT runtime is what's deferred.