← Back to projects
Summary

Gazprea was a full-compiler build in CMPUT 415, and we treated it like real engineering, not just coursework. Our team owned parsing, typing, lowering, runtime support, and CI-backed regression testing across a huge suite. We finished first in the course and were invited to present the work in an academic exchange with HSBI in Germany.

Stack

What it's built with.

Languages

  • C++17
  • CMake

Compiler Frontend

  • ANTLR4
  • AST Visitors
  • Lexer/Parser Design
  • Scope Analysis

Type System

  • Polymorphic TypeInfo
  • Type Resolution
  • Cycle Detection
  • Type Casting

IR & Codegen

  • MLIR
  • LLVM IR
  • Dialect Lowering
  • MemRef Descriptors
  • SCF / Arith / Func

Architecture

  • Compiler Theory
  • Service-Based Codegen
  • Memory Layout Design
  • IR Lowering Pipelines

Testing

  • Integration Test Harness
  • Diff-Based Testing
Skills exercised

The non-code parts.

Team LeadershipTechnical WritingAcademic Collaboration
Details

How it works.

Project context

Gazprea is a research language used as the capstone compiler project in CMPUT 415 at UAlberta. It forces you to do the whole thing properly: frontend, typing, IR, lowering, and runtime representation for vectors and matrices.

We built this in a four-person team over a term, finished first in the course, and were invited to present the implementation in an academic exchange with HSBI in Germany.

Compiler pipeline

Nine passes in order: ASTBuilder → ScopeAnalyzer → RoutineRecorder → TypeResolver → QualifierEnforcer → TypeChecker → TypeCastValidator → FinalJudgmentVisitor → CodeGenVisitor. Each owns a specific set of checks, so when something blows up the blast radius is contained. RoutineRecorder is an early collection pass that populates a `RoutineRegistry` so later passes can resolve forward references without rescanning. FinalJudgmentVisitor catches the oddball cases (break outside a loop, I/O inside a pure function, prototypes that never got a definition, function/procedure namespace conflicts).

The frontend is a single 342-line ANTLR4 grammar covering scalars, vectors, matrices, tuples, structs, intervals, type aliases, generators, streams, slicing, and stride syntax. Everything downstream of the parse tree is C++17 visitors over a 49-class AST hierarchy with a `Node` base, `Expr`/`Stmt` splits, and shared bases (`ArrayBase`, `ArrayAccess`) that let multiple node types share behaviour.

Type system design

The most interesting part of the project. Scalars (boolean, character, integer, real, string) are easy; the work is in intervals, tuples, structs, vectors, and matrices, all first-class with their own layout and promotion rules.

Everything is modelled as a polymorphic TypeInfo hierarchy rather than an enum, and that decision paid off constantly. TypeResolver runs in explicit phases: alias registration, type resolution, expression-type computation, type inference for omitted variable types. Aliases resolve with DFS-based cycle detection, so `typealias my_int my_int;` throws a clean error instead of looping forever. An `ArrayBase` abstraction unifies indexing, slicing, shape queries, and element-type compatibility across arrays, vectors, and strings, so adding behaviour to all three is a single change.

Mutability is enforced by a dedicated QualifierEnforcer pass: const vs var, immutable iterator loop variables, const-by-default globals, procedure-argument qualifier validation. Implicit promotion (integer to real) and explicit casts (`as<type>(expr)`) are split across PromotionService, CastingService, and TypeConversionService, and validated by TypeCastValidator.

Codegen architecture

A monolithic codegen visitor would have been horrifying. Instead, codegen is split into nine services in `src/services/codegen/`: BinaryOperationEmitter, LoopEmitter, CallEmitter, SubroutineEmitter, MatrixService, CastingGeneratorService, AggregateCodegenService, CompoundCodegenService, and CodeGenUtils. Each owns a specific concern; they compose through a `BuilderScope` RAII helper that swaps the active MLIR builder for nested codegen contexts (function bodies, block scopes) without leaking state.

MatrixService is the largest piece at ~1,565 lines. It handles descriptor-based memory layout for 2D data, row/column queries, slice and stride arithmetic, and the compile-time-vs-runtime dimension tracking that lets `rows()` and `columns()` return constants when they can. CompoundCodegenService (~1,680 lines) is the other heavy hitter, covering slicing, striding, concatenation, and generator expressions.

Vectors and matrices live as `memref<D1 x ... x Dn x ElementType>` descriptors at runtime, with bounds checks emitted at every index. Structs are LLVM struct types with named fields. Scalars stay as native MLIR types.

Lowering to LLVM IR

Codegen emits MLIR across six standard dialects (Arith, SCF, Func, MemRef, Math, ControlFlow). From there, seven sequential lowering passes carry it to LLVM IR: FuncToLLVM, SCFToCF, MathToLLVM, ArithToLLVM, MemRefToLLVM, ControlFlowToLLVM, then ReconcileUnrealizedCasts to clear the type-cast assertions MLIR leaves behind. Order matters: SCF has to go before CF, MemRef lowering has to happen after the descriptor types are fully resolved, and the reconcile pass at the end is what makes the final IR valid LLVM. The compiler driver exposes a `-d 0..7` flag that dumps MLIR after each stage, which made debugging dialect mismatches tractable.

Final IR is handed to the standard LLVM toolchain (`llc`, `clang` as the linker driver) and linked against a small C runtime (`libgazrt.so`) for stream I/O and runtime error reporting. This was the first compiler the team wrote that produced real executables.

Testing and CI discipline

2,101 integration tests, each a pair of `.in` (Gazprea source with a `// CHECK_FILE:` marker) and `.out` (expected stdout, or expected error message for compile-fail tests). Organised by feature: 664 Part 1 tests across ~23 categories (basics, conditionals, loops, functions, procedures, tuples, type-casting, scoping, qualifier-pass, math, error cases, and so on), 1,437 Part 2 tests across ~22 categories (arrays, vectors, matrices, generators, structs, iterator loops, forward declarations, type-promotion across compound types, stream state, globals), plus an opt-in icebox for not-yet-implemented features.

CI runs in GitHub Actions on a self-hosted container (`ghcr.io/cmput415/gaz-utils:latest`). Triggers on any push under `src/`, `tests/`, `grammar/`, `CMakeLists.txt`, or `cmake/`. The pipeline caches CMake artifacts (hashed against headers, sources, and CMake config), builds with `make -j$(nproc)`, and then runs the test suites under `dragon-runner memcheck` against `p1-config.json` and `p2-config.json`. Per test: `gazc input.gaz` emits LLVM IR, `llc` produces an object file, `clang` links against the runtime, the binary runs, and stdout is diffed against the expected output. The icebox suite runs with continue-on-error so partial work doesn't block the build.

On a compiler this volume earns its keep. Small changes anywhere in the type system or codegen ripple in surprising ways, and only catching every regression on the next push keeps forward velocity from collapsing.

Deck

HSBI presentation: our Gazprea compiler

HSBI presentation: our Gazprea compiler

Loading deck…
Roadmap

Where it's headed.

Road to Gazprea

CMPUT 415 builds up to Gazprea across four assignments. Each one layers new language machinery onto the previous — lexing, then scalars, then vectors, then the full compiler. Here's the progression.

  1. 01

    Generator

    Done

    A code generator warm-up. Walk an input file and emit target output — no real language semantics yet. Gets you comfortable with ANTLR4, the build system, and the shape of the pipeline before you have to reason about anything semantic.

  2. 02

    Scalc

    Done

    A scalar calculator. Introduces real parsing, an AST, a type system with integers and reals, and MLIR code generation for arithmetic and control flow. First time the whole pipeline — source → AST → IR → executable — actually runs end-to-end.

  3. 03

    Vcalc

    Done

    A vector calculator. Adds first-class vectors with dynamic sizing, filtering, generators, and range operators on top of Scalc. Forces you to confront descriptor-based memory layout and the promotion rules that will matter even more in Gazprea.

  4. 04

    Gazprea

    Done

    The full language: matrices, tuples, structs, intervals, type aliases, functions, procedures, the lot. Everything Scalc and Vcalc taught you, plus a real type system with inference, cycle-detecting alias resolution, and service-based codegen. This is the compiler that ranked #1 in the course.

Highlights

The things I'm proudest of.

  • Nine-pass pipeline: ASTBuilder → ScopeAnalyzer → RoutineRecorder → TypeResolver → QualifierEnforcer → TypeChecker → TypeCastValidator → FinalJudgmentVisitor → CodeGenVisitor. Built on a 342-line ANTLR4 grammar covering functions, procedures, prototypes, generators, streams, typealiases, intervals, structs, tuples, vectors, and 2D matrix literals with slicing and stride syntax.
  • Polymorphic TypeInfo hierarchy (BasicType, ArrayType, VectorType, TupleType, StructType, IntervalType, TypeAliasRef) with an ArrayBase abstraction that unifies indexing, slicing, concatenation, and shape queries across arrays, vectors, and strings. Multi-phase TypeResolver handles chained aliases with cycle detection, expression-type computation, and inference for omitted variable types.
  • MLIR codegen using six standard dialects (Arith, SCF, Func, MemRef, Math, ControlFlow), lowered through a seven-stage pipeline (FuncToLLVM → SCFToCF → MathToLLVM → ArithToLLVM → MemRefToLLVM → ControlFlowToLLVM → ReconcileUnrealizedCasts) to LLVM IR. The `-d 0..7` flag dumps MLIR at every stage for debugging.
  • Codegen split into nine modular services (BinaryOperationEmitter, LoopEmitter, CallEmitter, SubroutineEmitter, MatrixService, CastingGeneratorService, AggregateCodegenService, CompoundCodegenService, CodeGenUtils) instead of a monolithic visitor. Each owns one concern, composes through a `BuilderScope` RAII, and is independently testable.
  • 2,101 integration tests (664 Part 1, 1,437 Part 2, ~50 icebox) organised by feature and run in GitHub Actions on a self-hosted container via `dragon-runner memcheck`. Pipeline compiles each `.gaz`, links against the C runtime (`libgazrt.so`), executes, and diffs stdout byte-for-byte against the `.out`. Shipped #1 in the course and presented for an international knowledge exchange with HSBI Germany.