Introduction¶

What is tg2hdl?¶

tg2hdl is a compiler from tinygrad’s IR to synthesizable FPGA hardware. You describe a neural network in tinygrad; tg2hdl compiles it to an Amaranth HDL module that simulates cycle-accurately and can be synthesized to an FPGA.

The compiler operates on tinygrad’s linearized UOps — the same IR tinygrad uses to emit GPU kernels — and maps each op to hardware: memories, combinational arithmetic, and an FSM sequencer.

Components¶

Path	Role
`compiler/backend.py`	`HDLRenderer`, `compile_kernel`, `compile_model`, `compile_top_module`, `simulate_kernel`, `count_cycles_from_schedule`
`compiler/hdl_module.py`	`CompiledKernel` — three-pass Amaranth Elaboratable (KernelIR → hardware)
`compiler/ir.py`	Typed IR: `DType`, `IRConst/Counter/BufLoad/RegLoad/Op`, `IRBufStore/RegStore`, `LoopIR`, `BufferMeta`, `KernelIR`
`compiler/uop_to_ir.py`	`uop_to_ir()` — single-pass UOp list → `KernelIR` conversion
`compiler/lowering/arithmetic.py`	`ArithmeticLowering`, `create_counters()` — combinational signal emission
`compiler/lowering/control.py`	`build_control()` — FSM construction from typed loop tree
`compiler/fp32.py`	`FP32Add`, `FP32Mul`, `FP32Cmp` — IEEE 754 combinational hardware modules
`compiler/top_module.py`	`TopModule`, `simulate_top` — multi-kernel sequencer with copy FSM
`compiler/utils.py`	`pretty_print_uops` — UOp inspection helper
`benchmarks/harness.py`	`run_bench`, `BenchResult` — compare any tinygrad graph vs HDL simulation
`benchmarks/test_suite.py`	Correctness suite: Tier 1–3 (elementwise, GEMV, multi-kernel MLP)
`benchmarks/test_perf_suite.py`	Performance suite: 10 workloads, scalar → MNIST-scale
`utils/quantization.py`	`quantize_int8`, `dequantize` — user-level quantization helpers
`tests/test_compiler.py`	Compiler unit and simulation tests
`tests/test_top_module.py`	TopModule hardware simulation tests
`tests/test_fp32.py`	IEEE 754 FP32 unit and integration tests
`compare_inference.py`	End-to-end MNIST: CPU float32 vs compiler INT8

Workflow¶

tinygrad model
    │ .schedule()
    ▼
list[ExecItem]
    │ compile_top_module()  ← auto-detects inter-kernel connections
    ▼
TopModule + list[KernelSpec]     (Amaranth Elaboratables)
    │ simulate_kernel() per kernel   — or —   simulate_top()
    ▼
numpy outputs + cycle counts

Or via the benchmark harness (handles single- and multi-kernel automatically):

from benchmarks.harness import run_bench
result = run_bench("my_kernel", build_fn, input_arrays)
assert result.correct

Status¶

Capability	Status
Generic kernel compilation	✅
Scalar / elementwise / GEMV patterns	✅
Fused multi-op kernels (matmul + bias + relu)	✅
Multi-kernel hardware sequencing (`TopModule`)	✅ linear chains, skip connections, fan-out DAGs
Float32 — IEEE 754 hardware simulation	✅ `FP32Add`, `FP32Mul`, `FP32Cmp`
Float16 / BFloat16 arithmetic	❌ No dedicated units — compile error in practice
Multi-MAC parallelism (UNROLL)	Planned
FPGA synthesis	Planned

Supported ops¶

The compiler handles: ADD, MUL, CAST, CMPLT, WHERE, MAX, LOAD, STORE, RANGE, INDEX, DEFINE_GLOBAL, DEFINE_REG, CONST, AFTER.

All other UOps raise NotImplementedError at compile time (fail-loud policy).