API Reference

compiler

tinygrad UOps → Amaranth HDL compiler.

Public API:

HDLRenderer — tinygrad Renderer for sequential hardware uops_to_kernel_ir — list[UOp] → (KernelIR, buf_infos) compile_kernel — list[UOp] → KernelIR → CompiledKernel (Amaranth Elaboratable) compile_model — tinygrad schedule → list[KernelSpec] compile_top_module — tinygrad schedule → TopModule (auto-connects kernels) simulate_kernel — run a CompiledKernel on Amaranth simulator simulate_top — run a TopModule on Amaranth simulator count_cycles_from_schedule — analytical cycle estimator (no simulation)

Quantization utilities are in the utils package, not here:

from utils import quantize_int8, quantize_int16, dequantize

compiler.backend

HDL backend for tinygrad: UOps → Amaranth hardware.

Provides HDLRenderer (tells tinygrad we’re a sequential device), compile_kernel (UOps → KernelIR → CompiledKernel), compile_model (schedule → top module), and simulate (run on Amaranth simulator).

class compiler.backend.BufferInfo(idx: int, depth: int, elem_width: int, is_signed: bool, is_output: bool)

Bases: object

Descriptor for a DEFINE_GLOBAL buffer.

depth: int
elem_width: int
idx: int
is_output: bool
is_signed: bool
class compiler.backend.HDLRenderer

Bases: Renderer

Tells tinygrad to generate sequential UOps (no GPU features).

device = 'HDL'
global_max = None
has_local = False
has_shared = False
local_max = None
render(uops)
supports_float4 = False
class compiler.backend.KernelSpec(kernel: CompiledKernel, uops: list, buf_infos: list, buf_map: dict)

Bases: object

Info about a compiled kernel and its buffer mapping.

buf_infos: list
buf_map: dict
kernel: CompiledKernel
uops: list
compiler.backend.analyze_buffers(uops)

Extract BufferInfo for each DEFINE_GLOBAL in the UOp list.

compiler.backend.compile_kernel(uops, *, unroll_factor=1, reduce_unroll_factor=1)

Compile a linearized UOp list into a CompiledKernel (Amaranth Elaboratable).

Pipeline: UOps → KernelIR → (optional unroll) → CompiledKernel (Amaranth Elaboratable).

Parameters:
  • uops (list[UOp]) – Linearized UOps from tinygrad (via _get_uops).

  • unroll_factor (int) – LOOP-axis unroll factor (default 1 = no unrolling).

  • reduce_unroll_factor (int) – REDUCE-axis unroll factor (default 1 = no unrolling).

Returns:

Amaranth Elaboratable ready for simulation or synthesis.

Return type:

CompiledKernel

compiler.backend.compile_model(schedule, *, unroll_factor=1, reduce_unroll_factor=1)

Compile a tinygrad schedule into a list of KernelSpecs.

Parameters:
  • schedule (list[ExecItem]) – From Tensor.schedule().

  • unroll_factor (int) – LOOP-axis unroll factor applied to every kernel (default 1).

  • reduce_unroll_factor (int) – REDUCE-axis unroll factor applied to every kernel (default 1).

Returns:

One KernelSpec per compute kernel in the schedule.

Return type:

list[KernelSpec]

compiler.backend.compile_top_module(schedule, *, unroll_factor=1, reduce_unroll_factor=1)

Compile a tinygrad schedule into a TopModule.

Detects inter-kernel buffer connections automatically by checking Buffer object identity: if si_prev.bufs[0] is the same Python object as si_curr.bufs[j] for j ≥ 1, the two kernels are connected.

Parameters:
  • schedule (list[ExecItem]) – From Tensor.schedule().

  • unroll_factor (int) – LOOP-axis unroll factor applied to every kernel (default 1).

  • reduce_unroll_factor (int) – REDUCE-axis unroll factor applied to every kernel (default 1).

Returns:

  • top (TopModule) – Assembled top module with all kernels and copy FSM.

  • connections (list[tuple[int,int,int,int]]) – Detected connections as (src_k, src_buf, dst_k, dst_buf).

  • kernel_specs (list[KernelSpec]) – One KernelSpec per compute kernel (same order as in TopModule).

compiler.backend.count_cycles_from_schedule(schedule)

Analytically count total FSM cycles for all kernels in schedule.

Returns the sum of per-kernel cycle counts using the same model as the Amaranth FSM without running any simulation. Useful for float models where Amaranth simulation is not bit-accurate.

Parameters:

schedule (list[ExecItem]) – From Tensor.schedule().

Returns:

Total cycle count across all compute kernels.

Return type:

int

compiler.backend.simulate_kernel(kernel, input_data, clock_period=1e-08)

Simulate a single CompiledKernel.

Parameters:
  • kernel (CompiledKernel) – The compiled kernel module.

  • input_data (dict[int, np.ndarray]) – Maps buffer index → numpy array of data to load. Buffer 0 (output) is not loaded.

  • clock_period (float) – Simulation clock period in seconds (default 10ns = 100MHz).

Returns:

  • output (np.ndarray) – Output buffer contents after computation.

  • cycles (int) – Number of compute cycles.

  • wall_time (float) – Wall-clock seconds for simulation.

compiler.backend.uops_to_kernel_ir(uops)

Convert linearized UOps to typed KernelIR.

This is the first stage of the compiler pipeline, usable standalone for IR inspection without building any Amaranth hardware.

Parameters:

uops (list[UOp]) – Linearized UOps from tinygrad (via _get_uops).

Returns:

  • kernel_ir (KernelIR) – Typed intermediate representation.

  • buf_infos (list[dict]) – Buffer descriptors needed by CompiledKernel.

compiler.hdl_module

Generated Amaranth module from KernelIR.

CompiledKernel takes a fully-built KernelIR and lowers it to an FSM-based Amaranth module: memories for buffers, counters for loops, combinational datapath, and FSM for sequencing.

Three-pass architecture:

Pass 0: Create memories (one per DEFINE_GLOBAL buffer) Pass 1: Create counters + arithmetic datapath via ArithmeticLowering Pass 2: Wire default write ports + build FSM via build_control()

class compiler.hdl_module.CompiledKernel(*args, src_loc_at=0, **kwargs)

Bases: Elaboratable

Hardware module generated from KernelIR.

Parameters:
  • kernel_ir (KernelIR) – Typed kernel IR produced by uop_to_ir().

  • buf_infos (list[dict]) – Buffer descriptors: {idx, depth, elem_width, is_signed, is_output}.

elaborate(platform)

compiler.ir

Typed Kernel IR for the tg2hdl compiler.

Defines the intermediate representation that sits between tinygrad UOps and Amaranth RTL generation. All dtype information is explicit and self-contained so that downstream lowering passes do not need to import tinygrad.

Types

DType — scalar data type enum (INT8, INT32, FP32, …) IRConst — compile-time constant value IRCounter — loop induction variable IRBufLoad — value read from a global buffer IRRegLoad — value read from the accumulator register IROp — arithmetic/logical operation result IRValue — union of the above (type alias) IRBufStore — write a value to a global buffer (side effect) IRRegStore — write a value to the accumulator register (side effect) LoopIR — one level of a loop nest, with prologue/epilogue stores BufferMeta — descriptor for a DEFINE_GLOBAL buffer KernelIR — top-level typed kernel representation

class compiler.ir.BufferMeta(idx: int, depth: int, dtype: DType, is_output: bool)

Bases: object

Typed descriptor for a DEFINE_GLOBAL buffer.

depth: int
dtype: DType
idx: int
is_output: bool
class compiler.ir.DType(*values)

Bases: Enum

Scalar data types supported by the compiler.

Each member stores (name, bit_width, is_signed, is_float).

BF16 = ('bf16', 16, False, True)
FP16 = ('fp16', 16, False, True)
FP32 = ('fp32', 32, False, True)
INT16 = ('int16', 16, True, False)
INT32 = ('int32', 32, True, False)
INT8 = ('int8', 8, True, False)
UINT16 = ('uint16', 16, False, False)
UINT32 = ('uint32', 32, False, False)
UINT8 = ('uint8', 8, False, False)
amaranth_shape()

Return an Amaranth Shape (signed/unsigned) for this dtype.

const_to_bits(value) int

Convert a Python numeric constant to an integer bit pattern.

For float dtypes, produces the IEEE 754 bit pattern. For integer dtypes, masks to the appropriate width.

classmethod from_tinygrad(tg_dtype) DType

Map a tinygrad dtype to DType.

Raises ValueError for unsupported dtypes (fail-loud policy).

classmethod from_width(bit_width: int, is_signed: bool) DType

Reconstruct DType from (bit_width, is_signed) — for BufferInfo compat.

class compiler.ir.IRBufLoad(buf_idx: int, addr: object)

Bases: object

A value loaded from a global buffer at a given address.

addr: object
buf_idx: int
class compiler.ir.IRBufStore(buf_idx: int, addr: object, value: object, dtype: DType)

Bases: object

Write a value to a global buffer at a given address.

addr: object
buf_idx: int
dtype: DType
value: object
class compiler.ir.IRConst(value: object, dtype: DType)

Bases: object

A compile-time constant scalar.

dtype: DType
value: object
class compiler.ir.IRCounter(bound: int, depth: int)

Bases: object

A loop induction variable (the RANGE UOp).

bound: int
depth: int
class compiler.ir.IROp(op: str, dtype: DType, srcs: tuple)

Bases: object

An arithmetic or logical operation on one or more source IRValues.

dtype: DType
op: str
srcs: tuple
class compiler.ir.IRRegLoad(dtype: DType)

Bases: object

A value read from the accumulator register.

dtype: DType
class compiler.ir.IRRegStore(value: object)

Bases: object

Write a value into the accumulator register.

value: object
class compiler.ir.KernelIR(buffers: list, acc_dtype: DType | None, loop_tree: LoopIR, scalar_stores: list = <factory>)

Bases: object

Typed intermediate representation for one compiled kernel.

acc_dtype: DType | None
buffers: list
format(kir: KernelIR, *, show_buffers: bool = True) str

Tinygrad-ish dump of KernelIR.

Produces something visually close to tinygrad’s UOp table:
  • numbered statements

  • %value ids for loads/ops/counters

  • explicit RANGE/END

  • STORE lines in the loop prologue/epilogue order

loop_tree: LoopIR
pretty() str
scalar_stores: list
class compiler.ir.LoopIR(axis_type: object, bound: int, depth: int, prologue: list = <factory>, body: LoopIR | None = None, epilogue: list = <factory>)

Bases: object

One level of the loop tree.

The root has axis_type=None, bound=0, depth=-1; its prologue/epilogue hold scalar-kernel stores. Each nested level represents one RANGE/END pair.

axis_type: object
body: LoopIR | None = None
bound: int
depth: int
epilogue: list
prologue: list

compiler.uop_to_ir

UOp → KernelIR conversion.

Converts a linearized tinygrad UOp list into a typed KernelIR, merging the roles of the old _parse_loop_structure and value-analysis portion of _build_datapath into one sequential pass.

No Amaranth imports — this module is purely structural/analytical.

compiler.uop_to_ir.uop_to_ir(uops, buf_metas: list) KernelIR

Convert a linearized UOp list to a typed KernelIR.

Parameters:
  • uops (list[UOp]) – Linearized UOps from tinygrad (via linearize()).

  • buf_metas (list[BufferMeta]) – Buffer descriptors pre-built from DEFINE_GLOBAL analysis.

Return type:

KernelIR

compiler.lowering.arithmetic

Arithmetic lowering: KernelIR → Amaranth Signals (combinational datapath).

Converts typed IRValue nodes to Amaranth Signal objects, emitting combinational assignments into the provided Module. Float operations dispatch to the IEEE 754 FP32 hardware modules in compiler/fp32.py.

No FSM logic — that is the responsibility of control.py.

class compiler.lowering.arithmetic.ArithResult(signals: dict = <factory>, acc: Any = None, counter_sigs: dict = <factory>)

Bases: object

Holds the Amaranth Signal (or Const) for each IRValue in the kernel.

Keyed by id(IRValue). A few special entries:
  • result.acc : the accumulator Signal (or None)

  • result.counter_sigs[depth] : Signal for loop counter at depth d

acc: Any = None
counter_sigs: dict
signals: dict
class compiler.lowering.arithmetic.ArithmeticLowering(kernel: KernelIR, m: Module, int_rports: dict, counter_sigs: dict, acc)

Bases: object

Convert typed IRValues to Amaranth combinational logic.

Usage

counter_sigs = create_counters(kernel_ir, m) acc = Signal(kernel_ir.acc_dtype.amaranth_shape(), name=”acc”) if kernel_ir.acc_dtype else None lowering = ArithmeticLowering(kernel_ir, m, int_rports, counter_sigs, acc) result = lowering.run()

run() ArithResult

Walk all IRValues referenced in the kernel and emit combinational logic.

Traversal order: collect all IRValues reachable from stores in the loop tree, then emit in topological order (leaves first).

compiler.lowering.arithmetic.create_counters(kernel: KernelIR, m: Module) dict

Walk the LoopIR tree and create an Amaranth Signal for each loop level.

Returns a dict mapping loop depth (int) → Signal.

compiler.lowering.control

Control lowering: KernelIR + ArithResult → Amaranth FSM.

Builds the FSM that sequences stores (register updates and memory writes) according to the loop structure in KernelIR. Mirrors the behavior of the old CompiledKernel._build_fsm() and _build_scalar_fsm() exactly, but accepts typed IR nodes instead of raw UOps and an untyped sig dict.

State naming (unchanged from old design):

IDLE — wait for start L{d}_PRO — non-innermost level prologue (e.g. acc reset) L{d}_BODY — innermost level body (e.g. MAC) L{d}_EPI — non-innermost level epilogue (e.g. output write) SCALAR — single compute cycle for no-loop kernels

compiler.lowering.control.build_control(m: Module, kernel: KernelIR, result: ArithResult, int_wports: dict, start: Signal, done: Signal, busy: Signal) None

Build the FSM into Module m.

Parameters:
  • m (Module) – The Amaranth module under construction.

  • kernel (KernelIR) – Typed kernel IR (loop tree + stores).

  • result (ArithResult) – Signals produced by ArithmeticLowering.run().

  • int_wports (dict) – Maps buf_idx → Amaranth write port (from _create_memories).

  • start (Signal) – External control signals on the CompiledKernel.

  • done (Signal) – External control signals on the CompiledKernel.

  • busy (Signal) – External control signals on the CompiledKernel.

compiler.fp32

IEEE 754 float32 combinational arithmetic modules for Amaranth HDL.

Each module has single-cycle combinational latency and is synthesizable to FPGA:

FP32Add — float32 + float32 → float32 FP32Mul — float32 × float32 → float32 FP32Cmp — float32 < float32 → 1-bit (for CMPLT / relu) FP32Exp2 — 2^x → float32 FP32Log2 — log2(x) → float32 FP32Reciprocal — 1/x → float32 FP32Sqrt — sqrt(x) → float32 FP32FDiv — a/b → float32

Supported:
  • Normal numbers (IEEE 754 biased-exponent format)

  • Signed zero: +0.0 and -0.0 handled consistently

  • Infinities: passed through / generated on overflow

  • NaN: detected and passed through (quiet NaN)

Limitations:
  • Subnormal (denormal) numbers are flushed to zero on input.

  • Rounding mode: truncation (round toward zero).

  • float16 / bfloat16: not supported by these modules (use raw bit-pattern semantics from _build_datapath for those widths).

class compiler.fp32.FP32Add(*args, src_loc_at=0, **kwargs)

Bases: Elaboratable

Combinational IEEE 754 float32 adder.

Ports

a, b : in Signal(unsigned(32)) — IEEE 754 float32 bit patterns result : out Signal(unsigned(32)) — float32 bit pattern of a + b

elaborate(platform)
class compiler.fp32.FP32Cmp(*args, src_loc_at=0, **kwargs)

Bases: Elaboratable

Combinational float32 less-than: result = 1 iff a < b.

Ports

a, b : in Signal(unsigned(32)) — IEEE 754 float32 bit patterns result : out Signal() — 1 if a < b, 0 otherwise

elaborate(platform)
class compiler.fp32.FP32Exp2(*args, src_loc_at=0, **kwargs)

Bases: Elaboratable

Combinational 2^x for IEEE 754 float32.

Algorithm

  1. Convert x to signed Q8.23 fixed-point (covers ±256 with 2^-23 resolution).

  2. Split into integer part n = x[23:32] and fraction f_fp = x[0:23].

  3. Evaluate 2^f - 1 via 5th-order Horner polynomial in Q0.23 fixed-point.

  4. Pack result: sign=0, exp=(n+127), mant=poly.

Accuracy: ≤2 ULP for normal inputs; subnormals flush to zero.

elaborate(platform)
class compiler.fp32.FP32FDiv(*args, src_loc_at=0, **kwargs)

Bases: Elaboratable

Combinational a/b for IEEE 754 float32.

Computes a × (1/b) using FP32Reciprocal and FP32Mul.

elaborate(platform)
class compiler.fp32.FP32Log2(*args, src_loc_at=0, **kwargs)

Bases: Elaboratable

Combinational log2(x) for IEEE 754 float32.

Algorithm

log2(x) = (e - 127) + log2(1.m) where x = 2^(e-127) × 1.m log2(1 + f) for f ∈ [0,1) is approximated by a 5th-order Horner polynomial using the substitution f = m / 2^23.

Special cases: x≤0 → NaN, x=+inf → +inf, x=0 → -inf. Accuracy: ≤2 ULP for normal positive inputs.

elaborate(platform)
class compiler.fp32.FP32Mul(*args, src_loc_at=0, **kwargs)

Bases: Elaboratable

Combinational IEEE 754 float32 multiplier.

Ports

a, b : in Signal(unsigned(32)) — IEEE 754 float32 bit patterns result : out Signal(unsigned(32)) — float32 bit pattern of a * b

elaborate(platform)
class compiler.fp32.FP32Reciprocal(*args, src_loc_at=0, **kwargs)

Bases: Elaboratable

Combinational 1/x for IEEE 754 float32.

Uses 2 Newton-Raphson iterations: y_{n+1} = y_n × (2 - x × y_n) Initial guess: negate the biased exponent (gives 1-ULP seed for powers of 2).

Accuracy: ≤2 ULP for normal inputs.

elaborate(platform)
class compiler.fp32.FP32Sqrt(*args, src_loc_at=0, **kwargs)

Bases: Elaboratable

Combinational sqrt(x) for IEEE 754 float32.

Uses the identity: sqrt(x) = x × rsqrt(x) where rsqrt is approximated by 2 Newton-Raphson iterations:

y_{n+1} = y_n × (1.5 - 0.5×x×y_n^2)

Seed: classic Quake III fast inverse sqrt initial guess. Accuracy: ≤2 ULP for normal positive inputs.

elaborate(platform)

compiler.top_module

TopModule: sequences compiled kernels with dependency-driven copy states.

TopModule wires multiple CompiledKernel instances into a single Amaranth Elaboratable. A copy FSM transfers a producer kernel’s output buffer into the input buffers of whichever later kernels depend on it.

Ports

startSignal, in

Pulse high for one cycle to begin execution.

doneSignal, out

Pulses high for one cycle when all kernels have finished.

ext_write_portsdict[(k_idx, buf_idx), {wen, waddr, wdata}]

External write ports for all non-intermediate input buffers.

output_rport{raddr, rdata}

Read port for the final kernel’s output buffer.

FSM sequence (example)

IDLE → K0_RUN → K0_WAIT → COPY_K0_G0 → K1_RUN → K1_WAIT → K2_RUN → …

class compiler.top_module.TopModule(*args, src_loc_at=0, **kwargs)

Bases: Elaboratable

Hardware sequencer for N compiled kernels.

Parameters:
  • kernels (list[CompiledKernel]) – Kernels to run in order.

  • connections (list[tuple[int, int, int, int]]) – Each entry is (src_k, src_buf, dst_k, dst_buf): the source buffer of kernel src_k is DMA-copied into input buffer dst_buf of kernel dst_k after src_k completes.

  • buf_depths (dict[tuple[int,int], int]) – Maps (k_idx, buf_idx) → element count for each buffer involved in a copy (used to know when the copy counter overflows).

elaborate(platform)
compiler.top_module.simulate_top(top, input_data, clock_period=1e-08)

Simulate a TopModule end-to-end.

Parameters:
  • top (TopModule) – The assembled top module.

  • input_data (dict[tuple[int,int], np.ndarray]) – Maps (k_idx, buf_idx) → numpy array for each external input buffer.

  • clock_period (float) – Simulation clock period in seconds.

Returns:

  • output (np.ndarray) – Contents of the final kernel’s output buffer (int32).

  • cycle_counts (dict) –

    Breakdown of clock cycles:
    • ”load”: cycles spent writing input data into BRAM

    • ”compute”: cycles from start-pulse to done-pulse

    • ”readback”: cycles spent reading output data from BRAM

    • ”total”: sum of all three

    • ”states”: cycles spent in each top-level FSM state during compute

  • wall_s (float) – Wall-clock seconds for the simulation.

compiler.utils

Utilities for inspecting tinygrad UOps and generated hardware.

compiler.utils.format_uops(uops: List[UOp], full_width: bool = False) str

Format a linearized UOp list as a readable table.

Each row: index, op name, dtype, arg, sources (by index). If full_width is True, columns expand to fit content without truncation.

compiler.utils.pretty_print_uops(uops: List[UOp], full_width: bool = False) None

Print a linearized UOp list in a readable table format.

compiler.utils.show_hardware(kernel, out_dir: str, *, stage: str = 'opt', fmt: str = 'svg') str | None

Generate a Yosys schematic of a CompiledKernel.

Parameters:
  • kernel (CompiledKernel) – The compiled kernel module.

  • out_dir (str) – Directory to write the output file.

  • stage (str) –

    Level of Yosys lowering before drawing:

    ”rtl” — raw RTLIL structure (closest to source) “opt” — after proc + opt + clean (recommended default) “generic” — after technology mapping to generic cells “mapped” — after full ECP5 synthesis (very detailed)

  • fmt (str) – Output format: “svg” (default), “pdf”, or “png”.

Returns:

Path to the generated file, or None if yosys/dot not available.

Return type:

str or None

compiler.utils.synthesis_stats(kernel, device: str = '45k', package: str = 'CABGA381')

Run Yosys + nextpnr-ecp5 and return resource/timing data.

Returns a dict with:

mem_bits – total on-chip storage in bits (from buf_infos) fp32_units – FP32 submodule count (from RTLIL) fmax_mhz – achieved Fmax in MHz (float), or None if unavailable comb – TRELLIS_COMB cells used (LUT equivalent) ff – TRELLIS_FF flip-flops used dp16kd – DP16KD block RAM tiles used mult18 – MULT18X18D DSP multiplier tiles used from_synth – True when Yosys+nextpnr ran successfully

Falls back gracefully when Yosys or nextpnr-ecp5 is not on PATH.