Verification

Amaranth cycle-accurate simulation against NumPy ground truth. Integer results are compared at the bit level — the numpy references model INT8 truncation at every intermediate step to match UOp semantics exactly. Float32 results are compared with rtol=1e-5 against tinygrad CPU reference.

Test suite

uv run pytest tests/ benchmarks/ -k "not slow" -v   # ~160 tests, ~10 s
uv run pytest tests/ benchmarks/ -v                  # all tests incl. slow

FP32 unit tests (tests/test_fp32.py) — 38 tests

Class

What it validates

TestFP32Add

Positive/negative/mixed add, cancellation, zero, large exponent diff, infinity, 5 parametrized cases

TestFP32Mul

All sign combinations, mul-by-zero, mul-by-one, infinity, 6 parametrized cases

TestFP32Cmp

Less-than for positive, negative, mixed sign, ±0, 5 parametrized cases

Integration

test_fp32_relu_harness, test_fp32_add_harness — float32 through full run_bench

Compiler — structure (tests/test_compiler.py)

Test

Validates

test_renderer_produces_clean_uops

No GPU-specific ops in UOp output

test_renderer_attributes

has_local=False, has_shared=False, etc.

test_matmul_buffers

Buffer depth/width/index analysis

test_scalar_returns_root_no_body

uop_to_ir handles no-RANGE kernels

test_elementwise_one_loop

IR: single LOOP level, prologue stores present

test_gemv_two_levels

IR: LOOP + REDUCE nesting, correct bounds

test_gemv_prologue_epilogue_split

IRRegStore in prologue, IRBufStore in epilogue

test_mnist_kernel_shapes (slow)

MNIST schedule produces 2 kernels with correct buffer shapes

Compiler — simulation (tests/test_compiler.py)

Test

Validates

test_compile_small_matmul

Kernel object has expected ports

test_simulate_identity_matmul

3×3 identity: output == input

test_simulate_small_matmul_4x3

4×3 matmul matches numpy with INT8 truncation

test_simulate_matmul_with_bias_relu

Fused matmul + bias + ReLU

test_cycle_count

M×(K+2) cycle model

test_two_layer_mlp_simulation

Two kernels chained, both outputs match numpy

test_two_layer_mlp_prediction

Larger random MLP, argmax matches

Elementwise fusion (tests/test_relu.py, tests/test_combined.py)

Test

Validates

test_relu_*

relu over all-positive, all-negative, mixed int32

test_relu_add_bias_* (4 tests)

relu(a+b+const), N-cycle throughput

TopModule (tests/test_top_module.py) — 11 tests

Test

Validates

test_connections_detected

Auto-detection of buffer identity connections

test_ext_write_ports_exposed

Non-connected inputs exposed correctly

test_output_rport_wired

Final kernel output accessible

test_simulate_top_*

End-to-end TopModule simulation with 2-layer MLP

test_manual_non_adjacent_dependency

Skip connection: K0 output copied to K2 (not K1)

test_manual_fanout_dependency

Fan-out: K0 output broadcast to both K1 and K2 in one copy pass

Benchmark suites

# Correctness suite — 8 tests: Tier 1 elementwise, Tier 2 GEMV, Tier 3 MLP
uv run pytest benchmarks/test_suite.py -v

# Performance suite — 10 workloads with cycle-count validation
uv run pytest benchmarks/test_perf_suite.py -v -s -k "not slow"

Performance suite workloads:

Test

Shape

Expected cycles

scalar add

(1,)

≤ 5

elementwise relu

N=32

~33

elementwise add+relu

N=128

~129

tiny GEMV int8

(1,4)@(4,8)

49

small GEMV+bias int8

(1,8)@(8,16)

161

linear+bias+relu int8

(1,8)@(8,16)

161

2-layer MLP small

(1,4)→(1,4)→(1,2)

~36

2-layer MLP medium

(1,16)→(1,16)→(1,8)

~432

relu fp32

N=16

16

add fp32

N=32

32

gemv fp32

(1,4)@(4,8)

48

MNIST layer 1 (slow)

(1,784)@(784,128)

100,609