Verification¶
Amaranth cycle-accurate simulation against NumPy ground truth. Integer results are compared at the bit level — the numpy references model INT8 truncation at every intermediate step to match UOp semantics exactly. Float32 results are compared with rtol=1e-5 against tinygrad CPU reference.
Test suite¶
uv run pytest tests/ benchmarks/ -k "not slow" -v # ~160 tests, ~10 s
uv run pytest tests/ benchmarks/ -v # all tests incl. slow
FP32 unit tests (tests/test_fp32.py) — 38 tests¶
Class |
What it validates |
|---|---|
|
Positive/negative/mixed add, cancellation, zero, large exponent diff, infinity, 5 parametrized cases |
|
All sign combinations, mul-by-zero, mul-by-one, infinity, 6 parametrized cases |
|
Less-than for positive, negative, mixed sign, ±0, 5 parametrized cases |
Integration |
|
Compiler — structure (tests/test_compiler.py)¶
Test |
Validates |
|---|---|
|
No GPU-specific ops in UOp output |
|
|
|
Buffer depth/width/index analysis |
|
|
|
IR: single LOOP level, prologue stores present |
|
IR: LOOP + REDUCE nesting, correct bounds |
|
|
|
MNIST schedule produces 2 kernels with correct buffer shapes |
Compiler — simulation (tests/test_compiler.py)¶
Test |
Validates |
|---|---|
|
Kernel object has expected ports |
|
3×3 identity: output == input |
|
4×3 matmul matches numpy with INT8 truncation |
|
Fused matmul + bias + ReLU |
|
M×(K+2) cycle model |
|
Two kernels chained, both outputs match numpy |
|
Larger random MLP, argmax matches |
Elementwise fusion (tests/test_relu.py, tests/test_combined.py)¶
Test |
Validates |
|---|---|
|
relu over all-positive, all-negative, mixed int32 |
|
|
TopModule (tests/test_top_module.py) — 11 tests¶
Test |
Validates |
|---|---|
|
Auto-detection of buffer identity connections |
|
Non-connected inputs exposed correctly |
|
Final kernel output accessible |
|
End-to-end TopModule simulation with 2-layer MLP |
|
Skip connection: K0 output copied to K2 (not K1) |
|
Fan-out: K0 output broadcast to both K1 and K2 in one copy pass |
Benchmark suites¶
# Correctness suite — 8 tests: Tier 1 elementwise, Tier 2 GEMV, Tier 3 MLP
uv run pytest benchmarks/test_suite.py -v
# Performance suite — 10 workloads with cycle-count validation
uv run pytest benchmarks/test_perf_suite.py -v -s -k "not slow"
Performance suite workloads:
Test |
Shape |
Expected cycles |
|---|---|---|
scalar add |
(1,) |
≤ 5 |
elementwise relu |
N=32 |
~33 |
elementwise add+relu |
N=128 |
~129 |
tiny GEMV int8 |
(1,4)@(4,8) |
49 |
small GEMV+bias int8 |
(1,8)@(8,16) |
161 |
linear+bias+relu int8 |
(1,8)@(8,16) |
161 |
2-layer MLP small |
(1,4)→(1,4)→(1,2) |
~36 |
2-layer MLP medium |
(1,16)→(1,16)→(1,8) |
~432 |
relu fp32 |
N=16 |
16 |
add fp32 |
N=32 |
32 |
gemv fp32 |
(1,4)@(4,8) |
48 |
MNIST layer 1 (slow) |
(1,784)@(784,128) |
100,609 |