Benchmarks

Model: MNIST MLP, 2 layers (784→128 ReLU, 128→10), 101,632 total MACs.

Compiler-generated hardware (single-MAC, simulated)

The cycle count formula for a compiled GEMV kernel is M×(K+2) — one cycle for acc reset, K cycles for MACs, one cycle for output write — repeated M times.

Kernel

Dimensions

Cycles

@100 MHz

Layer 1

128×784

101,504

~1.0 ms

Layer 2

10×128

1,300

~13 μs

Total

~102,800

~1.0 ms

Parallelism potential (not yet implemented)

Enabling tinygrad’s UNROLL optimization would expose N-wide SIMD in the UOps, allowing N parallel MACs. The cycle count scales as M×(⌈K/N⌉+2).

MACs

Layer 1 cycles

Layer 2 cycles

Total @200 MHz

1

101,504

1,300

~0.51 ms

8

12,928

170

~65 μs

64

1,664

30

~8.5 μs

128

896

20

~4.6 μs

End-to-end comparison

uv run python compare_inference.py

Runs a single MNIST test image through tinygrad float32 (CPU reference) and through the two compiled kernels (Amaranth simulation, INT8 quantized). Prints predictions, cycle counts, and wall-clock simulation time.