Benchmarks¶
Model: MNIST MLP, 2 layers (784→128 ReLU, 128→10), 101,632 total MACs.
Compiler-generated hardware (single-MAC, simulated)¶
The cycle count formula for a compiled GEMV kernel is M×(K+2) — one cycle for acc reset, K cycles for MACs, one cycle for output write — repeated M times.
Kernel |
Dimensions |
Cycles |
@100 MHz |
|---|---|---|---|
Layer 1 |
128×784 |
101,504 |
~1.0 ms |
Layer 2 |
10×128 |
1,300 |
~13 μs |
Total |
~102,800 |
~1.0 ms |
Parallelism potential (not yet implemented)¶
Enabling tinygrad’s UNROLL optimization would expose N-wide SIMD in the UOps, allowing N parallel MACs. The cycle count scales as M×(⌈K/N⌉+2).
MACs |
Layer 1 cycles |
Layer 2 cycles |
Total @200 MHz |
|---|---|---|---|
1 |
101,504 |
1,300 |
~0.51 ms |
8 |
12,928 |
170 |
~65 μs |
64 |
1,664 |
30 |
~8.5 μs |
128 |
896 |
20 |
~4.6 μs |
End-to-end comparison¶
uv run python compare_inference.py
Runs a single MNIST test image through tinygrad float32 (CPU reference) and through the two compiled kernels (Amaranth simulation, INT8 quantized). Prints predictions, cycle counts, and wall-clock simulation time.