Architecture¶
Dataflow¶
The present network path follows:
h = (x @ w1 + b1).relu()logits = h @ w2 + b2
Because input is batch size 1, these are matrix-vector products. The hardware primitive is therefore GEMV.
GEMV unit (hdl/gemv.py)¶
The unit stores:
Vector memory
x[K](INT8)Weight memory
W[M×K](INT8, row-major)Accumulator
acc(INT32)
FSM¶
IDLE: waits forstart.COMPUTE: one MAC per cycle forKcolumns.EMIT: publishesresult_datafor the active row.DONE: raisesdoneand returns to idle.
Interface highlights¶
Write ports for loading vector and weight memories.
Streaming-like result channel via
result_valid,result_idx,result_data.Control signals:
start,busy,done.
Timing model¶
For a single-MAC design:
Per row:
Kcompute cycles + 1 emit cycleTotal:
M × (K + 1)
At 100 MHz, kernel timing is straightforward to estimate from cycle count.