Architecture

Dataflow

The present network path follows:

  1. h = (x @ w1 + b1).relu()

  2. logits = h @ w2 + b2

Because input is batch size 1, these are matrix-vector products. The hardware primitive is therefore GEMV.

GEMV unit (hdl/gemv.py)

The unit stores:

  • Vector memory x[K] (INT8)

  • Weight memory W[M×K] (INT8, row-major)

  • Accumulator acc (INT32)

FSM

  • IDLE: waits for start.

  • COMPUTE: one MAC per cycle for K columns.

  • EMIT: publishes result_data for the active row.

  • DONE: raises done and returns to idle.

Interface highlights

  • Write ports for loading vector and weight memories.

  • Streaming-like result channel via result_valid, result_idx, result_data.

  • Control signals: start, busy, done.

Timing model

For a single-MAC design:

  • Per row: K compute cycles + 1 emit cycle

  • Total: M × (K + 1)

At 100 MHz, kernel timing is straightforward to estimate from cycle count.