# Architecture

## Dataflow

The present network path follows:

1. `h = (x @ w1 + b1).relu()`
2. `logits = h @ w2 + b2`

Because input is batch size 1, these are matrix-vector products. The hardware primitive is therefore GEMV.

## GEMV unit (`hdl/gemv.py`)

The unit stores:

- Vector memory `x[K]` (INT8)
- Weight memory `W[M×K]` (INT8, row-major)
- Accumulator `acc` (INT32)

### FSM

- `IDLE`: waits for `start`.
- `COMPUTE`: one MAC per cycle for `K` columns.
- `EMIT`: publishes `result_data` for the active row.
- `DONE`: raises `done` and returns to idle.

### Interface highlights

- Write ports for loading vector and weight memories.
- Streaming-like result channel via `result_valid`, `result_idx`, `result_data`.
- Control signals: `start`, `busy`, `done`.

## Timing model

For a single-MAC design:

- Per row: `K` compute cycles + 1 emit cycle
- Total: `M × (K + 1)`

At 100 MHz, kernel timing is straightforward to estimate from cycle count.