TALOS - Tensor Accelerated Logic for On-Chip Systems

hardware accelerator for convolutional neural networks

overview

what if we could make something that is faster than pytorch at inference? its not really something that people normally think about…right? everyone just assumes that pytorch is the fastest because its the standard for all this deep learning related. BUT its made for flexibility and not with hardware speed in mind. if you design for speed you can actually go past it.

so thats what we did. we built the TALOS. this project first started as a question of how far we could push speed if we stripped away all the overhead that comes with frameworks like pytorch. we didn’t need dynamic graphs, fancy APIs, or layers of abstraction. we just needed something that could take a trained model and run it as fast as possible on the hardware we had.

we designed TALOS with that in mind. every part of the stack is optimized for inference, from how the weights are stored to how operations are scheduled and executed. it’s not about being general purpose, it’s about being efficient. that’s why TALOS can run models faster, using less memory, and in some cases even less power.

the idea isn’t to replace pytorch or tensorflow. those are amazing for training and research. TALOS is about the other side of the story - deployment, where latency and throughput matter the most.

before we get into the technical details of it all, we want to give a quick shoutout to the Tiny TPU and the entire team for their inspriational project and stories - be sure to check them out here. also shoutout to AMD's vitus AI (dpu infra) and SamsungLabs's Butteryfly Accelerator for inspiration.

going through this documentation you will find details on the architecture, design decisions, performance benchmarks, and how to get started with TALOS. we hope you enjoy reading this. we'll keep iterating and improving it over time.

content

inference

FPGA architecture evolution

inference

what's behind the first inference pipeline

TALOS' first inference pipeline was built around one radical simplification: *do only the math that matters, nothing else*. no runtime, no scheduler, no abstraction. every operation is grounded in fixed-point arithmetic, every cycle is deterministic, every path through the hardware is known.

the q16.16 backbone

the q16.16 backbone is built entirely on fixed-point arithmetic, where every number is stored as a 32-bit signed integer split into 16 integer bits (including the sign) and 16 fractional bits.

converting between floating-point and q16.16 is straightforward: a float is mapped into q16.16 by computing:

and the reverse is obtained by dividing by the same scaling factor:

the smallest representable step in this format is:

which provides fine grained precision for most computations. arithmetic in q16.16 follows simple integer rules: addition and subtraction are just standard 32-bit integer operations; multiplication is performed exactly in 64 bits, then rescaled back to q16.16 with a right shift by 16:

and division is implemented by scaling the numerator before integer division:

overall, the representable range is:

which is wide enough for typical neural network inference while keeping the math exact, deterministic, and efficient to execute on hardware.

the convolution operation

convolution is the fundamental operation in cnn models (it literally stands for convolutional neural networks), implemented in hardware as a multiply–accumulate (MAC) loop. mathematically, the output at position (i,j) is computed as:

here, x is the input feature map and w is the convolution kernel. in hardware, both x and w are represented in the q16.16 fixed-point format (32-bit signed integers). when multiplied, the result is a q32.32 value (64-bit signed integer), which is then scaled back to q16.16 by a right shift of 16 bits. each product is accumulated into a 32-bit register, with overflow handled by two’s complement wraparound. for example, applying a 3×3 kernel to a 28×28 input produces a 26×26 output, which requires:

multiply accumulate operations per kernel.

below is an illusration of the convolution operation on a dummy input image. for demonstration purposes it is simplifed from the 28x28 MNIST input images to a 8x8 input. an edge deetction kernel is then convloved with this image with a stride of 1 and 0 padding to return a feature map of size 6x6

input 8×8

convolution

edge detection kernel 3×3

output

feature map 6×6

maxpool: just comparisons

maxpool is a simple operation that reduces resolution by taking local maxima. mathematically, the output at position (i,j) is:

in hardware, the path is straightforward: start with the first value, then compare against the next three, updating the maximum at each step. this requires three comparators and just three multiplexers. for a 26×26 input downsampled to 13×13, we compute 676 outputs, one per cycle, for a total of ~676 cycles.

flatten: reindex

flatten is not computation, it’s reindexing — converting a 2d feature map into a 1d vector. mathematically, the mapping is:

the fully connected layer

the fully connected layer computes

1. inputs in q16.16, product promoted to q32.32 (64-bit)
2. scale back with >>16 to q16.16
3. accumulate in 32-bit register (wrap on overflow)
4. add bias once
5. optional ReLU:

10 neurons in parallel × 676 MACs = 676 cycles total.

FPGA architecture evolution

from serial to parallel: design iterations

TALOS evolved through multiple architecture iterations, each addressing specific performance and resource trade-offs. the journey from our initial time-multiplexed design to the current streaming architecture reveals key insights about hardware design for CNN acceleration.

generation 1: time-multiplexed cnn

our first approach used a single CNN instance reused across 4 kernels via ker_sel andker_bus multiplexing. this design minimized area but required complex control:

single CNN processes kernels sequentially (ker_sel = 0,1,2,3)
pass controller FSM (P_IDLE → P_CONV → P_POOL → ...) manages serialization
maxpool writes stored pooled maps to memory
large parallel matrix multiply for fully connected layer

generation 2: parallel cnn + streaming

recognizing area-for-latency trade-offs, we developed two alternate approaches that fundamentally changed the dataflow:

approach a: 4× parallel cnn instances

instantiate four separate cnn modules (cnn_ins0..3)
each generates x1..x4 concurrently
simplified FSM waits for all mp_complete, then streams mp0→mp1→mp2→mp3

approach b: hybrid serialized with pass controller

single CNN reused with enhanced pass controller
more intricate control but lower resource usage
serializes kernel processing while maintaining streaming interface

maxpool streaming

added STREAM_ONLY parameter to eliminate stored pooled maps. when enabled, convimg[] is tied to zero (no RAM inference) and data flows viaout_valid/out_ready handshake.

neuron streaming

fully connected neurons now accept inputs one activation at a time via streaming interface. weights moved from large port arrays to ROM MIFs, with each neuron reading from dedicated M10K memory.

early: complex pass controller

P_IDLE → P_CONV → P_POOL → ...

current: simple streaming FSM

S_IDLE → S_MP0 → S_MP1 → S_MP2 → S_MP3

neuron accumulation timing

similar issues plague the streaming neuron's final accumulation step. the original pattern:

sum <= sum + prod_q16;

if (mac_count + 32'd1 == PREV_NEURONS) begin

outputneuron <= sum + prod_q16; // double-count risk

end

creates subtle double-counting or missing-count bugs depending on when prod_q16is available. the fix uses the same explicit next-value pattern.

rom synchronous read alignment

streaming neurons access weights from synchronous ROM, requiring careful alignment between address issue and data usage:

w_addr <= operation[AW-1:0]; // cycle N

x_cur <= in_data; // cycle N

// w_cur appears cycle N+1 (ROM latency)

// prod_q16 must use cycle N+1 data

we handle ROM latency with a prime signal that gates computation for one cycle, ensuring w_cur and x_curare valid before computing products.

critical resets required:

operation <= 0
w_addr <= 0
mac_count <= 0
convolutions <= 0
hor_align <= 0
complete <= 1'b0

resource & performance tradeoffs

area vs latency decisions

each architectural choice in TALOS represents a specific tradeoff between hardware resources and inference latency. understanding these tradeoffs is crucial for optimizing the design for different deployment scenarios.

single CNN (time-multiplexed)

area cost:lowest

latency:higher

control complexity:high

DSP usage:1× multiplier

ideal when logic/DSP resources are constrained. requires sophisticated control FSM for kernel serialization.

4× parallel CNN instances

area cost:4× higher

latency:lowest

control complexity:simple

DSP usage:4× multipliers

trades DSP/M10K resources for parallel execution. simpler control but requires careful memory management.

streaming vs stored intermediates

the decision to stream activations versus store intermediate results fundamentally impacts memory requirements:

STREAM_ONLY maxpool benefits

eliminates convimg[] RAM requirements
enables memory-light inference pipeline
requires robust backpressure management
increases timing sensitivity in downstream modules

weight storage strategies

ROM MIF approach (current)

weights stored in quartus-friendly M10K ROMs
eliminates large external weight buses
requires power-of-two MIF depths (pad with zeros)
one ROM per neuron for parallel access

port array approach (legacy)

large weight arrays passed as module ports
synthesizes to distributed logic
high routing overhead
harder for quartus to optimize

timing closure considerations

streaming neurons create longer combinational paths that require careful timing analysis:

// critical path: ROM → multiplier → accumulator

w_cur (cycle N+1) → fxp_mul → prod_q16 → sum_next

this path must complete within one clock period. for high-frequency operation, consider pipelining the multiply-accumulate chain.