hardware accelerator for convolutional neural networks
what if we could make something that is faster than pytorch at inference? its not really something that people normally think about…right? everyone just assumes that pytorch is the fastest because its the standard for all this deep learning related. BUT its made for flexibility and not with hardware speed in mind. if you design for speed you can actually go past it.
so thats what we did. we built the TALOS. this project first started as a question of how far we could push speed if we stripped away all the overhead that comes with frameworks like pytorch. we didn’t need dynamic graphs, fancy APIs, or layers of abstraction. we just needed something that could take a trained model and run it as fast as possible on the hardware we had.
we designed TALOS with that in mind. every part of the stack is optimized for inference, from how the weights are stored to how operations are scheduled and executed. it’s not about being general purpose, it’s about being efficient. that’s why TALOS can run models faster, using less memory, and in some cases even less power.
the idea isn’t to replace pytorch or tensorflow. those are amazing for training and research. TALOS is about the other side of the story - deployment, where latency and throughput matter the most.
before we get into the technical details of it all, we want to give a quick shoutout to the Tiny TPU and the entire team for their inspriational project and stories - be sure to check them out here. also shoutout to AMD's vitus AI (dpu infra) and SamsungLabs's Butteryfly Accelerator for inspiration.
going through this documentation you will find details on the architecture, design decisions, performance benchmarks, and how to get started with TALOS. we hope you enjoy reading this. we'll keep iterating and improving it over time.
TALOS' first inference pipeline was built around one radical simplification: *do only the math that matters, nothing else*. no runtime, no scheduler, no abstraction. every operation is grounded in fixed-point arithmetic, every cycle is deterministic, every path through the hardware is known.
the q16.16 backbone is built entirely on fixed-point arithmetic, where every number is stored as a 32-bit signed integer split into 16 integer bits (including the sign) and 16 fractional bits.
converting between floating-point and q16.16 is straightforward: a float is mapped into q16.16 by computing:
and the reverse is obtained by dividing by the same scaling factor:
the smallest representable step in this format is:
which provides fine grained precision for most computations. arithmetic in q16.16 follows simple integer rules: addition and subtraction are just standard 32-bit integer operations; multiplication is performed exactly in 64 bits, then rescaled back to q16.16 with a right shift by 16:
and division is implemented by scaling the numerator before integer division:
overall, the representable range is:
which is wide enough for typical neural network inference while keeping the math exact, deterministic, and efficient to execute on hardware.
convolution is the fundamental operation in cnn models (it literally stands for convolutional neural networks), implemented in hardware as a multiply–accumulate (MAC) loop. mathematically, the output at position (i,j) is computed as:
here, x is the input feature map and w is the convolution kernel. in hardware, both x and w are represented in the q16.16 fixed-point format (32-bit signed integers). when multiplied, the result is a q32.32 value (64-bit signed integer), which is then scaled back to q16.16 by a right shift of 16 bits. each product is accumulated into a 32-bit register, with overflow handled by two’s complement wraparound. for example, applying a 3×3 kernel to a 28×28 input produces a 26×26 output, which requires:
multiply accumulate operations per kernel.
below is an illusration of the convolution operation on a dummy input image. for demonstration purposes it is simplifed from the 28x28 MNIST input images to a 8x8 input. an edge deetction kernel is then convloved with this image with a stride of 1 and 0 padding to return a feature map of size 6x6
maxpool is a simple operation that reduces resolution by taking local maxima. mathematically, the output at position (i,j) is:
in hardware, the path is straightforward: start with the first value, then compare against the next three, updating the maximum at each step. this requires three comparators and just three multiplexers. for a 26×26 input downsampled to 13×13, we compute 676 outputs, one per cycle, for a total of ~676 cycles.
flatten is not computation, it’s reindexing — converting a 2d feature map into a 1d vector. mathematically, the mapping is:
the fully connected layer computes
10 neurons in parallel × 676 MACs = 676 cycles total.
TALOS evolved through multiple architecture iterations, each addressing specific performance and resource trade-offs. the journey from our initial time-multiplexed design to the current streaming architecture reveals key insights about hardware design for CNN acceleration.
our first approach used a single CNN instance reused across 4 kernels via ker_sel andker_bus multiplexing. this design minimized area but required complex control:
recognizing area-for-latency trade-offs, we developed two alternate approaches that fundamentally changed the dataflow:
added STREAM_ONLY parameter to eliminate stored pooled maps. when enabled, convimg[] is tied to zero (no RAM inference) and data flows viaout_valid/out_ready handshake.
fully connected neurons now accept inputs one activation at a time via streaming interface. weights moved from large port arrays to ROM MIFs, with each neuron reading from dedicated M10K memory.
P_IDLE → P_CONV → P_POOL → ...S_IDLE → S_MP0 → S_MP1 → S_MP2 → S_MP3similar issues plague the streaming neuron's final accumulation step. the original pattern:
creates subtle double-counting or missing-count bugs depending on when prod_q16is available. the fix uses the same explicit next-value pattern.
streaming neurons access weights from synchronous ROM, requiring careful alignment between address issue and data usage:
we handle ROM latency with a prime signal that gates computation for one cycle, ensuring w_cur and x_curare valid before computing products.
operation <= 0w_addr <= 0mac_count <= 0convolutions <= 0hor_align <= 0complete <= 1'b0each architectural choice in TALOS represents a specific tradeoff between hardware resources and inference latency. understanding these tradeoffs is crucial for optimizing the design for different deployment scenarios.
ideal when logic/DSP resources are constrained. requires sophisticated control FSM for kernel serialization.
trades DSP/M10K resources for parallel execution. simpler control but requires careful memory management.
the decision to stream activations versus store intermediate results fundamentally impacts memory requirements:
convimg[] RAM requirementsstreaming neurons create longer combinational paths that require careful timing analysis:
this path must complete within one clock period. for high-frequency operation, consider pipelining the multiply-accumulate chain.