From CPU Instructions to Hardware States: Architecture of a Crypto Coprocessor

Last semester, I had an Information Security course, in which we went from the basics like Caesar ciphers all the way to modern algorithms like AES and SHA-256. Naturally, being a dev, I got curious and started implementing these algorithms in Python just to see how they tick in the backend.
It worked, but it hit me pretty quickly: even with perfect math, software is vulnerable. If the Operating System is compromised, an adversary doesn't need to "break" the encryption—they just need to scrape keys from memory. Plus, running complex rounds of math on a general-purpose CPU is like using a luxury car to haul bricks. It’s a massive overhead. I started wondering: What if we pulled this logic out of the OS entirely? What if we had a dedicated peripheral—a coprocessor—that handles the heavy lifting and just hands the system the results?
Why It Matters
When we write software, we’re essentially asking a CPU to play a game of "Simon Says" with a billion instructions. It’s sequential, it’s interrupted by background tasks, and it relies on caches that are inherently unpredictable. This introduces jitter—the time it takes to encrypt the same data varies every single time.
Hardware acceleration flips this. By moving from Python to RTL (Register Transfer Level), we aren't writing instructions; we’re building a physical structure. In my coprocessor, the algorithm is the architecture. There’s no context switching or "waiting" for a core to be free. You get deterministic, clock-cycle-accurate execution. Every SHA-256 hash takes exactly the same number of cycles, every single time. That predictability isn't just a performance boost; it's a security feature that eliminates timing side-channel attacks.
This predictability isn't just a performance boost; it's a security feature that eliminates timing side-channel attacks. More importantly, it’s the gateway to ultra-low latency systems—the kind used in High-Frequency Trading (HFT), where every nanosecond is a competitive edge. This project is my first step toward that world.
Design Decision: Building the Brain
I knew I wanted this to be a standalone peripheral, so the big challenge was handling the "handshake" between the system and the engines. I went with a modular hierarchy: a top-level controller that talks to a 64-bit AXI4-Stream interface, and independent "worker" engines for AES and SHA.
A key choice I made was using a dedicated Finite State Machine (FSM) for control rather than something like microcode. In crypto, predictability is security. With an FSM, the execution time is constant. By treating the SHA-256 and AES engines as "black boxes" within the design, I’ve made the system future-proof. I used the TDEST signal as an operation selector (Opcode), allowing the system to switch between loading keys, running SHA-256, or executing AES-128 without reconfiguring the bus.
Implementation
The heart of the system is a 3-bit opcode-driven FSM. When the system asserts s_axis_tvalid, the controller latches the tdest to decide the task.
The architecture relies on a 512-bit internal data buffer constructed from eight 64-bit "beats." This allows the coprocessor to ingest a full SHA-256 message block or an AES plaintext/key pair efficiently. The FSM moves through four stages:
IDLE: Waiting for the system handshake.
LOAD_DATA: Burst-loading the 64-bit segments into the 512-bit wide buffer.
EXECUTE: Triggering the sub-modules. (The SHA engine takes ~64 cycles, while the iterative AES core cycles through its 10 rounds).
WRITE_OUT: Streaming the 256-bit hash or 128-bit ciphertext back to the master.
Source Code
You can find the full Verilog implementation, constraints, and testbenches in the GitHub repository:
👉 GitHub: Hardware-Accelerated-Cryptography
Closing Insight
At the end of the day, good hardware design is about choosing your constraints. Moving from software abstraction to hardware gates isn't just a speed boost—it's about making your math untouchable by the layers above it. Offloading to a dedicated data path ensures that even if the software fails, the silicon remains deterministic.
What's Next: The Roadmap
This is a five-part series where I'll guide you through every detail of this design. Here's our journey ahead:
The SHA-256 Core – Discover why simple implementations struggle with timing and how I optimized the T1/T2 combinational paths.
The AES-128 Core – Balancing area and throughput by choosing an iterative FSM over an unrolled architecture.
The AXI4-Stream Interface – Managing backpressure and ensuring smooth data flow without bottlenecks.
Verification – Using NIST test vectors to confirm the silicon aligns perfectly with the math.
Stay tuned for the next post where we explore the math and timing of the SHA-256 engine.


