**Reverse-Engineering the RK3588 NPU: Hacking Limits to Run Vision Transformers**

**A Journey of Discovery and Innovation**

As a student specializing in Edge AI & Embedded Systems at CU Boulder, I was determined to push the boundaries of what's possible on the Rockchip RK3588 chip. This powerful processor promises 6 TOPS of NPU performance, making it an attractive option for running modern AI workloads – including computer vision tasks. However, when I tried to run the Vision Encoder from SmolVLM on this chip, the standard Computer Vision SDK (rknn-toolkit2) failed to deliver.

**The Problem: Unoptimized and Failing**

When I fed the SigLIP Vision Transformer used by SmolVLM into the rknn-toolkit2 driver, it choked on the massive Attention matrices generated by the model. The result was a cryptic hex error, refusing to compile. Even though the model is "smol," the driver couldn't handle its unique requirements.

**Accepting the Challenge**

I didn't accept this limitation. I decided to reverse-engineer the NPU to understand why it was failing and how to force it to run at full speed. This journey would take me through the intricacies of Instruction Set Architecture (ISA), memory management, and even custom code injection.

**The First Step: Understanding Error 0xe010**

When I tried to compile the Attention layers, the driver kept spitting out an undocumented error: REGTASK Overflow (0xe010). After hypothesizing that this was a memory overflow, I wrote a script to generate synthetic ONNX graphs to probe the hardware limits.

**Discovery: The NPU's Hardware Limitations**

The RK3588 has a hardware-enforced 32KB L1 SRAM Scratchpad for vector operations. It turns out that the standard compiler was trying to shove a massive 25MB Attention matrix into this tiny slot – causing the error.

**The Fix: Nano-Tiling & The "Poison Pill"**

To solve this issue, I wrote a "Nano-Tiling" algorithm in PyTorch, manually slicing the massive sequence into tiny 32x32 tiles that fit perfectly into the 32KB scratchpad. However, the rknn compiler is "smart" and decided to fuse these operators back together – causing another crash.

**Introducing the "Poison Pill"**

To trick the compiler, I injected a dummy operation called the "Poison Pill," which looks mathematically significant to the dependency graph but is mathematically irrelevant to the model output. This allowed me to successfully force the compiler to respect my tiling logic.

**The "SigLIP Cliff": Solving Accuracy Collapse**

Getting it to run was step one, but getting it right was step two. When I first got the NPU running, the output was garbage. The culprit was the architecture of SigLIP, which has massive activation spikes sitting next to tiny visual signals.

**Solving the Problem: A "Sandwich" Domain Shift**

I implemented a simple trick called the "Sandwich" Domain Shift, which restored signal fidelity from 0.02 to 0.999 (effectively bit-exact).

**The Final Piece of the Puzzle: Graph Sharding and Core Scheduling**

To bypass driver timeouts caused by the sheer number of tiles, I physically cut the model graph into 26 separate binary files (shards). I wrote a custom User-Space Runtime in Python that acts as an orchestrator, manually loading these shards onto the RK3588's 3 separate NPU cores and firing them in a synchronized round-robin schedule.

**Conclusion: Breaking the Binary Notion of "Supported Hardware"**

This project challenged the binary notion of "Supported Hardware." The RK3588 didn't support the SigLIP encoder out-of-the-box on the standard SDK, but the silicon was always capable of it. It just needed an engineer to dig into the register overflow codes and manage memory manually.

**The Full Code: A Repository for the Curious**

If you want to see the full code, including the tiling logic and the runtime orchestrator, check out the repository below.

View on GitHub