**Reproducing DeepSeek's MHC: When Residual Connections Explode**

Modern transformers, from GPT-5 to Claude and Llama, all rely on the same residual connection design introduced in 2016. This design, which simply adds the layer's output to the input flow, has been stable and reliable – but is it optimal? DeepSeek asked: what if we made the residual connections wider?

The answer lies in Hyper-Connections (HC), a more expressive approach that expands the input stream into multiple parallel streams with learnable mixing matrices. This design can amplify signals, not just route them, but it also comes with a significant caveat: unconstrained HC can lead to exponential failure.

**The Problem with Unconstrained Mixing Matrices**

When left unregulated, the mixing matrices in HC can amplify signals exponentially, leading to catastrophic results. At small scales, this amplification may seem manageable – my reproduction at 10M parameters reached a peak of 9.2x – but as the model grows, so does the potential for disaster.

DeepSeek observed this phenomenon firsthand when they scaled their HC implementation to 27B parameters: the Amax (maximum of row and column absolute sums) metric soared to 3000, an unsustainable level that would have crashed even the most robust models. This is not a typo – three thousand times amplification at scale.

**The Fix: Constrained Mixing Matrices**

DeepSeek's solution is elegant in its simplicity: constrain the mixing matrices to be doubly stochastic. This ensures that the mixing operation can only take weighted averages of streams, preventing arbitrary transformations and signal amplification. The network learns the raw weights HHH, while Sinkhorn ensures the actual mixing matrix is always doubly stochastic.

Twenty iterations are sufficient for this procedure, which is differentiable and allows gradients to flow back through all twenty iterations. This constraint may seem like a limitation, but it's actually a guarantee – a principled choice that makes the architecture work at scale.

**Results**

The results of my reproduction at 10M parameters show HC winning on raw performance, with a validation loss of 0.88 compared to 1.12 for standard residual connections. However, when looking at variance and stability, mHC shines: it has zero variance across seeds and runs, while HC's loss varies by up to 3x.

At 27B parameters, the story changes dramatically. While HC may still outperform standard residuals on raw performance, its instability becomes a liability – Amax amplification peaks at an unsustainable 3000x. This is why mHC is essential for large-scale models: it guarantees stability and prevents catastrophic failure.

**Conclusion**

Residual connections are more than just a trick to help gradients flow; they're a conservation law that constrains what's possible but enables prediction. HC breaks this conservation, while mHC restores it by enforcing doubly stochastic mixing matrices. This constraint may seem restrictive, but it's actually a guarantee – and one that makes the architecture work at scale.

This is Part 1 of a two-part series. Part 2 will explore the instability of HC at larger scales, delving into the catastrophic results of unregulated mixing matrices. Stay tuned for Thursday's post to see where things break.