Before we talk about this mHC paper, we must understand residual connection paradigm and Hyper-Connections(HC) .

ResNets And Hyper-Connections
The structure of a single-layer can be formulated as follows:
The term identity mapping refers to itself, which emphasizes the property that the signal from the shallower layers maps directly to deeper layer without any modification.
Hyper-Connections have introduced a new dimension to the residual connection. The improvements as follows:
- Decoupled Information Capacity. HC decoupled the width of the residual stream from the layer's input dimension. This allows model to carry much more information between layers than residual connection.
- Expanding the width of residual stream, which enhances the connection complexity and diversifies connectivity patterns. HC achieves this improvements by replacing the simple addition of the standard residual connections with a more complex system involving expanded dimensions and learnable matrices.
Instead of a single vector of dimension C, HC maintains a hidden matrix of dimension n×C, where n is the expansion rate. This effectively creates an n-stream residual. Meanwhile, HC introduces three specific learnable linear mappings.
However, the unconstrained nature of the learnable H_res mapping in HC leads to two primary issues: numerical instability and excessive system overhead.
- Since H_res is unconstrained, this product can deviate significantly from an identity matrix.
- While HC maintains high FLOPs efficiency, it hits a massive "memory wall" in practice. On the I/O front, it scales memory access costs by a factor of n; even the fastest GPU ends up idling while waiting for data to load, which tanks overall throughput. This memory pressure extends to the footprint—HC generates massive intermediate activations required for backpropagation. To avoid Out-Of-Memory (OOM) errors, engineers are often forced to use gradient checkpointing, sacrificing extra compute time to re-calculate data just to save VRAM. Finally, in distributed setups using pipeline parallelism, HC multiplies communication volume by n. This lag between stages creates massive "pipeline bubbles"—idle gaps where GPUs sit unproductive—resulting in a significant waste of expensive hardware resources.
The mHC Framework
The central premise of mHC is to constrain the residual mapping H_res onto a specific manifold - Doubly Stochastic Matrices.
A doubly stochastic matrix is a square matrix with non-negative entries where both rows and columns sum to 1. This constraint confers several powerful theoretical properties:
- The set of doubly stochastic matrices is closed under matrix multiplication. This ensures that the composite mapping across multiple layers remains doubly stochastic, preserving stability throughout the network's depth.
- The spectral norm of a doubly stochastic matrix is bounded by 1. This effectively mitigates the risk of gradient explosion.
- It can monotonically increase information mixing across streams.
To be practical for large-scale training, mHC incorporates a suite of rigorous infrastructure optimizations. These optimizations enable mHC (with an expansion rate of n=4) to operate with a marginal training overhead of only 6.7%.
Kernel Fusion: Implemented using the TileLang framework. Fuses operations including RMSNorm, linear mapping, and Sinkhorn iterations into a single GPU kernel.
Only store the outputs of computation-intensive (Attention/MLP) and . Discard lightweight -fold residual stream intermediate states and recompute them immediately when needed.
Based on DualPipe scheduling. Assign high-priority compute streams to the final step (responsible for data packing, ) to ensure data is delivered to the communication stream at the earliest possible time. Forego persistent kernels to leave "preemption gaps" for high-priority tasks, supporting task preemption.

Top comments (0)