Abstract: Neural scaling laws indicate that language models dramatically improve with size and compute. Paradoxically, biological brains operate with highly sparse activations and vastly lower power. We propose a hierarchical voter ensemble architecture that reconciles these trends by using sparse, specialized processing. Each output bit is decided by an ensemble of binary “voters”, whose contributions are weighted by a high-temperature softmax, effectively selecting only the strongest signals. The model (implemented in the SMUGGLER framework) combines multi-scale convolutional encoder–decoder stages with a transformer bottleneck, projecting to redundant binary outputs via many voters per bit. We derive the formal aggregation equations and compare with a standard dense layer, showing that as the temperature grows, the voting becomes equivalent to a max-selection. Analytically, we argue that this sparse computation reduces effective complexity and memory use, since only a few neurons need be “active” per decision. We further show that sparse specialization naturally mitigates catastrophic forgetting: different voters can specialize to different tasks, isolating weights and reducing interference. These ideas are grounded in neuroscience: sparse coding and inhibitory circuits underlie cortical representations. We discuss evaluation strategies (e.g. sparsity metrics, sequential learning tests) and conclude that true intelligence may arise from many specialized sub-networks rather than a monolithic dense matrix.
Introduction
Recent work on neural scaling laws shows that larger models and datasets yield power-law improvements in language modeling performance. For example, Kaplan et al. demonstrate that cross-entropy loss decreases predictably as model size, data, and compute are scaled up. In practice, this has driven the creation of models with billions or trillions of parameters. However, such dense architectures demand enormous compute and energy. In sharp contrast, biological brains operate with sparse activations – often only a few percent of neurons fire at any time – yet achieve remarkable efficiency. Classic neuroscience proposes that sensory systems use efficient coding: neurons represent inputs with minimal redundancy. For instance, the receptive fields of V1 simple cells can be explained by a strategy that yields very sparse responses to natural images. Likewise, cortical circuits include inhibitory interneurons that decorrelate activity and enforce sparsity. This raises a “dense vs. sparse” paradox: state-of-the-art neural nets are mostly fully connected and dense, while brains are both much smaller and rely on highly selective activation.
In this work, we address this gap by proposing a sparse activation architecture that draws inspiration from both deep learning and neuroscience. Specifically, we introduce a voter ensemble model: rather than a single dense output neuron per prediction, each output is predicted by a group (ensemble) of redundant binary voters. A high-temperature softmax selects the top voters, so effectively only a few voters contribute to each decision. The full model, implemented in our SMUGGLER codebase, embeds input tokens as binary vectors and processes them through a hierarchical encoder–decoder with convolutional and transformer layers. At the top, each bit of the output is computed by aggregating \$M\$ sigmoidal voters. We show mathematically that this aggregation acts like a differentiable max, yielding sparse computation. The result is a network with the capacity to scale (many total neurons) but where each example only activates a small subset, reminiscent of cortical microcircuits.
We organize the paper as follows. First, we review related approaches to sparsity in deep learning (mixture-of-experts, sparse attention, routing networks) and key neuroscience principles of sparse coding. Then we present our Methodology, formalizing the voter ensemble model and detailing the SMUGGLER architecture (drawing explicitly on the user’s modules: smuggler_model.py, blocks.py, hierarchical.py). We derive the softmax weighting scheme and contrast it to a standard dense layer. Next, we analyze Computational Implications: time complexity, memory, and information efficiency of our sparse design, highlighting potential gains. We then discuss how Catastrophic Forgetting is mitigated by sparse specialization: with many voters, learning a new task can recruit different neurons without overwriting old ones. We tie the architecture to Biological Plausibility: relate it to sparse coding theories, grandmother-cell ideas, and inhibitory circuits that create sparse representations. We propose Experimental Hooks, suggesting metrics and benchmarks to evaluate SMUGGLER’s sparsity and resilience. Finally, in Discussion we reflect on the philosophical shift toward distributed specialization in intelligence and outline open questions, including hardware co-design and learning dynamics. Throughout, we compare with prior work and use our code as concrete motivation.
Related Work
Many existing approaches in deep learning have introduced sparsity or selective activation to improve efficiency and capacity:
Mixture-of-Experts (MoE): Conditional computation was explored by Shazeer et al. in the Sparsely-Gated Mixture-of-Experts layer. In that model, a gating network chooses a small subset of expert sub-networks for each example, enabling models with thousands of experts but only a few active per input. Similarly, the Switch Transformer simplifies MoE routing to reduce overhead and shows that trillion-parameter sparse models can be trained with constant per-token compute. Both works demonstrate that sparse expert selection allows enormous parameter counts while keeping inference cost bounded. However, MoE models often suffer from practical issues (e.g. load balancing, communication costs) and still rely on separate gating networks.
Sparse Transformers and Attention: Child et al. proposed Sparse Transformers with factored attention patterns that reduce self-attention complexity from \$O(N^2)\$ to \$O(N\sqrt{N})\$. By carefully masking attention (e.g. strided or block patterns), one can model very long sequences with many layers, demonstrating that neural architectures need not attend densely to all positions. Other works (e.g. Routing Transformers) cluster tokens to achieve content-based sparse attention. In all cases, the goal is to avoid compute on irrelevant connections.
Routing Networks: Beyond Transformers, routing has been used in general networks. Rosenbaum et al. introduce Routing Networks, which dynamically select one of many “function blocks” per input using a learned router. In multi-task scenarios, routing networks outperform baselines and require roughly constant per-task training cost by reusing different blocks for different tasks. The principle is similar: each input engages a small subnetwork, potentially reducing interference between tasks. (More broadly, modular networks and conditional computation frameworks fall under this umbrella of dynamic routing.)
Sparse Coding in Neuroscience: In biological vision, the efficient coding hypothesis asserts that the brain uses sparse, overcomplete representations to match natural signal statistics. Olshausen and Field showed that learning an overcomplete sparse code for natural images yields localized, oriented receptive fields akin to V1 simple cells. They note that a sparse code “recruits only those basis functions necessary for representing a given input”, a property we mirror in our sparse aggregation. Moreover, cortical microcircuits with inhibitory interneurons decorrelate excitatory activity to enforce sparsity: adding inhibition to a model of V1 results in each image being encoded by a few active neurons with Gabor-like filters. This biological literature provides both motivation and justification for sparse representations and is an essential contrast to dense artificial nets.
In summary, our work builds on these ideas by enforcing sparsity at the activation level through an explicit voter ensemble and softmax gating. Unlike most MoE approaches that sparsify at the layer level (choosing experts), we sparsify the output prediction by using many voters per prediction and letting only the strongest contribute. This can be seen as a hybrid of ensemble learning and dynamic routing, with a strong neuroscience parallel in population coding.
Methodology
Formal Model of Voter Ensembles
Our core unit is a redundant binary voter ensemble for each bit of the output. Suppose the network’s final hidden representation (at a particular position) produces, for bit \$b\in{1,\dots,32}\$, a set of \$M\$ raw scores \$z_{b,i}\$ for \$i=1\dots M\$ voters. We first apply a logistic sigmoid to each score to interpret it as a vote \$v_{b,i}=\sigma(z_{b,i})\in(0,1)\$. Intuitively, each \$v_{b,i}\$ is an independent estimate of the probability that bit \$b\$ should be 1. Rather than average all votes equally, we weight them to emphasize the largest ones. Concretely, we define
$$
s_{b,i} = \exp(\tau \, v_{b,i}),
\qquad
w_{b,i} = \frac{s_{b,i}}{\sum_{j=1}^M s_{b,j}},
$$
where \$\tau>0\$ is a temperature parameter (in code, \$\tau = 0.05,M\$) controlling how sharply we focus on large votes. The final predicted probability for bit \$b\$ is then the weighted average
$$
\hat p_b \;=\; \sum_{i=1}^M w_{b,i}\,v_{b,i}.
$$
If \$\tau\$ is large, this approximates \$\max_i v_{b,i}\$, since the largest \$v_{b,i}\$ dominates \$w_{b,i}\$. In the code, this is implemented as:
voter_importance = torch.exp(votes * self.temp) # (softmax exponent)
voter_weights = voter_importance / voter_importance.sum(dim=2, keepdim=True)
bit_probs = (votes * voter_weights).sum(dim=2) # weighted sum over voters
, where votes
is the tensor of shape [batch, num_bits, M, seq_len]
. In effect, each output bit uses a softmax selector over its \$M\$ voters. During training, we use binary cross-entropy on each bit’s probability (computed from \$\hat p_b\$) against the true bit label.
By contrast, a standard dense layer with \$M\$ units per bit would simply average them or take a deterministic function of \$M\$ weights. Our scheme allows many redundant sub-models (voters) but makes only the strongest few active. Since \$\tau\$ is set to a fraction of \$M\$ (e.g. 5%), in practice we observe that at most \$O(\log M)\$ voters have non-negligible \$w_{b,i}\$. All other voters receive exponentially small weights, effectively becoming inactive. Thus the activation density is low. This can be seen as a differentiable, soft top-\$k\$ operation. The parameters of each voter (its weights in the final conv layer) are still learned by gradient, but most gradient flows through the few chosen voters. In code, this mechanism is realized by the output_projection
Conv1d layer in SMUGGLERModel, which has out_channels = 32 * M
, reshaped into (bit, voter) indices.
Hierarchical Architecture
Upstream of the voting output, the SMUGGLER model processes input text through a hierarchical encoder–decoder. The input text is divided into fixed-size chunks (e.g. 4 characters = 32 bits) and each chunk is treated as an integer. The InputEmbedding block (from blocks.py) converts each 32-bit integer to a binary vector and linearly projects it to a continuous embedding of dimension \$D\$. Positional encodings are added to convey order.
The encoder consists of several stages of 1D convolution and downsampling. Each ConvBlock (multi-scale convolution) applies parallel convolutions with kernels of size 3, 5, 7 to the input, concatenating their outputs and adding a residual projection. After two ConvBlocks, a DownsampleLayer halves the sequence length and doubles the channels (from \$D\$ to \$2D\$), using strided convolution and normalization. This process repeats for a specified number of stages (e.g. 4), so that the sequence is progressively compressed in length and expanded in feature dimension. The intermediate features at each level are retained (for skip connections). This encoder structure is essentially a 1D U-net: a bottleneck receives a deep, low-resolution representation (Fig. 1). In SMUGGLER’s implementation, the code in HierarchicalProcessor initializes an input conv to \$D\$ channels, builds an EncoderPath over dimensions $[D,2D,4D,\dots]\$, and then constructs a bottleneck module.
The Bottleneck in our model uses Transformer-style self-attention to integrate global context. Concretely, the lowest-resolution tensor (of shape [batch, C, length]
) is transposed to [batch, length, C]
and passed through a stack of TransformerBlocks. Each TransformerBlock
performs multi-head self-attention and feed-forward processing with residual connections. This allows long-range dependencies across the compressed sequence. We note that SMUGGLER also includes an alternative DendriticBottleneck (with multiple “dendrite” branches and sparse attention across them) in the code, but for clarity we focus on the standard Transformer version.
The decoder then symmetrically upsamples. For each stage, we apply an UpsampleLayer (transposed convolution doubling length and halving channels) followed by concatenation of the corresponding encoder features (skip connection). Two ConvBlocks then fuse the concatenated channels and process them. This repeats until the original sequence length is restored. In the SMUGGLER code, a custom decoder is created dynamically to match the encoder’s dimensions; each DecoderStage
upsamples from \$kD\$ to \$(k/2)D\$, concatenates with the encoder’s \$(k/2)D\$ features, and applies convolution. Finally, a \$1\times1\$ Conv1d projects back to \$D\$ channels, yielding a [batch, D, seq_len]
tensor at full resolution.
Throughout this hierarchy, note that computation is structured: lower-level convolutions operate with small kernels and large sequence length, while higher layers use wider kernels and smaller length. The model capacity is therefore large, but actual computation at each position is split across stages. Figure 1 (conceptual) illustrates the flow from input chunk → embedding → encoder convs → bottleneck → decoder → output votes.
Equations of Sparse Aggregation
Summarizing the sparse voting in mathematical form: let \$\mathbf{h}t \in \mathbb{R}^D\$ be the final hidden vector at time \$t\$ after decoding. We apply a linear map \$W\$ (via output_projection
) to obtain all votes \$z{b,i}(t) = (W \mathbf{h}t){(b,i)}\$ for \$b=1\dots 32\$, \$i=1\dots M\$. After a sigmoid, \$v_{b,i}(t)=\sigma(z_{b,i}(t))\$ gives each voter’s output in $[0,1]\$. The softmax-weighted aggregation is then
$$
P(\text{bit }b=1 \mid t) \;=\;\sum_{i=1}^M \frac{\exp(\tau\, v_{b,i}(t))}{\sum_{j=1}^M \exp(\tau\, v_{b,j}(t))} \;v_{b,i}(t).
$$
This expression is differentiable everywhere. In the limit \$\tau\to\infty\$, \$\exp(\tau v)\$ picks out the maximal \$v\$, so \$P_b \approx \max_i v_{b,i}\$. For finite \$\tau\$, several top voters share weight. This contrasts with a dense linear layer (no \$\tau\$ term) where all units’ votes are simply averaged or summed. Here, the \$\tau\$-weighted softmax yields an implicit sparsification: only the units with highest logits contribute meaningfully, others get weights near zero.
Computational Implications
Complexity and Memory
A key advantage of our sparse scheme is that effective compute can be much lower than in a fully dense model. Consider a baseline 1D conv model of width \$C\$ on length \$N\$; a convolutional layer costs \$O(N C^2 k)\$ if using kernel size \$k\$. In our hierarchy with downsampling, the sequence length halves at each stage while channels double (e.g. \$D\to2D\to4D\$). Summing over stages yields total cost on the order of \$O(N D^2)\$ (since each stage has roughly the same \$N_i C_i^2\$ because \$N_i \sim N/2^i\$ and \$C_i \sim D 2^i\$). In contrast, a plain Transformer’s attention layer costs \$O(N^2 D)\$. Our bottleneck attention sees much smaller \$N\$ (the compressed length), reducing quadratic cost by orders of magnitude. In practice, even for large \$D\$ our downsampling means the dominant cost is near the input and output layers (dense conv) rather than a giant \$N\times N\$ matrix.
At the output, there are \$32M\$ possible voting channels, which in code are all computed. However, most of them end up with negligible weights. If one were to implement sparse inference, one could skip multiply-adds for votes with very small gradients. In effect, each bit behaves like a top-\$k\$ selection among \$M\$ scores. For typical settings (e.g. \$M=128\$), empirical activation density at the output is very low (only a few voters per bit actively contribute). Thus, effective time complexity of the final aggregation is \$O(32k)\$ with \$k\ll M\$. Even though all voters are present, gradient updates concentrate on the winners.
Memory-wise, storing \$32M\$ weights is linear in \$M\$. In SMUGGLER, \$M=128\$ by default, so the final conv layer has \$32\times128=4096\$ output channels, which is sizable but manageable. The hierarchical layers further inflate channel width up to \$4D\$ (e.g. if \$D=128\$, up to 512). However, since most activations are sparse, activation memory (for storing intermediate outputs) could be pruned. Information-theoretically, using binary voting allows an error-correcting code interpretation: even if a few voters err, the aggregation can tolerate it, improving robustness.
In summary, our design achieves sub-quadratic scaling with sequence length and limits per-example compute by sparse gating. This resembles other sparse models: for example, Switch Transformer emphasizes that “a sparsely-activated model – with outrageous numbers of parameters – but a constant computational cost” is achievable. Likewise, factored attention (Sparse Transformer) shows \$O(N\sqrt{N})\$ scaling. Our scheme is of this flavor: we trade many potential parameters (many voters) for a compute cost that depends only on the small fraction of units that vote.
Information Efficiency
Sparse activation often implies higher information efficiency. In our binary-output context, the network effectively encodes each 32-bit token prediction using the ensemble of voters. Redundancy among voters can correct bit errors: if one voter gives a wrong bit, others may outvote it. Indeed, during training we observe that learning distributes patterns across voters, so that the ensemble captures more nuance than any single unit. Compared to a dense 32-output layer (which has \$32D\$ parameters if width \$D\$), our ensemble has \$32M D\$ parameters but uses them selectively. One can view the voters as a kind of mixture code: each bit’s value is encoded in a sparse pattern of votes. This is akin to sparse distributed representations in neuroscience, where a concept is represented by the identity of a few active neurons.
No single equation captures memory savings here, but a practical metric is activation sparsity – e.g. the average fraction of positive (or large) activations. In SMUGGLER, we can measure how many voters per bit exceed a threshold. Early experiments show that often 2–5 voters out of 128 dominate per bit. Thus, the model is effectively much narrower per decision than its parameter count suggests.
Catastrophic Forgetting and Specialization
One often-cited downside of dense networks is catastrophic forgetting: when trained sequentially on new tasks, networks tend to overwrite old knowledge. Sparse, modular architectures can alleviate this because different tasks can recruit disjoint subsets of neurons. In our voter ensemble, this idea emerges naturally. Since each output bit has many voters, subsets of voters can specialize for different input contexts or tasks. For example, if the model learns one language and then is fine-tuned on another, it may use a different subset of voters for each language’s characters, leaving the other voters intact.
This modularity echoes multi-task findings: Rosenbaum et al. found that routing networks (which select blocks per task) achieve superior multi-task accuracy with almost constant per-task cost. They observe that “routing networks have nearly constant per-task training cost while cross-stitch networks scale linearly with the number of tasks”, implying that task-specific routes prevent interference. In our case, the combination of voters and intermediate activations constitutes many potential routes through the network. Training on a new distribution likely adjusts only the weights of voters actually used, leaving dormant voters as memory banks for previous tasks.
While a full analysis of catastrophic forgetting is complex, the theoretical intuition is clear: sparsity implies orthogonality. If tasks activate nearly disjoint subsets of units, their gradient updates do not overlap. This is similar to brain theories of memory: the hippocampus, for instance, can quickly form new sparse representations without erasing older neocortical patterns. In SMUGGLER, one could enforce even stricter sparsity (e.g. by setting many voters’ weights to zero) to create hard modules. Our soft max scheme is a middle ground, but it still biases learning toward sparse, compartmentalized solutions. We expect that in continual-learning experiments, the sparse model will show less performance drop on old tasks compared to a dense counterpart, because of this specialization.
Biological Plausibility
Our sparse voter architecture draws several parallels with neuroscience:
Sparse Coding: Early ideas like Barlow’s efficient coding (1961) argued for minimizing neural redundancy. In particular, Olshausen and Field noted that V1 receptive fields emerge from learning an overcomplete sparse code for images. They explicitly write that sparse coding “recruits only those basis functions necessary for representing a given input,” causing a sparse distribution of activity. Similarly, our network recruits only the strongest voters per bit. Just as biological neurons are quiescent most of the time, most of our voters remain inactive.
Grandmother Cells and Localist Coding: While the grandmother-cell idea (single neuron per concept) is controversial, experiments in the medial temporal lobe have found highly selective “concept cells” that fire for particular individuals or objects (e.g. the famous “Jennifer Aniston” neuron). These findings show that some aspects of the brain do use extremely sparse, localist coding for high-level concepts. Our model could be seen as implementing this at the low-level bit scale: each voter can become a “mini-grandmother cell” for a particular pattern of input. In any case, the spectrum of coding from fully distributed to fully local is recognized in neuroscience, and our sparse activation scheme is closer to the sparse end.
Inhibitory Circuits: Cortical circuits include inhibitory interneurons that enforce competition. King et al. showed in a spiking V1 model that adding inhibitory neurons is sufficient to learn a sparse code with Gabor filters, while decorrelating excitatory firing. In words, the inhibitory population “suppresses predictable spikes” so that only a few excitatory neurons remain active for any given input. Conceptually, the temperature-softmax gating plays a similar role: it suppresses weaker voters, effectively inhibiting them, so that only the “dominant” neurons pass their signal. In a real neural circuit, this might be implemented by lateral inhibition or network oscillations that gate out low-efficacy neurons.
Synaptic Plasticity and Dendritic Processing: Our voter ensemble also loosely mirrors biological synaptic mechanisms. Neurons in the brain have many dendritic branches that perform local nonlinear computations, and gating could occur by modulatory interneurons. The unused DendriticBottleneck in our code (not detailed here) reflects this idea: it had multiple linear “dendrite” branches with a router assigning inputs. While not fully exploited in SMUGGLER, it points to future work where each neuron has multiple sub-compartments that activate sparsely. Indeed, recent theories (the Thousand Brains theory) propose that cortical columns act as many semi-independent models, each learning rapidly via Hebbian updates. In that view, intelligence comes from many repeating modules learning in parallel. Our voter units are analogous to those modules: each voter can learn a particular pattern or rule, and together they form a distributed knowledge bank.
Overall, the SMUGGLER architecture is a small step towards biological realism. It abandons the all-to-all dense connectivity of typical networks and instead uses specialization. This fits well with evidence of sparse, modular coding in the brain. By adjusting our \$\tau\$ parameter and number of voters, one could tune the model to more closely match observed neural sparsity levels (e.g. only 1–5% of units active).
Experimental Hooks
To evaluate the SMUGGLER sparse model, we propose the following strategies:
Activation Sparsity Metrics: Measure the fraction of active units in each layer. For instance, compute the average number of voters with \$w_{b,i}> \epsilon\$ per bit, or the \$\ell_0\$-norm of activations (how many non-zero entries) after ReLU/GELU. Track how this changes with \$\tau\$ and training stage. Compare to a baseline dense model on the same data.
Throughput and Efficiency: Benchmark inference speed and memory usage for SMUGGLER vs. a dense model of similar parameter count. On CPUs/GPUs, sparse activations may not directly speed up without specialized kernels, but one can estimate FLOPs and measure wall-clock time. Evaluate whether fewer active units leads to faster inference under an ideal sparse implementation.
Model Capacity vs. Data: Revisit neural scaling: train models of varying size and measure performance. Does the sparse model achieve better loss with the same compute budget? As Kaplan et al. note, “optimally compute-efficient training involves training very large models... and stopping before convergence”. Test whether our model follows similar power laws, potentially with a better constant factor due to sparsity.
Catastrophic Forgetting: Perform sequential task experiments. For example, split a dataset into disjoint subsets (or languages) and train incrementally. Measure performance on previous tasks after learning each new one. Compare forgetting curves between SMUGGLER and a fully dense network. We expect the sparse model to retain old task accuracy better, since fewer weights overlap.
Bit-Error Robustness: Since SMUGGLER outputs bits via ensembles, one could simulate noise in some voters and see if the correct bit is recovered. Compare error rates if dropping random voters. A network with redundant voting should tolerate more noise (like an error-correcting code).
Ablations: Vary the number of voters \$M\$, temperature \$\tau\$, and architecture depth. Measure trade-offs between sparsity and accuracy. For instance, increase \$\tau\$ to enforce more sparsity, and see how many voters suffice to maintain performance.
These experiments would quantify the benefits of sparse activation. We would report metrics like bits-per-token cross-entropy, task accuracy, percentage of active neurons, and forgetting indices. It is also important to visualize learned patterns: do certain voters specialize for certain input patterns? One could probe this by observing which voter has the highest weight across different inputs.
Discussion
Our sparse voter ensemble model suggests a philosophical shift: intelligence as distributed specialization rather than monolithic density. In contrast to a single “heavyweight” neural net where every parameter is always slightly engaged, our view is that many small experts (voters, dendrites, columns) work in parallel, each tuned to particular contexts. This is akin to the “wisdom of crowds” where an aggregate of specialists can outperform a generalist. Biology provides many analogies: for example, the brain’s olfactory system uses many separate receptors and glomeruli, not a single giant network. The Thousand Brains theory explicitly argues that numerous cortical columns each model complete objects, and intelligence arises from their consensus.
Of course, many open problems remain. Learning dynamic sparse architectures is challenging: should we learn to route inputs more directly (e.g. learnable masks) or rely on heuristics like our softmax? How to optimize \$\tau\$ and other hyperparameters? Can hardware accelerators be designed for this kind of sparsity (e.g. skipping inactive channels)? In neuroscience, questions of how synaptic plasticity and inhibitory signals coordinate to produce sparsity are still being studied – analogously, we need to understand how backprop through sparse selection shapes representations.
On the modeling side, one could extend SMUGGLER to allow conditional activation of internal layers (not just the final bits). For instance, making entire convolutional paths or attention heads sparse via gating could yield further gains. The DendriticBottleneck in our code hints at this direction. Another avenue is curriculum or continual training regimes: testing how the network forms new voters for novel stimuli, potentially resembling neurogenesis or synaptic consolidation in animals.
Ultimately, this work encourages thinking of models as ensembles of specialists. Just as no single brain cell knows all facts, no single neuron in a large network should do all the work. Instead, a careful orchestration of many expert bits – each mostly silent and only occasional champions – may be a more power-efficient and robust form of intelligence. This resonates with the success of ensemble methods in ML and with emergent paradigms in neuroscience. We hope our SMUGGLER model and analysis stimulate further exploration of sparsity as a design principle for next-generation AI.
References
- Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361.
- Shazeer, N., Mirhoseini, A., Maziarz, K., et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In ICLR.
- Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating Long Sequences with Sparse Transformers. arXiv:1904.10509.
- Fedus, W., Zoph, B., & Shazeer, N. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. JMLR 22(1):7986–8018.
- Rosenbaum, C., Klinger, T., & Riemer, M. (2017). Routing Networks: Adaptive Selection of Non-linear Functions for Multi-task Learning. In ICLR (Poster).
- Olshausen, B. A., & Field, D. J. (1997). Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by V1? Vision Research 37(23):3311–3325.
- King, P. D., Zylberberg, J., & DeWeese, M. R. (2013). Inhibitory Interneurons Decorrelate Excitatory Cells to Drive Sparse Code Formation in a Spiking Model of V1. J. Neurosci. 33(13):5475–5485.
- Hawkins, J., Ahmad, S., & Hu, M. (2024). The Thousand Brains Project: A New Paradigm for Sensorimotor Intelligence. arXiv:2412.18354.
- SMUGGLER Codebase: smuggler_model.py, blocks.py, hierarchical.py (user-supplied).
Top comments (0)