DEV Community: Gayan Sanjeewa

Daily LeetCode Problems: 779 :K-th Symbol in Grammar

Gayan Sanjeewa — Wed, 08 May 2024 01:45:15 +0000

Daily LeetCode Problems: 779 :K-th Symbol in Grammar

“K-th Symbol in Grammar.” This problem revolves around constructing a table of rows using a unique pattern of 0s and 1s. We’ll explore the problem statement, decipher the pattern, and provide an efficient solution to determine the k-th symbol in the nth row of the table.

Problem Statement:

In this problem, we are tasked with constructing a table of rows in the following manner:

Start with the first row containing a single ‘0’.
In every subsequent row, replace each ‘0’ in the previous row with ‘01’, and each ‘1’ with ‘10’.

Our goal is to find the k-th symbol in the nth row (both 1-indexed) of the table.

Approach:

To solve this problem, we need to understand the pattern and recurrence that emerges as we construct the rows.

The key insight is that the construction process can be thought of as a binary tree. The initial ‘0’ is the root, and each ‘0’ in a row has two children (‘0’ and ‘1’) in the row below, while each ‘1’ in a row has two children (‘1’ and ‘0’) in the row below. The tree branches out in a balanced manner.

With this insight, we can use a recursive approach to traverse the tree and find the k-th symbol in the nth row. We start with the root and move down the tree based on whether k is on the left or right side of the current node.

Pseudocode:

def kthGrammar(n, k):
    # Base case: if n is 1, the first row contains only '0'.
    if n == 1:
        return 0

    # Calculate the midpoint of the row, where the tree branches.
    mid = 2**(n-1) // 2

    # If k is in the left subtree, recurse on the left child.
    if k <= mid:
        return kthGrammar(n - 1, k)
    else:
        # If k is in the right subtree, recurse on the right child and flip the result.
        return 1 - kthGrammar(n - 1, k - mid)

Complexity Analysis:

Let’s analyze the complexity of our solution:

Time complexity: The time complexity of this solution is O(log k), where k is the index of the symbol we want to find. This is because the problem reduces by half with each recursive call.
Space complexity: The space complexity is O(log n) due to the recursive call stack.

Full Solution

Python

class Solution:
    def kthGrammar(self, n: int, k: int) -> int:
        if n == 1:
            return 0
        length = 2 ** (n - 2)
        if k <= length:
            return self.kthGrammar(n - 1, k)
        else:
            return 1 - self.kthGrammar(n - 1, k - length)

Java

class Solution {
  public int kthGrammar(int n, int k) {
    if (n == 1)
      return 0;
    if (k % 2 == 1)
      return kthGrammar(n - 1, (k + 1) / 2) == 0 ? 0 : 1; // Left node
    return kthGrammar(n - 1, k / 2) == 0 ? 1 : 0;         // Right node
  }
}

Curious Engineering Facts (Multi-token Prediction ,Kolmogorov-Arnold Networks (KANs) ****): May Release 2:24

Gayan Sanjeewa — Wed, 08 May 2024 01:42:35 +0000

Curious Engineering Facts (Multi-token Prediction ,Kolmogorov-Arnold Networks (KANs) ****): May Release 2:24

1.Meta’s New Groundbreaking Paper on Multi-Token Prediction for Better and Faster LLMs

Most current large language models are trained with a next-token prediction loss. However, they require large amount of data and often fail to capture longer-term dependencies effectively.

Meta’s new groundbreaking paper “Better & Faster Large Language Models via Multi-token Prediction” suggests that training language models to predict multiple future tokens at once results in higher sample efficiency.

Performance:

Enhanced Efficiency: Multi-token prediction improves sample efficiency and speeds up inference times by up to 3x, particularly with larger models and batch sizes.
Better Performance on Benchmarks: This technique shows substantial improvement over traditional next-token prediction models on coding tasks and generative benchmarks.
Scalability Benefits: The benefits of multi-token prediction become more significant as model size increases, which implies greater improvements for larger models.
Robustness Across Epochs: Using Multi-token prediction maintains performance advantages even when models are trained for multiple epochs, which demonstrates robustness and durability of training gains.

How it works:

Overall architecture: It consists of a common trunk that processes the input sequence to generate a latent representation of the observed context. On top of that, multiple output heads **are each responsible for predicting a **different future token simultaneously and independently.
Multi-token Prediction Task: Instead of predicting just the next token, the model predicts several future tokens from each position in the input sequence. Each output head makes its prediction independently based on the shared context provided by the trunk.
Training Process: During training, the model is optimized for predicting each of the future tokens independently. This approach trains the model to improve its predictions by considering multiple future outcomes at each step. The predictions are **generated in parallel **across the multiple heads, so this doesn’t add any computation overhead.
Efficient Inference: At inference time, the model can use the trained output heads to generate multiple tokens at once, speeding up the process.

2.A promising alternative to Multi-Layer Perceptrons (MLPs) is taking over the industry: KANs

introducing a novel neural network, Kolmogorov-Arnold Networks (KANs), which replaces MLPs’ fixed activation functions with learnable ones on weights, eliminating linear weights.

KANs enhance accuracy, interpretability, use significantly fewer parameters (200 vs. 300,000 in some MLPs), and effectively prevent catastrophic forgetting. However, their complex activation functions demand more computational resources.

Understanding Kolmogorov–Arnold Networks (KANs)
The genesis of Kolmogorov–Arnold Networks (KANs) is deeply rooted in the Kolmogorov-Arnold representation theorem, a seminal concept in mathematical theory that profoundly influences their design and functionality. This theorem provides a method to express any multivariate continuous function as a superposition of continuous functions of one variable. Inspired by this theorem, KANs are crafted to leverage this foundational mathematical insight, thereby reimagining the structure and capabilities of neural networks.

Theoretical Foundation
Unlike Multi-Layer Perceptrons (MLPs) that are primarily inspired by the Universal Approximation Theorem, **KANs draw from the Kolmogorov-Arnold representation theorem. This theorem asserts that **any function of several variables can be represented as a composition of functions of one variable and the addition operation. KANs operationalize this theorem by implementing a neural architecture where the traditional linear weight matrices and fixed activation **functions are **replaced **with **dynamic, learnable univariate functions along each connection, or “edge”, between nodes in the network.

3.OpenBio-LLM 8B and 70B

OpenBioLLM-8B is an advanced open source language model designed specifically for the biomedical **domain. Developed by Saama AI Labs, this model was fine-tuned on a vast corpus of high-quality biomedical data from the powerful foundations of the **Meta-Llama-3–8B and Meta-Llama-3–8B models.

The 70B parameter model outperforms GPT-4, Gemini, Meditron-70B, and Med-PaLM 1

The Open Medical LLM Leaderboard aims to track, rank and evaluate the performance of large language models (LLMs) on medical question answering tasks.

It evaluates LLMs across a diverse array of medical datasets, including MedQA (USMLE), PubMedQA, MedMCQA, and subsets of MMLU related to medicine and biology. Theses datasets contain multiple-choice and open-ended questions that require medical reasoning and understanding.

The OpenBio-LLM-70B is the leading model according to this benchmark, yet the recent Med-Gemini model is still not included in the leaderboard.

4.HyperSD

Hyper-SD is one of the new State-of-the-Art diffusion model acceleration techniques. It’s a new framework that allows for high fidelity in step compression and mitigates performance losses for Diffusion Models distillation.

In this HF Space, the models distilled from SDXL Base 1.0 and Stable-Diffusion v1–5 are released.

Curious Engineering Facts (Multi-token Prediction ,Kolmogorov-Arnold Networks (KANs) ****): May Release 2:24

Gayan Sanjeewa — Wed, 08 May 2024 01:42:24 +0000

Curious Engineering Facts (Multi-token Prediction ,Kolmogorov-Arnold Networks (KANs) ****): May Release 2:24

1.Meta’s New Groundbreaking Paper on Multi-Token Prediction for Better and Faster LLMs

Most current large language models are trained with a next-token prediction loss. However, they require large amount of data and often fail to capture longer-term dependencies effectively.

Performance:

Enhanced Efficiency: Multi-token prediction improves sample efficiency and speeds up inference times by up to 3x, particularly with larger models and batch sizes.
Better Performance on Benchmarks: This technique shows substantial improvement over traditional next-token prediction models on coding tasks and generative benchmarks.
Scalability Benefits: The benefits of multi-token prediction become more significant as model size increases, which implies greater improvements for larger models.
Robustness Across Epochs: Using Multi-token prediction maintains performance advantages even when models are trained for multiple epochs, which demonstrates robustness and durability of training gains.

How it works:

Overall architecture: It consists of a common trunk that processes the input sequence to generate a latent representation of the observed context. On top of that, multiple output heads **are each responsible for predicting a **different future token simultaneously and independently.
Multi-token Prediction Task: Instead of predicting just the next token, the model predicts several future tokens from each position in the input sequence. Each output head makes its prediction independently based on the shared context provided by the trunk.
Training Process: During training, the model is optimized for predicting each of the future tokens independently. This approach trains the model to improve its predictions by considering multiple future outcomes at each step. The predictions are **generated in parallel **across the multiple heads, so this doesn’t add any computation overhead.
Efficient Inference: At inference time, the model can use the trained output heads to generate multiple tokens at once, speeding up the process.

2.A promising alternative to Multi-Layer Perceptrons (MLPs) is taking over the industry: KANs

introducing a novel neural network, Kolmogorov-Arnold Networks (KANs), which replaces MLPs’ fixed activation functions with learnable ones on weights, eliminating linear weights.

3.OpenBio-LLM 8B and 70B

The 70B parameter model outperforms GPT-4, Gemini, Meditron-70B, and Med-PaLM 1

The Open Medical LLM Leaderboard aims to track, rank and evaluate the performance of large language models (LLMs) on medical question answering tasks.

The OpenBio-LLM-70B is the leading model according to this benchmark, yet the recent Med-Gemini model is still not included in the leaderboard.

4.HyperSD

In this HF Space, the models distilled from SDXL Base 1.0 and Stable-Diffusion v1–5 are released.

LLM Research Paper Fact: 17

Gayan Sanjeewa — Wed, 08 May 2024 01:41:25 +0000

LLM Research Paper Fact: 17

1. KAN: Kolmogorov-Arnold Networks

Watching: KAN (paper/code)

What problem does it solve? Multi-layer perceptrons(MLPs) have been a fundamental building block in deep learning architectures for decades. However, despite their widespread use, MLPs have limitations in terms of accuracy and interpretability. The fixed activation functions on nodes in MLPs can restrict their ability to capture complex patterns and relationships in data. Additionally, the lack of interpretability in MLPs makes it challenging to understand how they arrive at their predictions, which is crucial in many domains such as healthcare and finance.

How does it solve the problem? Kolmogorov-Arnold Networks (KANs) address the limitations of MLPs by introducing learnable activation functions on edges instead of fixed activation functions on nodes. By replacing linear weights with univariate functions parametrized as splines, KANs achieve better accuracy with smaller network sizes compared to MLPs. The learnable activation functions allow KANs to capture more complex patterns and relationships in data. Moreover, KANs exhibit faster neural scaling laws, meaning they can achieve **higher accuracy with fewer parameters **compared to MLPs. The interpretability of KANs is enhanced by their ability to be intuitively visualized and easily interact with human users, enabling them to assist scientists in discovering mathematical and physical laws.

What’s next? Further research can explore the application of KANs in various domains, such as computer vision, natural language processing, and reinforcement learning. The interpretability aspect of KANs can be leveraged to develop more transparent and explainable AI systems, which is crucial for building trust and accountability.

2. SPAFIT: Stratified Progressive Adaptation Fine-tuning for Pre-trained Large Language Models

Watching: SPAFIT (paper)

What problem does it solve? Fine-tuning large language models can be a challenging task due to the significant computational resources and storage required. Additionally, the Transformer architecture used in these models has been shown to suffer from catastrophic forgetting and overparameterization. Parameter-efficient fine-tuning (PEFT) methods have been developed to address these issues, but they typically apply adjustments across all layers of the model, which may not be the most efficient approach.

How does it solve the problem? Stratified Progressive Adaptation Fine-tuning (SPAFIT) is a novel PEFT method that takes advantage of the localization of different types of linguistic knowledge **within specific layers of the model. By strategically focusing on these layers, SPAFIT can achieve better performance while **fine-tuning only a fraction of the parameters compared to other PEFT methods. This targeted approach not only reduces the computational resources needed but also helps to mitigate the issues of catastrophic forgetting and overparameterization.

What’s next? The success of SPAFIT on the GLUE benchmark tasks demonstrates the potential of this method for efficient fine-tuning of large language models.

3. Better & Faster Large Language Models via Multi-token Prediction

Watching: Multi-token Prediction (paper)

What problem does it solve? Language Models (LMs) are usually trained using a next-token prediction objective, meaning that at each step, the model tries to predict the next token in the sequence. This approach, while effective, may not be the most efficient way to learn from the training data. By only predicting one token at a time, the model might miss out on learning longer-term dependencies and patterns in the data.

How does it solve the problem? The proposed method trains the model to predict multiple future tokens at once, using independent **output heads that operate on top of a **shared model trunk. By considering multi-token prediction as an auxiliary training task, the model can learn more efficiently from the same amount of data. This approach leads to improved downstream capabilities without increasing the training time. The benefits are more pronounced for larger model sizes and generative tasks like coding, where the multi-token prediction models consistently outperform strong baselines.

What’s next? The success of multi-token prediction in improving sample efficiency and generative capabilities opens up new avenues for further research. It would be interesting to explore how this approach scales with even larger model sizes and different types of data, such as multilingual corpora or domain-specific datasets. Additionally, the improved inference speed of models trained with multi-token prediction could have significant implications for real-world applications where low latency is crucial, such as chatbots or real-time translation systems.

4. Small Language Models Need Strong Verifiers to Self-Correct Reasoning

Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs), where LLMs refine their solutions using self-generated critiques that pinpoint the errors. This work explores whether smaller-size (<= 13B) language models (LMs) can self-correction on reasoning tasks with minimal inputs from stronger LMs. This paper proposes a novel pipeline that prompts smaller LMs to collect self-correction data that supports the training of self-refinement abilities. First, This paper leverages correct solutions to guide the model in critiquing their incorrect responses. Second, the generated critiques, after filtering, are used for supervised fine-tuning of the self-correcting reasoner through solution refinement. This paper’s experimental results show improved self-correction abilities of two models on five datasets spanning math and commonsense reasoning, with notable performance gains when paired with a strong GPT-4-based verifier, though limitations are identified when using a weak self-verifier for determining when to correct

Kolmogorov–Arnold Networks (KANs)

Gayan Sanjeewa — Tue, 07 May 2024 01:46:19 +0000

Kolmogorov–Arnold Networks (KANs)

MLPs are celebrated for their expressive power, primarily attributed to the Universal Approximation Theorem which suggests that they can model any continuous function under certain conditions. However, despite their widespread adoption, MLPs come with inherent limitations, particularly in terms of parameter efficiency and interpretability.

Enter Kolmogorov–Arnold Networks (KANs), a groundbreaking alternative inspired by the Kolmogorov-Arnold representation theorem. This new class of neural networks proposes a shift from **the **fixed activation **functions of **MLPs **to **adaptable activation **functions on the **connections between nodes, offering a fresh perspective on network design. Unlike traditional MLPs tha*t utilize a static architecture of weights and biases, KANs introduce a dynamic framework where **each connection weight **is replaced by a **learnable univariate function, typically parameterized as a **spline*. This subtle yet profound modification enhances the model’s flexibility and significantly reduces the complexity and number of parameters required.

Understanding Kolmogorov–Arnold Networks (KANs)

The genesis of Kolmogorov–Arnold Networks (KANs) is deeply rooted in the Kolmogorov-Arnold representation theorem, a seminal concept in mathematical theory that profoundly influences their design and functionality. This theorem provides a method to express **any multivariate continuous function as a **superposition of continuous functions of one variable. Inspired by this theorem, KANs are crafted to leverage this foundational mathematical insight, thereby reimagining the structure and capabilities of neural networks.
Give this a watch : a Short video

https://www.youtube.com/shorts/Yu1zsGhanh8

Theoretical Foundation

Unlike Multi-Layer Perceptrons (MLPs) that are primarily inspired by the Universal Approximation Theorem, KANs draw from the Kolmogorov-Arnold representation theorem. This theorem asserts that any function of several variables can be represented as a composition of functions of one variable and the addition operation. KANs operationalize this theorem by implementing a neural architecture where the traditional linear weight matrices and fixed activation functions are replaced **with **dynamic, learnable univariate functions **along each connection, or “edge**”, between nodes in the network.

Architectural Shifts

The most distinctive feature of KANs compared to traditional MLPs is the placement of activation functions. While MLPs apply fixed activation functions at the nodes (neurons) of the network:

KANs instead place learnable activation functions on the edges (weights), eliminating linear weights entirely:

Here, each Φ represents a learnable function, typically parameterized as a spline, that directly modifies **the **signal transmitted between layers. This architecture not only simplifies the computation graph but also enhances the network’s ability to model complex patterns through more direct manipulation of data flow.

Advantages Over Traditional MLPs

The **reconfiguration **of **activation **functions and **elimination **of linear **weight **matrices result in several key advantages:

Parameter Efficiency: Each weight in an MLP is replaced by a **spline **function in KANs, which can adapt its shape based on the learning process. This adaptability often allows KANs to achieve high accuracy with significantly fewer parameters compared to MLPs.
Flexibility and Adaptability: By employing splines, KANs can more finely tune their responses to the input data, offering a more nuanced adaptation to complex data patterns than the relatively rigid structure of MLPs.
Interpretability: The structure of KANs facilitates a clearer understanding of how inputs are transformed through the network. Each spline function’s effect on the data is more **observable **and understandable than the often opaque transformations in deep MLPs.

Visual Comparison

Illustratively, while MLPs rely on a combination of weight matrices and non-linear activation functions applied in a fixed sequence, KANs create a fluid network of functions that dynamically **adjust **based on the data. This difference is not just architectural but conceptual, pushing forward the boundaries of what neural networks can learn and represent.

Advantages of KANs Over Traditional MLPs :

Enhanced Accuracy and Efficiency

High accuracy with fewer parameters compared to MLPs. This advantage is underpinned by the unique architectural elements of KANs that allow for a more direct and flexible manipulation of input data through learnable activation functions on each edge of the network.

Reduced Model Complexity: By replacing the typical weight matrices in MLPs with spline-based functions that act on edges, KANs dramatically reduce **the number of **parameters. This reduction in complexity often leads to more efficient training processes and **faster convergence **rates.
High Precision in Data Fitting and PDE Solving: KANs have demonstrated superior performance in complex tasks such as **data fitting **and solving partial differential equations (PDEs). For instance, in applications requiring high precision, such as numerical simulation and predictive modeling, KANs have outperformed MLPs by orders of magnitude in both accuracy and computational efficiency.

Improved Interpretability

Visual Clarity of Function Transformations: The use of spline functions allows for a clear **visual interpretation **of how inputs are transformed through the network. Unlike MLPs, where the transformation through layers can be opaque, KANs provide a more transparent view of the data flow and transformation.
Ease of Modification and Interaction: The functional approach of KANs not only simplifies the understanding of each layer’s impact but also allows easier modifications to meet specific needs or constraints, facilitating user interaction and customization.

Theoretical and Empirical Validation

The theoretical foundations of KANs provide robustness to their design, which is empirically validated through extensive testing and application.

Neural Scaling Laws: Theoretically, KANs exhibit more favorable neural scaling laws than MLPs. This implies that as the network scales, KANs maintain or improve performance more effectively than MLPs, particularly in environments with large-scale data.
Empirical Studies: Across various studies, KANs have shown to not only perform better in standard tasks but also in discovering underlying patterns and laws in scientific data, demonstrating their utility as tools for scientific discovery.

Case Studies

the practical benefits of KANs over MLPs:

In mathematical applications, such as symbolic regression or complex function approximation, KANs have successfully identified and modeled intricate patterns that were challenging for traditional MLPs.
In physics and engineering, KANs have been applied to model and solve intricate problems, from fluid dynamics simulations to structural optimization, with greater accuracy and fewer computational resources than equivalent MLP models.

Empirical Performance and Theoretical Insights

Demonstrated Superiority in Diverse Applications

Data Fitting: KANs have shown the ability to fit complex data sets with high accuracy and fewer parameters. For example, in tasks involving the fitting of non-linear functions, KANs have outperformed MLPs by achieving lower mean squared errors with significantly reduced model complexity.
Solving Partial Differential Equations (PDEs): solved PDEs with greater precision and efficiency, often requiring smaller computational graphs compared to MLPs which translates into faster computation and less resource consumption.

Empirical Validation through Case Studies

Specific case studies underscore the practical advantages of KANs:

Scientific Discovery: In fields like physics and chemistry, KANs have helped researchers uncover underlying physical laws and chemical properties from experimental data, acting almost as collaborative tools in the scientific discovery process.
Machine Learning and AI: In more traditional machine learning tasks, such as image and speech recognition, KANs have demonstrated their ability to learn more effective representations with fewer training iterations, facilitating faster and more scalable AI solutions.

Theoretical Advancements

The theoretical framework of KANs offers insights into why these networks perform effectively:

Neural Scaling Laws: KANs benefit from favorable neural scaling laws, which suggest that their performance improves consistently as network size increases, without the diminishing returns often observed in MLPs.
Function Approximation Capabilities: The structure of KANs inherently supports a more flexible function approximation capability, which can be attributed to their use of spline-based activation functions. This flexibility allows KANs to model a wider range of functions directly compared to the layered linear transformations in MLPs.

Improvements in Training Dynamics

The training process of KANs also exhibits several improvements over traditional approaches:

Efficiency in Learning: KANs typically require fewer epochs to **converge **to **optimal **solutions
Stability and Generalization: KANs have shown greater stability during training and superior generalization capabilities on unseen data, likely due to their inherent regularization effects from spline functions.

Potential Applications and Impact on Science

Advancing Machine Learning and Artificial Intelligence

Deep Learning Enhancements: By integrating KANs into existing deep learning architectures, researchers can create more efficient and interpretable models for tasks like image recognition, natural language processing, and more.
Robust AI Systems: The inherent interpretability and efficient data handling of KANs contribute to building more robust and reliable AI systems, particularly in critical applications such as autonomous driving and medical diagnosis.

As A summary :

MLPs have fixed activation functions on nodes (or “neurons”), KANs have learnable activation functions on edges (or “weights”).

In a KAN, each weight parameter is replaced by a univariate **function, typically parameterized as a **spline. As a result, KANs have no linear **weights **at all. The nodes in a KAN simply sum the incoming signals without applying any non-linearities.

How do they work?

At its core, a KAN learns both the compositional structure (external degrees of freedom) and the univariate functions (internal degrees of freedom) of a given problem. This allows KANs to not only learn features, like MLPs, but also to optimize these learned features to great accuracy.

KANs leverage the strengths of both splines and MLPs while avoiding their weaknesses. Splines are accurate for low-dimensional functions and can easily adjust locally, but suffer from the curse of dimensionality. MLPs, on the other hand, are better at exploiting compositional structures, but struggle to optimize univariate functions. By combining the two approaches, KANs can learn and accurately represent complex functions more effectively than either splines or MLPs alone.

Expanded

Compositional Structure Learning (External Degrees of Freedom)
KANs, like MLPs, can learn the compositional structure of a problem. In other words, they can identify and learn the relationships between different input features and how they contribute **to the **output.

In a KAN, the nodes are responsible for summing the incoming signals without applying any non-linearities. The edges, on the other hand, contain learnable **activation functions, which are typically parameterized as **splines. This architecture allows the network to learn the optimal composition of these activation functions to model the underlying structure of the problem.

By learning the compositional structure, KANs can effectively handle high-dimensional problems and exploit the inherent relationships between input features. This capability is similar to that of MLPs, which can also learn complex feature interactions through their layered architecture.

Univariate Function Optimization (Internal Degrees of Freedom)
What sets KANs apart from MLPs is their ability to optimize univariate functions to a high degree of accuracy. In a KAN, each edge contains a learnable activation function, which is a univariate function parameterized as a spline. **Splines **are piecewise **polynomial **functions that can closely approximate complex univariate functions.
During training, KANs optimize these spline activation functions to best fit the target function. The spline parameterization allows for local adjustments, meaning that the network can fine-tune the activation functions in specific regions of the input space without affecting other regions. This local adaptability is a key advantage of **splines over global activation functions like sigmoids or ReLUs, **which are commonly used in MLPs.
By optimizing the univariate functions, KANs can achieve high accuracy in modeling complex, non-linear relationships between inputs and outputs. This is particularly useful for problems with low-dimensional input spaces, where splines can excel.

**Combining Strengths of Splines and MLPs
**KANs leverage the strengths of both splines and MLPs while avoiding their weaknesses. Splines are highly accurate for low-dimensional functions and can easily adapt locally, but they suffer from the curse of dimensionality. As the number of input dimensions increases, the number of spline parameters required to maintain accuracy grows exponentially, making splines impractical for high-dimensional problems.

On the other hand, MLPs are better suited for high-dimensional problems due to their ability to learn compositional structures. However, MLPs struggle to optimize univariate functions effectively, as their activation functions are typically fixed and global.

KANs overcome these limitations by combining the compositional structure learning of MLPs with the univariate function optimization of splines. The network’s architecture allows it to learn complex feature interactions like an MLP, while the spline activation functions enable accurate modeling of univariate relationships.

How Kolmogorov-Arnold Networks Could Revolutionize Large Language Models

Enhancing interpretability: One of the main criticisms of LLMs is their lack of interpretability. It can be difficult to understand how these models arrive at their outputs, which raises concerns about bias, fairness, and trustworthiness. While some architectures like decision trees and rule-based systems are more interpretable, they often lack the performance of deep learning models. KANs, with their learnable activation functions and more interpretable structure, could help address this issue. By integrating KANs into LLMs, researchers could gain more insights into how the models process and generate language, potentially leading to more transparent and explainable AI systems that outperform other interpretable architectures.
Few-shot learning: While LLMs have shown impressive few-shot learning capabilities, they still require substantial amounts of data and compute to achieve optimal performance. Other architectures like Siamese networks and metric learning approaches have been used for few-shot learning, but they may not scale as well to complex language tasks. KANs’ ability to learn both compositional structure and univariate functions more efficiently could help LLMs learn from fewer examples, potentially outperforming existing few-shot learning approaches in the language domain.
Knowledge representation and reasoning: LLMs have demonstrated some ability to store and retrieve knowledge, as well as perform basic reasoning tasks. However, their ability to represent and manipulate complex, structured knowledge is still limited. Graph neural networks (GNNs) and knowledge graphs have been used to represent structured knowledge, but integrating them with language models remains challenging. KANs’ more interpretable and modular structure could potentially help LLMs better represent and reason over structured knowledge, offering a more seamless integration of knowledge representation and language modeling compared to existing approaches.

What’s The Catch?

Currently, the biggest bottleneck of KANs lies in* its slow training*. KANs are **usually 10x slower than MLPs, given the same number of parameters. If one wants to train a model fast, one should use MLPs. In other cases, however, KANs should be comparable or better than MLPs, which makes them worth trying. if you care about **interpretability **and/or **accuracy, and **slow training is not a major concern, suggest trying KANs.

When to use them?

Practical :

https://github.com/GayanSanjeewaGitHub/KANs

DEV Community: Gayan Sanjeewa

Daily LeetCode Problems: 779 :K-th Symbol in Grammar

Daily LeetCode Problems: 779 :K-th Symbol in Grammar

Problem Statement:

Approach:

Pseudocode:

Complexity Analysis:

Full Solution

Python

Java

Curious Engineering Facts (**Multi-token Prediction ,**Kolmogorov-Arnold Networks (KANs) ****): May Release 2:24

Curious Engineering Facts (Multi-token Prediction ,Kolmogorov-Arnold Networks (KANs) ****): May Release 2:24

1.Meta’s New Groundbreaking Paper on Multi-Token Prediction for Better and Faster LLMs

2.A promising alternative to Multi-Layer Perceptrons (MLPs) is taking over the industry: KANs

3.OpenBio-LLM 8B and 70B

4.HyperSD

Curious Engineering Facts (**Multi-token Prediction ,**Kolmogorov-Arnold Networks (KANs) ****): May Release 2:24

Curious Engineering Facts (Multi-token Prediction ,Kolmogorov-Arnold Networks (KANs) ****): May Release 2:24

1.Meta’s New Groundbreaking Paper on Multi-Token Prediction for Better and Faster LLMs

2.A promising alternative to Multi-Layer Perceptrons (MLPs) is taking over the industry: KANs

3.OpenBio-LLM 8B and 70B

4.HyperSD

LLM Research Paper Fact: 17

LLM Research Paper Fact: 17

1. KAN: Kolmogorov-Arnold Networks

2. SPAFIT: Stratified Progressive Adaptation Fine-tuning for Pre-trained Large Language Models

3. Better & Faster Large Language Models via Multi-token Prediction

4. Small Language Models Need Strong Verifiers to Self-Correct Reasoning

Kolmogorov–Arnold Networks (KANs)

Kolmogorov–Arnold Networks (KANs)

Understanding Kolmogorov–Arnold Networks (KANs)

Theoretical Foundation

Architectural Shifts

Advantages Over Traditional MLPs

Visual Comparison

Advantages of KANs Over Traditional MLPs :

Enhanced Accuracy and Efficiency

Improved Interpretability

Theoretical and Empirical Validation

Case Studies

Empirical Performance and Theoretical Insights

Demonstrated Superiority in Diverse Applications

Empirical Validation through Case Studies

Theoretical Advancements

Improvements in Training Dynamics

Potential Applications and Impact on Science

Advancing Machine Learning and Artificial Intelligence

As A summary :

How do they work?

Expanded

How Kolmogorov-Arnold Networks Could Revolutionize Large Language Models

What’s The Catch?

When to use them?

Practical :

Curious Engineering Facts (Multi-token Prediction ,Kolmogorov-Arnold Networks (KANs) ****): May Release 2:24

Curious Engineering Facts (Multi-token Prediction ,Kolmogorov-Arnold Networks (KANs) ****): May Release 2:24