<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: huimin liao</title>
    <description>The latest articles on DEV Community by huimin liao (@huimin_liao_bb8519708c5bd).</description>
    <link>https://dev.to/huimin_liao_bb8519708c5bd</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2046086%2Fc43cca4f-4688-4cbe-a8e1-4662eb9322bf.png</url>
      <title>DEV Community: huimin liao</title>
      <link>https://dev.to/huimin_liao_bb8519708c5bd</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/huimin_liao_bb8519708c5bd"/>
    <language>en</language>
    <item>
      <title>"The Godfather of AI" Wins the Nobel Prize in Physics | What Magic Does the Neural Network Exactly Have?</title>
      <dc:creator>huimin liao</dc:creator>
      <pubDate>Mon, 21 Oct 2024 03:41:26 +0000</pubDate>
      <link>https://dev.to/huimin_liao_bb8519708c5bd/the-godfather-of-ai-wins-the-nobel-prize-in-physics-what-magic-does-the-neural-network-exactly-have-4nho</link>
      <guid>https://dev.to/huimin_liao_bb8519708c5bd/the-godfather-of-ai-wins-the-nobel-prize-in-physics-what-magic-does-the-neural-network-exactly-have-4nho</guid>
      <description>&lt;p&gt;The 2024 Nobel Prize in Physics was announced, awarding two scientists in the field of artificial intelligence&lt;br&gt;
On October 8th, Beijing time, the 2024 Nobel Prize in Physics was announced at the Royal Swedish Academy of Sciences. This year, the award will be given to two scientists: Professor John J. Hopfield from Princeton University in the United States and Professor Geoffrey E. Hinton from the University of Toronto in Canada, in recognition of their "fundamental discoveries and inventions in using artificial neural networks for machine learning". The official Nobel website indicates that the two Nobel laureates in physics used physics to train artificial neural networks: John J. Hopfield created an associative memory that can store and reconstruct images and other types of patterns in data; Geoffrey E. Hinton invented a method that can autonomously search for attributes in data, thus performing tasks such as identifying specific elements in pictures, etc. [1]&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fom3jjk4u34imw958h003.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fom3jjk4u34imw958h003.png" alt="Image description" width="481" height="481"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Figure 1 The 2024 Nobel laureates in physics [2]&lt;/p&gt;

&lt;p&gt;(1) John J. Hopfield [1]&lt;br&gt;
John J. Hopfield was born in Chicago, Illinois, in 1933 and received his Ph.D. from Cornell University in the United States in 1958. He is currently a professor at Princeton University in the United States.&lt;br&gt;
In 1986, John J. Hopfield co-founded the Doctoral Program in Computation and Neural Systems at the California Institute of Technology and discovered the associative memory neural network technology, usually called the "Hopfield network". The Hopfield network utilizes the principles describing the properties of matter in physics. The entire network is described in a way equivalent to the energy of a spin system in physics and is trained by finding values for the connections between nodes so that the stored images have low energy. When a distorted or incomplete image is input into the Hopfield network, it systematically traverses the nodes and updates their values, thus reducing the energy of the network. Therefore, the network will gradually find the stored image that is most similar to the input imperfect image.&lt;/p&gt;

&lt;p&gt;(2) Geoffrey E. Hinton [1]&lt;br&gt;
Geoffrey E. Hinton was born in London, England, in 1947 and received his Ph.D. from the University of Edinburgh in England in 1978. He is currently a professor at the University of Toronto in Canada.&lt;br&gt;
Based on the Hopfield network, in the 1990s, Hinton and his colleague Terrence Sejnowski used the tools of statistical physics to create a new network with a different method: the Boltzmann machine, which can learn to recognize the characteristic elements in a given type of data. Hinton used the tools of statistical physics to train the machine by providing examples that are very likely to occur during operation. The Boltzmann machine can be used to classify images or create new examples of the pattern types it has been trained on. Hinton conducted further research on this basis, promoting the explosive development of current machine learning.&lt;/p&gt;

&lt;p&gt;From the perceptron to MLLM (Multimodal Large Language Model), the evolution path of artificial neural networks&lt;br&gt;
The development history of artificial neural networks can be traced back to 1957. At that time, American scientist Frank Rosenblatt proposed the concept of the perceptron [3]. The perceptron model is a binary classification algorithm that simulates the function of human neurons. It can linearly classify input data and learn by adjusting weights. However, because the perceptron can only solve linearly separable problems, it was later proven by Marvin Minsky and Seymour Papert in 1969 that it could not handle nonlinear problems, and the research in this field once stagnated [4].&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fieyv6y1iu7v8773cen4c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fieyv6y1iu7v8773cen4c.png" alt="Image description" width="729" height="389"&gt;&lt;/a&gt;&lt;br&gt;
Figure 2 Schematic diagram of the perceptron structure (Source: Synced)&lt;/p&gt;

&lt;p&gt;Until the 1980s, the proposal of the backpropagation algorithm injected new vitality into artificial neural networks. The backpropagation algorithm developed by Geoffrey Hinton (the 2024 Nobel laureate in physics), David Rumelhart, and Ronald Williams, etc., solved the gradient calculation problem of multilayer neural networks, making it possible to train deep networks [5]. This breakthrough promoted the rise of deep learning and gradually established the core position of artificial neural networks in fields such as computer vision and natural language processing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe1jr29nesfdgabiqfbfn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe1jr29nesfdgabiqfbfn.png" alt="Image description" width="678" height="350"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Figure 3 Schematic diagram of backpropagation (Source: deephub)&lt;/p&gt;

&lt;p&gt;After entering the 21st century, the convolutional neural network (CNN) has become the mainstream model in the field of computer vision. In 2012, AlexNet developed by Hinton's student Alex Krizhevsky shone brightly in the ImageNet competition. Its deep convolutional structure significantly improved the accuracy of image classification and became an important milestone in promoting the deep learning wave [6]. Since then, deep learning has rapidly expanded to multiple fields such as speech recognition and machine translation. With the support of larger-scale data and stronger computing capabilities, the model structure of neural networks has continuously deepened, and the number of network layers has gradually increased. Architectures such as ResNet and VGG have emerged one after another.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgwwouaxfocxf7uvic417.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgwwouaxfocxf7uvic417.png" alt="Image description" width="701" height="332"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Figure 4 Schematic diagram of the AlexNet network structure (Source: CSDN@不如语冰)&lt;/p&gt;

&lt;p&gt;In recent years, the rise of multimodal large language models (MLLM) has further promoted the development of artificial neural networks. MLLM is not limited to processing single-modal data (such as text), but can also combine multiple modalities such as images and audio for unified understanding and generation. This breakthrough enables machines to better understand and process complex and diverse data environments [7]. For example, GPT-2 launched by OpenAI in 2019 demonstrated excellent language generation capabilities, and the subsequent GPT-3 further increased the number of parameters to 1750 billion and significantly enhanced the model's dialogue and reasoning capabilities [8]. With the release of GPT-4 in 2023, the multimodal capabilities of MLLM achieved a new breakthrough, being able to process text and image inputs simultaneously, opening up more application scenarios for intelligent assistants and generative AI [9].&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fardibukrxplv6crrb2pn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fardibukrxplv6crrb2pn.png" alt="Image description" width="639" height="365"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Figure 5 Schematic diagram of MLLM generated by GPT-4&lt;/p&gt;

&lt;p&gt;However, the surge in model size has also brought problems such as consumption of computing resources and energy. In this context, researchers have begun to explore new hardware architectures and model optimization techniques. Model compression techniques such as quantization and distillation have gradually emerged, helping to reduce the computational burden of training and inference. At the same time, the emergence of emerging hardware architectures such as in-memory computing has provided new solutions for the calculation of large-scale neural networks [10]. In the future, with the continuous evolution of multimodal large language models, we are expected to witness artificial neural networks demonstrating stronger intelligent capabilities in more complex tasks.&lt;/p&gt;

&lt;p&gt;1.The model size and the number of parameters have surged, and in-memory computing empowers neural network accelerators&lt;br&gt;
In the second part, we briefly introduced the evolution path of artificial neural networks from the perceptron to MLLM (Multimodal Large Language Model). With the continuous development of machine learning and large-scale neural networks, especially the breakthroughs of MLLM in various tasks, the size and the number of parameters of neural network models have shown an explosive growth. GPT-2 (2019) has 1500 million parameters and is OpenAI's second-generation generative pre-trained transformation model. Although its number of parameters is relatively small, it has already demonstrated strong capabilities in language generation tasks. Later, OpenAI launched GPT-3 (2023), which has 1750 billion parameters and became one of the most complex natural language processing models at that time. Subsequently, Google launched LaMDA (2021), which focuses on dialogue applications and has 1730 billion parameters; launched PaLM (2022), whose number of parameters reached 5400 billion and became one of the basic models for multimodal and multi-task learning. In 2023, OpenAI launched GPT-4, achieving a further breakthrough in multimodal large language models, with the number of parameters reaching 1.76 trillion. Compared with GPT-3, GPT-4 demonstrated stronger multimodal processing capabilities and can process multiple data forms such as text and images [11]. In just a few years, the number of model parameters has achieved a leap from hundreds of millions to trillions, which has also brought huge computing and hardware requirements. In the context of the continuous expansion of MLLM scale, how to develop efficient hardware accelerators has become a top priority.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkiseinmbxr5ut1llcz1s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkiseinmbxr5ut1llcz1s.png" alt="Image description" width="800" height="390"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Figure 1 The development trend of MLLM [12]&lt;/p&gt;

&lt;p&gt;As a new computing architecture, in-memory computing (Computing In Memory, CIM) is considered a revolutionary technology with potential. The key is to integrate storage and computing, effectively overcome the bottleneck of the von Neumann architecture, and combine advanced packaging in the post-Moore era, new storage devices, etc., to achieve an order-of-magnitude improvement in computing energy efficiency. In the MLLM field, in-memory computing technology can provide significant computational acceleration during MLLM training and inference. Since the core of neural network traversal and inference is large-scale matrix multiplication and convolution operations, in-memory computing can directly perform matrix multiplication and addition operations in storage units and perform excellently during large amounts of parallel computing. WTMemory Technology, as a leading enterprise in the domestic in-memory computing chip field, has mass-produced the WTM-8 in-memory computing chip, which can achieve complex functions such as image AI super-resolution, frame interpolation, HDR recognition and detection; has mass-produced the WTM-2101 in-memory computing chip, which has already achieved functions such as voice recognition that meet the end-side computing power requirements. In the future, in-memory computing chips will bring more possibilities for MLLM training and inference acceleration, helping MLLM development to reach a new level.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx332x5ch7zfocf2bmnid.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx332x5ch7zfocf2bmnid.png" alt="Image description" width="443" height="311"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fay1mlfnh7z2ra0p4dnp8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fay1mlfnh7z2ra0p4dnp8.png" alt="Image description" width="254" height="289"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;References:&lt;/p&gt;

&lt;p&gt;[1]The Nobel Prize in Physics 2024 - NobelPrize.org&lt;/p&gt;

&lt;p&gt;[2]The 2024 Nobel Prize in Physics was awarded to two "Godfathers of AI", which is an important moment in the AI academic field _ Tencent News (qq.com)&lt;/p&gt;

&lt;p&gt;[3]Rosenblatt, F. (1958). The Perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386-408.&lt;/p&gt;

&lt;p&gt;[4]Minsky, M., &amp;amp; Papert, S. (1969). Perceptrons: An introduction to computational geometry. MIT Press.&lt;/p&gt;

&lt;p&gt;[5]Rumelhart, D. E., Hinton, G. E., &amp;amp; Williams, R. J. (1968). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.&lt;/p&gt;

&lt;p&gt;[6]Krizhevsky, A., Sutskever, I., &amp;amp; Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 1097-1105.&lt;/p&gt;

&lt;p&gt;[7]Bommasani, R., Hudson, D. A., Adeli, E., &amp;amp; others. (2017). On the Opportunities and Risks of Foundation Models. arXiv preprint arXiv:2108.07258.&lt;/p&gt;

&lt;p&gt;[8]Brown, T. B., Mann, B., Ryder, N., &amp;amp; others. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.&lt;/p&gt;

&lt;p&gt;[9]OpenAI. (2023). GPT-4 Technical Report. OpenAI.&lt;/p&gt;

&lt;p&gt;[10]Cheng, J., &amp;amp; Wang, L. (2023). Computing-In-Memory for Efficient Large-Scale AI Models. IEEE Transactions on Neural Networks and Learning Systems.&lt;/p&gt;

&lt;p&gt;[11]OpenAI GPT-4 Documentation.(&lt;a href="https://openai.com/index/gpt-4-research/" rel="noopener noreferrer"&gt;https://openai.com/index/gpt-4-research/&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;[12]A Survey on Multimodal Large Language Models.&lt;br&gt;
(&lt;a href="https://arxiv.org/abs/2306.13549" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2306.13549&lt;/a&gt;)&lt;/p&gt;

</description>
      <category>neural</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
    </item>
    <item>
      <title>Computing in memory technology ultilized in multimodel application article- ISSCC 2023 16.1 Multcim in detail</title>
      <dc:creator>huimin liao</dc:creator>
      <pubDate>Wed, 16 Oct 2024 07:12:43 +0000</pubDate>
      <link>https://dev.to/huimin_liao_bb8519708c5bd/computing-in-memory-technology-ultilized-in-multimodel-application-article-isscc-2023-161-multcim-in-detail-4mb5</link>
      <guid>https://dev.to/huimin_liao_bb8519708c5bd/computing-in-memory-technology-ultilized-in-multimodel-application-article-isscc-2023-161-multcim-in-detail-4mb5</guid>
      <description>&lt;p&gt;Multimodal models, which are neural network models with the ability to understand mixed signals from different modalities (e.g., vision, natural language, speech, etc.), are one of the most important directions in the development of AI models today. The paper to be presented is entitled "16.1 MulTCIM: A 28nm 2.24μJ/Token Attention-Token-Bit Hybrid Sparse Digital CIM-Based Accelerator for Multimodal Transformers" by Dr. Fengbin Tu from the School of Integrated Circuits at Tsinghua University and the Department of Electronic and Computer Engineering at the Hong Kong University of Science and Technology (HKUST), proposes an integrated digital-computing core design that can support the computation of multimodal Transformer models.&lt;br&gt;
一．Article Basic Information[1]&lt;br&gt;
The ultimate goal of neural network models is to have human-like perception and processing capabilities, and multimodal models have been proposed for this purpose, the best of which is the multimodal Transformer model. However, current multimodal Transformer models face the following three sparsity challenges when executed on hardware:&lt;br&gt;
(1) In terms of attention sparsity, the attention matrix, which is an important part of the Transformer model, has an irregular sparsity, which may lead to a longer reuse distance. For example, 78.6% to 81.7% of the number of tokens can be covered in the ViLBERT-base model. To support such operations, a large number of weights need to be stored in the storage kernel for a long period of time, and the utilization of these weights is extremely low;&lt;br&gt;
(2) In terms of token sparsity, although the computation can be reduced by token pruning, tokens with different lengths for different modalities can lead to computational idleness or pipeline delays across modal attention layers;&lt;br&gt;
(3) In terms of bit sparsity, activation functions such as Softmax, GERU, etc. generate a lot of data close to 0, which enhances the sparsity of the data to be processed, and the effective bitwidth of the same set of inputs to the CIM core will change repeatedly. The serial multiply-accumulate computation scheme in traditional CIM makes the computation time limited by the longest bit width.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffof5vokvg6wix8f575sn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffof5vokvg6wix8f575sn.png" alt="Image description" width="800" height="220"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Fig. 1 Challenges presented in the article and the corresponding solutions[2]&lt;br&gt;
In response to the above problems, this paper proposes three targeted solutions:&lt;br&gt;
(1) Aiming at the longer reuse distance caused by the irregular sparsity of the attention matrix, this paper proposes a Long Reuse Elimination Scheduler (LRES, Long Reuse Elimination Scheduler).The LRES splits the attention matrix into global+local sparsity patterns, where the globally similar attention weight vectors will be stored in the CIM for a longer time, and the locally similar weight vectors are consumed and updated more frequently to reduce unnecessary long reuse distances, instead of generating tokens of Q, K, and V sequentially as in the traditional Transformer, which can improve the utilization of the storage-computing kernel;&lt;br&gt;
(2) To address the problem of computational idleness or pipeline delay due to different token lengths in different modalities, this paper proposes a runtime token pruner (RTP, Runtime Token Pruner) and a modal-adaptive CIM Network (MACN, Modal-Adaptive CIM Network) to optimize this process.The RTP is capable of removing unimportant tokens, while MACN is able to dynamically switch between different modalities in the attention layer, reducing the idle time of the CIM and decreasing the latency of generating Q and K tokens;&lt;br&gt;
(3) To address the problem of the longest bitwidth variation due to the sparsity of the activation function, this paper introduces the Effective Bitwidth Balanced CIM (EBB-CIM) macro-architecture to solve the problem.EBB-CIM solves the problem by detecting the effective bitwidth of each element in the input vector and performs a bit balancing process to balance the input bits in the memory MAC to reduce the computation time by balancing the input bits in the MAC. This is accomplished by reallocating bits in longer effective bitwidth elements for shorter effective bitwidth elements, which in turn results in a more balanced overall input bitwidth.&lt;br&gt;
二．Analysis of the content of the paper[1]&lt;br&gt;
In the following, this paper details the article innovations with respect to the three sparsity challenges proposed by the authors:&lt;br&gt;
(1) LRES&lt;br&gt;
LRES contains three parts that work in sequence:&lt;br&gt;
1) Attention sparsity manager: used to store the initial sparse attention pattern and update this pattern based on the token pruning information at runtime, in this step the manager identifies the Q and K vectors that generate extensive attention as these vectors need to be stored in the CIM core for a longer period of time in order to improve the utilization of the CIM;&lt;br&gt;
2) Local Attention Sequencer: the remaining attention matrix, Q and K vectors are reordered, where K is used as weights and Q as input vectors are frequently consumed and switched in the CIM. This means that K vectors are frequently replaced by newly generated K vectors, thus reducing the idleness of the CIM;&lt;br&gt;
3) Reshape Attention Generator: generates configuration information based on the outputs of the first two steps, which is used to optimize the workflow of the CIM core.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhay2qwbf38q5vdmoro2t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhay2qwbf38q5vdmoro2t.png" alt="Image description" width="371" height="567"&gt;&lt;/a&gt;&lt;br&gt;
Fig. 2 Schematic structure of LRES&lt;a href="https://dev.to2"&gt;2&lt;/a&gt; RTP and MACN&lt;br&gt;
The RTP and MACN modules optimized for token sparsity are described as shown in Fig. 3. Among them, the RTP module is mainly responsible for removing irrelevant tokens. the MACN dynamically divides all CIM cores into two pipeline stages: the StageS for static matrix multiplication (MM) in Q, K, and V token generation; and the StageD for dynamic MM in attention computation, and the two modules will be analyzed in detail below.&lt;br&gt;
Firstly since the class (CLS) markers characterize the importance of other markers, RTP needs to receive the CLS scores of the previous layer and select the top n most important markers of the current layer. And MACN includes a Modal Workload Allocator (MWA), 16 CIM cores and a pipeline bus. At work, the MWA needs to divide the CIM cores into StageS and StageD and pre-assign the weights of StageS according to the allocation table. In addition, in terms of cross-modal switching, the traditional method calculates modes in turn, and different modal parameters lead to many idle CIM macros in the cross-modal switch; whereas, MACN utilizes modal symmetry to overlap the generation of multimodal Q and K tokens to reduce the latency. The specific implementation scheme is that the 4:1 activation structure of CIM stores the multimodal weights in a macro and switches modes by time multiplexing: at time 1 ~ NX , MACN is in Phase1 state and Core1 stores WQX and WQY in the example; at time NX ~ NY , MACN switches to Phase2 state and Core1 activates WQY to generate QY ; at time NY ~ NX +NY , MACN switches to Phase3 state, and Core1 activates WQY and WKX to generate QY and KX . Modal symmetry allows the generation of QY and KX to be completed simultaneously with better CIM utilization.&lt;br&gt;
The final results show that RTP reduces the latency of unimodal and cross-modal attention by a factor of 2.13 and 1.58, and modal symmetry provides an additional 1.69-fold speedup for cross-modal attention.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv4hitq0vxa2tw6zwyc4d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv4hitq0vxa2tw6zwyc4d.png" alt="Image description" width="800" height="370"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Fig. 3 Schematic diagram of the structure of RTP and MACN&lt;a href="https://dev.to3"&gt;1&lt;/a&gt; EBB-CIM&lt;br&gt;
An EBB-CIM macro optimized for bit sparsity is shown in Fig. 4. It consists of 32 EBB-CIM arrays, an effective bit-width detector, a bit equalizer and a bit-balanced feeder. Each of these EBB-CIM arrays has 4 × 64 6T-SRAM bit cells (8 groups) and a Cross-Shift Multiply Accumulate Tree (Cross-Shift MAC Tree).EBB-CIM uses an all-digital CIM architecture with 4:1 activation, which achieves high computational accuracy at INT16 while maintaining memory density; the detector receives the inputs at runtime and detects the effective bit width (EB); the bit equalizer calculates the average EB, allocates bits from long EB data to short EB data, and generates a bit-balanced input sequence; and the bit-balanced feeder acquires the sequence and generates a cross-shift configuration. In addition, the EBB-CIM can be reconfigured for INT16 by fusing every two INT8 operations.&lt;br&gt;
The final results show that EBB-CIM reduces the latency of softmaxMM, GELU-MM, and the whole encoder by a factor of 2.38, 2.20, and 1.58, respectively, with a power overhead of only 5.1% and an area overhead of only 4.6%, as compared to the conventional bit-serial CIM.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffm7ljlu18eceuktlw3c3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffm7ljlu18eceuktlw3c3.png" alt="Image description" width="800" height="381"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Fig. 4 Schematic structure of EBB-CIM[1]&lt;br&gt;
三．multimodal model&lt;br&gt;
(1) Concepts and principles&lt;br&gt;
Multimodal models refer to models that are capable of processing and understanding multiple types of data, such as text, images, audio, and video. Compared to single-modal models, multimodal models are able to fuse information from different modalities, thus improving the accuracy and comprehensiveness of information understanding and task processing.&lt;br&gt;
The core principle of multimodal modeling lies in cross-modal information fusion and collaborative processing, the main processes of which include:&lt;br&gt;
1) Data representation: converting data from different modalities into a form that can be processed by the model. Usually a specific encoder is used to represent the data of each modality as vectors or embeddings;&lt;br&gt;
2) Feature extraction: extracting meaningful features from data of different modalities. For example, use Convolutional Neural Networks (CNN) for images and Recurrent Neural Networks (RNN) or Transformer architecture for text;&lt;br&gt;
3) Cross-modal alignment: establishing associations between different modalities, e.g., by aligning timestamps or utilizing shared attention mechanisms to ensure that information from different modalities can be effectively fused;&lt;br&gt;
4) Information fusion: the aligned multimodal features are fused, and commonly used methods include simple splicing, weighted summation, and the use of more complex fusion networks;&lt;br&gt;
5) Decision making and output: task processing and decision making output, such as classification, generation or retrieval, through the fused features.&lt;br&gt;
(2) Applications and prospects&lt;br&gt;
Multimodal models have a wide range of applications in many domains, for example, the most typical ones are Visual Question Answering and Image Captioning, i.e., inputting a picture to ChatGPT for it to understand the meaning (e.g., Fig. 5) or inputting a passage for ChatGPT to generate an image (e.g., Fig. 6).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flfkkvouv12x09ylctjze.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flfkkvouv12x09ylctjze.png" alt="Image description" width="674" height="433"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Fig. 5 ChatGPT's visual quiz function&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fua4ordz2d5bu4hz61lat.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fua4ordz2d5bu4hz61lat.png" alt="Image description" width="674" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Fig. 6 Image description generation function of ChatGPT&lt;br&gt;
In addition, video generation models such as Sora released in February and Vidu released in April have the function of Video Captioning; ChatGPT-4o released last week also has the functions of Multimodal Sentiment Analysis, Cross-modal Retrieval, Multimodal Translation, etc. They all rely on multimodal models to realize their functions. ChatGPT-4o big model also has Multimodal Sentiment Analysis, Cross-modal Retrieval, Multimodal Translation, etc., and they all rely on the multimodal big model to realize.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhxd1cxlc8whhti0vibpl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhxd1cxlc8whhti0vibpl.png" alt="Image description" width="674" height="378"&gt;&lt;/a&gt;&lt;br&gt;
Figure 7 Sentiment analysis function demonstrated at the launch of ChatGPT-4o&lt;/p&gt;

&lt;p&gt;Multi-modal models bring problems such as increased network size, dramatic increase in parameters, and increased training costs, which will challenge traditional chip architectures, and in-store computing technology can cope with these problems well. In-store computing technology will bring higher energy efficiency, computational efficiency, data processing parallelism and lower transmission delay, computational power consumption, these features make the in-store computing chip in the multimodal model training, reasoning and other scenarios dominant, and is expected to replace the traditional Von Neumann architecture to become the architecture of choice for a new generation of AI chips. Domestic ZhiCun Technology has been deeply cultivating in the field of in-store computing chip for many years, since the release of the first international in-store computing chip product WTM1001 in November 2019, in five years, it has already realized the mass production of WTM1001, the first international in-store computing SoC chip WTM2101 validation and small batch trial production of chip, the mass production of the new generation of computing vision chips of the WTM-8 series, etc. In the future, the in-store computing chip will have the advantage of the multi-modal model training inference and other scenarios. In the future, in-store computing chips will play a greater role in the field of multimodal modeling and provide strong support for the wide application of multimodal modeling.&lt;/p&gt;

&lt;p&gt;References:&lt;br&gt;
[1]Tu F, Wu Z, Wang Y, et al. 16.1 MuITCIM: A 28nm 2.24μJ/Token Attention-Token-Bit Hybrid Sparse Digital CIM-Based Accelerator for Multimodal Transformers[C]//2023 IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 2023: 248-250.&lt;br&gt;
[2]Tu F, Wu Z, Wang Y, et al. MulTCIM: Digital Computing-in-Memory-Based Multimodal Transformer Accelerator With Attention-Token-Bit Hybrid Sparsity[J]. IEEE Journal of Solid-State Circuits, 2023.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>CIM accelate LLM,Top Github resposity for learning CIM</title>
      <dc:creator>huimin liao</dc:creator>
      <pubDate>Mon, 09 Sep 2024 11:04:13 +0000</pubDate>
      <link>https://dev.to/huimin_liao_bb8519708c5bd/cim-accelate-llmtop-github-resposity-for-learning-cim-4edl</link>
      <guid>https://dev.to/huimin_liao_bb8519708c5bd/cim-accelate-llmtop-github-resposity-for-learning-cim-4edl</guid>
      <description>&lt;p&gt;This article focuses on the potential of Computing in memory(CIM) technology in accelerating the inference of large language models (LLMs). Starting with the background knowledge of LLMs, it explores the current challenges they face and then analyzes two classic papers to highlight how CIM could address existing issues in the inference acceleration of LLMs. Finally, it discusses the integration of LLMs with CIM and its future prospects.&lt;/p&gt;

&lt;p&gt;There exiting a resposity for machine learning algorithms deployed in the computing in memory backboard:[&lt;a href="https://github.com/witmem/Algorithm-Deployed-in-WTM2101" rel="noopener noreferrer"&gt;https://github.com/witmem/Algorithm-Deployed-in-WTM2101&lt;/a&gt;]&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Background and Challenges of Large Language Models&lt;/strong&gt;&lt;br&gt;
(I) Basic Concepts of Large Language Models&lt;/p&gt;

&lt;p&gt;In layman's terms, Large Language Models (LLMs) are deep learning models trained on massive text data, with parameters in the billions (or more). They can not only generate natural language text but also understand its meaning, handling various natural language tasks such as text summarization, question answering, and translation.&lt;/p&gt;

&lt;p&gt;The performance of LLMs often follows the scaling law, but certain abilities only emerge when the language model reaches a certain scale. These abilities are called "emergent abilities," which include three representative aspects: First, the ability to learn contextually by generating the expected output of test examples through the completion of input text word sequences without additional training or gradient updates; Second, the behavior of following instructions, where LLMs perform well on small tasks through fine-tuning with mixed multi-task datasets formatted as instructions; Third, the characteristic of gradual reasoning, where LLMs can solve such tasks by using a prompt mechanism involving intermediate reasoning steps to derive the final answer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F139jelfywd2uv2158k8f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F139jelfywd2uv2158k8f.png" alt="Image description" width="737" height="421"&gt;&lt;/a&gt;&lt;br&gt;
Figure 1: The development timeline of LLMs&lt;/p&gt;

&lt;p&gt;Currently, representative LLMs include models such as GPT-4, LLaMA, and PaLM (as shown in Figure 1). They have demonstrated strong capabilities in various natural language processing tasks, such as machine translation, text summarization, dialog systems, and question-answering systems. Moreover, LLMs have played a significant role in driving innovation in various fields, including healthcare, finance, and education, by providing enhanced data analysis, pattern recognition, and predictive modeling capabilities. This transformative impact underscores the importance of exploring and understanding the foundations of these models and their widespread application in different domains.&lt;/p&gt;

&lt;p&gt;(II) Challenges Faced by Large Language Models&lt;/p&gt;

&lt;p&gt;Executing LLM inference in traditional computing architectures mainly faces challenges such as computational latency, energy consumption, and data transfer bottlenecks. As models like GPT and BERT reach billions or more parameters, they require a massive amount of computation during inference, especially matrix operations and activation function processing. This high-density computational demand leads to significant latency, particularly in applications that require rapid response, such as real-time language translation or interactive dialog systems.&lt;/p&gt;

&lt;p&gt;Furthermore, as model sizes increase, the required computational resources also grow, leading to a substantial increase in energy consumption and higher operating costs and environmental impact. Running these large models in data centers or cloud environments consumes a significant amount of electricity, and high energy consumption limits their application on edge devices. Data transfer bottlenecks are another critical issue. Due to the enormous parameter volume of LLMs, they cannot all fit into the processor's high-speed cache, necessitating frequent data retrieval from slower main memory, increasing inference latency and further increasing energy consumption.&lt;/p&gt;

&lt;p&gt;Computing In-Memory Technology (Compute-In-Memory, CIM) effectively reduces the need for data transfer between memory and processors in traditional computing architectures by performing data processing directly within memory chips. This reduction in data movement significantly lowers energy consumption and reduces latency for inference tasks, making model responses faster and more efficient. Considering the current challenges faced by LLMs, CIM may be an effective solution to these problems.&lt;/p&gt;

&lt;p&gt;Case Study I - X-Former: In-Memory Acceleration of Transformers&lt;br&gt;
The X-Former architecture is a CIM hardware platform designed to accelerate LLMs such as Transformers. This architecture effectively overcomes the performance bottlenecks of traditional computing hardware in processing natural language tasks, such as high energy consumption, high latency, difficulty in parameter management, and scalability limitations, by computing directly within storage units, optimizing parameter management, improving computational efficiency, and hardware utilization.    &lt;/p&gt;

&lt;p&gt;The uniqueness of X-Former lies in its integration of specific hardware units, such as the projection engine and attention engine, which are optimized for different computational needs of the model (Figure 2). &lt;br&gt;
The projection engine uses NVM to handle static, compute-intensive matrix operations, while the attention engine utilizes CMOS technology to optimize frequent and dynamically changing computational tasks. This design allows X-Former to almost eliminate the dependency on external memory during execution since most data processing is completed within the memory array.&lt;/p&gt;

&lt;p&gt;Additionally, by adopting an intra-layer sequential blocking data flow method, X-Former further optimizes the data processing procedure, reducing memory occupancy and improving overall computational speed through parallel processing. This approach is particularly suitable for handling self-attention operations as it allows simultaneous processing of multiple data blocks instead of processing the entire dataset sequentially. Such design makes X-Former perform excellently in terms of hardware utilization and memory requirements, especially when dealing with models with a large number of parameters and complex computational demands.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgp2w9b7j8huzdn5qkg03.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgp2w9b7j8huzdn5qkg03.png" alt="Image description" width="800" height="425"&gt;&lt;/a&gt;&lt;br&gt;
Through actual performance evaluation, X-Former has shown significant performance advantages when processing Transformer models. Compared to traditional GPUs and other NVM-based accelerators, X-Former has made substantial improvements in latency and energy consumption. For example, compared to the NVIDIA GeForce GTX 1060 GPU, it has improved latency and energy consumption by 69.8 times and 13 times, respectively; compared to the state-of-the-art NVM accelerators, it has improved by 24.1 times and 7.95 times, respectively. This significant performance enhancement proves the great potential of X-Former in the field of natural language processing, especially when dealing with complex and parameter-rich models such as GPT-3 and BERT.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9hdhb99fgfggwyy7q5d8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9hdhb99fgfggwyy7q5d8.png" alt="Image description" width="422" height="237"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Case Study II - iMCAT: In-Memory Computing based Accelerator for Transformer Networks for Long Sequences&lt;br&gt;
The iMCAT Computing In-Memory architecture proposed by Laguna &lt;/p&gt;

&lt;p&gt;In conclusion,Computing in memory architecture is the potential important  and next generation support for LLM/MLLM,This organization in Github is the world first open source in CIM array, it contains the newest CIM motivation, tools used frequently and paper collection for study.&lt;br&gt;
&lt;a href="https://github.com/witmem" rel="noopener noreferrer"&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/witmem/Algorithm-Deployed-in-WTM2101" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
