<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sandeep Salwan</title>
    <description>The latest articles on DEV Community by Sandeep Salwan (@sandeep_salwan).</description>
    <link>https://dev.to/sandeep_salwan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3504722%2F7f2bf869-381d-40f3-90ef-400e832ec35f.png</url>
      <title>DEV Community: Sandeep Salwan</title>
      <link>https://dev.to/sandeep_salwan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sandeep_salwan"/>
    <language>en</language>
    <item>
      <title>Why your AI assistant lies to you (and how to fix it)</title>
      <dc:creator>Sandeep Salwan</dc:creator>
      <pubDate>Thu, 11 Dec 2025 03:16:36 +0000</pubDate>
      <link>https://dev.to/sandeep_salwan/why-your-ai-assistant-lies-to-you-and-how-to-fix-it-322a</link>
      <guid>https://dev.to/sandeep_salwan/why-your-ai-assistant-lies-to-you-and-how-to-fix-it-322a</guid>
      <description>&lt;p&gt;You ask your AI assistant a simple history question about the 184th president of the United States. The model does not hesitate or pause to consider that there have only been 47 presidents in history. Instead, it generates a credible name and a fake inauguration ceremony. This behavior is called hallucination, and it is the single biggest hurdle stopping artificial intelligence from being truly reliable in extremely high-stakes fields such as healthcare and law. You will learn why this hallucination happens, but more importantly, we need to examine the new methods we use to prevent it.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fix4ski6f6olbzjbd01u8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fix4ski6f6olbzjbd01u8.png" alt=" " width="800" height="275"&gt;&lt;/a&gt;&lt;br&gt;
Problem’s Scale&lt;br&gt;
You might think these errors are rare and assume technology companies have fixed this by now. However, the data show otherwise: recent studies tested six major AI models on tricky medical questions. The models provided false information in 50% to 82% of their answers! Even when researchers used specific prompts to guide the AI, nearly half of the responses still contained fabricated details.&lt;/p&gt;

&lt;p&gt;This creates a massive hidden cost for businesses, as a 2024 survey found that 47% of enterprise users made business decisions based on hallucinated AI-generated content. This is dangerous! It forces companies to treat AI errors as an unavoidable operational expense. Employees now spend approximately 4.3 hours every week just fact-checking AI outputs, and they must act as babysitters for software that was supposed to automate their work.&lt;/p&gt;

&lt;p&gt;Why The Machine Lies&lt;br&gt;
To fix the problem, you must understand the mechanism behind it. Large Language Models do not know facts. They do not have a database of truth inside them, but instead are just prediction engines.&lt;/p&gt;

&lt;p&gt;Illustration from: &lt;a href="https://www.ssw.com.au/" rel="noopener noreferrer"&gt;https://www.ssw.com.au/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When you ask a question, the model examines your words and estimates the probability of the next word. It does this over and over again, and is a very advanced version of your phone’s autocomplete.&lt;/p&gt;

&lt;p&gt;If you ask about the 184th president, the model does not check a history book. Instead, it identifies the pattern of a presidential biography, predicts words that sound like a biography, and prioritizes the language’s flow over accuracy.&lt;/p&gt;

&lt;p&gt;This happens because of “long-tail knowledge deficits.” If a fact appears rarely in the training data, the model struggles to recall it accurately. Researchers found that if a fact appears only once in the training data, the model is statistically guaranteed to hallucinate it at least 20% of the time. But because the model is trained to be helpful, it guesses and fills in the gaps with plausible-sounding noise.&lt;/p&gt;

&lt;p&gt;The New Way&lt;br&gt;
For a long time, the only solution was to build bigger models. The theory was that a larger brain would make fewer mistakes. That theory was wrong. Recent benchmarks show that larger, more “reasoning-heavy” models can actually hallucinate more. OpenAI’s o3 model showed a hallucination rate of 33% on specific tests. The smaller o4-mini model reached 48%. Intelligence does not equal honesty. Engineers are now moving away from brute force and are using three specific architectural changes to force the AI to stick to the truth.&lt;/p&gt;

&lt;p&gt;Solution 1: The Open Book Test (RAG)&lt;br&gt;
When you ask a question, artificial intelligence often uses the most effective current technique called Retrieval-Augmented Generation (RAG).&lt;/p&gt;

&lt;p&gt;Illustration from: &lt;a href="https://allganize.ai" rel="noopener noreferrer"&gt;https://allganize.ai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;RAG gives the AI an open-book test instead of a closed-book, so now, instead of guessing, the AI pauses, searches through a trusted set of documents (like your company’s files or a verified database) to find the answer, and then writes a response based only on that evidence. This prevents the AI from making things up because it must stick to the facts it just read. However, RAG has limits: if the documents it finds are outdated, the AI will confidently repeat that old information (Garbage in = Garbage out), because the technique is only as smart as the data you let it access. &lt;/p&gt;

&lt;p&gt;Solution 2: Multi-Agent Verification&lt;br&gt;
Another promising method involves using multiple AI models at once. The industry is adopting multi-agent systems where different AI models(although most models are currently pretty identical because they’re trained on the same pretraining) argue with each other. One agent acts as the writer while a second agent acts as the ruthless critic. The writer generates a draft. The critic hunts for logical errors and hallucinations. If the critic finds a mistake, it rejects the draft. The models debate until they reach a solid consensus. This adversarial debate mechanism mimics human peer review. Recent studies by Yang and colleagues show that this method significantly improves accuracy in complex reasoning tasks compared to single models.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffz5lwkiqzkqxw4v1t8nf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffz5lwkiqzkqxw4v1t8nf.png" alt=" " width="800" height="535"&gt;&lt;/a&gt;&lt;br&gt;
Solution 3: Hybrid Approach (Calibration)&lt;br&gt;
The most exciting solution changes how we teach the model to behave. We currently train models using Reinforcement Learning (RLHF) from Human Feedback. This standard method rewards the AI for sounding confident. It effectively teaches the system to lie to you.&lt;br&gt;
Engineers are fixing this by changing the scoring system. We now add a severe mathematical penalty when the model guesses wrong. We give the model a small reward when it admits it does not know the answer, creating an incentive for honesty. This approach requires massive human infrastructure. &lt;/p&gt;

&lt;p&gt;Companies like Scale AI employ over 240,000 human annotators to review model output. They explicitly label instances where the model should have refused to answer to calibrate the model. It aligns the model’s internal confidence with its actual accuracy.&lt;/p&gt;

&lt;p&gt;What You Can Do Now&lt;br&gt;
You must rigorously verify every claim because you should treat AI output as a rough draft rather than a final product. Use tools like Perplexity and provide direct links to sources so you can validate the citations yourself. You need to fundamentally adapt your professional workflow to account for these risks if you rely on these tools for work.  The goal is not to eliminate hallucinations entirely, as that’s mathematically impossible with current model architectures. The goal is to build systems that catch the lies before they reach you. We are building safety nets, verifiers, and calibration tools to teach the machine that it is okay to say “I don’t know.”&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9f8ejkphcnlzht3ktqp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9f8ejkphcnlzht3ktqp.png" alt=" " width="800" height="347"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9dlisl3eixa30ih07lfb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9dlisl3eixa30ih07lfb.png" alt=" " width="800" height="347"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>rag</category>
      <category>machinelearning</category>
      <category>ai</category>
      <category>llm</category>
    </item>
    <item>
      <title>Analysis: "Attention Is All You Need"</title>
      <dc:creator>Sandeep Salwan</dc:creator>
      <pubDate>Fri, 10 Oct 2025 15:10:33 +0000</pubDate>
      <link>https://dev.to/sandeep_salwan/analysis-attention-is-all-you-need-i9i</link>
      <guid>https://dev.to/sandeep_salwan/analysis-attention-is-all-you-need-i9i</guid>
      <description>&lt;p&gt;"Attention Is All You Need" introduced the Transformer architecture which is the foundation for modern language models. Its communication style shows the values of the AI research community.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Building Ethos&lt;/strong&gt;&lt;br&gt;
The paper lists eight authors who work at Google Brain, Google Research, and the University of Toronto. There is a note stating that the author listing is random, highlighting the researchers' focus on teamwork rather than one-upping each other. They establish authority through their significant affiliations and by having well-known researchers contribute to this paper, so they began the paper without needing to discuss their own authority. A footnote on the first page details each author's contribution. For example, it credits Noam Shazeer with proposing scaled dot-product attention. The footnote was remarkable, reinforcing this authority with transparency. It closely details each person's role, from designing the first models to accelerating research with a new codebase. This footnote gathers trust in a community valuing open and transparent collaboration. The authors do not need to boast about their credentials. Their affiliations and the paper's venue do that work for them. The paper was presented at the NIPS 2017 conference, which is very famous, and publication at NIPS signals that the work has passed a complicated peer-review process.  I can tell from the venue of NIPS as a whole that their work had an immediate stamp of approval. This is based on the present day.&lt;br&gt;
&lt;strong&gt;Purpose, Audience, and Content Level&lt;/strong&gt;&lt;br&gt;
The text informs and persuades. It presents a new Transformer architecture while concurrently arguing that this model is better than the antique, previously SOTA methods like recurrent neural networks. The audience is experts in machine learning because the paper uses technical terms (almost on par with Jameson Postmodernism) like "sequence transduction" and "auto-regressive," and is a challenging read without a great understanding of linear algebra and neural networks. This specialized language allowed for efficient communication between researchers, but was written in a way that made it unclear to these researchers how beneficial their model would be to the AI community.&lt;br&gt;
Additionally, this paper should be written for an audience with limited time, allowing readers to skip directly to the results. There is a straightforward narrative of how the introduction starts with the problem the community has, like RNNs "precludes parallelization," signaling a problem that the dominant technology had as a bottleneck. This helped people see that this new architecture is vital.  Also, math is the primary tool of explanation because it seems more credible and proves that the work is tested. &lt;br&gt;
&lt;strong&gt;Context and Sources&lt;/strong&gt;&lt;br&gt;
The authors cite many sources, like the paper on the Adam optimizer used for training. There are no ads surrounding the text. The paper's persuasive power comes from its problem-solution structure. The introduction establishes a clear problem. It highlights the "inherently sequential nature" of RNNs as a "fundamental constraint.” This language frames the old method as a barrier to progress. This situates their work within existing research. They treat sources as a foundation for their own ideas, citing "residual dropout" and "Adam optimizer," as well as their competition/ alternative approaches. The end of the paper attempts to provide a solution to the problem RNNs have, and it focuses heavily on preventing ambiguity by being clear.  They are citing both foundational work and competing models like ByteNet and ConvS2S, which provide this research paper with more ethos.  Also, the paper's conclusion is unique because it does not present a typical summary but ends with an agenda for future research by stating, "We are excited about the future of attention-based models and plan to apply them to other tasks." They presented this paper to allow other researchers to figure out how they can use this. &lt;br&gt;
&lt;strong&gt;Format and Language&lt;/strong&gt;&lt;br&gt;
The paper follows a typical scientific structure. It moves in a clear order with sections for model training and results. Each part is labeled and easy to follow. The tone stays formal and focused. The writing is tight and exact. The authors use active voice and write as "we," keeping the focus on their methods and results. The style feels deliberate, confident, and built around precision. A sample sentence is: "We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely". The text does not use prose like metaphors or similes because they want the results to be very reproducible. The abstract is very essential because it acts as a high-density executive summary. They provide proof like beating the old "28.4 BLEU" with a new SOTA score of "41.8."  Another example at "3.1 Encoder and Decoder Stacks" lets readers go directly to the information they need. This reliance on quantitative benchmarks is a key rhetorical strategy because AI research establishes authority through measurable and erproducible progress. The researchers persuade by presenting tricky numbers as proof of success, which is more profound than any descriptive language. The title "Attention is All You Need" is atypical of academic paper titles, almost making the paper more accessible, and symbolizes how these researchers are providing a comprehensive solution. &lt;br&gt;
&lt;strong&gt;Visuals and Mathematics&lt;/strong&gt;&lt;br&gt;
Visuals are critical to the paper's argument. Figure 1 provides a famous schematic of the Transformer architecture, which is often referenced and discussed in all AI courses. Figure 2, which shows Scaled Dot-Product and multi-head attention, is an important mathematical function that presents data.  Table 2 compares the Transformer architecture performance to previous SOTA models by comparing BLEU scores and training costs. Figure 3 makes tricky concepts easier to grasp by providing visibility and evidence of how the model is learning linguistics. They also have latex equations like "Attention(Q, K, V) = softmax((QK^T) / sqrt(d_k)) V"  functions rhetorically, signaling to readers this proposed mechanism is like a fundamental truth.&lt;br&gt;
&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
"Attention Is All You Need" shows the communication style of the AI research community.. These values serve as empirical proof and are grounded in prior work. The authors inform their audience about a new architecture and persuade readers with performance data. They even had a public code repo displaying confidence in their work, and it was an extra gesture helping make this paper so foundational. The paper's dense writing prioritizes extreme precision. In this field of CS+AI, arguments are won with better models and superior results, as demonstrated by the current LLMS battle. This paper presented both.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
    </item>
    <item>
      <title>In-Depth Analysis: "Attention Is All You Need"</title>
      <dc:creator>Sandeep Salwan</dc:creator>
      <pubDate>Fri, 10 Oct 2025 15:07:36 +0000</pubDate>
      <link>https://dev.to/sandeep_salwan/in-depth-analysis-attention-is-all-you-need-267n</link>
      <guid>https://dev.to/sandeep_salwan/in-depth-analysis-attention-is-all-you-need-267n</guid>
      <description>&lt;p&gt;"Attention Is All You Need" introduced the Transformer architecture which is the foundation for modern language models. Its communication style shows the values of the AI research community.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Building Ethos&lt;/strong&gt;&lt;br&gt;
The paper lists eight authors who work at Google Brain, Google Research, and the University of Toronto. There is a note stating that the author listing is random, highlighting the researchers' focus on teamwork rather than one-upping each other. They establish authority through their significant affiliations and by having well-known researchers contribute to this paper, so they began the paper without needing to discuss their own authority. A footnote on the first page details each author's contribution. For example, it credits Noam Shazeer with proposing scaled dot-product attention. The footnote was remarkable, reinforcing this authority with transparency. It closely details each person's role, from designing the first models to accelerating research with a new codebase. This footnote gathers trust in a community valuing open and transparent collaboration. The authors do not need to boast about their credentials. Their affiliations and the paper's venue do that work for them. The paper was presented at the NIPS 2017 conference, which is very famous, and publication at NIPS signals that the work has passed a complicated peer-review process.  I can tell from the venue of NIPS as a whole that their work had an immediate stamp of approval.&lt;br&gt;
&lt;strong&gt;Purpose, Audience, and Content Level&lt;/strong&gt;&lt;br&gt;
The text informs and persuades. It presents a new Transformer architecture while concurrently arguing that this model is better than the antique, previously SOTA methods like recurrent neural networks. The audience is experts in machine learning because the paper uses technical terms (almost on par with Jameson Postmodernism) like "sequence transduction" and "auto-regressive," and is a challenging read without a great understanding of linear algebra and neural networks. This specialized language allowed for efficient communication between researchers, but was written in a way that made it unclear to these researchers how beneficial their model would be to the AI community.&lt;br&gt;
Additionally, this paper should be written for an audience with limited time, allowing readers to skip directly to the results. There is a straightforward narrative of how the introduction starts with the problem the community has, like RNNs "precludes parallelization," signaling a problem that the dominant technology had as a bottleneck. This helped people see that this new architecture is vital.  Also, math is the primary tool of explanation because it seems more credible and proves that the work is tested. &lt;br&gt;
&lt;strong&gt;Context and Sources&lt;/strong&gt;&lt;br&gt;
The authors cite many sources, like the paper on the Adam optimizer used for training. There are no ads surrounding the text. The paper's persuasive power comes from its problem-solution structure. The introduction establishes a clear problem. It highlights the "inherently sequential nature" of RNNs as a "fundamental constraint.” This language frames the old method as a barrier to progress. This situates their work within existing research. They treat sources as a foundation for their own ideas, citing "residual dropout" and "Adam optimizer," as well as their competition/ alternative approaches. The end of the paper attempts to provide a solution to the problem RNNs have, and it focuses heavily on preventing ambiguity by being clear.  They are citing both foundational work and competing models like ByteNet and ConvS2S, which provide this research paper with more ethos.  Also, the paper's conclusion is unique because it does not present a typical summary but ends with an agenda for future research by stating, "We are excited about the future of attention-based models and plan to apply them to other tasks." They presented this paper to allow other researchers to figure out how they can use this. &lt;br&gt;
&lt;strong&gt;Format and Language&lt;/strong&gt;&lt;br&gt;
The paper follows a typical scientific structure. It moves in a clear order with sections for model training and results. Each part is labeled and easy to follow. The tone stays formal and focused. The writing is tight and exact. The authors use active voice and write as "we," keeping the focus on their methods and results. The style feels deliberate, confident, and built around precision. A sample sentence is: "We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely". The text does not use prose like metaphors or similes because they want the results to be very reproducible. The abstract is very essential because it acts as a high-density executive summary. They provide proof like beating the old "28.4 BLEU" with a new SOTA score of "41.8."  Another example at "3.1 Encoder and Decoder Stacks" lets readers go directly to the information they need. This reliance on quantitative benchmarks is a key rhetorical strategy because AI research establishes authority through measurable and erproducible progress. The researchers persuade by presenting tricky numbers as proof of success, which is more profound than any descriptive language. The title "Attention is All You Need" is atypical of academic paper titles, almost making the paper more accessible, and symbolizes how these researchers are providing a comprehensive solution. &lt;br&gt;
&lt;strong&gt;Visuals and Mathematics&lt;/strong&gt;&lt;br&gt;
Visuals are critical to the paper's argument. Figure 1 provides a famous schematic of the Transformer architecture, which is often referenced and discussed in all AI courses. Figure 2, which shows Scaled Dot-Product and multi-head attention, is an important mathematical function that presents data.  Table 2 compares the Transformer architecture performance to previous SOTA models by comparing BLEU scores and training costs. Figure 3 makes tricky concepts easier to grasp by providing visibility and evidence of how the model is learning linguistics. They also have latex equations like "Attention(Q, K, V) = softmax((QK^T) / sqrt(d_k)) V"  functions rhetorically, signaling to readers this proposed mechanism is like a fundamental truth.&lt;br&gt;
&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
"Attention Is All You Need" shows the communication style of the AI research community.. These values serve as empirical proof and are grounded in prior work. The authors inform their audience about a new architecture and persuade readers with performance data. They even had a public code repo displaying confidence in their work, and it was an extra gesture helping make this paper so foundational. The paper's dense writing prioritizes extreme precision. In this field of CS+AI, arguments are won with better models and superior results, as demonstrated by the current LLMS battle. This paper presented both.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
      <category>architecture</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Transformer Architecture</title>
      <dc:creator>Sandeep Salwan</dc:creator>
      <pubDate>Mon, 15 Sep 2025 19:37:28 +0000</pubDate>
      <link>https://dev.to/sandeep_salwan/transformer-architecture-4eo1</link>
      <guid>https://dev.to/sandeep_salwan/transformer-architecture-4eo1</guid>
      <description>&lt;p&gt;Before Transformers, models called RNNs were used, but Transformers are better because they solve issues like being difficult to parallelize and the exploding gradient problem.&lt;/p&gt;

&lt;p&gt;Line 1: “The person executed the swap because it was trained to do so.”&lt;br&gt;
Line 2: “The person executed the swap because it was an effective hedge.”&lt;/p&gt;

&lt;p&gt;Look carefully at those two lines. Notice how in line 1, “it” refers to the "person".&lt;br&gt;
In line 2, “it” refers to the "swap".&lt;/p&gt;

&lt;p&gt;Transformers figure out what “it” refers to entirely through numbers by discovering how related the word pairs are.&lt;/p&gt;

&lt;p&gt;These numbers are stored in tensors: a vector is a 1D tensor, a matrix is a 2D tensor, and higher-dimensional arrays are ND tensors. Embeddings for the input are based on frequency and co-occurrence of other words.&lt;/p&gt;

&lt;p&gt;This architecture relies on three key inputs: the Query matrix, the Key matrix, and the Value matrix.&lt;/p&gt;

&lt;p&gt;Imagine you are a detective. The Query is like your list of questions (Who or what is “it”?). The Key is the evidence each word carries (what every word offers as a clue). When you multiply Query by Key, you get a set of attention scores (numbers showing which clues are most relevant).&lt;/p&gt;

&lt;p&gt;Lot of math occurs here  {these scores are scaled (to keep them stable), normalized with softmax (so they become probabilities that sum to 1), and then used as weights.}&lt;/p&gt;

&lt;p&gt;Finally, the Value is the actual content of the evidence (the meaning of each word e.g. person is living and swap is an action). Multiplying the attention weights by the Value matrix gives the final information the model carries forward to make the right decision about “it.”&lt;/p&gt;

&lt;p&gt;All of these abstract (Q, K, V) matrix numbers are trained through backpropagation. Training works by predicting an output, comparing it to the true label, measuring the loss (higher loss the worse because it's calculated by difference of calculated vs actual output), calculating gradients (slopes showing how much each weight contributed to that error), and then updating the weights in the opposite direction of the slope (e.g., if the slope of loss is y = 2x, the weights move in the y = –2x direction).&lt;/p&gt;

&lt;p&gt;Now you know at a high level how Transformers (used by top LLM's today) work: they’re just predicting the next word in a sequence. &lt;/p&gt;

</description>
      <category>programming</category>
      <category>fullstack</category>
      <category>ai</category>
      <category>discuss</category>
    </item>
  </channel>
</rss>
