<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Chaitany</title>
    <description>The latest articles on DEV Community by Chaitany (@patelchaitany).</description>
    <link>https://dev.to/patelchaitany</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3486408%2F4e9e4009-5f97-4a0f-9ee9-5c6c1fa2351b.png</url>
      <title>DEV Community: Chaitany</title>
      <link>https://dev.to/patelchaitany</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/patelchaitany"/>
    <language>en</language>
    <item>
      <title>Sound Is All You Need: From Good UX to Great UX</title>
      <dc:creator>Chaitany</dc:creator>
      <pubDate>Sun, 28 Jun 2026 11:01:40 +0000</pubDate>
      <link>https://dev.to/patelchaitany/sound-is-all-you-need-from-good-ux-to-great-ux-5ej9</link>
      <guid>https://dev.to/patelchaitany/sound-is-all-you-need-from-good-ux-to-great-ux-5ej9</guid>
      <description>&lt;p&gt;So when we think about great design what comes to our mind is great visuals, sleek UI, some futuristic interface like holograms or some fancy visual graphics. That's what we all picture right? But here's the thing nobody really talks about, Sound.&lt;/p&gt;

&lt;p&gt;If we trace back to history, way back to the primitive age, when there is no UI or laptop or any kind of tech stuff, back then if we want to convey information we used to do it using sound and signs. That was it. That was the whole communication system. And as technology evolved from the punching card to write code, to the first CLI, to the GUI, the way we interact with computers kept changing but sound was always there somewhere in the background doing its job quietly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sound Was Always There, We Just Stopped Noticing
&lt;/h2&gt;

&lt;p&gt;Before GUI is introduced, there is only way for computer to pass down information to user, either use the text or use simpler sound effects. Back then sound was used but not that widely. When the new GUI interface is introduced to the user, the user found new way to interact with the computers. Today what we consider obvious like mouse clicking, dragging mouse, clicking button, controlling bunch of things, we today take for granted, all of this is first invented by someone and back then all of this new interface is brought to end user.&lt;/p&gt;

&lt;p&gt;And in some or other way the sound came for rescue. The clicking button sound, closing and opening tab sound, increasing volume to decreasing volume, from the booting computer to shutting down computer. Sound convey information to user in much more cleaner way than UI for simpler or minute details. As we all know the devil lies in the details, and sound is that detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vision Is Directional, Sound Is Omnidirectional
&lt;/h2&gt;

&lt;p&gt;Now here's something really interesting that we don't really think about. Vision is directional, it demands your full attention. You have to look at the screen, you have to focus your eyes, you have to read. But sound? Sound is omnidirectional. It works in the background. You don't need to look anywhere, you just hear it and you know what's happening.&lt;/p&gt;

&lt;p&gt;Think about it. When Google Maps needs to tell you "hey take a right turn" while you're driving, no pop-up could do that job safely. You can't be reading your screen while driving at 80 km/h. So what do they do? They play a reroute chime. Sound comes for the rescue. Again.&lt;/p&gt;

&lt;h2&gt;
  
  
  People Remember Sound
&lt;/h2&gt;

&lt;p&gt;Now here's something for the business folks too. People might forget your brand name, brand colour, but if you create a brand theme tune? That sticks. Think about the Nokia iconic handshaking tone. The Windows XP opening sound. Or take example of Indian brands like Titan iconic tune, Airtel tune, there are many more examples but we get the point. People remember sound. It gets embedded into our memory in a way that visuals sometimes just can't.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Physical Becomes Digital, Sound Fills the Gap
&lt;/h2&gt;

&lt;p&gt;Now when the first iPhone is introduced it replaced the physical keyboard with the touch screen. But many were skeptical right? So what Apple did is that they introduced haptic and the sound over there as replacement for the physical touch. We have seen this so many times, every time a new company replaces something physical with the digital, then it started feeling just enough like physical.&lt;/p&gt;

&lt;p&gt;Like with Apple Pencil, it started emitting the sound of pen rubbing against the paper. You're not writing on paper but your brain thinks you are because the sound is telling you so. And there are studies which show how different sounds and pitches could convey different meanings to us, as those are embedded into our memory from survival instincts to some natural actions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Research Actually Backs This Up
&lt;/h2&gt;

&lt;p&gt;Now I'm not just saying all of this from my gut feeling. There is proper HCI research that backs this up.&lt;/p&gt;

&lt;p&gt;Ma et al. did a study in 2014 on touch typing performance with keyclick feedback, published in IEEE. What they found is that when you add that clicking sound to a touchscreen keyboard, people actually type faster and make fewer errors. Microsoft Research also found the same thing in 2015, haptic keyclick feedback improves typing speed and reduces typing errors. So it's not just about feeling nice, it actually improves performance.&lt;/p&gt;

&lt;p&gt;Sodnik et al. in 2008 studied auditory versus visual interfaces for use while driving and found that auditory feedback is significantly safer and more effective when the user's eyes need to be somewhere else. Which makes total sense if you think about it.&lt;/p&gt;

&lt;p&gt;And there's even research from Nova Southeastern University on using auditory feedback in IDEs to help novice developers. Imagine your code editor giving you subtle sound cues when something compiles successfully versus when it fails, instead of making you hunt for that tiny red squiggly line somewhere in your code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Now If You See, Smart Devices Already Get This
&lt;/h2&gt;

&lt;p&gt;Your smart watch, your earphone, your headphone, all of this convey some kind of small or big information in form of sound. The little beep when your earbuds connect. The subtle ding when your Apple Watch detects you've completed your standing goal. These are tiny moments but they add up to make the whole experience feel alive and responsive.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI and VR, Sound Becomes Even More Essential
&lt;/h2&gt;

&lt;p&gt;Now the same thing is happening with AI. As AI is evolving, AI will find way to communicate with human which is going to be the path of least resistance, which could be using sound as medium. AI also started from text, same as the computers started from the punching machine to the CRT to GUI. Same way the AI and human interaction is going to evolve from text box to voice. We're already seeing this with voice assistants getting smarter every day.&lt;/p&gt;

&lt;p&gt;And this is going to happen in the VR space too. When there is no physical surface, when you can't touch anything real, we need to give the user something to hold onto. We need to convey information to user and we need to use sound over there. Because in VR you remove all physical surfaces, and if there's no feedback, the user feels lost. Sound becomes the anchor.&lt;/p&gt;

&lt;p&gt;But here's the catch, more sound isn't better sound. There's a point where every ping becomes noise and users stop hearing anything, including the alerts that actually matter. It's like that boy who cried wolf. If everything makes a sound, nothing is important anymore.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Open Source Tools Nobody Is Using
&lt;/h2&gt;

&lt;p&gt;Now here's what most developers don't realize, the open-source ecosystem already has solid infrastructure for sound design. We just aren't using it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;freedesktop.org Sound Theme Specification&lt;/strong&gt;, there's an actual open standard that defines how event sounds should work across Linux desktops. It's well-designed, it exists, and almost nobody outside of desktop environment maintainers knows about it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;libcanberra&lt;/strong&gt;, this is the library that implements that spec. It's LGPL-2.1 licensed, thread-safe, supports ALSA and PulseAudio backends. This is the actual plumbing behind GNOME's event sounds, and any open-source project can use it.&lt;/p&gt;

&lt;p&gt;And then look at how &lt;strong&gt;GNOME and KDE&lt;/strong&gt; handle sound completely differently. GNOME goes minimalist, few sounds, subtle, almost silent. KDE's community goes the other way with themes like "Ocean" and "Borealis" that are designed so that a dozen workstations in the same office wouldn't distract each other. Two completely different philosophies, both open source, both have something to teach us.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SND (snd-lib)&lt;/strong&gt;, this is an MIT licensed UI sound library available via npm. It has categorized sounds for taps, notifications, cautions, loading states, and even "celebration" sounds for when users complete a major goal like sending an email or finishing the last task. This toolkit exists and most developers have never heard of it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Material Design Sound Guidelines&lt;/strong&gt;, Google's framework categorizes UI sounds into primary sounds for system feedback on every interaction, secondary sounds for less frequent state changes, and ambient sounds for personality and emotion. A practical way to think about sound that any project can adopt.&lt;/p&gt;

&lt;h2&gt;
  
  
  So What Should You Take Away From This?
&lt;/h2&gt;

&lt;p&gt;Sound isn't decoration. It's a functional design layer. It confirms actions, communicates system state, and reduces cognitive load on the user's eyes. Every other design dimension, colour, typography, layout, motion, has established open-source tooling, community discussions, and conference talks. Sound is the blind spot.&lt;/p&gt;

&lt;p&gt;We ship projects with carefully considered visual design and completely absent audio design. That needs to change. And the tools to do it? They already exist. We just need to start using them.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Academic Research
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Ma et al. (2014), &lt;a href="https://ieeexplore.ieee.org/document/6775459/" rel="noopener noreferrer"&gt;A Study of Touch Typing Performance with Keyclick Feedback&lt;/a&gt; (IEEE)&lt;/li&gt;
&lt;li&gt;Balters &amp;amp; Steinert (2015), &lt;a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0129056" rel="noopener noreferrer"&gt;The Influence of Emotion on Keyboard Typing: An Experimental Study Using Auditory Stimuli&lt;/a&gt; (PLOS ONE)&lt;/li&gt;
&lt;li&gt;Microsoft Research (2015), &lt;a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2015/06/Ma_etal_WHC2015.pdf" rel="noopener noreferrer"&gt;Haptic Keyclick Feedback Improves Typing Speed and Reduces Typing Errors&lt;/a&gt; (World Haptics Conference)&lt;/li&gt;
&lt;li&gt;Sodnik et al. (2008), &lt;a href="https://www.sciencedirect.com/science/article/abs/pii/S1071581907001553" rel="noopener noreferrer"&gt;A User Study of Auditory Versus Visual Interfaces for Use While Driving&lt;/a&gt; (International Journal of Human-Computer Studies)&lt;/li&gt;
&lt;li&gt;Jacob &amp;amp; Mack (2018), &lt;a href="https://www.yorku.ca/mack/hcii2018a.html" rel="noopener noreferrer"&gt;Comparison of Feedback Modes for the Visually Impaired: Vibration vs. Audio&lt;/a&gt; (HCII 2018)&lt;/li&gt;
&lt;li&gt;Nova Southeastern University, &lt;a href="https://nsuworks.nova.edu/gscis_etd/1148/" rel="noopener noreferrer"&gt;Auditory Feedback in IDEs to Provide Cognitive Support for Novice Developers&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Apple (2020), &lt;a href="https://www.appleworld.today/blog/2020/7/28/apple-granted-patent-for-haptic-and-audio-feedback-on-an-apple-pencil" rel="noopener noreferrer"&gt;US Patent No. 10,691,209: Haptic and Audio Feedback for Apple Pencil&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Open Source Tools &amp;amp; Specifications
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.freedesktop.org/wiki/Specifications/sound-theme-spec/" rel="noopener noreferrer"&gt;freedesktop.org Sound Theme Specification&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://developer-old.gnome.org/libcanberra/" rel="noopener noreferrer"&gt;libcanberra, GNOME Developer Docs&lt;/a&gt; (LGPL-2.1)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/KDE/ocean-sound-theme" rel="noopener noreferrer"&gt;KDE Ocean Sound Theme&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://snd.dev/" rel="noopener noreferrer"&gt;SND, UI Sound Assets for Developers&lt;/a&gt; / &lt;a href="https://github.com/snd-lib/snd-lib" rel="noopener noreferrer"&gt;snd-lib on GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://m2.material.io/design/sound/about-sound.html" rel="noopener noreferrer"&gt;Google Material Design, About Sound&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://m2.material.io/design/sound/applying-sound-to-ui.html" rel="noopener noreferrer"&gt;Google Material Design, Applying Sound to UI&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Further Reading
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://opensource.com/article/18/6/sound-themes-linux" rel="noopener noreferrer"&gt;Sound Themes in Linux: What Every User Should Know&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://wiki.archlinux.org/title/Libcanberra" rel="noopener noreferrer"&gt;libcanberra, ArchWiki&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://store.kde.org/p/1149503/" rel="noopener noreferrer"&gt;KDE Borealis Sound Theme&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>uxdesign</category>
      <category>ai</category>
      <category>webdev</category>
      <category>productivity</category>
    </item>
    <item>
      <title>PolarQuant: Quantizing KV Caches with Polar Transformation</title>
      <dc:creator>Chaitany</dc:creator>
      <pubDate>Sat, 28 Mar 2026 21:18:13 +0000</pubDate>
      <link>https://dev.to/patelchaitany/polarquant-quantizing-kv-caches-with-polar-transformation-255</link>
      <guid>https://dev.to/patelchaitany/polarquant-quantizing-kv-caches-with-polar-transformation-255</guid>
      <description>&lt;p&gt;&lt;em&gt;A deep dive into how PolarQuant compresses LLM key caches by 4x using polar coordinates, and why it works so well.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;If you have ever tried running a large language model on long contexts (32K, 64K, or 128K tokens), you have hit the wall: the &lt;strong&gt;KV cache&lt;/strong&gt;. It grows linearly with sequence length, eating up GPU memory and becoming the dominant bottleneck during inference.&lt;/p&gt;

&lt;p&gt;PolarQuant, introduced by researchers from KAIST, Google Research, and Yale (&lt;a href="https://arxiv.org/abs/2502.02617" rel="noopener noreferrer"&gt;arXiv:2502.02617&lt;/a&gt;), offers an elegant solution. Instead of quantizing key embeddings the usual way (in Cartesian space), it converts them to &lt;strong&gt;polar coordinates&lt;/strong&gt; (angle and radius) and quantizes &lt;em&gt;those&lt;/em&gt; instead. The result is a &lt;strong&gt;~4x compression&lt;/strong&gt; of the key cache with near-lossless quality on long-context benchmarks.&lt;/p&gt;

&lt;p&gt;Let's break down exactly how it works.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is PolarQuant?
&lt;/h2&gt;

&lt;p&gt;Every time an LLM generates a token, it needs to attend to all previous tokens. To avoid recomputing everything from scratch, the model stores &lt;strong&gt;Key&lt;/strong&gt; and &lt;strong&gt;Value&lt;/strong&gt; embeddings in a cache. This cache grows with every token generated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PolarQuant&lt;/strong&gt; compresses this cache by:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Converting key embeddings from &lt;strong&gt;Cartesian coordinates&lt;/strong&gt; into &lt;strong&gt;polar coordinates&lt;/strong&gt; (angle + radius).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quantizing&lt;/strong&gt; the angles and radii to low-bit integers (e.g., 4 bits each).&lt;/li&gt;
&lt;li&gt;Computing attention &lt;strong&gt;directly from the quantized representation&lt;/strong&gt;, without ever reconstructing the full keys, using a custom GPU kernel.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Why Not Just Quantize Normally?
&lt;/h3&gt;

&lt;p&gt;Traditional methods like KIVI or GPTQ-style quantization work in Cartesian space. They need per-block normalization constants (zero-points, scales) stored in full precision, which adds significant memory overhead, often over 1 extra bit per quantized value.&lt;/p&gt;

&lt;p&gt;PolarQuant sidesteps this problem with a mathematical insight:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RoPE&lt;/strong&gt; (Rotary Position Embeddings), used by most modern LLMs, already operates on &lt;strong&gt;pairs of dimensions&lt;/strong&gt; as 2D rotations. This makes polar coordinates a natural fit.&lt;/li&gt;
&lt;li&gt;After applying a random preconditioning matrix, the angles in polar coordinates follow a &lt;strong&gt;predictable, concentrated distribution&lt;/strong&gt;. This means we can quantize them with pre-computed codebooks, and no per-block normalization is needed.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Cartesian to Polar Conversion
&lt;/h3&gt;

&lt;p&gt;Every key vector of dimension &lt;code&gt;d&lt;/code&gt; (e.g., &lt;code&gt;d = 128&lt;/code&gt;) is split into &lt;code&gt;d/2&lt;/code&gt; &lt;strong&gt;pairs of adjacent dimensions&lt;/strong&gt;. Each pair &lt;code&gt;(x, y)&lt;/code&gt; forms a point in a 2D plane.&lt;/p&gt;

&lt;p&gt;The conversion to polar coordinates is straightforward. For each pair, we compute the &lt;strong&gt;angle&lt;/strong&gt; as &lt;code&gt;phi = atan2(y, x)&lt;/code&gt;, which gives the direction of the 2D vector. We also compute the &lt;strong&gt;radius&lt;/strong&gt; as &lt;code&gt;r = sqrt(x^2 + y^2)&lt;/code&gt;, which gives the magnitude. If the angle comes out negative, we shift it by adding &lt;code&gt;2 * pi&lt;/code&gt; so that all angles fall in the range &lt;code&gt;[0, 2*pi)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This conversion is &lt;strong&gt;lossless&lt;/strong&gt;. The same information is represented, just in a different coordinate system. You can always go back: &lt;code&gt;x = r * cos(phi)&lt;/code&gt;, &lt;code&gt;y = r * sin(phi)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Each pair of FP16 values &lt;code&gt;(x, y)&lt;/code&gt; becomes an angle &lt;code&gt;phi&lt;/code&gt; and a radius &lt;code&gt;r&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You can see this in the source code at &lt;a href="https://github.com/ericshwu/PolarQuant/blob/main/models/modeling_llama_polar.py" rel="noopener noreferrer"&gt;modeling_llama_polar.py&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;phi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;atan2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key_states&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:],&lt;/span&gt; &lt;span class="n"&gt;key_states&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:])&lt;/span&gt;
&lt;span class="n"&gt;phi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;phi&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;phi&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;phi&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;radii&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key_states&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Paper's Recursive Polar Transformation
&lt;/h3&gt;

&lt;p&gt;The paper goes deeper with a &lt;strong&gt;recursive&lt;/strong&gt; version. Instead of stopping at one level of pairing, it recursively applies the polar transform to the radii themselves.&lt;/p&gt;

&lt;p&gt;First, at &lt;strong&gt;Level 1&lt;/strong&gt;, we pair adjacent dimensions &lt;code&gt;(x1, x2), (x3, x4), ...&lt;/code&gt; into &lt;code&gt;d/2&lt;/code&gt; polar pairs. Each pair gives an angle and a radius.&lt;/p&gt;

&lt;p&gt;Then, at &lt;strong&gt;Level 2&lt;/strong&gt;, we take the &lt;code&gt;d/2&lt;/code&gt; radii from Level 1 and pair them again into &lt;code&gt;d/4&lt;/code&gt; pairs. Each pair of radii gives a new angle and a new radius.&lt;/p&gt;

&lt;p&gt;We keep repeating this process at &lt;strong&gt;Level 3, Level 4&lt;/strong&gt;, and so on, until a single scalar radius remains.&lt;/p&gt;

&lt;p&gt;The final output is &lt;strong&gt;one radius&lt;/strong&gt; plus &lt;strong&gt;(d - 1) angles&lt;/strong&gt; organized across &lt;code&gt;log2(d)&lt;/code&gt; levels. In practice, the implementation only recurses for &lt;strong&gt;4 levels&lt;/strong&gt; (not all &lt;code&gt;log2(d)&lt;/code&gt;), leaving &lt;code&gt;d/16&lt;/code&gt;-dimensional radii stored in full precision.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Does This Help?
&lt;/h3&gt;

&lt;p&gt;Here is the key mathematical insight from the paper. After applying a random preconditioning (rotation) matrix to the key vectors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Level 1 angles&lt;/strong&gt; are approximately uniformly distributed over &lt;code&gt;[0, 2*pi)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Level 2+ angles&lt;/strong&gt; follow a distribution proportional to &lt;code&gt;sin^(d-1)(2*theta)&lt;/code&gt;, which is tightly concentrated around &lt;code&gt;pi/4&lt;/code&gt;. The higher the level, the more concentrated.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A concentrated distribution means fewer quantization bins are needed to represent it accurately. The paper uses &lt;strong&gt;4 bits&lt;/strong&gt; for Level 1 (wide range) and just &lt;strong&gt;2 bits&lt;/strong&gt; for higher levels (concentrated range).&lt;/p&gt;




&lt;h3&gt;
  
  
  Step 2: How the Quantization Happens
&lt;/h3&gt;

&lt;p&gt;Once we have polar coordinates, we quantize angles and radii to low-bit integers. This is where the lossy compression happens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Grouping the tokens.&lt;/strong&gt; First, we organize the keys along the sequence dimension into groups of &lt;code&gt;G&lt;/code&gt; tokens (default &lt;code&gt;G = 128&lt;/code&gt;). All the quantization statistics are computed independently within each group and each 2D sub-dimension.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quantizing the angle.&lt;/strong&gt; Within each group, we find the minimum and maximum angle values across the &lt;code&gt;G&lt;/code&gt; tokens. The range between them is divided into &lt;code&gt;2^tbits&lt;/code&gt; equal-width bins (with &lt;code&gt;tbits = 4&lt;/code&gt;, that is 16 bins). We compute the step size as &lt;code&gt;delta_phi = (phi_max - phi_min) / 16&lt;/code&gt;. Then, for each token's angle, we figure out which bin it falls into by computing &lt;code&gt;floor((phi - phi_min) / delta_phi)&lt;/code&gt; and clamping the result to the range 0 through 15. This gives us an integer index &lt;code&gt;theta&lt;/code&gt; between 0 and 15.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quantizing the radius.&lt;/strong&gt; We do the exact same thing for the radius values. Find the min and max radius within the group, divide the range into &lt;code&gt;2^rbits&lt;/code&gt; bins (again 16 bins with &lt;code&gt;rbits = 4&lt;/code&gt;), and map each radius to an integer index &lt;code&gt;rho&lt;/code&gt; between 0 and 15.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Packing into one byte.&lt;/strong&gt; Since both &lt;code&gt;theta&lt;/code&gt; and &lt;code&gt;rho&lt;/code&gt; are 4-bit values, we combine them into a single &lt;code&gt;uint8&lt;/code&gt; byte by shifting the radius index left by 4 bits and adding the angle index: &lt;code&gt;indices = (rho &amp;lt;&amp;lt; 4) + theta&lt;/code&gt;. This means every 2D sub-dimension per token is stored in just &lt;strong&gt;one byte&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;You can see the full implementation in the &lt;code&gt;quantize_and_pack_nbit&lt;/code&gt; method at &lt;a href="https://github.com/ericshwu/PolarQuant/blob/main/models/modeling_llama_polar.py" rel="noopener noreferrer"&gt;modeling_llama_polar.py&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;quantize_and_pack_nbit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key_states&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;D&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;key_states&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;D&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;L&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;group_size&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rbits&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tbits&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;
    &lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;G&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;D&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;group_size&lt;/span&gt;

    &lt;span class="n"&gt;key_states&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;key_states&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;L&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;G&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;G&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;phi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;atan2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key_states&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:],&lt;/span&gt; &lt;span class="n"&gt;key_states&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:])&lt;/span&gt;
    &lt;span class="n"&gt;phi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;phi&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;phi&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;phi&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;tmx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tmn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;phi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;phi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;tscale&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmx&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;tmn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tbits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;theta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;floor&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;phi&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;tmn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;tscale&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;uint8&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tbits&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;radii&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key_states&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;rmx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rmn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;radii&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;radii&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;rscale&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rmx&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;rmn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rbits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;rho&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;floor&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;radii&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;rmn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;rscale&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;uint8&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rbits&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rho&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tbits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;theta&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rscale&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rmn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tscale&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tmn&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What gets stored per group:&lt;/strong&gt; the packed &lt;code&gt;indices&lt;/code&gt; (uint8), the angle step size and minimum (&lt;code&gt;tscale&lt;/code&gt;, &lt;code&gt;tmn&lt;/code&gt; in float16), and the radius step size and minimum (&lt;code&gt;rscale&lt;/code&gt;, &lt;code&gt;rmn&lt;/code&gt; in float16).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compression math:&lt;/strong&gt; Each 2D pair originally takes 4 bytes (two FP16 floats). After quantization, it takes 1 byte (packed uint8) plus a small share of per-group scales. That is roughly a &lt;strong&gt;4x compression&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Paper's Codebook Approach
&lt;/h3&gt;

&lt;p&gt;The paper goes further than simple min/max quantization. Since random preconditioning makes the angle distribution &lt;em&gt;predictable&lt;/em&gt; and &lt;em&gt;analytically known&lt;/em&gt;, they use &lt;strong&gt;1-D k-means clustering&lt;/strong&gt; on the distribution to find optimal bin boundaries and centroids that minimize the mean squared quantization error.&lt;/p&gt;

&lt;p&gt;They also allocate bits differently across levels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Level 1 angles (range &lt;code&gt;[0, 2*pi)&lt;/code&gt;): &lt;strong&gt;4 bits&lt;/strong&gt; = 16 bins&lt;/li&gt;
&lt;li&gt;Level 2+ angles (range &lt;code&gt;[0, pi/2]&lt;/code&gt;, concentrated): &lt;strong&gt;2 bits&lt;/strong&gt; = 4 bins&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gives: &lt;code&gt;16 bits (FP radius) + 32 bits (8 level-1 angles * 4 bits) + 14 bits (7 higher angles * 2 bits) = 62 bits&lt;/code&gt; per block of 16 coordinates = &lt;strong&gt;3.875 bits per coordinate&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Step 3: Polar to Cartesian (Dequantization)
&lt;/h3&gt;

&lt;p&gt;To reconstruct a key from its quantized form, we reverse the process. There are four steps to this.&lt;/p&gt;

&lt;p&gt;First, we &lt;strong&gt;unpack the byte&lt;/strong&gt; by separating it back into the angle index and the radius index. The lower 4 bits give us the angle index &lt;code&gt;theta&lt;/code&gt; (extracted using a bitwise AND with &lt;code&gt;0xF&lt;/code&gt;), and the upper 4 bits give us the radius index &lt;code&gt;rho&lt;/code&gt; (extracted by right-shifting by 4).&lt;/p&gt;

&lt;p&gt;Second, we &lt;strong&gt;reconstruct the angle&lt;/strong&gt; by converting the bin index back to an actual angle value. Instead of using the left edge of the bin, we use the &lt;strong&gt;midpoint&lt;/strong&gt; by adding 0.5 to the index before multiplying by the step size and adding the minimum: &lt;code&gt;phi_hat = delta_phi * (theta + 0.5) + phi_min&lt;/code&gt;. Using the midpoint halves the maximum quantization error compared to using the bin edge.&lt;/p&gt;

&lt;p&gt;Third, we &lt;strong&gt;reconstruct the radius&lt;/strong&gt; the same way: &lt;code&gt;r_hat = delta_r * (rho + 0.5) + r_min&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Finally, we &lt;strong&gt;convert back to Cartesian coordinates&lt;/strong&gt; using the standard polar-to-Cartesian formulas: &lt;code&gt;x_hat = r_hat * cos(phi_hat)&lt;/code&gt; and &lt;code&gt;y_hat = r_hat * sin(phi_hat)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You can see this reconstruction in the PyTorch reference code at &lt;a href="https://github.com/ericshwu/PolarQuant/blob/main/models/kernel4group.py" rel="noopener noreferrer"&gt;kernel4group.py&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;phi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;theta&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;tscale&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;tscale&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tmn&lt;/span&gt;
&lt;span class="n"&gt;radii&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rho&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;rscale&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;rscale&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;rmn&lt;/span&gt;
&lt;span class="n"&gt;key_states_reconstruct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;radii&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;phi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cos&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;radii&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;phi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sin&lt;/span&gt;&lt;span class="p"&gt;()],&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Paper's Recursive Dequantization
&lt;/h3&gt;

&lt;p&gt;For the multi-level recursive transformation, dequantization works &lt;strong&gt;top-down&lt;/strong&gt;, starting from the single radius at the highest level and splitting downward.&lt;/p&gt;

&lt;p&gt;At each level, we take each radius value and split it into two values by looking up the stored angle centroid for that position. The first value is the radius multiplied by the cosine of the centroid angle, and the second is the radius multiplied by the sine. This doubles the number of coordinates at each level. We repeat this from the top level all the way down to level 1, at which point we have the full reconstructed vector. If preconditioning was applied, we multiply by the transpose of the preconditioning matrix to undo it.&lt;/p&gt;




&lt;h3&gt;
  
  
  Step 4: The Efficient Decode Kernel (Skip Reconstruction Entirely)
&lt;/h3&gt;

&lt;p&gt;Here is the cleverest part of the implementation. During decoding, we &lt;strong&gt;never actually reconstruct the keys&lt;/strong&gt;. Instead, we compute the attention scores (the dot product &lt;code&gt;Q * K^T&lt;/code&gt;) directly from the packed byte codes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The mathematical trick:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The dot product between a query &lt;code&gt;q&lt;/code&gt; and a reconstructed key, for each 2D sub-dimension, can be written as:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;q_x * r_hat * cos(phi_hat) + q_y * r_hat * sin(phi_hat)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;If we factor out the radius, this becomes:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;r_hat * (q_x * cos(phi_hat) + q_y * sin(phi_hat))&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The expression inside the parentheses depends &lt;strong&gt;only on the angle bin index&lt;/strong&gt;. Since the angle &lt;code&gt;phi_hat&lt;/code&gt; can only take &lt;code&gt;2^tbits = 16&lt;/code&gt; possible values (one per bin), we can &lt;strong&gt;precompute the dot product for all 16 possibilities&lt;/strong&gt; and then just &lt;strong&gt;look up&lt;/strong&gt; the right one using the stored index.&lt;/p&gt;

&lt;p&gt;Here is how the kernel works, step by step:&lt;/p&gt;

&lt;p&gt;First, for each group of quantized keys, the kernel &lt;strong&gt;precomputes the query-angle dot product for all 16 possible angle bins&lt;/strong&gt;. For each bin, it computes the midpoint angle and then calculates &lt;code&gt;q_x * cos(angle) + q_y * sin(angle)&lt;/code&gt;. This creates a small lookup table of 16 values per sub-dimension.&lt;/p&gt;

&lt;p&gt;Next, for each token in the group, the kernel &lt;strong&gt;extracts the angle index&lt;/strong&gt; from the lower 4 bits of the packed byte and &lt;strong&gt;looks up&lt;/strong&gt; the corresponding precomputed dot product from the table. No cos/sin computation is needed per token, just a table lookup.&lt;/p&gt;

&lt;p&gt;Then, the kernel &lt;strong&gt;extracts the radius index&lt;/strong&gt; from the upper 4 bits, computes the radius midpoint value, and &lt;strong&gt;multiplies&lt;/strong&gt; it with the looked-up dot product. This gives the contribution of that sub-dimension to the total attention score.&lt;/p&gt;

&lt;p&gt;Finally, the kernel &lt;strong&gt;sums&lt;/strong&gt; these contributions across all sub-dimensions to produce the final attention logit for that token.&lt;/p&gt;

&lt;p&gt;You can see the Triton kernel implementation at &lt;a href="https://github.com/ericshwu/PolarQuant/blob/main/models/kernel4group.py" rel="noopener noreferrer"&gt;kernel4group.py&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;phi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tscale&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;tbits&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tmn&lt;/span&gt;

&lt;span class="n"&gt;tp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;tl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;interleave&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;phi&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;tl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;phi&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;attn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;broadcast_to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="p"&gt;:]&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;tbits&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                 &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;Nk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;G&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;radii&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rscale&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;rbits&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;rmn&lt;/span&gt;

&lt;span class="n"&gt;attn&lt;/span&gt; &lt;span class="o"&gt;*=&lt;/span&gt; &lt;span class="n"&gt;tl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;radii&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="p"&gt;:]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;tbits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;attn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why is this fast?&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Traditional (reconstruct + matmul)&lt;/th&gt;
&lt;th&gt;PolarQuant kernel&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory reads/token&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;d&lt;/code&gt; floats = &lt;code&gt;2d&lt;/code&gt; bytes&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;d/2&lt;/code&gt; bytes (packed uint8)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Compute&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full matmul&lt;/td&gt;
&lt;td&gt;16 cos/sin (shared) + 2 lookups + 1 multiply per sub-dim&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory bandwidth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Bottleneck on long seqs&lt;/td&gt;
&lt;td&gt;~4x less traffic&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The kernel reads 4x less data from memory and replaces expensive per-token math with cheap table lookups.&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Cache Architecture: Hybrid Prefill + Decode
&lt;/h3&gt;

&lt;p&gt;The implementation does not quantize everything. It uses a &lt;strong&gt;hybrid strategy&lt;/strong&gt;, keeping a short full-precision tail alongside the quantized prefix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;During prefill&lt;/strong&gt; (when processing the full prompt, where multiple tokens are processed at once):&lt;/p&gt;

&lt;p&gt;The model first runs full-precision &lt;strong&gt;FlashAttention&lt;/strong&gt; on the entire prompt to compute the attention output. After that, it splits the key cache into two parts. The portion of the sequence whose length is divisible by &lt;code&gt;residual_length&lt;/code&gt; (default 128) gets polar-quantized into packed indices and scales. The remaining tail (the last few tokens that don't fill a complete group) stays in full precision. If the entire sequence is shorter than 128 tokens, nothing gets quantized at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;During decode&lt;/strong&gt; (when generating one token at a time):&lt;/p&gt;

&lt;p&gt;Each new token's key is appended to the full-precision tail. The model then computes attention in two parts. For the quantized prefix, it uses the Triton kernel to compute approximate attention scores directly from the packed bytes. For the full-precision tail, it uses standard matrix multiplication. The two sets of scores are concatenated, divided by &lt;code&gt;sqrt(head_dim)&lt;/code&gt;, and passed through softmax together. Whenever the full-precision tail grows to a length that is a multiple of &lt;code&gt;residual_length&lt;/code&gt;, the entire tail gets quantized and merged into the packed prefix, and the tail is reset to empty.&lt;/p&gt;

&lt;p&gt;This logic lives in the &lt;code&gt;forward&lt;/code&gt; method of &lt;code&gt;LlamaPolarGroupAttention&lt;/code&gt; at &lt;a href="https://github.com/ericshwu/PolarQuant/blob/main/models/modeling_llama_polar.py" rel="noopener noreferrer"&gt;modeling_llama_polar.py&lt;/a&gt;. The hybrid approach ensures that the most recent tokens (which often matter most for generation quality) are always at full precision.&lt;/p&gt;

&lt;h3&gt;
  
  
  Default Hyperparameters
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;What it controls&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;rbits&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Bits for radius quantization (16 bins)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tbits&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Bits for angle quantization (16 bins)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;group_size&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;td&gt;Tokens per quantization group&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;residual_length&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;td&gt;Size of the full-precision tail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Constraint&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;rbits + tbits &amp;lt;= 8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Must fit in one byte&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are set in the &lt;code&gt;__init__&lt;/code&gt; method of &lt;code&gt;LlamaPolarGroupAttention&lt;/code&gt; at &lt;a href="https://github.com/ericshwu/PolarQuant/blob/main/models/modeling_llama_polar.py" rel="noopener noreferrer"&gt;modeling_llama_polar.py&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance Results
&lt;/h3&gt;

&lt;p&gt;On &lt;strong&gt;LongBench&lt;/strong&gt; with Llama-3.1-8B-Instruct at ~4x compression:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Average Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Exact (16-bit, no compression)&lt;/td&gt;
&lt;td&gt;48.63&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PolarQuant-R (online codebook)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;48.37&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PolarQuant-R (offline codebook)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;48.29&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PolarQuant (no preconditioning)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;48.11&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;KIVI&lt;/td&gt;
&lt;td&gt;46.70&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HeadKV&lt;/td&gt;
&lt;td&gt;45.34&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SnapKV&lt;/td&gt;
&lt;td&gt;44.57&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PyramidKV&lt;/td&gt;
&lt;td&gt;44.03&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;StreamingLLM&lt;/td&gt;
&lt;td&gt;38.36&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;PolarQuant achieves the &lt;strong&gt;best quality&lt;/strong&gt; among all compression methods, nearly matching the uncompressed baseline, while generating tokens &lt;strong&gt;14% faster&lt;/strong&gt; than KIVI.&lt;/p&gt;

&lt;p&gt;On the &lt;strong&gt;Needle-in-a-Haystack&lt;/strong&gt; test (finding a specific sentence buried in a 104K token document), PolarQuant scores &lt;strong&gt;0.991&lt;/strong&gt; vs the uncompressed &lt;strong&gt;0.995&lt;/strong&gt;, virtually indistinguishable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Putting It All Together
&lt;/h2&gt;

&lt;p&gt;Here is the full pipeline of what happens to a key vector:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start&lt;/strong&gt; with the original key in FP16: &lt;code&gt;[x1, y1, x2, y2, ..., x_{d/2}, y_{d/2}]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pair&lt;/strong&gt; adjacent dimensions into 2D pairs: &lt;code&gt;[(x1,y1), (x2,y2), ..., (x_{d/2}, y_{d/2})]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Convert&lt;/strong&gt; each pair to polar form using atan2 and L2 norm: &lt;code&gt;[(phi1, r1), (phi2, r2), ...]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quantize&lt;/strong&gt; each angle and radius to 4-bit integers: &lt;code&gt;[(theta1, rho1), (theta2, rho2), ...]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pack&lt;/strong&gt; both into one byte per pair: &lt;code&gt;(rho &amp;lt;&amp;lt; 4) + theta&lt;/code&gt;, giving &lt;code&gt;[byte1, byte2, ..., byte_{d/2}]&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Storage:&lt;/strong&gt; &lt;code&gt;d&lt;/code&gt; values at 2 bytes each = &lt;code&gt;2d&lt;/code&gt; bytes becomes &lt;code&gt;d/2&lt;/code&gt; bytes + small per-group scales. That is roughly a &lt;strong&gt;4x compression&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;PolarQuant works because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Polar coordinates are natural&lt;/strong&gt; for RoPE-transformed key vectors, since RoPE already operates on 2D rotations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Random preconditioning&lt;/strong&gt; makes angle distributions predictable and concentrated, enabling efficient quantization without per-block normalization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The dot-product factorization trick&lt;/strong&gt; lets us compute attention directly from packed bytes using table lookups instead of reconstructing full keys.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The hybrid cache&lt;/strong&gt; keeps recent tokens at full precision while progressively compressing older ones.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The end result is &lt;strong&gt;4x less memory&lt;/strong&gt;, &lt;strong&gt;14% faster generation&lt;/strong&gt;, and &lt;strong&gt;near-lossless quality&lt;/strong&gt;, a compelling combination for deploying LLMs on long contexts.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Paper: &lt;a href="https://arxiv.org/abs/2502.02617" rel="noopener noreferrer"&gt;PolarQuant: Quantizing KV Caches with Polar Transformation&lt;/a&gt; by Insu Han, Praneeth Kacham, Amin Karbasi, Vahab Mirrokni, and Amir Zandieh.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Code: &lt;a href="https://github.com/ericshwu/PolarQuant" rel="noopener noreferrer"&gt;github.com/ericshwu/PolarQuant&lt;/a&gt;. The implementation supports Llama and Qwen2 models, with a custom Triton GPU kernel for efficient decode-time attention over quantized keys.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>opensource</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
