<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Govind.S.B</title>
    <description>The latest articles on DEV Community by Govind.S.B (@govindsb).</description>
    <link>https://dev.to/govindsb</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1220853%2F88c2e719-3df9-4162-b22c-0a07bf8e9366.jpeg</url>
      <title>DEV Community: Govind.S.B</title>
      <link>https://dev.to/govindsb</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/govindsb"/>
    <language>en</language>
    <item>
      <title>My take on the Memory Layer Paper by Meta (noob friendly)</title>
      <dc:creator>Govind.S.B</dc:creator>
      <pubDate>Fri, 03 Jan 2025 18:47:20 +0000</pubDate>
      <link>https://dev.to/govindsb/my-take-on-the-memory-layer-paper-by-meta-noob-friendly-3hgo</link>
      <guid>https://dev.to/govindsb/my-take-on-the-memory-layer-paper-by-meta-noob-friendly-3hgo</guid>
      <description>&lt;p&gt;Ref : &lt;a href="https://arxiv.org/abs/2412.09764" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2412.09764&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So Meta FAIR's new paper is a banger as always coming from them lol. This one is about increasing model capabilities but with fewer FLOPs spent for the same or more number of parameters basically. What they changed is where the FLOPs (computation in GPU terms , we AI folks like to call it that way).&lt;/p&gt;




&lt;p&gt;So in traditional architecture, you got the input converted to a vector representation and the attention heads transforms them to imbue them with context within the input and then this transformed input is passed to the MLP, this is the actual Neural net that has in its parameters the reasoning and memory both encoded. As the input vector passes through each layer, the input is transformed with a bit of reasoning and knowledge that the model learnt during the course of its learning. Then we do this for a bunch of times and in the end the vector represents the next token that should be part of the input. Pretty standard. Neat.&lt;/p&gt;

&lt;p&gt;Now the important part here is the MLP here does both the knowledge , and the reasoning and the reasoning based transformations in an abstract sense. What these guys attempted is a separation of concerns.&lt;/p&gt;




&lt;p&gt;Here is what they did, they took a few MLP layers out and put in what they call memory layers essentially these are key value dictionaries and during the course of the input processing after the prompt is converted to vectors, we apply dot product on these keys to find the most fitting pairs for the given query (input prompt). These values from the key value pairs that were selected is then allowed to transform the input prompt that is now a vector imbuing new information.&lt;/p&gt;

&lt;p&gt;Now what was this new information imbued into these vectors, that was the stuff the LLM learnt during training time, the memory and some of the reasoning steps that could be in a sense learnt or raw dogged is captured in this key value representation. Its learnt and put in this dictionary, and the operation we did just before is to find the relevant bits of its memory for the given prompt and feed it as part of the input for the MLP&lt;/p&gt;

&lt;p&gt;The MLP now does the reasoning part, well ofc since this is an AI system it also retains a bit of knowledge i assume, but much of the learning of information must have been picked up by the memory layers and what the MLP does is the harder reasoning part that requires mixing and bending and processing these information over and over again. which its really good at the MLP or dense layers.&lt;/p&gt;

&lt;p&gt;So in a gist, we made the MLP the compute intensive part do the hard reasoning part while the memory layers which are predominantly just dot products (less compute) handle the rote learnable parts.&lt;/p&gt;

&lt;p&gt;What these keys and values in the memory layer are ? We dont know for sure, its learnt by the network together as this layer is training along with the MLP. The network just figures out this arrangement as it minimizes the loss like all AI systems end up doing ( or the ones that we remember being successful lol )&lt;/p&gt;




&lt;p&gt;This separation of concerns is kinda inspired from what folks did with MoME mixture of a million experts and the PEER router setup, in that approach instead of an MLP the router had a fk ton of single neurons learn a bunch of stuff. And the router used a key query dot product thing to find neurons that matter and sort of on the fly merge them to form a makeshift neural net. I can see strong influence of that in here&lt;/p&gt;

&lt;p&gt;I was also reminiscing about Extended Mind Transformers while reading through this, tho its a different approach.&lt;/p&gt;

&lt;p&gt;I feel like while the 2 papers i mentioned above focused on memory and then citations. this approach takes a priority on compute cost.&lt;/p&gt;




&lt;p&gt;Okay now there is a slight chance some folks might be like why is this less compute intensive, what the fk is dense and sparse.&lt;/p&gt;

&lt;p&gt;When the input goes through the MLP each neuron or param or node or value whatever you call it, affects the input vector. Each of them. And as you can see thats why they are dense.&lt;/p&gt;

&lt;p&gt;The memory doesnt require each value to undergo processing with the input vector, only the most matching values found by applying dot product on the keys wrt inptu vector needs to undergo a MLP layer or something of that sort that can then again transform the input. This is a sparse operation therefore.&lt;/p&gt;




&lt;p&gt;The performance seems to be really good, it scales better than the MoE approaches and dense approaches (check paper and graphs). But i would love to see more folks test this idea out and id like to hear from them. But i got a feeling people are gonna sweep it under the rug even tho its such a fun idea. Maybe meta is cooking something with its byte representation and LCMs with this shit.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Set up SSH for WSL to use windsurf IDE before official WSL support</title>
      <dc:creator>Govind.S.B</dc:creator>
      <pubDate>Wed, 20 Nov 2024 03:18:47 +0000</pubDate>
      <link>https://dev.to/govindsb/set-up-ssh-for-wsl-to-use-windsurf-ide-before-official-wsl-support-aj8</link>
      <guid>https://dev.to/govindsb/set-up-ssh-for-wsl-to-use-windsurf-ide-before-official-wsl-support-aj8</guid>
      <description>&lt;p&gt;This is to setup ssh for wsl so that I can connect windsurf to wsl before their official support&lt;/p&gt;

&lt;p&gt;First setup and start the ssh server on wsl&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt install ssh
sudo systemctl start ssh
sudo systemctl enable ssh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In windows set up port forwarding to your wsl distro by running the following in PowerShell&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$EXT_PORT=2222
$WSL_PORT=22
netsh interface portproxy add v4tov4 listenport=$EXT_PORT listenaddress=0.0.0.0 connectport=$WSL_PORT connectaddress=127.0.0.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now just connect to your wsl machine like this&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ssh user@&amp;lt;windowsmachineIP&amp;gt; -p 2222
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ssh vio@localhost -p 2222
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now to make this passwordless login we need to setup key based login&lt;/p&gt;

&lt;p&gt;On windows run&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ssh-keygen -t rsa -b 4096
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;When asked for file location, press Enter for default (usually &lt;code&gt;C:\Users\YourUsername\.ssh\id_rsa&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Leave passphrase empty for passwordless login (just press Enter twice)&lt;/li&gt;
&lt;li&gt;This creates two files:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;id_rsa&lt;/code&gt; (private key)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;id_rsa.pub&lt;/code&gt; (public key)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Create SSH Config on Windows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create .ssh directory if it doesn't exist&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;mkdir&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-Force&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;USERPROFILE&lt;/span&gt;&lt;span class="s2"&gt;\.ssh"&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c"&gt;# Create/edit config file using Notepad&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;notepad&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;USERPROFILE&lt;/span&gt;&lt;span class="s2"&gt;\.ssh\config"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add these lines to the config file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Host myserver
    HostName your-server-ip
    User your-linux-username
    IdentityFile C:\Users\YourWindowsUsername\.ssh\id_rsa
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;myserver&lt;/code&gt; with whatever name you want&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;your-server-ip&lt;/code&gt; with your server's IP address&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;your-linux-username&lt;/code&gt; with your Linux server username&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;YourWindowsUsername&lt;/code&gt; with your Windows username&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is mine :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Host local_wsl
    HostName localhost
    User vio
    Port 2222
    IdentityFile C:\Users\vio\.ssh\id_rsa
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now copy over the pub key and add it to appropriate files&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; Get-Content "$env:USERPROFILE\.ssh\id_rsa.pub"
PS C:\Users\vio&amp;gt; ssh your-linux-username@your-server-ip "mkdir -p ~/.ssh &amp;amp;&amp;amp; echo '$PUBKEY' &amp;gt;&amp;gt; ~/.ssh/authorized_keys &amp;amp;&amp;amp; chmod 700 ~/.ssh &amp;amp;&amp;amp; chmod 600 ~/.ssh/authorized_keys"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The command would be like this for me:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ssh vio@localhost -p 2222 "mkdir -p ~/.ssh &amp;amp;&amp;amp; echo '$PUBKEY' &amp;gt;&amp;gt; ~/.ssh/authorized_keys &amp;amp;&amp;amp; chmod 700 ~/.ssh &amp;amp;&amp;amp; chmod 600 ~/.ssh/authorized_keys"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Things that might fail or be different for you&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Opening up the correct ports&lt;/li&gt;
&lt;li&gt;Enabling Key based auth and disabling password requirement in the ssh server ( in this case our wsl )&lt;/li&gt;
&lt;li&gt;File permissions&lt;/li&gt;
&lt;li&gt;SSH client not installed in windows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now to connect to WSL from windows installation of windsurf simply click on connect to SSH Host button on the bottom left of editor, click on the remote ssh option. Your config must ideally be there and clicking on it should work&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Mixtral, OpenAI and the race to bottom</title>
      <dc:creator>Govind.S.B</dc:creator>
      <pubDate>Tue, 19 Dec 2023 10:34:05 +0000</pubDate>
      <link>https://dev.to/govindsb/mixtral-openai-and-the-race-to-bottom-1l38</link>
      <guid>https://dev.to/govindsb/mixtral-openai-and-the-race-to-bottom-1l38</guid>
      <description>&lt;h2&gt;
  
  
  Competition is good
&lt;/h2&gt;

&lt;p&gt;A good competition in the market always benefits the end consumer. This has been time and time again proven&lt;br&gt;
And right now in the AI/LLM space we are seeing something similar happening. A race to the bottom&lt;/p&gt;

&lt;p&gt;See before this open source roar in the space there was only OpenAI with their models GPT3.5 and GPT4 and their pricing blew us away, GPT3.5 gives you a good enough AI for most general purpose applications we build (for the current tech). But the thing to consider is there was only them leading it , if you wanted to build an app cost effectively it was only them and they really enjoyed their time up in the ladder&lt;/p&gt;

&lt;p&gt;But as google put it :&lt;br&gt;
"We Have No Moat, And Neither Does OpenAI"&lt;/p&gt;
&lt;h2&gt;
  
  
  Open source LLMs
&lt;/h2&gt;

&lt;p&gt;Open source models were lagging behind gpt3.5 in terms of performance and cost , they were expensive to run and returns dumb answers lol, But then gigachad mistral dropped mistral their 7B model that they call "tiny" , That made everyone lose their mind it was soo good and soon the community was flooded with mistral finetunes.&lt;/p&gt;

&lt;p&gt;Mistral then recently dropped Mixtral, a new MoE model where they used their mistral models in a creative way. A mixture of experts, basically they instead of training a new single model had 8 7B models specialized in various tasks and then combined them, and the neat thing is due to sharing of weights among the models its smaller and while inferring it just used 13B parameters ( 2 experts are used while each token is generated ). So basically this thing can run on commercial hardware that people have ... And this thing performs as well as GPT3.5 or better ... This is important , an open source model that beats the current cheapest feasible closed source AI.&lt;/p&gt;

&lt;p&gt;Open AI can charge their pricing cause its their model and they are the only providers so consider infra cost and their R&amp;amp;D and profits. With an open source model there is no moat, the model is free and out in the open. So this is what happened later, there are these services which provider inference APIs they host the model and give you the API just like open AI does but charge just for the infra. Its obvious that they are gonna undercut the pricing but by how much is just mind blowing&lt;/p&gt;
&lt;h2&gt;
  
  
  The price drop
&lt;/h2&gt;

&lt;p&gt;GPT 3.5 costs $1 per million input tokens and $2 per million output tokens , on avg $1.5 per million tokens&lt;br&gt;
Together AI, the leading AI infra provider put out their pricing as $0.6 per miliion tokens&lt;br&gt;
Seeing this , Anyscale another such provider was like we will do you one better , $0.5 per million tokens&lt;br&gt;
It doesnt stop there , DeepInfra dropped their pricing to $0.27 per million token&lt;/p&gt;

&lt;p&gt;The cost drop is a staggering 82%&lt;/p&gt;

&lt;p&gt;With all this happening openrouter came up , they are this service which autoroutes your request to the cheapest infra available provider for us to benefit from the race to bottom , they decided to host mixtral for free now. Yep free! I taked with their team on discord and they said they want to support a lot of models and was having them on beta for people to test ( or as i would say get used to their ecosystem )&lt;/p&gt;

&lt;p&gt;One thing to note here is dont forget that lots of providers also are heavily funded by VC money to burn and THEY WILL burn it, you can see this in the form of heavily subsidized prices and free credits to capture market&lt;/p&gt;

&lt;p&gt;Together AI gives you $25 in free credits once you sign up, thats about 41 million tokens, you are not running out of that if you are doing personal pet projects.&lt;/p&gt;

&lt;p&gt;The race to the bottom is here and it is here to stay for a while imo, So profit while you can and build cool stuff&lt;br&gt;
I have this &lt;a href="https://github.com/BulletLaunch/Mixtral-Inference-APIs"&gt;repo&lt;/a&gt; where I wrote a general purpose function to interact with all these providers to use mixtral,check that out if you wanna jump in fast . Star the thing if you find it useful and thats it from me, thanks for reading.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--A9-wwsHG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/BulletLaunch"&gt;
        BulletLaunch
      &lt;/a&gt; / &lt;a href="https://github.com/BulletLaunch/Mixtral-Inference-APIs"&gt;
        Mixtral-Inference-APIs
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      a convenience script used internally having a collection of inference API providers with cheap infra cost
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;h1&gt;
Mixtral Inference APIs&lt;/h1&gt;
&lt;p&gt;make the best out of the race to bottom&lt;/p&gt;
&lt;p&gt;a convenience script we use internally having a collection of providers with cheap infra cost for LLM inference&lt;/p&gt;
&lt;p&gt;if you like what we are doing&lt;/p&gt;
&lt;p&gt;Please leave a star on the repo&lt;/p&gt;
&lt;p&gt;Support us on buymeacoffee&lt;br&gt;
&lt;a href="https://www.buymeacoffee.com/bulletlaunch" rel="nofollow"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_OOqhCiH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://private-user-images.githubusercontent.com/62943847/291451790-9e97ec08-c4ab-4baa-8485-f3f543f247bb.png%3Fjwt%3DeyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDI5ODIzNDUsIm5iZiI6MTcwMjk4MjA0NSwicGF0aCI6Ii82Mjk0Mzg0Ny8yOTE0NTE3OTAtOWU5N2VjMDgtYzRhYi00YmFhLTg0ODUtZjNmNTQzZjI0N2JiLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzEyMTklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMjE5VDEwMzQwNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTA0Y2NlNWViMTViYjljZjNkZjRiN2YyNjFkMWYyN2VmMTBiMmE1ODViNWU5MzczM2ZjOGVmODEwMTY4NzRhZWUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.GFYLKlPRofDHFjQ2jBYSBD4mk5j8bXHxwpytfVmpDxA" alt="Buy Me A Coffee"&gt;&lt;/a&gt;&lt;br&gt;
Checkout our socials and follow us there&lt;br&gt;
&lt;a href="https://twitter.com/bulletlaunchhq" rel="nofollow"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cmAwZZo2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://private-user-images.githubusercontent.com/62943847/291450533-58075057-2502-4fe8-b2f8-c6e121194dd4.png%3Fjwt%3DeyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDI5ODIzNDUsIm5iZiI6MTcwMjk4MjA0NSwicGF0aCI6Ii82Mjk0Mzg0Ny8yOTE0NTA1MzMtNTgwNzUwNTctMjUwMi00ZmU4LWIyZjgtYzZlMTIxMTk0ZGQ0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzEyMTklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMjE5VDEwMzQwNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWVkODkyMTFjYjYwNTM2NmYzNzc5ODVhOTYxOWQ3YjlmMjdjN2EyZDE1MjI4ZTdiMWQ5NDI4YmFlNjQyYWE2ODUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.W_ERPNDBLv7DIa379BeoBJmhyyPIrsg_oDL2qa8Vn5Y" alt="Twitter" height="41" width="41"&gt;&lt;/a&gt;
&lt;a href="https://www.linkedin.com/company/bulletlaunch" rel="nofollow"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lUOsoLTs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://private-user-images.githubusercontent.com/62943847/291452448-e71c1e79-a287-4cfc-bad2-79ad41cd445b.png%3Fjwt%3DeyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDI5ODIzNDUsIm5iZiI6MTcwMjk4MjA0NSwicGF0aCI6Ii82Mjk0Mzg0Ny8yOTE0NTI0NDgtZTcxYzFlNzktYTI4Ny00Y2ZjLWJhZDItNzlhZDQxY2Q0NDViLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzEyMTklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMjE5VDEwMzQwNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTM5MWM5OTYzMGMzNTJiNmUwMDE5NGRlNzgzNjUyMTY0NmMwZDUxNWY4Y2QwMTk5MWNmNDFlM2Q4ODA1YjYxNDUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.o2s_8TPooUELqyt5k8qgZ5TiU72Y6mgqlzkmILFzqDo" alt="Linkedin" height="41" width="41"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
Usage&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Rename the &lt;code&gt;.env.template&lt;/code&gt; file to &lt;code&gt;.env&lt;/code&gt; and add the corresponding credentials for the providers you want to use (check pricing and performance comparison below)&lt;/li&gt;
&lt;li&gt;You can either run the script directly for testing the endpoints or use the inference function in your program logic&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;To directly use the inference function copy the &lt;code&gt;.env&lt;/code&gt; file and &lt;code&gt;llm_inference_script&lt;/code&gt; to your project and import the function&lt;/p&gt;
&lt;p&gt;example :&lt;/p&gt;
&lt;div class="highlight highlight-source-python notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;llm_inference_script&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;llm_inference&lt;/span&gt;
&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;dotenv&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;load_dotenv&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;os&lt;/span&gt;
&lt;span class="pl-en"&gt;load_dotenv&lt;/span&gt;()
&lt;span class="pl-v"&gt;KEY&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;os&lt;/span&gt;.&lt;span class="pl-en"&gt;getenv&lt;/span&gt;(&lt;span class="pl-s"&gt;"PROVIDER_API_KEY"&lt;/span&gt;) &lt;span class="pl-c"&gt;# put correct provider name here&lt;/span&gt;
&lt;span class="pl-s1"&gt;output&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;llm_inference&lt;/span&gt;&lt;/pre&gt;…
&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/BulletLaunch/Mixtral-Inference-APIs"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;TLDR ; GPT3.5 is dead and mixtral killed it , you can run it for free at this point. Checkout &lt;a href="https://github.com/BulletLaunch/Mixtral-Inference-APIs"&gt;repo&lt;/a&gt; for my API endpoint collection to get into the hype fast&lt;/p&gt;

&lt;p&gt;UPDATE : Openrouter is not free anymore&lt;/p&gt;

</description>
    </item>
    <item>
      <title>SDXL Turbo Optimization Experiments</title>
      <dc:creator>Govind.S.B</dc:creator>
      <pubDate>Sun, 03 Dec 2023 04:15:53 +0000</pubDate>
      <link>https://dev.to/govindsb/sdxl-turbo-optimization-experiments-fgg</link>
      <guid>https://dev.to/govindsb/sdxl-turbo-optimization-experiments-fgg</guid>
      <description>&lt;p&gt;ComfyUI and other UI powered full backend systems were super slow and I wanted to optimize it for better efficiency and real time performane for a local running LLM based ppt generator application I was building  &lt;/p&gt;

&lt;p&gt;I wrote a custom script to benchmark different stages of the image generation while trying out different configurations and this is just me writing down my finding for future reference primarily  &lt;/p&gt;

&lt;p&gt;My Specs and Configuration:&lt;br&gt;&lt;br&gt;
Ryzen 7 5800X | 32 GB RAM | RTX 3060 Ti&lt;br&gt;&lt;br&gt;
Windows 11 WSL Ubuntu, python 3.10  &lt;/p&gt;

&lt;p&gt;Comfy UI SDXL Turbo Generation Speed averages at 2.5 second per image (without prompt caching)  &lt;/p&gt;

&lt;p&gt;My Test Criterias distinguishes Total Time, Load Time for any modules, Init Gen Time, Avg Gen Time&lt;br&gt;&lt;br&gt;
For Avg Time calculation I test with 1 + 4 prompts&lt;br&gt;&lt;br&gt;
I am not optimizing for memory footprint, just performance  &lt;/p&gt;

&lt;p&gt;Here is my Github for reference : &lt;a href="https://github.com/Govind-S-B/sdxl-turbo-optimization-experiments"&gt;https://github.com/Govind-S-B/sdxl-turbo-optimization-experiments&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I also post some experiments I do over at my twitter : &lt;a href="https://twitter.com/violetto96"&gt;https://twitter.com/violetto96&lt;/a&gt;&lt;br&gt;
&lt;iframe class="tweet-embed" id="tweet-1730886441595748742-752" src="https://platform.twitter.com/embed/Tweet.html?id=1730886441595748742"&gt;
&lt;/iframe&gt;

  // Detect dark theme
  var iframe = document.getElementById('tweet-1730886441595748742-752');
  if (document.body.className.includes('dark-theme')) {
    iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1730886441595748742&amp;amp;theme=dark"
  }



 &lt;/p&gt;
&lt;h2&gt;
  
  
  Basic Diffuser Performance :
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Run 1&lt;/th&gt;
&lt;th&gt;Run 2&lt;/th&gt;
&lt;th&gt;Run 3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total Time&lt;/td&gt;
&lt;td&gt;19.3036789894104&lt;/td&gt;
&lt;td&gt;17.443750381469727&lt;/td&gt;
&lt;td&gt;17.60261082649231&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Load Time&lt;/td&gt;
&lt;td&gt;6.236544132232666&lt;/td&gt;
&lt;td&gt;4.532968759536743&lt;/td&gt;
&lt;td&gt;4.691980600357056&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Init Gen Time&lt;/td&gt;
&lt;td&gt;2.950467348098755&lt;/td&gt;
&lt;td&gt;2.9440150260925293&lt;/td&gt;
&lt;td&gt;3.0313751697540283&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg Gen Time&lt;/td&gt;
&lt;td&gt;2.529166877269745&lt;/td&gt;
&lt;td&gt;2.4916916489601135&lt;/td&gt;
&lt;td&gt;2.4698137640953064&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The default example script from sdxl turbo repo was used here, with multiple prompts being iterated  &lt;/p&gt;
&lt;h2&gt;
  
  
  Batch Prompt Processing :
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Run 1&lt;/th&gt;
&lt;th&gt;Run 2&lt;/th&gt;
&lt;th&gt;Run 3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total Time&lt;/td&gt;
&lt;td&gt;24.790883779525757&lt;/td&gt;
&lt;td&gt;23.44549536705017&lt;/td&gt;
&lt;td&gt;28.178327798843384&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Load Time&lt;/td&gt;
&lt;td&gt;4.713327169418335&lt;/td&gt;
&lt;td&gt;4.722451210021973&lt;/td&gt;
&lt;td&gt;4.594852685928345&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Init Gen Time&lt;/td&gt;
&lt;td&gt;2.940718412399292&lt;/td&gt;
&lt;td&gt;2.895763397216797&lt;/td&gt;
&lt;td&gt;3.0071566104888916&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg Gen Time&lt;/td&gt;
&lt;td&gt;4.2842095494270325&lt;/td&gt;
&lt;td&gt;3.9568201899528503&lt;/td&gt;
&lt;td&gt;5.144079625606537&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Instead of iterating through each prompt, passed the entire list for batch processing to the pipeline&lt;br&gt;&lt;br&gt;
Its suprising to see a decline in performance while using the built in batch processing methond than iterating the pipe  &lt;/p&gt;
&lt;h2&gt;
  
  
  Upcase VAE Precision
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Run 1&lt;/th&gt;
&lt;th&gt;Run 2&lt;/th&gt;
&lt;th&gt;Run 3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total Time&lt;/td&gt;
&lt;td&gt;17.625147342681885&lt;/td&gt;
&lt;td&gt;23.426799774169922&lt;/td&gt;
&lt;td&gt;17.705934762954712&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Load Time&lt;/td&gt;
&lt;td&gt;5.011475086212158&lt;/td&gt;
&lt;td&gt;4.612784147262573&lt;/td&gt;
&lt;td&gt;4.5227577686309814&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Init Gen Time&lt;/td&gt;
&lt;td&gt;2.9415533542633057&lt;/td&gt;
&lt;td&gt;4.21022629737854&lt;/td&gt;
&lt;td&gt;3.078413963317871&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg Gen Time&lt;/td&gt;
&lt;td&gt;2.4180297255516052&lt;/td&gt;
&lt;td&gt;3.650947332382202&lt;/td&gt;
&lt;td&gt;2.526190757751465&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Decrease in performance compared to base diffuser performance&lt;br&gt;&lt;br&gt;
source : &lt;a href="https://huggingface.co/docs/diffusers/using-diffusers/sdxl_turbo#speed-up-sdxl-turbo-even-more"&gt;https://huggingface.co/docs/diffusers/using-diffusers/sdxl_turbo#speed-up-sdxl-turbo-even-more&lt;/a&gt;  &lt;/p&gt;
&lt;h2&gt;
  
  
  VAE 16fp Optimization :
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Run 1&lt;/th&gt;
&lt;th&gt;Run 2&lt;/th&gt;
&lt;th&gt;Run 3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total Time&lt;/td&gt;
&lt;td&gt;11.680602312088013&lt;/td&gt;
&lt;td&gt;13.030585050582886&lt;/td&gt;
&lt;td&gt;11.378620147705078&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Load Time&lt;/td&gt;
&lt;td&gt;5.208057641983032&lt;/td&gt;
&lt;td&gt;5.137163877487183&lt;/td&gt;
&lt;td&gt;4.957686901092529&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Init Gen Time&lt;/td&gt;
&lt;td&gt;1.538787603378296&lt;/td&gt;
&lt;td&gt;1.9612927436828613&lt;/td&gt;
&lt;td&gt;1.539346694946289&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg Gen Time&lt;/td&gt;
&lt;td&gt;1.2334392666816711&lt;/td&gt;
&lt;td&gt;1.4830321073532104&lt;/td&gt;
&lt;td&gt;1.220396637916565&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Gives Promising Speed (2x) with no noticeable quality loss&lt;br&gt;&lt;br&gt;
source : &lt;a href="https://huggingface.co/docs/diffusers/using-diffusers/sdxl_turbo#speed-up-sdxl-turbo-even-more"&gt;https://huggingface.co/docs/diffusers/using-diffusers/sdxl_turbo#speed-up-sdxl-turbo-even-more&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pipe.vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  VAE : Tiny Auto Encoder
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Run 1&lt;/th&gt;
&lt;th&gt;Run 2&lt;/th&gt;
&lt;th&gt;Run 3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total Time&lt;/td&gt;
&lt;td&gt;6.502416133880615&lt;/td&gt;
&lt;td&gt;5.839867115020752&lt;/td&gt;
&lt;td&gt;6.068450450897217&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Load Time&lt;/td&gt;
&lt;td&gt;5.318273067474365&lt;/td&gt;
&lt;td&gt;4.63173770904541&lt;/td&gt;
&lt;td&gt;4.82595682144165&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Init Gen Time&lt;/td&gt;
&lt;td&gt;0.6518356800079346&lt;/td&gt;
&lt;td&gt;0.664395809173584&lt;/td&gt;
&lt;td&gt;0.704963207244873&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg Gen Time&lt;/td&gt;
&lt;td&gt;0.13307684659957886&lt;/td&gt;
&lt;td&gt;0.13593339920043945&lt;/td&gt;
&lt;td&gt;0.13438260555267334&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Incredible Speed, Its said to be quality reducing but I couldnt notice&lt;br&gt;&lt;br&gt;
source : &lt;a href="https://github.com/huggingface/blog/blob/main/simple_sdxl_optimizations.md#tiny-autoencoder"&gt;https://github.com/huggingface/blog/blob/main/simple_sdxl_optimizations.md#tiny-autoencoder&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesdxl", torch_dtype=torch.float16)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Doing all subsequent tests with Tiny Auto Encoder since the vae is pretty much optimized and we will not need to swap that out, for all following performance comparisons use the Tiny Auto Encoder benchmark as reference point now  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Compile UNet
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Run 1&lt;/th&gt;
&lt;th&gt;Run 2&lt;/th&gt;
&lt;th&gt;Run 3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total Time&lt;/td&gt;
&lt;td&gt;75.41857695579529&lt;/td&gt;
&lt;td&gt;65.76266884803772&lt;/td&gt;
&lt;td&gt;64.25783467292786&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Load Time&lt;/td&gt;
&lt;td&gt;5.129103422164917&lt;/td&gt;
&lt;td&gt;7.217745304107666&lt;/td&gt;
&lt;td&gt;5.239185810089111&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Init Gen Time&lt;/td&gt;
&lt;td&gt;68.40448021888733&lt;/td&gt;
&lt;td&gt;57.36990666389465&lt;/td&gt;
&lt;td&gt;57.89517068862915&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg Gen Time&lt;/td&gt;
&lt;td&gt;0.4712483286857605&lt;/td&gt;
&lt;td&gt;0.2937542200088501&lt;/td&gt;
&lt;td&gt;0.2808695435523987&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Massive Decrease in init generation and 2x slower in avg gen time&lt;br&gt;&lt;br&gt;
source : &lt;a href="https://huggingface.co/docs/diffusers/using-diffusers/sdxl_turbo#speed-up-sdxl-turbo-even-more"&gt;https://huggingface.co/docs/diffusers/using-diffusers/sdxl_turbo#speed-up-sdxl-turbo-even-more&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  CPU Offloading
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Run 1&lt;/th&gt;
&lt;th&gt;Run 2&lt;/th&gt;
&lt;th&gt;Run 3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total Time&lt;/td&gt;
&lt;td&gt;17.635058403015137&lt;/td&gt;
&lt;td&gt;16.957004070281982&lt;/td&gt;
&lt;td&gt;16.976733684539795&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Load Time&lt;/td&gt;
&lt;td&gt;7.488234281539917&lt;/td&gt;
&lt;td&gt;9.042913675308228&lt;/td&gt;
&lt;td&gt;7.669692039489746&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Init Gen Time&lt;/td&gt;
&lt;td&gt;2.28349232673645&lt;/td&gt;
&lt;td&gt;2.0885674953460693&lt;/td&gt;
&lt;td&gt;2.0610623359680176&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg Gen Time&lt;/td&gt;
&lt;td&gt;1.9658329486846924&lt;/td&gt;
&lt;td&gt;1.4563807249069214&lt;/td&gt;
&lt;td&gt;1.8114948272705078&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Decrease in performance&lt;br&gt;&lt;br&gt;
source : &lt;a href="https://github.com/huggingface/blog/blob/main/simple_sdxl_optimizations.md#model-cpu-offloading"&gt;https://github.com/huggingface/blog/blob/main/simple_sdxl_optimizations.md#model-cpu-offloading&lt;/a&gt;  &lt;/p&gt;

&lt;p&gt;General rule of thumb that if you can fit the model in memory, offloading is only going to cost you performance  &lt;/p&gt;

&lt;h2&gt;
  
  
  VAE Slicing
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Run 1&lt;/th&gt;
&lt;th&gt;Run 2&lt;/th&gt;
&lt;th&gt;Run 3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total Time&lt;/td&gt;
&lt;td&gt;7.269654989242554&lt;/td&gt;
&lt;td&gt;6.002238035202026&lt;/td&gt;
&lt;td&gt;6.339429616928101&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Load Time&lt;/td&gt;
&lt;td&gt;5.831358909606934&lt;/td&gt;
&lt;td&gt;4.8166186809539795&lt;/td&gt;
&lt;td&gt;4.8016180992126465&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Init Gen Time&lt;/td&gt;
&lt;td&gt;0.6312739849090576&lt;/td&gt;
&lt;td&gt;0.6541228294372559&lt;/td&gt;
&lt;td&gt;0.6929543018341064&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg Gen Time&lt;/td&gt;
&lt;td&gt;0.20175552368164062&lt;/td&gt;
&lt;td&gt;0.13287413120269775&lt;/td&gt;
&lt;td&gt;0.21121430397033691&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;No Impact or Slight Decrease in performance , hard to gauge.&lt;br&gt;&lt;br&gt;
source : &lt;a href="https://github.com/huggingface/blog/blob/main/simple_sdxl_optimizations.md#slicing"&gt;https://github.com/huggingface/blog/blob/main/simple_sdxl_optimizations.md#slicing&lt;/a&gt;  &lt;/p&gt;

&lt;h2&gt;
  
  
  MORE TESTING CONCLUSIONS
&lt;/h2&gt;

&lt;p&gt;I tried all the optimizations listed in &lt;a href="https://huggingface.co/docs/diffusers/optimization/opt_overview"&gt;https://huggingface.co/docs/diffusers/optimization/opt_overview&lt;/a&gt;&lt;br&gt;&lt;br&gt;
Including Xformers , Token Merging , Offloading . But none of them beats the base Tiny Encoder Optimization Benchmarks  &lt;/p&gt;

&lt;p&gt;I think it indicates some bottleneck deep within python thats causing the performance issue now and I have hit a bottleneck in what I could optimize,if anyone else wants to try out the optimizations in a different language like Rust I think that would be the way forward  &lt;/p&gt;

&lt;h2&gt;
  
  
  Additional Notes :
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;I couldnt try out &lt;a href="https://github.com/huggingface/blog/blob/main/simple_sdxl_optimizations.md#caching-computations"&gt;https://github.com/huggingface/blog/blob/main/simple_sdxl_optimizations.md#caching-computations&lt;/a&gt; since i couldnt figure out how to get the tokenizers and encoders for the model&lt;/li&gt;
&lt;li&gt;Couldnt try out tracing UNet &lt;a href="https://huggingface.co/docs/diffusers/optimization/memory#tracing"&gt;https://huggingface.co/docs/diffusers/optimization/memory#tracing&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>sdxl</category>
      <category>turbo</category>
      <category>optimization</category>
    </item>
  </channel>
</rss>
