DEV Community: Govind.S.B

My take on the Memory Layer Paper by Meta (noob friendly)

Govind.S.B — Fri, 03 Jan 2025 18:47:20 +0000

So Meta FAIR's new paper is a banger as always coming from them lol. This one is about increasing model capabilities but with fewer FLOPs spent for the same or more number of parameters basically. What they changed is where the FLOPs (computation in GPU terms , we AI folks like to call it that way).

So in traditional architecture, you got the input converted to a vector representation and the attention heads transforms them to imbue them with context within the input and then this transformed input is passed to the MLP, this is the actual Neural net that has in its parameters the reasoning and memory both encoded. As the input vector passes through each layer, the input is transformed with a bit of reasoning and knowledge that the model learnt during the course of its learning. Then we do this for a bunch of times and in the end the vector represents the next token that should be part of the input. Pretty standard. Neat.

Now the important part here is the MLP here does both the knowledge , and the reasoning and the reasoning based transformations in an abstract sense. What these guys attempted is a separation of concerns.

Here is what they did, they took a few MLP layers out and put in what they call memory layers essentially these are key value dictionaries and during the course of the input processing after the prompt is converted to vectors, we apply dot product on these keys to find the most fitting pairs for the given query (input prompt). These values from the key value pairs that were selected is then allowed to transform the input prompt that is now a vector imbuing new information.

Now what was this new information imbued into these vectors, that was the stuff the LLM learnt during training time, the memory and some of the reasoning steps that could be in a sense learnt or raw dogged is captured in this key value representation. Its learnt and put in this dictionary, and the operation we did just before is to find the relevant bits of its memory for the given prompt and feed it as part of the input for the MLP

The MLP now does the reasoning part, well ofc since this is an AI system it also retains a bit of knowledge i assume, but much of the learning of information must have been picked up by the memory layers and what the MLP does is the harder reasoning part that requires mixing and bending and processing these information over and over again. which its really good at the MLP or dense layers.

So in a gist, we made the MLP the compute intensive part do the hard reasoning part while the memory layers which are predominantly just dot products (less compute) handle the rote learnable parts.

What these keys and values in the memory layer are ? We dont know for sure, its learnt by the network together as this layer is training along with the MLP. The network just figures out this arrangement as it minimizes the loss like all AI systems end up doing ( or the ones that we remember being successful lol )

This separation of concerns is kinda inspired from what folks did with MoME mixture of a million experts and the PEER router setup, in that approach instead of an MLP the router had a fk ton of single neurons learn a bunch of stuff. And the router used a key query dot product thing to find neurons that matter and sort of on the fly merge them to form a makeshift neural net. I can see strong influence of that in here

I was also reminiscing about Extended Mind Transformers while reading through this, tho its a different approach.

I feel like while the 2 papers i mentioned above focused on memory and then citations. this approach takes a priority on compute cost.

Okay now there is a slight chance some folks might be like why is this less compute intensive, what the fk is dense and sparse.

When the input goes through the MLP each neuron or param or node or value whatever you call it, affects the input vector. Each of them. And as you can see thats why they are dense.

The memory doesnt require each value to undergo processing with the input vector, only the most matching values found by applying dot product on the keys wrt inptu vector needs to undergo a MLP layer or something of that sort that can then again transform the input. This is a sparse operation therefore.

The performance seems to be really good, it scales better than the MoE approaches and dense approaches (check paper and graphs). But i would love to see more folks test this idea out and id like to hear from them. But i got a feeling people are gonna sweep it under the rug even tho its such a fun idea. Maybe meta is cooking something with its byte representation and LCMs with this shit.

Set up SSH for WSL to use windsurf IDE before official WSL support

Govind.S.B — Wed, 20 Nov 2024 03:18:47 +0000

This is to setup ssh for wsl so that I can connect windsurf to wsl before their official support

First setup and start the ssh server on wsl

sudo apt install ssh
sudo systemctl start ssh
sudo systemctl enable ssh

In windows set up port forwarding to your wsl distro by running the following in PowerShell

$EXT_PORT=2222
$WSL_PORT=22
netsh interface portproxy add v4tov4 listenport=$EXT_PORT listenaddress=0.0.0.0 connectport=$WSL_PORT connectaddress=127.0.0.1

Now just connect to your wsl machine like this

ssh user@<windowsmachineIP> -p 2222

ssh vio@localhost -p 2222

Now to make this passwordless login we need to setup key based login

On windows run

ssh-keygen -t rsa -b 4096

When asked for file location, press Enter for default (usually C:\Users\YourUsername\.ssh\id_rsa)
Leave passphrase empty for passwordless login (just press Enter twice)
This creates two files:
- id_rsa (private key)
- id_rsa.pub (public key)

Create SSH Config on Windows:

# Create .ssh directory if it doesn't exist
mkdir -Force "$env:USERPROFILE\.ssh"

# Create/edit config file using Notepad
notepad "$env:USERPROFILE\.ssh\config"

Add these lines to the config file:

Host myserver
    HostName your-server-ip
    User your-linux-username
    IdentityFile C:\Users\YourWindowsUsername\.ssh\id_rsa

Replace:

myserver with whatever name you want
your-server-ip with your server's IP address
your-linux-username with your Linux server username
YourWindowsUsername with your Windows username

Here is mine :

Host local_wsl
    HostName localhost
    User vio
    Port 2222
    IdentityFile C:\Users\vio\.ssh\id_rsa

Now copy over the pub key and add it to appropriate files

 Get-Content "$env:USERPROFILE\.ssh\id_rsa.pub"
PS C:\Users\vio> ssh your-linux-username@your-server-ip "mkdir -p ~/.ssh && echo '$PUBKEY' >> ~/.ssh/authorized_keys && chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys"

The command would be like this for me:

ssh vio@localhost -p 2222 "mkdir -p ~/.ssh && echo '$PUBKEY' >> ~/.ssh/authorized_keys && chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys"

Things that might fail or be different for you

Opening up the correct ports
Enabling Key based auth and disabling password requirement in the ssh server ( in this case our wsl )
File permissions
SSH client not installed in windows

Now to connect to WSL from windows installation of windsurf simply click on connect to SSH Host button on the bottom left of editor, click on the remote ssh option. Your config must ideally be there and clicking on it should work

Mixtral, OpenAI and the race to bottom

Govind.S.B — Tue, 19 Dec 2023 10:34:05 +0000

Competition is good

A good competition in the market always benefits the end consumer. This has been time and time again proven
And right now in the AI/LLM space we are seeing something similar happening. A race to the bottom

See before this open source roar in the space there was only OpenAI with their models GPT3.5 and GPT4 and their pricing blew us away, GPT3.5 gives you a good enough AI for most general purpose applications we build (for the current tech). But the thing to consider is there was only them leading it , if you wanted to build an app cost effectively it was only them and they really enjoyed their time up in the ladder

But as google put it :
"We Have No Moat, And Neither Does OpenAI"

Open source LLMs

Open source models were lagging behind gpt3.5 in terms of performance and cost , they were expensive to run and returns dumb answers lol, But then gigachad mistral dropped mistral their 7B model that they call "tiny" , That made everyone lose their mind it was soo good and soon the community was flooded with mistral finetunes.

Mistral then recently dropped Mixtral, a new MoE model where they used their mistral models in a creative way. A mixture of experts, basically they instead of training a new single model had 8 7B models specialized in various tasks and then combined them, and the neat thing is due to sharing of weights among the models its smaller and while inferring it just used 13B parameters ( 2 experts are used while each token is generated ). So basically this thing can run on commercial hardware that people have ... And this thing performs as well as GPT3.5 or better ... This is important , an open source model that beats the current cheapest feasible closed source AI.

Open AI can charge their pricing cause its their model and they are the only providers so consider infra cost and their R&D and profits. With an open source model there is no moat, the model is free and out in the open. So this is what happened later, there are these services which provider inference APIs they host the model and give you the API just like open AI does but charge just for the infra. Its obvious that they are gonna undercut the pricing but by how much is just mind blowing

The price drop

GPT 3.5 costs $1 per million input tokens and $2 per million output tokens , on avg $1.5 per million tokens
Together AI, the leading AI infra provider put out their pricing as $0.6 per miliion tokens
Seeing this , Anyscale another such provider was like we will do you one better , $0.5 per million tokens
It doesnt stop there , DeepInfra dropped their pricing to $0.27 per million token

The cost drop is a staggering 82%

With all this happening openrouter came up , they are this service which autoroutes your request to the cheapest infra available provider for us to benefit from the race to bottom , they decided to host mixtral for free now. Yep free! I taked with their team on discord and they said they want to support a lot of models and was having them on beta for people to test ( or as i would say get used to their ecosystem )

One thing to note here is dont forget that lots of providers also are heavily funded by VC money to burn and THEY WILL burn it, you can see this in the form of heavily subsidized prices and free credits to capture market

Together AI gives you $25 in free credits once you sign up, thats about 41 million tokens, you are not running out of that if you are doing personal pet projects.

The race to the bottom is here and it is here to stay for a while imo, So profit while you can and build cool stuff
I have this repo where I wrote a general purpose function to interact with all these providers to use mixtral,check that out if you wanna jump in fast . Star the thing if you find it useful and thats it from me, thanks for reading.

BulletLaunch / Mixtral-Inference-APIs

a convenience script used internally having a collection of inference API providers with cheap infra cost

Mixtral Inference APIs

make the best out of the race to bottom

a convenience script we use internally having a collection of providers with cheap infra cost for LLM inference

if you like what we are doing

Please leave a star on the repo

Support us on buymeacoffee

Checkout our socials and follow us there

Usage

Rename the .env.template file to .env and add the corresponding credentials for the providers you want to use (check pricing and performance comparison below)
You can either run the script directly for testing the endpoints or use the inference function in your program logic

To directly use the inference function copy the .env file and llm_inference_script to your project and import the function

example :

from llm_inference_script import llm_inference
from dotenv import load_dotenv
import os
load_dotenv()
KEY = os.getenv("PROVIDER_API_KEY") # put correct provider name here
output = llm_inference

…

View on GitHub

TLDR ; GPT3.5 is dead and mixtral killed it , you can run it for free at this point. Checkout repo for my API endpoint collection to get into the hype fast

UPDATE : Openrouter is not free anymore

SDXL Turbo Optimization Experiments

Govind.S.B — Sun, 03 Dec 2023 04:15:53 +0000

ComfyUI and other UI powered full backend systems were super slow and I wanted to optimize it for better efficiency and real time performane for a local running LLM based ppt generator application I was building

I wrote a custom script to benchmark different stages of the image generation while trying out different configurations and this is just me writing down my finding for future reference primarily

My Specs and Configuration:

Ryzen 7 5800X | 32 GB RAM | RTX 3060 Ti

Windows 11 WSL Ubuntu, python 3.10

Comfy UI SDXL Turbo Generation Speed averages at 2.5 second per image (without prompt caching)

My Test Criterias distinguishes Total Time, Load Time for any modules, Init Gen Time, Avg Gen Time

For Avg Time calculation I test with 1 + 4 prompts

I am not optimizing for memory footprint, just performance

Here is my Github for reference : https://github.com/Govind-S-B/sdxl-turbo-optimization-experiments

I also post some experiments I do over at my twitter : https://twitter.com/violetto96
// Detect dark theme var iframe = document.getElementById('tweet-1730886441595748742-752'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1730886441595748742&theme=dark" }

Basic Diffuser Performance :

Metric	Run 1	Run 2	Run 3
Total Time	19.3036789894104	17.443750381469727	17.60261082649231
Load Time	6.236544132232666	4.532968759536743	4.691980600357056
Init Gen Time	2.950467348098755	2.9440150260925293	3.0313751697540283
Avg Gen Time	2.529166877269745	2.4916916489601135	2.4698137640953064

The default example script from sdxl turbo repo was used here, with multiple prompts being iterated

Batch Prompt Processing :

Metric	Run 1	Run 2	Run 3
Total Time	24.790883779525757	23.44549536705017	28.178327798843384
Load Time	4.713327169418335	4.722451210021973	4.594852685928345
Init Gen Time	2.940718412399292	2.895763397216797	3.0071566104888916
Avg Gen Time	4.2842095494270325	3.9568201899528503	5.144079625606537

Instead of iterating through each prompt, passed the entire list for batch processing to the pipeline

Its suprising to see a decline in performance while using the built in batch processing methond than iterating the pipe

Upcase VAE Precision

Metric	Run 1	Run 2	Run 3
Total Time	17.625147342681885	23.426799774169922	17.705934762954712
Load Time	5.011475086212158	4.612784147262573	4.5227577686309814
Init Gen Time	2.9415533542633057	4.21022629737854	3.078413963317871
Avg Gen Time	2.4180297255516052	3.650947332382202	2.526190757751465

Decrease in performance compared to base diffuser performance

source : https://huggingface.co/docs/diffusers/using-diffusers/sdxl_turbo#speed-up-sdxl-turbo-even-more

VAE 16fp Optimization :

Metric	Run 1	Run 2	Run 3
Total Time	11.680602312088013	13.030585050582886	11.378620147705078
Load Time	5.208057641983032	5.137163877487183	4.957686901092529
Init Gen Time	1.538787603378296	1.9612927436828613	1.539346694946289
Avg Gen Time	1.2334392666816711	1.4830321073532104	1.220396637916565

Gives Promising Speed (2x) with no noticeable quality loss

source : https://huggingface.co/docs/diffusers/using-diffusers/sdxl_turbo#speed-up-sdxl-turbo-even-more

pipe.vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)

VAE : Tiny Auto Encoder

Metric	Run 1	Run 2	Run 3
Total Time	6.502416133880615	5.839867115020752	6.068450450897217
Load Time	5.318273067474365	4.63173770904541	4.82595682144165
Init Gen Time	0.6518356800079346	0.664395809173584	0.704963207244873
Avg Gen Time	0.13307684659957886	0.13593339920043945	0.13438260555267334

Incredible Speed, Its said to be quality reducing but I couldnt notice

source : https://github.com/huggingface/blog/blob/main/simple_sdxl_optimizations.md#tiny-autoencoder

pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesdxl", torch_dtype=torch.float16)

Doing all subsequent tests with Tiny Auto Encoder since the vae is pretty much optimized and we will not need to swap that out, for all following performance comparisons use the Tiny Auto Encoder benchmark as reference point now

Compile UNet

Metric	Run 1	Run 2	Run 3
Total Time	75.41857695579529	65.76266884803772	64.25783467292786
Load Time	5.129103422164917	7.217745304107666	5.239185810089111
Init Gen Time	68.40448021888733	57.36990666389465	57.89517068862915
Avg Gen Time	0.4712483286857605	0.2937542200088501	0.2808695435523987

Massive Decrease in init generation and 2x slower in avg gen time

source : https://huggingface.co/docs/diffusers/using-diffusers/sdxl_turbo#speed-up-sdxl-turbo-even-more

pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

CPU Offloading

Metric	Run 1	Run 2	Run 3
Total Time	17.635058403015137	16.957004070281982	16.976733684539795
Load Time	7.488234281539917	9.042913675308228	7.669692039489746
Init Gen Time	2.28349232673645	2.0885674953460693	2.0610623359680176
Avg Gen Time	1.9658329486846924	1.4563807249069214	1.8114948272705078

Decrease in performance

source : https://github.com/huggingface/blog/blob/main/simple_sdxl_optimizations.md#model-cpu-offloading

General rule of thumb that if you can fit the model in memory, offloading is only going to cost you performance

VAE Slicing

Metric	Run 1	Run 2	Run 3
Total Time	7.269654989242554	6.002238035202026	6.339429616928101
Load Time	5.831358909606934	4.8166186809539795	4.8016180992126465
Init Gen Time	0.6312739849090576	0.6541228294372559	0.6929543018341064
Avg Gen Time	0.20175552368164062	0.13287413120269775	0.21121430397033691

No Impact or Slight Decrease in performance , hard to gauge.

source : https://github.com/huggingface/blog/blob/main/simple_sdxl_optimizations.md#slicing

MORE TESTING CONCLUSIONS

I tried all the optimizations listed in https://huggingface.co/docs/diffusers/optimization/opt_overview

Including Xformers , Token Merging , Offloading . But none of them beats the base Tiny Encoder Optimization Benchmarks

I think it indicates some bottleneck deep within python thats causing the performance issue now and I have hit a bottleneck in what I could optimize,if anyone else wants to try out the optimizations in a different language like Rust I think that would be the way forward

Additional Notes :

I couldnt try out https://github.com/huggingface/blog/blob/main/simple_sdxl_optimizations.md#caching-computations since i couldnt figure out how to get the tokenizers and encoders for the model
Couldnt try out tracing UNet https://huggingface.co/docs/diffusers/optimization/memory#tracing