DEV Community: Alessandro Marrella

Generate a podcast about anything you want

Alessandro Marrella — Tue, 01 Oct 2024 00:00:00 +0000

Google's NotebookLM is making the rounds on the internet (at least in my bubble). It's a new AI tool that Google pitches as a "personalized research assistant" but has evolved to be much more than that, including... a podcast generator.

Before diving deeper into what NotebookLM does, have a listen to a few of the podcasts I generated and judge for yourself how good (or bad) they are:

Podcast about my blog
https://alessandromarrella.com/audio/alessandromarrella-com-podcast.mp3

Podcast about LLAMA3.1's paper
https://alessandromarrella.com/audio/llama3-paper.mp3

Podcast about chess.com's privacy policy
https://alessandromarrella.com/audio/chess-privacy.mp3

In my opinion these podcasts could fool anyone to be "real". Even if the structure is similar, and the hosts are always excited no matter the topic, this is not very far from what happens in the average podcast (especially American ones, sorry :) ).

Generating a podcast using NotebookLM is very simple:

Sign in to NotebookLM
Accept the terms etc
Click "Upload Source", there you can upload a file, or paste a link directly, or read from your google drive. You can upload multiple docs as well.
In the "Audio Overview" section, click "Generate" and wait a few minutes...
Done! You should be able to play or download the audio directly in the UI.

As mentioned previously, NotebookLM is a research assistant, so you can also use it to ask questions about your documents, generate FAQs and study guides, and more.

It's a pretty cool tool that Google created, it's (for now) free, and I'm really happy we are moving beyond the "chatbot-style" interface with AI and exploring something new. Turns out LLMs are not only good at generating content, but also questions about the content you provide.

To learn more I recommend the always amazing Simon Willison's blog, and to try it yourself. It's really impressive and could be a really useful tool for learning new things.

Deepfaking myself was scarily easy

Alessandro Marrella — Tue, 03 Sep 2024 00:00:00 +0000

Moved by my curiosity on everything AI related, I decided to give it a try by creating a LoRa adapter on the FLUX.1 model by Black Forest Labs to generate pictures of myself that never happened.

FLUX.1 is a model by Black Forest Labs and as of today represents the state of the art on image generation. It currently comes into 3 versions: Schnell (open weights, Apache 2.0 licensed), Dev (open weights, Non Commercial) and Pro (Commercial, closed source).

In this experiment, I decided to try training FLUX.1 Dev, using the very convenient ostris/flux-dev-lora-trainer hosted on Replicate. The code for the trainer can be found on Github.

I didn't play around with the parameters much (training costed me around $2.44), the only things I did were:

Uploading a Zip file with 12 photos of myself (without any label, I relied on the auto-labeling that the trainer provides that leverages LLava-1.5, a image captioning model)
Setting the autocaption_prefix to "A photo of TOK, ". TOK being a trigger word that should help the model to identify myself. I'm not sure if this helped or not (I decide to limit my budget to do a simple POC), but I did it anyway 🤷‍♂️.

I then kicked off the job, which took about 34 minutes to complete. Once the model was trained, by clicking "Run trained model" in the UI, I was brought to a form where I could configure a prompt like "A photo of TOK, eating pineapple pizza" and generate an image that hopefully would resemble me. You can also provide and image and a "mask" to guide the model on what to generate (see "Lord Commander of the Night's Watch example below).

And that's it really, generating the images took about a minute, some of them missed the target, but the adapter learned my face reasonably well with no tweaking, and produced absurd results like the ones in the examples gallery.

The next step I'm going to try is videos, though I am really creeped out already by the photos one.

In a world where everything can be realistically faked (again, my example took no effort, I'm sure that can be tweaked be even more realistic) how will we be able to distinguish fiction from reality?

Note: if you ever see a picture of me eating pineapple pizza, that's a fake for sure.

BigQuery Editions vs On Demand

Alessandro Marrella — Sat, 31 Aug 2024 00:00:00 +0000

These are some notes on the tradeoffs and best practices between On Demand pricing vs Editions pricing.

Pricing models

BigQuery currently offers two very different pricing models

Editions

With Editions you are charged for "compute time" by slot-hour. A slot is a virtual CPU that BigQuery uses to execute queries.

Within Editions, you can purchase a "committment" for a lower price if you always have stuff running, with the caveat that you are charged the entire time (it doesn't scale up or down at will, and so for a yearly committment for 100 slots you end up paying for 100 * 365 * 24 slot hours, even if you don't use them).

Outside the committed capacity, you can use an autoscaling reservation, which sets up the minimum (could be 0) and the maximum slots that BigQuery can use. Slots are scaled up and down based on compute requirements for the queries.

On Demand

With On Demand you are charged for "bytes processed". The compute capacity that GCP gives you is about 2000 slots, but you are not being charged for it.

Best practices based on the pricing models

Minimize bytes scanned (especially on on demand)

Do not select *, but only select the columns you need.
Partition/Cluster the tables so that you only scan the minimum amount of rows needed (though this increases compute on write, so it's not always helpful with editions, see note on compute used).
Follow the best practices in the bigquery docs (reduce data processed)

Minimize compute used (especially for editions)

Optimize your joins, see my note about semi hash joins.
Pay attention to when clustering and partitioning are more harmful than helpful. Take a look at the execution plan of the jobs that produce/update tables with partitioning and clustering. If the table is large, a lot of slot time is going to be spent on sorting the data to match the clustering.
Follow the best practices in the bigquery docs (optimize query ops)

Right-size your slots autoscaling (editions only)

Pay particular attention to the jobs where you have a "contention" warning, and dig through the steps in the execution plan.

Check that the "Wait ms" stat and compare it with "Read", "Write" and "Compute". "Wait" is time that BigQuery spends waiting for slots to be available.

Given that with autoscaling you are charged by the time allocated, and slots are allocated even when they are not used, wait time still counts towards the cost.

If wait time is too high, this might mean that you need to either:

serialize the queries so that they are no longer in contention for slots
increase the max in the autoscaling reservation to reduce contention

Don't increase the autoscaling reservation maximum too much, as from experience the bigquery autoscaler is very eager to use as many slots as it can to run the query as quickly as possible, but scaling down takes time (that you are billed for), and the minimum interval of 1 min adds up quickly if you are using many thousands of slots.

Run queries with the model that's cheapest (if possible)

This sounds obvious, but Google doesn't make it exactly easy. You'll need to:

estimate the price of a query in both modes.
run the query in a dedicated project based on which model you choose (you cannot mix modes within a GCP project)

For 1 (estimate the price of a query in both modes), I like to use variations of the following query:

SELECT
job_id, -- the bigquery job id that identifies a run
query, -- the bigquery query text
destination_table.table_id, -- the name of the table (if the query writes to a table)
start_time,
SUM(total_slot_ms) / (1000 * 60 * 60) * 0.06 as editions_cost -- unit cost (here $0.06/slot_hour) could be different based on the edition chosen and gcp discounts
SUM(total_bytes_billed) / POW(1024, 4) * 6.25 as on_demand_cost -- unit cost (here $6.25/tb) could be different based on the edition chosen and gcp discounts
FROM `{your_project}`.`{your_region}`.INFORMATION_SCHEMA.JOBS_BY_PROJECT -- or JOBS_BY_ORGANIZATION to see the whole company, but then you'll need to remove the query field
GROUP BY ALL

Note that the prices here are estimated, because google bills by slots "assigned" not slots used so sometimes slot price is always slightly higher than the estimate (it takes time to scale up and down, and google bills a minimum of 1 minute even if a query runs for 3 seconds).

For 2 (run the query in a dedicated project), you'll need to set up two separate GCP projects, one with the reservation (or the reservation assigned from another project) and the other without.

How to quickly profile python imports and runtime

Alessandro Marrella — Sat, 20 Jul 2024 00:00:00 +0000

A small TIL about Python profiling.

tuna is a really handy tool that renders the output of cProfile and python -X importtime logs into an easy to navigate tree.

The AI/ML concepts behind Apple Intelligence

Alessandro Marrella — Wed, 19 Jun 2024 00:00:00 +0000

On Monday, June 10, 2024 at their annual World Wide Developer Conference (WWDC) Apple announced "Apple Intelligence", their own flavour of AI integration into their operating systems.

Emphasis should probably go into integration, as the experience that was demoed at WWDC felt like every feature we are used to on our devices was augmented in a way or another by AI. AI is also able to orchestrate those features and use multiple apps and functionalities to achieve what the user wanted.

As someone who has a love/hate relationship with Siri, this is only good news, Siri will become way more powerful and able to understand what you want to do.

This post is not going to be an overview of Apple Intelligence from the users perspective though (for which I suggest the excellent blog post from Simon Willison Thoughts on the WWDC 2024 keynote on Apple Intelligence) but will explore key concepts such as the Semantic Index, App Intents Toolbox, and foundational model adaptations utilized by Apple.

Architecture#

With what was presented in the keynote, theplatform state of the union the content in introducing apple foundation models we can derive a bit of the architecture that powers Apple Intelligence.

In the diagram above, we can see that the architecture has 3 layers:

Apps and Experiences : features that the users sees and interacts with
Personal Intelligence System : what runs on device and on Apple's servers
Apple Silicon : the specialised Apple hardware that powers the intelligence layer and the security between on-device and cloud communication.

Apps and systemwide experiences are for example writing tools (summarization, rewriting, etc), image playground (to generate images and "genmojis") and Siri.

Personal Intelligence System#

Apps and systemwide experiences interact with the Personal Intelligence System by reading from and writing to the "Semantic Index" and the "App Intents Toolbox". Then there is a orchestration layer on device that decides whether to use on-device or server models.

Let's break down these components.

Semantic Index#

What Apple calls a Semantic Index is probably a vector database storing embeddings.

An "embedding" is a vector of numbers representing objects like text, images, audio and video into a multidimensional space. By computing the distance between two embeddings, you can see if two concepts are related. You can read more on embeddings in this Stackoverflow Blog Post.

Computing the distance between embeddings is usually compute intensive, especially if you are trying to surface "the closest embedding" in a collection of many of those.

To alleviate the problem, there we tend to look for the Approximate Nearest Neighbour (ANN).

Vector databases are databases specialized into storing vectors in a way that makes this search as efficient as possible.

Examples in the opensource world of these databases are Chroma and Qdrant.

App Intents Toolbox#

The "App Intents Toolbox" is a way for apps to declare their capabilities so that they can be leveraged by the model. This is leveraging a technique called "tool use", which comes in two variants: single step and multi step. For single step tool use (think "write an email") it's usually referred to as "function calling", while for multi step it's commonly referred to as "agents".

With function calling, you make available to the model a list of tools that the model can "call". Based on the prompt, the model chooses whether to use a tool or not, and if it chooses to do so, it returns what tool to use and with what parameters. Based on that, the system will call the tool, get the result and return that to the user. In Apple's terms, what happens is that the user interacts with the "Personal Intelligence System" via Siri or an App, the model that apple uses within the system will check the "App Intent Toolbox" to see if there is any app or system feature that can satisfy the request, call it with the parameters exposed in the App Intents API, and surface the result to the user. For example, a query like "show me the pictures of my dog in Dublin" will interact with the Images app via the intent toolbox, filter the pictures by selecting only those tagged (by another model at a separate time) as my dog, with the geo location set to Dublin, and return the app view with those pictures displayed.

Agents capture more nuanced use cases, by taking multiple steps of "function calling" on behalf of the user, so queries like "send the pictures of my dog in Dublin to my mom via email" will first call the images app, and then the email app, and the user is presented with the email pre-populated with the pictures.

Both function calling and agents are available in the API of the most popular models (e.g. OpenAI or Google Gemini) and in the opensource world as well, with models such as Cohere's Command-R specialized in tool use, and frameworks such as Langchain that help building complex agent applications.

Orchestration#

The orchestration layer is an on-device model whose only task is to decide whether to use one of the many on device or server models based on what the user or the system is trying to do. This is similar to the agents described above, but in this case the "tools" to use are other models!

Models#

Individually, the on-device and server models are all multimodal models, with the key difference being the size of the model and the compute power required to run them.

Apple describes their model architecture as having foundation models + adapters.

The modeling happens in different phases:

Data collection and pre-processing#

Apple collects licensed data and scraped data by their own web crawler, for which they provide an opt out by configuring the robots.txt file on the website. They do some pre-processing and feed it to the pre-training.

Pre-training#

Pre-training happens using AXLearn, a framework built on top of Google's JAX and Tensorflow's XLA. This allows them to run the training on GPUs and TPUs. Apple lists a bunch of techniques that they use to achieve efficient pre-training, by splitting the work across multiple machines:

Data Parallelism: Splitting data across multiple machines to train faster.
Tensor Parallelism: Breaking down the model itself across multiple machines.
Sequence Parallelism: Dividing long sequences of data to process them in parts.
Fully Sharded Data Parallel (FSDP): Distributing both data and model pieces in a way that uses resources most efficiently.

These techniques are well explained in the HuggingFace docs, which cover Multiple GPUs and Parallelism and the Fully Sharded Data Parallel (FSDP) method.

Post-training#

After pre-training, Apple refines the model with "post-training", using a mix of reinforcement learning with human feedback (RLHF) and training on synthetic data.

RLHF is a technique that became really popular with the release of ChatGPT, and consists in incorporating human feedback to align the model to human preferences. Once the model is pre-trained, humans interact with it and provide feedback on its performance, usually in terms of ranking different responses to the same prompt, or assigning a score. Based on the feedback the model receives a reward or a penalty, and these rewards or penalties are used to update the model's behaviour (this is typical of reinforcement learning, the model will try to optimize to receive as many rewards as possible over time). There are different ways to build this, there is a good overview on HuggingFace's blog.

Training on synthetic data is also a way to improve the model output, by feeding the model with data that is generated by a machine. The advantage of this is that the generated data is deterministic, and can be produced in high quantities. I suggest reading this paper from Google Deepmind to learn more about the current state of the art.

There is also the possibility of mixing the two, by using Reinforcement Learning with AI Feedback, where instead of humans we can use other AIs, specialized in ranking and scoring.

Optimization#

This stage is mainly focused on speed and efficiency of the models. Especially for those running on device, we want them to be able to work very fast and not draw a lot of power to not excessively consume battery. On their models page Apple lists a few techniques with which they perform these optimizations.

Grouped Query Attention#

First, they mention Grouped Query Attention (GQA). There are a few concepts to unpack to understand this technique. "Attention" is a mechanism that weighs the importance of different tokens (words, images, etc) in a sentence to predict the next token to generate, and involves three key components:

Queries: the current token or sentence that the model is currently focusing on
Keys: the tokens that the model can potentially focus on
Values: the actual values (could be the same as the keys) that can be used in the output

The simplest form of attention has a single "head" (a head in this context is an independent attention mechanism), and so lets the model only focus on one set of queries, keys and values. So the output can (for example) be computed by doing the dot product queries and keys to get the attention scores.

Many large language models are based on the Transformer architecture, which commonly uses "multi-head" attention (Attention is all you need). In this case there are multiple attention "heads", each focusing and prioritizing different tokens or sequences, and contributing to the final prediction. Multi-head attention is a very powerful technique because the model can focus on multiple parts of data at the same time, leading to more nuanced outputs. The issue with "multi-head" is that it's much more expensive to compute, slowing down a lot the inference times.

To speed up the process, Shazeer from Google in 2019 published Fast Transformer Decoding: One Write-Head is All You Need, which introduces "multi-query attention", which shares keys and values across all the attention heads, only changing the "query". This leads to a much more memory efficient structure, making the inference much faster.

The issue with the "multi-query attention" technique is a degradation in quality, due to the fact that the selected keys and values are the same for all heads. Grouped Query Attention is a compromise between multi-head attention (where we have distinct keys/values for each head) and multi query attention (where we share them for all heads). The compromise is reached by creating "groups" that share the same key value pairs.

Shared embedding tables#

The models leverage tables that map tokens ("vocabulary") to vectors and vice-versa. By sharing them across models this allows for lower memory usage and greater inference speeds.

Apple mentions that they use a 49K vocab for on device models, and 100K for server models (which include additional language and technical tokens). For comparison, Meta's LLama 3 model uses a vocabulary of 128K tokens.

Low-bit palletization#

Palletizaiton is a quantization technique to reduce the memory usage (and as a consequence reduce power usage and improve performance) by "compressing" the weight vectors using a fixed lookup table.

For example if we had a vector of floating point weights such as [0.1, 0.1, 0.2, 0.2] and a 1 bit look up table {0.1: 0, 0.2: 1} we could compress the weights into a 1bit vector [0, 1].

In particular, Apple says that they are mixing 2-bit and 4-bit palettes to achieve an average size of 3.5 bit per weight (which is much cheaper to use than the original 16 bits the model has been trained on). This usually comes at an output quality tradeoff cost, but it seems that from Apple's own benchmarks that the results are good enough (they mention that they measure the impact of these optimization with Talaria, a custom developed tool that analyses model latency and power usage to select the best bit rate).

Activation quantization#

Activations are the outputs of the neurons after applying an activation function, common activation functions are ReLU, Tanh, or Sigmoid. The goal of these activation functions is to introduce a non-linearity in the neural network and allow the model to learn more complex patterns (there is a really good video on activations in Karpathy's series NN Zero to Hero).

Quantizing the activation reduces the precision of it, again reducing the memory and compute footprint. You can see how to quantize the activation function in Lei Mao's "Quantization for Neural Network" book, in the Quantized Deep Learning Layers section.

Embedding quantization#

As mentioned before, embeddings are vector representations of tokens. They can also be quantized to reduce the memory footprint and improve performance!

In particular with embeddings, there is a now popular technique called "binary quantization", which converts these embeddings from float32 values into 1 bit values, significantly reducing the embeddings size. Quantizing embeddings here literally means setting a threshold of 0, and any value below 0 gets mapped to 0, and any value above is mapped to 1. This seems like a very large loss of information, but in practice for information retrieval it yields really good results (see this paper).

Another big advantage of binary embedding quantization, is that retrieval is super fast using Hamming Distance, which only requires 2 cpu cycles to retrieve the information we need.

For more details on embedding quantization, i suggest reading this HuggingFace blog post.

Efficient key-value cache update on neural engines#

Neural engines are a particular part of Apple silicon that are optimized to run computations for neural networks. Here Apple doesn't give a lot of information, but certainly having a way to quickly update the cache on a GPU-like device is something that's critical for performance.

Token speculation#

Apple mentions they also employ token speculation techniques, which means that they are predicting multiple likely tokens at the same time, allowing the model to explore multiple path simultaneously. This is for example useful to provide a more "real time" experience to autocompletion, by speculating on a few paths that the user could take in composing the text.

Model adaptation#

Apple created foundation models for language and image generation, these models are fine-tuned for different activities that the user might do on their device and specialize themselves just in time for the task at hand. To achieve a variety of specialization personalised for the user quickly Apple uses adapters, small neural network models that can be attached to the foundation models to make them specialized on a specific task. These adapter models can be dynamically loaded into memory, cached and swapped.

Types of adaptation#

Apple mentions that they adapt the attention matrices, the attention projection matrix and the fully connected layers in the point-wise feedforward networks.

Let's break down the adapation techniques that Apple mentions.

Attention matrix adaptation#

As explained above, the attention matrix is what makes the model "focus" on specific tokens or sequences to predict the next one. To adapt the attention matrix means for example updating how the attention scores are computed or modify the scaling factors, to ultimately change how attention is distributed. This can be achieved either by more training on a specific domain, or training by updating only the attention scores, or using techniques such as knowledge distillation, with a "teacher" model and a "student" model, where we aim for the student model to have a similar error between teacher and student attention matrices.

Attention projection matrices adaptation#

Attention projection transforms the input sequence into matrices in usually three layers (Queries, Keys and Values), which are the main components of the attention mechanism.

By inserting an adapter at the attention projection matrices layer, we add a small neural network specialized for a task that will change the matrices that are produced, influencing the final output to be more in line with the task that we have in mind. Low-rank (LoRa) adaptation is a way of achieving that, by replacing the original matrices with smaller ones, trained on the task-specific dataset. You can read more about LoRa here.

Fully connected layers in the point-wise feedforward networks adaptation#

Point-wise feed-forward networks are a component of transformer models (introduced in Attention is all you need). They consist in two fully connected layers with an activation function in the middle. They are generally used to learn more complex transformations.

Inserting an adapter here means injecting them between those layers to again make the model specialise on completing specific tasks.

Evaluation and fine-tuning improvements#

Given that the focus of Apple Intelligence is to augment the user experience, the main focus of Apple's evaluation pipeline is Human Evaluation. In their foundation models document, Apple provides a useful example on how they conduct the study (the example is based on the summarization task for emails vs notifications):

the local model summarization adapter is trained on synthetic summaries generated by the more powerful server models (this is known as knowledge distillation, also mentioned previously on the optimization section)
to evaluate the product-specific summarization (email vs notification), they sample responses for each use case, with diverse inputs, with datasets that resemble real use cases
they run explicit tests to reduce risks related to the task (e.g. for summarization, omitting important information), and also conduct adversarial tests to identify unknown harms
also foundation models are tested and evaluated

I suggest looking through the Evaluation section of their doc to read more about the benchmarks they ran and how they are evaluating their model performance.

Conclusion#

Apple Intelligence showcases the cutting-edge AI and ML techniques that drive the next generation of user experiences, which we can all use to improve our products. By integrating a sophisticated architecture with the Semantic Index (vector store of embeddings) and App Intents Toolbox (tools for function calling and agents), Apple has created a seamless interaction environment to augment the user capabilities while using their devices.

With their optimization techniques, including Grouped Query Attention, shared embedding tables, and various quantization methods, they enhance performance and efficiency.

Apple’s use of adapters for model specialization ensures tailored user interactions while maintaining high-speed processing and low power consumption.

Finally, their evaluation loop based on a combination of human feedback and fine-tuning, allows them to further improve the user experience, which is a key focus of Apple's differentiation.

These advancements not only enhance user experience but also pave the way for more intuitive and intelligent applications. As developers, exploring these techniques can inspire innovative solutions in our own projects. I'm excited for the way forward, and cannot wait to see the finished product that will come out with iOS 18 and the new MacOs. In the meantime, I'll try to apply some of these techniques and optimizations to my projects to understand them further.

UPDATE: Apple released more details in their Apple Intelligence Foundation Language Models paper

BigQuery performance best practice: use semi joins when possible

Alessandro Marrella — Sun, 02 Jun 2024 00:00:00 +0000

SQL is an amazing language, it lets you declaratively say what you want, and the engine figures out for you the best way to return it to you. Or should I say, it figures out the best way to return it to you given the information it has and the capabilities of the engine itself.

In this post, we’ll discuss a performance optimization technique for BigQuery (also other advanced enough Enteprise Data Warehouses and databases support SEMI JOINS, but I'll focus on BigQuery since it's the one I use the most these days): using semi joins.

A SEMI JOIN returns rows from the first table (left table) where one or more matches are found in the second table (right table), but it does not return rows from the second table. This can significantly improve query performance in certain scenarios.”

Let's look at an example query using their public datasets, the NYC taxi dataset, which contains a log of taxi trips in NYC.

Suppose that we want to know the distinct pickup_location_id values where both yellow and green taxis picked up clients in 2022.

One way to express this query might be the following:

SELECT DISTINCT yellow.pickup_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
  join `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
  on yellow.pickup_location_id = green.pickup_location_id

If we look at the performance, on my project at the time of running this query's performance is the following (results might change a bit based on your google cloud project, slot availability, time of day, etc.):

Total elapsed time : 4min 45sec
Slot time consumed : 3h40min (A slot in BigQuery is a unit of computational capacity required to execute SQL queries)

If we look at the execution graph, we see the following

The join is definitely the step that takes the longest! If we click the join step in the graph, BigQuery really helpfully shows us more information:

The UI shows that the join produced a lot more rows than it consumed, due to how the join was applied. You can also see that it uses a INNER HASH JOIN.

In theory, this is a very efficient kind of join, as it builds a hash table with the join keys of one table (the smaller one), and then probes the other table (the larger one) to find the row keys that have a match in the hash table.

The problem in this case is not much in how the join happens, but in what it produces. As you can peek from the screenshot above, the number of rows produced is 83 million! This is due to the many-to-many relationship that we have in this join, where a pickup_location_id can happen multiple times in either table.

In a sense, BigQuery here does way more than we need to, as it would be enough to find one match in the "green" table to consider the row in yellow valid. In other words, we need a SEMI JOIN.

A SEMI JOIN returns rows from the first table (left table) where one or more matches are found in the second table (right table), but it does not return rows from the second table.

How do we rewrite the previous query to make BigQuery use a semi join?

For this specific query we have (at least) three options, which all make use of the SEMI HASH JOIN in the query plan:

Option 1: use `EXISTS`#

SELECT DISTINCT yellow.pickup_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
  WHERE EXISTS (
    SELECT 1
    FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
    WHERE green.pickup_location_id = yellow.pickup_location_id
  )

Option 2: use `IN`#

SELECT DISTINCT yellow.pickup_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
  WHERE yellow.pickup_location_id IN (
    SELECT DISTINCT green.pickup_location_id
    FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
  )

Option 3: use `INTERSECT DISTINCT`#

SELECT DISTINCT yellow.pickup_location_id
FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
INTERSECT DISTINCT 
SELECT DISTINCT green.pickup_location_id
FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green

Comparison#

The three options all return the same result of the original query, and they all produce a query plan with a semi hash join, with a much better performance:

Total elapsed time : 1sec
Slot time consumed : 40sec

While they generate a similarly shaped plan, and in this case produce the same result, they are not the same from a logical point of view.

EXISTS#

Option 1 (EXISTS) is the most flexible, because it lets you write multiple predicates in the WHERE clause. So you can for example write:

SELECT DISTINCT yellow.pickup_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
  WHERE EXISTS (
    SELECT 1
    FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
    WHERE green.pickup_location_id = yellow.pickup_location_id
    -- add another predicate
    AND green.dropoff_location_id = yellow.dropoff_location_id
  )

This would still do a SEMI HASH JOIN, but now we also filter the rows so that the result returns locations where both the pickup and dropoff was the same.

From a performance point of view, even if this is fast, this still scans 83,869,625 rows in the join phase.

If we want to reduce the number of rows scanned, in this case we can do it with a WITH statement, like this:

WITH green AS (
  SELECT DISTINCT pickup_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022`
)
SELECT DISTINCT yellow.pickup_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
  WHERE EXISTS (
    SELECT 1
    FROM green
    WHERE green.pickup_location_id = yellow.pickup_location_id
  )

This query usually performs a bit faster (for me it's around 900ms), uses slightly less slot seconds (for me about 30s), and scans less rows in the join phase (now 36,272,535). The main difference is that we do a DISTINCT before joining.

To reduce the join to a minimum, we can also do one more step and do a distinct on yellow too, and we get

WITH green AS (
  SELECT DISTINCT pickup_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022`
),
yellow AS (
  SELECT DISTINCT yellow.pickup_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
)
SELECT DISTINCT yellow.pickup_location_id
  FROM yellow
  WHERE EXISTS (
    SELECT 1
    FROM green
    WHERE green.pickup_location_id = yellow.pickup_location_id
  )

Now the performance is even faster (for me around 800ms), uses even less slot seconds (about 20s), and more deterministically we can say that it only scans 16,813in the join phase.

IN#

With IN we are a bit more constrained (unless we do ugly string concatenation things), as we can really only compare one element per statement.

So to express a query where we want pickup_location_id and dropoff_location_id to be the same you'd have to write:

SELECT DISTINCT yellow.pickup_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
  WHERE yellow.pickup_location_id IN (
    SELECT DISTINCT green.pickup_location_id
    FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
  )
  AND yellow.dropoff_location_id IN (
    SELECT DISTINCT green.dropoff_location_id
    FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
  )

Since this uses two statements, in the JOIN step in the query it does two joins!

The IN is more perfomant than the original EXISTS query written above, and is more comparable to the EXISTS query where we use WITH green AS... to do the select distinct. Also in the IN case we scan 36,272,535rows (like in the first improved exists).

We can reach a comparable performance to the second improved exists (the one with WITH green AS..., yellow AS...) if we do

WITH yellow AS (
  SELECT DISTINCT yellow.pickup_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
)
SELECT DISTINCT yellow.pickup_location_id
FROM yellow
WHERE yellow.pickup_location_id IN (
  SELECT DISTINCT green.dropoff_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
)

Also here, we scan 16,819 rows in the JOIN.

There is one more caveat with IN, as sometimes IN and EXISTS don't always produce the same result! See NOT IN and NOT EXISTS don't always produce the same result

INTERSECT DISTINCT#

I'll admit it, I crafted the query so that intersect distinct would make the cut as well, as I particularly like it syntax wise.

INTERSECT DISTINCT in our case produces exactly the same result as IN and EXISTS but has a different limitation: the columns selected need to be the same in both tables AND the result needs to be distinct (there is no such thing as an INTERSECT ALL).

From a performance profile, this immediately produces the most efficient result, scanning only 16,819 rows in the JOIN (which again, is a SEMI HASH JOIN).

Conclusion#

In conclusion, we saw that SEMI HASH JOIN can be a powerful optimization, especially when dealing with many-to-many relationships, and can be done in several ways. Each method has its strengths and limitations, but in the end the main optimization is the impact on the query plan of moving from a INNER to a SEMI join.

NOT IN and NOT EXISTS don't always produce the same result

Alessandro Marrella — Sat, 01 Jun 2024 00:00:00 +0000

IN and EXISTS often produce the same result, but when negated and dealing with NULL values, they behave differently.

Let's see an example, assume we have these two tables:

orders

order_id	customer_id	amount
1	100	50.00
2	101	75.00
3	102	30.00
4	NULL	20.00

customers

customer_id	name
100	Bob
101	Alice
102	Martha
NULL	John

Ignore the fact that a well formed customers table would need to have the id always specified and ideally as a constraint. This is just an example.

Now, let’s say we want to find orders where the customer_id is not present in the customers table.

Using NOT EXISTS we would write something like:

SELECT *
FROM orders o
WHERE NOT EXISTS (SELECT 1 FROM customers c WHERE c.customer_id = o.customer_id);

The output, would probably be what we expect:

order_id	customer_id	amount
4	NULL	20.00

Using NOT IN we would write something like:

SELECT *
FROM orders
WHERE customer_id NOT IN (SELECT customer_id FROM customers);

The output here will be empty!

When we use NOT IN, SQL checks each value in the orders table against the list of values returned by the subquery. If any value in the subquery result is NULL, the entire NOT IN comparison will result in NULL for each row in the orders table. This is because any comparison with NULL yields NULL and NOT IN needs all comparisons to be TRUE for a row to be included in the result.

This is not a problem with NOT EXISTS, because the NOT EXISTS clause checks for the non-existence of rows that meet the criteria specified in the subquery. It does not perform direct comparisons with NULL in the same way NOT IN does. Instead, it simply checks if there are any rows that match the condition. If no such rows exist, the condition is true.

How to run and serve a webserver in Google Colab without ngrok

Alessandro Marrella — Mon, 20 May 2024 00:00:00 +0000

Today I learned how to run a webserver in Google Colab, without needing external services like ngrok. I'm using Dagster here as an example but any webserver should work.

If you just want to create a public url for a webserver running in colab you can jump to step 4

Step 1: Install the Required Packages#

!pip install dagster dagster-webserver

Step 2: Scaffold a New Dagster Project#

!dagster project scaffold --name test

Step 3: Run Dagster in the Background#

To run Dagster's webserver, you'll need to start it in the background. This can be done using Python's subprocess module. The following code navigates to the project directory and starts the webserver on port 3000:

import subprocess

subprocess.Popen(
    [
        "bash",
        "-c",
        "cd /content/test && dagster-webserver -h 0.0.0.0 -p 3000 &"
    ]
)

Step 4: Create a Public URL for the Webserver#

Google Colab provides a way to create a public URL for your webserver. Use the output.serve_kernel_port_as_window function to expose the webserver running on port 3000.

This does the real magic. Note that the URL is authenticated, and only the user running the notebook has access to the webserver.

from google.colab import output
output.serve_kernel_port_as_window(3000, path='/')

you can also run it in an iframe with

output.serve_kernel_port_as_iframe(3000)

Putting it all together#

You can check out a full example in this Colab Notebook

DEV Community: Alessandro Marrella

Generate a podcast about anything you want

Deepfaking myself was scarily easy

BigQuery Editions vs On Demand

Pricing models

Editions

On Demand

Best practices based on the pricing models

Minimize bytes scanned (especially on on demand)

Minimize compute used (especially for editions)

Right-size your slots autoscaling (editions only)

Run queries with the model that's cheapest (if possible)

How to quickly profile python imports and runtime

The AI/ML concepts behind Apple Intelligence

Architecture#

Personal Intelligence System#

Semantic Index#

App Intents Toolbox#

Orchestration#

Models#

Data collection and pre-processing#

Pre-training#

Post-training#

Optimization#

Grouped Query Attention#

Shared embedding tables#

Low-bit palletization#

Activation quantization#

Embedding quantization#

Efficient key-value cache update on neural engines#

Token speculation#

Model adaptation#

Types of adaptation#

Attention matrix adaptation#

Attention projection matrices adaptation#

Fully connected layers in the point-wise feedforward networks adaptation#

Evaluation and fine-tuning improvements#

Conclusion#

BigQuery performance best practice: use semi joins when possible

Option 1: use EXISTS#

Option 2: use IN#

Option 3: use INTERSECT DISTINCT#

Comparison#

EXISTS#

IN#

INTERSECT DISTINCT#

Conclusion#

NOT IN and NOT EXISTS don't always produce the same result

How to run and serve a webserver in Google Colab without ngrok

Step 1: Install the Required Packages#

Step 2: Scaffold a New Dagster Project#

Step 3: Run Dagster in the Background#

Step 4: Create a Public URL for the Webserver#

Putting it all together#

Option 1: use `EXISTS`#

Option 2: use `IN`#

Option 3: use `INTERSECT DISTINCT`#