DEV Community: Chirag (Srce Cde)

Reciprocal Rank Fusion (RRF): how it works and when to skip it

Chirag (Srce Cde) — Sun, 12 Jul 2026 00:12:51 +0000

Hybrid search is commonly used in the search systems which runs several retrievers for a given same query (For ex: a keyword retriever, a dense embedding retriever) and each retriever returns its own top-k ranked list of documents. User cares about a single list and not lists of list from different retrievers, so you have to merge those lists into a single list. The most common way to merge the ranked results from different retrievers is to use Reciprocal Rank Fusion.

In this post I'll walk through three things -

RRF equation and what it actually computes
Why RRF works smoothly across different retrievers but averaging the scores across retrievers fail
Tradeoff when RRF is the wrong tool

Why averaging scores across retrievers do not work

Suppose there are two retrievers. Each retriever returns the top-k matching document list along with the score. So to merge the results into a single list, why not average their two scores and sort by average?

Picture the retrievers as judges scoring the contestants.

Judge 1: BM25
Judge 2: Dense retriever

BM25 scores can range from 0 upto tens with no fixed ceiling and the exact number depends on term frequency and corpus statistics. Dense retriever scores using the cosine similarity which is bounded between -1 and 1. Both the retrievers has very different range in terms of scores. For example - For a single document say BM25 assigned the score of 32 and Dense retriever assigned it 0.9 (quality match) and the average becomes (32 + 0.9)/2 = 16.45. 16.45 is almost half of 32 and it means that the judge 2 (dense retriever) score barely moved the result. It just shifted the average by less than half a point 0.45. The larger score (BM25 - 32) decided and dominated the result.

Point to note here is that whenever you average two quantities whose ranges differ by an order of magnitude, the one with the wider range dominates the sum, so naive averaging doesn't blend two retrievers at all. It quitely picks whichever retriever emits the bigger raw numbers and lets it win not because that retriever is right more often, but because of the scale its scores happen to live on.

The fix here is to put the two judges on equal scale and RRF does it with a very simple move. It throws the scores away entirely and keeps only each retrievers ranking.

RRF equation explained

We have three documents (A, B, C) and two retrievers that ranked them like this (deliberately skipping the score (distance, similarity) here):

BM25: A, B, C
Dense: B, C, A

Each retriever here represents one of our judges and each document is a contestant. Every judge has handed in a ranking placing the contestants first, second, third and our job is to turn two rankings into one final standing using RRF. RRF says for each document, go to every judge, look at the place/rank it gave that document and turn that place into 1 / (60 + rank/place). Then add those numbers up across both judges. The 60 is a constant I'll explain shortly.

Let's calculate for document B first. We find B on each retrievers ranking and read off its place/rank:

BM25 placed B second - 1 / (60 + 2) = 1/62 = 0.0161
Dense placed B first - 1 / (60 + 1) = 1/61 = 0.0164

Add them: 0.0161 + 0.0164 = 0.0325. That is B's fused score. Now the same for A and C:

A was placed first, third - 1/61 + 1/63 = 0.0164 + 0.0159 = 0.0323
C was placed third, second - 1/63 + 1/62 = 0.0159 + 0.0161 = 0.0320

So the final standing is B (0.0325), then A (0.0323), then C (0.0320). And B wins here.

Why B wins? B was ranked near the top by both judges, a first and a second. A was ranked first by BM25 but last by the dense retriever. C never did better than second. RRF rewards the contestant both judges rank near the top over the one a single judge loves and the other ranks low. So the broad agreement beats one strong opinion.

Now that we've worked out the numbers, here is the same rule written as a formula:

RRF_score(d) = Σ_r  1 / (k + rank_r(d))

Reading it, term by term: d is the document we're scoring. Σ_r means we sum over every retriever r (our two judges). rank_r(d) is the rank retriever r gave document d, counting from 1 at the top. And k is that constant, set to 60 in our example. So the formula says exactly what we just did by hand: turn each retriever's rank into 1 / (k + rank), then add across retrievers.

Two choices in that formula are deliberate, and both are worth understanding.

First is the reciprocal, the 1 / rank shape. We want better rank to worth more, so you take one over the rank, which makes 1/1 > 1/2 > 1/3. Rank 1 beats rank 2 beats rank 3, and the reward keeps shrinking the further down the list you go. That is the "reciprocal" in reciprocal rank.

Second is the + k. Its job is to soften how much the very top of each list dominates. Consider what happens without it: rank 1 would score 1/1 = 1.0 and rank 2 would score 1/2 = 0.5, a two-fold drop between first and second. A single retriever putting a document first would then almost decide the whole result. With k = 60, rank 1 scores 1/61 ≈ 0.0164 and rank 2 scores 1/62 ≈ 0.0161, nearly the same. A large k makes the judges generous: being in the top few matters, but the exact place barely does. A small k makes them strict: first place towers over everything below. k is the knob for that behaviour, and 60 is the value from the original 2009 paper, a reasonable default and not a law. If you want RRF to trust a confident #1 more, you lower k.

Why RRF is scale-invariant

Let's look at the example where BM25 assigned 32 as score & dense retriever assigned 0.91 along with the RRF equation. Within RRF equation there is no variable which takes input as retrievers score. It only looks at the rank of each document within the retrievers output. Rank doesn't care what scale the scores were on. If you multiply every one of BM25's scores by a thousand, take the log or add a constant, the documents will still come out in the same order, so their ranks don't change, so their RRF contributions don't change. Any transformation that preserves the ordering leaves RRF untouched. The scale is discarded the moment we read off the places/ranks and that's what makes RRF robust in the ways people like.

A retriever whose scores live on a wild scale can't overpower the others, because its magnitudes never enter the sum. The scale-mismatch problem from earlier (averaging) simply doesn't arise.
A retriever that returns broken scores (miscalibrated, all negative, or NaN where it failed) can't poison the result through those numbers as long as it produces a sensible ordering, RRF can use it, because the scores are ignored.
You don't need to normalize anything, tune relative weights, or know each retriever's score distribution. The ranking is already a common currency across all of them.
Averaging fails because it trusts scores that aren't comparable; RRF sidesteps the problem by not trusting any score at all, and reading only the order.

But look at the last point - Not trusting any score also means discarding everything the scores were telling you beyond their order and that is where RRF starts to cost you.

The tradeoff: what scale invariance throws away

RRF gets its scale invariance by discarding the scores and the margins between scores and it can be a fine trade but sometimes the margin was the whole point.

Each judge scores the contestants and ranks them by those scores. The margin lives there: how far ahead the winner actually was. RRF never looks at the scorecards. It keeps only the placing the scores produced, 1st, 2nd, 3rd. A dominant win and a near-tie come out as the same placing, so the margin on the scorecards is gone. If that margin was what set a strong match apart from a weak one, RRF has already discarded it. Let's try to understand it with an example.

Suppose the dense retriever returns first document with 0.95 as cosine similarity and every other match is scored at 0.30. That gap is the retriever telling you, that the first document is the answer and the rest of the matches are noise and RRF hears none of it. RRF only records rank 1, rank 2, rank 3 evenly spaced, and the retriever's confidence (which lived entirely in the size of that gap) is gone. So RRF is blind to by how much one document beat another.

This blindness shows up in three practical places.

When one retriever is much stronger than the other - Plain RRF gives every retriever an equal vote. If your dense retriever is genuinely good and BM25 is just along for the ride, a weak ranker still gets an equal say and can pull the result down. A 2025 study, Balancing the Blend, documents exactly this and calls it the weakest-link effect: a single weak retrieval path measurably drags the fused result down, so it pays to check each path's quality before fusing rather than tipping everything into one equal-weight pot. There is a partial fix worth knowing: weighted RRF attaches a weight to each retriever's term, so you can make the dense ranker count more than BM25. But that only tunes how much each retriever counts. It still can't express how much one document beat another inside a ranker, which is the margin we lost above and can't get back.

When you need to know whether anything is actually good - Recall the three fused scores: 0.0325, 0.0323, 0.0320. You can put a threshold on numbers like these, but it won't mean what you want: the score reflects how highly the retrievers ranked a document, not how relevant it is, so there's no value you can use to define "this match is good." The moment your system needs to decide whether any document is good enough (to abstain and return nothing, to gate a RAG answer, to drop weak matches before they reach an LLM), RRF has already discarded the quantity that decision depends on.

When you have very few candidates or many ties - With a short list or lots of tied ranks, the 1 / (k + rank) curve is coarse, k dominates the arithmetic, and the fused ordering turns mushy, while k quietly becomes a number you're tuning with nothing to guide you.

⚠️ Important Note

An RRF score is not a relevance score. Its size is set by k and the number of retrievers, not by how good any match is. With two retrievers and k = 60, a document ranked near the top by both scores close to the 2/60 ≈ 0.033 ceiling whether it is a perfect match or a mediocre one. So the score encodes consensus rank, not relevance and RRF cannot support abstention, confidence gating, or any decision that depends on how good the top result actually is. If your system's job includes saying "I don't have a good answer for this," RRF has thrown away the only thing that answer depends on. This is the single most important limitation of the method.

What research says

There's a common piece of advice that RRF is the sophisticated choice and score-based fusion is a naive trap and this framing is too simple.

The research backs this up, and it cuts against the usual advice that RRF is the sophisticated choice. In the most careful comparison (Bruch et al.), a tuned convex combination of scores beats RRF on the standard BM25-plus-dense benchmark, across all nine datasets they test. Two things fall out of that work worth carrying: RRF's k is a real parameter, not a free lunch, and the value that helps in-domain doesn't transfer to new data, so "no tuning needed" mostly means the parameter was left at 60. And the usual objection to score fusion, that picking a normalizer is fragile, mostly isn't true: reasonable choices come out equivalent once you tune the weight.

It's worth knowing the origin of RRF because it tells us when RRF is still right. RRF was built for metasearch, combining ranked lists from systems whose scores you genuinely can't see, trust or compare. That's the setting it was made for and it's a good default there.

When to use RRF, and when to skip it

Putting it together:

Averaging the raw scores: broken by the scale problem we started with. Don't.
RRF: puts every retriever on equal footing by keeping only ranks. Its robust and needs no configuration, and clearly better than averaging.
A tuned, normalized convex combination: puts retrievers on a common scale while keeping the magnitudes, and beat RRF on all nine datasets in Bruch's main BM25-plus-dense comparison (though not in every one of their other model pairings), as long as you can tune the weight. For two retrievers that weight is a single number α; for more it is one weight per retriever, summing to one.

The choice between the second and third isn't about which is more advanced. It's about your situation.

Reach for RRF when you can't see or trust the scores, the retrievers sit on genuinely incomparable scales, or you have no labeled queries to tune with. Probably a cold start, a zero-shot setup, a grab bag of black-box retrievers that only hand you ranks. It's also a fine first baseline.

There is a standing operational argument for RRF even when you could tune. RRF has nothing to maintain. A convex combination's weight is fit to a snapshot of your data, and it drifts as the corpus, the embedding model, or the score distributions change, so it needs re-tuning and an evaluation harness to catch when it has gone stale. RRF has no such knob to go stale. That robustness is a real feature, not a consolation prize for the cold-start case.

Skip RRF when any of these is true:

You have even a few dozen labeled queries to tune on. With two retrievers, tuning α is a one-dimensional search over a single number between 0 and 1, so it takes very little data (Bruch found under 5% of a training set was enough) and with more retrievers it is a few more weights, still cheap. Once the weights are roughly right, the convex combination is ahead.
You have a retriever whose scores are calibrated, so the margins carry real signal. This is the real crux of the keep the magnitudes argument: it only helps if the magnitudes mean something down the line. An uncalibrated dense model whose scores are a scale artifact gives you margins that are noise, and discarding them, which is exactly what RRF does, is then the right move rather than a loss.
Anything downstream needs absolute relevance (abstention, thresholding, confidence gating), which, as we saw, RRF structurally cannot provide.

The short version: RRF when you can't tune or trust the scores; a normalized, tuned convex combination the moment you can.

Two approaches in code

To make the difference concrete, here are both, small enough to read in full.

RRF first. Notice that the retrievers scores never appear and the function only ever looks at each document's position in each list:

from collections import defaultdict

def rrf(rankings: list[list[str]], k: int = 60) -> list[tuple[str, float]]:
    """Fuse ranked lists into one. Each list is doc ids, best first.
    Only rank position is used; scores never enter."""
    scores = defaultdict(float)
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] += 1.0 / (k + rank)
    return sorted(scores.items(), key=lambda kv: kv[1], reverse=True)

For each ranked list, it walks the documents in order, turns each document's position into 1 / (k + rank), and adds that onto the document's running total. That is the whole method.

Now the alternative that keeps the magnitudes. First a small helper to min-max normalize a retriever's scores into [0, 1], then a weighted blend:

def minmax(scores: dict[str, float]) -> dict[str, float]:
    lo, hi = min(scores.values()), max(scores.values())
    if hi == lo:
        return {d: 0.0 for d in scores}
    return {d: (s - lo) / (hi - lo) for d, s in scores.items()}

def convex_fuse(dense: dict[str, float], bm25: dict[str, float],
                alpha: float = 0.8) -> list[tuple[str, float]]:
    """alpha weights the dense retriever; 1 - alpha weights BM25.
    alpha is the single knob, tunable on a few dozen labeled queries."""
    d, b = minmax(dense), minmax(bm25)
    docs = set(d) | set(b)
    fused = {x: alpha * d.get(x, 0.0) + (1 - alpha) * b.get(x, 0.0) for x in docs}
    return sorted(fused.items(), key=lambda kv: kv[1], reverse=True)

The alpha here is the single knob (how much to trust the dense retriever versus BM25), and it's the one number you'd tune on a labeled set.

The thing to notice is what survives to the output. In rrf, the input scores are never read, so the result can tell you an order but nothing about how good any match is. In convex_fuse, the normalized scores flow through into the final number, so that number still carries a notion of relevance you can threshold on. That difference (magnitude discarded versus magnitude kept) is the entire tradeoff, in two functions.

Every number here is from published benchmarks. Which one wins on your data is a different question, so run both on your own queries and measure before you commit.

Final takeaway: RRF works by keeping ranks and throwing away magnitudes. Everything else about it, good and bad, comes from that.

Thank you for reading!

References & Further Reading

What is a Vector Space in Machine Learning? (With Math and Intuition)

Chirag (Srce Cde) — Thu, 08 Jan 2026 01:19:51 +0000

In machine learning, data is often represented in vector spaces so that mathematical operations such as combination and scaling are possible.

Field: the numbers to compute with

Let’s first understand what a field is because vector space is defined over a field.

Field is defined as a set of elements where it guarantees 4 basic arithmetic operations: addition, subtraction, multiplication and division (by non-zero elements). The result of performing these operations over the elements of the set must remain within the same set — this property is called a closure.

In addition, a field must satisfy certain rules (called axioms) such as commutativity, associativity, distributivity, and the existence of additive and multiplicative identities and inverses (for all non-zero elements). Intuitively, a field allows us to perform “normal arithmetic”.

The sets of Real numbers (R), rational numbers (Q), complex numbers (C) are fields because they satisfy all the above mentioned properties. Integers (Z), however, are not a field because division does not remain within the set. For example 2÷3 is not an integer as it results in a fraction. As we know fractions are not part of integers set and for that reason integers are not closed under division. Integers form a ring which guarantees addition, subtraction, and multiplication, but not division.

You can imagine a field as a kitchen where you have all the tools needed to cook anything (except dividing by zero).

A field F is the set with two operations (+,⋅) such that (F,+) is an abelian group, (F∖{0},⋅) is an abelian group and ∀a,b,c∈F, a⋅(b+c)=a⋅b+a⋅c

Space: a set with a structure

Next, let’s understand what space is.

Space is defined as a set of elements together with specific structure. Structure is nothing but the set of rules which depicts what operations are allowed on the elements and what properties those operations must hold. Different structures give rise to different types of spaces.

Intuitively, we think of a vector as coordinates, arrows or data representation. However, formally a vector is an element of the vector space. Therefore, without defining a vector space first, a vector cannot be precisely defined.

Vector space: environment for vector

Vector space is a structure defined over the field (usually the real or complex numbers). It consists of a set of elements together with two operations:

vector addition
scalar multiplication These operations must be closed within the set and must satisfy specific addition and scalar axioms. A vector space must also contain a zero vector, which represents the absence of any contribution.

You can imagine a vector space as a spice rack in your kitchen, and the structure as the rules that define which spices can be mixed, how they can be combined, and what counts as a valid mixture.

A vector space V over a field F is the set equipped with operations + : V×V→V and ⋅ : F×V→V

For all u,v∈V and α∈F , we have u+v∈V , αv∈V , u+0=u , and α(u+v)=αu+αv

Vector: an element of the vector space

So what is vector? Vector is an element of the vector space. In many common cases (like R^n), they can be represented as ordered lists of field elements. Vectors have many interpretations like

In physics, they represent direction and magnitude
In mathematics, they represent coordinates
In machine learning, they represent data or information

You can imagine a vector as a particular combination of ingredients chosen from the spice rack.

Let v∈R^n be a vector, where v=(v1,v2,v3,…,vn) - This is just a representation and not the definition.

Why vector space matters in ML?

It matters because vector spaces allow data to be

added (combine information)
scaled by normalization or weighting, distances and angles can be defined (we will cover this in a future blog post)
Many machine learning algorithms rely on these properties. If data does not form a vector space, some ML techniques fail or require special handling.

Vector space is a mathematical environment that makes ML possible.

What is NOT a vector space?

Not everything that looks like a collection of elements forms a vector space. For example -

Categorical values
Example: {black, white, blue}

These do not form the vector space because adding and scaling the categorical values are not meaningful and does not make sense. These values require encoding before being used in ML.

Probability distribution
Example: [0.2, 0.5, 0.3]

List of probabilities that sums to 1.0. Adding two probability vectors does not preserve the sum to 1 and scaling by a negative scalar will result in negative probability which does not make sense.

Sets

Example: { black, white, blue}

Sets do not support vector addition, scalar multiplication and operations. Operations like union, intersections are not equivalent to vector operations.

So far the vector space that we have discussed only supports the addition and scalar multiplication operations. The length of a vector cannot be measured because there is no notion of norm.

Bringing it all together: the kitchen analogy

Think of the entire setup like a kitchen.

Field is the kitchen with all the basic tools. It guarantees that you can perform the fundamental operations — add, subtract, multiply, and divide (except by zero) — and that everything behaves consistently. Without a field, you don’t even have reliable arithmetic to work with.

Space is the organization and rules of the kitchen. It defines what kind of kitchen this is and what operations are allowed. Different rules give you different kitchens — just as different structures give you different kinds of spaces.

Vector space is a specific kitchen setup where mixing and scaling ingredients is allowed and always makes sense. It guarantees that you can combine ingredients (vector addition), scale recipes up or down (scalar multiplication), and always stay within the same kitchen. There is also a “zero recipe” — doing nothing at all.

Vector is a particular recipe or dish made using the ingredients and rules of that kitchen. It’s a concrete instance — a specific combination of ingredients that follows all the rules of the vector space.

In machine learning, raw data is rarely a ready-made recipe. Instead, we transform data so it can live in a well-defined kitchen — a vector space — where combining, scaling, and learning all make sense.

A note on scope

So far, the vector spaces discussed here only support addition and scalar multiplication. There is no notion of vector length, distance, or angle yet. Those concepts require additional structure (such as norms and inner products), which will be introduced in a future post.

The animation below builds intuition for vector spaces using a simple kitchen analogy, complementing the math discussed in this article.

Thanks for reading. Vector spaces are foundational in machine learning, and intuition goes a long way in understanding them.

Amazon S3 Vector Buckets Hands on - Similar Product Search Using: Image and Text Retrieval with RRF Fusion

Chirag (Srce Cde) — Sun, 03 Aug 2025 14:24:08 +0000

In today’s data-driven landscape, business increasingly rely on the power of similarity, contextual search functionalities to enable features like visual search, product recommendations, semantic text retrieval and more. However, setting up the infrastructure to support scalable, low-latency vector search often requires specialized tools, dedicated vector databases, complex indexing strategy and deployments. This might create barriers for teams that want to experiment or integrate vector search search into their existing pipelines quickly. Recently, AWS announced Amazon S3 Vector Buckets (in preview) which address this challenge by bringing vector storage and similarity search directly into Amazon S3 — turning a familiar, scalable object store into a lightweight vector database.

Amazon S3 Vector Buckets enable scalable storage and efficient retrieval of vector data. Key features include no provision of additional infrastructure, scalable (grow indexes with data), cost effective, simplified API integration, and low latency queries.

In this blog, we’ll walk through a lightweight implementation of a product search workflow using Amazon S3 Vector Buckets — all within a Jupyter notebook. We’ll explore how to enable both text-based and image-based search, and demonstrate a simple approach to multimodal search by combining results using Reciprocal Rank Fusion (RRF).

The core of this workflow are vectors; numerical representations of data like text or images. We’ll use Amazon S3 Vector Buckets to create vector indexes, add vectors to those indexes, and perform similarity searches to find the most relevant matches. But before diving into implementation, let’s take a moment to understand what a vector is at a high level.

What is a Vector?

Vector is a numerical representation of the something — in our case, data like images, text, or audio. These numbers are mapped into a high-dimensional space so that similar data ends up close together. This numeric representation enables similarity measurements through mathematical metrics like cosine similarity or Euclidean distance, facilitating accurate and efficient data retrieval.

Simple Image Example: Pixels as Vectors

Consider below 3x3 grayscale image (left) and its pixel values (right). Each pixel value range from 0 to 255, representing brightness. These values can be flattened into a vector.

Just like this (1 x 9 dimensions), modern AI models take larger and more complex images and convert them into much higher-dimensional vectors (like 512 or 1024 dimensions). These vectors capture not just color or brightness — but complex patterns, shapes, and even semantic meaning.

Personalization use case: Product Search Demonstrated via Jupyter Notebook

The example is presented as a practical tutorial in a Jupyter notebook format, demonstrating individual text-based and image-based searches, and implements multimodal search by combining text and image results using Reciprocal Rank Fusion (RRF).

S3 Vector bucket creation

via AWS CLI (Optionally you can also specify the encryption configuration). Ensure to update the AWS CLI to latest version. Currently, I have 2.27.60 installed.

aws s3vectors create-vector-bucket --vector-bucket-name "media-vector-bucket"

via SDK (Boto3)

import boto3
s3vectors = boto3.client("s3vectors")
s3vclient = boto3.client('s3vectors')
response = s3vclient.create_vector_bucket(
    vectorBucketName='media-vector-bucket'
)

Vector index creation

In this example, we’ll create two separate vector indexes — one for images and another for text. This allows us to store and query image and text embeddings independently, while still leveraging the power of Amazon S3 Vectors for high-speed similarity search within each modality.

A vector index is the structure that organizes the added vector data which enables the fast similarity search.

Image index

via AWS CLI

aws s3vectors create-index \
  --vector-bucket-name "media-vector-bucket" \
  --index-name "img-index" \
  --data-type "float32" \
  --dimension 768 \
  --distance-metric "cosine"

via SDK (Boto3)

response = client.create_index(
    vectorBucketName='media-vector-bucket',
    indexName='img-index',
    dataType='float32',
    dimension=768,
    distanceMetric='cosine',
)

The command above creates an image index named img-index within the vector bucket media-vector-bucket. We set the dimension to 768, the data type to float32 (currently the only supported type), and the distance metric to cosine for similarity search (currently it supports Cosine & Euclidean).

The reason we specify 768 as the dimension is because the model we’ll use to generate image embeddings — google/vit-base-patch16–224-in21k — outputs vectors of 768 dimensions. This ensures that the vectors we generate will be compatible with the index we’ve defined.

Text index

via AWS CLI

aws s3vectors create-index \
  --vector-bucket-name "media-vector-bucket" \
  --index-name "txt-index" \
  --data-type "float32" \
  --dimension 384 \
  --distance-metric "cosine"

via SDK (Boto3)

response = client.create_index(
    vectorBucketName='media-vector-bucket',
    indexName='txt-index',
    dataType='float32',
    dimension=384,
    distanceMetric='cosine',
)

The reason we specify 384 as the dimension is because the model we’ll use to generate image embeddings — sentence-transformers/all-MiniLM-L6-v2 — outputs vectors of 384 dimensions.

Important Note:

Amazon S3 Vector Buckets currently support dimension values between 1 and 4096. Higher-dimensional vectors require more storage space, which may impact both performance and cost.

Data preparation

For this demo, we’ll use the Fashion Product Images Dataset available on Kaggle. To keep things lightweight and manageable within a notebook setup, we’ll randomly select 10,000 product images from the dataset to build our image search experience.

images.csv contains the filename and the image link. styles.csv contains the metadata of the image. Hence, merging both the dataframe.

import os
from pathlib import Path
import pandas as pd

base_data_path = f"{Path().resolve().parent}/data/"
storage_bucket_name = os.environ.get("STORAGE_BUCKET_NAME")

images_df = pd.read_csv(f"{base_data_path}/fashion-dataset/images.csv")
style_df = pd.read_csv(f"{base_data_path}/fashion-dataset/styles.csv", on_bad_lines='skip')

style_df["id"] = style_df["id"].astype(str)
images_df["id"] = images_df.filename.str.replace(".jpg", "")

merged_df = images_df.merge(style_df, left_on=["id"], right_on=["id"], how="left")
merged_df = merged_df[merged_df.link != "undefined"]
sampled_df = merged_df[~merged_df["productDisplayName"].isna()].sample(n=10000)

sampled_df.reset_index(drop=True, inplace=True)
sampled_df.fillna("", inplace=True)

Optionally, if the data is on s3 then add the s3_uri to sampled_df.

sampled_df["s3_uri"] = sampled_df["filename"].apply(lambda x: f"s3://{storage_bucket_name}/fashion-dataset/images/{x}")

From Embeddings to Index: Creating and Uploading Vectors

The next step is to define a helper function that takes in raw input (either an image or a product name) and returns its corresponding vector embedding. You’re free to use any model that suits your use case (as far as it satisfy the dimension & data type limitations) for generating these vectors — whether it’s a proprietary model or an open-source one.

For this example, we’ll use open-source models to generate embeddings:

For images, we’ll use a vision model like google/vit-base-patch16–224-in21k
For text, we’ll use a lightweight model like sentence-transformers/all-MiniLM-L6-v2

These models will help us convert product data into vector representations that we can store and search in Amazon S3 Vector Buckets.

from transformers import CLIPProcessor, CLIPModel, AutoImageProcessor, AutoModel
from sentence_transformers.SentenceTransformer import SentenceTransformer

iprocessor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
imodel = AutoModel.from_pretrained("google/vit-base-patch16-224-in21k")
tmodel = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

def get_embeddings(inputs, mode):
    if mode == "image":
        img_inputs = iprocessor(images=inputs, return_tensors="pt")
        with torch.no_grad():
            outputs = imodel(**img_inputs)
            last_hidden_state = outputs.last_hidden_state[:, 0, :]  # shape: [1, 2048, 1, 1]
            features = last_hidden_state / last_hidden_state.norm(dim=1, keepdim=True)  # L2 normalize each row
            features = features.cpu().numpy().astype(np.float32)
            return [f.flatten().tolist() for f in features]
    if mode == "text":
        embeddings = tmodel.encode(inputs)
        return [f.tolist() for f in embeddings]

The get_embeddings is the helper method will accept the batch inputs along with the mode as image or text.

Important Note:

Ensure that the returned vector embedding is in the form of list and each element in the list is of type float32 or lower precision. The length of the vector embedding list should match the defined dimension during the vector creation. In case, if you pass the data with higher precision, S3 Vectors will converts the values to 32-bit floating point before storing them. And each vector should match the defined dimension.

Creating and Uploading Vectors

To efficiently process and upload large volumes of data, we define a function called create_and_upload_vectors. This function processes the input data in batches of 500 records.

For each batch, it:

Invokes the get_embeddings function to generate vector embeddings for both image and text data.
Attaches metadata such as product IDs or categories to each vector for future filtering.
Uploads the vectors to their respective indexes:
- Image embeddings → img-index
- Text embeddings → text-index

Batching ensures the upload process remains scalable and performant, especially when dealing with thousands of vectors.

def create_and_upload_vectors(df, vector_bucket_name, image_index_name, text_index_name, batch_size=500):
    for i in range(0, len(df), batch_size):
        image_vectors, text_vectors = [], []

        batch = df.iloc[i:i+batch_size]

        images_batch = [Image.open(f"{base_path}{f}").convert("RGB") for f in batch["filename"]]
        text_batch = batch["productDisplayName"].tolist()

        image_embeddings = get_embeddings(images_batch, mode="image")
        text_embeddings = get_embeddings(text_batch, mode="text")

        for ind, v in enumerate(batch.itertuples()):
            metadata = {"gender": v.gender, "category": v.masterCategory, "type": v.usage, "season": v.season, "productName": v.productDisplayName, "s3_uri": v.s3_uri}
            text_vectors.extend([{
                "key": f"{v.id}",
                "data": {"float32" : text_embeddings[ind]},
                "metadata": metadata}])
            image_vectors.extend([{
                "key": f"{v.id}",
                "data": {"float32" : image_embeddings[ind]},
                "metadata": metadata}])

        s3vectors.put_vectors(
            vectorBucketName=vector_bucket_name,   
            indexName=image_index_name,
            vectors=image_vectors
        )
        s3vectors.put_vectors(
            vectorBucketName=vector_bucket_name,   
            indexName=text_index_name,
            vectors=text_vectors
        )

# invocation
create_and_upload_vectors(sampled_df, vector_bucket_name, image_index_name, text_index_name)

Important Note:

Amazon S3 Vector Buckets currently allow a maximum of 500 vectors per PutVector invocation. This is why we process and upload the data in batches of 500 to stay within the API limits.

Querying & retrieving similar matches

To query the index, we define a helper function query_s3_vectors to query the index and retrieve top k results.

Important Note:

query_vectors can only retrieve and return up to a maximum of 30 matching results.

def query_s3_vectors(bucket_name, index_name, vector, k=5):
    response = s3vectors.query_vectors(
            vectorBucketName=bucket_name,
            indexName=index_name,
            queryVector={"float32": vector}, 
            topK=k, 
            returnDistance=True,
            returnMetadata=True
        )
    return response["vectors"]

Also, create the helper function to display the query and its respective results.

import textwrap
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

def display_results(matches, query_text=None, query_image=None, is_rrf=False):
    max_cols = 7
    total = len(matches) + 1
    n_rows = math.ceil(total / max_cols)

    fig, axes = plt.subplots(n_rows, max_cols, figsize=(3 * max_cols, 5 * n_rows))
    axes = axes.flatten()

    query_path = os.path.join(f"{base_data_path}/fashion-dataset/test-images/{query_image}")
    if os.path.exists(query_path):
        img = mpimg.imread(query_path)
        axes[0].imshow(img)
        axes[0].axis("off")
        axes[0].set_title(f"Query Text:\n{textwrap.fill(str(query_text), width=25)}", fontsize=9)
    else:
        axes[0].text(0.5, 0.5, f"Query Text:\n{textwrap.fill(str(query_text), width=25)}", ha='center', va='center')
        axes[0].axis("off")

    for i, match in enumerate(matches):
        idx = i + 1
        match_path = os.path.join(f"{base_data_path}/fashion-dataset/images/{match['key']}.jpg")
        if os.path.exists(match_path):
            img = mpimg.imread(match_path)
            axes[idx].imshow(img)
            axes[idx].axis("off")
            product_name = match["metadata"].get("productName", "N/A")
            wrapped_name = textwrap.fill(product_name, width=20)
            if is_rrf:
                title = f"{match["key"]}\n{wrapped_name}\nrrf_score: {match["rrf_score"]:.4f}"
            else:
                title = f"{match["key"]}\n{wrapped_name}\nDist: {match["distance"]:.4f}"
            axes[idx].set_title(title, fontsize=8)
        else:
            axes[idx].text(0.5, 0.5, "Image not found", ha="center", va="center")
            axes[idx].axis("off")

    for i in range(total, len(axes)):
        axes[i].axis("off")

    plt.tight_layout()

Querying the index using an image vector.

query_image = "t-shirt.webp"
input_image = [Image.open(f"{base_data_path}/fashion-dataset/test-images/{query_image}").convert("RGB")]
i_emb = get_embeddings(input_image, mode="image")
image_match = query_s3_vectors(bucket_name=vector_bucket_name, vector=i_emb[0], k=10, index_name="img-index")
display_results(image_match, query_image=query_image)

Querying the index using the text.

input_text = ["mens green polo tshirt"]
t_emb = get_embeddings(input_text, mode="text")
text_match = query_s3_vectors(bucket_name=vector_bucket_name, vector=t_emb[0], k=10, index_name="txt-index")
display_results(text_match, query_text=input_text, query_image=None)

Multimodal Search with Reciprocal Rank Fusion (RRF)

Reciprocal Rank Fusion (RRF) is a simple rank aggregation technique that combines results from multiple search systems by rewarding items that consistently appear near the top.

Combine individual search results using Reciprocal Rank Fusion:

Obtain separate ranked lists for text and image searches.
Apply RRF to merge rankings, enhancing overall search accuracy.

def rrf_merge(image_results, text_results, k=60, top_k=20):
    scores = defaultdict(float)
    rank_sources = {"image": image_results, "text": text_results}

    for source_name, result_list in rank_sources.items():
        for rank, item in enumerate(result_list):
            doc_id = item["key"]
            scores[doc_id] += 1 / (k + rank + 1)
    metadata = {item["key"]: item for item in image_results + text_results}

    fused = [
        {
            "key": doc_id,
            "rrf_score": round(score, 6),
            **metadata[doc_id]
        }
        for doc_id, score in sorted(scores.items(), key=lambda x: x[1], reverse=True)
    ]

    return fused[:top_k]

fused_results = rrf_merge(image_results=image_match, text_results=text_match, top_k=20)
display_results(fused_results, query_text=input_text, query_image=query_image, is_rrf=True)

Performance Considerations

Vector query results are returned under a second (<1000 ms), enabling fast lookups.
Dimension size and the number of vectors within the index may affect latency — larger datasets or high-dimensional vectors may increase query time.

Storage and Cost Considerations

Larger dimensions require more storage space, which can lead to increased cost.
A single index can store up to 50 million vectors, making it scalable for large workloads.
Vector dimensionality must be carefully managed to optimize overall costs.

Recall and Accuracy Considerations

When performing vector similarity queries, average recall performance is influenced by:

The quality and type of embedding model used to generate the vectors.
The size of the dataset, in both vector count and dimensionality.
The distribution and diversity of incoming queries.

Design Considerations

S3 Vectors is ideal for workloads that don’t require real-time results, such as batch processing or periodic search.
Partial updates to a vector or its associated metadata are not supported. Any update requires re-uploading the full vector entry.
No direct support for inherently multimodal search (e.g., combining image and text vectors). You’ll need to implement external fusion techniques like Reciprocal Rank Fusion (RRF).
No infrastructure setup is required — S3 Vectors is fully managed, so you can start storing and querying vectors immediately using API calls, without provisioning or managing servers.

Wrapping Up

Amazon S3 Vector Buckets make it easier than ever to implement vector search without managing a dedicated vector database. In this tutorial, we built a multimodal product search system combining image and text similarity.

By generating embeddings, creating separate indexes for image and text, and retrieving results using simple API calls, we demonstrated how to deliver a powerful and scalable search experience. We also used Reciprocal Rank Fusion (RRF) to combine results across modalities, showing how multimodal search can be layered on top with minimal effort.

S3 Vector Buckets are a great fit for:

Teams wanting to experiment with vector search quickly
Applications where query latency under 1 second is acceptable
Use cases like product recommendations, visual search, or content discovery
They may not suit real-time or high-frequency update scenarios, but they shine when ease of integration, scalability, and cost-efficiency matter most.

If you like to follow along with me step by step then you can refer to this video.

Want to try it yourself?

Download the notebook to get started.

Thank you for reading!

[Hands-On] AWS Lambda function URL with AWS IAM Authentication type

Chirag (Srce Cde) — Tue, 31 Oct 2023 23:11:41 +0000

In this article, I am going to cover how to secure the AWS lambda function URL using AWS_IAM auth followed by how authenticated IAM users can access the lambda function via function URL.

If you are not aware of what the AWS Lambda function URL is then please refer to my video on How to configure the AWS Lambda function URL.

What does AWS_IAM Authentication type mean when enabled for AWS Lambda function URL?

It means that only authenticated IAM users or roles can invoke the lambda function via the function URL. If they are not authenticated or do not have the necessary permissions then they will not be able to invoke or access the lambda function via URL and they will be greeted with an error message like Forbidden with status code 403.

Hands-On

Lambda Function

Navigate to Lambda Management Console and create the lambda function with the configuration as shown below.

Create lambda function

As a part of the Execution role, the first option will create a role with basic permissions which will allow the lambda function to create and write the logs to cloudwatch.

Expand the Advance Settings — to enable function URL along with AWS_IAM as auth type as shown in below screenshot and click on Create function. You can also enable the function URL after creating the function.

Enable function URL

IAM User

The next step would be to create an IAM user. I am creating the new IAM user just to demonstrate things end-to-end but you can also experiment with an existing IAM user.

Navigate to IAM Management Console → Click Users from left panel → Create User. Follow through the on-screen steps. Do not add/attach any permissions to that IAM user.

Now, if we were to access the lambda function via the function URL when the AWS_IAM Authentication type is enabled, we would require AWS security credentials. So let’s generate the access key & secret access key for the given IAM user.

Open the IAM user → Security credentials → Scroll down to Access keys → Create access key. As the next step, let’s try to invoke the lambda function via the function URL using the generated security credentials with Postman.

Configure security credentials

Open Postman → Copy & paste the lambda function URL. Under Authorization → Select AWS Signature → Fill the Access Key & Secret key values with IAM user credentials (Generated in the previous step). Under Advanced configuration, enter the appropriate region (in my case it’s us-east-1) and the service name will be lambda because we are accessing/invoking the lambda service. Finally, click on Send to invoke and it will greet you with 403 forbidden because the IAM user does not have permission to access the said lambda function via the function URL.

The next step would be to provide the permission. There are two ways to provide permission which is either via Identity-based policy or resource-based policy and I will show you both. The basic difference between Identity-based policy and resource-based policy is that identity-based policies are directly attached to IAM users, groups, or roles, and in this case, we will attach it to IAM user that we have created, whereas resource-based policies are directly attached to resources which defines who can access that resource and in our case the resource is Lambda function where we will define who can access this lambda function via function URL.

Ideally, to successfully invoke the function via URL, the said entity must have InvokeFunctionUrl permission.

Permission via Identity-based policy

Navigate to IAM Management Console → Policies (from left panel) → Create policy → Select JSON view. Copy & paste the below policy and create it. Make sure to replace the ARN with the ARN of your lambda function. Post policy creation, attach the policy to the IAM user.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Action": [
                "lambda:InvokeFunctionUrl"
            ],
            "Resource": [
                "arn:aws:lambda:your-region:your-acc-id:function:your-lambda-fun"
            ]
        }
    ]
}

The above policy says - to allow the InvokeFunctionUrl action on the particular lambda function that is defined as a part of the Resource to the IAM user or identity to which this policy will be attached.

As a next step, open Postman and invoke the function URL again. This time, it will return status code 200 along with “Hello from lambda!” as a response.

Permission via Resource-based policy

In this section, we will configure the resource-based policy. It is something that is attached to the resource (i.e. lambda function). As a first step, please remove the policy that you have attached to the IAM user.

Open the lambda function → Configurations → Permissions → Scroll down to Resource-based policy statements → Add permissions. Configure the policy as shown below.

Resource-based policy

Replace the Principal with the IAM user ARN → Save and test it again. You should be able to successfully invoke the lambda function via the function URL.

Generating & using temporary security credentials

We were able to invoke the lambda function via the function URL successfully using the IAM user security credentials (i.e. access and secret key), but the keys might get exposed and misused, which is a risk. So, the more promising way is to use temporary credentials which basically expire after a certain time. To do that, we are going to use AWS Security Token Service (AWS STS) to create and use temporary credentials and it is very simple to generate. Follow the below steps (Assuming AWS CLI is already installed).

Open terminal
Execute aws configure & configure the access key, access secret key & region
To generate temporary credentials, execute aws sts get-session-token --duration-seconds 900

The above command will generate the temporary credentials (looks like below), which will be valid for 900 seconds.

AWS temporary credentials

As a next step, open Postman. Replace the AccessKey & SecretKey with the new values. Also, paste the sessionToken in the relevant field under Advanced configuration.

Postman temporary credentials configuration

If you invoke the URL now, then you will be able to access the lambda function successfully with status code 200. After 15 minutes, these credentials will no longer be valid and need to be regenerated.

I hope you learned something new today. If you like to follow along with me step by step then you can refer to this video.

You might also like reading about how to Whitelist IP addresses for Lambda function URLs.

If you have any questions, comments, or feedback please leave them below. A reaction & a follow is appreciated :) Subscribe to my channel for more.

Pixel vs Path: Comparing Raster and Vector Images

Chirag (Srce Cde) — Thu, 26 Oct 2023 21:52:39 +0000

In Essence of Digital Images: Building blocks article, I covered the various aspects of digital images like building blocks of an image, color spaces, color encoding & its representation, compression, and more. Those topics mainly involved pixel-based images. However, in this article, I want to cover two primary categories of images.

Raster images
Vector images

Raster images

Raster images are also known as bitmap images. When we talk about images in general, implicitly we are referring to raster images most of the time. Raster images are made of small squares known as pixels - The smallest building block, arranged/represented in the setup of rows and columns resulting in a grid expressed as width x height (For ex. 1920x1080). In math terms, it is a matrix. Each pixel is assigned a numerical value that represents the intensity of the color.

[Left] 8 x 8 grayscale image | [Right] 8 x 8 grayscale image with intensity values

[Grayscale] The size of the above grid is 8 (rows) x 8 (columns); hence the dimension of the image is 8x8 pixels and each square block is the pixel with the associated intensity or color value between 0 and 256. It has one channel.

[Left] 8 x 8 RGB image | [Right] 8 x 8 RGB image with intensity values

[RGB] The size of the above grid is 8 (rows) x 8 (columns); hence the dimension of the image is 8x8 pixels and each square block is the pixel with the associated intensity or color value between 0 and 256 for each channel in vertical order (R, G, B). It has three channels.

The code to generate the above images can be found in this GitHub repository. Please refer raster.py

Common file types that fall under raster images: JPEG, PNG, GIF, BMP, TIFF, and more.

In general, the quality of the raster image is directly tied to the dimensions of an image; i.e. how large the grid is. 1080x1080 pixels image is better and will have more visual details/information than a 512x512 pixels image when it is viewed at the same size. However, other aspects like bit depth, compression, color mode, file format also affect the quality of the image in terms of visual details and file size. Higher image dimensions (pixels) typically result in larger file sizes.

Raster images are widely used in photography, web graphics, scanning, medical imaging, and more. It is also used where capturing real, critical details and color gradients are important.

Software like Adobe Photoshop, GIMP but not limited to these are generally used while working with raster images.

Vector images

Vector images are made up of lines, curves, paths, angles which has start and end point. These lines, paths, shapes can be any geometric representation. And mainly they are driven by mathematical expressions. Pixels, grids, dimensions are not applicable to vector images. One can scale vector images indefinitely to any size and the quality of the image will not deteriorate.

Common file types that fall under vector images: SVG, AI, EPS, and more.

Vector images are widely used for designing logos, icons, illustrations, computer-aided drawings, engineering/architectural plans or layouts, and other graphics designs where image scalability, clarity, and precision/accuracy are important.

[Left] Vector vs [Right] Raster image — when scaled at 1000%

In the above example, I have attempted to show the scalability comparison between vector and raster images. The quality of the vector image of “@” on the left is not impacted when scaled and on the other hand, the quality of the raster image of “@” on the right is deteriorating/pixelated when scaled. Hence, when scalability & small file size is important, vector images can be considered.

The code to generate the above images can be found in this GitHub repository. Please refer raster_vs_vector_img.py

Software like CoreDRAW, Adobe Illustrator but not limited to these are generally used while working with vector images.

Comparison

The primary advantage of vector images over raster image is, it can be resized/scaled without losing the quality whereas, raster image gets pixelated when scaled
Raster to vector image conversion is complex and often manual adjustment by an experienced user is required. Simple graphics conversion would be straightforward but it is much more challenging to translate the detailed photograph accurately into vector format. In contrast, vector-to-raster translation is simple depending on factors like resolution, output size, and potential loss of clarity.
Raster images are well-suited to represent intricate details and complex color gradations. Vector images are more suited for illustrations with solid colors and sharp edges.
In terms of file size, a raster image is larger in size since it holds more visual information and more (thousands/millions) pixels, whereas a vector image file size is comparatively smaller since it only holds the mathematical expressions that would determine the designs. However, raster images can be compressed which I will cover in an upcoming article.
Raster images are easily accessible and can be opened using various everyday applications but to access/open vector images often requires specialized software.

The choice between using raster vs vector depends on the use cases. For photographs where capturing real details is important, raster is the way to go. In the scenario where scalability, precision, accuracy are important, vectors are more suitable. The well-known ImageNet dataset contains raster images.

Note: In the upcoming articles, I will be mainly talking about the raster images. At the same time, it is important to understand the difference between raster and vector images.

If you have any questions, comments, or feedback please leave them below. Subscribe to my channel for more.

Essence of Digital Images: Building blocks

Chirag (Srce Cde) — Mon, 18 Sep 2023 16:38:52 +0000

In the era of digital transformation, images emerge as more than mere visual artifacts from the graphics/visuals perspective. Images/Visuals are not just pictures, but they are technological wonders which is a blend of data structures, mathematics, and algorithms. This blend gave rise to these visual experiences that we encounter daily in our lives in the form of digital images.

Pixels - The smallest building block

At the core of every digital image lies the pixel. An image is made of tiny pixels sitting together in the form of the grid which creates the matrix representing the image dimensions and resolution. Consider a 3x3 8-bit pixel image in grayscale color space (Grayscale color space is considered for simplicity). Each pixel carries a numerical value representing color intensity from 0 (black) to 255 (white) corresponding to 8 bits of data.

(Left) 3x3 8-bit grayscale image | (Right) Color intensity representation

Color intensity numbers in red are just for representation purposes.

In RGB color space, the pixel intensity is represented as (255, 0, 0) (255 - represents Red channel color intensity, 0 - represents Green channel color intensity, 0 — represents Blue channel color intensity), indicating the pixel color is Red and no green or blue. Each channel color intensity is represented on a scale from 0 to 255 with 256 possible values for each color channel. Consider the 3x3 8-bit scale pixel image in RGB color space as shown below.

(Left) 3x3 8-bit RGB image | (Right) Color intensity representation

Note: Each channel of RGB is 8-bit, which can be called 24-bit (8R x 8G x 8B = 24 bit per pixel) color depth data (True color). The 8-bit scale provides 2⁸ = 256 possible values ranging from 0 to 255 for each channel per pixel. Hence 256 x 256 x 256 = over 16 million possible color combinations.

The alpha channel is not included for simplicity.

Color encoding & representation

The previous section touched upon how single & multi-channel (RGB) colors can be represented for a given pixel. This section will cover the two different approaches to representing and encoding colors in digital images.

Indexed color — Indexed color also known as palette-based color, uses a lookup table with a limited amount of colors. Each pixel in the image is associated with a predefined index value in the color lookup table. It is memory, storage and transmission efficient.

(Left) Color Lookup table | (Right) Grid image with indexed color

Direct color — Direct color is known as true color or 24-bit color which is one of the methods of representing colors in digital images. Unlike indexed color, direct color does not have a color lookup table instead each pixel has its own color value based on a combination of primary colors (RGB). Each channel in RGB is represented using 8 bits each, resulting in 24 bits per pixel.

Each pixel in a grid with its respective color information

File formats like JPEG, PNG, TIFF support direct color images whereas GIF, PNG-8 file format supports indexed color.

(Left) Direct color example with 24 bits per pixel & Indexed color example (Right) with max 8 colors

The color palette (above image — Right) used in the indexed color example is the same as the lookup table mentioned above as an example.

Note: Bit depth (8, 16, 32) quantifies the number of distinct numerical color values each pixel can have. The higher the bit depth, the finer the color gradations (quality). This collectively influences the image richness and accuracy.

Color spaces

Imagine you want to tell someone about the colors of the breathtaking sunrise image below. While you can use words like red, blue, orange, grey but these words cannot capture all the details. Hence, there is a need for different language which can capture the details of colors; i.e. Color spaces.

Photo by Vika Jordanov on Unsplash

Color spaces is a system/mathematical model that defines how visual colors can be represented as numerical values. It is the standardized and structured way to describe the colors across various applications. There are multiple color spaces each with its own properties. However, the most common color spaces includes but are not limited to—

RGB (Red, Green, Blue) — RGB represents colors as a combination of red, green, and blue as primary colors, each color channel intensity is represented from 0 to 255 in 8-bit systems. In 16-bit systems (16-bit image)(2¹⁶), each channel intensity is represented by the numerical value ranging from 0 to 65,535. RGB is an additive color model, where primary colors are mixed at different intensities to create a wide range of colors.

HSV (Hue, Saturation, Value) — HSV represents colors in Hue (which color), Saturation (range of grey or intensity of the color), Value (brightness of the color). Hue is represented in degree from 0⁰ to 360⁰ whereas Saturation & Value is represented in percentage. HSV color space is similar to how humans perceive color.

Lab — Lab color space is used where precision of color is important. L channel represents brightness/lightness, a channel represents the color position on green to red axis (horizontal), b channel represents the color position on blue to yellow axis (vertical).

CMY/CYMK (Cyan, Magenta, Yellow, Key/Black)- CMY is the combination of cyan, magenta, and yellow as primary colors. CYMK is an extension of CMY which includes the fourth channel as black. This color space is commonly used in printing. It is a subtractive color model where colors are layered to create a wide spectrum of colors.

There are other color spaces like YUV/YCbCr, YIQ, YDbDr, sRGB, and more.

Image formats & compression

Image formats; which are likely to be identified by the extension of the file like jpg/jpeg, png, tiff, bmp, raw, and more carry a varying representation of the same visuals (for the given image). Different file formats use compression algorithms that fall under lossless and lossy compressions. Compression algorithms help to optimize storage and transmission by reducing the file size while the intricate details can be retained.

JPEG is the lossy compression that selectively discards some data to reduce size while PNG format uses a lossless data compression algorithm. It’s a trade-off between compression ratios and image quality. The balancing ratio between size and quality varies across different applications.

This section mainly touched upon raster formats like JPEG, PNG which deal with pixel data, however, there are vector formats as well which include SVG, EPS, AI which are defined by points, curves, and lines rather than pixels and they are also resolution independent.

Digital image manipulation

Consider the photo that you want to crop, rotate, apply a filter, or apply any sort of transformation and this is where digital image processing techniques like filtering, image enhancement, noise reduction, and blurring come in handy. For example, if you want to smoothen the noise from the photograph — blurring techniques like Gaussian blur, median blur can be applied which will in general smoothen the image by averaging the pixel values based on the kernel. This technical computational process fine-tunes images for aesthetic, clarity, and analysis.

Images are something that we interact with daily for a variety of purposes and the visual data is growing at an exponential rate. Hence, it becomes crucial that a new dimension of visual analysis evolve, interpret and understand images. With the advancement in AI/ML, it has already revolutionized the computer vision area which has enabled machines to understand images.

Conclusion

Digital images are a combination of mathematics, algorithms, and engineering starting from pixels coming together to form grids, color spaces lightning up pixels with colors, striking a balance between size and quality for better storage & transmission, extending further to filters/manipulation which gives a different perspective and AI/ML opening up a whole new space of visual analysis.

In this article, I have barely scratched the surface in terms of images but I hope it gives you a holistic view of images at a high level.

If you have any questions, comments, or feedback please leave them below. Subscribe to my channel for more.

AWS Glue Custom Classifier | Grok | Tutorial

Chirag (Srce Cde) — Tue, 07 Feb 2023 15:28:56 +0000

AWS Glue custom classifier enables you to catalog the data in the way you want when AWS Glue built-in classifiers cannot. It is important to catalog the data correctly and the classifier plays an important role in identifying the structure of underlying data.

If the built-in classifiers do not catalog the data as you need, then there are a couple of options to go for but it is not limited to those. The first option is to populate or prepare the source data in the format that is supported by AWS Glue built-in classifiers. The second option could be to catalog the data manually if it is feasible. As a third option, create a custom classifier to determine or parse the structure of data, the way we want. At the same time, sometimes the custom classifier also fails, and in that case, consider changing the format of data via some transformations.

Understanding the role of classifiers

Let’s consider S3 as the data source, or it could be anything. The ideal step to catalog the data is to create the crawler, which will scan the underlying followed by determining the potential columns and their data type. Post-identification of the structure of data, it will populate the table definition as a part of the data catalog.

How the crawler will determine the schema of the data?

The answer is, it will determine the schema using the classifiers. So, the role of the classifier is to read the data and identify its structure of it and help to catalog the data correctly.

A short definition of a classifier is, A classifier is a configuration, either built-in or custom, that is leveraged as a part of the crawler to infer the source data and parse the structure or schema of the underlying data.

How crawler choose the classifier?

While creating the crawler, one can choose one or more custom classifiers to infer the schema and for a built-in classifier, there is no additional configuration is required. Assuming that, the crawler that is created has one or more custom classifiers. This is how it will choose the classifier

The crawler will start the match using the custom classifiers first and subsequently leverage other classifiers if required
The crawler determines if the said classifier is a good fit or not based on the certainty score. Match via each classifier returns the certainty score. If any classifier returns a certainty score of 1.0, then the crawler will make the decision that the particular classifier is a good fit and it can create the correct table definition. The crawler will stop further matching with other classifiers if any classifier returns a certainty score of 1.0
If none of the custom classifiers returns a certainty score of 1.0, then the crawler will initiate the match using the AWS Glue built-in classifiers. The match via built-in classifiers will either return 1.0 (If there is a match) or 0.0 (If no match is found)
If no classifier (custom or built-in) returns a certainty score of 1.0, then it will pick the classifier with the highest certainty score
Let’s say there is no classifier that returned a certainty greater than 0.0, then AWS Glue will return the default classification string of UNKNOWN

Types of classifiers

There are two types of classifiers; AWS Glue Built-in classifiers and Custom classifiers. Here, is the list of built-in and custom classifiers that AWS Glue supports as of today. With respect to built-in classifiers, there is no additional configuration required to parse the structure of data because AWS Glue will internally figure out which built-in classifier to use based on the certainty score.

For custom classifiers, there are 4 types available which are grok, JSON, XML, and CSV custom classifiers.

It can classify the files which are in ZIP, BZIP, GZIP, LZ4, and Snappy compression formats.

When to use a custom classifier?

When the AWS Glue built-in classifier is unable to create the expected or required table definition, then one should consider creating & using the custom classifier.

Grok custom classifier

Grok is a tool to parse textual data using a grok pattern. Grok pattern is the named set of a regular expression. For example, defining a regular expression pattern to match email addresses and then the name is given to that pattern like an EMAILPARSER, and this is a named set of the regular expression. (EMAILPARSER regular-expression).

Syntax of grok pattern: %{PATTERNNAME:field-name:data-type}

PATTERNNAME is the name of the pattern that will match the text. (PATTERNNAME could be EMAILPARSER from the previous example)
field-name is the alias name and can be anything
data-type is optional. By default, the data type will be a string if not mentioned. The supported data types are byte, boolean, double, short, int, long, string, and float

For example, if you want to match the month number from the text, then

Define the regular expression to match the same and then we will provide the name to that pattern
Named pattern: **MONTHNUM** (?:0?[1–9]|1[0–2])
In this case, the name of the pattern that we have given is MONTHNUM, which is followed by a regular expression.
To define the grok pattern using the defined syntax, which will be the reference to the defined pattern_name followed by the alias name. And the pattern name is MONTHNUM and the alias name can be anything. Finally, casting it to int datatype
Grok pattern: %{MONTHNUM:month:int}

AWS Glue provides built-in patterns and if required the custom pattern can be defined. For example, if AWS Glue has MONTHNUM defined as a part of the built-in patterns, then we can directly use the name of that pattern to define grok pattern. But if there is something that you want to match or parse that is not defined as a part of the AWS Glue’s built-in pattern then you have to define the custom pattern like MONTHNUM followed by the regular expression. Hence, the grok pattern can be defined using AWS Glue’s built-in patterns and custom patterns.

Grok custom classifier example

Consider a log file (as shown in the Log data image). In this log file, there is UUID, WindowsMAC address, and Employee Id. The requirement is to catalog this structure and create the table definition via grok custom classifier.

Below is a snapshot that shows a couple of AWS Glue built-in patterns, that we will use. But there is a long list of AWS Glue built-in patterns to which you can refer here. The keyword you see initially is the pattern name (highlighted in the blue box), and it is followed by the regular expression (highlighted in the purple box).

Starting with the UUID, the pattern to parse UUID is already defined in AWS Glue built-in pattern (highlighted in the red box), then it is the same for Windows MAC as well. But for Employee Id, which is a 7-digit number, there is no built-in pattern, so in that case, the custom pattern is required. In simple words, if we have to define the pattern name and regular expression to match then it is the custom pattern.

If we put together all the patterns, then it will look like the grok pattern defined as a part of the Final grok pattern in the above table.

Hands-On

As a part of hands-on, we will parse the log file looks like below. I have modified this log file to include the random dummy email addresses.

Breaking down the structure of a log. Starting with a Timestamp in square braces, then the Log type which could be an error or notice, followed by the Email, and finally the Message. Now, let’s see which patterns will be used among built-in and custom to define grok patterns.

Timestamp: A custom pattern is defined using the AWS Glue built-in patterns to infer Day, Month, Monthday, Time & Year as a single entity. And using the custom pattern the grok pattern is defined.

Log type: AWS Glue built-in pattern WORD is used. WORD pattern will match any alphanumeric characters including underscore.

Email: There is no built-in pattern to parse email. Hence, the custom pattern is defined with the name GETEMAIL.

Message: GRREDYDATA built-in pattern is used. GREEDATA will match 0 or more of the preceding tokens except for the new line.

The overall pattern is defined as a part of the Final grok pattern.

AWS Console

Goto AWS Management Console → S3 Management Console → Create a new bucket or use an existing bucket to upload the log file.

Create database

As a next step, go to the AWS Glue console and create the database. The table definition will be created under this database.

Create custom classifier

I would encourage you to create the crawler first to run it without a custom classifier and check the results to get an idea if the AWS Glue built-in classifiers are able to populate the correct structure or not. Post-that go to Classifier (Under Crawlers) → Add classifier and configure the details as shown below and create.

Classifier name: The name of the custom classifier
Classifier type: Grok
Classification: A classification string that will be used as a label
Grok pattern: A grok pattern to match the structure (From Grok pattern table above). All the fields will have string data type but if you want to cast it to another data type then add the optional data type block (seperated by a colon as per the syntax which is discussed earlier)
Custom patterns: In the above table, two custom patterns (GETTIMESTAMP & GETEMAIL) are created which are not available as a part of AWS Glue built-in patterns, so enter that

Create crawler

Go to crawlers → Create crawler → Configure crawler name (Step 1) → Configure data source & add custom classifier(s) as shown below (Step 2) → Select IAM role (Step 3) → Select the database created earlier & provide the optional table prefix. In my case, it is custom- (Step 4) → Review the configuration (Step 5) → Create crawler.

For detailed steps on creating a crawler, please refer to this video

As a next step, run the crawler, and if the configuration and grok pattern is correct then it will populate the table under the said database.

Query the table via Athena

Navigate to Amazon Athena → Select Query Editor → Select AwsDataCatalog as Data Source → Select the database (In my case it’s apache-demo-db) → That should list all the tables under that database → Query the table

I hope, you gained good insights and hands-on knowledge about the custom classifier and especially the grok custom classifier. If you like to follow along with me step by step in detail then you can refer to this video.

If you have any questions, comments, or feedback then please leave them below. Subscribe to my channel for more.

Reference: https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html

Allow access to REST API Gateway from specific IP addresses | Whitelist IPs

Chirag (Srce Cde) — Wed, 01 Feb 2023 11:03:35 +0000

How to allow specific IP or range of IP addresses to access our REST API endpoints?

In this article, I will share how to whitelist an IP address to allow access to the REST API endpoint and deny/block all the requests originating from different source IPs. This article is purely for the APIs with REST protocol within API Gateway. The method/approach that we are going to use to control the whitelisting of IPs is via Resource Policy.

Here, I am going to allow/whitelist my IP address to access/invoke the API Endpoint and block the rest of the requests originating from sources other than my IP address.

Getting started

To get started, create a lambda function (requestService) which will be our back-end integration for our REST API Gateway (which we will create in a while). The lambda function will simply return the hard-coded response whenever the endpoint (GET method) will be invoked, without any business logic.

Post creation of the Lambda function, go ahead to API Management Console and create the REST API from scratch or you can also open any existing REST API. As a next step create the resource (/processrequest) along with the GET method. In the end, integrate the lambda function (requestService) with the GET method. Please refer to the below screenshot for integration.

For similar detailed step by step setup of the resources you can refer to my tutorial on Resources, method integration with lambda

Whitelisting IP address via Resource Policy

With the help of resource policy, we can restrict the API Endpoint invocation to specific requests originating from defined IP addresses and block/deny the rest of the requests.

After setting up the API Gateway and lambda function, open the API Gateway (which is created in the above step) and click on Resource Policy from the left panel, and copy & paste the below policy in the editor and click on Save.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": "*",
      "Action": "execute-api:Invoke",
      "Resource": "execute-api:/*/*/*"
    },
    {
      "Effect": "Deny",
      "Principal": "*",
      "Action": "execute-api:Invoke",
      "Resource": "execute-api:/*/*/*",
      "Condition": {
        "NotIpAddress": {
          "aws:SourceIp": ["YOUR IP ADDRESS", "IP CIDR BLOCK"]
        }
      }
    }
  ]
}

Here, within policy, we have two statement blocks (i.e. Allow & Deny block). The first statement which allows statement states that we are going to allow all the API Endpoint invocations originating from any source to all the resources within our REST API.

In the second statement, we have defined explicit denial. The deny statement states that block all the requests from all sources to all resources but with a condition. The condition states that block all the requests except the request coming from the IP address mentioned in the NotIpAddress block.

As a next step, replace the YOUR IP ADDRESS placeholder with your IP address (you can simply google, whatmyip to fetch your IP address) for which you want to allow the API Endpoint invocation. Additionally, you can also define the IP range with the CIDR block. After modification, Click on Save

Finally, re-deploy the API for the changes to be reflected and get the Invocation URL.

Testing

Post-deployment, copy the invocation URL and paste it into a new tab in your browser and make sure to add /processrequest and hit Enter. As a result, you should be able to see the response coming from the lambda function.

To make sure, that the resource policy approach is working fine, go ahead and replace your IP address with localhost IP and click on Save. And re-deploy it.

Now if you re-hit the API endpoint again then it will return an error message as shown in the below reference image.

Finally, we made out endpoint secure in a way.

For a detailed step-by-step setup, you can refer to the video below.

If you have any questions, comments, or feedback then please leave them below. Subscribe to my channel for more.

Handle SQS message failure in batch with partial batch response feature

Chirag (Srce Cde) — Mon, 30 Jan 2023 12:28:26 +0000

AWS has announced, “AWS Lambda now supports partial batch response for SQS as an event source”. Before we go through, how it works let understand that how SQS messages were handled before, and then we will go through how the partial batch response feature will add value.

1st scenario

Assumption: The SQS trigger is configured for the lambda function. The exception is not handled in case of any message processing failure for a given batch.

The lambda function is triggered with the batch of 5 messages and if the lambda function fails to process any of the messages and throws an error then all the 5 messages would be kept on the queue to be reprocessed after a visibility timeout period. In this case, either the batch processing would be completely successful and the messages would be deleted from the SQS queue or it would completely fail and put the whole batch in the queue for reprocessing.

import json
import boto3

def lambda_handler(event, context):
    region_name = os.environ['AWS_REGION']
    if event:
        for record in event['Records']:
            body = record['body']
            # process message
    return {
        'statusCode': 200,
        'body': json.dumps('Message processed successfully!')
    }

Re-processing already processed message is not feasible, so to avoid it at some level let’s delete each message from the batch once it is processed.

2nd scenario

Assumption: The SQS trigger is configured for the lambda function. The exception is not handled in case of any message processing failure. But the delete functionality is added to delete each message after it is processed successfully.

In this case, let’s say the first 2 messages in the batch are processed successfully and deleted. The 3rd message failed and lambda returns an error, so for this failure, the 3rd, 4th & 5th messages will be set for a retry since the 1st & 2nd messages are processed successfully and deleted from the queue. Hence, the processing of already processed messages will not happen.

import os
import json
import boto3

def lambda_handler(event, context):
    region_name = os.environ['AWS_REGION']
    if event:
        sqs = boto3.client('sqs', region_name=region_name)
        queue_name = event['Records'][0]['eventSourceARN'].split(':')[-1]
        queue_url = sqs.get_queue_url(
                    QueueName=queue_name,
                )

        for record in event['Records']:
            body = record['body']
            print(body)
            # process message
            response = sqs.delete_message(
                        QueueUrl=queue_url['QueueUrl'],
                        ReceiptHandle=record['receiptHandle']
                    )

    return {
        'statusCode': 200,
        'body': json.dumps('Message processed successfully!')
    }
view raw

The lambda failed to process the 3rd message from the batch and due to that further processing of the rest of the messages is interrupted. But let’s say we want all the messages to be processed even if any message is failing and the successfully processed messages should be deleted and only the failed message should be retried.

3rd scenario

Assumption: The SQS trigger is configured for the lambda function. Exception handling for failed messages is configured on top of delete functionality.

Here, we will maintain a flag to determine if any message is failing. Let’s say again the 3rd message failed. Now, since we have error handling in place, it will handle the failed message and process the rest of the messages in the batch followed by the deletion. Finally, the manual exception will be raised based on the flag condition which will cause the failed message to retry since the rest of the messages are processed and deleted successfully. In this scenario, we cannot control which messages we want lambda to retry with and if we want to control the messages that lambda should retry then the partial batch response feature is the answer.

import os
import json
import boto3

def lambda_handler(event, context):
    region_name = os.environ['AWS_REGION']
    if event:
        sqs = boto3.resource('sqs', region_name=region_name)
        queue_name = event['Records'][0]['eventSourceARN'].split(':')[-1]
        queue = sqs.get_queue_by_name(QueueName=queue_name)
        failed_flag = False
        messages_to_delete = []
        for record in event['Records']:
            try:
                body = record['body']
                # process message
                messages_to_delete.append({
                    'Id': record['messageId'],
                    'ReceiptHandle': record['receiptHandle']
                })
            except RuntimeError as e:
                failed_flag =True

        if messages_to_delete:
            delete_response = queue.delete_messages(
                    Entries=messages_to_delete)
        if failed_flag:
            raise RuntimeError('Failed to process messages')

    return {
        'statusCode': 200,
        'body': json.dumps('Messages processed successfully!')
    }

All the 3 scenarios were about how lambda can handle the messages depending on the requirements before the partial batch response was introduced.

4th Scenario

With the partial batch response feature, a lambda can identify the failed messages from the batch and return the identified messages back to the queue which will allow reprocessing of only the failed or asked messages. This will make the SQS queue processing more efficient, kill the need for repetitive data transfer with increased throughput, improve processing performance, and on top of that it does not come with any additional cost beyond the standard price.

While using this feature, exception handling should be in place and the lambda function has to return the message ids of the messages that requires reprocessing in the particular format given below.

To enable and handle partial batch failure, check the Report batch item failures option under Additional settings while adding the SQS trigger.

After the SQS event source configuration, the response part in the code should be in a particular format that is given below for the partial batch response failure to work.

{    
    "batchItemFailures": [          
        {             
            "itemIdentifier": "id2"         
        },         
        {             
            "itemIdentifier": "id4"         
        }     
    ] 
}

import os
import json
import boto3

def lambda_handler(event, context):
    if event:
        messages_to_reprocess = []
        batch_failure_response = {}
        for record in event["Records"]:
            try:
                body = record["body"]
                # process message
            except Exception as e:
                messages_to_reprocess.append({"itemIdentifier": record['messageId']})

        batch_failure_response["batchItemFailures"] = messages_to_reprocess
        return batch_failure_response

For a detailed video please refer to the below video.

Recommendation

For most of the situations/scenarios adopting the implementation used in the 4th scenario would be beneficial and have added advantage of efficient & fast processing, reduced repetitive data transfer hence increased throughput.

If you have any questions, comments, or feedback then please leave them below. Subscribe to my channel for more.

How to map static outbound IP with the AWS Lambda function

Chirag (Srce Cde) — Sun, 29 Jan 2023 13:00:50 +0000

In this article, I will share how to map elastic IP and make sure that, all the outgoing requests from the lambda function have the same IP address/origin.

Scenario

There is an External API and it only allows or entertains the requests originating from specific IP addresses which are whitelisted as a part of the policy of the external API.

Here, we want to invoke/access the external API from the lambda function but there is a problem.

Problem

The problem here is that the AWS Lambda function does not guarantee the same outbound IP address with each invocation since the AWS Lambda function runs on containers within the AWS environment in AWS-managed VPC and at times the container could run in different execution environments. And different environments have a different IP addresses, hence whitelisting any outbound IP within external IP is not possible since it is bound to change.

So how can we assign the static IP to the lambda function and make sure that all the requests made from the Lambda function originate from the same IP address?

Solution

The solution is to map static/elastic IP to all the outgoing traffic from the lambda function and for that, place the lambda function in VPC. However, as soon as you place the lambda function in VPC, it will lose the internet connectivity and here we want to invoke the external API which is over the internet. Now, to enable internet connectivity, the NAT Gateway & Internet Gateway is also required.

High-level steps

Create VPC with Public & Private subnets
Create NAT Gateway under Public subnet with Elastic IP
Create Internet Gateway
Map NAT Gateway within private subnets route table (Source: 0.0.0.0/0 Destination: nat-gateway)
Map Internet Gateway within public subnet route table (Source: 0.0.0.0/0 Destination: internet-gateway)
Place lambda function in VPC under Private subnet

Hands-On

Create & Configure VPC

To get started, navigate to VPC Management Console to create the VPC and select VPC and more option to create & configure VPC, subnets, AZs, NAT Gateway & Internet Gateway on a single screen.

Configure the asked details as per your requirement. My configuration is as follows:

Name tag auto-generation: lambda-elastic-ip
IPv4 CIDR block: 10.100.0.0/16
IPv6 CIDR block: No IPv6 CIDR block
Tenancy: Default
Number of Availability Zones (AZs): 2
- First availability zone: us-east-1a
- Second availability zone: us-east-1b
Number of public subnets: 2
- Public subnet CIDR block in us-east-1a: 10.100.0.0/20
- Public subnet CIDR block in us-east-1b: 10.100.16.0/20
Number of private subnets: 2
- Private subnet CIDR block in us-east-1a: 10.100.128.0/20
- Private subnet CIDR block in us-east-1b: 10.100.144.0/20
NAT gateways: In 1 AZ
VPC endpoints: None
DNS options

✓ Enable DNS hostnames
✓ Enable DNS resolution

The above VPC setup will take care of adding necessary routes (NAT & IG mappings) in the respective route tables.

Create & configure lambda function

Navigate to Lambda Management Console and create the lambda function with Python3.9 as runtime. Post creation, open the relevant IAM role attached with the lambda function and add AWSLambdaVPCAccessExecutionRole permission to it.

As a next step, add the below snippet as a part of the function code for testing purposes. The below snippet will invoke (GET) the API Endpoint mentioned as a part of API_URL.

import json
import urllib3
def lambda_handler(event, context):
    http = urllib3.PoolManager()
    API_URL = "YOUR_API_URL"
    response = http.request("GET", API_URL)
    print(response.data)
    return {
        'statusCode': 200,
        'body': json.dumps('Hello from Srce Cde!')
    }

Note: Replace YOUR_API_URL with the actual API endpoint.

As a next step, place the lambda function in VPC under private subnets as shown below.

Testing

For testing, I have an external REST API (API Gateway) where I can control which requests to entertain based on the origin (IP address). If you also want to have similar setup then you can follow my tutorial video on How to whitelist IP address to access API Gateway | REST

Execution log of the lambda function before adding/whitelisting the Elastic IP address of the lambda function.

Execution log of the lambda function after adding/whitelisting the Elastic IP address of the lambda function.

Response “Successful Invocation!” is returned from the external API.

Finally, we successfully attached/mapped the Elastic IP address with the AWS Lambda function.

For a detailed step-by-step tutorial and implementation of the above setup, you can refer to the below video.

If you have any questions, comments, or feedback then please leave them below. Subscribe to my channel for more.

Send notification based on CloudWatch logs filter patterns

Chirag (Srce Cde) — Fri, 27 Jan 2023 09:44:52 +0000

Multiple resources are deployed and we want a mechanism to get notified when any unusual things start to appear in logs, which as a developer we do not want to see. To get notified, we need to parse the logs to track certain Keywords, Errors, and Critical information which we think are red flags. And in this article, I will share how to send an email notification when certain Keywords, Errors appear in the CloudWatch logs.

For the purpose of this tutorial, we will consider a lambda function that we want to track and get notified of any errors or keywords. To track the logs, we will use CloudWatch Log Subscription Filter. CloudWatch Subscription Filter will analyze logs for certain errors or keywords using filter patterns (we will define them in the latter part of this article) and based on that it will invoke the destination resource (in our case, it’s the Error Handler lambda function).

The above is the architecture that we will implement. Here, we have a number of lambda functions for which we want to get notified in case of any errors. The lambda functions will push logs to CloudWatch and on the log group, we will have a subscription filter along with filter patterns. It will look for keywords like ERROR, INFO in the logs and if any of the keywords exist then it will invoke the Error Handler lambda function. The Error Handler lambda function will decode, decompress & parse the log data and will publish a message to the SNS topic, and the SNS topic will further publish the message to its subscribers (In our case it will be an email subscriber)

Here, we will start with the creation of the SNS Topic along with the email subscription (To which the notification will be sent). After that, create a dummy lambda function, which will act as an error, keyword-producing function. As a next step, create the Error Handler lambda function followed by the configuration of the CloudWatch Event trigger (where we will define the filter patterns). Finally, update the code base and add the necessary permission to publish a message from the lambda function.

Hands-on

SNS

Create the standard SNS topic

Open the SNS topic that you have created and create Email subscription.

After creating the subscription, you will receive the confirmation email. So, make sure to click on confirm subscription to receive email notifications.

Lambda functions

Error/Keyword producer lambda function

Create the error/keyword producer lambda function with the appropriate name & Python3.9 as runtime. The purpose of this lambda function will be to publish certain keywords in the logs, which we want to track to validate the end-to-end flow. Post creation, update the code of the same with the below snippet.

import json
import logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
def lambda_handler(event, context):
    logger.info("Sample INFO log")
    logger.error("Sample ERROR log")
    logger.critical("High LATENCY")

Test it once to check if the lambda function is running without errors.

Error Handler lambda function

Create another lambda function with the relevant name & Python3.9 as runtime. The purpose of this lambda function is to parse the error logs and publish the message to SNS.

As a next step, update the source code of the function from here and deploy: https://github.com/srcecde/aws-tutorial-code/blob/master/lambda/lambda_proces_cw_error_notification.py

After deployment, add the environment variable SNS_TOPIC_ARN and the value as the ARN of the SNS topic that is created earlier.

Post adding the environment variable, we need to provide the permission to lambda function to publish a message to SNS. So, add the permission to the IAM role (Which can be found by clicking on Configuration → Permissions → Role name) of the lambda function to publish the message to the SNS topic. Create a new policy within IAM (IAM → Click Policies → Create policy → JSON) and paste the below policy.

Note: Please update the policy with region & account id.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "sns:Publish",
            "Resource": "arn:aws:sns:<region>:<account_id>:publish-notification"
        }
    ]
}

Attach the policy with the IAM role of the lambda function (IAM → Roles → Open Role → Add permissions → Attach policies → select the policy that you created above → Attach policies)

CloudWatch Trigger | Filter Pattern

Open the Error Handler lambda function, click on Add Trigger, and select CloudWatch Logs from the drop-down.

Under Log Group, select the CloudWatch log group of the error/keyword producer lambda function.
Enter the Filter name
Under Filter pattern, enter the filter value pattern that you want to track. For ex: I want to track the logs that contain keywords like ERROR & LATENCY, so I will enter ?ERROR ?LATENCY

Similarly, you can define different patterns as per your requirement and to learn more about how you can define patterns, please refer to this material

Post configuration, click on Add

Testing

We are all set to test this setup. Go ahead to the error/keyword producer lambda function and click on Test. And in a while, you will receive an email notification, which will look something like this because the logs contain the keyword ERROR & LATENCY

For a detailed step-by-step tutorial and implementation of the above setup, you can refer to the below video.

Finally, I also want to provide the CDK code for this use case and for that please refer to this repository: https://github.com/srcecde/aws-cdk-code/tree/main/lambda-cloudwatch-pattern-notification

If you have any questions, comments, or feedback then please leave them below. Subscribe to my channel for more.

Whitelist IP addresses for Lambda function URLs

Chirag (Srce Cde) — Wed, 25 Jan 2023 04:20:14 +0000

Lambda function URLs feature is the recent addition to the AWS Lambda service. With a lambda function URL, one can invoke the lambda function via a unique URL similar to the invocation of any API endpoint with respective methods.

In this article, we will configure/add the functionality to validate the IP address of the incoming requests via function URL which will enable us to only serve the requests originating from the whitelisted IP addresses and block the rest while Auth Type is selected as None.

As of now, we cannot leverage resource policy to whitelist IP addresses for lambda function URLs since that feature is not available. So, here we will write a simple python function to add that functionality as a part of the lambda function code base.

Hands-On

Create the lambda function and the function URL for the same.

As a next step, update the source code of the function from here and deploy: https://github.com/srcecde/aws-tutorial-code/blob/master/lambda/lambda_ip_val_func_url.py

Post deploying the code, add the environment variable IP_RANGE with the list of IP addresses, CIDR blocks (for IP range) that need to be whitelisted. If you do not add the environment variable, then by default it will return status code 500 with the message Unauthorized for all incoming requests.

Note: The status code and message can be modified within the code.

The updated lambda function code will check & validate the origin IP address of the request against the whitelisted IP addresses as a part of the IP_RANGE environment variables.

Now, we are all set to test it.

Testing

For testing, we will use Postman and the setup will look as below.

The endpoint will return 500 Forbidden if the IP address is not whitelisted as a part of an IP_RANGE environment variable.

After whitelisting the IP address as a part of an IP_RANGE environment variable, the endpoint will return status code 200 with an appropriate response.

However, here we have a few disadvantages when we decide to choose this methodology.

For all invalid calls (Invocation calls from the IP addresses which are not whitelisted), the lambda function will get triggered each time and that will add up the cost for each unwanted call
For all valid calls (Invocation calls from the IP addresses which are whitelisted), the validation of IP address logic will add up to the execution time with added relevant cost
Cannot whitelist private IP addresses (For ex: Private VPCs IP ranges)

For a detailed end-to-end, step-by-step setup, you can refer to the video below.

If you have any questions, comments, or feedback then please leave them below. Subscribe to my channel for more.