DEV Community: Ramya Perumal

RAG - Semantic Caching

Ramya Perumal — Sat, 18 Jul 2026 17:12:20 +0000

When a user submits a query, the query is converted into an embedding and searched against the vector database to retrieve the relevant documents.

But what happens if the user asks the same or a very similar query again?

This is where semantic caching comes into the picture.

Instead of searching the vector database again, the system stores the previous search result in a cache. A cache is a temporary storage where frequently accessed or recently queried results are stored. When the user asks the same or a semantically similar query again, the system can retrieve the result directly from the cache instead of querying the vector database again.

Benefits

Saves retrieval time
Reduces token consumption
Reduces the number of calls to the vector database
Reduces the number of calls to the LLM

How Do We Store Results in the Cache?

We can use Redis or Valkey for semantic caching.

These are in-memory databases, which means they store data in RAM instead of disk. Since data is stored in memory, retrieval is much faster compared to traditional databases.

Typically, we store:

User query
Related answer
Metadata
Embeddings

Example

Suppose a user asks:

"What is today's gold price?"

The query and its corresponding answer are stored in Redis.

Later, another user asks:

"Gold price today?"

Although both queries have the same meaning, Redis cannot directly retrieve the previous answer because it expects the key to match exactly.

This is one of the limitations of using Redis as a simple key-value store.

How Can We Solve This?

One approach is:

Retrieve all the keys stored in Redis (for example, using KEYS *).
Generate or retrieve the embedding for each stored query.
Convert the current user query into an embedding.
Compare the current query embedding with the stored query embeddings using cosine similarity.
If the similarity score is above a predefined threshold, retrieve the corresponding answer from Redis.

This allows semantically similar queries to reuse cached results even when the text is different.

Ways to Implement Semantic Caching

Semantic caching can be implemented in two ways:

Using frameworks such as LangChain
Using in-memory databases such as Redis, Valkey, or other similar databases

Cache Invalidation

One of the most important aspects of semantic caching is cache invalidation, which determines how long cached data should remain valid before it is automatically removed or refreshed.

For example, suppose a user asks:

"What is today's gold price?"

The answer should only be valid for a limited period. If the application returns yesterday's gold price, the information becomes incorrect.

There is no single solution for cache invalidation. The appropriate strategy depends on the application and the type of data being cached.

Different scenarios need to be considered before deciding when cached data should expire.

When Should In-Memory Databases Be Used?

In-memory databases are well suited for:

Temporary queries
Frequently asked questions
Data that is accessed repeatedly

By understanding the meaning of the query, we can define guardrails to determine which queries should be cached and when the cache should be invalidated.

The main objective is to optimize the RAG pipeline by reducing unnecessary calls to both the vector database and the LLM.

Although it is not possible to eliminate duplicate requests completely, semantic caching can significantly reduce them.

Important Consideration

We should not store every query in an in-memory database.

Only queries that are valuable for caching should be stored because RAM has limited storage capacity. Therefore, an effective caching strategy should carefully decide which queries are worth storing and for how long.

RAG - Meta Filtering and Reranking

Ramya Perumal — Sun, 12 Jul 2026 21:43:55 +0000

Generally, when a user asks a query, the system searches for the relevant chunks stored in the vector database using cosine similarity. The better we can filter the data, the smaller the search space becomes, resulting in faster and more efficient retrieval.

Suppose we have a book with 10 chapters. If we want to search for a particular topic, all the points in the vector database are compared with the user query, and only the closest points are retrieved. This process is called KNN (K-Nearest Neighbors).

Another algorithm is ANN (Approximate Nearest Neighbors). Instead of checking all the points in the vector database, ANN searches only within a smaller region based on the proximity of the data. As the name suggests, it does not always return the exact result, but it provides the most preferred or approximate results much faster.

Is there any other method we can use to make the search more effective?

Metadata Filtering

Metadata means data about the data.

Metadata is stored along with each chunk. It can contain information related to the chunk, such as the chapter name, topic description, author, or any other relevant details.

When the user query contains information related to the metadata (for example, a chapter name or topic), the system can directly filter the relevant chunks before performing vector similarity search. This technique is called metadata filtering.

Metadata filtering is supported by:

Pinecone
ChromaDB
Qdrant

FAISS does not provide built-in support for metadata filtering.

Reranking

Documents are first split into chunks, and each chunk is converted into vectors and stored in the vector database.

When a user query arrives, it is converted into a vector and searched against the vector database to retrieve the closest chunks. However, we do not know whether the retrieved documents are actually the most relevant to the query. It is not always true that the closest vectors represent the most relevant documents.

How Reranking Works

The documents retrieved from the vector database are passed to a cross-encoder along with the user query.

The cross-encoder assigns a relevance score that indicates how closely each document matches the query. The documents are then displayed in ascending or descending order based on these scores.

The results produced by the cross-encoder are called reranked results.

The retrieved documents remain the same as those returned by the vector database, but their order changes. Documents with higher relevance scores appear before those with lower scores.

A cross-encoder is a neural ranking model. Instead of encoding the query and documents separately, it takes both the query and the document together as input to a transformer model and generates a relevance score for each document.

There are transformer models specifically designed for reranking tasks. The encoder understands the meaning of both the query and the document and reranks the documents accordingly.

Why Use Reranking?

Reranking is an important step in the RAG pipeline.

It is especially useful when working with documents that contain images or other multimodal content.

Example

Suppose the user asks:

"Show me the front view of the truck."

The vector database may retrieve multiple images related to trucks because they are semantically similar.

The reranker analyzes both the query and the retrieved images (or their associated text descriptions) and assigns relevance scores.

As a result, the image showing the front view of the truck receives a higher score than the other truck images, making it appear first in the final results.

Docker -Networking and Best Practices

Ramya Perumal — Sat, 27 Jun 2026 20:30:19 +0000

Docker Networking

Containers are assigned an IP address when they are created. To check the IP address, we can use the following command:

docker inspect <container_id>

If we send a request from the host to the container's IP address, the container responds using its assigned IP address.

By default, Docker creates a bridge network. This bridge network allows:

Communication between the host and the container (through port mapping).
Communication between containers connected to the same bridge network.

With the default bridge network, containers generally communicate using IP addresses. To communicate using container names (hostnames), we can use a custom bridge network.

The main difference between the default bridge network and a custom bridge network is that containers on a custom bridge network can communicate using their container names (DNS resolution), making it suitable for production environments.

When to Use a Custom Bridge Network

A custom bridge network is useful when an application consists of multiple services running in separate containers. These containers can communicate with one another using their container names instead of IP addresses.

Create a Custom Bridge Network

docker network create mybridge

Create and Run Containers Inside the Network

docker run -it --network mybridge --name container1 busybox:1.36 sh

docker run -it --network mybridge --name container2 busybox:1.36 sh

Now, from container1, run:

ping container2

Similarly, from container2, run:

ping container1

If the ping is successful, it confirms that both containers can communicate because they are connected to the same custom bridge network.

Example Workflow for a Docker Bridge Network

Step 1: Build the Images

docker build -f Dockerfile -t flask_app:v1 .
docker build -f httpd.Dockerfile -t apache_container:v1 .

Step 2: Create the Network

docker network create bridge_app

Step 3: Run the Containers

docker run -d --name flask_new --network bridge_app flask_app:v1

docker run -d --name apache_new --network bridge_app apache_container:v1

Step 4: Verify the Network

docker network inspect bridge_app

Step 5: Test Communication

ping apache_new
ping flask_new

You can also use curl with the application's port to access another container's application.

Host Network

Suppose we run an application inside a Docker container.

If we do not expose the application port using -p, the application cannot be accessed through the host machine. It can only be accessed using the container's IP address (if reachable).

Example:

docker run flask_app:v1

If we expose the port:

docker run -p 5000:5000 flask_app:v1

Docker maps the host port to the container port, allowing the application to be accessed using the host machine's IP address.

Host Network Mode

docker run --network host flask_app:v1

In host network mode, the container shares the host's network stack. Therefore, the application can use the host's network directly without explicit port mapping.

Note: Host networking is supported on Linux. On Docker Desktop for Windows and macOS, host networking is limited and generally not recommended. Port mapping (-p) is the standard approach.

Docker Image Optimization

Why Is Image Optimization Needed?

Smaller images start containers faster.
Smaller images are easier to share.
Smaller images consume less storage.
Smaller images download faster.

1. Multi-Stage Builds

In a multi-stage build, the first stage builds the application, and the second stage copies only the required artifacts into the final image.

A single-stage build includes both build tools and runtime dependencies, making the image larger.

A multi-stage build keeps only the files required to run the application.

Single-Stage Build

FROM python:3.9-slim

COPY . /app

WORKDIR /app

CMD ["python", "main.py"]

Multi-Stage Build

FROM python:3.9-slim AS builder

WORKDIR /app

COPY main.py .

FROM python:3.9-slim AS runner

WORKDIR /app

COPY --from=builder /app/main.py .

CMD ["python", "main.py"]

Here:

Builder stage prepares the application.
Runner stage copies only the required files, resulting in a smaller final image.

2. Choose a Minimal Base Image

Using a lightweight base image reduces the final image size.

Examples:

FROM python:3.9-slim

This image contains only the essential Python packages.

FROM python:3.9-alpine

This image is even smaller and includes only minimal functionality. Additional packages must be installed separately.

It is not recommended to use:

FROM python

because Docker will pull the latest version, which may introduce compatibility issues. Always specify a version tag.

3. Layer Caching

Docker caches image layers.

If no changes occur in a layer or any previous layer, Docker reuses the cached layer, making builds much faster.

The order of Dockerfile instructions is important.

If COPY . /app is placed near the beginning of the Dockerfile, any source code change invalidates all subsequent layers.

Instead, place frequently changing instructions lower in the Dockerfile whenever possible.

Less Efficient

FROM python:3.9-slim

COPY . /app

WORKDIR /app

CMD ["python", "main.py"]

Better

FROM python:3.9-slim

WORKDIR /app

COPY . /app

CMD ["python", "main.py"]

This prevents Docker from rebuilding the WORKDIR layer unnecessarily.

4. Run Containers as a Non-Root User

By default, containers run as the root user.

A root user can create or modify files anywhere inside the container.

If the container is compromised, an attacker may gain elevated privileges.

Running the container as a non-root user improves security because that user has limited permissions.

Example (Windows Containers):

RUN icacls C:\app /grant ContainerUser:(OI)(CI)F

USER ContainerUser

Example (Linux Containers):

RUN useradd -m appuser

USER appuser

Running applications as a non-root user is considered a Docker best practice.

Interview Questions

Question 1

Which Docker network allows containers to communicate with each other without requiring port mapping on the host?

Answer: Bridge network.

Question 2

Which statement is true about Docker layer caching?

Answer:

Docker caches layers unless a previous layer has changed.

Question 3

Why is it considered best practice to run containers as a non-root user?

Answer:

It helps prevent attackers from gaining root privileges if the container is compromised.

Question 4

What is the purpose of using a multi-stage build in Docker?

Answer:

To reduce the size of the final Docker image.

Question 5

How do you specify a non-root user in Docker?

Answer:

Use the USER directive in the Dockerfile or the --user option when running the container.

Question 6

Which of the following can invalidate the Docker build cache for a layer?

Answer:

Changing the base image version.
Modifying files copied into that layer.
Adding or changing an environment variable (ENV) or build argument (ARG) used by that layer.
Changing the Dockerfile instruction itself.

Docker – ARG Directive, .dockerignore, and Docker Volumes

Ramya Perumal — Sat, 27 Jun 2026 20:30:07 +0000

ARG Directive

The ARG directive acts like a variable. We can define it inside the Dockerfile and change its value during the image build process.

ARG PYTHON_VERSION=3.8
FROM python:${PYTHON_VERSION}-slim

Here, the Python version in the Dockerfile is set to 3.8. However, during the build process, we can change it to 3.10.

docker build -f Dockerfile --build-arg PYTHON_VERSION=3.10 -t helloworld_flask:v1 .

-t means tag.
-f means Dockerfile path.

We can use ARG to make base image versions and other build-time values configurable.

Example

FROM node:20-alpine

# 1. Define the arguments
ARG APP_DIR=app
ARG INSTALL_ARGS="--omit=dev"

# 2. Use them in instructions
WORKDIR /${APP_DIR}
COPY . .
RUN npm install ${INSTALL_ARGS}

Note: ARG values can only be changed during the image build process. They cannot be changed during container creation.

Docker Ignore

A .dockerignore file is used to specify files and directories that should not be copied into the Docker build context.

Create a file named .dockerignore in the application's root directory.

Examples of files and folders that can be ignored:

Dockerfile
.venv
__pycache__
*.pyc
requirements.txt
.git
.gitignore

Ignoring unnecessary files reduces the build context size and speeds up image builds.

Docker Volumes

Generally, when a container is created, a writable layer is also created.

If we create files inside the container, they are stored in the writable layer. However, when the container is deleted, all data in the writable layer is lost.

What if we need to store files permanently on the host machine?

This is where Docker volumes come into the picture.

Docker volumes allow data to persist independently of the container lifecycle.

When a volume is mounted between a host directory and a container directory:

Files created in the container appear on the host machine.
Files created on the host machine appear inside the container.
Changes are synchronized between both locations.

Types of Docker Volumes

Bind-Mounted Volumes
Docker Managed Volumes (Named Volumes)

1. Bind-Mounted Volumes

A bind mount creates a mapping between a host directory and a container directory.

docker run -it -v ./data:/data busybox:1.36 sh

Here:

./data = Host machine directory
/data = Container directory

Characteristics:

Tightly coupled with the host file system.
Multiple containers can share the same host directory.
Changes made in either location are reflected in the other.

Note: The host directory is not deleted when the container is removed.

2. Docker Managed Volumes (Named Volumes)

Create a Docker-managed volume:

docker volume create dockersession

This creates a volume outside the container lifecycle.

List Volumes

docker volume ls

Inspect a Volume

docker volume inspect dockersession

This displays information about the volume, including its mount location.

Example Linux location:

/var/lib/docker/volumes/dockersession/_data

Mount the Volume to a Container

docker run -it -v dockersession:/data123 busybox:1.36 sh

Here:

dockersession = Volume name
/data123 = Container directory

Multiple containers can use the same volume for data sharing.

Find Containers Using a Specific Volume

docker ps -a --filter volume=dockersession

One of the major benefits of Docker volumes is that they are completely decoupled from the container lifecycle.

When a container is deleted, the volume and all its data remain safely stored on the host machine.

Interview Questions

Question:

What is the primary use of the ARG instruction in Docker?

Answer: To pass build-time variables to the Dockerfile.

Question:

Which of the following is true about ARG variables in Docker?

Answer: They are used only during the image build process.

Question:

Can an ARG variable be used in a RUN instruction within a Dockerfile?

Answer: Yes, but only after it has been declared.

Question:

Which files can be ignored using .dockerignore?

Answer: Any file or directory within the build context.

Question:

What is the purpose of Docker volumes?

Answer: To store data that persists even after a container is destroyed.

Question:

What is the default location of Docker volumes on Linux systems?

Answer:

/var/lib/docker/volumes

Question:

Which command allows you to list all Docker volumes?

Answer:

docker volume ls

Question:

In which scenario would you use a bind-mounted volume?

Answer: When you need to share specific directories between the host machine and a container.

Docker – Port Mapping, Logs, Container Management, and Image Removal

Ramya Perumal — Sat, 27 Jun 2026 20:29:51 +0000

Docker – Logs, Remove, and Port Mapping

Port Mapping

When we run an application inside a container, we define the port on which the application will run. The container runs the application on that port.

It is not possible to access an application running inside Docker from outside the container unless port mapping is configured.

Port mapping is specified using the -p option.

Syntax:

docker run -p <host_port>:<container_port>

Example:

docker run -p 8888:8080

Here:

8888 is the host (machine) port used to access the application from outside the container.
8080 is the port defined in the application and exposed inside the container.

If the host port is already in use, Docker displays an error message.

How to Check Which Ports Are Being Used by Docker Containers

docker ps --format "table {{.ID}}\t{{.Names}}\t{{.Ports}}"

Detached Mode

To run a container in detached mode, use the -d option.

docker run -d -p 2000:2001

Detached mode means the container runs in the background.

Docker Logs

Logs contain information about the activities happening inside a container.

View Logs of a Specific Container

docker logs <container_id>

Follow Logs Continuously

docker logs -f <container_id>

-f stands for follow.

View Logs from a Specific Time Range

Seconds (s): docker logs --since 30s <container_id>
Minutes (m): docker logs --since 5m <container_id>
Hours (h): docker logs --since 2h <container_id>

Docker supports seconds (s), minutes (m), and hours (h) for relative time.

For days, weeks, months, or years, use an ISO 8601 date or timestamp.

Weeks:  docker logs --since 2026-06-14 <container_id>
Months: docker logs --since 2026-05-21 <container_id>
Years:  docker logs --since 2025-06-21 <container_id>

Combined Time Units

docker logs --since 1h30m <container_id>

Displays logs from 1 hour and 30 minutes ago.

Exact Time

docker logs --since "2026-06-21T17:30:00" <container_id>

Displays logs generated since the specified date and time.

Docker Inspect

docker inspect is used to view detailed information about a container, image, network, or volume.

docker inspect <container_id>

Access a Running or Exited Container

To open a shell inside a running container:

docker exec -it <container_id> sh

-i = Interactive mode
-t = Allocate a terminal

Docker Remove

Delete an Exited Container

docker rm <container_id>

Delete a Running Container

docker rm -f <container_id>

-f forcefully stops and removes the container.

Alternative Method

First stop the container:

docker stop <container_id>

Then remove it:

docker rm <container_id>

Delete Multiple Containers

docker rm -f <container_id1> <container_id2> <container_id3>

Alternative Method (Windows)

FOR /F "tokens=*" %i IN ('docker ps -aq') DO docker rm -f %i

This command removes all containers.

Listing Containers

View Running Containers

docker ps

ps stands for Process Status.

This command lists only running containers.

View All Containers

docker ps -a

This command lists all containers, including exited containers.

-a stands for all.

View Only Container IDs

docker ps -aq

This command lists all container IDs.

Delete Images

docker rmi -f <image_id>

This command forcefully removes an image.

Naming a Container

To assign a name to a container:

docker run -d --name anyname busybox:1.36

Example:

docker run -d --name my-container busybox:1.36

This creates a container with the name my-container.

Docker – Image and Container Bonding & Client-Server Architecture

Ramya Perumal — Sat, 27 Jun 2026 20:29:32 +0000

Question:

A container is running from an image. If we try to delete the image while the container is running, why can't we delete the image?

Answer:

If we attempt to delete the image, Docker will display an error message stating that the image is being used by a container.

The reason is that each line in a Dockerfile creates a layer.

Each layer is a read-only layer. Once a layer is created, it cannot be modified. If we want to make changes to the image, we need to create a new Dockerfile and build a new image.

When we create a container, only a reference to the image layers is passed to the container. The container has a writable layer, meaning we can create files, modify files, or generate output inside the container. These changes reside only in the container and do not affect the image.

However, the container always depends on the image. That is why we cannot delete the image while the container is using it.

This methodology is called Copy-on-Write (CoW).

Question:

If an image size is 10 GB, what will be the container size?

Answer:

A container contains only references to the image layers. Therefore, the container size consists of the files and changes stored in the writable layer running inside the container.

Question:

What command is used to find the size of a container?

docker ps --size

The Dive application is used to analyze each layer in a Docker image.

Client-Server Architecture

Docker works based on the Client-Server Architecture.

The Docker client sends requests to the Docker daemon, and the Docker daemon responds to the client.

If a requested image is not found on the local system, Docker retrieves it from a public repository.

docker run hello-world

This command pulls the image from the repository and runs it on the local system.

To pull an image from an image repository

docker pull busybox:1.36

To create a container

docker run busybox:1.36

To create a container, execute the `ls` command inside it, and exit immediately

docker run busybox:1.36 ls

To create a container and enter interactive mode

docker run -it busybox:1.36

Interview Questions

Question 1:

In a Client-Server architecture, who sends requests to the server?

Answer: The client.

Question 2:

What happens when a container is created from an image?

Answer: A writable layer is created.

Question 3:

What is the purpose of the writable layer in a container?

Answer: To handle file modifications and store changes made inside the container.

Question 4:

The read-only layers in Docker come from what?

Answer: The base image and its image layers.

Question 5:

What happens to the writable layer when the container is deleted?

Answer: It is discarded.

Question 6:

In a Docker environment, the client interacts with what?

Answer: The Docker daemon.

Question 7:

How do you start a container in interactive mode?

Answer:

docker run -it <image-name>

Docker – Need for Docker and Docker Terminologies

Ramya Perumal — Sat, 27 Jun 2026 20:29:13 +0000

Need for Docker

When more than one application runs on a single physical machine, all the applications have to share the machine's memory, CPU, and computational resources.

Suppose one application consumes more computational power. In that case, the other applications may become slow or even stop responding.

One solution is to run each application on a separate physical server. Although this provides better performance and isolation, the infrastructure cost and maintenance cost become very high.

To overcome this problem, the concept of Virtual Machines (VMs) was introduced.

In a virtual machine environment, each application runs on its own operating system while sharing a single physical machine.

Benefits of Virtual Machines

Reduced hardware cost.
Lower maintenance cost.
Better isolation between applications.
Multiple operating systems can run on a single physical machine.

A software component called a Hypervisor is responsible for virtualizing the physical machine and allowing multiple virtual machines to run on it.

Types of Hypervisors

There are two types of Hypervisors.

Type 1 Hypervisor

A Type 1 Hypervisor is installed directly on the physical machine (bare metal).

Type 2 Hypervisor

A Type 2 Hypervisor is installed on top of a host operating system.

We generally use Type 2 Hypervisors on personal computers. They allocate virtual resources to each virtual machine either manually or dynamically.

However, virtual machines still require a complete operating system for every application, which consumes a significant amount of memory and storage.

This means we are not fully utilizing the operating system resources for every application.

To overcome this limitation, Containers were introduced.

Containers include only the minimum libraries and dependencies required to run an application.

Benefits of Containers

Containers consume much less memory than virtual machines.
Containers start much faster.
Containers are lightweight.
Containers are portable.

An application is packaged as a Docker Image, which can be shared with any number of users and run consistently across different environments.

Docker Terminologies

For better understanding, let's compare Docker concepts with a kitchen.

Docker Concept	Kitchen Analogy
Docker Engine	The kitchen where everything happens.
Dockerfile	The recipe that contains the ingredients and preparation steps.
Docker Image	The finished dish prepared using the recipe.
Docker Container	A serving (portion) of the finished dish.
Docker Registry	A pantry that stores many dishes (images) with different tags.
Docker Daemon	The chef who prepares the dish (image) by following the recipe (Dockerfile).

Installing Docker

Download and install Docker Desktop.

Once the installation is complete, Docker is ready to use.

List Images Available on the Local Machine

docker images

This command lists all Docker images available on the local machine.

Pull an Image from Docker Hub

If the requested image is not available locally, Docker automatically downloads it from the Docker Registry.

docker pull hello-world

If no tag (version) is specified, Docker downloads the latest version by default.

docker pull hello-world:latest

To download a specific tagged version:

docker pull hello-world:nanoserver-ltsc2025

Create a Container from an Image

docker run hello-world:latest

List Containers

docker ps -a

-a displays all containers, including exited containers.

Interview Questions

Question 1

What is a Virtual Machine (VM)?

Answer:

A Virtual Machine is a software emulation of a physical computer. It behaves like a separate computer with its own operating system.

Question 2

What does a Hypervisor do in virtualization?

Answer:

A Hypervisor allows multiple virtual machines to run on a single physical host by managing and allocating hardware resources.

Question 3

What are the advantages of using containers?

Answer:

Containers are lightweight.
Containers start much faster than virtual machines.
Containers are portable and can run consistently across different environments.
Containers package only the required dependencies.

Note:

Containers share the host operating system kernel. If the host kernel encounters a critical issue, all containers may be affected.

Virtual machines have separate operating systems. Therefore, if one virtual machine crashes, the others continue to run independently.

Question 4

Which type of Hypervisor runs directly on physical hardware?

Answer:

Type 1 Hypervisor.

Question 5

What is the difference between a Virtual Machine and a Container?

Answer:

Containers share the host operating system kernel.
Virtual machines run their own operating system on top of a Hypervisor.

Question 6

What is Docker primarily used for?

Answer:

Docker is primarily used to containerize applications, ensuring portability, consistency, and providing only the minimum required dependencies to run the application.

RAG - Mastering Prompt Frameworks for Better AI Responses

Ramya Perumal — Tue, 09 Jun 2026 03:16:36 +0000

Prompting techniques such as zero-shot, one-shot, few-shot, system prompting, role prompting, contextual prompting, Chain of Thought, Tree of Thoughts, and self-consistent prompting are primarily sources of inspiration.

There are also structured frameworks that can be followed to generate better outputs from LLMs.

How do we know these frameworks are effective?

They have been developed and refined through extensive trial-and-error by practitioners and researchers who have experimented with different prompting approaches.

CRISP Framework

C (Context/Capacity)
Define the expertise or capability the AI should assume.

R (Role/Request)
Clearly specify the task to be performed.

I (Instructions/Insight)
Provide relevant context and information needed to complete the task.

S (Style/Specification)
Define constraints, requirements, and formatting expectations.

This can be closely associated with system prompting.

P (Purpose/Presentation)
Control the output format and explain the intended purpose of the response.

Context vs Purpose

Context and Purpose may appear similar, but they are different.

Example

Context:
I have an exam tomorrow, so I am asking this question.

Purpose:
If you provide a good answer, I will be able to perform well in the exam.

The context explains the background, while the purpose explains the reason or intended outcome.

Example :

When to Use CRISP

Creating content
Generating documents
Writing blogs

RICE Framework

R (Role)
Assign a specific role to the AI.

I (Instructions)
Clearly define what needs to be done.

C (Context)
Provide background information and relevant details.

E (Expectations)
Specify the desired outcome, format, and quality expectations.

Example :

When to Use RICE

Planning
Requirement gathering
Product roadmap creation

RAG - Prompt Engineering

Ramya Perumal — Mon, 08 Jun 2026 00:56:15 +0000

Prompt engineering is the process of designing and structuring prompts to get better results from an LLM.

In a RAG application, a prompt template typically contains:

User query
Retrieved documents from the vector database
Additional context or instructions

The quality of the prompt plays a major role in determining the quality of the response generated by the LLM.

There are several prompting techniques that can be used depending on the use case.

Zero-Shot Prompting

In zero-shot prompting, only the query or instruction is provided to the LLM without any examples.

The model generates a response based on its pre-trained knowledge and the given prompt.

Example

Prompt:
How do I make tea?

No examples are provided.

The LLM generates the answer directly.

One-Shot and Few-Shot Prompting

Providing examples helps the LLM understand the expected format and style of the response.

One-Shot Prompting

In one-shot prompting, a single example is provided along with the query.

Example

Prompt:

How to make coffee?

Step 1: Boil water
Step 2: Add coffee powder
Step 3: Mix well
Step 4: Serve

How to make tea?

The LLM will likely generate the tea-making instructions using the same format.

Few-Shot Prompting

In few-shot prompting, multiple examples are provided before the actual query.

The model learns the expected structure, style, and pattern from the examples and generates responses accordingly.

Advantage
Better formatting consistency
Improved accuracy
Better task understanding

Disadvantage
Higher token consumption
Increased cost and latency

System Prompting

System prompting is used to define rules, constraints, and behavior for the LLM.

The model is expected to operate within these boundaries.

Examples

You must return the output in JSON format.
Do not include floating-point values in the response.
Answer using only the information provided in the context.

System prompts are commonly used in production RAG applications to control model behavior.

Role Prompting

In role prompting, the LLM is instructed to behave as a specific role, profession, or expert.

Examples

Act as a Python developer.
Act as a cybersecurity expert.
Act as a technical interviewer.

Role prompting helps the model generate responses from a particular perspective and expertise level.

Contextual Prompting

Contextual prompting provides background information to help the LLM better understand the situation and generate a more relevant response.

Example

I have an exam tomorrow, and this is a difficult subject for me.
Please answer the following question in a simple and easy-to-understand manner.

The additional context helps the model tailor its response to the user's situation.

Chain of Thought Prompting

Chain of Thought (CoT) prompting is a technique where the model is instructed to analyze the input step by step before giving the final answer.

This helps the LLM break down complex problems into smaller logical steps, leading to better reasoning and more accurate results.

Self-Consistent Prompting

In this approach, the LLM is asked to:

Try solving the same problem using multiple reasoning paths
Generate multiple possible answers
Select the answer that appears most frequently or is the most consistent

This improves reliability by reducing randomness in reasoning.

Tree of Thoughts

Tree of Thoughts is an advanced version of self-consistent prompting.

Instead of following a single reasoning path, the LLM:

Explores multiple possible solution paths
Evaluates each path
Decides which path is most promising
Expands only the best or optimal paths further

This creates a tree-like structure of reasoning, where different branches represent different thought processes.

Tree of Thoughts is useful for complex problem-solving tasks that require exploration and decision-making.

Prompt Chaining

Prompt chaining is a technique where the output of one prompt is used as the input for another prompt.

In this approach:

A problem is broken into multiple stages
Each stage is handled by a separate prompt
The result of one prompt flows into the next

This creates a pipeline of prompts, allowing complex tasks to be solved step by step in a structured manner.

Prompt chaining is commonly used in workflows where tasks need decomposition and sequential processing.

Combining Prompting Techniques

For better performance, multiple prompting techniques can be combined.

Examples

System Prompting + User Prompting
System Prompting + Few-Shot Prompting
Role Prompting + Contextual Prompting
Role Prompting + Few-Shot Prompting
System Prompting + Role Prompting + Contextual Prompting
Chain of Thought + Prompt Chaining
Self-Consistent Prompting + Few-Shot Prompting
Tree of Thoughts + Role Prompting + System Prompting

Example

You are a senior Python developer.

Answer only using the provided context.

Provide the response in JSON format.

Example:
{
"language": "Python",
"difficulty": "Easy"
}

Question:
How do Python dictionaries work?

This prompt combines:

System Prompting
Role Prompting
One-Shot Prompting

Prompt Template in RAG

A typical RAG prompt template consists of:

System Instructions
Retrieved Context/Documents
User Query

Example

You are a helpful assistant.

Context:

Question:
What is vector chunking?

Answer:

The LLM uses the retrieved documents, instructions, and user query together to generate an accurate and human-readable response.

Key Takeaway

There is no single prompting technique that works best for every scenario.

The choice depends on:

Application requirements
Cost constraints
Token limits
Desired output format
Accuracy requirements

In real-world applications, combining multiple prompting techniques often produces the best results.

ReAct (Reason + Action)

ReAct (Reasoning + Action) is a proven methodology used to improve the performance of LLMs by combining reasoning with external tool usage.

In this approach, the model not only thinks about the problem but also decides when to take action by calling external tools or functions.

Why ReAct is Needed

If we ask a question like:

“What is the current temperature?”

A standard LLM cannot directly know real-time information such as current weather or live data.

However, it can:

Understand the intent of the question
Identify that external information is required
Decide to use an available tool (e.g., weather API)
Use the tool output to generate the final response

How ReAct Works

ReAct follows a loop of:

1. Reasoning

The LLM analyzes the question and determines what is needed.

What is the user asking?
Do I already know the answer?
Do I need external data?

2. Action

If external data is required, the model selects an appropriate tool or function.

Examples of tools:

Weather API
Calculator
Search engine
Database query

3. Observation

The tool returns results, and the LLM observes the output.

4. Final Answer Generation

The LLM combines:

Reasoning
Tool output
Context

and generates the final human-readable response.

Example

User Query:

“What is the current temperature in Chennai?”

Step 1: Reasoning

The model understands that this requires real-time data.

Step 2: Action

It calls a weather API tool.

Step 3: Observation

Tool returns:
“32°C, partly cloudy”

Step 4: Final Answer

“The current temperature in Chennai is 32°C with partly cloudy conditions.”

Key Idea of ReAct

ReAct allows LLMs to:

Think (Reason)
Act (Use tools)
Improve accuracy using real-world data

Benefits of ReAct

Reduces hallucination
Enables real-time information access
Improves reasoning accuracy
Makes LLMs more agent-like

Where to use
Research Activities
Troubleshooting in Kubernetes
Support in Devops

RAG - Hybrid search and RAG pipeline using FAISS DB

Ramya Perumal — Sun, 31 May 2026 23:45:52 +0000

Hybrid Search

Hybrid search is a combination of dense embeddings and sparse embeddings.

Dense embeddings focus on semantic meaning, while sparse embeddings focus on exact keyword matching. By combining both approaches, hybrid search improves retrieval accuracy and relevance.

OpenSearch is commonly used as a search engine for:

Log analysis
Observability and monitoring

One of the key features of OpenSearch is hybrid search, which combines:

Vector search (dense retrieval)
BM25-based search (sparse retrieval)

BM25 internally uses concepts such as:

TF (Term Frequency)
IDF (Inverse Document Frequency)

This allows OpenSearch to retrieve documents based on both semantic meaning and exact keyword matches.

RAG Cycle

A Retrieval-Augmented Generation (RAG) system consists of the following stages:

1. Document Ingestion

Documents are split into chunks using a chunking strategy.

2. Embedding Generation

Each chunk is converted into an embedding vector using an embedding model.

3. Storage

The generated vectors are stored in a vector database.

4. Retrieval

When a user submits a query:

The query is converted into an embedding vector
Similar documents are retrieved from the vector database

5. Augmentation

The Augmentor combines:

User query
Retrieved documents/chunks
Prompt instructions

This combined context is then sent to the LLM.

Generation

The LLM processes the augmented context and generates a human-readable response.

RAG Flow

Documents
↓
Chunking
↓
Embeddings
↓
Vector Database
↓
User Query
↓
Retrieval
↓
Augmentation
(Query + Retrieved Documents + Instructions)
↓
LLM
↓
Human Readable Response

FAISS

FAISS (Facebook AI Similarity Search) is an open-source library used for efficient vector similarity search.

FAISS is commonly used to:

Store vector indexes locally
Perform similarity search efficiently
Build small to medium-scale RAG applications

Advantages

Fast similarity search
Open source
Easy to set up
Works well for local development and prototyping

Limitations

FAISS primarily stores indexes in memory or local files. Because of this:

It is not a full-fledged vector database
Managing very large datasets becomes challenging
Continuous streaming and real-time updates are more difficult compared to dedicated vector databases

When to Use FAISS

FAISS is a good choice when:

Building proof-of-concept projects
Developing small to medium-sized RAG applications
Running local experiments

When to Consider a Vector Database

For large-scale applications that require:

Billions of vectors
Real-time updates
Continuous data ingestion

RAG - Sparse Embedding

Ramya Perumal — Wed, 27 May 2026 02:09:56 +0000

Sparse means thinly spread, scattered, or not dense.

In sparse embeddings, chunks are converted into tokens, and each token is represented based on whether it exists in the vocabulary dictionary.

If a token is present in the vocabulary, it is assigned 1; otherwise, it is assigned 0.

Example

[0,0,0,1,0,0,1,0,...]

If the vocabulary dictionary contains 10,000 words, the vector representation will also contain 10,000 dimensions.

For a particular chunk:

Only a few positions may contain values like 1
Most other positions will contain 0

Unlike dense embeddings, sparse embeddings do not contain continuous values. They mainly depend on token occurrence and frequency.

Why Do We Use Sparse Embeddings?

Sparse embeddings are mainly used for direct text matching and keyword-based retrieval.

They are useful when:

Exact keyword matching is important
Semantic understanding is not the primary requirement
Traditional search behavior is needed

Basic Sparse Representation

In the basic sparse approach:

Word tokens are compared with the vocabulary dictionary
If the token exists, the value becomes 1
Otherwise, the value becomes 0

This is similar to one-hot encoding.

Drawback of Basic Sparse Representation

The main drawback is that it does not consider how many times a word appears in the document.

For example:

If the word “database” appears 20 times and another word appears only once, both may still receive the same representation.

To solve this problem, the concept of token weighting was introduced.

Term Frequency (TF)

TF stands for Term Frequency.

It measures how frequently a term appears in a document.

The formula is:

TF gives higher importance to terms that appear more frequently in a document.

Issue with TF

The problem with TF is that commonly occurring words may receive very high importance even if they are not meaningful.

For example:

“the”
“is”
“and”

These words appear frequently in most documents but do not provide strong contextual meaning.

To solve this issue, IDF was introduced.

Inverse Document Frequency (IDF)

IDF stands for Inverse Document Frequency.

It measures how rare or important a word is across the entire document collection.

Common words receive lower scores
Rare and meaningful words receive higher scores

The formula is:

Issue with IDF

IDF alone does not determine how relevant a document is to the user query.

It only measures the rarity of terms across documents.

To improve retrieval quality, TF and IDF are combined together.

TF-IDF

TF-IDF combines:

Term Frequency (TF)
Inverse Document Frequency (IDF)

The formula is:

TF-IDF works well for many traditional search systems because it balances:

Word frequency within the document
Word importance across all documents

However, TF-IDF still does not fully capture semantic meaning.

BM25 (Best Match 25)

BM25 is an advanced ranking algorithm used in sparse retrieval systems.

It improves upon TF-IDF by considering:

Term frequency
Document length
Query relevance

BM25 is one of the most commonly used algorithms in traditional search engines and sparse retrieval systems.

Limitation of Sparse Embeddings

Sparse embeddings alone are usually not enough to retrieve highly relevant documents in modern RAG systems because they mainly focus on exact keyword matching rather than semantic meaning.

For example:

“car” and “automobile” may not match
“feline” and “cat” may not match

Even though the meanings are similar.

Hybrid Search

To improve retrieval quality, modern systems combine:

Dense embeddings
Sparse embeddings

This approach is called hybrid search.

Typical Combination

Dense retrieval → Sentence transformers or embedding models
Sparse retrieval → BM25

Dense embeddings help with semantic understanding, while sparse embeddings help with exact keyword matching.

Together, they provide better retrieval performance in RAG applications.

RAG - Dense Embedding

Ramya Perumal — Wed, 20 May 2026 03:03:30 +0000

Dense means continuous.

When text is converted into a numerical representation called a vector (point) that contains continuous values, it is called a dense embedding.

Unlike sparse vectors, where many values are zero, dense vectors contain meaningful numerical values across most dimensions.

Example

A dense vector may look like:
[0.123, -0.456, 0.789, 0.245, ...]

Multi-Dimensional Representation

Each vector is represented in an n-dimensional space.
This means:

Every value in the vector represents one dimension
Each dimension contains some numerical value other than zero
Similar meanings are stored closer together in vector space

All vectors are stored in a mathematical space called latent space.

Words or sentences with similar meanings are usually positioned closer together inside this latent space.

How Dense Embeddings are Generated

To convert text into vectors, we can use:

Embedding Models
Examples:

nomic-embed-text
BGE (Beijing Academy of Artificial Intelligence General Embedding) models

Transformer Models
Examples:

all-MiniLM-L6-v2
Nomic Transformer

These models are commonly available through:

Hugging Face
Ollama

Relationship Between LLMs and Transformers

LLMs internally use transformer architecture.

A transformer mainly contains two parts:

Encoder
Decoder

Encoder
The encoder converts text into embeddings (vectors).

Decoder
The decoder processes embeddings and generates human-readable text.

In embedding models, the encoder part is mainly used to generate vector representations.

Methods to Generate Embeddings

Embeddings can be generated in two ways:

1. Using Dedicated Embedding Models

These models are specifically trained for embedding generation.

Examples

nomic-embed-text
BGE models

This is the most common and efficient approach in RAG systems.

2. Using General LLMs Through Prompting

A general-purpose LLM can also generate embeddings by giving prompts that instruct the model to convert text into vector representations.

This approach is sometimes used in vectorless RAG systems.

Disadvantage
Higher computational cost
Slower performance
More token consumption

Measuring Embedding and Retrieval Accuracy

To measure retrieval accuracy effectively, unit tests should be written for the RAG pipeline.

The test cases should include:

Expected inputs
Expected outputs
Different query scenarios
Edge cases
Semantic similarity checks

This helps evaluate how accurately the embedding model retrieves relevant information.

Similarity Methods Used in Dense Embeddings

Dense embeddings commonly use one of the following similarity measurement methods:

Cosine Similarity

This is the most commonly used similarity method in RAG applications.

It measures the angle between vectors rather than physical distance.

If the vectors point in similar directions, the similarity score becomes higher.

Euclidean Distance

Measures the straight-line distance between vectors in vector space.

Dot Product

Measures similarity by multiplying corresponding vector values and summing them.

Why the Same Embedding Model Must Be Used

The same embedding model should be used for both:

Data ingestion phase
Retrieval phase

If different embedding models are used, the generated vectors may exist in completely different latent spaces or vector distributions.

As a result:

Similarity calculations become inaccurate
Retrieval quality decreases
Relevant chunks may not be retrieved correctly

Using the same embedding model ensures that both stored documents and user queries are represented consistently in the same vector space.

Sparse Embeddings

Sparse embeddings use TF-IDF and BM25 mechanisms for retrieval.

In sparse embeddings, vectors are generated mainly based on keyword frequency and importance rather than semantic meaning.

The combination of BM25 and vector search is called hybrid search.

Tools such as OpenSearch and Elasticsearch support hybrid search by combining:

Traditional keyword-based retrieval
Semantic vector-based retrieval

Similar to one-hot encoding, sparse embeddings generate vectors based on text frequency. Most values in the vector remain 0, while only important terms receive higher numerical values.

Example

[3.91, 0, 0, 1.62]

In this representation:

Higher values indicate more important or frequently occurring terms
Zero values indicate terms that are absent or not important in the document

Sparse embeddings mainly focus on exact keyword matching and are highly effective for traditional search use cases.