DEV Community: Kunaal Thanik

RAG+RAGAS+LangChain+FAISS+OpenAI

Kunaal Thanik — Tue, 13 May 2025 23:35:43 +0000

RAG (Retrieval-Augmented Generation) Workflow

Import Required Libraries

This imports all the necessary libraries for loading datasets, performing text splitting, creating embeddings, building a retrieval system, and evaluating metrics.

dotenv: For loading environment variables (e.g., API keys).

load_diabetes: Provides the diabetes dataset from scikit-learn.

LangChain libraries: Tools for text splitting, embeddings, and setting up a question-answering (QA) pipeline.

FAISS: A library for efficient similarity search and clustering of dense vectors.

Ragas: For evaluating retrieval-based question-answering systems.

Dataset: From the datasets library, useful for organizing data for evaluation.

userdata: Used for securely retrieving sensitive data in Google Colab.

1. Ground Truth - Source of Truth

The foundation of the system lies in the source data. In this example:

diabetes = load_diabetes()
raw_text = diabetes.DESCR

load_diabetes(): Fetches the diabetes dataset from sklearn.datasets.
diabetes.DESCR: Contains a detailed description of the dataset (variables, data characteristics).
raw_text: Represents the "Ground Truth" that the RAG system will reference for its operations.

2. Retrieval - Finding Relevant Information

The raw text (input) is split into smaller chunks, and a vector store is created using FAISS. Queries retrieve the most relevant chunks.

# Split into chunks (simulate document retrieval)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_text(raw_text)

# Create Embeddings & Build FAISS Index
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")  # Recommended model
docsearch = FAISS.from_texts(texts, embeddings)

RecursiveCharacterTextSplitter: Splits the large raw_text into smaller, overlapping segments (texts).
OpenAIEmbeddings: Converts input text chunk into numerical vector representations for processing.
FAISS.from_texts(texts, embeddings): Builds an index of these embeddings for efficient retrieval.

3. Augmentation - Adding Context

The RetrievalQA chain in LangChain manages augmentation.

It takes the user's query and uses the docsearch index to find relevant documents.
These documents are passed to the LLM along with the original query, providing contextual information.

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever(),
    return_source_documents=True  # helpful for debugging
)

RetrievalQA.from_chain_type sets up the RAG pipeline.
retriever=docsearch.as_retriever() connects FAISS index to QA chain, enables fetch relevant doc/info based on query
chain_type="stuff" pass retrieved doc/info to LLM. It stuffs query+info/doc.
return_source_documents=True for traceability

4. Generation - Producing the Response

The LLM generates a response by combining the query and augmented context retrieved in the previous step.

for query in queries:
    result = qa_chain.invoke({"query": query})
    answers.append(result["result"])

    # Extract retrieved docs for Ragas evaluation
    retrieved_docs = result.get("source_documents", [])
    contexts.append([doc.page_content for doc in retrieved_docs])

5. Traceability - Explaining the Source

The system ensures traceability by showing the origin of the retrieved information, helping users understand the response's basis.

retrieved_docs = result.get("source_documents", [])
    contexts.append([doc.page_content for doc in retrieved_docs])

Machine Learning testing strategies

Kunaal Thanik — Tue, 13 May 2025 02:28:19 +0000

Model Accuracy and Performance

Using special test data to see how well a model works.
Measuring results with scores like precision, recall, and accuracy.

So, let’s talk about how to test the accuracy of a model. Now that you have a dataset to train the model, you’ll keep some of it aside as a test set. This is called the hold-out data set. What does it mean? Well, it means the model hasn’t seen this data before, so you’ll use it to test how well it performs. If the model still works great, you can calculate some scores to see if it’s up to expectations.

For example, precision, recall, accuracy, and F1 scores give you a good idea of how well the model is doing. These numbers usually range from 0 to 1. A score of 0 means the model is performing poorly, while a score of 1 means it’s performing very well. You can also use other tools to check the latency and throughput of the model, like JMeter or Locust. By using these two techniques, you can check the accuracy, performance, and other aspects of the model.

In my usual approach, I ask myself a question: Can I trust these numbers? As a QA, you’re not going to stop at these numbers. These numbers can be misleading representations of the information. This happened recently with a major company developing an AI chatbot. They replaced human agents with AI chatbot agents and were heavily relying on these numbers. They believed it was working perfectly, but the reality was different. They had to go back to human Customer Service agents.

Let's assume you are working with a model that is capable enough to detect if the email is spam or not. That means you have two classes: spam and non-spam. The data you provide to the model should have an equal distribution. 50 spam emails and 50 non-spam emails, the distribution should be 50-50. If the distribution is not accurate, you need to resample the data. Resampling means balancing the data so that you get 50-50 percent. You can do this using data synthesis, where you generate artificial data and add it back to the dataset. After resampling, you run the same performance metrics [Recall, Accuracy, F1, precision] on both classes. Finally, you compare the final scores of class one and class two with the model’s scores. If all these numbers match closely, you have high confidence in the model. If you see any major discrepancies, take care of them.

Robustness

Using tools like Great Expectations to check the quality of data.
Making sure models work well even with tricky or unusual inputs.

Sometimes [Most of the time], people ask questions with spelling mistakes or grammar errors. This is why the model should be able to understand the intent, even with errors. So adding noice to data is a technique to test system.

Bias and Fairness

Using tools like IBM AIF360, Google What-If, and Microsoft Fairlearn to find and fix unfairness in models.

Bias and fairness are crucial parts of testing because if the model is biased, it can lead to serious consequences for the company. I won’t mention any specific companies that have faced legal issues due to biased data, but you can search online. You need to work hard to address this issue. There are multiple tools available, such as IBM’s AI Fairness 360, Google’s What If, and Microsoft’s FairLearn. Each tool has its own strengths and weaknesses.

Bias can happen in different ways. It can be in the training data or the model, or the output of the model. So, there are many different techniques we’ll use to detect and fix bias at each level.

Integration

Testing how applications work with systems like EHR (Electronic Health Records).
Ensuring smooth data sharing and system compatibility.

The next step is integration. It’s super important because it doesn’t matter how good the model is. If the integration isn’t seamless, who’s going to use the application? For example, in a doctor’s office, there’s an EHR called electronic health records. An AI model should be integrated with that system seamlessly, ensuring smooth data sharing and system compatibility. Think of it like you’re developing an app. You need to make sure it works on iOS, Android, and other platforms. Otherwise, it might be a big failure.

Monitoring

Keeping an eye on how models behave over time.
Using tools like Amazon SageMaker Model Monitor to spot problems.

The last step is monitoring. Keep an eye on the model’s behavior over time and do regular audits. Document everything. The goal is to make sure the model’s performance doesn’t degrade. Check every time if the model’s performance is the same or increasing. If it’s degrading, take action to prevent it. Amazon AWS provides tools like SageMaker Model Monitor, but you’ll need to work manually with the compliance team as well.

Regulatory Compliance

Making sure AI follows important rules like HIPAA and protects personal data, PII, and PHI.
Keeping sensitive information safe with strong security measures.

Be careful about regulatory compliance. Make sure the AI follows the important rules of your domain, whether it’s healthcare (HIPAA) or something else. Also, make sure the data is encrypted and protected with strong security measures. There is a very big healthcare insurance company in the USA, which is facing huge legal issues. Lawsuits are active, and they faced a penalty of around 22 million dollars, but that is not it. It is just a type of iceberg because the reputation of the company is on stake now and even right after that the stock market value was down by $118 billion for that company so that is a huge loss why because the algorithm they were using NH algorithm that was not as per compliance and the and even right after that the stock market value was down by $118 billion for that company so that is a huge loss why because the algorithm they were using NH algorithm that was not as per compliance and the regulatory regulations so I would say this is the one of the most crucial part of testing AI applications specifically if you are dealing with the healthcare and finance industry, it can affect human lives directly.