DEV Community: Jeffrey Ip

‼️ Top 5 Arize AI Competitors in 2026 💥⚖️

Jeffrey Ip — Wed, 19 Mar 2025 08:45:52 +0000

TL;DR 📚

Arize AI is great for LLM observability. Depending on what you need though, its feature set might not always be ideal for every use case. If you care more about evaluating the performance of your LLM apps, you should be using something like Confident AI or Giskard, while for tracing and observability, there are other cheaper options such as Langsmith.

Let's begin!

What Do People Like & Don't Like About Arize AI?

Arize AI is a platform to monitor and evaluate LLM applications. It's main product, Phoenix, is great for debugging LLM applications such as AI agents (for customer support, for example), and can be used to evaluate their performances as well. Originally built for more ML focused workflows, they have since pivoted into focusing on LLMs since 2023.

However, depending on your use case (and budget) 🚩, you may find that Arize AI may or may not be the right fit for your use case. In this article, we'll list out the top 5 alternatives that you must consider in 2025 before deciding whether Arize is right for you.

1. Confident AI - The Eval-First LLM Observability Platform

Confident AI is an eval-first cloud platform for LLM observability. It's evals are powered by DeepEval, one of the world's most popular and adopted open-source LLM evaluation framework. It is well known for unit-testing LLM applications ✅

Key differences

As the name suggest, it is most known for its laser focus on LLM evaluation-first observability. While Arize AI offers evaluations in its spans and traces during LLM observability through one-off debugging, Confident AI focuses on the custom benchmarking of LLM applications instead.

This means:

More controllable and customizable metrics
Evaluation results are more accurate
Easier for entire organizations to collaborate on testing LLMs
Scales to LLM safety testing

With Confident AI, you're able to easily A|B test different iterations of your LLM application with a side-by-side, GitHub like diff view of all regressions and improvements. Arize AI, on the other hand, focuses more on one-off debugging.

They also target slightly differently in the LLM development lifecycle. Arize is more for production monitoring while Confident AI for LLM evaluation before deployment. They both do the other part well as well however.

Side by side comparison summary

We'll go down the feature list so you can make a more informed decision on which is best for you.

Metrics

Feature	Confident AI	Arize AI
Out-of-the-box metrics	50+	10+
RAG metrics	✅	✅
Conversation (chatbot) metrics	✅	❌
Agent metrics	✅	✅
Research-backed custom metrics	✅	❌
Deterministic LLM-as-a-judge metrics	✅	❌
Open-source	✅	✅
Integrates with any LLM	✅	❌
Can be run locally in code	✅	❌
Can be run on the cloud	✅	✅
Auto improves	✅	❌

For open-source users, Confident AI allows you to use literally any LLM for evaluation metrics, whereas Arize AI metrics are limited to the LLMs available on their platform.

Platform

Feature	Confident AI	Arize AI
Evaluation	50+	10+
Dataset Management	✅	✅
Prompt Management	✅	✅
Metric Alignment	✅	❌
Human Feedback	✅	✅
LLM Observability	✅	✅

From afar, no big differences here. Let's dive deeper into each feature on the platform.

Evaluation

Feature	Confident AI	Arize AI
Testing Report	✅	✅
A/B Experimentation	✅	🚧
Regression Testing	✅	❌
Side-by-side evaluation comparisons	✅	❌
Statistical metric scores analysis	✅	❌
Publicly sharable testing report	✅	❌
Advanced filtering for metrics/test cases	✅	❌
Human labelling for metrics	✅	✅
Metric score accuracy validation (confusion matrix)	✅	❌
Scales to safety testing	✅	✅

Although Arize supports LLM evaluation features, there's a lot of things that doesn't scale well into >100s of test cases. This means it will be harder to benchmark LLM applications that is required for experimentation and to satisfy external stakeholders through publicly sharable testing reports.

🌟 Visit Confident AI Website

Dataset management

Feature	Confident AI	Arize AI
100% DeepEval integration	✅	❌
Dataset editor	✅	✅
Uploading datasets from CSV	✅	✅
Push/pull datasets in code	✅	✅
Create datasets from production data	✅	✅
Create Datasets from testing reports	✅	❌
Comment on datasets	✅	❌
PIT recovery	✅	❌
Dataset backup	✅	✅
Revision history	✅	✅
Custom columns	✅	✅
RAG support	✅	❌
"Finalized" flag	✅	❌

Arize and Confident is mainly the same here. Confident does have a little bit of an edge in terms of dataset collaboration where comments can be left by domain experts while engineers can focus on building to ensure these test cases pass.

Prompt management

Feature	Confident AI	Arize AI
100% DeepEval integration	✅	❌
Prompt editor	✅	✅
Prompt auto versioning	✅	✅
Dynamic prompt variables	✅	✅
Can be used for evaluation	✅	❌
Can be used for observability	✅	❌

Arize has prompt management support but not as integrated for evaluation and observability.

LLM observability

Feature	Confident AI	Arize AI
LLM output monitoring	✅	✅
Integrated LLM tracing	✅	✅
Custom LLM tracing	✅	✅
Has chatbot specific monitoring	✅	❌
Real-time evaluations	✅	✅
Human feedback leaving	✅	✅
Advance filtering for prompts and models	✅	❌
Advance filtering for custom properties	✅	❌

Arize AI focuses more on deep, detailed debugging while Confident AI's observability is for monitoring the output of each LLM interaction, with tracing included.

Support, Security & Others

Feature	Confident AI (Premium)	Arize AI (Pro)
Pricing	Monthly	Monthly
User roles & permissions	✅	❌
SOC2 Type II	✅	❌
HIPAA	✅	❌
Data Retention	1 year	6 months
Support	Dedicated	Community + email

Both providers are compliant however for their enterprise tier.

Which One Should You Chooose?

Arize AI's great for debugging, while Confident AI is great for LLM evaluation and benchmarking. Both has their own strengths and weaknesses, and has overlap in features, but it ultimately depends whether you care more about evaluation or observability.

If you want to do both, go for Confident AI, since LLM observability is the same for most providers anyway.

🌟 Visit Confident AI Website

2. Giskard - Secure your LLM Agents

Primary Use Case: Testing and debugging LLMs before deployment
Features:
- Focuses on pre-deployment testing and model validation.
- Helps identify biases, vulnerabilities, and errors in LLMs before production.
- Provides automated testing and explainability tools.
- Can be used for unit testing LLMs, similar to software testing frameworks.
- Helps ensure compliance with AI safety and fairness guidelines.
Ideal for: LLM teams who want to debug models, ensure robustness, and prevent issues before deployment.

Key Differences

Feature	Arize AI	Giskard
Focus Area	Production monitoring & observability	Pre-deployment testing & debugging
Data Drift Detection	✅	❌
Bias & Fairness Testing	❌	✅
Root Cause Analysis	✅	❌
Explainability	✅	✅
Automated LLM Testing	❌	✅
Compliance & Safety Checks	❌	✅

Which One Should You Choose?

If you need to monitor production LLMs for drift and performance degradation, go with Arize AI.
If you need to test and debug LLMs before deployment, go with Giskard.
If you need both testing and monitoring, you might consider using both together.

Visit Giskard Website

3. Lunary - AI Developer Platform

Primary Use Case: LLM chatbot observability, evaluation, and debugging
Features:
- Provides logging, monitoring, and analytics for LLM chatbots.
- Tracks conversation history, user feedback, and model performance.
- Supports prompt versioning, management, and collaboration.
- Measures cost, latency, token usage, and model performance metrics.
- Offers both cloud-hosted and self-hosted deployment options.
Ideal for: Teams developing and deploying LLM chatbots who need monitoring, evaluation, and debugging capabilities.

Key Differences

Feature	Arize AI	Lunary
Focus Area	LLM agents	LLM chatbots
Root Cause Analysis	✅	❌
Logging and Tracing	✅	✅
Prompt Versioning	❌	✅
Cost and Token Tracking	✅	✅
Automated LLM Testing	❌	❌
Compliance and Security	✅	✅ (SOC 2, ISO 27001)

Which One Should You Choose?

If you need to track, debug, and evaluate LLM applications with logging, analytics, and user feedback, choose Lunary. It helps teams iterate on prompts, detect hallucinations, and analyze costs before and after deployment.
If you need a solution focused on production monitoring with real-time performance tracking and drift detection, choose Arize AI. It is designed for LLM observability at scale, ensuring models remain reliable in deployment.

Visit Lunary Website

4. Datadog - Modern monitoring & security

Datadog is not LLM specific, but it does offer some good features compared to Arize AI.

Primary Use Case: General application monitoring, logging, and infrastructure observability
Features:
- Provides monitoring for servers, databases, and cloud services with real-time dashboards.
- Supports log management, distributed tracing, and security monitoring across applications.
- Detects anomalies and performance bottlenecks in system infrastructure.
- Offers alerting and automated incident response for system failures.
- Integrates with various cloud providers, DevOps tools, and microservices architectures.
- Focuses on infrastructure observability rather than model-specific insights.
- Weaker than Arize AI when it comes to LLM evaluation, as it lacks built-in model performance tracking, data drift detection, and detailed LLM-specific analytics.

Key differences

Feature	Datadog	Arize AI
Focus Area	Infrastructure and application monitoring	LLM observability and performance tracking
Model Performance Monitoring	❌	✅
LLM Drift Detection	🚧	✅
Logging and Tracing	✅	✅
Root Cause Analysis	🚧	✅
Security and Compliance	✅	❌
Application Performance Monitoring	✅	❌
LLM Evaluation and Debugging	🚧	✅

Which one should you choose?

If you need to monitor system infrastructure, application performance, and security events, choose Datadog. It is best suited for DevOps and cloud-native applications that require end-to-end observability.
If you need to monitor LLMs in production, detect model drift, and analyze performance issues, choose Arize AI. Arize is significantly stronger in LLM evaluation, providing model-specific insights, drift detection, and performance tracking that Datadog lacks.

Visit Datadog Website

5. MLFlow - ML and GenAI made simple

As the name suggest, MLFlow is undecided on whether to focus on traditional ML or GenAI. I would not recommend MLFlow if you don't have traditional ML workflows to satisfy for this reason.

Primary Use Case: Experiment tracking, model management, and deployment
Features:
- Tracks and logs ML experiments, including parameters, metrics, and artifacts.
- Provides a central model registry for versioning and managing models.
- Supports model packaging for deployment in multiple environments.
- Enables reproducibility by logging code, dependencies, and environment configurations.
- Integrates with various ML frameworks, including TensorFlow, PyTorch, and Scikit-learn.
- Allows deployment of models to cloud services, on-premises, and edge devices.
- Offers APIs and a UI for tracking and managing experiments.
- Supports collaborative workflows for ML teams.
- Provides lifecycle management for ML models, from development to production.

Which One Should You Choose?

If you need to track experiments, manage model versions, and handle deployments, choose MLflow. It is best suited for the early stages of the LLM lifecycle, helping teams develop, iterate, and manage models before deployment.
If you need to monitor LLMs in production, detect performance issues, and analyze model drift, choose Arize AI. It is specifically designed for LLM observability, helping teams detect data drift, hallucinations, and degradation over time.
If your workflow involves both training and production monitoring, consider using MLflow for experiment tracking and Arize AI for post-deployment monitoring.

Visit MLFlow Website

Conclusion

So there you have it, the list of the top 5 Arize AI alternatives in 2025. Think there's something I've missed? Comment below to let me know!

Thank you for reading, and till next time 😊

🚨🏆 Top 5 Open-source Alternatives for LLM Development You Must Know About 💥

Jeffrey Ip — Wed, 13 Nov 2024 11:55:27 +0000

TL:DR

I'm not a fan of closed-source, especially when it comes to LLM application development 👎 But thankfully, for every closed-source framework in nature there ought to be an equal open-source counterpart (after all, isn't this the third law of some famous scientist🤔?).

So in this article, as someone who have soaked and bathed in the LLM development rabbit hole 🐇 for more than two years, I'm going to let you know five most important open-source alternatives for closed-source LLM development solutions.

Here we go!🙌

(PS. please star all the open-source repos to help them gain awareness over their closed-source counterparts!)

1. DeepEval >> Humanloop

DeepEval is the open-source LLM evaluation framework👍, while Humanloop, well you guessed it, is a closed-source LLM evaluation solution 👎, with hidden API endpoints, instead of open algorithms that you can see how evaluation is carried out.

This is top of the list because in my opinion, nothing is more important than open LLM evaluation 💯. Openness allows for transparency, and transparency, especially in LLM development, allows everyone to see what the standard of evaluation is. You wouldn't want some LLM safety evaluation to be done behind close doors, while you simply get informed of the results, right?

Open-source code gets scrutinized all the time, which helps DeepEval to be is much easier to use than Humanloop. Here's how to evaluate your LLM application in DeepEval:

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
 input="How many evaluation metrics does DeepEval offers?",
 actual_output="14+ evaluation metrics"
)
metric = AnswerRelevancyMetric()

evaluate(test_cases=[test_case], metrics=[metric])

🌟 Star DeepEval on GitHub

(DeepEval's humble mascot wants a star)

2. Llama3.1 >> Open AI GPT-4

I bet you saw this coming, but the next on the list is Meta's Llama3.1 vs OpenAI's GPT-4. Llama3.1 can be self-hosted, with much faster inference times and cheaper token costs than GPT-4, and the best part is it is open-source, with open-weights. What does this mean?

This means if you want to customize Llama3.1, which by the way performs as good as GPT-4 on several benchmarks, you will be able to do it yourself. The millions (or billions?) of dollars Meta spent on training Llama3.1 can be leveraged by literally anyone, and available for fine-tuning.

Use Llama3.1 today:

import transformers
import torch

model_id = "meta-llama/Llama-3.1-8B"

pipeline = transformers.pipeline(
    "text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto"
)

response = pipeline("Hey how are you doing today?")

3. LangChain >> OpenAI Assistants

I'm sorry OpenAI, it's you again. LangChain is an LLM application orchestration framework, while OpenAI Assistants is like a RAG API, with the inner orchestration logic hidden behind close doors.

I know, I'm using big words and this sounds very fancy, but let me explain what each of those means. LLM orchestration simply means connecting external data to your LLM, and allowing your LLM to fetch data as it see fits by giving it access to your APIs. For example, a chatbot reporting the daily weather to you that's built on LangChain would allow the LLM to fetch the latest weather for today. In OpenAI assistants, they've hidden it all away behind an API.

This means it's not as customizable, and quite frankly, I haven't met a single person who uses the Assistants feature even though it was hugely hyped up as the next big thing.

🌟 Star on GitHub

4. Flowise >> Relevance AI

So while LangChain allows you to build your LLM application in code. Flowise allows you to do it via an UI, in a open-source way. Simply click and drop to customize what data your LLM has access to, and you're pretty much good to go.

The alternative? A slightly less pretty paid version if it.

Look at Flowise:

🌟 Star on GitHub

5. Lite LLM >> Martian AI

Lite LLM is an open-source library that allows you to switch out LLMs for LLMs in a single line of code, while Martian on the other hand, is a closed-source version of it.

Ok not quite, in reality, although both allows you to switch out LLMs for LLMs, Martian AI is an LLM router, meaning it chooses the best LLM for each input to your LLM to optimize on accuracy, speed, and cost.

In this rare occasion, I'd have to say both are pretty good products.

🌟 Star on GitHub

Martian AI here: https://withmartian.com/

So there you have it, the list of open-source vs closed-source LLM development tools you should definitely know about. Think there's something I've missed? Comment below to let me know!

Thank you for reading, and till next time 😊

🚨💥 Top 5 Trending Open-source LLM Tools & Frameworks You Must Know About ✨🚀

Jeffrey Ip — Tue, 29 Oct 2024 11:00:38 +0000

TL;DR

"Just the other day, I was deciding which set of LLM tools to use to build my company's upcoming customer support chatbot, and it was the easiest decision of my life!" - said no one ever 🚩🚩🚩

It has been a while since gpt-4's release but still, it seems like every week a new open-source LLM framework is launched, each doing the same thing as its 50+ other competitors while desperately explaining how it is better than its predecessor. At the end of the day, what developers like yourself really want is some quick personal anecdotes to weigh out the pros and cons of each. 👨🏻‍💻

So, as someone who played around with more than a dozen of open-source LLM tools, I'm going to tell you my top picks so you don't have to do the boring work yourself. 😌

Let's begin!

1. DeepEval - The LLM Evaluation Framework

DeepEval is the LLM tool that will help you quantify how well your LLM application, such as a customer support chatbot, is performing 🎉

It takes top spot for two simple reasons:

Evaluating and testing LLM performance is IMO the most important part of building an LLM application.
It is the best LLM evaluation framework available, and it's open-source 💯

For those who might not be as familiar, LLM testing is hard because there are infinite possibilities in the responses an LLM can output.😟 DeepEval makes testing LLM applications, such as those built with LlamaIndex or LangChain, extremely easy by:

Offers 14+ research backed evaluation metrics to test LLM applications built with literally any framework like LangChain.
Simple to use, great docs, and intuitive to understand. Perfect for those just getting started, but also technical enough for experts to dive deep into this rabbit hole.
Integrated with Pytest, include it in your CI/CD pipeline for deployment checks.
Synthetic dataset generation - to help you get started with evaluation in case you don't have a dataset ready.
LLM safety scanning - automatically scans for safety risks like your LLM app being bias, toxic, etc.

After testing, simply go back to the LLM tool used for building your application (which I'll reveal my pick later) to iterate on areas that need improvement. Here's a quick example to test for how relevant your LLM chatbot responses are:

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
 input="How many evaluation metrics does DeepEval offers?",
 actual_output="14+ evaluation metrics"
)
metric = AnswerRelevancyMetric()

evaluate(test_cases=[test_case], metrics=[metric])

🌟 Star DeepEval on GitHub

(DeepEval's humble mascot wants a star)

2. LlamaIndex - Data Framework for LLM applications

While DeepEval evaluates, LlamaIndex builds. LlamaIndex is a data framework specifically designed for integrating large language models (LLMs) with various data sources, particularly for applications involving retrieval-augmented generation (RAG).

For those who haven't heard of RAG, it is the programmatic equivalent of pasting some text into ChatGPT and asking some questions about it. RAG simply helps your LLM application to be aware of context it is not aware of through the process of retrieval, and LlamaIndex makes this extremely easy.

You see, a big problem in RAG is connecting to data sources and parsing unstructured data (like tables in PDFs) from them. It's not hard, but extremely tedious to build out.

Here's an example of how you can use LlamaIndex to build a customer support chatbot to answer questions on your private data:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("Some question about the data should go here")
print(response)

🌟 Star on GitHub

3. Ollama - Get up and running with large language models

Evaluating and building is important, but what about data privacy?

Ollama is an interesting one because it unlocks LLMs to be used locally. It allows users to run, customize, and interact with LLMs directly on their own hardware, which can improve privacy, reduce dependency on cloud providers, and optimize latency for certain use cases. Ollama streamlines working with open-source LLMs, making them more accessible and manageable for individuals and organizations without needing extensive machine learning expertise or cloud infrastructure.

For instance, using Ollama, you might load a model for customer support automation that runs locally on company servers. This setup keeps customer data private and may reduce response latency compared to a cloud-based setup. Ollama is also suitable for experimentation with open-source LLMs, like fine-tuning models for specific tasks or integrating them into larger applications without relying on external cloud services.

# List available models
ollama list

# Run a model with a prompt (for example, running a GPT-4-like model named `gpt4-all`)
ollama run gpt4-all -p "Explain the benefits of using DSPy."

🌟 Star on GitHub

4. Guidance

Guidance is a framework designed to help developers craft dynamic, efficient prompts for large language models (LLMs). Unlike traditional prompt engineering, which often relies on fixed templates, Guidance allows prompts to be dynamically constructed, leveraging control structures like loops and conditionals directly within the prompt. This flexibility makes it especially useful for generating responses that require complex logic or customized outputs.

A simple example is, customer Support Bots: Use conditionals to create prompts that adapt based on the customer’s question, providing personalized responses while maintaining consistency in tone and style instead of manual prompting.

import guidance

# Initialize the Guidance model (e.g., OpenAI or another model API)
gpt = guidance("gpt-3.5-turbo")  # You can specify another model if available

# Define the dynamic prompt with Guidance
prompt = guidance("""
{{#if summary}}
Please provide a brief summary of the topic: {{topic}}.
{{else}}
Provide a detailed explanation of the topic: {{topic}}, covering all relevant details.
{{/if}}
""")

# Set up input parameters
params = {
    "topic": "Machine Learning",
    "summary": True  # Toggle between True for summary or False for detailed response
}

# Run the prompt
response = prompt(params)
print(response)

🌟 Star on GitHub

5. DSPy - Algorithmically optimize LM prompts and weights

DSPy is designed to simplify the process of building applications that use LLMs, like those from OpenAI or Hugging Face. It makes it easier to manage how these models respond to inputs without needing to constantly adjust prompts or settings manually.

The benefit of DSPy is that it simplifies and speeds up application development with large language models by separating logic from prompts, automating prompt tuning, and enabling flexible model switching. This means developers can focus on defining tasks rather than on technical details, making it easier to achieve reliable and consistent results.

However, I've personally found DSPy hard to get started with, hence why it is the lowest on the list than the others.

🌟 Star on GitHub

So there you have it, the list of top LLM open-source trending tools and frameworks on Github you should definitely use to build your next LLM application. Think there's something I've missed? Comment below to let me know!

Thank you for reading, and till next time 😊

‼️ Top 5 Open-Source LLM Evaluation Frameworks in 2026 🎉🔥

Jeffrey Ip — Wed, 17 Jan 2024 10:24:02 +0000

TL:DR

"I feel like there are more LLM evaluation solutions out there than there are problems around LLM evaluation" - said Dylan, a Head of AI at a Fortune 500 company.

And I couldn't agree more - it seems like every week there is a new open-source repo trying to do the same thing as the other 30+ frameworks that already exists. At the end of the day, what Dylan really wants is a framework, package, library, whatever you want to call it, that would simply quantify the performance of the LLM (application) he's looking to productionize.

So, as someone who were once in Dylan's shoes, I've compiled a list of the top 5 LLM evaluation framework that exists in 2025 :) 😌

Let's begin!

1. DeepEval - The Evaluation Framework for LLMs

DeepEval is your favorite evaluation framework's favorite evaluation framework. It takes top spot for a variety of reasons:

Offers 14+ LLM evaluation metrics (both for RAG and fine-tuning use cases), updated with the latest research in the LLM evaluation field. These metrics include:
- G-Eval
- Summarization
- Hallucination
- Faithfulness
- Contextual Relevancy
- Answer Relevancy
- Contextual Recall
- Contextual Precision
- RAGAS
- Bias
- Toxicity

Most metrics are self-explaining, which means DeepEval's metrics will literally tell you why the metric score cannot be higher.

Offers modular components that is extremely simple to plug and use. You can easily mix and match different metrics, or even use DeepEval to build your own evaluation pipeline if needed.
Treats evaluations as unit tests. With an integration for Pytest, DeepEval is a complete testing suite most developers are familiar with.
Allows you to generate synthetic datasets using your knowledge base as context, or load datasets from CSVs, JSONs, or Hugging face.
Offers a hosted platform with a generous free tier to run real-time evaluations in production.

With Pytest Integration:

from deepeval import assert_test
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
 input="How many evaluation metrics does DeepEval offers?",
 actual_output="14+ evaluation metrics",
 context=["DeepEval offers 14+ evaluation metrics"]
)
metric = HallucinationMetric(minimum_score=0.7)

def test_hallucination():
  assert_test(test_case, [metric])

Then in the CLI:

deepeval test run test_file.py

Or, without Pytest (perfect for notebook environments):

from deepeval import evaluate
...

evaluate([test_case], [metric])

🌟 Star DeepEval on GitHub

2. MLFlow LLM Evaluate - LLM Model Evaluation

MLFlow is a modular and simplistic package that allows you to run evaluations in your own evaluation pipelines. It offers RAG evaluation and QA evaluation.

MLFlow is good because of its intuitive developer experience. For example, this is how you run evaluations with MLFlow:

results = mlflow.evaluate(
    model,
    eval_data,
    targets="ground_truth",
    model_type="question-answering",
)

🌟 Star MLFlow on GitHub

3. RAGAs - Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines

Third on the list, RAGAs was build for RAG pipelines. They offer 5 core metrics:

Faithfulness
Contextual Relevancy
Answer Relevancy
Contextual Recall
Contextual Precision

These metrics make up the final RAGAs score. DeepEval and RAGAs have very similar implementations, but RAGAs metrics are not self-explaining, making it much harder to debug unsatisfactory results.

RAGAs is third on the list primarily because it also incorporates the latest research into its RAG metrics, is simple to use, but not higher on the list because of its limited features and inflexibility as a framework.

from ragas import evaluate
from datasets import Dataset
import os

os.environ["OPENAI_API_KEY"] = "your-openai-key"

# prepare your huggingface dataset in the format
# Dataset({
#     features: ['question', 'contexts', 'answer', 'ground_truths'],
#     num_rows: 25
# })

dataset: Dataset

results = evaluate(dataset)

🌟 Star RAGAs on GitHub

4. Deepchecks

Deepchecks stands out as it is geared more towards evaluating the LLM itself, rather than LLM systems/applications.

It is not higher on the list due to its complicated developer experience (seriously, try setting it up yourself and let me know how it goes), but its open-source offering is unique as it focuses heavily on the dashboards and the visualization UI, which makes it easy for users to visualize evaluation results.

🌟 Star Deepchecks on GitHub

5. Arize AI Phoenix

Last on the list, Arize AI evaluates LLM applications through extensive observability into LLM traces. However it is extremely limited as it only offers three evaluation criteria:

QA Correctness
Hallucination
Toxicity

🌟 Star Phoenix on GitHub

So there you have it, the list of top LLM evaluation frameworks GitHub has to offer in 2025. Think there's something I've missed? Comment below to let me know!

Thank you for reading, and till next time 😊

🔪 6 Killer Open-Source Libraries to Achieve AI Mastery in 2024 🔥🪄

Jeffrey Ip — Mon, 11 Dec 2023 10:41:32 +0000

TL;DR

AI has traditionally been a very difficult field for web developers to break into... until now 😌 With the introduction of large language models (LLMs) like ChatGPT, it seems like nowadays anyone can become an AI engineer. But make no mistake, this cannot be further from the truth.

In this article, I will reveal the current top AI libraries that makes a mediocre AI engineer exceptional. As an ex-Google, ex-Microsoft AI engineer myself, I will show you how exceptional AI engineers use these libraries to build great applications.

Are you ready to up-skill yourself and be one step closer to becoming an AI wizard before 2024? Lets begin 🤗

1. DeepEval - Open-source Evaluation Infrastructure for LLMs

A good engineer can build, but an exceptional engineer can communicate the value of what they're built. DeepEval allows you to do exactly that.

DeepEval allows you to unit test and debug your large language model (LLM, or just AI) applications at scale in both development and production in under 10 lines of code.

Why is this valuable you ask? Because companies nowadays want to be seen as an innovative AI company and so stakeholders prefer engineers that can not just build like an indie hacker, but know how to ship reliable AI applications like a seasonal AI specialist.**

import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
import chatbot

def test_chatbot():
   input = "How to become an AI engineer in 2024?"
   test_case = LLMTestCase(input=input, actual_output=chatbot(input))
   answer_relevancy_metric = AnswerRelevancyMetric()
   assert_test(test_case, [answer_relevancy_metric])

🌟 Star DeepEval on GitHub

2. Unstructured - Pre-processing for Unstructured Data

LLMs thrive because they are versatile and can handle a large variety of inputs, but not all. Unstructured helps you easily transform unstructured data like webpages, PDFs, tables into readable formats for LLMs.

What does this mean? This means you can now enable your AI application to be customized on your internal documents. Unstructured is amazing because it in my opinion, operates at the right level of abstraction - it gives the boring hard work while giving you enough control as a developer.

from unstructured.partition.auto import partition

elements = partition(filename="example-docs/eml/fake-email.eml")
print("\n\n".join([str(el) for el in elements]))

🌟 Star Unstructured

3. Airbyte - Data Integration for LLMs

Connect data sources, move data around, basically most of what you need to build a real-time AI application, using Airbyte. Allows your LLMs to be connected to information outside of the data it was trained on.

Alike Unstructured, Airbyte provides a great level of abstraction over the work an AI engineer does.

🌟 Star Airbyte

4. Qdrant - Fast Vector Search for LLMs

Ever wondered what happens if you feed in too much data to ChatGPT? That's right, you'll encounter a context overflow error.

That's because LLMs cannot take in infinite information. To help with that, we need a way to only feed in relevant information. And this process, is known as retrieval augmented generation (RAG). Here's another great article on what RAG is.

Qdrant is a vector database that helps you do just that. It stores and retrieve relevant information at blazing fast speed, ensuring your application stays up to date with the real world.

🌟 Star Qdrant

5. MemGPT - Memory Management for LLMs

So Qdrant helps give LLMs "long-term memory", but what happens if there's too much to "remember"? MemGPT helps you manage memory for this exact use case.

MemGPT is like a cache for vector databases, with its own proprietary way to clearing caches. It helps you manage redundant information in your knowledge bases, making your AI application more performant and accurate.

🌟 Star MemGPT

6. LiteLLM - LLM proxy

LiteLLM is a proxy for multiple LLMs. It is great for experimentation and combined with DeepEval, allows you to pick the best model for your use case. The best part? it allows you to use any model it supports in the same OpenAI interface.

from litellm import completion
import os

## set ENV variables 
os.environ["OPENAI_API_KEY"] = "your-openai-key" 

messages = [{ "content": "Hello, how are you?","role": "user"}]

# openai call
response = completion(model="gpt-3.5-turbo", messages=messages)

🌟 Star LiteLLM

Closing Remarks

That's all folks, thanks for reading and I'd hope you learned a few things along the way!

Please like and comment if enjoyed this article, and as always, don't forget to give open-source some love by starring their repos as a token of appreciation 🌟.

⤴️How to build a Midjourney API with Nest.js 🚀

Jeffrey Ip — Wed, 29 Nov 2023 10:59:17 +0000

TL;DR

In this post I will show you the architecture of building an unofficial Midjourney API with Typescript and Nest.js.

DeepEval - open-source evaluation framework for LLM applications

DeepEval evaluates performance based on metrics such as factual consistency, accuracy, answer relevancy

We are just starting out.
Can you help us with a star, please? 😽

https://github.com/confident-ai/deepeval

➡️ Please Like, Heart and star this article

What are we going to build?

To start off, let's understand how Mindjourney works on Discord. People use simple commands to talk to an AI bot. The bot then takes these commands and creates pictures that match the descriptions given by the users.

In order to mimic this type of behavior, we will need to create an API that interacts with a Discord Bot. This bot has the ability to interact with Discord, hence send commands to Midjourney. Here is a high level design of the command flow:

Important: This tutorial is only for educational purpose and understanding the nature of Discord and Midjourney interaction. It is not recommended to use it as a production service or in any official project.

The reason I chose Nest.js is because there are plenty of Python examples of how to build API for Midjourney, but not a decent one that shows the process using Javascript or Node.js. I prefer using Nest.js as it also well organized and easy for bootstrapping a project.

Connecting a Midjourney's Discord bot

In the absence of a formal API, a connection to Midjourney is facilitated via a Discord bot. The process includes the following steps.

Important : In order for this to work, you must have Midjourney subscription.

Step 1: Create a Discord bot.

Take a moment to help me please. I am working very hard to create the best open source for LLM evaluation.

Please give me a star - I will truly appreciate it.

🌟 DeepEval on GitHub

The first step towards a complete Midjourney API is to create our Discord bot. Discord has an interface for creating bots for different purposes. Go ahead and create your MJ bot.

Here is a great article for creating a Discord bot.

Once you've created the bot, you'll receive an invite link. Use it to invite the bot to your Discord server - we'll use this later to generate and receive images.

Step 2: Implementing /Imagine command

Once creating a Nest.js app, go ahead and create your discord module. This module will interact with our Discord server and MidJourney.”

Let's begin with our controller that should look something like that:

@Controller('discord')
export class DiscordController {
  constructor(private discordService: DiscordService) {}

  @Post('imagine')
  async imagine(@Body('prompt') prompt: string): Promise<any> {
    return this.discordService.sendImagineCommand(prompt);
  }
}

As you can see, I have created a discord module with a single POST request. We will pass a prompt to our discord/imagine request.

Next, let's create our discord service:

@Injectable()
export class DiscordService {

constructor(private readonly httpService: HttpService) {}


  async sendImagineCommand(prompt: string): Promise<any> {
    const postUrl = "https://discord.com/api/v9/interactions";
    const uniqueId = this.generateId();

    const postPayload = {
      type: 2,
      application_id: <APPLICATION_ID>,
      guild_id: <GUILD_ID>,
      channel_id: <CHANNEL_ID>,
      session_id: <SESSION_ID>,
      data: {
        version: <COMMAND_VERSION>,
        id: <IMAGINE_COMMAND_ID>,
        name: "imagine",
        type: 1,
        options: [
          {
            type: 3,
            name: "prompt",
            value: `${prompt} --no ${uniqueId}`
          }
        ],
        application_command: {
          id: <IMAGINE_COMMAND_ID>,
          application_id: <APPLICATION_ID>,
          version: <COMMAND_VERSION>,
          default_member_permissions: null,
          type: 1,
          nsfw: false,
          name: "imagine",
          description: "Create images with Midjourney",
          dm_permission: true,
          contexts: [0, 1, 2],
          options: [
            {
              type: 3,
              name: "prompt",
              description: "The prompt to imagine",
              required: true
            }
          ]
        },
        attachments: []
      }
    };


    const postHeaders = {
      authorization: <your auth token>,
      "Content-Type": "application/json"
    };

    this.httpService
      .post(postUrl, postPayload, { headers: postHeaders })
      .toPromise()
      .then(console.log);


    return uniqueId;
  }


  generateId(): number {
    return Math.floor(Math.random() * 1000);
  }
}

You will notice a few things here:

We are using https://discord.com/api/v9/interactions discord endpoint to interact with Discord server and send commands. This is the main entry point to deal with requests to Midjourney.
We mimic a web-browser request to Discord, and here is the real "magic" - go ahead and send /imagine command from your Discord web interface to Midjourney, after signing in to Midjourney web.
Once sending a request , you will notice the imagine command sent in Network tab as well, which is very similar to the above.
Copy the relevant fields : IMAGINE_COMMAND_ID , COMMAND_VERSION, SESSION_ID, GUILD_ID, CHANNEL_ID and APPLICATION_ID. This will be used on our service. We also need to copy MIDJOURNEY_TOKEN which is sent as part of the request.
Copy BOT_TOKEN from the bot application page we created earlier. This is important in order to communicate with our bot.
You will also notice the uniqueId that we generate using our generateId() function. This is using Midjourney's --no command so we can later track back the unique request sent to Discord and get the generated images.

Once completing this step, you are now able to call Discord with /imagine command and generate images with Midjourney.

Reminder : This is only a technical post describing how this flow works, and is not recommended for use for any project.

Step 3: Fetching generated images.

Let's create a new controller to fetch images:

@Get('mj/results/:id')
  async getMidjourneyResults(@Param('id') id: string) {
    const image = await this.discordService.getResultFromMidjourney(id);
    const attachmentUrl = get(image[0].attachments[0], 'url');

    if (attachmentUrl) {
      const urls = await this.discordService.processAndUpload(attachmentUrl);
      return urls;
    }
    return image;
  }

We are going to use the unique id generated when creating our /imagine request, in order to fetch results from Discord.


async getResultFromMidjourney(id: string): Promise<any> {
    const headers = {
      "Authorization": MIDJOURNEY_TOKEN,
      "Content-Type": "application/json"
    };
    const channelUrl = `https://discord.com/api/v9/channels/${CHANNEL_ID}/messages?limit=50`;

    try {
      const response = await this.httpService.get(channelUrl, { headers: headers }).toPromise();
      const data = response.data;


      const matchingMessage = data.filter(message =>
        message.content.includes(id) &&
        message
          .components
          .some(component => component.components.some(c => c.label === "U1") ) // means that we can upscale results
      ) || [];

      if (!matchingMessage.length) {
        return null;
      }

      if (matchingMessage.attachments && matchingMessage.attachments.length > 0) {
        for (const attachment of matchingMessage.attachments) {
          attachment.url = await this.fetchAndEncodeImage(attachment.url);
        }
      }

      return matchingMessage;

    } catch (error) {
     // do something 
    }
  }

  async fetchAndEncodeImage(url: string): Promise<string> {
    const response: AxiosResponse<any> = await this.httpService.get(url, {
      responseType: 'arraybuffer',
    }).toPromise();

    const base64 = Buffer.from(response.data, 'binary').toString('base64');
    return `data:${response.headers['content-type']};base64,${base64}`;
  }

https://discord.com/api/v9/channels/${CHANNEL_ID}/messages?limit=50 endpoint is being used to fetch our Discord channel and get the response in order to retrieve our images.

Since Midjourney generation takes about 60 seconds or more, we will need to poll this channel every x seconds to check for results.

Let's give it a try with { prompt: "a cat" } :

That's it! You should now have a fully working Midjourney API for testing and fun and you've learned how Discord bot architecture works.

Final thoughts

You now have a bootstrap project that demonstrates how Discord communicates with MidJourney to generate the most amazing AI images.
You can build a nice UI and have your own generative AI platform. Good luck!

Why OpenAI Assistants is a Big Win for LLM Evaluation

Jeffrey Ip — Fri, 24 Nov 2023 12:13:34 +0000

A week after the famous, or infamous, OpenAI Dev Day, we at Confident AI released JudgementalGPT — an LLM agent built using OpenAI’s Assistants API, specifically designed for the purpose of evaluating other LLM applications. What initially started off as an experimental idea quickly turned into a prototype that we were eager to ship as we received feedback from users that JudgementalGPT gave more accurate and reliable results when compared to other state-of-the-art LLM-based evaluation approaches such as G-Eval.

Understandably, knowing that Confident AI is the world’s first open-source evaluation infrastructure for LLMs, many demanded more transparency into how JudgementalGPT was built after our initial public release:

I thought it’s all open source, but it seems like JudgementalGPT, in particular, is a black box for users. It would be great if we had more knowledge on how this is built.

So here you go, dear anonymous internet stranger, this article is dedicated to you.

DeepEval - open-source evaluation framework for LLM applications

DeepEval is a framework that help engineers evaluate the performance of their LLM applications by providing default metrics to measure hallucination, relevancy, and much more.

We are just starting out, and we really want to help more developers build safer AI apps. Would you mind giving it a star to spread the word, please? 🥺❤️🥺

🌟 DeepEval on GitHub

Limitations of LLM-based evaluations

The authors of G-Eval, state that:

Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity.

For those who don’t already know, G-Eval is a framework that utilizes Large Language Models (LLMs) with chain-of-thought (CoT) processing to evaluate the quality of generated texts in a form-filling paradigm, and if you’ve ever tried implementing a version of your own, you’ll quickly find that using LLMs for evaluation presents its own set of problems:

Unreliability — although G-Eval uses a low-precision grading scale (1–5), which makes it easier for interpretation, these scores can vary a lot even under the same evaluation conditions. This variability is due to an intermediate step in G-Eval that dynamically generates steps for later evaluation, which increases the stochasticity of evaluation scores (which is also why providing an initial seed value doesn’t help).
Inaccuracy — for certain tasks, one digit usually dominates (e.g., 3 for a grading scale of 1–5 using gpt-3.5-turbo). A way to get around this problem would be to take the probabilities of output tokens from an LLM to normalize the scores and take their weighted summation as the final score. But, unfortunately, this isn’t an option if you’re using OpenAI’s GPT models as an evaluator, since they deprecated the logprobs parameter a few months ago.

In fact, another paper that explored LLM-as-a-judge pointed out that using LLMs as an evaluator is flawed in several ways. For example, GPT-4 gives preferential treatment to self-generated outputs, is not very good at math (but neither am I), and is prone to verbosity bias. Verbosity bias means it favors longer, verbose responses instead of accurate, shorter alternatives. (In fact, an initial study has shown that GPT-4 exhibits verbosity bias 8.75% of the time)

Can you see how this becomes a problem if you’re trying to evaluate a summarization task?

OpenAI Assistants offers a workaround to existing problems

Here’s a surprise — JudgementalGPT isn’t composed of one evaluator built using the new OpenAI Assistant API, but multiple. That’s right, behind the scenes, JudgementalGPT is a proxy for multiple assistants that perform different evaluations depending on the evaluation task at hand. Here are the problems JudgementalGPT was designed to solve:

Bias — we’re still experimenting with this (another reason for close-sourcing JudgementalGPT!), but assistants have the ability to write and execute code using the code interpreter tool, which means that, with a bit of prompt engineering, it can account for tasks that are more prone to logical fallacies, such as asserting coding or math problems, or tasks that require more factuality rather than giving preferential treatment to its own outputs.
Reliability — since we no longer require LLMs to dynamically generate CoTs/evaluation steps, we can enforce a set of rules for specific evaluation tasks. In other words, since we’ve pre-defined multiple sets of evaluation steps based on the evaluation task at hand, we have removed the biggest parameter contributing to stochasticity.
Accuracy — having a set of pre-defined evaluation steps for different tasks also means we can provide more guidance based on what we as humans actually expect from each evaluator and quickly iterate on the implementation based on user feedback.Another thing that we learnt when implementing G-Eval into our open-source project DeepEval was evaluation steps generated by LLMs are be lengthy and full of fluff.

Another insight we gained while integrating G-Eval into our open-source project DeepEval was the realization that LLM-generated evaluation steps tend to be arbitrary and generally does not help in providing guidance for evaluation. Some of you might also wonder what happens when JudgementalGPT can’t find a suitable evaluator for a particular evaluation task. For this edge case, we default back to G-Eval. Here’s a quick architecture diagram on how JudgementalGPT works:

As I’m writing this article, I discovered recent paper introducing Prometheus, “a fully open-source LLM that is on par with GPT-4’s evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied”, which also requires evaluation steps to be explicitly defined instead.

Still, problems with LLM-based evaluation lingers

One unresolved issue pertains to the accuracy challenges stemming from the predominance of a single digit in evaluation scores. This phenomenon, theoretically, isn’t exclusive to older models and is likely to affect advanced versions like gpt-4–1106-preview as well. So, I’m keeping an open mind about how this might affect JudgementalGPT. We’re really looking forward to more research that’ll either back up what we think or give us a whole new perspective — either way, I’m all ears.

Lastly, there can still be intricacies involved in defining our own set of evaluators. For example, just like how G-Eval isn’t a one-size-fits-all solution, neither is summarization, or relevancy. Any metric that is subject to interpretability is guaranteed to disappoint users who expect something different. For now, the best solution would be to have users clearly define their evaluation criteria to rid LLMs of any evaluation ambiguity.

Conclusion

At the end of the day, there’s no one-size-fits-all solution for LLM-based evaluations, which is why engineers/data scientists are frequently disappointed by non-human evaluation scores. However, by defining specific and concise evaluation steps for different use cases, LLMs are able to navigate ambiguity better, as they are provided more guidance into what a human might expect for different evaluation criteria.

P.S. By now, those of you who read between the lines will probably know the key to building a better evaluator is to tailor them for specific use cases, and OpenAI’s new Assistant API along with its code interpreter functionality is merely the icing on the cake (and a good marketing strategy!).

So, dear anonymous internet stranger, I hope you’re satisfied, and till next time.

⤴️ Be a prompt engineer: Understanding Midjourney LLM

Jeffrey Ip — Wed, 15 Nov 2023 09:43:06 +0000

TL;DR

By now, you've probably seen those incredible AI-generated images on your social feeds and thought to yourself, "How are people making these amazing images?" So you jump onto Midjourney, ready to create your own, but somehow, what comes out isn't quite what you pictured.

Don't worry — I've got you covered.
In order to get amazing images out of Midjourney, you need to be able to write prompts like a pro. Since Midjourney is based on an LLM, it all comes down to understanding its nature and how to get the most out of it.

Do you want to become a Prompt Hero? Then this guide is for you!

DeepEval - open-source evaluation framework for LLM applications

DeepEval evaluates performance based on metrics such as factual consistency, accuracy, answer relevancy

We are just starting out.
Can you help us with a star, please? 😽

https://github.com/confident-ai/deepeval

Creating your first Midjourney artwork

To get started with Midjourney, sign up to Discord and complete the registration process. Once you have got Discord running, open Midjourney website and choose Join Beta.

Once signing up, you can select a paid or a free plan.
If you are using a free plan, you may generate images in any of the Midjourney newbies channels. Paid users can send commands directly to Midjourney bot.

To begin with your first image, start typing / followed by imagine command. Then, it will let you enter a prompt (a description for generating an image), for example:

/imagine prompt: beautiful colorful horse

Midjourney will generate an image based on your prompt.

How does Midjourney works?

Midjourney uses an LLM (a large language model) to create images from text descriptions. This model has been trained on a vast array of text-image pairs, enabling it to understand and interpret the text prompts to produce similar images.

Let's break down this image creation process:

Analyzing the Prompt

The LLM starts by dissecting the prompt into its core ideas and terms. If you input something like "a photorealistic portrait of a woman," the system identifies key concepts like "photorealistic," "portrait," and "woman."

A basic Midjourney prompt looks like this:

A more advanced prompt may look like this:

We'll get back to that later. What's important is to understand that whatever you write is used to create the latent vector in the following step.

Generating a Latent Vector

Next, the LLM translates these concepts into a latent vector. This is a numerical code that captures all the image details - its color palette, shapes, style, objects, and more.

All those parameters are used inside the model to understand your request, by matching the vector to data it already knows and has been trained on.

This is why the following tip by official Midjourney documentation is important:

The Midjourney Bot works best with simple, short sentences that describe what you want to see. Avoid long lists of requests. Instead of: "Show me a picture of lots of blooming California poppies, make them bright, vibrant orange, and draw them in an illustrated style with colored pencils," try: "Bright orange California poppies drawn with colored pencils."

Pro tip: use short prompts!

🌟 DeepEval on GitHub

Using a Diffusion Model to generate the image

The final step of generating the image involves converting this latent vector into the actual image. This is where a diffusion model comes into play. It's a kind of AI that can form images from seemingly random patterns.

Starting with a blank canvas, the model slowly refines the image, adding layers of detail until it reflects what the latent vector describes. The way it adds this 'noise' is controlled, making sure the final image is clear and recognizable.

Other well-known generative AI platforms such as Stable Diffusion uses the same technics.

This is also the reason while waiting for Midjourney to complete its image creation, you notice blurry images which eventually turn into amazing art work.

The basics

Begin with a short prompt, focus on what you want to create - our subject.
Let's say we are interested in creating a portrait of a woman. We can begin with something like this:

/imagine A portrait of a young woman with light blue eyes

Once we have our initial image, it is all about iterations and improvements. We can now focus on details that matter, such as medium, mood, composition, environment.

Let's say we want to get a more realistic photo:
/imagine A realistic photo of a young woman with light blue eyes

This one is more realistic; however, let's give it the touch of an old photograph. To achieve that, we can simply add a year, say, 1960.

/imagine A realistic photo of a young woman with light blue eyes, year 1960

We've come a long way by only adding small details, such as the year and the medium type (realistic).

Pro tip: The Midjourney Bot does not comprehend grammar, sentence structure, or words as humans do. Using fewer words means that each one has a more powerful influence.

Now, let's add a composition; for instance, if I am interested in a headshot from above, we can revise our prompt accordingly:

/imagine Bird-eye view realistic photo, of a young woman with light blue eyes, 1960

Pretty cool, right?

Continue experimenting with various elements such as environment, emotions, colors, and more to discover the diverse outcomes they can produce.

Midjourney, utilizing a well-trained Large Language Model (LLM) and a diffusion model, has the capability to generate a wide range of variations based on your initial image. This allows for a great deal of flexibility and creativity in the image creation process.

By instructing the bot to produce either strong or weak variations, you can refine the output step by step. You might start with a broad concept and then progressively narrow down the details, or you could begin with a highly specific image and explore slight adjustments. The process continues until you reach a result that meets your vision or preference.

Asking for a strong variation will result in the following images:

Advanced techniques

Now that we understand the basics of Midjourney LLM, we can dive into its parameters. Parameters are options added to a prompt that change how an image is generated.

Changing aspect ratio

Pro tip: parameters are always added at the end of the prompt

One of the most important parameters is the aspect ratio. Midjourney's default aspect ratio is square (1:1), but what if we want to create a great cover image (such as this article's cover) or a portrait image?
We just need to add --ar at the end of the prompt. For example:

/imagine Bird-eye view realistic photo, of a young woman with light blue eyes, 1960 --ar 1:2

Notice the --ar followed by the spect ratio, here.

Getting more artistic

Using styles

The --style parameter replaces the default style of some Midjourney Model Versions.

Using --style raw will result in a more accurate prompt style, and less beautification. Let's have a look at the following example:

/imagine cat icon will generate this kind of image, which is beautiful, but not really an icon:

If we add --style raw to it, Midjourney will generate a much more relevant image:

Niji model

Midjourney has an alternative model called niji 5 which allows to use other style parameters.
Adding --niji 5 followed by different styles such as: cute, expressive, original or scenic will result in more sophisticated images.

/imagine cat --niji 5 --style cute

As an LLM-based generator, Midjourney is trained on a huge amount of data, incorporating different artistic styles.
Providing a --stylize parameter influences how strongly this training is applied, with the range being between 0 and 1000; higher values will generate a more artistic image.

/imagine child's drawing of a dog

Ready to become a pro?

Before moving forward, I would appreciate it if you could like or 'heart' this article — it would help me a lot.

Also, please check out my open-source GitHub library. Would you mind giving it a star? ❤️

🌟 DeepEval on GitHub

Here comes the fun part. But before we start, I would like to share with you the way I create nice photos and understand Midjourney LLM better.

Finding inspirations

When looking for inspiration, I head to the Midjourney Showcase page where I look for inspiring photos. Once I've found one, I download the photo and ask Midjourney to describe it. This process is similar to the reverse engineering of the LLM, which reveals how Midjourney transforms text to image.

For example, I have found this image interesting:

And asked Midjourney to describe it using /describe command.

That's a good starting point for your next image generation. Take the keywords that created this image and use them to generate images with a similar look and feel.
Here I noticed the text "a polygonal elephant in a dark background", which is dominant, but also "in the style of graphic design influence, stephen shortridge".

Pro tip: Midjourney knows how to generate images in the style of a given artist

Prompt /imagine a polygonal elephant, in the style of stephen shortridge

Let's get weird

We can get unconventional images with the --weird parameter. When using this parameter, Midjourney creates unique and unexpected outcomes. --weird accepts values from 0 to 3000 (the default is 0), and the higher the value we provide, the more unexpected the outcome is.

/imagine elephant --weird ...

Permutations

What if we wish to try different colors, say red/green/blue/yellow elephant?

We can use permutations by adding { ... } to our prompt, comma separating our permutations.

/imagine a { red, green, blue, yellow } elephant

This will create 4 Midjourney jobs in a single shot.

Midjourney Tiles

This is probably one of the most amazing, yet hidden, Midjourney features. The --tile parameter will generate an image which can be repeatedly used as a tile.

/imagine watercolor elephant --tile

Final thoughts

Understanding Midjourney LLM results in generating amazing images and photos.
If you think of any other helpful Midjourney prompt engineering that I haven't covered in this article, please share them in the comments section below. 👇🏻

So, that is it for this article.

Thank you so much for reading! 🤩🙏

What is Retrieval Augmented Generation (RAG)? 🚀

Jeffrey Ip — Wed, 25 Oct 2023 10:30:40 +0000

TL;DR

In this article, I’m going to talk about what RAG is and how to implement a RAG-based LLM application (yes, with a complete code sample 😚)

Let’s dive right in.

DeepEval - open-source evaluation framework for LLM applications

DeepEval is a framework that help engineers evaluate the performance of their LLM applications by providing default metrics to measure hallucination, relevancy, and much more.

We are just starting out, and we really want to help more developers build safer AI apps. Would you mind giving it a star to help spread the word, please? 🥺❤️🥺

🌟 DeepEval on GitHub

What is RAG?

Retrieval augmented generation is a technique in NLP that allows LLMs like ChatGPT to generate customized outputs that are outside the scope of the data it was trained on. An LLM application without RAG, is akin to asking ChatGPT to summarize an email without providing the actual email as context.

A RAG system consists of two primary components: the retriever and the generator.

The retriever is responsible for searching through the knowledge base for the most relevant pieces of information that correlate with the given input, which is referred to as retrieval results. On the other hand, the generator utilizes these retrieval results to craft a series of prompts based on a predefined prompt template to produce a coherent and relevant response to the input.

Here’s a diagram of a RAG architecture.

In most cases, your “knowledge base” consists of vector embeddings stored in a vector database like ChromaDB, and your “retriever” will 1) embed the given input at runtime and 2) search through the vector space containing your data to find the top K most relevant retrieval results 3) rank the results based on relevancy (or distance to your vectorized input embedding). This will then be processed into a series of prompts and passed onto your “generator”, which is your LLM of choice (GPT-4, lLama2, etc.).

For more curious users, here are the models a retriever commonly employs to extract the most pertinent retrieval results:

Neural Network Embeddings (eg. OpenAI/Cohere’s embedding models): ranks documents based on their locational proximity in a multidimensional vector space, enabling an understanding of textual relationships and relevance between an input and the document corpus.
Best Match 25 (BM25): a probabilistic retrieval model that enhances text retrieval precision. By considering term frequencies with inverse document frequencies, it takes into account term significance, ensuring that both common and rare terms influence the relevance ranking.
TF-IDF (Term Frequency — Inverse Document Frequency): calculates the significance of a term within a document relative to the broader corpus. By juxtaposing a term’s occurrence in a document with its rarity across the corpus, it ensures a comprehensive relevance ranking.
Hybrid Search: optimizes the relevance of the search results by assigning distinctive weights to different methodologies, such as Neural Network Embeddings, BM25, and TF-IDF.

Applications

RAG has various applications across different fields due to its ability to combine retrieval and generation of text for enhanced responses. Having worked with numerous companies building LLM applications at Confident, here is the top four use cases I’ve seen:

Customer support / user onboarding chatbots: No surprises here, retrieve data from internal documents to generate more personalized responses. Click here to read a full tutorial on how to build one yourself using lLamaindex.
Data Extraction. Interestingly, we can use RAG to extract relevant data from documents such as PDFs. You can find a tutorial on how to do it here.
Sales enablement: retrieve data from LinkedIn profiles and email threads to generate more personalized outreach messages
Content creation and enhancement: retrieve data from past message conversations to generate suggested message replies

In the following code walkthrough, we’ll be building a very generalized chatbot, and you’ll be able to customize it’s functionality into any of the use cases listed above by tweaking prompts and data stored in your vector database.

Project Setup

For this project, we’re going to build a question-answering (QA) chatbot based on your knowledge base. We’re not going to cover the part on how to index your knowledge base, as that’s a discussion for another day.

We’re going to be using python, ChromaDB for our vector database, and OpenAI for both vector embeddings and chat completion. We’re going to build a chatbot on your favorite Wikipedia page.

First, set up a new project directory and install the dependencies we need.

mkdir rag-llm-app
cd rag-llm-app
python3 -m venv venv
source venv/bin/activate

Your terminal should now start with something like this:

(venv)

Installing dependencies

pip install openai chromadb

Next, create a new main.py file — the entry point to your LLM application.

touch main.py

Getting your API keys

Lastly, go ahead and get your OpenAI API key here if you don’t already have one, and set it as an enviornment variable:

export OPENAI_API_KEY="your-openai-api-key"

You’re good to go! Let’s start coding.

Building a RAG-based LLM application

Begin by creating an Retriever class that will retrieve the most relevant data from ChromaDB for a given user question.

Open main.py and paste in the following code:

import chromadb
from chromadb.utils import embedding_functions
import openai

client = chromadb.Client()
client.heartbeat()

class Retriver:
    def __init__(self):
        pass

    def get_retrieval_results(self, input, k):
        openai_ef = embedding_functions.OpenAIEmbeddingFunction(api_key="your-openai-api-key", model_name="text-embedding-ada-002")
        collection = client.get_collection(name="my_collection", embedding_function=openai_ef)
        retrieval_results = collection.query(
            query_texts=[input],
            n_results=k,
        )
        return retrieval_results["documents"][0]

Here, openai_ef is the embedding function used under the hood by ChromaDB to vectorize an input. When a user sends a question to your chatbot, a vector embedding will be created from this question using OpenAI’s text-embedding-ada-002 model. This vector embedding will then be used for ChromaDB to perform a vector similarity search in the collection vector space, which contains data from your knowledge base (remember, we’re assuming you’ve already indexed data for this tutorial). This process allows you to search for the top K most relevant retrieval results on any given input.

Now that you’ve created your retriever, paste in the following code to create a generator:

...

class Generator:
    def __init__(self, openai_model="gpt-4"):
        self.openai_model = openai_model
        self.prompt_template = """
            You're a helpful assistant with a thick country accent. Answer the question below and if you don't know the answer, say you don't know.

            {text}
        """

    def generate_response(self, retrieval_results):
        prompts = []
        for result in retrieval_results:
            prompt = self.prompt_template.format(text=result)
            prompts.append(prompt)
        prompts.reverse()

        response = openai.ChatCompletion.create(
            model=self.openai_model,
            messages=[{"role": "assistant", "content": prompt} for prompt in prompts],
            temperature=0,
        )

        return response["choices"][0]["message"]["content"]

Here, we constructed a series of prompts in the generate_response method based on a list of retrieval_results that will be provided by the retriever we built earlier. We then send this series of prompts to OpenAI to generate an answer. Using RAG, your QA chatbot can now produce more customized outputs by enhancing the generation with retrieval results!

To wrap things up, lets put everything together:

...

class Chatbot:
    def __init__(self):
        self.retriver = Retriver()
        self.generator = Generator()

    def answer(self, input):
        retrieval_results = self.retriver.get_retrieval_results(input)
        return self.generator.generate_response(retrieval_results)


# Creating an instance of the Chatbot class
chatbot = Chatbot()

while True:
    user_input = input("You: ")  # Taking user input from the CLI
    response = chatbot.answer(user_input)
    print(f"Chatbot: {response}")

That’s all folks! You just built your very first RAG-based chatbot.

Conclusion

In this article, you’ve learnt what RAG is, some use cases for RAG, and how to build your own RAG-based LLM application. However, you might have noticed that building your own RAG application is pretty complicated, and indexing your data is often a non-trivial task. Luckily, there are existing open-source frameworks like LangChain and lLamaIndex that allows you to implement what we’ve demonstrated in a much simpler way.

If you like the article, don’t forget to give us a star on Github ❤️: https://github.com/confident-ai/deepeval

You can also find the full code example here: https://github.com/confident-ai/blog-examples/tree/main/rag-llm-app

Till next time!

The one thing everyone's doing wrong with ChatGPT... 🤫🤔

Jeffrey Ip — Tue, 03 Oct 2023 09:08:39 +0000

TL;DR

Most developers don't evaluate their GPT outputs when building applications even if that means introducing unnoticed breaking changes because evaluation is very, very, hard. In this article, you're going to learn how to evaluate ChatGPT (LLM) outputs the right way.

🔥 On the agenda:

what are LLMs and why they're difficult to evaluate
different ways to evaluate LLM outputs
how to evaluate in python

Enjoy! 🤗

DeepEval - open-source evaluation framework for LLM applications

DeepEval is a framework that help engineers evaluate the performance of their LLM applications by providing default metrics to measure hallucination, relevancy, and much more.

We are just starting out, and we really want to help more developers build safer AI apps. Would you mind giving it a star to help spread the word, please? 🥺❤️🥺

🌟 DeepEval on GitHub

What are LLMs and what makes them so hard to evaluate?

To understand why LLMs are difficult to evaluate and why they're often times referred to as a "black box", let's debunk are LLMs and how they work.

ChatGPT is an example of a large language model (LLM) and was trained on huge amounts of data. To be exact, around 300 billion words from articles, tweets, r/tifu, stack-overflow, how-to-guides, and other pieces of data that were scraped off the internet 🤯

Anyway, the GPT behind "Chat" stands for Generative Pre-trained Transformers. A transformer is a specific neural network architecture which is particularly good at predicting the next few tokens (a token == 4 characters for ChatGPT, but this can be as short as one character or as long as a word depending on the specific encoding strategy).

So in fact, LLMs don't really "know" anything, but instead "understand" linguistic patterns due to the way in which they were trained, which often times makes them pretty good at figuring out the right thing to say. Pretty manipulative huh?

All jokes aside, if there's one thing you need to remember, it's this: the process of predicting the next plausible "best" token is probabilistic in nature. This means that, LLMs can generate a variety of possible outputs for a given input, instead of always providing the same response. It is exactly this non-deterministic nature of LLMs that makes them challenging to evaluate, as there's often more than one appropriate response.

Why do we need to evaluate LLM applications?

When I say LLM applications, here are some examples of what I'm referring to:

Chatbots: For customer support, virtual assistants, or general conversational agents.
Code Assistance: Suggesting code completions, fixing code errors, or helping with debugging.
Legal Document Analysis: Helping legal professionals quickly understand the essence of long contracts or legal texts.
Personalized Email Drafting: Helping users draft emails based on context, recipient, and desired tone.

LLM applications usually have one thing in common - they perform better when augmented with proprietary data to help with the task at hand. Want to build an internal chatbot that helps boost your employee's productivity? OpenAI certainly doesn't keep tabs on your company's internal data (hopefully 😥).

This matters because it is now not only OpenAI's job to ensure ChatGPT is performing as expected ⚖️ but also yours to make sure your LLM application is generating the desired outputs by using the right prompt templates, data retrieval pipelines, model architecture (if you're fine-tuning), etc.

Evaluation (I'll just call them evals from hereon) helps you measure how well your application is handling the task at hand. Without evals, you will be introducing unnoticed breaking changes and would have to manually inspect all possible LLM outputs each time you iterate on your application 👀 which to me sounds like a terrible idea 💀

How to evaluate LLM outputs

There are two ways everyone should know about when it comes to evals - with and without ChatGPT.

Evals without ChatGPT

A nice way to evaluate LLM outputs without using ChatGPT is using other machine learning models derived from the field of NLP. You can use specific models to judge your outputs on different metrics such as factual correctness, relevancy, biasness, and helpfulness (just to name a few, but the list goes on), despite non-deterministic outputs.

For example, we can use natural language inference (NLI) models (which outputs an entailment score) to determine how factually correct a response is based on some provided context. The higher the entailment score, the more factually correct an output is, which is particularity helpful if you're evaluating a long output that's not so black and white in terms of factual correctness.

You might also wonder how can these models possibly "know" whether a piece of text is factually correct 🤔 It turns out you can provide context to these models for them to take at face value 🥳 In fact, we call these context ground truths or references. A collection of these references are often referred to an evaluation dataset.

But not all metrics require references. For example, relevancy can be calculated using cross-encoder models (another ML model), and all you need is supply the input and output for it to determine how relevant they are to each another.

Off the top of my head, here's a list of reference-less metrics:

relevancy
bianess
toxicity
helpfulness
harmlessness

And here is a list of reference based metrics:

factual correctness
conceptual similarity

Note that reference based metrics doesn't require you to provide the initial input, as it only judges the output based on the provided context.

Using ChatGPT for Evals

There's a new emerging trend to use state-of-the-art (aka ChatGPT) LLMs to evaluate themselves or even other others LLMs.

G-Eval is a recently developed framework that uses LLMs for evals.

I'll attach an image from the research paper that introduced G-eval below, but in a nutshell G-Eval is a two part process - the first generates evaluation steps, and the second uses the generated evaluation steps to output a final score.

Let's run though a concrete example. Firstly, to generate evaluation steps:

introduce an evaluation task to ChatGPT (eg. rate this summary from 1 - 5 based on relevancy)
introduce an evaluation criteria (eg. Relevancy will based on the collective quality of all sentences)

Once the evaluation steps has been generated:

concatenate the input, evaluation steps, context, and the actual output
ask it to generate a score between 1 - 5, where 5 is better than 1
(Optional) take the probabilities of the output tokens from the LLM to normalize the score and take their weighted summation as the final result

Step 3 is actually pretty complicated 🙃 because to get the probability of the output tokens, you would typically need access to the raw model outputs, not just the final generated text. This step was introduced in the paper because it offers more fine-grained scores that better reflect the quality of outputs.

Here's a diagram taken from the paper that can help you visualize what we've learnt:

Utilizing GPT-4 with G-Eval outperformed traditional metrics in areas such as coherence, consistency, fluency, and relevancy 😳 but, evaluations using LLMs can often be very expensive.

So, my recommendation would be to evaluate with G-Eval as a starting point to establish a performance standard and then transition to more cost-effective traditional methods where suitable.

Evaluating LLM outputs in python

By now, you probably feel inundated by all the jargon and definitely wouldn't want to implement everything from scratch. Imagine having to research what's the best way to compute each individual metric, train your own model for it, and code up an evaluation framework... 😰

Luckily, there are a few open source packages such as ragas and DeepEval that provides an evaluation framework so you don't have to write your own 😌

As the cofounder of Confident (the company behind DeepEval), I'm going to go ahead and shamelessly show you how you can unit test your LLM applications using DeepEvals 😊 (but seriously, we have an amazing Pytest-like developer experience, easy to setup, and offer a free platform for you to visualize your evaluation results)

Let's wrap things up with some coding.

Setting up your test environment

To implement our much anticipated evals, create a project folder and initialize a python virtual environment by running the code below in your terminal:

mkdir evals-example
cd evals-example
python3 -m venv venv
source venv/bin/activate

Your terminal should now start something like this:

(venv)

Installing dependencies

Run the following code:

pip install deepeval

Setting your OpenAI API Key

Lastly, set your OpenAI API key as an environment variable. We'll need OpenAI for G-Evals later (which basically means using LLMs for evaluation). In your terminal, paste in this with your own API key (get yours here if you don't already have one):

export OPENAI_API_KEY="your-api-key-here"

Writing your first test file

Let's create a file called test_evals.py (note that test files must start with "test"):

touch test_evals.py

Paste in the following code:

from deepeval.metrics.factual_consistency import FactualConsistencyMetric
from deepeval.metrics.answer_relevancy import AnswerRelevancyMetric
from deepeval.metrics.conceptual_similarity import ConceptualSimilarityMetric
from deepeval.metrics.llm_eval import LLMEvalMetric
from deepeval.test_case import LLMTestCase
from deepeval.run_test import assert_test
import openai

def test_factual_correctness():
    input = "What if these shoes don't fit?"
    context = "All customers are eligible for a 30 day full refund at no extra costs."
    output = "We offer a 30-day full refund at no extra costs."
    factual_consistency_metric = FactualConsistencyMetric(minimum_score=0.5)
    test_case = LLMTestCase(query=input, output=output, context=context)
    assert_test(test_case, [factual_consistency_metric])

def test_relevancy():
    input = "What does your company do?"
    output = "Our company specializes in cloud computing"
    relevancy_metric = AnswerRelevancyMetric(minimum_score=0.5)
    test_case = LLMTestCase(query=input, output=output)
    assert_test(test_case, [relevancy_metric])

def test_conceptual_similarity():
    input = "What did the cat do?"
    output = "The cat climbed up the tree"
    expected_output = "The cat ran up the tree."
    conceptual_similarity_metric = ConceptualSimilarityMetric(minimum_score=0.5)
    test_case = LLMTestCase(query=input, output=output, expected_output=expected_output)
    assert_test(test_case, [conceptual_similarity_metric])

def test_humor():
    def make_chat_completion_request(prompt):
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt},
            ],
        )
        return response.choices[0].message.content
    input = "Write me something funny related to programming"
    output = "Why did the programmer quit his job? Because he didn't get arrays!"
    llm_metric = LLMEvalMetric(
        criteria="How funny it is",
        completion_function=make_chat_completion_request
    )
    test_case = LLMTestCase(query=input, output=output)
    assert_test(test_case, [llm_metric])

Now run the test file:

deepeval test run test_evals.py

For each of the test cases, there is a predefined metric provided by DeepEval, and each of these metrics output a score from 0 - 1. For example, FactualConsistencyMetric(minimum_score=0.5) means we want to evaluate how factually correct an output is, where the minimum_score=0.5 means the test will only pass if the output score is higher than a 0.5 threshold.

Let's go over the test cases one by one:

test_factual_correctness tests how factually correct your LLM output is relative to the provided context.
test_relevancy tests how relevant the output is relative to the given input.
test_conceptual_similarity tests how conceptually similar the LLM output is relative to the expected output.
test_humor tests how funny your LLM output is. This test case is the only test case that uses ChatGPT for evaluation.

Notice how there's up to 4 moving parameters for a single test case - the input, the expected output, the actual output (of your application), and the context (that was used to generate the actual output). Depending on the metric you're testing, some parameters are optional, while some are mandatory.

Lastly, what if you want to test more than a metric on the same input? Here's how you can aggregate metrics on a single test case:

def test_everything():
    input = "What did the cat do?"
    output = "The cat climbed up the tree"
    expected_output = "The cat ran up the tree."
    context = "The cat ran up the tree."
    conceptual_similarity_metric = ConceptualSimilarityMetric(minimum_score=0.5)
    relevancy_metric = AnswerRelevancyMetric(minimum_score=0.5)
    factual_consistency_metric = FactualConsistencyMetric(minimum_score=0.5)
    test_case = LLMTestCase(query=input, output=output, context=context, expected_output=expected_output)
    assert_test(test_case, [conceptual_similarity_metric, relevancy_metric, factual_consistency_metric])

Not so hard after all huh? Write enough of these (10-20), and you'll have much better control over what you're building 🤗

PS. And here's a bonus feature DeepEval offers: free web platform for you to view data on all your test runs.

Try running the following command:

deepeval login

Follow the instructions (login, get your API key, paste it in the CLI), and run this again:

deepeval test run test_example.py

Let me know what happens!

Conclusion

In this article, you've learnt:

how ChatGPT work
examples of LLM applications
why it's hard to evaluate LLM outputs
how to evaluate LLM outputs in python

With evals, you can stop making breaking changes to your LLM application ✅ quickly iterate on your implementation to improve on metrics you care about ✅ and most importantly be confident in the LLM application you build 😇

If you enjoyed this article, don't forget to give us a star on GitHub! The source code for this tutorial is available here:
https://github.com/confident-ai/blog-examples/tree/main/evals-example

Thank you for reading, and till next time 🫡

How to build a PDF QA chatbot using OpenAI and ChromaDB 🤗

Jeffrey Ip — Tue, 26 Sep 2023 10:38:37 +0000

TL;DR

In this article, you'll learn how to build a RAG based chatbot to chat with any PDF of your choice so you can achieve your lifelong dream of talking to PDFs 😏 In the end, I'll also show how you can test what you've built ✅

I know, I wrote something similar in my last article on building a customer support chatbot 😅 but this week we're going to dive deep into how to use the raw OpenAI API to chat with PDF data (including text trapped in visuals like tables) stored in ChromaDB, as well as how to use Streamlit to build the chatbot UI.

A small request 🙏🏻

I'm trying to get DeepEval to 5k stars by the end of 2023, can you please help me out by starring my repo? It helps me create more weekly high quality content ❤️ thank you very very much!

https://github.com/confident-ai/deepeval

Introducing RAG, Vector Databases, and OCR

Before we dive into the code, let's debunk what we're going to implement 🕵️ To begin, OCR (Optical Character Recognition) is a technology within the field of computer vision that recognizes the characters present in the document and converts them into text - this is particularly helpful in the case of tables and charts in documents 😬 We'll be using OCR provided by Azure Cognitive Services in this tutorial.

Once text chunks are extracted using OCR, they are converted into a high-dimensional vector (aka. vectorized) using embedding models like Word2Vec, FastText, or BERT. These vectors, which encapsulate the semantic meaning of the text, are then indexed in a vector database. We'll be using ChromaDB as our in-memory vector database 🥳

Now, let's see what happens when a user asks their PDF something. First, the user query is first vectorized using the same embedding model used to vectorize the extracted PDF text chunks. Then, the top K most semantically similar text chunk is fetched by searching through the vector database, which remember, contains the text chunks from our PDF. The retrieved text chunks are then provided as context for ChatGPT to generate an answer based on information in their PDF. This is the process of retrieval, augmented, generation (RAG).

Feeling educated? 😊 Let's begin.

Project Setup

First, I'm going to guide you through how to set up your project folders and any dependencies you need to install.

Create a project folder and a python virtual environment by running the following command:

mkdir chat-with-pdf
cd chat-with-pdf
python3 -m venv venv
source venv/bin/activate

Your terminal should now start something like this:

(venv)

Installing dependencies

Run the following command to install OpenAI API, ChromaDB, and Azure:

pip install openai chromadb azure-ai-formrecognizer streamlit tabulate

Let's briefly go over what each of those package does:

streamlit - sets up the chat UI, which includes a PDF uploader (thank god 😌)
azure-ai-formrecognizer - extracts textual content from PDFs using OCR
chromadb - is an in-memory vector database that stores the extracted PDF content
openai - we all know what this does (receives relevant data from chromadb and returns a response based on your chatbot input)

Next, create a new main.py file - the entry point to your application

touch main.py

Getting your API keys

Lastly, get your OpenAI and Azure API key ready (click the hyperlink to get them if you don't already have one)

Note: It's pretty troublesome to sign up for an account on Azure Cognitive Services. You'll need a card (although they won't charge you automatically), and phone number 😔 but do give it a try if you're trying to build something serious!

Building the Chatbot UI with Streamlit

Streamlit is an easy way to build frontend applications using python.

Lets import streamlit along with setting up everything else we'll need:

import streamlit as st
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
from tabulate import tabulate
from chromadb.utils import embedding_functions
import chromadb
import openai

# You'll need this client later to store PDF data
client = chromadb.Client()
client.heartbeat()

Give our chat UI a title and create a file uploader:

...
st.write("#Chat with PDF")

uploaded_file = st.file_uploader("Choose a PDF file", type="pdf")
...

Listen for a change event in uploaded_file. This will be triggered when you upload a file:

...
if uploaded_file is not None:
    # Create a temporary file to write the bytes to
    with open("temp_pdf_file.pdf", "wb") as temp_file:
        temp_file.write(uploaded_file.read())
...

View your streamlit app by running main.py (we'll implement the chat input UI later):

streamlit run main.py

That's the easy part done 🥳! Next comes the not so easy part...

Extracting text from PDFs

Carrying on from the previous code snippet, we're going to send temp_file to Azure Cognitive Services for OCR:

    ...
    # you can set this up in the azure cognitive services portal
    AZURE_COGNITIVE_ENDPOINT = "your-custom-azure-api-endpoint"
    AZURE_API_KEY = "your-azure-api-key"
    credential = AzureKeyCredential(AZURE_API_KEY)
    AZURE_DOCUMENT_ANALYSIS_CLIENT = DocumentAnalysisClient(AZURE_COGNITIVE_ENDPOINT, credential)

    # Open the temporary file in binary read mode and pass it to Azure
    with open("temp_pdf_file.pdf", "rb") as f:
        poller = AZURE_DOCUMENT_ANALYSIS_CLIENT.begin_analyze_document("prebuilt-document", document=f)
        doc_info = poller.result().to_dict()
    ...

Here, dict_info is a dictionary containing information on the extracted text chunks. It's a pretty complicated dictionary, so I would recommend printing it out and seeing for yourself what it looks like.

Paste in the following to finish processing the data received from Azure:

   ...
   res = []
   CONTENT = "content"
   PAGE_NUMBER = "page_number"
   TYPE = "type"
   RAW_CONTENT = "raw_content"
   TABLE_CONTENT = "table_content"

   for p in doc_info['pages']:
        dict = {}
        page_content = " ".join([line["content"] for line in p["lines"]])
        dict[CONTENT] = str(page_content)
        dict[PAGE_NUMBER] = str(p["page_number"])
        dict[TYPE] = RAW_CONTENT
        res.append(dict)

    for table in doc_info["tables"]:
        dict = {}
        dict[PAGE_NUMBER] = str(table["bounding_regions"][0]["page_number"])
        col_headers = []
        cells = table["cells"]
        for cell in cells:
            if cell["kind"] == "columnHeader" and cell["column_span"] == 1:
                for _ in range(cell["column_span"]):
                    col_headers.append(cell["content"])

        data_rows = [[] for _ in range(table["row_count"])]
        for cell in cells:
            if cell["kind"] == "content":
                for _ in range(cell["column_span"]):
                    data_rows[cell["row_index"]].append(cell["content"])
        data_rows = [row for row in data_rows if len(row) > 0]

        markdown_table = tabulate(data_rows, headers=col_headers, tablefmt="pipe")
        dict[CONTENT] = markdown_table
        dict[TYPE] = TABLE_CONTENT
        res.append(dict)
    ...

Here, we accessed various properties of the dictionary returned by Azure to get texts on the page, and data stored in tables. The logic is pretty complex because of all the nested structures 😨 but from personal experience, Azure OCR works well even for complex PDF structures, so I highly recommend giving it a try :)

Storing PDF content in ChromaDB

Still with me? 😅 Great, we're almost there so hang in there!

Paste in the code below to store extracted text chunks from res in ChromaDB.

    ...
    try:
        client.delete_collection(name="my_collection")
        st.session_state.messages = []
    except:
        print("Hopefully you'll never see this error.")

    openai_ef = embedding_functions.OpenAIEmbeddingFunction(api_key="your-openai-api-key", model_name="text-embedding-ada-002")
    collection = client.create_collection(name="my_collection", embedding_function=openai_ef)
    data = []
    id = 1
    for dict in res:
        content = dict.get(CONTENT, '')
        page_number = dict.get(PAGE_NUMBER, '')
        type_of_content = dict.get(TYPE, '')

        content_metadata = {   
            PAGE_NUMBER: page_number,
            TYPE: type_of_content
        }

        collection.add(
            documents=[content],
            metadatas=[content_metadata],
            ids=[str(id)]
        )
        id += 1
    ...

The first try block ensures that we can continue uploading PDFs without having to refresh the page.

You might have noticed that we add data into a collection and not to the database directly. A collection in ChromaDB is a vector space. When a user enters a query, it performs a search inside this collection, instead of the entire database. In Chroma, this collection is identified by a unique name, and with a simple line of code, you can add all extracted text chunks via to this collection via collection.add(...).

Generating a response using OpenAI

I get asked a lot about how to build a RAG chatbot without relying on frameworks like langchain and lLamaIndex. Well here's how you do it - you construct a list of prompts dynamically based on the retrieved results from your vector database.

Paste in the following code to wrap things up:

...
if "messages" not in st.session_state:
    st.session_state.messages = []

# Display chat messages from history on app rerun
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

if prompt := st.chat_input("What do you want to say to your PDF?"):
    # Display your message
    with st.chat_message("user"):
        st.markdown(prompt)
    # Add your message to chat history
    st.session_state.messages.append({"role": "user", "content": prompt})

    # query ChromaDB based on your prompt, taking the top 5 most relevant result. These results are ordered by similarity.
    q = collection.query(
        query_texts=[prompt],
        n_results=5,
    )
    results = q["documents"][0]

    prompts = []
    for r in results:
        # construct prompts based on the retrieved text chunks in results 
        prompt = "Please extract the following: " + prompt + "  solely based on the text below. Use an unbiased and journalistic tone. If you're unsure of the answer, say you cannot find the answer. \n\n" + r

        prompts.append(prompt)
    prompts.reverse()

    openai_res = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "assistant", "content": prompt} for prompt in prompts],
        temperature=0,
    )

    response = openai_res["choices"][0]["message"]["content"]
    with st.chat_message("assistant"):
        st.markdown(response)

    # append the response to chat history
    st.session_state.messages.append({"role": "assistant", "content": response})

Notice how we reversed prompts after constructing a list of prompts according to the list of retrieved text chunks from ChromaDB. This is because the results returned from ChromaDB is ordered in descending order, meaning the most relevant text chunk will always be the first in the results list. However, the way ChatGPT works is it considers the last prompt in a list of prompts more, hence why we have to reverse it.

Run the streamlit app and try things out for yourself 😙:

streamlit run main.py

🎉 Congratulations, you made it to the end!

Taking it a step further

As you know, LLM applications are a black box and so for production use cases, you'll want to safeguard the performance of your PDF chatbot to keep your users happy. To learn how to build a simple evaluation framework that could get you setup in less than 30 minutes, click here.

Conclusion

In this article, you've learnt:

what a vector database is a how to use ChromaDB
how to use the raw OpenAI API to build a RAG based chatbot without relying on 3rd party frameworks
what OCR is and how to use Azure's OCR services
how to quickly set up a beautiful chatbot UI using streamlit, which includes a file uploader.

This tutorial walked you through an example of how you can build a "chat with PDF" application using just Azure OCR, OpenAI, and ChromaDB. With what you've learnt, you can build powerful applications that help increase the productivity of workforces (at least that's the most prominent use case I've came across).

The source code for this tutorial is available here:
https://github.com/confident-ai/blog-examples/tree/main/chat-with-pdf

Thank you for reading!

Building a customer support chatbot using GPT-3.5 and lLamaIndex🚀

Jeffrey Ip — Tue, 19 Sep 2023 07:18:42 +0000

TL;DR

In this article, you'll learn how to create a customer support chatbot using GPT-3.5 and lLamaIndex. Also, stay tuned for bonus tips and tricks on how to evaluate your chatbot at the end of this article :)

A small request 🥺

I produce weekly content and your support would really help me continue. Please support me and my company Confident AI by starring our GitHub library. We're building a platform to unit test your chatbot. Thank you very very much! ❤️

https://github.com/confident-ai/deepeval

Introducing OpenAI API, and lLamaindex

In this tutorial, we're going to use GPT-3.5 provided by the OpenAI API. GPT-3.5 is a machine learning model and is like a super-smart computer buddy made by OpenAI. It's been trained with tons of data from the internet so it can chat, answer questions, and help with all sorts of language tasks.

But, you might wonder, can raw, out-of-the-box GPT-3.5 answer customer support questions that are specific to my own internal data?

Unfortunately, the answer is no 😔 because as you may know, GPT models have only been trained on public data up until 2021. This is precisely why we need open source frameworks like lLamaIndex! These frameworks help connect your internal data sources with GPT-3.5, so your chatbot can output tailored responses based on data that regular ChatGPT don't know about 😊

Pretty cool, huh? Lets begin.

Project Setup 🚀

First, I'll guide you through how to set up a project for your chatbot.

Create the project folder and a python virtual environment by running the code below in terminal:

mkdir customer-support-chatbot
cd customer-support-chatbot
python3 -m venv venv
source venv/bin/activate

Your terminal should now start something like this:

(venv)

Installing dependencies

Run the following code to install lLamaIndex:

pip install llama-index

Note that we don't require openai because lLamaIndex already provides a wrapper to call the OpenAI API under the hood.

Create a new main.py file - the entry point to your chatbot, and chatbot.py - your chatbot implementation.

touch main.py chatbot.py

Setting up your internal knowledge base

Create a new data.txt file in a new data folder that would contain fake data on MadeUpCompany:

mkdir data
cd data
touch data.txt

This file will contain the data that your chatbot is going to base its responses on. Luckily for us, ChatGPT prepared some fake information on MadeUpCompany 😌 Paste the following text in data.txt:

About MadeUpCompany
MadeUpCompany is a pioneering technology firm founded in 2010, specializing in cloud computing, data analytics, and machine learning. Our headquarters is based in San Francisco, California, with satellite offices spread across New York, London, and Tokyo. We are committed to offering state-of-the-art solutions that help businesses and individuals achieve their full potential. With a diverse team of experts from various industries, we strive to redefine the boundaries of innovation and efficiency.

Products and Services
We offer a suite of services ranging from cloud storage solutions, data analytics platforms, to custom machine learning models tailored for specific business needs. Our most popular product is CloudMate, a cloud storage solution designed for businesses of all sizes. It offers seamless data migration, top-tier security protocols, and an easy-to-use interface. Our data analytics service, DataWiz, helps companies turn raw data into actionable insights using advanced algorithms.

Pricing
We have a variety of pricing options tailored to different needs. Our basic cloud storage package starts at $9.99 per month, with premium plans offering more storage and functionalities. We also provide enterprise solutions on a case-by-case basis, so it’s best to consult with our sales team for customized pricing.

Technical Support
Our customer support team is available 24/7 to assist with any technical issues. We offer multiple channels for support including live chat, email, and a toll-free number. Most issues are typically resolved within 24 hours. We also have an extensive FAQ section on our website and a community forum for peer support.

Security and Compliance
MadeUpCompany places the utmost importance on security and compliance. All our products are GDPR compliant and adhere to the highest security standards, including end-to-end encryption and multi-factor authentication.

Account Management
Customers can easily manage their accounts through our online portal, which allows you to upgrade your service, view billing history, and manage users in your organization. If you encounter any issues or have questions about your account, our account management team is available weekdays from 9 AM to 6 PM.

Refund and Cancellation Policy
We offer a 30-day money-back guarantee on all our products. If you're not satisfied for any reason, you can request a full refund within the first 30 days of your purchase. After that, you can still cancel your service at any time, but a prorated refund will be issued based on the remaining term of your subscription.

Upcoming Features
We’re constantly working to improve our services and offer new features. Keep an eye out for updates on machine learning functionalities in DataWiz and more collaborative tools in CloudMate in the upcoming quarters.

Your customer support staff can use these paragraphs to build their responses to customer inquiries, providing both detailed and precise information to address various questions.

Lastly, navigate back to customer-support-chatbot containing main.py, and set your OpenAI API key as an environment variable. In your terminal, paste in this with your own API key (get yours here if you don't already have one):

export OPENAI_API_KEY="your-api-key-here"

All done! Let's start coding.

Building a Chatbot with lLamaIndex 🦄

To begin, we first have to chunk and index the text we have in data.txt to a format that's readable for GPT-3.5. So you might wonder, what do you mean by "readable"? 🤯

Well, GPT-3.5 has something called a context limit, which refers to how much text the model can "see" or consider at one time. Think of it like the model's short-term memory. If you give it a really long paragraph or a big conversation history, it might reach its limit and not be able to add much more to it. If you hit this limit, you might have to shorten your text so the model can understand and respond properly.

In addition, GPT-3.5 performs worse if you supply it with way too much text, kind of how someone loses focus if you tell too long a story. This is exactly where lLamaIndex shines 🦄 llamaindex helps us breakdown large bodies of text into chunks that can be consumed by GPT-3.5 🥳

In a few lines of code, we can build our chatbot using lLamaIndex. Everything from chunking the text from data.txt, to calling the OpenAI APIs, is handled by lLamaIndex. Paste in the following code in chatbot.py:

from llama_index import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

def query(user_input):
    return query_engine.query(user_input).response

And the following in main.py:

from chatbot import query

while True:
    user_input = input("Enter your question: ")
    response = query(user_input)
    print("Bot response:", response)

Now try it for yourself by running the code below:

python3 main.py

Feel free to switch out the text in data/data.txt with your own knowledge base!

Improving your Chatbot

You might start to run into situations where the chatbot isn't performing as well as you hope for certain questions/inputs. Luckily there are several ways to improve your chatbot 😊

Parsing your data into smaller/bigger chunks

The quality of output from your chatbot is directly affected by the size of text chunks (scroll down for a better explanation why).

In chatbot.py, Add service_context = ServiceContext.from_defaults(chunk_size=1000) to VectorStoreIndex to alter the chunk size:

from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(chunk_size=1000)
documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

def query(user_input):
    return query_engine.query(user_input).response

Play around with the size parameter to find what works best :)

Providing more context to GPT-3.5

Depending on your data, you might benefit from supply a lesser/greater number of text chunks to GPT-3.5. Here's how you can do it through query_engine = index.as_query_engine(similarity_top_k=5) in chatbot.py:

from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(chunk_size=1000)
documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=5)

def query(user_input):
    return query_engine.query(user_input).response

Evaluating your chatbot ⚖️

By now, you might have ran into the problem of eyeballing your chatbot output. You make a small configurational change, such as changing the number of retrieved text chunks, run main.py, type in the same old query, and wait 5 seconds to see if the result has gotten any better 😰 Sounds familiar?

The problem becomes worse if you want to inspect outputs from not just one, but several different queries. Here is a great read on how you can build your own evaluation framework in less than 20 minutes, but if you'd prefer to not reinvent the wheel, consider using a free open source packages like DeepEval. It helps you evaluate your chatbot so you don't have to do it yourself 😌

Since I'm slightly biased as the cofounder of Confident AI (which is the company behind DeepEval), I'm going to go ahead and show you how DeepEval can help with evaluating your chatbot (no but seriously, we offer unit testing for chatbots, have a stellar developer experience, and a free platform for you to holistically view your chatbot's performance 🥵)

Install by running the code:

pip install deepeval

Create a new test file:

touch test_chatbot.py

Paste in the following code:

import pytest
from deepeval.metrics.factual_consistency import FactualConsistencyMetric
from deepeval.test_case import LLMTestCase
from deepeval.run_test import assert_test
from chatbot import query

def test_1():
    input = "What does your company do?"
    output = query(input)
    context = "Our company specializes in cloud computing, data analytics, and machine learning. We offer a range of services including cloud storage solutions, data analytics platforms, and custom machine learning models."
    factual_consistency_metric = FactualConsistencyMetric(minimum_score=0.7)
    test_case = LLMTestCase(output=output, context=context)
    assert_test(test_case, [factual_consistency_metric])

Run the test:

deepeval test run test_chatbot.py

Your test should have passed! Let's breakdown what happened. The variable query mimics a user input, and output is what your chatbot outputs based on this query. The variable context contains the relevant information from your knowledge base, and FactualConsistencyMetric(minimum_score=0.7) is an out-of-the-box metric provided by DeepEval for your to assess how factually correct your chatbot's output is based on the provided context. This score ranges from 0 - 1, which the minimum_score=0.7 ultimately determines if your test have passed or not.

Add more tests to stop wasting time on fixing breaking changes to your chatbot 😙

How does your chatbot work under the hood?

The chatbot we just built actually relies on an architecture called Retrieval Augmented Generation (RAG). Retrieval Augmented Generation is a way to make GPT-3.5 smarter by letting it pull in fresh or specific information from an outside source, in this case data.txt. So, when you ask it something, it can give you a more current and relevant answer.

In the previous two sections, we looked at how tweaking two parameters — text chunk size and number of text chunks used can impact the quality of answer you get from GPT-3.5. This is because when you ask your chatbot a question, lLamaIndex retrieves the most relevant text chunks from data.txt, which GPT-3.5 will use to generate a data augmented answer.

Conclusion

In this article, you've learnt:

What OpenAI GPT-3.5 is,
How to build a simple chatbot on your own data using lLamaIndex,
how to improve the quality of your chatbot,
how to evaluate your chatbot using Deepeval
what is RAG and how it works,

This tutorial walks you through an example of a chatbot you can build using lLamaIndex and GPT-3.5. With lLamaIndex, you can create powerful personalized chatbots useful in various applications, such as customer support, user onboarding, sales enablement, and more 🥳

The source code for this tutorial is available here:
https://github.com/confident-ai/blog-examples/tree/main/customer-support-chatbot

Thank you for reading!