DEV Community: Confident AI

🔪 6 Killer Open-Source Libraries to Achieve AI Mastery in 2024 🔥🪄

Jeffrey Ip — Mon, 11 Dec 2023 10:41:32 +0000

TL;DR

AI has traditionally been a very difficult field for web developers to break into... until now 😌 With the introduction of large language models (LLMs) like ChatGPT, it seems like nowadays anyone can become an AI engineer. But make no mistake, this cannot be further from the truth.

In this article, I will reveal the current top AI libraries that makes a mediocre AI engineer exceptional. As an ex-Google, ex-Microsoft AI engineer myself, I will show you how exceptional AI engineers use these libraries to build great applications.

Are you ready to up-skill yourself and be one step closer to becoming an AI wizard before 2024? Lets begin 🤗

1. DeepEval - Open-source Evaluation Infrastructure for LLMs

A good engineer can build, but an exceptional engineer can communicate the value of what they're built. DeepEval allows you to do exactly that.

DeepEval allows you to unit test and debug your large language model (LLM, or just AI) applications at scale in both development and production in under 10 lines of code.

Why is this valuable you ask? Because companies nowadays want to be seen as an innovative AI company and so stakeholders prefer engineers that can not just build like an indie hacker, but know how to ship reliable AI applications like a seasonal AI specialist.**

import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
import chatbot

def test_chatbot():
   input = "How to become an AI engineer in 2024?"
   test_case = LLMTestCase(input=input, actual_output=chatbot(input))
   answer_relevancy_metric = AnswerRelevancyMetric()
   assert_test(test_case, [answer_relevancy_metric])

🌟 Star DeepEval on GitHub

2. Unstructured - Pre-processing for Unstructured Data

LLMs thrive because they are versatile and can handle a large variety of inputs, but not all. Unstructured helps you easily transform unstructured data like webpages, PDFs, tables into readable formats for LLMs.

What does this mean? This means you can now enable your AI application to be customized on your internal documents. Unstructured is amazing because it in my opinion, operates at the right level of abstraction - it gives the boring hard work while giving you enough control as a developer.

from unstructured.partition.auto import partition

elements = partition(filename="example-docs/eml/fake-email.eml")
print("\n\n".join([str(el) for el in elements]))

🌟 Star Unstructured

3. Airbyte - Data Integration for LLMs

Connect data sources, move data around, basically most of what you need to build a real-time AI application, using Airbyte. Allows your LLMs to be connected to information outside of the data it was trained on.

Alike Unstructured, Airbyte provides a great level of abstraction over the work an AI engineer does.

🌟 Star Airbyte

4. Qdrant - Fast Vector Search for LLMs

Ever wondered what happens if you feed in too much data to ChatGPT? That's right, you'll encounter a context overflow error.

That's because LLMs cannot take in infinite information. To help with that, we need a way to only feed in relevant information. And this process, is known as retrieval augmented generation (RAG). Here's another great article on what RAG is.

Qdrant is a vector database that helps you do just that. It stores and retrieve relevant information at blazing fast speed, ensuring your application stays up to date with the real world.

🌟 Star Qdrant

5. MemGPT - Memory Management for LLMs

So Qdrant helps give LLMs "long-term memory", but what happens if there's too much to "remember"? MemGPT helps you manage memory for this exact use case.

MemGPT is like a cache for vector databases, with its own proprietary way to clearing caches. It helps you manage redundant information in your knowledge bases, making your AI application more performant and accurate.

🌟 Star MemGPT

6. LiteLLM - LLM proxy

LiteLLM is a proxy for multiple LLMs. It is great for experimentation and combined with DeepEval, allows you to pick the best model for your use case. The best part? it allows you to use any model it supports in the same OpenAI interface.

from litellm import completion
import os

## set ENV variables 
os.environ["OPENAI_API_KEY"] = "your-openai-key" 

messages = [{ "content": "Hello, how are you?","role": "user"}]

# openai call
response = completion(model="gpt-3.5-turbo", messages=messages)

🌟 Star LiteLLM

Closing Remarks

That's all folks, thanks for reading and I'd hope you learned a few things along the way!

Please like and comment if enjoyed this article, and as always, don't forget to give open-source some love by starring their repos as a token of appreciation 🌟.

⤴️How to build a Midjourney API with Nest.js 🚀

Jeffrey Ip — Wed, 29 Nov 2023 10:59:17 +0000

TL;DR

In this post I will show you the architecture of building an unofficial Midjourney API with Typescript and Nest.js.

DeepEval - open-source evaluation framework for LLM applications

DeepEval evaluates performance based on metrics such as factual consistency, accuracy, answer relevancy

We are just starting out.
Can you help us with a star, please? 😽

https://github.com/confident-ai/deepeval

➡️ Please Like, Heart and star this article

What are we going to build?

To start off, let's understand how Mindjourney works on Discord. People use simple commands to talk to an AI bot. The bot then takes these commands and creates pictures that match the descriptions given by the users.

In order to mimic this type of behavior, we will need to create an API that interacts with a Discord Bot. This bot has the ability to interact with Discord, hence send commands to Midjourney. Here is a high level design of the command flow:

Important: This tutorial is only for educational purpose and understanding the nature of Discord and Midjourney interaction. It is not recommended to use it as a production service or in any official project.

The reason I chose Nest.js is because there are plenty of Python examples of how to build API for Midjourney, but not a decent one that shows the process using Javascript or Node.js. I prefer using Nest.js as it also well organized and easy for bootstrapping a project.

Connecting a Midjourney's Discord bot

In the absence of a formal API, a connection to Midjourney is facilitated via a Discord bot. The process includes the following steps.

Important : In order for this to work, you must have Midjourney subscription.

Step 1: Create a Discord bot.

Take a moment to help me please. I am working very hard to create the best open source for LLM evaluation.

Please give me a star - I will truly appreciate it.

🌟 DeepEval on GitHub

The first step towards a complete Midjourney API is to create our Discord bot. Discord has an interface for creating bots for different purposes. Go ahead and create your MJ bot.

Here is a great article for creating a Discord bot.

Once you've created the bot, you'll receive an invite link. Use it to invite the bot to your Discord server - we'll use this later to generate and receive images.

Step 2: Implementing /Imagine command

Once creating a Nest.js app, go ahead and create your discord module. This module will interact with our Discord server and MidJourney.”

Let's begin with our controller that should look something like that:

@Controller('discord')
export class DiscordController {
  constructor(private discordService: DiscordService) {}

  @Post('imagine')
  async imagine(@Body('prompt') prompt: string): Promise<any> {
    return this.discordService.sendImagineCommand(prompt);
  }
}

As you can see, I have created a discord module with a single POST request. We will pass a prompt to our discord/imagine request.

Next, let's create our discord service:

@Injectable()
export class DiscordService {

constructor(private readonly httpService: HttpService) {}


  async sendImagineCommand(prompt: string): Promise<any> {
    const postUrl = "https://discord.com/api/v9/interactions";
    const uniqueId = this.generateId();

    const postPayload = {
      type: 2,
      application_id: <APPLICATION_ID>,
      guild_id: <GUILD_ID>,
      channel_id: <CHANNEL_ID>,
      session_id: <SESSION_ID>,
      data: {
        version: <COMMAND_VERSION>,
        id: <IMAGINE_COMMAND_ID>,
        name: "imagine",
        type: 1,
        options: [
          {
            type: 3,
            name: "prompt",
            value: `${prompt} --no ${uniqueId}`
          }
        ],
        application_command: {
          id: <IMAGINE_COMMAND_ID>,
          application_id: <APPLICATION_ID>,
          version: <COMMAND_VERSION>,
          default_member_permissions: null,
          type: 1,
          nsfw: false,
          name: "imagine",
          description: "Create images with Midjourney",
          dm_permission: true,
          contexts: [0, 1, 2],
          options: [
            {
              type: 3,
              name: "prompt",
              description: "The prompt to imagine",
              required: true
            }
          ]
        },
        attachments: []
      }
    };


    const postHeaders = {
      authorization: <your auth token>,
      "Content-Type": "application/json"
    };

    this.httpService
      .post(postUrl, postPayload, { headers: postHeaders })
      .toPromise()
      .then(console.log);


    return uniqueId;
  }


  generateId(): number {
    return Math.floor(Math.random() * 1000);
  }
}

You will notice a few things here:

We are using https://discord.com/api/v9/interactions discord endpoint to interact with Discord server and send commands. This is the main entry point to deal with requests to Midjourney.
We mimic a web-browser request to Discord, and here is the real "magic" - go ahead and send /imagine command from your Discord web interface to Midjourney, after signing in to Midjourney web.
Once sending a request , you will notice the imagine command sent in Network tab as well, which is very similar to the above.
Copy the relevant fields : IMAGINE_COMMAND_ID , COMMAND_VERSION, SESSION_ID, GUILD_ID, CHANNEL_ID and APPLICATION_ID. This will be used on our service. We also need to copy MIDJOURNEY_TOKEN which is sent as part of the request.
Copy BOT_TOKEN from the bot application page we created earlier. This is important in order to communicate with our bot.
You will also notice the uniqueId that we generate using our generateId() function. This is using Midjourney's --no command so we can later track back the unique request sent to Discord and get the generated images.

Once completing this step, you are now able to call Discord with /imagine command and generate images with Midjourney.

Reminder : This is only a technical post describing how this flow works, and is not recommended for use for any project.

Step 3: Fetching generated images.

Let's create a new controller to fetch images:

@Get('mj/results/:id')
  async getMidjourneyResults(@Param('id') id: string) {
    const image = await this.discordService.getResultFromMidjourney(id);
    const attachmentUrl = get(image[0].attachments[0], 'url');

    if (attachmentUrl) {
      const urls = await this.discordService.processAndUpload(attachmentUrl);
      return urls;
    }
    return image;
  }

We are going to use the unique id generated when creating our /imagine request, in order to fetch results from Discord.


async getResultFromMidjourney(id: string): Promise<any> {
    const headers = {
      "Authorization": MIDJOURNEY_TOKEN,
      "Content-Type": "application/json"
    };
    const channelUrl = `https://discord.com/api/v9/channels/${CHANNEL_ID}/messages?limit=50`;

    try {
      const response = await this.httpService.get(channelUrl, { headers: headers }).toPromise();
      const data = response.data;


      const matchingMessage = data.filter(message =>
        message.content.includes(id) &&
        message
          .components
          .some(component => component.components.some(c => c.label === "U1") ) // means that we can upscale results
      ) || [];

      if (!matchingMessage.length) {
        return null;
      }

      if (matchingMessage.attachments && matchingMessage.attachments.length > 0) {
        for (const attachment of matchingMessage.attachments) {
          attachment.url = await this.fetchAndEncodeImage(attachment.url);
        }
      }

      return matchingMessage;

    } catch (error) {
     // do something 
    }
  }

  async fetchAndEncodeImage(url: string): Promise<string> {
    const response: AxiosResponse<any> = await this.httpService.get(url, {
      responseType: 'arraybuffer',
    }).toPromise();

    const base64 = Buffer.from(response.data, 'binary').toString('base64');
    return `data:${response.headers['content-type']};base64,${base64}`;
  }

https://discord.com/api/v9/channels/${CHANNEL_ID}/messages?limit=50 endpoint is being used to fetch our Discord channel and get the response in order to retrieve our images.

Since Midjourney generation takes about 60 seconds or more, we will need to poll this channel every x seconds to check for results.

Let's give it a try with { prompt: "a cat" } :

That's it! You should now have a fully working Midjourney API for testing and fun and you've learned how Discord bot architecture works.

Final thoughts

You now have a bootstrap project that demonstrates how Discord communicates with MidJourney to generate the most amazing AI images.
You can build a nice UI and have your own generative AI platform. Good luck!

Why OpenAI Assistants is a Big Win for LLM Evaluation

Jeffrey Ip — Fri, 24 Nov 2023 12:13:34 +0000

A week after the famous, or infamous, OpenAI Dev Day, we at Confident AI released JudgementalGPT — an LLM agent built using OpenAI’s Assistants API, specifically designed for the purpose of evaluating other LLM applications. What initially started off as an experimental idea quickly turned into a prototype that we were eager to ship as we received feedback from users that JudgementalGPT gave more accurate and reliable results when compared to other state-of-the-art LLM-based evaluation approaches such as G-Eval.

Understandably, knowing that Confident AI is the world’s first open-source evaluation infrastructure for LLMs, many demanded more transparency into how JudgementalGPT was built after our initial public release:

I thought it’s all open source, but it seems like JudgementalGPT, in particular, is a black box for users. It would be great if we had more knowledge on how this is built.

So here you go, dear anonymous internet stranger, this article is dedicated to you.

DeepEval - open-source evaluation framework for LLM applications

DeepEval is a framework that help engineers evaluate the performance of their LLM applications by providing default metrics to measure hallucination, relevancy, and much more.

We are just starting out, and we really want to help more developers build safer AI apps. Would you mind giving it a star to spread the word, please? 🥺❤️🥺

🌟 DeepEval on GitHub

Limitations of LLM-based evaluations

The authors of G-Eval, state that:

Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity.

For those who don’t already know, G-Eval is a framework that utilizes Large Language Models (LLMs) with chain-of-thought (CoT) processing to evaluate the quality of generated texts in a form-filling paradigm, and if you’ve ever tried implementing a version of your own, you’ll quickly find that using LLMs for evaluation presents its own set of problems:

Unreliability — although G-Eval uses a low-precision grading scale (1–5), which makes it easier for interpretation, these scores can vary a lot even under the same evaluation conditions. This variability is due to an intermediate step in G-Eval that dynamically generates steps for later evaluation, which increases the stochasticity of evaluation scores (which is also why providing an initial seed value doesn’t help).
Inaccuracy — for certain tasks, one digit usually dominates (e.g., 3 for a grading scale of 1–5 using gpt-3.5-turbo). A way to get around this problem would be to take the probabilities of output tokens from an LLM to normalize the scores and take their weighted summation as the final score. But, unfortunately, this isn’t an option if you’re using OpenAI’s GPT models as an evaluator, since they deprecated the logprobs parameter a few months ago.

In fact, another paper that explored LLM-as-a-judge pointed out that using LLMs as an evaluator is flawed in several ways. For example, GPT-4 gives preferential treatment to self-generated outputs, is not very good at math (but neither am I), and is prone to verbosity bias. Verbosity bias means it favors longer, verbose responses instead of accurate, shorter alternatives. (In fact, an initial study has shown that GPT-4 exhibits verbosity bias 8.75% of the time)

Can you see how this becomes a problem if you’re trying to evaluate a summarization task?

OpenAI Assistants offers a workaround to existing problems

Here’s a surprise — JudgementalGPT isn’t composed of one evaluator built using the new OpenAI Assistant API, but multiple. That’s right, behind the scenes, JudgementalGPT is a proxy for multiple assistants that perform different evaluations depending on the evaluation task at hand. Here are the problems JudgementalGPT was designed to solve:

Bias — we’re still experimenting with this (another reason for close-sourcing JudgementalGPT!), but assistants have the ability to write and execute code using the code interpreter tool, which means that, with a bit of prompt engineering, it can account for tasks that are more prone to logical fallacies, such as asserting coding or math problems, or tasks that require more factuality rather than giving preferential treatment to its own outputs.
Reliability — since we no longer require LLMs to dynamically generate CoTs/evaluation steps, we can enforce a set of rules for specific evaluation tasks. In other words, since we’ve pre-defined multiple sets of evaluation steps based on the evaluation task at hand, we have removed the biggest parameter contributing to stochasticity.
Accuracy — having a set of pre-defined evaluation steps for different tasks also means we can provide more guidance based on what we as humans actually expect from each evaluator and quickly iterate on the implementation based on user feedback.Another thing that we learnt when implementing G-Eval into our open-source project DeepEval was evaluation steps generated by LLMs are be lengthy and full of fluff.

Another insight we gained while integrating G-Eval into our open-source project DeepEval was the realization that LLM-generated evaluation steps tend to be arbitrary and generally does not help in providing guidance for evaluation. Some of you might also wonder what happens when JudgementalGPT can’t find a suitable evaluator for a particular evaluation task. For this edge case, we default back to G-Eval. Here’s a quick architecture diagram on how JudgementalGPT works:

As I’m writing this article, I discovered recent paper introducing Prometheus, “a fully open-source LLM that is on par with GPT-4’s evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied”, which also requires evaluation steps to be explicitly defined instead.

Still, problems with LLM-based evaluation lingers

One unresolved issue pertains to the accuracy challenges stemming from the predominance of a single digit in evaluation scores. This phenomenon, theoretically, isn’t exclusive to older models and is likely to affect advanced versions like gpt-4–1106-preview as well. So, I’m keeping an open mind about how this might affect JudgementalGPT. We’re really looking forward to more research that’ll either back up what we think or give us a whole new perspective — either way, I’m all ears.

Lastly, there can still be intricacies involved in defining our own set of evaluators. For example, just like how G-Eval isn’t a one-size-fits-all solution, neither is summarization, or relevancy. Any metric that is subject to interpretability is guaranteed to disappoint users who expect something different. For now, the best solution would be to have users clearly define their evaluation criteria to rid LLMs of any evaluation ambiguity.

Conclusion

At the end of the day, there’s no one-size-fits-all solution for LLM-based evaluations, which is why engineers/data scientists are frequently disappointed by non-human evaluation scores. However, by defining specific and concise evaluation steps for different use cases, LLMs are able to navigate ambiguity better, as they are provided more guidance into what a human might expect for different evaluation criteria.

P.S. By now, those of you who read between the lines will probably know the key to building a better evaluator is to tailor them for specific use cases, and OpenAI’s new Assistant API along with its code interpreter functionality is merely the icing on the cake (and a good marketing strategy!).

So, dear anonymous internet stranger, I hope you’re satisfied, and till next time.

⤴️ Be a prompt engineer: Understanding Midjourney LLM

Jeffrey Ip — Wed, 15 Nov 2023 09:43:06 +0000

TL;DR

By now, you've probably seen those incredible AI-generated images on your social feeds and thought to yourself, "How are people making these amazing images?" So you jump onto Midjourney, ready to create your own, but somehow, what comes out isn't quite what you pictured.

Don't worry — I've got you covered.
In order to get amazing images out of Midjourney, you need to be able to write prompts like a pro. Since Midjourney is based on an LLM, it all comes down to understanding its nature and how to get the most out of it.

Do you want to become a Prompt Hero? Then this guide is for you!

DeepEval - open-source evaluation framework for LLM applications

DeepEval evaluates performance based on metrics such as factual consistency, accuracy, answer relevancy

We are just starting out.
Can you help us with a star, please? 😽

https://github.com/confident-ai/deepeval

Creating your first Midjourney artwork

To get started with Midjourney, sign up to Discord and complete the registration process. Once you have got Discord running, open Midjourney website and choose Join Beta.

Once signing up, you can select a paid or a free plan.
If you are using a free plan, you may generate images in any of the Midjourney newbies channels. Paid users can send commands directly to Midjourney bot.

To begin with your first image, start typing / followed by imagine command. Then, it will let you enter a prompt (a description for generating an image), for example:

/imagine prompt: beautiful colorful horse

Midjourney will generate an image based on your prompt.

How does Midjourney works?

Midjourney uses an LLM (a large language model) to create images from text descriptions. This model has been trained on a vast array of text-image pairs, enabling it to understand and interpret the text prompts to produce similar images.

Let's break down this image creation process:

Analyzing the Prompt

The LLM starts by dissecting the prompt into its core ideas and terms. If you input something like "a photorealistic portrait of a woman," the system identifies key concepts like "photorealistic," "portrait," and "woman."

A basic Midjourney prompt looks like this:

A more advanced prompt may look like this:

We'll get back to that later. What's important is to understand that whatever you write is used to create the latent vector in the following step.

Generating a Latent Vector

Next, the LLM translates these concepts into a latent vector. This is a numerical code that captures all the image details - its color palette, shapes, style, objects, and more.

All those parameters are used inside the model to understand your request, by matching the vector to data it already knows and has been trained on.

This is why the following tip by official Midjourney documentation is important:

The Midjourney Bot works best with simple, short sentences that describe what you want to see. Avoid long lists of requests. Instead of: "Show me a picture of lots of blooming California poppies, make them bright, vibrant orange, and draw them in an illustrated style with colored pencils," try: "Bright orange California poppies drawn with colored pencils."

Pro tip: use short prompts!

🌟 DeepEval on GitHub

Using a Diffusion Model to generate the image

The final step of generating the image involves converting this latent vector into the actual image. This is where a diffusion model comes into play. It's a kind of AI that can form images from seemingly random patterns.

Starting with a blank canvas, the model slowly refines the image, adding layers of detail until it reflects what the latent vector describes. The way it adds this 'noise' is controlled, making sure the final image is clear and recognizable.

Other well-known generative AI platforms such as Stable Diffusion uses the same technics.

This is also the reason while waiting for Midjourney to complete its image creation, you notice blurry images which eventually turn into amazing art work.

The basics

Begin with a short prompt, focus on what you want to create - our subject.
Let's say we are interested in creating a portrait of a woman. We can begin with something like this:

/imagine A portrait of a young woman with light blue eyes

Once we have our initial image, it is all about iterations and improvements. We can now focus on details that matter, such as medium, mood, composition, environment.

Let's say we want to get a more realistic photo:
/imagine A realistic photo of a young woman with light blue eyes

This one is more realistic; however, let's give it the touch of an old photograph. To achieve that, we can simply add a year, say, 1960.

/imagine A realistic photo of a young woman with light blue eyes, year 1960

We've come a long way by only adding small details, such as the year and the medium type (realistic).

Pro tip: The Midjourney Bot does not comprehend grammar, sentence structure, or words as humans do. Using fewer words means that each one has a more powerful influence.

Now, let's add a composition; for instance, if I am interested in a headshot from above, we can revise our prompt accordingly:

/imagine Bird-eye view realistic photo, of a young woman with light blue eyes, 1960

Pretty cool, right?

Continue experimenting with various elements such as environment, emotions, colors, and more to discover the diverse outcomes they can produce.

Midjourney, utilizing a well-trained Large Language Model (LLM) and a diffusion model, has the capability to generate a wide range of variations based on your initial image. This allows for a great deal of flexibility and creativity in the image creation process.

By instructing the bot to produce either strong or weak variations, you can refine the output step by step. You might start with a broad concept and then progressively narrow down the details, or you could begin with a highly specific image and explore slight adjustments. The process continues until you reach a result that meets your vision or preference.

Asking for a strong variation will result in the following images:

Advanced techniques

Now that we understand the basics of Midjourney LLM, we can dive into its parameters. Parameters are options added to a prompt that change how an image is generated.

Changing aspect ratio

Pro tip: parameters are always added at the end of the prompt

One of the most important parameters is the aspect ratio. Midjourney's default aspect ratio is square (1:1), but what if we want to create a great cover image (such as this article's cover) or a portrait image?
We just need to add --ar at the end of the prompt. For example:

/imagine Bird-eye view realistic photo, of a young woman with light blue eyes, 1960 --ar 1:2

Notice the --ar followed by the spect ratio, here.

Getting more artistic

Using styles

The --style parameter replaces the default style of some Midjourney Model Versions.

Using --style raw will result in a more accurate prompt style, and less beautification. Let's have a look at the following example:

/imagine cat icon will generate this kind of image, which is beautiful, but not really an icon:

If we add --style raw to it, Midjourney will generate a much more relevant image:

Niji model

Midjourney has an alternative model called niji 5 which allows to use other style parameters.
Adding --niji 5 followed by different styles such as: cute, expressive, original or scenic will result in more sophisticated images.

/imagine cat --niji 5 --style cute

As an LLM-based generator, Midjourney is trained on a huge amount of data, incorporating different artistic styles.
Providing a --stylize parameter influences how strongly this training is applied, with the range being between 0 and 1000; higher values will generate a more artistic image.

/imagine child's drawing of a dog

Ready to become a pro?

Before moving forward, I would appreciate it if you could like or 'heart' this article — it would help me a lot.

Also, please check out my open-source GitHub library. Would you mind giving it a star? ❤️

🌟 DeepEval on GitHub

Here comes the fun part. But before we start, I would like to share with you the way I create nice photos and understand Midjourney LLM better.

Finding inspirations

When looking for inspiration, I head to the Midjourney Showcase page where I look for inspiring photos. Once I've found one, I download the photo and ask Midjourney to describe it. This process is similar to the reverse engineering of the LLM, which reveals how Midjourney transforms text to image.

For example, I have found this image interesting:

And asked Midjourney to describe it using /describe command.

That's a good starting point for your next image generation. Take the keywords that created this image and use them to generate images with a similar look and feel.
Here I noticed the text "a polygonal elephant in a dark background", which is dominant, but also "in the style of graphic design influence, stephen shortridge".

Pro tip: Midjourney knows how to generate images in the style of a given artist

Prompt /imagine a polygonal elephant, in the style of stephen shortridge

Let's get weird

We can get unconventional images with the --weird parameter. When using this parameter, Midjourney creates unique and unexpected outcomes. --weird accepts values from 0 to 3000 (the default is 0), and the higher the value we provide, the more unexpected the outcome is.

/imagine elephant --weird ...

Permutations

What if we wish to try different colors, say red/green/blue/yellow elephant?

We can use permutations by adding { ... } to our prompt, comma separating our permutations.

/imagine a { red, green, blue, yellow } elephant

This will create 4 Midjourney jobs in a single shot.

Midjourney Tiles

This is probably one of the most amazing, yet hidden, Midjourney features. The --tile parameter will generate an image which can be repeatedly used as a tile.

/imagine watercolor elephant --tile

Final thoughts

Understanding Midjourney LLM results in generating amazing images and photos.
If you think of any other helpful Midjourney prompt engineering that I haven't covered in this article, please share them in the comments section below. 👇🏻

So, that is it for this article.

Thank you so much for reading! 🤩🙏

What is Retrieval Augmented Generation (RAG)? 🚀

Jeffrey Ip — Wed, 25 Oct 2023 10:30:40 +0000

TL;DR

In this article, I’m going to talk about what RAG is and how to implement a RAG-based LLM application (yes, with a complete code sample 😚)

Let’s dive right in.

DeepEval - open-source evaluation framework for LLM applications

DeepEval is a framework that help engineers evaluate the performance of their LLM applications by providing default metrics to measure hallucination, relevancy, and much more.

We are just starting out, and we really want to help more developers build safer AI apps. Would you mind giving it a star to help spread the word, please? 🥺❤️🥺

🌟 DeepEval on GitHub

What is RAG?

Retrieval augmented generation is a technique in NLP that allows LLMs like ChatGPT to generate customized outputs that are outside the scope of the data it was trained on. An LLM application without RAG, is akin to asking ChatGPT to summarize an email without providing the actual email as context.

A RAG system consists of two primary components: the retriever and the generator.

The retriever is responsible for searching through the knowledge base for the most relevant pieces of information that correlate with the given input, which is referred to as retrieval results. On the other hand, the generator utilizes these retrieval results to craft a series of prompts based on a predefined prompt template to produce a coherent and relevant response to the input.

Here’s a diagram of a RAG architecture.

In most cases, your “knowledge base” consists of vector embeddings stored in a vector database like ChromaDB, and your “retriever” will 1) embed the given input at runtime and 2) search through the vector space containing your data to find the top K most relevant retrieval results 3) rank the results based on relevancy (or distance to your vectorized input embedding). This will then be processed into a series of prompts and passed onto your “generator”, which is your LLM of choice (GPT-4, lLama2, etc.).

For more curious users, here are the models a retriever commonly employs to extract the most pertinent retrieval results:

Neural Network Embeddings (eg. OpenAI/Cohere’s embedding models): ranks documents based on their locational proximity in a multidimensional vector space, enabling an understanding of textual relationships and relevance between an input and the document corpus.
Best Match 25 (BM25): a probabilistic retrieval model that enhances text retrieval precision. By considering term frequencies with inverse document frequencies, it takes into account term significance, ensuring that both common and rare terms influence the relevance ranking.
TF-IDF (Term Frequency — Inverse Document Frequency): calculates the significance of a term within a document relative to the broader corpus. By juxtaposing a term’s occurrence in a document with its rarity across the corpus, it ensures a comprehensive relevance ranking.
Hybrid Search: optimizes the relevance of the search results by assigning distinctive weights to different methodologies, such as Neural Network Embeddings, BM25, and TF-IDF.

Applications

RAG has various applications across different fields due to its ability to combine retrieval and generation of text for enhanced responses. Having worked with numerous companies building LLM applications at Confident, here is the top four use cases I’ve seen:

Customer support / user onboarding chatbots: No surprises here, retrieve data from internal documents to generate more personalized responses. Click here to read a full tutorial on how to build one yourself using lLamaindex.
Data Extraction. Interestingly, we can use RAG to extract relevant data from documents such as PDFs. You can find a tutorial on how to do it here.
Sales enablement: retrieve data from LinkedIn profiles and email threads to generate more personalized outreach messages
Content creation and enhancement: retrieve data from past message conversations to generate suggested message replies

In the following code walkthrough, we’ll be building a very generalized chatbot, and you’ll be able to customize it’s functionality into any of the use cases listed above by tweaking prompts and data stored in your vector database.

Project Setup

For this project, we’re going to build a question-answering (QA) chatbot based on your knowledge base. We’re not going to cover the part on how to index your knowledge base, as that’s a discussion for another day.

We’re going to be using python, ChromaDB for our vector database, and OpenAI for both vector embeddings and chat completion. We’re going to build a chatbot on your favorite Wikipedia page.

First, set up a new project directory and install the dependencies we need.

mkdir rag-llm-app
cd rag-llm-app
python3 -m venv venv
source venv/bin/activate

Your terminal should now start with something like this:

(venv)

Installing dependencies

pip install openai chromadb

Next, create a new main.py file — the entry point to your LLM application.

touch main.py

Getting your API keys

Lastly, go ahead and get your OpenAI API key here if you don’t already have one, and set it as an enviornment variable:

export OPENAI_API_KEY="your-openai-api-key"

You’re good to go! Let’s start coding.

Building a RAG-based LLM application

Begin by creating an Retriever class that will retrieve the most relevant data from ChromaDB for a given user question.

Open main.py and paste in the following code:

import chromadb
from chromadb.utils import embedding_functions
import openai

client = chromadb.Client()
client.heartbeat()

class Retriver:
    def __init__(self):
        pass

    def get_retrieval_results(self, input, k):
        openai_ef = embedding_functions.OpenAIEmbeddingFunction(api_key="your-openai-api-key", model_name="text-embedding-ada-002")
        collection = client.get_collection(name="my_collection", embedding_function=openai_ef)
        retrieval_results = collection.query(
            query_texts=[input],
            n_results=k,
        )
        return retrieval_results["documents"][0]

Here, openai_ef is the embedding function used under the hood by ChromaDB to vectorize an input. When a user sends a question to your chatbot, a vector embedding will be created from this question using OpenAI’s text-embedding-ada-002 model. This vector embedding will then be used for ChromaDB to perform a vector similarity search in the collection vector space, which contains data from your knowledge base (remember, we’re assuming you’ve already indexed data for this tutorial). This process allows you to search for the top K most relevant retrieval results on any given input.

Now that you’ve created your retriever, paste in the following code to create a generator:

...

class Generator:
    def __init__(self, openai_model="gpt-4"):
        self.openai_model = openai_model
        self.prompt_template = """
            You're a helpful assistant with a thick country accent. Answer the question below and if you don't know the answer, say you don't know.

            {text}
        """

    def generate_response(self, retrieval_results):
        prompts = []
        for result in retrieval_results:
            prompt = self.prompt_template.format(text=result)
            prompts.append(prompt)
        prompts.reverse()

        response = openai.ChatCompletion.create(
            model=self.openai_model,
            messages=[{"role": "assistant", "content": prompt} for prompt in prompts],
            temperature=0,
        )

        return response["choices"][0]["message"]["content"]

Here, we constructed a series of prompts in the generate_response method based on a list of retrieval_results that will be provided by the retriever we built earlier. We then send this series of prompts to OpenAI to generate an answer. Using RAG, your QA chatbot can now produce more customized outputs by enhancing the generation with retrieval results!

To wrap things up, lets put everything together:

...

class Chatbot:
    def __init__(self):
        self.retriver = Retriver()
        self.generator = Generator()

    def answer(self, input):
        retrieval_results = self.retriver.get_retrieval_results(input)
        return self.generator.generate_response(retrieval_results)


# Creating an instance of the Chatbot class
chatbot = Chatbot()

while True:
    user_input = input("You: ")  # Taking user input from the CLI
    response = chatbot.answer(user_input)
    print(f"Chatbot: {response}")

That’s all folks! You just built your very first RAG-based chatbot.

Conclusion

In this article, you’ve learnt what RAG is, some use cases for RAG, and how to build your own RAG-based LLM application. However, you might have noticed that building your own RAG application is pretty complicated, and indexing your data is often a non-trivial task. Luckily, there are existing open-source frameworks like LangChain and lLamaIndex that allows you to implement what we’ve demonstrated in a much simpler way.

If you like the article, don’t forget to give us a star on Github ❤️: https://github.com/confident-ai/deepeval

You can also find the full code example here: https://github.com/confident-ai/blog-examples/tree/main/rag-llm-app

Till next time!

The one thing everyone's doing wrong with ChatGPT... 🤫🤔

Jeffrey Ip — Tue, 03 Oct 2023 09:08:39 +0000

TL;DR

Most developers don't evaluate their GPT outputs when building applications even if that means introducing unnoticed breaking changes because evaluation is very, very, hard. In this article, you're going to learn how to evaluate ChatGPT (LLM) outputs the right way.

🔥 On the agenda:

what are LLMs and why they're difficult to evaluate
different ways to evaluate LLM outputs
how to evaluate in python

Enjoy! 🤗

DeepEval - open-source evaluation framework for LLM applications

DeepEval is a framework that help engineers evaluate the performance of their LLM applications by providing default metrics to measure hallucination, relevancy, and much more.

We are just starting out, and we really want to help more developers build safer AI apps. Would you mind giving it a star to help spread the word, please? 🥺❤️🥺

🌟 DeepEval on GitHub

What are LLMs and what makes them so hard to evaluate?

To understand why LLMs are difficult to evaluate and why they're often times referred to as a "black box", let's debunk are LLMs and how they work.

ChatGPT is an example of a large language model (LLM) and was trained on huge amounts of data. To be exact, around 300 billion words from articles, tweets, r/tifu, stack-overflow, how-to-guides, and other pieces of data that were scraped off the internet 🤯

Anyway, the GPT behind "Chat" stands for Generative Pre-trained Transformers. A transformer is a specific neural network architecture which is particularly good at predicting the next few tokens (a token == 4 characters for ChatGPT, but this can be as short as one character or as long as a word depending on the specific encoding strategy).

So in fact, LLMs don't really "know" anything, but instead "understand" linguistic patterns due to the way in which they were trained, which often times makes them pretty good at figuring out the right thing to say. Pretty manipulative huh?

All jokes aside, if there's one thing you need to remember, it's this: the process of predicting the next plausible "best" token is probabilistic in nature. This means that, LLMs can generate a variety of possible outputs for a given input, instead of always providing the same response. It is exactly this non-deterministic nature of LLMs that makes them challenging to evaluate, as there's often more than one appropriate response.

Why do we need to evaluate LLM applications?

When I say LLM applications, here are some examples of what I'm referring to:

Chatbots: For customer support, virtual assistants, or general conversational agents.
Code Assistance: Suggesting code completions, fixing code errors, or helping with debugging.
Legal Document Analysis: Helping legal professionals quickly understand the essence of long contracts or legal texts.
Personalized Email Drafting: Helping users draft emails based on context, recipient, and desired tone.

LLM applications usually have one thing in common - they perform better when augmented with proprietary data to help with the task at hand. Want to build an internal chatbot that helps boost your employee's productivity? OpenAI certainly doesn't keep tabs on your company's internal data (hopefully 😥).

This matters because it is now not only OpenAI's job to ensure ChatGPT is performing as expected ⚖️ but also yours to make sure your LLM application is generating the desired outputs by using the right prompt templates, data retrieval pipelines, model architecture (if you're fine-tuning), etc.

Evaluation (I'll just call them evals from hereon) helps you measure how well your application is handling the task at hand. Without evals, you will be introducing unnoticed breaking changes and would have to manually inspect all possible LLM outputs each time you iterate on your application 👀 which to me sounds like a terrible idea 💀

How to evaluate LLM outputs

There are two ways everyone should know about when it comes to evals - with and without ChatGPT.

Evals without ChatGPT

A nice way to evaluate LLM outputs without using ChatGPT is using other machine learning models derived from the field of NLP. You can use specific models to judge your outputs on different metrics such as factual correctness, relevancy, biasness, and helpfulness (just to name a few, but the list goes on), despite non-deterministic outputs.

For example, we can use natural language inference (NLI) models (which outputs an entailment score) to determine how factually correct a response is based on some provided context. The higher the entailment score, the more factually correct an output is, which is particularity helpful if you're evaluating a long output that's not so black and white in terms of factual correctness.

You might also wonder how can these models possibly "know" whether a piece of text is factually correct 🤔 It turns out you can provide context to these models for them to take at face value 🥳 In fact, we call these context ground truths or references. A collection of these references are often referred to an evaluation dataset.

But not all metrics require references. For example, relevancy can be calculated using cross-encoder models (another ML model), and all you need is supply the input and output for it to determine how relevant they are to each another.

Off the top of my head, here's a list of reference-less metrics:

relevancy
bianess
toxicity
helpfulness
harmlessness

And here is a list of reference based metrics:

factual correctness
conceptual similarity

Note that reference based metrics doesn't require you to provide the initial input, as it only judges the output based on the provided context.

Using ChatGPT for Evals

There's a new emerging trend to use state-of-the-art (aka ChatGPT) LLMs to evaluate themselves or even other others LLMs.

G-Eval is a recently developed framework that uses LLMs for evals.

I'll attach an image from the research paper that introduced G-eval below, but in a nutshell G-Eval is a two part process - the first generates evaluation steps, and the second uses the generated evaluation steps to output a final score.

Let's run though a concrete example. Firstly, to generate evaluation steps:

introduce an evaluation task to ChatGPT (eg. rate this summary from 1 - 5 based on relevancy)
introduce an evaluation criteria (eg. Relevancy will based on the collective quality of all sentences)

Once the evaluation steps has been generated:

concatenate the input, evaluation steps, context, and the actual output
ask it to generate a score between 1 - 5, where 5 is better than 1
(Optional) take the probabilities of the output tokens from the LLM to normalize the score and take their weighted summation as the final result

Step 3 is actually pretty complicated 🙃 because to get the probability of the output tokens, you would typically need access to the raw model outputs, not just the final generated text. This step was introduced in the paper because it offers more fine-grained scores that better reflect the quality of outputs.

Here's a diagram taken from the paper that can help you visualize what we've learnt:

Utilizing GPT-4 with G-Eval outperformed traditional metrics in areas such as coherence, consistency, fluency, and relevancy 😳 but, evaluations using LLMs can often be very expensive.

So, my recommendation would be to evaluate with G-Eval as a starting point to establish a performance standard and then transition to more cost-effective traditional methods where suitable.

Evaluating LLM outputs in python

By now, you probably feel inundated by all the jargon and definitely wouldn't want to implement everything from scratch. Imagine having to research what's the best way to compute each individual metric, train your own model for it, and code up an evaluation framework... 😰

Luckily, there are a few open source packages such as ragas and DeepEval that provides an evaluation framework so you don't have to write your own 😌

As the cofounder of Confident (the company behind DeepEval), I'm going to go ahead and shamelessly show you how you can unit test your LLM applications using DeepEvals 😊 (but seriously, we have an amazing Pytest-like developer experience, easy to setup, and offer a free platform for you to visualize your evaluation results)

Let's wrap things up with some coding.

Setting up your test environment

To implement our much anticipated evals, create a project folder and initialize a python virtual environment by running the code below in your terminal:

mkdir evals-example
cd evals-example
python3 -m venv venv
source venv/bin/activate

Your terminal should now start something like this:

(venv)

Installing dependencies

Run the following code:

pip install deepeval

Setting your OpenAI API Key

Lastly, set your OpenAI API key as an environment variable. We'll need OpenAI for G-Evals later (which basically means using LLMs for evaluation). In your terminal, paste in this with your own API key (get yours here if you don't already have one):

export OPENAI_API_KEY="your-api-key-here"

Writing your first test file

Let's create a file called test_evals.py (note that test files must start with "test"):

touch test_evals.py

Paste in the following code:

from deepeval.metrics.factual_consistency import FactualConsistencyMetric
from deepeval.metrics.answer_relevancy import AnswerRelevancyMetric
from deepeval.metrics.conceptual_similarity import ConceptualSimilarityMetric
from deepeval.metrics.llm_eval import LLMEvalMetric
from deepeval.test_case import LLMTestCase
from deepeval.run_test import assert_test
import openai

def test_factual_correctness():
    input = "What if these shoes don't fit?"
    context = "All customers are eligible for a 30 day full refund at no extra costs."
    output = "We offer a 30-day full refund at no extra costs."
    factual_consistency_metric = FactualConsistencyMetric(minimum_score=0.5)
    test_case = LLMTestCase(query=input, output=output, context=context)
    assert_test(test_case, [factual_consistency_metric])

def test_relevancy():
    input = "What does your company do?"
    output = "Our company specializes in cloud computing"
    relevancy_metric = AnswerRelevancyMetric(minimum_score=0.5)
    test_case = LLMTestCase(query=input, output=output)
    assert_test(test_case, [relevancy_metric])

def test_conceptual_similarity():
    input = "What did the cat do?"
    output = "The cat climbed up the tree"
    expected_output = "The cat ran up the tree."
    conceptual_similarity_metric = ConceptualSimilarityMetric(minimum_score=0.5)
    test_case = LLMTestCase(query=input, output=output, expected_output=expected_output)
    assert_test(test_case, [conceptual_similarity_metric])

def test_humor():
    def make_chat_completion_request(prompt):
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt},
            ],
        )
        return response.choices[0].message.content
    input = "Write me something funny related to programming"
    output = "Why did the programmer quit his job? Because he didn't get arrays!"
    llm_metric = LLMEvalMetric(
        criteria="How funny it is",
        completion_function=make_chat_completion_request
    )
    test_case = LLMTestCase(query=input, output=output)
    assert_test(test_case, [llm_metric])

Now run the test file:

deepeval test run test_evals.py

For each of the test cases, there is a predefined metric provided by DeepEval, and each of these metrics output a score from 0 - 1. For example, FactualConsistencyMetric(minimum_score=0.5) means we want to evaluate how factually correct an output is, where the minimum_score=0.5 means the test will only pass if the output score is higher than a 0.5 threshold.

Let's go over the test cases one by one:

test_factual_correctness tests how factually correct your LLM output is relative to the provided context.
test_relevancy tests how relevant the output is relative to the given input.
test_conceptual_similarity tests how conceptually similar the LLM output is relative to the expected output.
test_humor tests how funny your LLM output is. This test case is the only test case that uses ChatGPT for evaluation.

Notice how there's up to 4 moving parameters for a single test case - the input, the expected output, the actual output (of your application), and the context (that was used to generate the actual output). Depending on the metric you're testing, some parameters are optional, while some are mandatory.

Lastly, what if you want to test more than a metric on the same input? Here's how you can aggregate metrics on a single test case:

def test_everything():
    input = "What did the cat do?"
    output = "The cat climbed up the tree"
    expected_output = "The cat ran up the tree."
    context = "The cat ran up the tree."
    conceptual_similarity_metric = ConceptualSimilarityMetric(minimum_score=0.5)
    relevancy_metric = AnswerRelevancyMetric(minimum_score=0.5)
    factual_consistency_metric = FactualConsistencyMetric(minimum_score=0.5)
    test_case = LLMTestCase(query=input, output=output, context=context, expected_output=expected_output)
    assert_test(test_case, [conceptual_similarity_metric, relevancy_metric, factual_consistency_metric])

Not so hard after all huh? Write enough of these (10-20), and you'll have much better control over what you're building 🤗

PS. And here's a bonus feature DeepEval offers: free web platform for you to view data on all your test runs.

Try running the following command:

deepeval login

Follow the instructions (login, get your API key, paste it in the CLI), and run this again:

deepeval test run test_example.py

Let me know what happens!

Conclusion

In this article, you've learnt:

how ChatGPT work
examples of LLM applications
why it's hard to evaluate LLM outputs
how to evaluate LLM outputs in python

With evals, you can stop making breaking changes to your LLM application ✅ quickly iterate on your implementation to improve on metrics you care about ✅ and most importantly be confident in the LLM application you build 😇

If you enjoyed this article, don't forget to give us a star on GitHub! The source code for this tutorial is available here:
https://github.com/confident-ai/blog-examples/tree/main/evals-example

Thank you for reading, and till next time 🫡

How to build a PDF QA chatbot using OpenAI and ChromaDB 🤗

Jeffrey Ip — Tue, 26 Sep 2023 10:38:37 +0000

TL;DR

In this article, you'll learn how to build a RAG based chatbot to chat with any PDF of your choice so you can achieve your lifelong dream of talking to PDFs 😏 In the end, I'll also show how you can test what you've built ✅

I know, I wrote something similar in my last article on building a customer support chatbot 😅 but this week we're going to dive deep into how to use the raw OpenAI API to chat with PDF data (including text trapped in visuals like tables) stored in ChromaDB, as well as how to use Streamlit to build the chatbot UI.

A small request 🙏🏻

I'm trying to get DeepEval to 5k stars by the end of 2023, can you please help me out by starring my repo? It helps me create more weekly high quality content ❤️ thank you very very much!

https://github.com/confident-ai/deepeval

Introducing RAG, Vector Databases, and OCR

Before we dive into the code, let's debunk what we're going to implement 🕵️ To begin, OCR (Optical Character Recognition) is a technology within the field of computer vision that recognizes the characters present in the document and converts them into text - this is particularly helpful in the case of tables and charts in documents 😬 We'll be using OCR provided by Azure Cognitive Services in this tutorial.

Once text chunks are extracted using OCR, they are converted into a high-dimensional vector (aka. vectorized) using embedding models like Word2Vec, FastText, or BERT. These vectors, which encapsulate the semantic meaning of the text, are then indexed in a vector database. We'll be using ChromaDB as our in-memory vector database 🥳

Now, let's see what happens when a user asks their PDF something. First, the user query is first vectorized using the same embedding model used to vectorize the extracted PDF text chunks. Then, the top K most semantically similar text chunk is fetched by searching through the vector database, which remember, contains the text chunks from our PDF. The retrieved text chunks are then provided as context for ChatGPT to generate an answer based on information in their PDF. This is the process of retrieval, augmented, generation (RAG).

Feeling educated? 😊 Let's begin.

Project Setup

First, I'm going to guide you through how to set up your project folders and any dependencies you need to install.

Create a project folder and a python virtual environment by running the following command:

mkdir chat-with-pdf
cd chat-with-pdf
python3 -m venv venv
source venv/bin/activate

Your terminal should now start something like this:

(venv)

Installing dependencies

Run the following command to install OpenAI API, ChromaDB, and Azure:

pip install openai chromadb azure-ai-formrecognizer streamlit tabulate

Let's briefly go over what each of those package does:

streamlit - sets up the chat UI, which includes a PDF uploader (thank god 😌)
azure-ai-formrecognizer - extracts textual content from PDFs using OCR
chromadb - is an in-memory vector database that stores the extracted PDF content
openai - we all know what this does (receives relevant data from chromadb and returns a response based on your chatbot input)

Next, create a new main.py file - the entry point to your application

touch main.py

Getting your API keys

Lastly, get your OpenAI and Azure API key ready (click the hyperlink to get them if you don't already have one)

Note: It's pretty troublesome to sign up for an account on Azure Cognitive Services. You'll need a card (although they won't charge you automatically), and phone number 😔 but do give it a try if you're trying to build something serious!

Building the Chatbot UI with Streamlit

Streamlit is an easy way to build frontend applications using python.

Lets import streamlit along with setting up everything else we'll need:

import streamlit as st
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
from tabulate import tabulate
from chromadb.utils import embedding_functions
import chromadb
import openai

# You'll need this client later to store PDF data
client = chromadb.Client()
client.heartbeat()

Give our chat UI a title and create a file uploader:

...
st.write("#Chat with PDF")

uploaded_file = st.file_uploader("Choose a PDF file", type="pdf")
...

Listen for a change event in uploaded_file. This will be triggered when you upload a file:

...
if uploaded_file is not None:
    # Create a temporary file to write the bytes to
    with open("temp_pdf_file.pdf", "wb") as temp_file:
        temp_file.write(uploaded_file.read())
...

View your streamlit app by running main.py (we'll implement the chat input UI later):

streamlit run main.py

That's the easy part done 🥳! Next comes the not so easy part...

Extracting text from PDFs

Carrying on from the previous code snippet, we're going to send temp_file to Azure Cognitive Services for OCR:

    ...
    # you can set this up in the azure cognitive services portal
    AZURE_COGNITIVE_ENDPOINT = "your-custom-azure-api-endpoint"
    AZURE_API_KEY = "your-azure-api-key"
    credential = AzureKeyCredential(AZURE_API_KEY)
    AZURE_DOCUMENT_ANALYSIS_CLIENT = DocumentAnalysisClient(AZURE_COGNITIVE_ENDPOINT, credential)

    # Open the temporary file in binary read mode and pass it to Azure
    with open("temp_pdf_file.pdf", "rb") as f:
        poller = AZURE_DOCUMENT_ANALYSIS_CLIENT.begin_analyze_document("prebuilt-document", document=f)
        doc_info = poller.result().to_dict()
    ...

Here, dict_info is a dictionary containing information on the extracted text chunks. It's a pretty complicated dictionary, so I would recommend printing it out and seeing for yourself what it looks like.

Paste in the following to finish processing the data received from Azure:

   ...
   res = []
   CONTENT = "content"
   PAGE_NUMBER = "page_number"
   TYPE = "type"
   RAW_CONTENT = "raw_content"
   TABLE_CONTENT = "table_content"

   for p in doc_info['pages']:
        dict = {}
        page_content = " ".join([line["content"] for line in p["lines"]])
        dict[CONTENT] = str(page_content)
        dict[PAGE_NUMBER] = str(p["page_number"])
        dict[TYPE] = RAW_CONTENT
        res.append(dict)

    for table in doc_info["tables"]:
        dict = {}
        dict[PAGE_NUMBER] = str(table["bounding_regions"][0]["page_number"])
        col_headers = []
        cells = table["cells"]
        for cell in cells:
            if cell["kind"] == "columnHeader" and cell["column_span"] == 1:
                for _ in range(cell["column_span"]):
                    col_headers.append(cell["content"])

        data_rows = [[] for _ in range(table["row_count"])]
        for cell in cells:
            if cell["kind"] == "content":
                for _ in range(cell["column_span"]):
                    data_rows[cell["row_index"]].append(cell["content"])
        data_rows = [row for row in data_rows if len(row) > 0]

        markdown_table = tabulate(data_rows, headers=col_headers, tablefmt="pipe")
        dict[CONTENT] = markdown_table
        dict[TYPE] = TABLE_CONTENT
        res.append(dict)
    ...

Here, we accessed various properties of the dictionary returned by Azure to get texts on the page, and data stored in tables. The logic is pretty complex because of all the nested structures 😨 but from personal experience, Azure OCR works well even for complex PDF structures, so I highly recommend giving it a try :)

Storing PDF content in ChromaDB

Still with me? 😅 Great, we're almost there so hang in there!

Paste in the code below to store extracted text chunks from res in ChromaDB.

    ...
    try:
        client.delete_collection(name="my_collection")
        st.session_state.messages = []
    except:
        print("Hopefully you'll never see this error.")

    openai_ef = embedding_functions.OpenAIEmbeddingFunction(api_key="your-openai-api-key", model_name="text-embedding-ada-002")
    collection = client.create_collection(name="my_collection", embedding_function=openai_ef)
    data = []
    id = 1
    for dict in res:
        content = dict.get(CONTENT, '')
        page_number = dict.get(PAGE_NUMBER, '')
        type_of_content = dict.get(TYPE, '')

        content_metadata = {   
            PAGE_NUMBER: page_number,
            TYPE: type_of_content
        }

        collection.add(
            documents=[content],
            metadatas=[content_metadata],
            ids=[str(id)]
        )
        id += 1
    ...

The first try block ensures that we can continue uploading PDFs without having to refresh the page.

You might have noticed that we add data into a collection and not to the database directly. A collection in ChromaDB is a vector space. When a user enters a query, it performs a search inside this collection, instead of the entire database. In Chroma, this collection is identified by a unique name, and with a simple line of code, you can add all extracted text chunks via to this collection via collection.add(...).

Generating a response using OpenAI

I get asked a lot about how to build a RAG chatbot without relying on frameworks like langchain and lLamaIndex. Well here's how you do it - you construct a list of prompts dynamically based on the retrieved results from your vector database.

Paste in the following code to wrap things up:

...
if "messages" not in st.session_state:
    st.session_state.messages = []

# Display chat messages from history on app rerun
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

if prompt := st.chat_input("What do you want to say to your PDF?"):
    # Display your message
    with st.chat_message("user"):
        st.markdown(prompt)
    # Add your message to chat history
    st.session_state.messages.append({"role": "user", "content": prompt})

    # query ChromaDB based on your prompt, taking the top 5 most relevant result. These results are ordered by similarity.
    q = collection.query(
        query_texts=[prompt],
        n_results=5,
    )
    results = q["documents"][0]

    prompts = []
    for r in results:
        # construct prompts based on the retrieved text chunks in results 
        prompt = "Please extract the following: " + prompt + "  solely based on the text below. Use an unbiased and journalistic tone. If you're unsure of the answer, say you cannot find the answer. \n\n" + r

        prompts.append(prompt)
    prompts.reverse()

    openai_res = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "assistant", "content": prompt} for prompt in prompts],
        temperature=0,
    )

    response = openai_res["choices"][0]["message"]["content"]
    with st.chat_message("assistant"):
        st.markdown(response)

    # append the response to chat history
    st.session_state.messages.append({"role": "assistant", "content": response})

Notice how we reversed prompts after constructing a list of prompts according to the list of retrieved text chunks from ChromaDB. This is because the results returned from ChromaDB is ordered in descending order, meaning the most relevant text chunk will always be the first in the results list. However, the way ChatGPT works is it considers the last prompt in a list of prompts more, hence why we have to reverse it.

Run the streamlit app and try things out for yourself 😙:

streamlit run main.py

🎉 Congratulations, you made it to the end!

Taking it a step further

As you know, LLM applications are a black box and so for production use cases, you'll want to safeguard the performance of your PDF chatbot to keep your users happy. To learn how to build a simple evaluation framework that could get you setup in less than 30 minutes, click here.

Conclusion

In this article, you've learnt:

what a vector database is a how to use ChromaDB
how to use the raw OpenAI API to build a RAG based chatbot without relying on 3rd party frameworks
what OCR is and how to use Azure's OCR services
how to quickly set up a beautiful chatbot UI using streamlit, which includes a file uploader.

This tutorial walked you through an example of how you can build a "chat with PDF" application using just Azure OCR, OpenAI, and ChromaDB. With what you've learnt, you can build powerful applications that help increase the productivity of workforces (at least that's the most prominent use case I've came across).

The source code for this tutorial is available here:
https://github.com/confident-ai/blog-examples/tree/main/chat-with-pdf

Thank you for reading!

Building a customer support chatbot using GPT-3.5 and lLamaIndex🚀

Jeffrey Ip — Tue, 19 Sep 2023 07:18:42 +0000

TL;DR

In this article, you'll learn how to create a customer support chatbot using GPT-3.5 and lLamaIndex. Also, stay tuned for bonus tips and tricks on how to evaluate your chatbot at the end of this article :)

A small request 🥺

I produce weekly content and your support would really help me continue. Please support me and my company Confident AI by starring our GitHub library. We're building a platform to unit test your chatbot. Thank you very very much! ❤️

https://github.com/confident-ai/deepeval

Introducing OpenAI API, and lLamaindex

In this tutorial, we're going to use GPT-3.5 provided by the OpenAI API. GPT-3.5 is a machine learning model and is like a super-smart computer buddy made by OpenAI. It's been trained with tons of data from the internet so it can chat, answer questions, and help with all sorts of language tasks.

But, you might wonder, can raw, out-of-the-box GPT-3.5 answer customer support questions that are specific to my own internal data?

Unfortunately, the answer is no 😔 because as you may know, GPT models have only been trained on public data up until 2021. This is precisely why we need open source frameworks like lLamaIndex! These frameworks help connect your internal data sources with GPT-3.5, so your chatbot can output tailored responses based on data that regular ChatGPT don't know about 😊

Pretty cool, huh? Lets begin.

Project Setup 🚀

First, I'll guide you through how to set up a project for your chatbot.

Create the project folder and a python virtual environment by running the code below in terminal:

mkdir customer-support-chatbot
cd customer-support-chatbot
python3 -m venv venv
source venv/bin/activate

Your terminal should now start something like this:

(venv)

Installing dependencies

Run the following code to install lLamaIndex:

pip install llama-index

Note that we don't require openai because lLamaIndex already provides a wrapper to call the OpenAI API under the hood.

Create a new main.py file - the entry point to your chatbot, and chatbot.py - your chatbot implementation.

touch main.py chatbot.py

Setting up your internal knowledge base

Create a new data.txt file in a new data folder that would contain fake data on MadeUpCompany:

mkdir data
cd data
touch data.txt

This file will contain the data that your chatbot is going to base its responses on. Luckily for us, ChatGPT prepared some fake information on MadeUpCompany 😌 Paste the following text in data.txt:

About MadeUpCompany
MadeUpCompany is a pioneering technology firm founded in 2010, specializing in cloud computing, data analytics, and machine learning. Our headquarters is based in San Francisco, California, with satellite offices spread across New York, London, and Tokyo. We are committed to offering state-of-the-art solutions that help businesses and individuals achieve their full potential. With a diverse team of experts from various industries, we strive to redefine the boundaries of innovation and efficiency.

Products and Services
We offer a suite of services ranging from cloud storage solutions, data analytics platforms, to custom machine learning models tailored for specific business needs. Our most popular product is CloudMate, a cloud storage solution designed for businesses of all sizes. It offers seamless data migration, top-tier security protocols, and an easy-to-use interface. Our data analytics service, DataWiz, helps companies turn raw data into actionable insights using advanced algorithms.

Pricing
We have a variety of pricing options tailored to different needs. Our basic cloud storage package starts at $9.99 per month, with premium plans offering more storage and functionalities. We also provide enterprise solutions on a case-by-case basis, so it’s best to consult with our sales team for customized pricing.

Technical Support
Our customer support team is available 24/7 to assist with any technical issues. We offer multiple channels for support including live chat, email, and a toll-free number. Most issues are typically resolved within 24 hours. We also have an extensive FAQ section on our website and a community forum for peer support.

Security and Compliance
MadeUpCompany places the utmost importance on security and compliance. All our products are GDPR compliant and adhere to the highest security standards, including end-to-end encryption and multi-factor authentication.

Account Management
Customers can easily manage their accounts through our online portal, which allows you to upgrade your service, view billing history, and manage users in your organization. If you encounter any issues or have questions about your account, our account management team is available weekdays from 9 AM to 6 PM.

Refund and Cancellation Policy
We offer a 30-day money-back guarantee on all our products. If you're not satisfied for any reason, you can request a full refund within the first 30 days of your purchase. After that, you can still cancel your service at any time, but a prorated refund will be issued based on the remaining term of your subscription.

Upcoming Features
We’re constantly working to improve our services and offer new features. Keep an eye out for updates on machine learning functionalities in DataWiz and more collaborative tools in CloudMate in the upcoming quarters.

Your customer support staff can use these paragraphs to build their responses to customer inquiries, providing both detailed and precise information to address various questions.

Lastly, navigate back to customer-support-chatbot containing main.py, and set your OpenAI API key as an environment variable. In your terminal, paste in this with your own API key (get yours here if you don't already have one):

export OPENAI_API_KEY="your-api-key-here"

All done! Let's start coding.

Building a Chatbot with lLamaIndex 🦄

To begin, we first have to chunk and index the text we have in data.txt to a format that's readable for GPT-3.5. So you might wonder, what do you mean by "readable"? 🤯

Well, GPT-3.5 has something called a context limit, which refers to how much text the model can "see" or consider at one time. Think of it like the model's short-term memory. If you give it a really long paragraph or a big conversation history, it might reach its limit and not be able to add much more to it. If you hit this limit, you might have to shorten your text so the model can understand and respond properly.

In addition, GPT-3.5 performs worse if you supply it with way too much text, kind of how someone loses focus if you tell too long a story. This is exactly where lLamaIndex shines 🦄 llamaindex helps us breakdown large bodies of text into chunks that can be consumed by GPT-3.5 🥳

In a few lines of code, we can build our chatbot using lLamaIndex. Everything from chunking the text from data.txt, to calling the OpenAI APIs, is handled by lLamaIndex. Paste in the following code in chatbot.py:

from llama_index import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

def query(user_input):
    return query_engine.query(user_input).response

And the following in main.py:

from chatbot import query

while True:
    user_input = input("Enter your question: ")
    response = query(user_input)
    print("Bot response:", response)

Now try it for yourself by running the code below:

python3 main.py

Feel free to switch out the text in data/data.txt with your own knowledge base!

Improving your Chatbot

You might start to run into situations where the chatbot isn't performing as well as you hope for certain questions/inputs. Luckily there are several ways to improve your chatbot 😊

Parsing your data into smaller/bigger chunks

The quality of output from your chatbot is directly affected by the size of text chunks (scroll down for a better explanation why).

In chatbot.py, Add service_context = ServiceContext.from_defaults(chunk_size=1000) to VectorStoreIndex to alter the chunk size:

from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(chunk_size=1000)
documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

def query(user_input):
    return query_engine.query(user_input).response

Play around with the size parameter to find what works best :)

Providing more context to GPT-3.5

Depending on your data, you might benefit from supply a lesser/greater number of text chunks to GPT-3.5. Here's how you can do it through query_engine = index.as_query_engine(similarity_top_k=5) in chatbot.py:

from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(chunk_size=1000)
documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=5)

def query(user_input):
    return query_engine.query(user_input).response

Evaluating your chatbot ⚖️

By now, you might have ran into the problem of eyeballing your chatbot output. You make a small configurational change, such as changing the number of retrieved text chunks, run main.py, type in the same old query, and wait 5 seconds to see if the result has gotten any better 😰 Sounds familiar?

The problem becomes worse if you want to inspect outputs from not just one, but several different queries. Here is a great read on how you can build your own evaluation framework in less than 20 minutes, but if you'd prefer to not reinvent the wheel, consider using a free open source packages like DeepEval. It helps you evaluate your chatbot so you don't have to do it yourself 😌

Since I'm slightly biased as the cofounder of Confident AI (which is the company behind DeepEval), I'm going to go ahead and show you how DeepEval can help with evaluating your chatbot (no but seriously, we offer unit testing for chatbots, have a stellar developer experience, and a free platform for you to holistically view your chatbot's performance 🥵)

Install by running the code:

pip install deepeval

Create a new test file:

touch test_chatbot.py

Paste in the following code:

import pytest
from deepeval.metrics.factual_consistency import FactualConsistencyMetric
from deepeval.test_case import LLMTestCase
from deepeval.run_test import assert_test
from chatbot import query

def test_1():
    input = "What does your company do?"
    output = query(input)
    context = "Our company specializes in cloud computing, data analytics, and machine learning. We offer a range of services including cloud storage solutions, data analytics platforms, and custom machine learning models."
    factual_consistency_metric = FactualConsistencyMetric(minimum_score=0.7)
    test_case = LLMTestCase(output=output, context=context)
    assert_test(test_case, [factual_consistency_metric])

Run the test:

deepeval test run test_chatbot.py

Your test should have passed! Let's breakdown what happened. The variable query mimics a user input, and output is what your chatbot outputs based on this query. The variable context contains the relevant information from your knowledge base, and FactualConsistencyMetric(minimum_score=0.7) is an out-of-the-box metric provided by DeepEval for your to assess how factually correct your chatbot's output is based on the provided context. This score ranges from 0 - 1, which the minimum_score=0.7 ultimately determines if your test have passed or not.

Add more tests to stop wasting time on fixing breaking changes to your chatbot 😙

How does your chatbot work under the hood?

The chatbot we just built actually relies on an architecture called Retrieval Augmented Generation (RAG). Retrieval Augmented Generation is a way to make GPT-3.5 smarter by letting it pull in fresh or specific information from an outside source, in this case data.txt. So, when you ask it something, it can give you a more current and relevant answer.

In the previous two sections, we looked at how tweaking two parameters — text chunk size and number of text chunks used can impact the quality of answer you get from GPT-3.5. This is because when you ask your chatbot a question, lLamaIndex retrieves the most relevant text chunks from data.txt, which GPT-3.5 will use to generate a data augmented answer.

Conclusion

In this article, you've learnt:

What OpenAI GPT-3.5 is,
How to build a simple chatbot on your own data using lLamaIndex,
how to improve the quality of your chatbot,
how to evaluate your chatbot using Deepeval
what is RAG and how it works,

This tutorial walks you through an example of a chatbot you can build using lLamaIndex and GPT-3.5. With lLamaIndex, you can create powerful personalized chatbots useful in various applications, such as customer support, user onboarding, sales enablement, and more 🥳

The source code for this tutorial is available here:
https://github.com/confident-ai/blog-examples/tree/main/customer-support-chatbot

Thank you for reading!

Generating synthetic data with LLMs - Part 1

Jeffrey Ip — Tue, 12 Sep 2023 10:53:54 +0000

The ability to use AI to generate data out of thin air is one of those things that seem too good to be true — think about it, you can get your hands on quality data without needing to manually collect, clean, and annotate massive datasets.

But, as you might expect, synthetic data is not without its caveats😔. Although it is convenient, efficient, and cost effective, the quality of synthetic data is only as good as the method used to generate it. Settle for rudimentary methods, and you’ll end up with unusable datasets that don’t represent real-world data well 🤯.

In this article, I’m going to share how we managed to generate realistic textual synthetic data at Confident AI. Let's dive right into it 😊.

What is synthetic data?

First and foremost, synthetic data is artificially generated data in attempt to simulate real-world data 🤖. Unlike real-world data that is collected from observations or actual events (e.g., tweets on the platform formally known as Twitter 🐦), synthetic data is made up, sometimes entirely, but more commonly based on a small subset of real-world data (also known as data augmentation).

This kind of data is often used for testing, training, and validating machine learning models, especially in scenarios where using real-world data is scarce or difficult to collect✅.

The struggles in generating textual data

Historically, while the demand for synthetic data continued to rise steadily over the years, advancements in generation methods struggled to keep pace.

Methods available at the time were often simplistic, perhaps relying on basic statistical methods, or they were too domain-specific and hard to generalize, meaning they lacked the complexity to mimic real-world data in a meaningful way❌.

Let’s take Generative Adversarial Networks (GANs) as an example. GANs employed a novel architecture of two neural networks — a generator and a discriminator — that competed with each other. The competition between these two networks resulted in the generation of highly realistic and complex synthetic data.

However, as one might have guessed from the title of this article, there were still major drawbacks when leveraging GANs to generate textual data 😔.

To list a few, here are the major pitfalls that GANs suffer from:

Mode Collapse: A phenomenon where the generator starts to produce the same output (or very similar outputs) over and over again.
Difficult to train: GANs are notoriously hard to train, with issues like vanishing/exploding gradients and oscillations in loss.
Long-Range Dependencies: Textual data often involve long-range dependencies (e.g., the subject of a sentence affecting a verb that appears much later), and capturing these effectively is a challenge even for advanced GAN architectures.
Very Needy: They require lots of data to train on (ironically).

Needless to say, there’s a lot of hurdles to overcome and consider when it comes to textual data. Let’s cut to the chase and see why you should use LLMs instead.

Generating Synthetic Data with LLMs

Like it or hate it, large language models (LLMs) like GPT-4 has democratized textual synthetic data. Let’s say I want to generate some queries related to the topic of synthetic data. All I have to do is use either ChatGPT or OpenAI’s API to generate a set of tweets. For example, here’s how you can do it in python (note: I’m using GPT-3.5):

import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

response = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
    {
      "role": "user",
      "content": "Generate 5 questions one might have on synthetic data."
    }
  ],
  temperature=1,
  max_tokens=256
)

print(response.choices[0].message.content)

Here’s a sample output 👀:

1. What are the key advantages of using synthetic data over real-world data in machine learning models?

2. How is synthetic data generated and how closely can it mimic the characteristics of original datasets?

3. Is synthetic data reliable for training machine learning models in sensitive sectors like healthcare or finance?

4. What are the ethical considerations associated with using synthetic data, especially when it is used to replace or supplement personally identifiable information?

5. Can synthetic data be used to address the challenges of data imbalance in machine learning, and if so, how effective is it compared to traditional resampling techniques?

While the generated data is quite varied, it may not accurately reflect real-world conditions, making it less useful for certain applications. Fortunately, by carefully crafting the input prompts, we can improve the authenticity of the synthetic data.

Using Dynamic Prompts Templates to make Synthetic Data Realistic

The pervasive problem with synthetic data generation is there’s often a mismatch between the generative distribution and the distribution of real-world data. However, due to the versatile and adaptable nature of LLMs, we can easily ground generated data by dynamically changing the prompt (basically string interpolation!) 🥳. For example, you might want to wrap the OpenAI API call in a function instead and make it accept additional context as parameters:

import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

def generate_synthetic_data(context):
  prompt = f""" 
    Generate 5 questions one might have on synthetic data. In your questions, also take into account the context below.
    Context: {context}
  """

  response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
      {
        "role": "user",
        "content": prompt
      }
    ],
    temperature=1,
    max_tokens=256
  )

  return response.choices[0].message.content

print(generate_synthetic_data("Data Privacy is a huge concern for enterprises."))

Here’s a sample output (don’t forget we’re using GPT-3.5!):

1. How can synthetic data help enterprises address data privacy concerns while still maintaining the ability to perform data analytics and testing?

2. What are the key differences between real data and synthetic data in terms of their privacy implications for enterprises?

3. Are there any legal or regulatory considerations that enterprises should be aware of when using synthetic data to safeguard data privacy?

4. How can enterprises ensure that synthetic data accurately represents their real data while preserving the privacy of sensitive information?

5. What are the potential limitations or challenges that enterprises may face when implementing synthetic data solutions to protect data privacy, and how can they mitigate these challenges effectively?

As you can see, the output is much more relevant and significantly improved compared to previous iterations without dynamic prompting.

Conclusion

In this article, we explored ways to contextualize synthetic data effectively. LLMs like GPT-3.5 can offer a simple yet powerful way of generating data through some careful prompt designing.

Stay tuned for our Part 2 guide on diversifying your synthetic data set!

(thanks for reading my first article! Follow my Twitter to follow my journey in building Confident AI: https://twitter.com/jeffr_yyy, and come give our Github repo a star⭐ !)