Detect LLM Hallucinations in CI / CD:

Himanshu Bamoria — Mon, 08 Apr 2024 00:39:11 +0000

A Guide to evaluate your RAG pipelines using GitHub Actions + Athina / Ragas

If you've ever worked on coding projects, you know how important it is to make sure your code is solid before showing it to the world.

That's where CI/CD pipelines come into play. They're like your coding safety net, catching bugs and problems automatically.

So why not have the same process for your LLM pipeline?

The best teams will implement an evaluation system as part of their CI / CD system for their RAG pipelines.

This makes a lot of sense - LLMs are unpredictable at best, and tiny changes in your prompt or retrieval system can throw your whole application out of whack.

Athina can help you detect mistakes and hallucinations in your RAG pipeline your code's quality with a really simple integration. We're going to walk you through how to set this up using GitHub Actions.

***You can use Athina evals in your CI/CD pipeline to catch regressions before they get to production.*

Here is a guide for setting athina-evals in your CI/CD pipeline.
All code described here is also present in our GitHub repository.

GitHub Actions

We're going to use GitHub Actions to create our CI/CD pipelines. GitHub Actions allow us to define workflows that are triggered by events (pull request, push, etc.) and execute a series of actions.

Our GitHub Actions are defined under our repository's .github/workflows directory.

We have defined a workflow for the evals too. The workflow file is named athina_ci.yml.

The workflow is triggered on every push to the main branch.

name: CI with Athina Evals

on:
  push:
    branches:
      - main

jobs:
  evaluate:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'

      - name: Install Dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt  # Install project dependencies
          pip install athina  # Install Athina Evals

      - name: Run Athina Evaluation and Validation Script
        run: python -m evaluations.run_athina_evals
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ATHINA_API_KEY: ${{ secrets.ATHINA_API_KEY }}

Athina Evals Script

The run_athina_evals.py script is the entry point for our Athina Evals. It is a simple script that uses the Athina Evals SDK to evaluate and validate the Rag Application.

For example, we are testing if the response from the Rag Application answers the query using the DoesResponseAnswerQuery evaluation from Athina.

eval_model = "gpt-3.5-turbo"
df = DoesResponseAnswerQuery(model=eval_model).run_batch(data=dataset).to_df()

# Validation: Check if all rows in the dataframe passed the evaluation
all_passed = df['passed'].all()

if not all_passed:
    failed_responses = df[~df['passed']]
    print(f"Failed Responses: {failed_responses}")
    raise ValueError("Not all responses passed the evaluation.")
else:
    print("All responses passed the evaluation.")

You can also load a golden dataset and run the evaluation on it.

with open('evaluations/golden_dataset.jsonl', 'r') as file:
  raw_data = file.read().split('\n')
  data = []
  for item in raw_data:
    item = json.loads(item)
    item['context'], item['response'] = app.generate_response(item['query'])
    data.append(item)
You can also run a suite of evaluations on the dataset.
eval_model = "gpt-3.5-turbo"
eval_suite = [
  DoesResponseAnswerQuery(model=eval_model),
  Faithfulness(model=eval_model),
  ContextContainsEnoughInformation(model=eval_model),
]


# Run the evaluation suite
batch_eval_result = EvalRunner.run_suite(
  evals=eval_suite,
  data=dataset,
  max_parallel_evals=2
)

# Validate the batch_eval_results as you want.

Secrets

We are using GitHub Secrets to store our API keys.
We have two secrets, OPENAI_API_KEY and ATHINA_API_KEY.
You can add these secrets to your repository by navigating to Settings > Secrets > New repository secret.

Athina AI: Monitor & Evaluate LLM Outputs in 5Mins!

Himanshu Bamoria — Fri, 05 Jan 2024 09:52:35 +0000

TL;DR: Athina helps you monitor and evaluate your LLM powered app. Plug and play evals in production. 5 minute setup.

👋 Hey everyone! We’re thrilled to announce the launch of Athina AI, a suite of tools for LLM developers to ship and develop AI products with confidence.

What is Athina AI?

Athina AI is a Monitoring & Evaluation platform for LLM developers.

Developers use Athina’s evaluation framework and production monitoring platform to improve the performance and reliability of AI applications through real-time monitoring, analytics, and automatic evaluations.

Problem

It is difficult to measure the quality of Generative AI responses.
Eyeballing production responses is tough.
No easy way to detect unreliable or bad outputs (especially in production).
Low visibility into LLM touchpoints.

LLM developers typically have to build lots of in-house infrastructure for monitoring and evaluation.

Solution: Athina AI

Quick Setup: Get started in just 5 minutes! The entire integration is 1 simple POST request (and we don’t interfere with your LLM calls)
Comprehensive Monitoring Platform: Full visibility into your LLM touchpoints. Search, sort, filter, compare, debug.
Prebuilt Evaluations:
- You can configure automatic evaluations in just a few clicks - use one of our preset evals or define a custom eval.
- These evals will run against logged inferences automatically.
- You can also use our open-source library to run evals and iterate rapidly during development.
Granular Analytics:
- Tracks usage metrics like response time, cost, token usage, feedback, and more.
- Athina also track metrics from the evals, like Faithfulness, Answer Relevance, Context Sufficiency, etc
- You can segment these metrics by any property: customer ID, environment, model, prompt, etc.
  - For example, you could use Athina to see how prompt/v4 is performing for customer ID nike-usa and how gpt-4 performance compares to a llama finetune.

Our Story

As a team of engineers and hackers, we spent a summer trying to build various LLM-powered applications for developers.

While working with LLMs, we found that the most challenging part was evaluating the Generative AI output and systematically improving model performance.

We discovered a major gap in the tools that engineers need to effectively build production grade applications using LLMs, and set out to solve this problem.

Get Started

Athina AI is a comprehensive suite of tools to supercharge your LLM development lifecycle and help you ship high-performing, reliable AI applications with confidence.

🌟 Sign up for an account at app.athina.ai
Log your inferences using this guide.
Try our open source evals.
Schedule a call with us

DEV Community: Himanshu Bamoria