<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Athina AI</title>
    <description>The latest articles on DEV Community by Athina AI (@athina).</description>
    <link>https://dev.to/athina</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F8606%2Fe0ca08ce-8d91-4bc7-ac46-84025bf1af6d.png</url>
      <title>DEV Community: Athina AI</title>
      <link>https://dev.to/athina</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/athina"/>
    <language>en</language>
    <item>
      <title>Detect LLM Hallucinations in CI / CD:</title>
      <dc:creator>Himanshu Bamoria</dc:creator>
      <pubDate>Mon, 08 Apr 2024 00:39:11 +0000</pubDate>
      <link>https://dev.to/athina/detect-llm-hallucinations-in-ci-cd-594f</link>
      <guid>https://dev.to/athina/detect-llm-hallucinations-in-ci-cd-594f</guid>
      <description>&lt;p&gt;&lt;strong&gt;A Guide to evaluate your RAG pipelines using GitHub Actions + Athina / Ragas&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you've ever worked on coding projects, you know how important it is to make sure your code is solid before showing it to the world.&lt;/p&gt;

&lt;p&gt;That's where CI/CD pipelines come into play. They're like your coding safety net, catching bugs and problems automatically.&lt;/p&gt;

&lt;p&gt;So why not have the same process for your LLM pipeline?&lt;/p&gt;

&lt;p&gt;The best teams will implement an evaluation system as part of their CI / CD system for their RAG pipelines.&lt;/p&gt;

&lt;p&gt;This makes a lot of sense - LLMs are unpredictable at best, and tiny changes in your prompt or retrieval system can throw your whole application out of whack.&lt;/p&gt;

&lt;p&gt;Athina can help you detect mistakes and hallucinations in your RAG pipeline your code's quality with a really simple integration. We're going to walk you through how to set this up using GitHub Actions.&lt;/p&gt;




&lt;p&gt;*&lt;em&gt;**You can use Athina evals in your CI/CD pipeline to catch regressions before they get to production.&lt;/em&gt;* &lt;/p&gt;

&lt;p&gt;Here is a guide for setting athina-evals in your CI/CD pipeline.&lt;br&gt;
All code described here is also present in our &lt;a href="https://github.com/athina-ai/athina-evals-ci/"&gt;GitHub repository&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  GitHub Actions
&lt;/h2&gt;

&lt;p&gt;We're going to use GitHub Actions to create our CI/CD pipelines. GitHub Actions allow us to define workflows that are triggered by events (pull request, push, etc.) and execute a series of actions.&lt;/p&gt;

&lt;p&gt;Our GitHub Actions are defined under our repository's &lt;code&gt;.github/workflows&lt;/code&gt; directory.&lt;/p&gt;

&lt;p&gt;We have defined a workflow for the evals too. The workflow file is named &lt;code&gt;athina_ci.yml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The workflow is triggered on every push to the &lt;code&gt;main&lt;/code&gt; branch.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name: CI with Athina Evals

on:
  push:
    branches:
      - main

jobs:
  evaluate:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'

      - name: Install Dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt  # Install project dependencies
          pip install athina  # Install Athina Evals

      - name: Run Athina Evaluation and Validation Script
        run: python -m evaluations.run_athina_evals
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ATHINA_API_KEY: ${{ secrets.ATHINA_API_KEY }}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Athina Evals Script
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;run_athina_evals.py&lt;/code&gt; script is the entry point for our Athina Evals. It is a simple script that uses the Athina Evals SDK to evaluate and validate the Rag Application.&lt;/p&gt;

&lt;p&gt;For example, we are testing if the response from the Rag Application answers the query using the &lt;code&gt;DoesResponseAnswerQuery&lt;/code&gt; evaluation from Athina.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;eval_model = "gpt-3.5-turbo"
df = DoesResponseAnswerQuery(model=eval_model).run_batch(data=dataset).to_df()

# Validation: Check if all rows in the dataframe passed the evaluation
all_passed = df['passed'].all()

if not all_passed:
    failed_responses = df[~df['passed']]
    print(f"Failed Responses: {failed_responses}")
    raise ValueError("Not all responses passed the evaluation.")
else:
    print("All responses passed the evaluation.")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also load a golden dataset and run the evaluation on it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;with open('evaluations/golden_dataset.jsonl', 'r') as file:
  raw_data = file.read().split('\n')
  data = []
  for item in raw_data:
    item = json.loads(item)
    item['context'], item['response'] = app.generate_response(item['query'])
    data.append(item)
You can also run a suite of evaluations on the dataset.
eval_model = "gpt-3.5-turbo"
eval_suite = [
  DoesResponseAnswerQuery(model=eval_model),
  Faithfulness(model=eval_model),
  ContextContainsEnoughInformation(model=eval_model),
]


# Run the evaluation suite
batch_eval_result = EvalRunner.run_suite(
  evals=eval_suite,
  data=dataset,
  max_parallel_evals=2
)

# Validate the batch_eval_results as you want.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Secrets
&lt;/h2&gt;

&lt;p&gt;We are using GitHub Secrets to store our API keys. &lt;br&gt;
We have two secrets, &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; and &lt;code&gt;ATHINA_API_KEY&lt;/code&gt;.&lt;br&gt;
You can add these secrets to your repository by navigating to &lt;code&gt;Settings&lt;/code&gt; &amp;gt; &lt;code&gt;Secrets&lt;/code&gt; &amp;gt; &lt;code&gt;New repository secret&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;p&gt;We have more examples and details in our &lt;a href="https://github.com/athina-ai/athina-evals-ci/"&gt;GitHub repository&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Alright, we've covered how to add Athina to your CI/CD pipeline with GitHub Actions - with this simple modification, you can ensure your AI is top-notch before it goes live.&lt;/p&gt;

&lt;p&gt;If you're interested in continuous monitoring and evaluation of your AI in production, we can help.&lt;/p&gt;

&lt;p&gt;Watch this &lt;a href="https://bit.ly/athina-demo-feb-2024"&gt;demo video&lt;/a&gt; of Athina's platform, and feel free to &lt;a href="https://cal.com/shiv-athina/30min"&gt;schedule a call with us&lt;/a&gt; if you're interested in setting up safety nets for your LLM.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
