<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Himanshu Bamoria</title>
    <description>The latest articles on DEV Community by Himanshu Bamoria (@hbamoria).</description>
    <link>https://dev.to/hbamoria</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1234252%2Fabb1e17a-77c7-4ef5-be03-0840791dfdfd.jpg</url>
      <title>DEV Community: Himanshu Bamoria</title>
      <link>https://dev.to/hbamoria</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hbamoria"/>
    <language>en</language>
    <item>
      <title>Detect LLM Hallucinations in CI / CD:</title>
      <dc:creator>Himanshu Bamoria</dc:creator>
      <pubDate>Mon, 08 Apr 2024 00:39:11 +0000</pubDate>
      <link>https://dev.to/athina/detect-llm-hallucinations-in-ci-cd-594f</link>
      <guid>https://dev.to/athina/detect-llm-hallucinations-in-ci-cd-594f</guid>
      <description>&lt;p&gt;&lt;strong&gt;A Guide to evaluate your RAG pipelines using GitHub Actions + Athina / Ragas&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you've ever worked on coding projects, you know how important it is to make sure your code is solid before showing it to the world.&lt;/p&gt;

&lt;p&gt;That's where CI/CD pipelines come into play. They're like your coding safety net, catching bugs and problems automatically.&lt;/p&gt;

&lt;p&gt;So why not have the same process for your LLM pipeline?&lt;/p&gt;

&lt;p&gt;The best teams will implement an evaluation system as part of their CI / CD system for their RAG pipelines.&lt;/p&gt;

&lt;p&gt;This makes a lot of sense - LLMs are unpredictable at best, and tiny changes in your prompt or retrieval system can throw your whole application out of whack.&lt;/p&gt;

&lt;p&gt;Athina can help you detect mistakes and hallucinations in your RAG pipeline your code's quality with a really simple integration. We're going to walk you through how to set this up using GitHub Actions.&lt;/p&gt;




&lt;p&gt;*&lt;em&gt;**You can use Athina evals in your CI/CD pipeline to catch regressions before they get to production.&lt;/em&gt;* &lt;/p&gt;

&lt;p&gt;Here is a guide for setting athina-evals in your CI/CD pipeline.&lt;br&gt;
All code described here is also present in our &lt;a href="https://github.com/athina-ai/athina-evals-ci/"&gt;GitHub repository&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  GitHub Actions
&lt;/h2&gt;

&lt;p&gt;We're going to use GitHub Actions to create our CI/CD pipelines. GitHub Actions allow us to define workflows that are triggered by events (pull request, push, etc.) and execute a series of actions.&lt;/p&gt;

&lt;p&gt;Our GitHub Actions are defined under our repository's &lt;code&gt;.github/workflows&lt;/code&gt; directory.&lt;/p&gt;

&lt;p&gt;We have defined a workflow for the evals too. The workflow file is named &lt;code&gt;athina_ci.yml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The workflow is triggered on every push to the &lt;code&gt;main&lt;/code&gt; branch.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name: CI with Athina Evals

on:
  push:
    branches:
      - main

jobs:
  evaluate:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'

      - name: Install Dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt  # Install project dependencies
          pip install athina  # Install Athina Evals

      - name: Run Athina Evaluation and Validation Script
        run: python -m evaluations.run_athina_evals
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ATHINA_API_KEY: ${{ secrets.ATHINA_API_KEY }}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Athina Evals Script
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;run_athina_evals.py&lt;/code&gt; script is the entry point for our Athina Evals. It is a simple script that uses the Athina Evals SDK to evaluate and validate the Rag Application.&lt;/p&gt;

&lt;p&gt;For example, we are testing if the response from the Rag Application answers the query using the &lt;code&gt;DoesResponseAnswerQuery&lt;/code&gt; evaluation from Athina.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;eval_model = "gpt-3.5-turbo"
df = DoesResponseAnswerQuery(model=eval_model).run_batch(data=dataset).to_df()

# Validation: Check if all rows in the dataframe passed the evaluation
all_passed = df['passed'].all()

if not all_passed:
    failed_responses = df[~df['passed']]
    print(f"Failed Responses: {failed_responses}")
    raise ValueError("Not all responses passed the evaluation.")
else:
    print("All responses passed the evaluation.")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also load a golden dataset and run the evaluation on it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;with open('evaluations/golden_dataset.jsonl', 'r') as file:
  raw_data = file.read().split('\n')
  data = []
  for item in raw_data:
    item = json.loads(item)
    item['context'], item['response'] = app.generate_response(item['query'])
    data.append(item)
You can also run a suite of evaluations on the dataset.
eval_model = "gpt-3.5-turbo"
eval_suite = [
  DoesResponseAnswerQuery(model=eval_model),
  Faithfulness(model=eval_model),
  ContextContainsEnoughInformation(model=eval_model),
]


# Run the evaluation suite
batch_eval_result = EvalRunner.run_suite(
  evals=eval_suite,
  data=dataset,
  max_parallel_evals=2
)

# Validate the batch_eval_results as you want.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Secrets
&lt;/h2&gt;

&lt;p&gt;We are using GitHub Secrets to store our API keys. &lt;br&gt;
We have two secrets, &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; and &lt;code&gt;ATHINA_API_KEY&lt;/code&gt;.&lt;br&gt;
You can add these secrets to your repository by navigating to &lt;code&gt;Settings&lt;/code&gt; &amp;gt; &lt;code&gt;Secrets&lt;/code&gt; &amp;gt; &lt;code&gt;New repository secret&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;p&gt;We have more examples and details in our &lt;a href="https://github.com/athina-ai/athina-evals-ci/"&gt;GitHub repository&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Alright, we've covered how to add Athina to your CI/CD pipeline with GitHub Actions - with this simple modification, you can ensure your AI is top-notch before it goes live.&lt;/p&gt;

&lt;p&gt;If you're interested in continuous monitoring and evaluation of your AI in production, we can help.&lt;/p&gt;

&lt;p&gt;Watch this &lt;a href="https://bit.ly/athina-demo-feb-2024"&gt;demo video&lt;/a&gt; of Athina's platform, and feel free to &lt;a href="https://cal.com/shiv-athina/30min"&gt;schedule a call with us&lt;/a&gt; if you're interested in setting up safety nets for your LLM.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Athina AI: Monitor &amp; Evaluate LLM Outputs in 5Mins!</title>
      <dc:creator>Himanshu Bamoria</dc:creator>
      <pubDate>Fri, 05 Jan 2024 09:52:35 +0000</pubDate>
      <link>https://dev.to/hbamoria/athina-ai-monitor-evaluate-llm-outputs-in-5mins-2d15</link>
      <guid>https://dev.to/hbamoria/athina-ai-monitor-evaluate-llm-outputs-in-5mins-2d15</guid>
      <description>&lt;h4&gt;
  
  
  TL;DR: Athina helps you monitor and evaluate your LLM powered app. Plug and play evals in production. 5 minute setup.
&lt;/h4&gt;




&lt;p&gt;👋 Hey everyone! We’re thrilled to announce the launch of Athina AI, a suite of tools for LLM developers to ship and develop AI products with confidence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is Athina AI?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr2wqescoj1r6qt2bco35.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr2wqescoj1r6qt2bco35.png" alt="Athina Monitoring Dashboard" width="800" height="481"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Athina AI is a &lt;strong&gt;Monitoring &amp;amp; Evaluation platform&lt;/strong&gt; for LLM developers. &lt;/p&gt;

&lt;p&gt;Developers use Athina’s evaluation framework and production monitoring platform to improve the performance and reliability of AI applications through real-time monitoring, analytics, and automatic evaluations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;It is difficult to measure the quality of Generative AI responses.&lt;/li&gt;
&lt;li&gt;Eyeballing production responses is tough.&lt;/li&gt;
&lt;li&gt;No easy way to detect unreliable or bad outputs (especially in production).&lt;/li&gt;
&lt;li&gt;Low visibility into LLM touchpoints.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LLM developers typically have to build lots of in-house infrastructure for monitoring and evaluation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution: Athina AI
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Quick Setup&lt;/strong&gt;: &lt;a href="https://docs.athina.ai/logging/log_via_api"&gt;Get started&lt;/a&gt; in just 5 minutes! The entire integration is 1 simple &lt;code&gt;POST&lt;/code&gt; request &lt;em&gt;(and we don’t interfere with your LLM calls)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comprehensive Monitoring Platform&lt;/strong&gt;: Full visibility into your LLM touchpoints. Search, sort, filter, compare, debug.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prebuilt Evaluations&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;You can configure automatic evaluations in just a few clicks - use one of our preset evals or define a custom eval.&lt;/li&gt;
&lt;li&gt;These evals will run against logged inferences automatically.&lt;/li&gt;
&lt;li&gt;You can also use our &lt;strong&gt;&lt;a href="http://github.com/athina-ai/athina-evals"&gt;open-source library&lt;/a&gt;&lt;/strong&gt; to run evals and iterate rapidly during development.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Granular Analytics&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Tracks usage metrics like response time, cost, token usage, feedback, and more.&lt;/li&gt;
&lt;li&gt;Athina also track metrics from the evals, like Faithfulness, Answer Relevance, Context Sufficiency, etc&lt;/li&gt;
&lt;li&gt;You can segment these metrics by any property: customer ID, environment, model, prompt, etc.

&lt;ul&gt;
&lt;li&gt;For example, you could use Athina to see how &lt;code&gt;prompt/v4&lt;/code&gt; is performing for customer ID &lt;code&gt;nike-usa&lt;/code&gt; and how gpt-4 performance compares to a llama finetune.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuwkpa3p970r2f46qsqbs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuwkpa3p970r2f46qsqbs.png" alt="Athina Evaluation Dashboard" width="800" height="481"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Our Story
&lt;/h3&gt;

&lt;p&gt;As a team of engineers and hackers, we spent a summer trying to build various LLM-powered applications for developers. &lt;/p&gt;

&lt;p&gt;While working with LLMs, we found that the most challenging part was evaluating the Generative AI output and systematically improving model performance. &lt;/p&gt;

&lt;p&gt;We discovered a major gap in the tools that engineers need to effectively build production grade applications using LLMs, and set out to solve this problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Get Started
&lt;/h3&gt;

&lt;p&gt;Athina AI is a comprehensive suite of tools to supercharge your LLM development lifecycle and help you ship high-performing, reliable AI applications with confidence.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🌟 Sign up for an account at &lt;a href="http://app.athina.ai"&gt;app.athina.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Log your inferences using &lt;a href="http://docs.athina.ai/logging/log_via_api"&gt;this guide&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Try our &lt;a href="http://github.com/athina-ai/athina-evals"&gt;open source evals&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cal.com/shiv-athina/30min"&gt;Schedule&lt;/a&gt; a call with us&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
  </channel>
</rss>
