DEV Community

Cover image for LLAMA 3.1 vs GPT4: Which is smarter for analytics?
Samad Yar Khan for Middleware

Posted on

LLAMA 3.1 vs GPT4: Which is smarter for analytics?

Table of Contents

Introduction

Middleware is a platform that enables engineering leaders to derive actionable insights from data and improve the processes, making dev teams more efficient. With the fast movements in the field of AI we have been continuously trying to integrate ML models across the product with the goal of deriving actionable insights from the data.

We took some time and figured that the open source LLAMA or Mistral models we wanted to use were good but GPT4o was more reliable when it came to data centric problems. Hence we decided to move in the more sophisticated direction of building RAG pipelines and using function calling.

All this changed when we heard that Meta dropped LLAMA 3.1 models. The 70B and 405B models are simply one of the best open-source models out there and compete neck to neck with GPT4o. So we decided to integrate AI powered DORA reports as a part of an experimental effort and see how GPT4 and LLAMA 3.1 perform when it comes to data analysis and reasoning.

Background

DORA metrics provide critical insights into the performance and reliability of software delivery processes.

Image description

1) Lead Time for Changes

  • Lead time consists of First Commit to PR Open time, First Response Time, Rework Time, Merge Time, and Merge to Deploy Time.

2) Deployment Frequency

  • This metric gauges how frequently code changes are deployed to production.

3) Mean Time to Recover (MTTR)

  • MTTR measures how swiftly a team can restore service after a failure occurs in production.
  • The team's average incident resolution time is to compute its MTTR.

4) Change Failure Rate (CFR)

  • CFR quantifies the percentage of changes that result in a service impairment or outage in production, aiding in the evaluation of deployment process stability and reliability.
  • CFR is computed by linking incidents to deployments within an interval; each deployment may have several or no incidents.

You can learn more about dora metrics from here. By leveraging advanced LLMs, we aim to automate the analysis of these metrics, providing teams with deeper and more actionable insights.

Objectives

  • To integrate LLMs into Middleware for the analysis of DORA metrics.
  • To compare the performance of different large language models in terms of:
    • Mathematical Accuracy: How well can it calculate the DORA score ?
    • Data Analysis: Can the LLM analyse the input data and derive correct inferences ?
    • Summarising: How well can the model summarise data ?
    • Actionability: How well can the models suggest an action-plan based on the input data ?

Implementation

Data Processing: Middleware to the Rescue

  • Middleware syncs all your data from different sources and calculates the DORA Metrics for your teams.
  • Checkout middlewarehq/middleware and setup the dev server using docker.

Image description

Model Integration: FireworksAi and OpenAI

  • We integrated OpenAI GPT4o and LLAMA 3.1 (70B and 405B) models.
  • The OpenAI models use the official OpenAI API under the hood, while the Fireworks AI APIs have been used to integrate the 70B and 405B LLAMA 3.1 Models.
  • These AI analytics are powered by the AIAnalyticsService in the analytics server. This service can be extended to use more closed sources models from OpenAI or OpenSource model using FireworksAi
    Image description

  • Changes on the front end introduce components and BFF logic allowing users to enter their token, choose a large language model and generate AI Reports for their DORA Metrics.

  • Whenever the user tries to generate AI analysis, the UI makes a POST request to the BFF API: internal/ai/dora_metrics with all the preprocessed DORA Metrics and trends data.

  • This BFF API internally calls multiple analytics APIs with the dora metrics and trends data, which in turn generate the analysis based on the processed data and the curated prompts.

  • Finally, the analysis for each individual metric trend is fed again into the LLM for a summarising effort and all the data is sent to the front-end.

More implementation details can be found in this pull request.

Evaluation and Results: GPT4o Vs LLAMA 3.1

We did the DORA AI analysis for July on the following open-source repositories: facebook/react, middlewarehq/middlware, meta-llama/llama and facebookresearch/dora.

Mathematical Accuracy

  • Middleware generated a DORA Performance Score for the team based on this guide by dora.dev
  • To test out the computational accuracy of the model we provide it with the four key metrics and prompt the LLM to generate a DORA Score and compare the results with Middleware.
  • The four keys was a JSON of the format.
    {
        "lead_time": 4000,
        "mean_time_to_recovery": 200000,
        "change_failure_rate": 20,
        "weekly_deployment_frequency": 2
    }
Enter fullscreen mode Exit fullscreen mode
  • The Actual Dora Score for the repositories was around 5. While OpenAi’s GPT4o was able to predict the score to be 4-5 most of the times, LLAMA 3.1 405B a margin away.

DORA Metrics score: 5/10
Image description

GPT 4o with DORA score 5/10
Image description

LLAMA 3.1 with DORA Score 8/10 (incorrect)
Image description

GPT 4o DORA Score was closer to the actual DORA score than LLAMA 3.1 in 9/10 cases, hence GPT4o was more accurate compared to LLAMA 3.1 in this scenario.

Data Analysis

  • The trend data for the four keys dora metrics, calculated by Middleware, was fed to the LLMs as input along with different experimental prompts to ensure a concrete data analysis.
  • The trend data is usually a JSON object with date strings as keys, representing weeks' start dates mapped to the metric data.
    {
       "2024-01-01": {
               ...
           },
           "2024-01-08": {
               ...
           }
    }
Enter fullscreen mode Exit fullscreen mode
  • Mapping Data: Both the models were at par at extracting data from the JSON and inferring the data in the correct manner. Example: Both GPT and LLAMA were able to map the correct data to the input weeks without errors or hallucinations.

    Deployment Trends Summarised: GPT4o
    Image description

    Deployment Trends Summarised: LLAMA 3.1 405B
    Image description

  • Extracting Inferences: Both the models were able to derive solid inferences from data.

    • LLAMA 3.1 identified week with maximum lead time along with the reason for the high lead time.Image description
    • This inference could be verified by the Middleware Trend Charts.Image description
    • GPT4o was also able to extract the week with the maximum lead time and the reason too, which was, high first-response time.Image description
  • Data Presentation: Data representation has been a hit or miss with LLMs. There are cases where GPT performs better at data presentation but lacks behind LLAMA 3.1 in accuracy and there have been cases like the DORA score where GPT was able to do the math better.

    • LLAMA and GPT were both given the lead time value in seconds. LLAMA was able to round off the data closer to the actual value of 16.99 days while GPT rounded off the data to 17 days 2 hours but presented the data in a more detailed format.

    GPT4oImage description

    LLAMA 3.1 405BImage description

Actionability

  • The models were able to output similar actionables for improving teams' efficiency based on all the metrics.
  • Example: Both the models identified the reason for high lead-time to be first-response time and suggested the team to use an alerting tool to avoid delayed PR Reviews. The models also suggested better planning to avoid rework where rework was high in a certain week.

GPT4oImage description

LLAMA 3.1 405BImage description

Summarisation

To test out the summarisation capabilities of the models we asked the model to summarise each metric trend individually and then feed the output results for all the trends back into the LLMs to get a summary or in Internet's slang DORA TLDR for the team.

The summarisation capability of large data is similar in both the LLMs.

LLAMA 3.1 405B
Image description

GPT4o
Image description

Conclusion

For a long time LLAMA was trying to catch up with GPT in terms of data processing and analytical abilities. Our earlier experimentation with older LLAMA models led us to believe that GPT is way ahead, but the recent LLAMA 3.1 405B model is at par with the GPT4o.

If you value data privacy of your customers and want to try out the open-source LLAMA 3.1 models instead of GPT4, go ahead! There will be negligible difference in performance and you will be able to ensure data privacy if you use self hosted models. Open-Source LLMs have finally started to compete with all the closed-source competitors.

Both LLAMA 3.1 and GPT4o are super capable of deriving inferences from processed data and making Middleware’s DORA metrics more actionable and digestible for engineering leaders, leading to more efficient teams.

Future Work

This was an experiment to build an AI powered DORA solution and in the future we will be focusing on adding greater support for self hosted or locally running LLMs from Middleware. Enhanced support for AI powered action-plans throughout the product using self hosted LLMs, while ensuring data privacy, will be our goal for the coming months.

In the mean time you can try out the AI DORA summary feature here.

GitHub logo middlewarehq / middleware

✨ Open-source DORA metrics platform for engineering teams ✨

Middleware Logo

Open-source engineering management that unlocks developer potential

continuous integration Commit activity per month contributors
license Stars

Join our Engineering Leaders Community

Middleware Opensource

Introduction

Middleware is an open-source tool designed to help engineering leaders measure and analyze the effectiveness of their teams using the DORA metrics. The DORA metrics are a set of four key values that provide insights into software delivery performance and operational efficiency.

They are:

  • Deployment Frequency: The frequency of code deployments to production or an operational environment.
  • Lead Time for Changes: The time it takes for a commit to make it into production.
  • Mean Time to Restore: The time it takes to restore service after an incident or failure.
  • Change Failure Rate: The percentage of deployments that result in failures or require remediation.

Table of Contents





Top comments (12)

Collapse
 
shivamchhuneja profile image
Shivam Chhuneja • Edited

This was quite interesting of a read Samad.

Looking forward to how this gets integrated here

Collapse
 
samadyarkhan profile image
Samad Yar Khan

Yup ! We will add soon more exciting AI features.

Collapse
 
jayantbh profile image
Jayant Bhawal • Edited

I tried out the dev setup from the link shared in the post.

To me, GPT didn't work as well as the post claims. It somehow scored me in the 3-4/10 range, with Llama scoring 7-8/10 where the control score appears to be ~6.

I saw the prompt in the code. I think we could come up with something better.

Interesting stuff nonetheless. If the repo wasn't shared, I might have thought this was bait. 🤣

Collapse
 
samadyarkhan profile image
Samad Yar Khan

If you can come up with a better PR, would be open to that 😅

Collapse
 
dhruvagarwal profile image
Dhruv Agarwal

Great article, Samad! OSS ftw! 🚀

Llama 3.1 has been wonderful and makes OpenAI bite dust 💨

Collapse
 
samadyarkhan profile image
Samad Yar Khan • Edited

OpenAI still dominates the LLM space 👀 but this is a great leap for Meta and open-sourced LLMs🔥

Collapse
 
adnanhashmi09 profile image
Adnan Hashmi

Great article Samad!
It is great that we get similar performance as GPT 4o with the added benefit of data privacy.

Collapse
 
algorithm99 profile image
Andrew Yuan

Thank you for sharing this! 🌟
Llama 3.1 is truly impressive and a game-changer! 🚀

Collapse
 
samadyarkhan profile image
Samad Yar Khan

Yes, it definitely is. Can't wait to see how it performs with the tool calling!

Collapse
 
ayush2390 profile image
Ayush Thakur

This is a highly informative read

Collapse
 
samadyarkhan profile image
Samad Yar Khan

Glad you liked it!

Collapse
 
eforeshaan profile image
Eshaan

Great review of the LLMs, honest and tried hands-on! It's interesting how you broke down your experience into computation and cognition.