<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Firoj Alam</title>
    <description>The latest articles on DEV Community by Firoj Alam (@firojalam04).</description>
    <link>https://dev.to/firojalam04</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2381714%2F43bac449-44e0-4860-8469-81a209f74927.jpg</url>
      <title>DEV Community: Firoj Alam</title>
      <link>https://dev.to/firojalam04</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/firojalam04"/>
    <language>en</language>
    <item>
      <title>[Boost]</title>
      <dc:creator>Firoj Alam</dc:creator>
      <pubDate>Fri, 28 Feb 2025 09:50:40 +0000</pubDate>
      <link>https://dev.to/firojalam04/-4oh3</link>
      <guid>https://dev.to/firojalam04/-4oh3</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/firojalam04" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2381714%2F43bac449-44e0-4860-8469-81a209f74927.jpg" alt="firojalam04"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/firojalam04/benchmarking-llms-made-easy-with-llmebench-1hh3" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;Benchmarking LLMs Made Easy with LLMeBench&lt;/h2&gt;
      &lt;h3&gt;Firoj Alam ・ Feb 28&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
    </item>
    <item>
      <title>Benchmarking LLMs Made Easy with LLMeBench</title>
      <dc:creator>Firoj Alam</dc:creator>
      <pubDate>Fri, 28 Feb 2025 09:50:03 +0000</pubDate>
      <link>https://dev.to/firojalam04/benchmarking-llms-made-easy-with-llmebench-4eef</link>
      <guid>https://dev.to/firojalam04/benchmarking-llms-made-easy-with-llmebench-4eef</guid>
      <description>&lt;p&gt;🔹 Are you evaluating Large Language Models (LLMs) for your NLP tasks?&lt;br&gt;
🔹 Do you need a flexible, scalable framework that supports multiple providers?&lt;/p&gt;

&lt;p&gt;Look no further—LLMeBench is here!&lt;/p&gt;

&lt;h2&gt;
  
  
  What is LLMeBench?
&lt;/h2&gt;

&lt;p&gt;LLMeBench is an open-source benchmarking framework designed to help researchers and developers evaluate LLMs across different tasks, providers, and languages.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fom89duly0lsz1vm0uq5s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fom89duly0lsz1vm0uq5s.png" alt="LLMeBench" width="800" height="555"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With LLMeBench 1.1.0, we’ve added:&lt;/p&gt;

&lt;p&gt;✅ Expanded modality support (text, vision, multimodal tasks)&lt;br&gt;
✅ More evaluation metrics for precise comparisons&lt;br&gt;
✅ Improved dataset integration for smoother benchmarking&lt;/p&gt;

&lt;p&gt;🔗 GitHub Repo → &lt;a href="https://github.com/qcri/LLMeBench" rel="noopener noreferrer"&gt;github.com/qcri/LLMeBench&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  💡 Why Benchmarking LLMs is Important
&lt;/h2&gt;

&lt;p&gt;The rapid rise of GPT-4, BLOOMZ, Falcon, and LLaMA has created a need for systematic performance evaluation. LLMs behave differently across tasks, datasets, and languages, making standardized benchmarking essential for:&lt;/p&gt;

&lt;p&gt;📌 Model Comparison → Which LLM performs best for a specific task?&lt;br&gt;
📌 Cost &amp;amp; Latency Analysis → Is an LLM efficient for real-world deployment?&lt;br&gt;
📌 Fairness &amp;amp; Bias Detection → Does the model exhibit language-specific biases?&lt;/p&gt;

&lt;p&gt;LLMeBench addresses these challenges with a structured benchmarking approach that supports various model providers like:&lt;br&gt;
🟢 OpenAI (GPT models)&lt;br&gt;
🟢 Hugging Face Inference API&lt;br&gt;
🟢 Azure AI models&lt;br&gt;
🟢 Models deployed through VLLM&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started with LLMeBench
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Install LLMeBench&lt;br&gt;
&lt;code&gt;&lt;br&gt;
pip install 'llmebench[fewshot]'&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Download the current assets: &lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
python -m llmebench assets download&lt;br&gt;
&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;This will fetch assets and place them in the current working directory.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Download one of the dataset, e.g. ArSAS. &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
python -m llmebench data download ArSAS&lt;br&gt;
&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;This will download the data to the current working directory inside the data folder.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Evaluate!&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For example, to evaluate the performance of a random baseline for Sentiment analysis on ArSAS dataset, you can run:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
python -m llmebench --filter 'sentiment/ArSAS_Random*' assets/ results/&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;which uses the ArSAS_random "asset": a file that specifies the dataset, model and task to evaluate. Here, ArSAS_Random is the asset name referring to the ArSAS dataset name and the Random model, and assets/ar/sentiment_emotion_others/sentiment/ is the directory where the benchmarking asset for the sentiment analysis task on Arabic ArSAS dataset can be found. Results will be saved in a directory called results.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;View the Results&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;LLMeBench generates a performance report with:&lt;/p&gt;

&lt;p&gt;📊 Accuracy&lt;br&gt;
⏳ Response time&lt;br&gt;
📈 Task-specific metrics&lt;/p&gt;

&lt;h3&gt;
  
  
  🎯 Why Use LLMeBench?
&lt;/h3&gt;

&lt;p&gt;✔ Works with any NLP model &amp;amp; dataset&lt;br&gt;
✔ Supports multiple providers (OpenAI, HF, Azure, Petals)&lt;br&gt;
✔ Handles multimodal &amp;amp; multilingual benchmarking&lt;br&gt;
✔ Saves time &amp;amp; effort in evaluation&lt;/p&gt;

&lt;h3&gt;
  
  
  ⭐ Join the Community &amp;amp; Contribute
&lt;/h3&gt;

&lt;p&gt;We’re excited to see researchers &amp;amp; developers using LLMeBench for their benchmarking needs! 🚀&lt;/p&gt;

&lt;p&gt;🔗 Try LLMeBench today: github.com/qcri/LLMeBench&lt;br&gt;
⭐ If you find it useful, give us a star on GitHub!&lt;/p&gt;

&lt;p&gt;💬 Have feedback or feature requests? Open an issue or PR  -- we’d love to hear from you!&lt;/p&gt;

&lt;h3&gt;
  
  
  💡 What’s Next?
&lt;/h3&gt;

&lt;p&gt;We’re constantly improving LLMeBench with new features &amp;amp; optimizations. Stay tuned for:&lt;br&gt;
✅ More task-specific benchmarking modules&lt;br&gt;
✅ Fine-grained evaluation for multilingual models&lt;br&gt;
✅ Support for additional model providers&lt;/p&gt;

&lt;p&gt;🔥 If you’re working with LLMs and benchmarking, we’d love to hear how LLMeBench can help your workflow! Drop a comment below or connect with us on GitHub! 🚀✨&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Benchmarking LLMs Made Easy with LLMeBench</title>
      <dc:creator>Firoj Alam</dc:creator>
      <pubDate>Fri, 28 Feb 2025 09:50:03 +0000</pubDate>
      <link>https://dev.to/firojalam04/benchmarking-llms-made-easy-with-llmebench-1hh3</link>
      <guid>https://dev.to/firojalam04/benchmarking-llms-made-easy-with-llmebench-1hh3</guid>
      <description>&lt;p&gt;🔹 Are you evaluating Large Language Models (LLMs) for your NLP tasks?&lt;br&gt;
🔹 Do you need a flexible, scalable framework that supports multiple providers?&lt;/p&gt;

&lt;p&gt;Look no further—LLMeBench is here!&lt;/p&gt;

&lt;h2&gt;
  
  
  What is LLMeBench?
&lt;/h2&gt;

&lt;p&gt;LLMeBench is an open-source benchmarking framework designed to help researchers and developers evaluate LLMs across different tasks, providers, and languages.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fom89duly0lsz1vm0uq5s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fom89duly0lsz1vm0uq5s.png" alt="LLMeBench" width="800" height="555"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With LLMeBench 1.1.0, we’ve added:&lt;/p&gt;

&lt;p&gt;✅ Expanded modality support (text, vision, multimodal tasks)&lt;br&gt;
✅ More evaluation metrics for precise comparisons&lt;br&gt;
✅ Improved dataset integration for smoother benchmarking&lt;/p&gt;

&lt;p&gt;🔗 GitHub Repo → &lt;a href="https://github.com/qcri/LLMeBench" rel="noopener noreferrer"&gt;github.com/qcri/LLMeBench&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  💡 Why Benchmarking LLMs is Important
&lt;/h2&gt;

&lt;p&gt;The rapid rise of GPT-4, BLOOMZ, Falcon, and LLaMA has created a need for systematic performance evaluation. LLMs behave differently across tasks, datasets, and languages, making standardized benchmarking essential for:&lt;/p&gt;

&lt;p&gt;📌 Model Comparison → Which LLM performs best for a specific task?&lt;br&gt;
📌 Cost &amp;amp; Latency Analysis → Is an LLM efficient for real-world deployment?&lt;br&gt;
📌 Fairness &amp;amp; Bias Detection → Does the model exhibit language-specific biases?&lt;/p&gt;

&lt;p&gt;LLMeBench addresses these challenges with a structured benchmarking approach that supports various model providers like:&lt;br&gt;
🟢 OpenAI (GPT models)&lt;br&gt;
🟢 Hugging Face Inference API&lt;br&gt;
🟢 Azure AI models&lt;br&gt;
🟢 Models deployed through VLLM&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started with LLMeBench
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Install LLMeBench&lt;br&gt;
&lt;code&gt;&lt;br&gt;
pip install 'llmebench[fewshot]'&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Download the current assets: &lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
python -m llmebench assets download&lt;br&gt;
&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;This will fetch assets and place them in the current working directory.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Download one of the dataset, e.g. ArSAS. &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
python -m llmebench data download ArSAS&lt;br&gt;
&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;This will download the data to the current working directory inside the data folder.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Evaluate!&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For example, to evaluate the performance of a random baseline for Sentiment analysis on ArSAS dataset, you can run:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
python -m llmebench --filter 'sentiment/ArSAS_Random*' assets/ results/&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;which uses the ArSAS_random "asset": a file that specifies the dataset, model and task to evaluate. Here, ArSAS_Random is the asset name referring to the ArSAS dataset name and the Random model, and assets/ar/sentiment_emotion_others/sentiment/ is the directory where the benchmarking asset for the sentiment analysis task on Arabic ArSAS dataset can be found. Results will be saved in a directory called results.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;View the Results&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;LLMeBench generates a performance report with:&lt;/p&gt;

&lt;p&gt;📊 Accuracy&lt;br&gt;
⏳ Response time&lt;br&gt;
📈 Task-specific metrics&lt;/p&gt;

&lt;h3&gt;
  
  
  🎯 Why Use LLMeBench?
&lt;/h3&gt;

&lt;p&gt;✔ Works with any NLP model &amp;amp; dataset&lt;br&gt;
✔ Supports multiple providers (OpenAI, HF, Azure, Petals)&lt;br&gt;
✔ Handles multimodal &amp;amp; multilingual benchmarking&lt;br&gt;
✔ Saves time &amp;amp; effort in evaluation&lt;/p&gt;

&lt;h3&gt;
  
  
  ⭐ Join the Community &amp;amp; Contribute
&lt;/h3&gt;

&lt;p&gt;We’re excited to see researchers &amp;amp; developers using LLMeBench for their benchmarking needs! 🚀&lt;/p&gt;

&lt;p&gt;🔗 Try LLMeBench today: github.com/qcri/LLMeBench&lt;br&gt;
⭐ If you find it useful, give us a star on GitHub!&lt;/p&gt;

&lt;p&gt;💬 Have feedback or feature requests? Open an issue or PR  -- we’d love to hear from you!&lt;/p&gt;

&lt;h3&gt;
  
  
  💡 What’s Next?
&lt;/h3&gt;

&lt;p&gt;We’re constantly improving LLMeBench with new features &amp;amp; optimizations. Stay tuned for:&lt;br&gt;
✅ More task-specific benchmarking modules&lt;br&gt;
✅ Fine-grained evaluation for multilingual models&lt;br&gt;
✅ Support for additional model providers&lt;/p&gt;

&lt;p&gt;🔥 If you’re working with LLMs and benchmarking, we’d love to hear how LLMeBench can help your workflow! Drop a comment below or connect with us on GitHub! 🚀✨&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
