<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Francesco Mattia</title>
    <description>The latest articles on DEV Community by Francesco Mattia (@fr4ncis).</description>
    <link>https://dev.to/fr4ncis</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1407967%2Fdc4ddc77-787c-4212-a07e-b9421c308423.jpeg</url>
      <title>DEV Community: Francesco Mattia</title>
      <link>https://dev.to/fr4ncis</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/fr4ncis"/>
    <language>en</language>
    <item>
      <title>Supercharging Language Models: What I Learned Testing LLMs with Tools</title>
      <dc:creator>Francesco Mattia</dc:creator>
      <pubDate>Sun, 04 May 2025 20:32:20 +0000</pubDate>
      <link>https://dev.to/fr4ncis/supercharging-language-models-what-i-learned-testing-llms-with-tools-28pl</link>
      <guid>https://dev.to/fr4ncis/supercharging-language-models-what-i-learned-testing-llms-with-tools-28pl</guid>
      <description>&lt;p&gt;LLMs are great at creative writing and language tasks, but they often stumble on basic knowledge retrieval and math. Popular tests like counting r's in "strawberry" or doing simple arithmetic often trip them up. This is where tools come into the picture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why we use tools with LLMs?
&lt;/h2&gt;

&lt;p&gt;Simply put, we're giving LLMs capabilities they don't naturally have, helping them deliver better answers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Surprising Things I Found While Testing
&lt;/h2&gt;

&lt;p&gt;I ran some tests with local models on Ollama and noticed some interesting patterns:&lt;/p&gt;

&lt;h3&gt;
  
  
  1) Even the Best Models Get Math Wrong Sometimes
&lt;/h3&gt;

&lt;p&gt;I tested various models with this straightforward financial question: "I have initially 100 USD in an account that gives 3.42% interest/year for the first 2 years then switches to a 3% interest/year. How much will I have after 5 years?"&lt;/p&gt;

&lt;p&gt;The correct answer is &lt;code&gt;100 * 1.0342^2 * 1.03^3 = 116.8747624&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;What surprised me was that top models like Gemini 2.5 Pro and GPT-4o "understood" the approach but messed up the actual calculations. Gemini calculated &lt;code&gt;106.954164 * 1.092727 = 116.881&lt;/code&gt; - close, but not quite right.&lt;/p&gt;

&lt;p&gt;This is a good reminder to double-check LLM calculations, especially for important decisions like financial planning.&lt;/p&gt;

&lt;p&gt;Interestingly, even a small local model like Qwen3 4B could nail this when given a calculator tool - showing that the right tools can make a huge difference.&lt;/p&gt;

&lt;h3&gt;
  
  
  2) Tools Can Supercharge Performance
&lt;/h3&gt;

&lt;p&gt;The difference tools make is pretty dramatic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Qwen3 4B without tools: Spent 12 minutes thinking only to get the wrong answer (somehow turning 100 USD into 1000 USD)&lt;/li&gt;
&lt;li&gt;Same model with a calculator: Got the right answer in just over 2 minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I saw similar improvements across other local models like Llama 3.2 3B and 3.3 70B. I assume we'd reach the same conclusions with cloud-based LLMs.&lt;/p&gt;

&lt;h3&gt;
  
  
  3) LLMs Can Fill in the Gaps When Tools Fall Short
&lt;/h3&gt;

&lt;p&gt;What's fascinating is how LLMs handle imperfect tool results. I experimented by simulating tool calls and controlling what they returned - often deliberately giving back information that wasn't quite what the LLM requested.&lt;/p&gt;

&lt;p&gt;For example, when I gave an LLM a weather tool that only showed temperatures in Fahrenheit but asked for Celsius, it just did the conversion itself without missing a beat.&lt;/p&gt;

&lt;p&gt;In another experiment, I simulated returning interest calculations for the wrong time period (e.g., 3 years instead of 5). The LLM recognized the mismatch and tried to adapt the information to solve the original problem. Sometimes it would request additional calculations, and other times it would attempt to extrapolate from what it received.&lt;/p&gt;

&lt;p&gt;These experiments show that LLMs don't just blindly use tool outputs - they evaluate the results, determine if they're helpful, and find ways to work with what they have. This adaptability makes tools even more powerful, as they don't need to be perfectly aligned with every request.&lt;/p&gt;

&lt;h3&gt;
  
  
  4) Smaller Models Don't Always Use Tools When They Should
&lt;/h3&gt;

&lt;p&gt;You might expect smaller models with limited knowledge to eagerly embrace tools as a crutch, but my testing revealed something quite different and fascinating.&lt;/p&gt;

&lt;p&gt;The tiniest model I tested, Qwen 0.6B, was surprisingly stubborn about using its own capabilities. Even when explicitly told about available tools that could help solve a problem, it consistently tried to work things out on its own - often with poor results. It's almost as if it lacked the self-awareness to recognise its own limitations.&lt;/p&gt;

&lt;p&gt;Llama 3.2 3B showed a different pattern. It attempted to use tools, showing they recognised the need for external help, but applied them incorrectly. For instance, when trying to solve our compound interest problem, it would call the calculator tool but input the wrong formula or misinterpret the results.&lt;/p&gt;

&lt;p&gt;Larger models seem to be more reliable in their calculations - sometimes rightfully so, but other times this confidence was misplaced as they still made errors. It still make sense to use tools have a deterministic output grounded.&lt;/p&gt;

&lt;p&gt;This pattern suggests that effective tool use might not emerge naturally as models get smaller - it may require specific fine-tuning to teach smaller models when and how to leverage external tools. Perhaps smaller models need explicit training to recognise their own limitations and develop the "humility" to rely on tools when appropriate?&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next to Explore
&lt;/h2&gt;

&lt;p&gt;I'm particularly interested in understanding tool selection strategies (how models choose between multiple viable tools), tool chaining for complex problems, and whether smaller models can be specifically fine-tuned to better recognise when they need external help.&lt;/p&gt;

&lt;p&gt;The sweet spot in tool design is another critical area - finding the right balance between verbose outputs with explanations versus minimal outputs that are easier to parse could dramatically improve how effectively LLMs leverage external capabilities.&lt;/p&gt;

&lt;p&gt;Want to play around with this yourself? Check out my Node CLI app:&lt;br&gt;
&lt;a href="https://github.com/Fr4ncis/llm_and_tools" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;&lt;/p&gt;

</description>
      <category>genai</category>
      <category>llm</category>
      <category>ai</category>
    </item>
    <item>
      <title>Testing LLM Speed Across Cloud Providers: Groq, Cerebras, AWS &amp; More</title>
      <dc:creator>Francesco Mattia</dc:creator>
      <pubDate>Sun, 08 Dec 2024 20:57:02 +0000</pubDate>
      <link>https://dev.to/fr4ncis/testing-llm-speed-across-cloud-providers-groq-cerebras-aws-more-3f8</link>
      <guid>https://dev.to/fr4ncis/testing-llm-speed-across-cloud-providers-groq-cerebras-aws-more-3f8</guid>
      <description>&lt;p&gt;After &lt;a href="https://dev.to/fr4ncis/the-fastest-llama-uncovering-the-speed-of-llms-5ap8"&gt;my previous exploration of local vs cloud GPU performance&lt;/a&gt; for LLMs, I wanted to dive deeper into comparing inference speeds across different cloud API providers. With all the buzz around Groq and Cerebras's blazing-fast inference claims, I was curious to see how they stack up in real-world usage.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Testing Framework
&lt;/h3&gt;

&lt;p&gt;I developed a &lt;a href="https://github.com/Fr4ncis/small-llms-benchmark" rel="noopener noreferrer"&gt;simple Node.js-based framework&lt;/a&gt; to benchmark different LLM providers consistently. The framework:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs a series of standardised prompts across different providers&lt;/li&gt;
&lt;li&gt;Measures inference time and response generation&lt;/li&gt;
&lt;li&gt;Writes results to structured output files&lt;/li&gt;
&lt;li&gt;Supports multiple providers including OpenAI, Anthropic, AWS Bedrock, Groq, and Cerebras&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The test prompts were designed to cover different scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mathematical computations (typically challenging for LLMs)&lt;/li&gt;
&lt;li&gt;Long-form text summarisation (high input tokens, lower output)&lt;/li&gt;
&lt;li&gt;Structured output generation (JSON, XML, CSV formats)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Test Results
&lt;/h3&gt;

&lt;p&gt;The complete benchmark results are available in &lt;a href="https://my-fr4ncis-bucket.s3.amazonaws.com/small_llm_results.xlsx" rel="noopener noreferrer"&gt;this spreadsheet&lt;/a&gt;. While the GitHub repository contains the output from each LLM, we'll focus purely on performance metrics here.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6646ft2lr1skolbr6m6p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6646ft2lr1skolbr6m6p.png" alt="Benchmark results" width="800" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One of the most interesting findings was the significant speed variation for identical models across different providers. This suggests that infrastructure and optimization play a crucial role in inference speed.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxq0vw5gusyosbhxp7h4f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxq0vw5gusyosbhxp7h4f.png" alt="Llama 3.2 3B results" width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The most dramatic differences emerged when testing larger models like Llama 70B. Providers optimized for fast inference showed remarkable capabilities, demonstrating that even models with 70B parameters can achieve impressive speeds with the right infrastructure.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdby7m37m2o87zskkriv4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdby7m37m2o87zskkriv4.png" alt="Llama 70B results" width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Groq's performance across different model sizes reveals an intriguing pattern: whether running small or large models, inference speeds remain remarkably consistent, suggesting they possibly managed to optimise for bigger models.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzxshkhnvyp2nek6atl8f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzxshkhnvyp2nek6atl8f.png" alt="Groq running different models" width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Findings
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Groq and Cerebras&lt;/strong&gt;: The hype is real. Both providers demonstrated exceptional performance, particularly with larger models like Llama 3 70B&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt;: With a decent GPU (e.g., RTX 4090), smaller models (Llama 3.2 1B/3B) performed (speed-wise) comparably to the quickest "API-based models" like Anthropic's Claude Haiku 3 and Amazon's Nova Micro&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speed rankings were fairly consistent&lt;/strong&gt; across different prompts (math, summarisation, structured output)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API throttling&lt;/strong&gt; became an issue with larger models on AWS Bedrock (Claude Sonnet 3.5, Opus 3, Nova Pro)&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>genai</category>
      <category>benchmarks</category>
      <category>bedrock</category>
    </item>
    <item>
      <title>The Fastest Llama: Uncovering the Speed of LLMs</title>
      <dc:creator>Francesco Mattia</dc:creator>
      <pubDate>Sun, 01 Sep 2024 09:51:42 +0000</pubDate>
      <link>https://dev.to/fr4ncis/the-fastest-llama-uncovering-the-speed-of-llms-5ap8</link>
      <guid>https://dev.to/fr4ncis/the-fastest-llama-uncovering-the-speed-of-llms-5ap8</guid>
      <description>&lt;p&gt;Curious about LLM Speed? I Tested Local vs Cloud GPUs (and CPUs too!)&lt;/p&gt;

&lt;p&gt;I've been itching to compare the speed of locally-run LLMs against the big players like OpenAI and Anthropic. So, I decided to put my curiosity to the test with a series of experiments across different hardware setups.&lt;/p&gt;

&lt;p&gt;I started with LM Studio and Ollama on my trusty laptop, but then I thought, "Why not push it further?" So, I fired up my PC with an RTX 3070 GPU and dove into some cloud options like RunPod, AWS, and vast.ai. I wanted to see not just the speed differences but also get a handle on the costs involved.&lt;/p&gt;

&lt;p&gt;Now, I'll be the first to admit my test wasn't exactly very scientific. I used just two prompts for inference, which some might argue is a bit basic. But hey, it gives us a solid starting point to compare speeds across different GPUs and understand the nuances between prompt evaluation (input) and response (output) speeds.&lt;/p&gt;

&lt;p&gt;Check out this table of results.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Device&lt;/th&gt;
&lt;th&gt;Cost/hr&lt;/th&gt;
&lt;th&gt;Phi-3 input (t/s)&lt;/th&gt;
&lt;th&gt;Phi-3 output (t/s)&lt;/th&gt;
&lt;th&gt;Phi-3 IO ratio&lt;/th&gt;
&lt;th&gt;Llama3 input (t/s)&lt;/th&gt;
&lt;th&gt;Llama3 output (t/s)&lt;/th&gt;
&lt;th&gt;Llama3 IO ratio&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;M1 Pro&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;96.73&lt;/td&gt;
&lt;td&gt;30.63&lt;/td&gt;
&lt;td&gt;3.158&lt;/td&gt;
&lt;td&gt;59.12&lt;/td&gt;
&lt;td&gt;25.44&lt;/td&gt;
&lt;td&gt;2.324&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3070&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;318.68&lt;/td&gt;
&lt;td&gt;103.12&lt;/td&gt;
&lt;td&gt;3.090&lt;/td&gt;
&lt;td&gt;167.48&lt;/td&gt;
&lt;td&gt;64.15&lt;/td&gt;
&lt;td&gt;2.611&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;g5g.xlarge (T4G)&lt;/td&gt;
&lt;td&gt;$0.42&lt;/td&gt;
&lt;td&gt;185.55&lt;/td&gt;
&lt;td&gt;60.85&lt;/td&gt;
&lt;td&gt;3.049&lt;/td&gt;
&lt;td&gt;88.61&lt;/td&gt;
&lt;td&gt;42.33&lt;/td&gt;
&lt;td&gt;2.093&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;g5.12xlarge (4x A10G)&lt;/td&gt;
&lt;td&gt;$5.672&lt;/td&gt;
&lt;td&gt;266.46&lt;/td&gt;
&lt;td&gt;105.97&lt;/td&gt;
&lt;td&gt;2.514&lt;/td&gt;
&lt;td&gt;131.36&lt;/td&gt;
&lt;td&gt;68.07&lt;/td&gt;
&lt;td&gt;1.930&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A40 (runpod)&lt;/td&gt;
&lt;td&gt;$0.49 (spot)&lt;/td&gt;
&lt;td&gt;307.51&lt;/td&gt;
&lt;td&gt;123.73&lt;/td&gt;
&lt;td&gt;2.485&lt;/td&gt;
&lt;td&gt;153.41&lt;/td&gt;
&lt;td&gt;79.33&lt;/td&gt;
&lt;td&gt;1.934&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L40 (runpod)&lt;/td&gt;
&lt;td&gt;$0.69 (spot)&lt;/td&gt;
&lt;td&gt;444.29&lt;/td&gt;
&lt;td&gt;154.22&lt;/td&gt;
&lt;td&gt;2.881&lt;/td&gt;
&lt;td&gt;212.25&lt;/td&gt;
&lt;td&gt;97.51&lt;/td&gt;
&lt;td&gt;2.177&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 4090 (runpod)&lt;/td&gt;
&lt;td&gt;$0.49 (spot)&lt;/td&gt;
&lt;td&gt;470.42&lt;/td&gt;
&lt;td&gt;168.08&lt;/td&gt;
&lt;td&gt;2.799&lt;/td&gt;
&lt;td&gt;222.27&lt;/td&gt;
&lt;td&gt;101.43&lt;/td&gt;
&lt;td&gt;2.191&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2x RTX 4090 (runpod)&lt;/td&gt;
&lt;td&gt;$0.99 (spot)&lt;/td&gt;
&lt;td&gt;426.73&lt;/td&gt;
&lt;td&gt;40.95&lt;/td&gt;
&lt;td&gt;10.4&lt;/td&gt;
&lt;td&gt;168.60&lt;/td&gt;
&lt;td&gt;111.34&lt;/td&gt;
&lt;td&gt;1.51&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3090 (vast.ai)&lt;/td&gt;
&lt;td&gt;$0.24&lt;/td&gt;
&lt;td&gt;335.49&lt;/td&gt;
&lt;td&gt;142.02&lt;/td&gt;
&lt;td&gt;2.36&lt;/td&gt;
&lt;td&gt;145.47&lt;/td&gt;
&lt;td&gt;88.99&lt;/td&gt;
&lt;td&gt;1.63&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Setup and Specs: For the Tech-Curious
&lt;/h3&gt;

&lt;p&gt;I ran tests on a variety of setups, from cloud services to my local machines. Below is a quick rundown of the hardware. I wrote about running LLMs in the cloud more in detail &lt;a href="https://dev.to/fr4ncis/running-your-own-llms-in-the-cloud-a-practical-guide-55jg"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The benchmarks are run using these python scripts &lt;code&gt;https://github.com/MinhNgyuen/llm-benchmark.git&lt;/code&gt; which lean on ollama for the inference. Hence on any environment we need to set up ollama and python, pull the models we want to test and prepare to run the tests.&lt;/p&gt;

&lt;p&gt;On runpod (starting from ollama/ollama Docker template):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# basic setup (on ubuntu)
apt-get update
apt install pip python3 git python3.10-venv -y

# pull models we want to test
ollama pull phi3; ollama pull llama3

python3 -m venv venv
source venv/bin/activate

# download benchmarking script and install dependencies 
git clone https://github.com/MinhNgyuen/llm-benchmark.git
cd llm-benchmark
pip install -r requirements.txt

# run benchmarking script with installed models and these prompts
python benchmark.py --verbose --skip-models nomic-embed-text:latest --prompts "Why is the sky blue?" "Write a report on the financials of Nvidia"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Systems specs
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Environment&lt;/th&gt;
&lt;th&gt;Hardware Specification&lt;/th&gt;
&lt;th&gt;VRAM&lt;/th&gt;
&lt;th&gt;Software&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AWS EC2 -&lt;/td&gt;
&lt;td&gt;g5g.xlarge, Nvidia T4G&lt;/td&gt;
&lt;td&gt;16GB VRAM&lt;/td&gt;
&lt;td&gt;ollama&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS EC2 -&lt;/td&gt;
&lt;td&gt;g5.12xlarge, 4x Nvidia A10G&lt;/td&gt;
&lt;td&gt;96GB VRAM&lt;/td&gt;
&lt;td&gt;ollama&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;runpod&lt;/td&gt;
&lt;td&gt;Nvidia A40&lt;/td&gt;
&lt;td&gt;48GB VRAM&lt;/td&gt;
&lt;td&gt;ollama&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;runpod&lt;/td&gt;
&lt;td&gt;Nvidia L40&lt;/td&gt;
&lt;td&gt;48GB VRAM&lt;/td&gt;
&lt;td&gt;ollama&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;runpod&lt;/td&gt;
&lt;td&gt;Nvidia RTX 4090&lt;/td&gt;
&lt;td&gt;24GB VRAM&lt;/td&gt;
&lt;td&gt;ollama&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;runpod&lt;/td&gt;
&lt;td&gt;2x Nvidia RTX 4090&lt;/td&gt;
&lt;td&gt;48GB VRAM&lt;/td&gt;
&lt;td&gt;ollama&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;vast.ai&lt;/td&gt;
&lt;td&gt;Nvidia RTX 3090&lt;/td&gt;
&lt;td&gt;24GB VRAM&lt;/td&gt;
&lt;td&gt;ollama&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local Mac&lt;/td&gt;
&lt;td&gt;M1 Pro 8 CPU Cores (6p + 2e) + 14 GPU cores&lt;/td&gt;
&lt;td&gt;16GB (V)RAM&lt;/td&gt;
&lt;td&gt;LM Studio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local PC&lt;/td&gt;
&lt;td&gt;Nvidia RTX 3070, LLM on GPU&lt;/td&gt;
&lt;td&gt;8GB VRAM&lt;/td&gt;
&lt;td&gt;LM Studio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local PC&lt;/td&gt;
&lt;td&gt;Ryzen 5500 6 CPU Cores, LLM on CPU&lt;/td&gt;
&lt;td&gt;64GB RAM&lt;/td&gt;
&lt;td&gt;LM Studio&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  But Wait, What About CPUs?
&lt;/h3&gt;

&lt;p&gt;Curious about CPU performance compared to GPUs? I ran a quick test to give you an idea. I used a single prompt across three different setups:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A Mac, which uses its integrated GPUs&lt;/li&gt;
&lt;li&gt;A PC with an Nvidia GPU, which expectedly gave the best speed results&lt;/li&gt;
&lt;li&gt;A PC running solely on its CPU&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For this test, I used LM Studio, that gives you flexibility on where to load the LLM layers, conveniently letting you choose whether to use your system's GPU or not. I ran the tests with temperature set to 0, using the prompt &lt;code&gt;Who is the president of the US?&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Here are the results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Device&lt;/th&gt;
&lt;th&gt;TTFT&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Phi3 mini 4k instruct q4&lt;/td&gt;
&lt;td&gt;M1 Pro&lt;/td&gt;
&lt;td&gt;0.04s&lt;/td&gt;
&lt;td&gt;~35 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;RTX 3070&lt;/td&gt;
&lt;td&gt;0.01s&lt;/td&gt;
&lt;td&gt;~97 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;Ryzen 5&lt;/td&gt;
&lt;td&gt;0.07s&lt;/td&gt;
&lt;td&gt;~13 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Meta Llama 3 Instruct 7B&lt;/td&gt;
&lt;td&gt;M1 Pro&lt;/td&gt;
&lt;td&gt;0.17s&lt;/td&gt;
&lt;td&gt;~23 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;RTX 3070&lt;/td&gt;
&lt;td&gt;0.02s&lt;/td&gt;
&lt;td&gt;~64 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;Ryzen 5&lt;/td&gt;
&lt;td&gt;0.13s&lt;/td&gt;
&lt;td&gt;~7 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemma It 2B Q4_K_M&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;M1 Pro&lt;/td&gt;
&lt;td&gt;0.02s&lt;/td&gt;
&lt;td&gt;~63 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;RTX 3070&lt;/td&gt;
&lt;td&gt;0.01s&lt;/td&gt;
&lt;td&gt;~170 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;Ryzen 5&lt;/td&gt;
&lt;td&gt;0.05s&lt;/td&gt;
&lt;td&gt;~23 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  My takeaways
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Dedicated GPUs are speed demons: They outperform Macs when it comes to inference speed, especially considering the costs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Size matters (for models): Smaller models can provide a viable experience even on lower-end hardware, as long as you've got the RAM or VRAM to back it up.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;CPUs? Not so hot for inference: Your average desktop CPU is still vastly slow compared to dedicated GPUs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Gaming GPUs for the win: a beastly gaming GPU like the 4090 is quite cost-effective and can deliver top-notch results, comparable to an H100. Multiple GPUs didn't necessarily make things faster in this scenario.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This little experiment has been a real eye-opener for me, and I'm eager to dive deeper. I'd love to hear your thoughts! What other tests would you like to see? Any specific hardware or models you're curious about?&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Running Your Own LLMs in the Cloud: A Practical Guide</title>
      <dc:creator>Francesco Mattia</dc:creator>
      <pubDate>Sat, 31 Aug 2024 17:07:25 +0000</pubDate>
      <link>https://dev.to/fr4ncis/running-your-own-llms-in-the-cloud-a-practical-guide-55jg</link>
      <guid>https://dev.to/fr4ncis/running-your-own-llms-in-the-cloud-a-practical-guide-55jg</guid>
      <description>&lt;p&gt;Ever wondered what it would be like to have your own personal fleet of language models at your command? In this post, we'll explore how to run LLMs on cloud GPU instances, giving you more control, better performance, and greater flexibility.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Run Your Own LLM Instances?
&lt;/h3&gt;

&lt;p&gt;There are several compelling reasons to consider this approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;em&gt;Data Control&lt;/em&gt;: You have complete oversight of the data sent to and processed by the LLMs.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Enhanced Performance&lt;/em&gt;: Access to powerful GPU instances means faster responses and the ability to run larger models.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Model Ownership&lt;/em&gt;: Run fine-tuned models with behaviour that remains consistent over time.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Scalability&lt;/em&gt;: Easily scale resources up or down based on your needs.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  How does it work?
&lt;/h3&gt;

&lt;p&gt;We will be using Ollama, a tool for running LLMs, along with cost-effective cloud providers like RunPod and vast.ai. Here's the basic process:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start a cloud instance with Ollama installed&lt;/li&gt;
&lt;li&gt;Serve the LLM through an API&lt;/li&gt;
&lt;li&gt;Access the API from your local machine&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  RunPod
&lt;/h3&gt;

&lt;p&gt;RunPod (&lt;a href="https://www.runpod.io/" rel="noopener noreferrer"&gt;runpod.io&lt;/a&gt;) offers a streamlined approach to creating cloud instances from Docker images. This means you can quickly spin up an instance that's already configured to serve Ollama and provide API access. It's worth noting that their pricing has become more competitive recently, with instances starting at $0.22/hr for a 24GB VRAM GPU.&lt;/p&gt;

&lt;p&gt;Here's a step-by-step guide to get you started:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;On runpod.io, navigate to "Pods"&lt;/li&gt;
&lt;li&gt;Click "Deploy" and select an NVIDIA instance&lt;/li&gt;
&lt;li&gt;Choose the "ollama template" based on ollama/ollama:latest Docker image&lt;/li&gt;
&lt;li&gt;Take note of the POD_ID - you'll need this for API access&lt;/li&gt;
&lt;li&gt;Connect to the API via HTTPS on {POD_ID}-11434.proxy.runpod.net:443&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;While this setup is straightforward, it does raise some security concerns. The API is exposed on port 11434 without any built-in authentication or access limitations. I attempted to use an SSH tunnel as a workaround (similar to the method I'll describe for vast.ai), but encountered difficulties getting it to work with RunPod. This is an area where I'd appreciate community input on best practices or alternative solutions.&lt;/p&gt;

&lt;h4&gt;
  
  
  Optional: SSH access
&lt;/h4&gt;

&lt;p&gt;If you need direct access to the instance, SSH is available:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ssh tn0b2n8qpybgbv-644112be@ssh.runpod.io -i ~/.ssh/id_ed25519
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once connected, you can verify the GPU specifications using the nvidia-smi command.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu70u9auknwh2hdrueoz2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu70u9auknwh2hdrueoz2.png" alt="Look at this beefy 4xA100 (320GB VRAM!)" width="800" height="424"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Vast.ai
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://vast.ai/" rel="noopener noreferrer"&gt;vast.ai&lt;/a&gt; operates as a marketplace where users can both offer and rent GPU instances. The pricing is generally quite competitive, often lower than RunPod, especially for low-end GPUs with less than 24GB of VRAM. However, it also provides access to more powerful systems, like the 4xA100 setup I used to run Llama3.1-405B.&lt;/p&gt;

&lt;p&gt;Setting up an instance on Vast.ai is straightforward. You can select a template for Ollama within their interface, leveraging Docker once again. Unlike RunPod, Vast.ai doesn’t automatically expose a port for API access, but I found that using SSH tunneling is a more secure and preferred solution. Once you’ve chosen an instance that meets your requirements, simply click on “Rent” and connect to the instance via SSH, which also sets up the SSH tunnel.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ssh -i ~/.ssh/vastai -p 31644 root@162.193.169.187 -L 11434:localhost:11434
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command creates a tunnel that forwards connections from port 11434 on your local machine to port 11434 on the remote machine, allowing you to access services on the remote machine as if they were running locally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt; the vast.ai image does not run the Ollama server by default. To enable this, you need to modify the template during the instance rental process by adding &lt;code&gt;ollama serve&lt;/code&gt; to the on-start script. Alternatively, you can connect via SSH and manually run the command. Additionally, Vast.ai offers a CLI tool to search for available GPU instances, rent them, run Ollama, and connect directly via the CLI, which is quite neat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; During my initial tests on Vast.ai, I encountered issues where the Ollama server crashed, likely due to instance-specific factors. Restarting the instance resolved the problem, suggesting it might have been an isolated incident. &lt;/p&gt;

&lt;h3&gt;
  
  
  Checking the API and comparing models
&lt;/h3&gt;

&lt;p&gt;I've created some scripts to test models and compare performance (&lt;a href="https://github.com/Fr4ncis/llm-quantisation-comparison" rel="noopener noreferrer"&gt;see here&lt;/a&gt;). Here's how to use them.&lt;/p&gt;

&lt;p&gt;For RunPod:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;node stream_chat_completion.js -v --function ollama --hostname sbeu57aj70rdqu-11434.proxy.runpod.net --port 443
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Vast.ai (using SSH tunnel, keep that terminal open!):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;node stream_chat_completion.js -v --function ollama --models mistral-nemo:12b-instruct-2407-q2_K,mistral-nemo:12b-instruct-2407-q4_K_M
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What About AWS?
&lt;/h3&gt;

&lt;p&gt;While I initially looked into AWS EC2, it proved less straightforward and more costly for this specific use case compared to RunPod and Vast.ai. For completeness, the steps I took to setup nvidia-smi and ollama:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt-get update
sudo apt install ubuntu-drivers-common -y
sudo apt install nvidia-driver-550 -y # use 535 for A10G!!
sudo apt install nvidia-cuda-toolkit -y

# verify that nvidia drivers are running
sudo nvidia-smi 

# install ollama
curl -fsSL https://ollama.com/install.sh | sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Running LLMs on cloud GPU instances is more accessible (both from a cost and an effort perspective) than I originally thought, and it offers impressive performance for various model sizes. The ability to run large models like Llama3.1-405B, quantised to fit on 320GB VRAM is particularly noteworthy.&lt;/p&gt;

&lt;p&gt;However, I'm not yet sure what would be a good use case compared to using big LLMs available through APIs (e.g. GPT-4o, Claude 3.5, etc.), besides testing bigger models.&lt;/p&gt;

&lt;p&gt;Have you tried running your own LLMs in the cloud? What has your experience been like? I'd love to hear your thoughts and questions in the comments below!&lt;/p&gt;

</description>
      <category>cloudcomputing</category>
      <category>machinelearning</category>
      <category>aiinfrastructure</category>
      <category>gpupower</category>
    </item>
    <item>
      <title>Unlocking Vision: Evaluating LLMs for Home Security</title>
      <dc:creator>Francesco Mattia</dc:creator>
      <pubDate>Wed, 22 May 2024 08:27:55 +0000</pubDate>
      <link>https://dev.to/fr4ncis/unlocking-vision-evaluating-llms-for-home-security-2dmk</link>
      <guid>https://dev.to/fr4ncis/unlocking-vision-evaluating-llms-for-home-security-2dmk</guid>
      <description>&lt;h3&gt;
  
  
  Introduction
&lt;/h3&gt;

&lt;p&gt;I am diving into the vision capabilities of large language models (LLMs) to see if they can accurately classify images, specifically focusing on spotting door handle positions to tell if they’re locked or unlocked. This experiment includes basic tests to evaluate accuracy, speed, and token usage, offering an initial comparison across models.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Fr4ncis/LLM_Image_Classifier"&gt;Code on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiumioj1o5d1wa7r5ybli.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiumioj1o5d1wa7r5ybli.jpg" alt="Image description" width="600" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario
&lt;/h3&gt;

&lt;p&gt;Imagine using a webcam to monitor door security, providing images of door handles in different lighting conditions (day and night). The system’s goal is to classify the handle’s position—vertical (locked) or horizontal (unlocked)—and report the status in a parseable format like JSON. This could be a valuable feature in home automation systems. While traditional machine learning models, which require specific training, might achieve better performance, this experiment explores the potential of large language models (LLMs) in this task.&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach
&lt;/h3&gt;

&lt;p&gt;First, I took some pictures and fed them to the best LLMs, no code, through their websites (Claude 3 Opus, OpenAI GPT-4) to see if they could accurately classify door handle positions. Was this method viable or would it end up being a waste of time?&lt;/p&gt;

&lt;p&gt;The initial results were encouraging, but I needed to verify if the models could consistently perform well. With a binary classifier, there’s a 50% chance of guessing correctly, so I wanted to ensure the accuracy was truly meaningful.&lt;/p&gt;

&lt;p&gt;To ensure deterministic outputs, I used a prompt with a temperature setting of 0.0. To save on tokens and improve processing speed, I resized the images using the following command:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;convert original_image.jpg -resize 200x200 resized.jpg&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;Next, I wrote a script to access Anthropic models, comparing the classification results to the actual positions indicated by the image filenames (v for vertical, h for horizontal).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;./locks_classifier.js -m Haiku -v
🤖 Haiku
images/test01_v.jpg ✅
📊 In: 202 tkn Out: 11 Time: 794 ms
images/test02_v.jpg ✅
📊 In: 202 tkn Out: 11 Time: 1073 ms
images/test03_h.jpg ❌
📊 In: 202 tkn Out: 11 Time: 604 ms

Correct Responses: (12 / 20) 60%
Total In Tokens: 3976
Total Out Tokens: 220
Avg Time: 598 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The results for Haiku were somewhat underwhelming, while Sonnet performed even worse, albeit with similar speed.&lt;/p&gt;

&lt;p&gt;I experimented with few-shot examples embedded in the prompt, but this did not improve the results.&lt;/p&gt;

&lt;p&gt;Out of curiosity, I also tested OpenAI models, adapting my scripts to accommodate their slightly different APIs (it’s frustrating that there isn’t a standard yet, right?).&lt;/p&gt;

&lt;p&gt;The results with OpenAI models were significantly better. Although slightly slower, they were much more accurate in comparison.&lt;/p&gt;

&lt;p&gt;GPT-4-Turbo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;./locks_classifier.js -m GPT4 -v
Responses: (16 / 20) 80% 
In Tokens: 6360 Out Tokens: 240
Avg Time: 2246 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The just released GPT-4o:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;./locks_classifier.js -m GPT4o -v
Responses: (20 / 20) 100% 
In Tokens: 6340 Out Tokens: 232
Avg Time: 1751 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What I learnt
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1) LLM Performance:&lt;/strong&gt; I was curious to see how the models would perform, and I am quite impressed by GPT-4o. It delivered high accuracy and reasonable speed. On the other hand, Haiku’s performance was somewhat disappointing, although its lower cost and faster response time make it appealing for many applications. There’s definitely potential to explore Haiku further.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2) Temperature 0.0:&lt;/strong&gt; I was surprised by the varying responses even with the temperature set to 0.0, which should theoretically produce consistent results. This variability was unexpected and suggests that other factors may be influencing the outputs. Any ideas on why this might be happening?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🤖 Haiku *Run #1*
Responses: (5 / 11) 45%
In Tokens: 2222 Out Tokens: 121
Avg Time: 585 ms

🤖 Haiku *Run #2*
Correct Responses: (7 / 11) 64% 
In Tokens: 2222 Out Tokens: 121 
Avg Time: 585 ms

🤖 Haiku *Run #3*
Correct Responses: (4 / 11) 36% 
In Tokens: 2222 Out Tokens: 121
Avg Time: 583 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3) Variability in Tokenization:&lt;/strong&gt; There is significant variability in the number of tokens generated by different models for the same input. This variability impacts cost estimates and efficiency, as token usage directly influences the expense of using these models.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;In Tks&lt;/th&gt;
&lt;th&gt;Out Tks&lt;/th&gt;
&lt;th&gt;$/M In Tks&lt;/th&gt;
&lt;th&gt;$/M Out Tks&lt;/th&gt;
&lt;th&gt;Images per $1&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Haiku&lt;/td&gt;
&lt;td&gt;202&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;$0.25&lt;/td&gt;
&lt;td&gt;$1.25&lt;/td&gt;
&lt;td&gt;15,563&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet&lt;/td&gt;
&lt;td&gt;156&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;1,579&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4&lt;/td&gt;
&lt;td&gt;318&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;$5.00&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;565&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;317&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;td&gt;$30.00&lt;/td&gt;
&lt;td&gt;283&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;4) Variability in Response Time: I did not expect the same model, given the same input size, to have such a wide range of response times. This variability suggests that there are underlying factors affecting the inference speed.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Avg Res Time (ms)&lt;/th&gt;
&lt;th&gt;Min Res Time (ms)&lt;/th&gt;
&lt;th&gt;Max Res Time (ms)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Haiku&lt;/td&gt;
&lt;td&gt;598&lt;/td&gt;
&lt;td&gt;351&lt;/td&gt;
&lt;td&gt;1073&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet&lt;/td&gt;
&lt;td&gt;605&lt;/td&gt;
&lt;td&gt;468&lt;/td&gt;
&lt;td&gt;1011&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4&lt;/td&gt;
&lt;td&gt;2246&lt;/td&gt;
&lt;td&gt;1716&lt;/td&gt;
&lt;td&gt;6037&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;1751&lt;/td&gt;
&lt;td&gt;1172&lt;/td&gt;
&lt;td&gt;4559&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Overall, while the accuracy and results are interesting, they can vary significantly depending on the images used. For instance, would larger images improve the performance of models like Haiku and Sonnet?&lt;/p&gt;

&lt;h3&gt;
  
  
  Next steps
&lt;/h3&gt;

&lt;p&gt;Here are a few ideas to dive deeper into:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Explore Different Challenges:&lt;/strong&gt; Consider swapping the current challenge with a different task to further test the capabilities of LLMs in various scenarios.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Test Local Vision-Enabled Models:&lt;/strong&gt; Evaluate models like Llava 1.5 7B running locally on platforms such as LM Studio or Ollama. Would a local LLM provide a viable option?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3.Compare with Traditional ML Models:&lt;/strong&gt; Conduct tests against more traditional machine learning models to see how many sample images are needed to achieve similar or better accuracy.&lt;/p&gt;

&lt;p&gt;Let me know if you have any comments or questions. I’d love to hear your suggestions on where to go next and what tests you’d like to see conducted!&lt;/p&gt;

</description>
      <category>genai</category>
      <category>computervision</category>
      <category>homeautomation</category>
      <category>languagemodels</category>
    </item>
  </channel>
</rss>
