DEV Community: Francesco Mattia

Supercharging Language Models: What I Learned Testing LLMs with Tools

Francesco Mattia — Sun, 04 May 2025 20:32:20 +0000

LLMs are great at creative writing and language tasks, but they often stumble on basic knowledge retrieval and math. Popular tests like counting r's in "strawberry" or doing simple arithmetic often trip them up. This is where tools come into the picture.

Why we use tools with LLMs?

Simply put, we're giving LLMs capabilities they don't naturally have, helping them deliver better answers.

Three Surprising Things I Found While Testing

I ran some tests with local models on Ollama and noticed some interesting patterns:

1) Even the Best Models Get Math Wrong Sometimes

I tested various models with this straightforward financial question: "I have initially 100 USD in an account that gives 3.42% interest/year for the first 2 years then switches to a 3% interest/year. How much will I have after 5 years?"

The correct answer is 100 * 1.0342^2 * 1.03^3 = 116.8747624.

What surprised me was that top models like Gemini 2.5 Pro and GPT-4o "understood" the approach but messed up the actual calculations. Gemini calculated 106.954164 * 1.092727 = 116.881 - close, but not quite right.

This is a good reminder to double-check LLM calculations, especially for important decisions like financial planning.

Interestingly, even a small local model like Qwen3 4B could nail this when given a calculator tool - showing that the right tools can make a huge difference.

2) Tools Can Supercharge Performance

The difference tools make is pretty dramatic:

Qwen3 4B without tools: Spent 12 minutes thinking only to get the wrong answer (somehow turning 100 USD into 1000 USD)
Same model with a calculator: Got the right answer in just over 2 minutes

I saw similar improvements across other local models like Llama 3.2 3B and 3.3 70B. I assume we'd reach the same conclusions with cloud-based LLMs.

3) LLMs Can Fill in the Gaps When Tools Fall Short

What's fascinating is how LLMs handle imperfect tool results. I experimented by simulating tool calls and controlling what they returned - often deliberately giving back information that wasn't quite what the LLM requested.

For example, when I gave an LLM a weather tool that only showed temperatures in Fahrenheit but asked for Celsius, it just did the conversion itself without missing a beat.

In another experiment, I simulated returning interest calculations for the wrong time period (e.g., 3 years instead of 5). The LLM recognized the mismatch and tried to adapt the information to solve the original problem. Sometimes it would request additional calculations, and other times it would attempt to extrapolate from what it received.

These experiments show that LLMs don't just blindly use tool outputs - they evaluate the results, determine if they're helpful, and find ways to work with what they have. This adaptability makes tools even more powerful, as they don't need to be perfectly aligned with every request.

4) Smaller Models Don't Always Use Tools When They Should

You might expect smaller models with limited knowledge to eagerly embrace tools as a crutch, but my testing revealed something quite different and fascinating.

The tiniest model I tested, Qwen 0.6B, was surprisingly stubborn about using its own capabilities. Even when explicitly told about available tools that could help solve a problem, it consistently tried to work things out on its own - often with poor results. It's almost as if it lacked the self-awareness to recognise its own limitations.

Llama 3.2 3B showed a different pattern. It attempted to use tools, showing they recognised the need for external help, but applied them incorrectly. For instance, when trying to solve our compound interest problem, it would call the calculator tool but input the wrong formula or misinterpret the results.

Larger models seem to be more reliable in their calculations - sometimes rightfully so, but other times this confidence was misplaced as they still made errors. It still make sense to use tools have a deterministic output grounded.

This pattern suggests that effective tool use might not emerge naturally as models get smaller - it may require specific fine-tuning to teach smaller models when and how to leverage external tools. Perhaps smaller models need explicit training to recognise their own limitations and develop the "humility" to rely on tools when appropriate?

What's Next to Explore

I'm particularly interested in understanding tool selection strategies (how models choose between multiple viable tools), tool chaining for complex problems, and whether smaller models can be specifically fine-tuned to better recognise when they need external help.

The sweet spot in tool design is another critical area - finding the right balance between verbose outputs with explanations versus minimal outputs that are easier to parse could dramatically improve how effectively LLMs leverage external capabilities.

Want to play around with this yourself? Check out my Node CLI app:
GitHub Repository

Testing LLM Speed Across Cloud Providers: Groq, Cerebras, AWS & More

Francesco Mattia — Sun, 08 Dec 2024 20:57:02 +0000

After my previous exploration of local vs cloud GPU performance for LLMs, I wanted to dive deeper into comparing inference speeds across different cloud API providers. With all the buzz around Groq and Cerebras's blazing-fast inference claims, I was curious to see how they stack up in real-world usage.

The Testing Framework

I developed a simple Node.js-based framework to benchmark different LLM providers consistently. The framework:

Runs a series of standardised prompts across different providers
Measures inference time and response generation
Writes results to structured output files
Supports multiple providers including OpenAI, Anthropic, AWS Bedrock, Groq, and Cerebras

The test prompts were designed to cover different scenarios:

Mathematical computations (typically challenging for LLMs)
Long-form text summarisation (high input tokens, lower output)
Structured output generation (JSON, XML, CSV formats)

Test Results

The complete benchmark results are available in this spreadsheet. While the GitHub repository contains the output from each LLM, we'll focus purely on performance metrics here.

One of the most interesting findings was the significant speed variation for identical models across different providers. This suggests that infrastructure and optimization play a crucial role in inference speed.

The most dramatic differences emerged when testing larger models like Llama 70B. Providers optimized for fast inference showed remarkable capabilities, demonstrating that even models with 70B parameters can achieve impressive speeds with the right infrastructure.

Groq's performance across different model sizes reveals an intriguing pattern: whether running small or large models, inference speeds remain remarkably consistent, suggesting they possibly managed to optimise for bigger models.

Key Findings

Groq and Cerebras: The hype is real. Both providers demonstrated exceptional performance, particularly with larger models like Llama 3 70B
Ollama: With a decent GPU (e.g., RTX 4090), smaller models (Llama 3.2 1B/3B) performed (speed-wise) comparably to the quickest "API-based models" like Anthropic's Claude Haiku 3 and Amazon's Nova Micro
Speed rankings were fairly consistent across different prompts (math, summarisation, structured output)
API throttling became an issue with larger models on AWS Bedrock (Claude Sonnet 3.5, Opus 3, Nova Pro)

The Fastest Llama: Uncovering the Speed of LLMs

Francesco Mattia — Sun, 01 Sep 2024 09:51:42 +0000

Curious about LLM Speed? I Tested Local vs Cloud GPUs (and CPUs too!)

I've been itching to compare the speed of locally-run LLMs against the big players like OpenAI and Anthropic. So, I decided to put my curiosity to the test with a series of experiments across different hardware setups.

I started with LM Studio and Ollama on my trusty laptop, but then I thought, "Why not push it further?" So, I fired up my PC with an RTX 3070 GPU and dove into some cloud options like RunPod, AWS, and vast.ai. I wanted to see not just the speed differences but also get a handle on the costs involved.

Now, I'll be the first to admit my test wasn't exactly very scientific. I used just two prompts for inference, which some might argue is a bit basic. But hey, it gives us a solid starting point to compare speeds across different GPUs and understand the nuances between prompt evaluation (input) and response (output) speeds.

Check out this table of results.

Device	Cost/hr	Phi-3 input (t/s)	Phi-3 output (t/s)	Phi-3 IO ratio	Llama3 input (t/s)	Llama3 output (t/s)	Llama3 IO ratio
M1 Pro	-	96.73	30.63	3.158	59.12	25.44	2.324
RTX 3070	-	318.68	103.12	3.090	167.48	64.15	2.611
g5g.xlarge (T4G)	$0.42	185.55	60.85	3.049	88.61	42.33	2.093
g5.12xlarge (4x A10G)	$5.672	266.46	105.97	2.514	131.36	68.07	1.930
A40 (runpod)	$0.49 (spot)	307.51	123.73	2.485	153.41	79.33	1.934
L40 (runpod)	$0.69 (spot)	444.29	154.22	2.881	212.25	97.51	2.177
RTX 4090 (runpod)	$0.49 (spot)	470.42	168.08	2.799	222.27	101.43	2.191
2x RTX 4090 (runpod)	$0.99 (spot)	426.73	40.95	10.4	168.60	111.34	1.51
RTX 3090 (vast.ai)	$0.24	335.49	142.02	2.36	145.47	88.99	1.63

Setup and Specs: For the Tech-Curious

I ran tests on a variety of setups, from cloud services to my local machines. Below is a quick rundown of the hardware. I wrote about running LLMs in the cloud more in detail here.

The benchmarks are run using these python scripts https://github.com/MinhNgyuen/llm-benchmark.git which lean on ollama for the inference. Hence on any environment we need to set up ollama and python, pull the models we want to test and prepare to run the tests.

On runpod (starting from ollama/ollama Docker template):

# basic setup (on ubuntu)
apt-get update
apt install pip python3 git python3.10-venv -y

# pull models we want to test
ollama pull phi3; ollama pull llama3

python3 -m venv venv
source venv/bin/activate

# download benchmarking script and install dependencies 
git clone https://github.com/MinhNgyuen/llm-benchmark.git
cd llm-benchmark
pip install -r requirements.txt

# run benchmarking script with installed models and these prompts
python benchmark.py --verbose --skip-models nomic-embed-text:latest --prompts "Why is the sky blue?" "Write a report on the financials of Nvidia"

Systems specs

Environment	Hardware Specification	VRAM	Software
AWS EC2 -	g5g.xlarge, Nvidia T4G	16GB VRAM	ollama
AWS EC2 -	g5.12xlarge, 4x Nvidia A10G	96GB VRAM	ollama
runpod	Nvidia A40	48GB VRAM	ollama
runpod	Nvidia L40	48GB VRAM	ollama
runpod	Nvidia RTX 4090	24GB VRAM	ollama
runpod	2x Nvidia RTX 4090	48GB VRAM	ollama
vast.ai	Nvidia RTX 3090	24GB VRAM	ollama
Local Mac	M1 Pro 8 CPU Cores (6p + 2e) + 14 GPU cores	16GB (V)RAM	LM Studio
Local PC	Nvidia RTX 3070, LLM on GPU	8GB VRAM	LM Studio
Local PC	Ryzen 5500 6 CPU Cores, LLM on CPU	64GB RAM	LM Studio

But Wait, What About CPUs?

Curious about CPU performance compared to GPUs? I ran a quick test to give you an idea. I used a single prompt across three different setups:

A Mac, which uses its integrated GPUs
A PC with an Nvidia GPU, which expectedly gave the best speed results
A PC running solely on its CPU

For this test, I used LM Studio, that gives you flexibility on where to load the LLM layers, conveniently letting you choose whether to use your system's GPU or not. I ran the tests with temperature set to 0, using the prompt Who is the president of the US?

Here are the results:

Model	Device	TTFT	Speed
Phi3 mini 4k instruct q4	M1 Pro	0.04s	~35 tok/s
	RTX 3070	0.01s	~97 tok/s
	Ryzen 5	0.07s	~13 tok/s
Meta Llama 3 Instruct 7B	M1 Pro	0.17s	~23 tok/s
	RTX 3070	0.02s	~64 tok/s
	Ryzen 5	0.13s	~7 tok/s
Gemma It 2B Q4_K_M	M1 Pro	0.02s	~63 tok/s
	RTX 3070	0.01s	~170 tok/s
	Ryzen 5	0.05s	~23 tok/s

My takeaways

Dedicated GPUs are speed demons: They outperform Macs when it comes to inference speed, especially considering the costs.
Size matters (for models): Smaller models can provide a viable experience even on lower-end hardware, as long as you've got the RAM or VRAM to back it up.
CPUs? Not so hot for inference: Your average desktop CPU is still vastly slow compared to dedicated GPUs.
Gaming GPUs for the win: a beastly gaming GPU like the 4090 is quite cost-effective and can deliver top-notch results, comparable to an H100. Multiple GPUs didn't necessarily make things faster in this scenario.

This little experiment has been a real eye-opener for me, and I'm eager to dive deeper. I'd love to hear your thoughts! What other tests would you like to see? Any specific hardware or models you're curious about?

Running Your Own LLMs in the Cloud: A Practical Guide

Francesco Mattia — Sat, 31 Aug 2024 17:07:25 +0000

Ever wondered what it would be like to have your own personal fleet of language models at your command? In this post, we'll explore how to run LLMs on cloud GPU instances, giving you more control, better performance, and greater flexibility.

Why Run Your Own LLM Instances?

There are several compelling reasons to consider this approach:

Data Control: You have complete oversight of the data sent to and processed by the LLMs.
Enhanced Performance: Access to powerful GPU instances means faster responses and the ability to run larger models.
Model Ownership: Run fine-tuned models with behaviour that remains consistent over time.
Scalability: Easily scale resources up or down based on your needs.

How does it work?

We will be using Ollama, a tool for running LLMs, along with cost-effective cloud providers like RunPod and vast.ai. Here's the basic process:

Start a cloud instance with Ollama installed
Serve the LLM through an API
Access the API from your local machine

RunPod

RunPod (runpod.io) offers a streamlined approach to creating cloud instances from Docker images. This means you can quickly spin up an instance that's already configured to serve Ollama and provide API access. It's worth noting that their pricing has become more competitive recently, with instances starting at $0.22/hr for a 24GB VRAM GPU.

Here's a step-by-step guide to get you started:

On runpod.io, navigate to "Pods"
Click "Deploy" and select an NVIDIA instance
Choose the "ollama template" based on ollama/ollama:latest Docker image
Take note of the POD_ID - you'll need this for API access
Connect to the API via HTTPS on {POD_ID}-11434.proxy.runpod.net:443

While this setup is straightforward, it does raise some security concerns. The API is exposed on port 11434 without any built-in authentication or access limitations. I attempted to use an SSH tunnel as a workaround (similar to the method I'll describe for vast.ai), but encountered difficulties getting it to work with RunPod. This is an area where I'd appreciate community input on best practices or alternative solutions.

Optional: SSH access

If you need direct access to the instance, SSH is available:

ssh tn0b2n8qpybgbv-644112be@ssh.runpod.io -i ~/.ssh/id_ed25519

Once connected, you can verify the GPU specifications using the nvidia-smi command.

Vast.ai

vast.ai operates as a marketplace where users can both offer and rent GPU instances. The pricing is generally quite competitive, often lower than RunPod, especially for low-end GPUs with less than 24GB of VRAM. However, it also provides access to more powerful systems, like the 4xA100 setup I used to run Llama3.1-405B.

Setting up an instance on Vast.ai is straightforward. You can select a template for Ollama within their interface, leveraging Docker once again. Unlike RunPod, Vast.ai doesn’t automatically expose a port for API access, but I found that using SSH tunneling is a more secure and preferred solution. Once you’ve chosen an instance that meets your requirements, simply click on “Rent” and connect to the instance via SSH, which also sets up the SSH tunnel.

ssh -i ~/.ssh/vastai -p 31644 root@162.193.169.187 -L 11434:localhost:11434

This command creates a tunnel that forwards connections from port 11434 on your local machine to port 11434 on the remote machine, allowing you to access services on the remote machine as if they were running locally.

Important the vast.ai image does not run the Ollama server by default. To enable this, you need to modify the template during the instance rental process by adding ollama serve to the on-start script. Alternatively, you can connect via SSH and manually run the command. Additionally, Vast.ai offers a CLI tool to search for available GPU instances, rent them, run Ollama, and connect directly via the CLI, which is quite neat.

Note: During my initial tests on Vast.ai, I encountered issues where the Ollama server crashed, likely due to instance-specific factors. Restarting the instance resolved the problem, suggesting it might have been an isolated incident.

Checking the API and comparing models

I've created some scripts to test models and compare performance (see here). Here's how to use them.

For RunPod:

node stream_chat_completion.js -v --function ollama --hostname sbeu57aj70rdqu-11434.proxy.runpod.net --port 443

For Vast.ai (using SSH tunnel, keep that terminal open!):

node stream_chat_completion.js -v --function ollama --models mistral-nemo:12b-instruct-2407-q2_K,mistral-nemo:12b-instruct-2407-q4_K_M

What About AWS?

While I initially looked into AWS EC2, it proved less straightforward and more costly for this specific use case compared to RunPod and Vast.ai. For completeness, the steps I took to setup nvidia-smi and ollama:

sudo apt-get update
sudo apt install ubuntu-drivers-common -y
sudo apt install nvidia-driver-550 -y # use 535 for A10G!!
sudo apt install nvidia-cuda-toolkit -y

# verify that nvidia drivers are running
sudo nvidia-smi 

# install ollama
curl -fsSL https://ollama.com/install.sh | sh

Conclusion

Running LLMs on cloud GPU instances is more accessible (both from a cost and an effort perspective) than I originally thought, and it offers impressive performance for various model sizes. The ability to run large models like Llama3.1-405B, quantised to fit on 320GB VRAM is particularly noteworthy.

However, I'm not yet sure what would be a good use case compared to using big LLMs available through APIs (e.g. GPT-4o, Claude 3.5, etc.), besides testing bigger models.

Have you tried running your own LLMs in the cloud? What has your experience been like? I'd love to hear your thoughts and questions in the comments below!

Unlocking Vision: Evaluating LLMs for Home Security

Francesco Mattia — Wed, 22 May 2024 08:27:55 +0000

Introduction

I am diving into the vision capabilities of large language models (LLMs) to see if they can accurately classify images, specifically focusing on spotting door handle positions to tell if they’re locked or unlocked. This experiment includes basic tests to evaluate accuracy, speed, and token usage, offering an initial comparison across models.

Code on GitHub

Scenario

Imagine using a webcam to monitor door security, providing images of door handles in different lighting conditions (day and night). The system’s goal is to classify the handle’s position—vertical (locked) or horizontal (unlocked)—and report the status in a parseable format like JSON. This could be a valuable feature in home automation systems. While traditional machine learning models, which require specific training, might achieve better performance, this experiment explores the potential of large language models (LLMs) in this task.

Approach

First, I took some pictures and fed them to the best LLMs, no code, through their websites (Claude 3 Opus, OpenAI GPT-4) to see if they could accurately classify door handle positions. Was this method viable or would it end up being a waste of time?

The initial results were encouraging, but I needed to verify if the models could consistently perform well. With a binary classifier, there’s a 50% chance of guessing correctly, so I wanted to ensure the accuracy was truly meaningful.

To ensure deterministic outputs, I used a prompt with a temperature setting of 0.0. To save on tokens and improve processing speed, I resized the images using the following command:

convert original_image.jpg -resize 200x200 resized.jpg

Next, I wrote a script to access Anthropic models, comparing the classification results to the actual positions indicated by the image filenames (v for vertical, h for horizontal).

./locks_classifier.js -m Haiku -v
🤖 Haiku
images/test01_v.jpg ✅
📊 In: 202 tkn Out: 11 Time: 794 ms
images/test02_v.jpg ✅
📊 In: 202 tkn Out: 11 Time: 1073 ms
images/test03_h.jpg ❌
📊 In: 202 tkn Out: 11 Time: 604 ms

Correct Responses: (12 / 20) 60%
Total In Tokens: 3976
Total Out Tokens: 220
Avg Time: 598 ms

The results for Haiku were somewhat underwhelming, while Sonnet performed even worse, albeit with similar speed.

I experimented with few-shot examples embedded in the prompt, but this did not improve the results.

Out of curiosity, I also tested OpenAI models, adapting my scripts to accommodate their slightly different APIs (it’s frustrating that there isn’t a standard yet, right?).

The results with OpenAI models were significantly better. Although slightly slower, they were much more accurate in comparison.

GPT-4-Turbo:

./locks_classifier.js -m GPT4 -v
Responses: (16 / 20) 80% 
In Tokens: 6360 Out Tokens: 240
Avg Time: 2246 ms

The just released GPT-4o:

./locks_classifier.js -m GPT4o -v
Responses: (20 / 20) 100% 
In Tokens: 6340 Out Tokens: 232
Avg Time: 1751 ms

What I learnt

1) LLM Performance: I was curious to see how the models would perform, and I am quite impressed by GPT-4o. It delivered high accuracy and reasonable speed. On the other hand, Haiku’s performance was somewhat disappointing, although its lower cost and faster response time make it appealing for many applications. There’s definitely potential to explore Haiku further.

2) Temperature 0.0: I was surprised by the varying responses even with the temperature set to 0.0, which should theoretically produce consistent results. This variability was unexpected and suggests that other factors may be influencing the outputs. Any ideas on why this might be happening?

🤖 Haiku *Run #1*
Responses: (5 / 11) 45%
In Tokens: 2222 Out Tokens: 121
Avg Time: 585 ms

🤖 Haiku *Run #2*
Correct Responses: (7 / 11) 64% 
In Tokens: 2222 Out Tokens: 121 
Avg Time: 585 ms

🤖 Haiku *Run #3*
Correct Responses: (4 / 11) 36% 
In Tokens: 2222 Out Tokens: 121
Avg Time: 583 ms

3) Variability in Tokenization: There is significant variability in the number of tokens generated by different models for the same input. This variability impacts cost estimates and efficiency, as token usage directly influences the expense of using these models.

Model	In Tks	Out Tks	$/M In Tks	$/M Out Tks	Images per $1
Haiku	202	11	$0.25	$1.25	15,563
Sonnet	156	11	$3.00	$15.00	1,579
GPT-4	318	12	$5.00	$15.00	565
GPT-4o	317	12	$10.00	$30.00	283

4) Variability in Response Time: I did not expect the same model, given the same input size, to have such a wide range of response times. This variability suggests that there are underlying factors affecting the inference speed.

Model	Avg Res Time (ms)	Min Res Time (ms)	Max Res Time (ms)
Haiku	598	351	1073
Sonnet	605	468	1011
GPT-4	2246	1716	6037
GPT-4o	1751	1172	4559

Overall, while the accuracy and results are interesting, they can vary significantly depending on the images used. For instance, would larger images improve the performance of models like Haiku and Sonnet?

Next steps

Here are a few ideas to dive deeper into:

1. Explore Different Challenges: Consider swapping the current challenge with a different task to further test the capabilities of LLMs in various scenarios.

2. Test Local Vision-Enabled Models: Evaluate models like Llava 1.5 7B running locally on platforms such as LM Studio or Ollama. Would a local LLM provide a viable option?

3.Compare with Traditional ML Models: Conduct tests against more traditional machine learning models to see how many sample images are needed to achieve similar or better accuracy.

Let me know if you have any comments or questions. I’d love to hear your suggestions on where to go next and what tests you’d like to see conducted!