DEV Community: David Britt

Builders' Challenge v3

David Britt — Fri, 10 Oct 2025 12:24:32 +0000

The Nosana Builder Challenge is back! After the success of Agents 101, we're excited to announce Agents 102 — a developer challenge where you'll build intelligent AI agents with frontend interfaces and deploy them on the Nosana decentralized compute network.

Quick Details

Prize Pool: $3,000 USDC for top 10 submissions
Start Date: October 10, 2025
Submission Deadline: October 24, 2025
Winners Announced: October 31, 2025
Submission Platform: SuperTeam Builders Challenge Page
GitHub Repository: Agent Challenge Starter

Your Mission

Build an intelligent AI agent that performs real-world tasks using:

Mastra framework for agent orchestration
Tool calling to interact with external services
MCP (Model Context Protocol) for enhanced capabilities
Custom frontend to showcase your agent's functionality

Then deploy your complete stack (agent + frontend + LLM) on Nosana's decentralized network!

Agent Ideas to Inspire You

The possibilities are endless! Here are some ideas:

🤖 Personal Assistant - Schedule management, email drafting, task automation
📊 Data Analyst Agent - Fetch financial data, generate insights, create visualizations
🌐 Web Researcher - Aggregate information from multiple sources, summarize findings
🛠️ DevOps Helper - Monitor services, automate deployments, manage infrastructure
🎨 Content Creator - Generate social media posts, blog outlines, marketing copy
🔍 Smart Search - Multi-source search with AI-powered result synthesis
💬 Customer Support Bot - Answer FAQs, ticket routing, knowledge base queries

Be Creative! The best agents solve real problems in innovative ways.

The Framework: Mastra

We're using Mastra, the powerful TypeScript framework that makes building AI applications intuitive and fast. Mastra provides all the primitives you need: workflows, agents, RAG, integrations, and evaluations.

New to Mastra? Check out these resources:

Getting Started

Step 1: Register

Register at SuperTeam
Register at the Luma Event Page
Star the required repos: Agent Challenge, Nosana CLI, Nosana SDK
Complete the registration form

Step 2: Fork & Build

Fork the challenge repository and start building your agent using the provided starter template with Next.js, Mastra, and CopilotKit.

# Fork this repo on GitHub, then clone your fork
git clone https://github.com/YOUR-USERNAME/agent-challenge

cd agent-challenge

cp .env.example .env

pnpm i

pnpm run dev:ui      # Start UI server (port 3000)
pnpm run dev:agent   # Start Mastra agent server (port 4111)

Step 3: Deploy to Nosana

Build your Docker container and deploy your complete stack to the Nosana network using either the Nosana Dashboard or the Nosana CLI.

Step 4: Submit

Commit your code to you forked GitHub repo and submit your project on the SuperTeam Challenge Page before the deadline.

Minimum Requirements

Your submission must include:

✅ Agent with Tool Calling - At least one custom tool/function
✅ Frontend Interface - Working UI to interact with your agent
✅ Deployed on Nosana - Complete stack running on Nosana network
✅ Docker Container - Published to Docker Hub
✅ Video Demo - 1-3 minute demonstration of your deployed agent
✅ Updated README - Clear documentation in your forked repo
✅ Social Media Post - Share on X/BlueSky/LinkedIn with #NosanaAgentChallenge and tag @nosana_ai

Prizes

Top 10 submissions will be rewarded:

🥇 1st Place: $1,000 USDC
🥈 2nd Place: $750 USDC
🥉 3rd Place: $450 USDC
🏅 4th Place: $200 USDC
🏅 5th-10th Place: $100 USDC each

Judging Criteria

Submissions evaluated on 4 key areas (25% each):

Innovation 🎨

Originality of agent concept, creative use of AI capabilities, unique problem-solving approach

Technical Implementation 💻

Code quality, proper use of Mastra framework, efficient tool implementation, error handling

Nosana Integration ⚡

Successful deployment, resource efficiency, stability and performance, proper containerization

Real-World Impact 🌍

Practical use cases, potential for adoption, clear value proposition, demonstration quality

Support & Resources

Discord: Join Nosana Discord
Dev Chat: Builders Challenge Channel
Twitter: Follow @nosana_ai
Docs: Nosana Documentation | Mastra Docs

Good luck, builders! We can't wait to see the innovative AI agents you create for the Nosana ecosystem.

Happy Building!

Want access to exclusive builder perks, early challenges, and Nosana credits?
Subscribe to our newsletter and never miss an update.

👉 Join the Nosana Builders Newsletter

Be the first to know about:

🧠 Upcoming Builders Challenges
💸 New reward opportunities
⚙ Product updates and feature drops
🎁 Early-bird credits and partner perks

Join the Nosana builder community today — and build the future of decentralized AI.

How We're Helping AI Startups Cut Costs by 67% With Open-Source Models

David Britt — Wed, 13 Aug 2025 20:24:36 +0000

The Hidden Cost of AI-Powered Products

In today's AI-driven product landscape, impressive capabilities often come with significant cost challenges. One of our recent collaborations with an AI presentation tool startup illustrates this perfectly. Their sleek, intuitive platform generates professional slide decks in minutes—but behind the scenes, the economics were threatening their growth potential.

The Challenge: When AI Costs Threaten Profitability

This startup's AI presentation generator delivers impressive results. Users can go from a simple prompt to a complete, professional slide deck in just 10-15 minutes. The magic behind this capability? A powerful proprietary AI model—but that magic comes at a price: approximately $0.30 per slide.

For a typical 20-slide presentation, that's $6 in AI costs alone—before accounting for hosting, development, support, or any other business expenses. At scale, these costs threatened to make their unit economics unsustainable, especially for a startup looking to offer competitive pricing.

They approached us with a challenge: explore if they could use an open-source model instead and cut their costs to around $0.05-0.10 cents per slide.

Evaluating the Technical Requirements

After testing their platform, we were impressed with the quality and interactivity of the AI-generated presentations. This level of sophistication meant we needed to find an open-source alternative that could deliver comparable results.

The startup's application required:

High-quality text generation for professional content
Sufficient context window to process complex presentation requirements
Tool-calling capabilities for integration with their platform
Reasonable generation speed for a good user experience

The Technical Solution: Optimized Open-Source Models

After evaluating several open-source models, our team identified Qwen3-32B as the optimal starting point for their needs. While not identical to proprietary models, it offers comparable capabilities at a fraction of the cost when deployed on optimized infrastructure.

Key technical aspects of our solution included:

Optimized deployment: NVIDIA A100-80GB or H100 GPUs for maximum performance
Parallel processing: Support for 40-50 concurrent users on a single GPU
Efficient resource utilization: Careful memory management to maximize context window
Scalable architecture: Ability to grow with their user base

Our platform enables efficient deployment of these models with streamlined infrastructure management—crucial for a startup looking to minimize DevOps overhead.

The Business Impact: A 67% Cost Reduction

The numbers tell a compelling story:

Current cost with proprietary model: $0.30 per slide
Projected cost with open-source model: $0.10 per slide
Cost reduction: 67%

This dramatic cost reduction transforms the startup's business possibilities. With improved unit economics, they can now implement:

A viable freemium model: Offer a free tier using open-source models to drive user acquisition
Tiered pricing strategy: Reserve premium models for paid tiers with higher performance needs
Competitive pricing: Maintain margins while offering more attractive price points
Sustainable scaling: Grow their user base without proportional AI cost increases

Implementation Strategy: "Model Discovery Phase"

Rather than a one-size-fits-all approach, we proposed a "model discovery phase" to find the optimal balance between cost and performance:

"We'll explore which model is the best for your use case. Even though the model is not as capable as proprietary alternatives, we can provide access at a significantly reduced price."

The implementation plan included:

Deploying a dedicated endpoint for testing
Running performance benchmarks with real-world content
Fine-tuning model parameters for presentation generation
Gradually optimizing for the ideal cost/performance balance

Future Possibilities: Grant Program Support

Beyond the immediate cost benefits, this collaboration opens doors to additional opportunities through our grant program, which provides:

Financial support for implementation
Technical resources for optimization
A showcase example of effective open-source AI implementation

The Technical Details: For the Curious

For those interested in the technical aspects, our team conducted detailed calculations on the economics of running these models at scale:

Enterprise GPU costs approximately $1.60/hour
With parallel processing supporting 40+ users, effective cost drops to approximately $0.30 per million tokens
For typical presentation workloads, this translates to roughly $0.10 per slide

Making AI Sustainable for Startups

This case demonstrates a critical reality in today's AI product landscape: proprietary models aren't always the most cost-effective solution. By leveraging open-source alternatives on optimized infrastructure, startups can dramatically improve unit economics while maintaining impressive capabilities.

For this presentation tool startup, our solution represents the difference between a challenging cost structure and a sustainable, scalable business model. For us, it showcases the practical benefits of our infrastructure for AI-powered applications.

This collaborative approach to AI implementation represents the future of sustainable AI-powered products—where technical innovation meets business reality to create truly viable solutions.

Interested in exploring how open-source AI models could reduce costs for your product? Contact our team to discuss your specific use case.

Useful Links

Nosana Builders Challenge: Agent-101

David Britt — Wed, 25 Jun 2025 09:52:48 +0000

The main goal of this Nosana Builders Challenge to teach participants to build and deploy agents. This first step will be in running a basic AI agent and giving it some basic functionality. Participants will add a tool, for the tool calling capabilities of the agent. These are basically some TypeScript functions, that will, for example, retrieve some data from a weather API, post a tweet via an API call, etc.

Mastra

For this challenge, we will be using Mastra to build our tool.

Mastra is an opinionated TypeScript framework that helps you build AI applications and features quickly. It gives you the set of primitives you need: workflows, agents, RAG, integrations, and evals. You can run Mastra on your local machine, or deploy to a serverless cloud.

Required Reading

We recommend reading the following sections to get started with how to create an Agent and how to implement Tool Calling.

Get Started

To get started run the following command to start developing:
We recommend using pnpm, but you can try npm, or bun if you prefer.

pnpm install
pnpm run dev

Assignment

Challenge Overview

Welcome to the Nosana AI Agent Hackathon! Your mission is to build and deploy an AI agent on Nosana.
While we provide a weather agent as an example, your creativity is the limit. Build agents that:

Beginner Level:

Simple Calculator: Perform basic math operations with explanations
Todo List Manager: Help users track their daily tasks

Intermediate Level:

News Summarizer: Fetch and summarize latest news articles
Crypto Price Checker: Monitor cryptocurrency prices and changes
GitHub Stats Reporter: Fetch repository statistics and insights

Advanced Level:

Blockchain Monitor: Track and alert on blockchain activities
Trading Strategy Bot: Automate simple trading strategies
Deploy Manager: Deploy and manage applications on Nosana

Or any other innovative AI agent idea at your skill level!

Getting Started

Fork the Nosana Agent Challenge to your GitHub account
Clone your fork locally
Install dependencies with pnpm install
Run the development server with pnpm run dev
Build your agent using the Mastra framework

How to build your Agent

Here we will describe the steps needed to build an agent.

Folder Structure

Provided in this repo, there is the Weather Agent.
This is a fully working agent that allows a user to chat with an LLM, and fetches real time weather data for the provided location.

There are two main folders we need to pay attention to:

In src/mastra/agents/weather-agent/ you will find a complete example of a working agent. Complete with Agent definition, API calls, interface definition, basically everything needed to get a full fledged working agent up and running.
In src/mastra/agents/your-agents/ you will find a bare bones example of the needed components, and imports to get started building your agent, we recommend you rename this folder, and it's files to get started.

Rename these files to represent the purpose of your agent and tools. You can use the Weather Agent Example as a guide until you are done with it, and then you can delete these files before submitting your final submission.

As a bonus, for the ambitious ones, we have also provided the src/mastra/agents/weather-agent/weather-workflow.ts file as an example. This file contains an example of how you can chain agents and tools to create a workflow, in this case, the user provides their location, and the agent retrieves the weather for the specified location, and suggests an itinerary.

LLM-Endpoint

Agents depend on an LLM to be able to do their work.

Running Your Own LLM with Ollama

The default configuration uses a local Ollama LLM.
For local development or if you prefer to use your own LLM, you can use Ollama to serve the lightweight qwen2.5:1.5b mode.

Installation & Setup:

Install Ollama :
Start Ollama service:

ollama serve

Pull and run the qwen2.5:1.5b model:

ollama pull qwen2.5:1.5b
ollama run qwen2.5:1.5b

Update your .env file

There are two predefined environments defined in the .env file. One for local development and another, with a larger model, qwen2.5:32b, for more complex use cases.

Why qwen2.5:1.5b?

Lightweight (only ~1GB)
Fast inference on CPU
Supports tool calling
Great for development and testing

Do note qwen2.5:1.5b is not suited for complex tasks.

The Ollama server will run on http://localhost:11434 by default and is compatible with the OpenAI API format that Mastra expects.

Testing your Agent

You can read the Mastra Documentation: Playground to learn more on how to test your agent locally.
Before deploying your agent to Nosana, it's crucial to thoroughly test it locally to ensure everything works as expected. Follow these steps to validate your agent:

Local Testing:

Start the development server with pnpm run dev and navigate to http://localhost:8080 in your browser
Test your agent's conversation flow by interacting with it through the chat interface
Verify tool functionality by triggering scenarios that call your custom tools
Check error handling by providing invalid inputs or testing edge cases
Monitor the console logs to ensure there are no runtime errors or warnings

Docker Testing:
After building your Docker container, test it locally before pushing to the registry:

# Build your container
docker build -t yourusername/agent-challenge:latest .

# Run it locally with environment variables
docker run -p 8080:8080 --env-file .env yourusername/agent-challenge:latest

# Test the containerized agent at http://localhost:8080

Ensure your agent responds correctly and all tools function properly within the containerized environment. This step is critical as the Nosana deployment will use this exact container.

Submission Requirements

1. Code Development

Fork this repository and develop your AI agent
Your agent must include at least one custom tool (function)
Code must be well-documented and include clear setup instructions
Include environment variable examples in a .env.example file

2. Docker Container

Create a Dockerfile for your agent
Build and push your container to Docker Hub or GitHub Container Registry
Container must be publicly accessible
Include the container URL in your submission

Build, Run, Publish

Note: You'll need an account on Dockerhub


# Build and tag
docker build -t yourusername/agent-challenge:latest .

# Run the container locally
docker run -p 8080:8080 yourusername/agent-challenge:latest

# Login
docker login

# Push
docker push yourusername/agent-challenge:latest

3. Nosana Deployment

Deploy your Docker container on Nosana
Your agent must successfully run on the Nosana network
Include the Nosana job ID or deployment link

Nosana Job Definition

We have included a Nosana job definition at <./nos_job_def/nosana_mastra.json>, that you can use to publish your agent to the Nosana network.

A. Deploying using @nosana/cli

Edit the file and add in your published docker image to the image property. "image": "docker.io/yourusername/agent-challenge:latest"
Download and install the @nosana/cli
Load your wallet with some funds
- Retrieve your address with: nosana address
- Go to our Discord and ask for some NOS and SOL to publish your job.
Run: nosana job post --file nosana_mastra.json --market nvidia-3060 --timeout 30
Go to the Nosana Dashboard to see your job

B. Deploying using the Nosana Dashboard

Make sure you have https://phantom.com/, installed for your browser.
Go to our Discord and ask for some NOS and SOL to publish your job.
Click the Expand button, on the Nosana Dashboard
Copy and Paste your edited Nosana Job Definition file into the Textarea
Choose an appropriate GPU for the AI model that you are using
Click Deploy

4. Video Demo

Record a 1-3 minute video demonstrating:
- Your agent running on Nosana
- Key features and functionality
- Real-world use case demonstration
Upload to YouTube, Loom, or similar platform

5. Documentation

Update this README with:
- Agent description and purpose
- Setup instructions
- Environment variables required
- Docker build and run commands
- Example usage

Submission Process

Complete all requirements listed above
Commit all of your changes to the main branch of your forked repository
- All your code changes
- Updated README
- Link to your Docker container
- Link to your video demo
- Nosana deployment proof
Social Media Post: Share your submission on X (Twitter)
- Tag @nosana_ai
- Include a brief description of your agent
- Add hashtag #NosanaAgentChallenge
Finalize your submission on the https://earn.superteam.fun/agent-challenge page

Remember to add your forked GitHub repository link
Remember to add a link to your X post.

Judging Criteria

Submissions will be evaluated based on:

Innovation (25%)

Originality of the agent concept
Creative use of AI capabilities

Technical Implementation (25%)

Code quality and organization
Proper use of the Mastra framework
Efficient tool implementation

Nosana Integration (25%)

Successful deployment on Nosana
Resource efficiency
Stability and performance

Real-World Impact (25%)
- Practical use cases
- Potential for adoption
- Value proposition

Prizes

We’re awarding the top 10 submissions:

🥇 1st: $1,000 USDC
🥈 2nd: $750 USDC
🥉 3rd: $450 USDC
🏅 4th: $200 USDC
🔟 5th–10th: $100 USDC

All prizes are paid out directly to participants on SuperTeam

Resources

Support

Join Nosana Discord for technical support where we have dedicated Builders Challenge Dev chat channel.
Follow @nosana_ai for updates.

Important Notes

Ensure your agent doesn't expose sensitive data
Test thoroughly before submission
Keep your Docker images lightweight
Document all dependencies clearly
Make your code reproducible
You can vibe code it if you want 😉
Only one submission per participant
Submissions that do not compile, and do not meet the specified requirements, will not be considered
Deadline is: 9 July 2025, 12.01 PM
Announcement will be announced about one week later, stay tuned for our socials for exact date

Don’t Miss Nosana Builder Challenge Updates

Good luck, builders! We can't wait to see the innovative AI agents you create for the Nosana ecosystem.
Happy Building!

LLM Benchmarking: Cost-Efficient Performance

David Britt — Wed, 09 Apr 2025 05:08:00 +0000

Economic viability is one of the most important factors in the success of new products and applications. No less so for Nosana. We show that the consumer-grade flagship RTX 4090 can provide LLM inference at a staggering 2.5X lower cost compared to the industry-standard enterprise A100 GPU.

Our previous article showed how we implemented a uniform LLM benchmark that helps track individual node performance and configurations. With this information, we are able to design fairer GPU compute markets by lowering their performance variation. But although the initial benchmark data is valuable in terms of market design optimization, it does not give meaningful insights into the realistic performance we are interested in. This is because the benchmark was designed to be compatible with all nodes on the network but it wasn’t able to test the full capacity of each node.

In this article, we address this limitation and zoom in on the performance comparison between consumer-grade and enterprise hardware. We implement benchmarks and use the results in a cost-adjusted performance analysis to highlight the competitive advantage of the Nosana Grid over traditional compute providers.

LLM Inference

When we talk about performance measurements in the context of LLM inference, we are mostly interested in inference speed. To better understand the factors influencing this speed, let’s begin with a brief overview of how LLM inference works.

The previous blog post went into more detail on this topic. If you have read it, you can skip ahead to this section. Readers who are interested in an in-depth explanation should refer to the previous blog post.

As far as computers are concerned, LLMs consist of two files. A large file containing the model parameters, and a smaller file that is able to run the model. The size of an LLM is determined by the amount of parameters it has and the precision of its parameters. Precision means the accuracy with which the model’s parameters are represented and is measured in bits. To calculate an example, let's take the popular LLM Llama 3.1 with 8 billion parameters and a commonly used 16-bit floating-point precision. One parameter with 16-bit floating point precision equals 2 bytes times 8 billion parameters, giving us a total model size of 16 GB. The model size is an important factor in the usability of LLMs because it determines which types of hardware are able to load the model.

Once loaded onto hardware, LLMs perform next-token prediction. This means that LLMs iteratively predict and add single tokens to an input sequence that is provided as context. This process of generating tokens is called inference. To perform inference, an LLM goes through two stages, the prefill phase and the decoding phase. During the prefill phase, the model processes all input tokens simultaneously to compute all the necessary information for generating subsequent tokens. During the decoding phase, the model uses the cached information computed during the prefill phase to generate new tokens.

In practice, the prefill phase corresponds to the time you have to wait until the LLM starts generating its response. It is a relatively short period that makes efficient use of available computing capacity through highly parallelized computations. We call the prefill phase compute-bound because it is limited by the computational capacity of the hardware running the LLM.

The decoding phase generally takes up the bulk of the inference time and corresponds to the period between the generation of the first and the completion of the last token. This process is not as computationally efficient as the prefill phase because it requires the constant on and offloading of cached computations between the processing units and memory. We call the decoding phase memory-bound because its performance is limited by how fast data can be moved to and from memory.

GPUs & Inference

In large production use cases, LLM inference is predominantly performed on high-end graphics processing units, or GPUs. Three key specifications of GPUs are particularly relevant to LLM inference:

VRAM (Video Random Access Memory): The amount of available memory on the GPU
FLOPS (Floating Point Operations Per Second): A measure of the GPU’s computational capacity
Memory bandwidth: The speed at which data can be transferred within the GPU

The processing of single sequences as described in the previous section usually leaves the VRAM and computational capacity of GPUs underutilized. To make better use of these resources we need to increase the amount of tokens processed and the computations performed. We can do this by processing a batch of multiple sequences at once. In production use cases this means that prompts from different users get bundled together and processed at the same time. Handling multiple requests, or concurrent users, plays an important role in the optimization of GPU usage.

Current Research

Alright, with the basics of LLM inference in mind, let's get more specific about the goal of the current research. Previously, we benchmarked the performance of all GPU types on the Nosana grid using Llama 3.1–8B with a single concurrent user. Running inference with a single concurrent user leads to GPU underutilization, limiting the insights gained when comparing performance with other compute providers. In this article, we set up benchmarks for accurate performance comparisons. We’ll focus our analysis on comparing Nosana’s performance against established cloud computing platforms. This comparison involves two key benchmarks:

A baseline assessment measuring the performance of current market leaders
An experimental evaluation of the Nosana grid’s performance

The Baseline Benchmark

Similar to running models on the Nosana grid, you can use a fully customized Docker image when renting a GPU from a compute provider. This means that we can keep important variables such as the model files and LLM serving framework constant for our experiment and only have to pick the GPU type and the price of usage for a fair comparison.

Because running LLMs in a production setting requires high capacity in terms of computation and memory, there are two main types to consider when renting a GPU, the A100 and the H100. The H100 is a newer and more powerful GPU than the A100, but both cards are able to load in and effectively run most open-source models. Given its relative affordability and arguable cost-effectiveness, we opt for the A100 as our baseline GPU.

For the price of usage variable there are more options to consider because there are various compute providers that offer a specific rental price per hour. To pick a competitive price we made use of the website https://getdeploying.com, which shows aggregated GPU rental prices for all cloud providers. At the time of writing the cheapest rental price for an A100–80GB is offered by Crusoe at $1.65 per hour, so we will use this price for our analysis.

The Experimental Benchmark

To compare the Nosana grid with our baseline approach, we need to determine the GPU type and an accompanying price per hour for our experimental benchmark. We’ll leave the price per hour as a variable to allow comparisons across multiple hypothetical pricing scenarios. This means that we only have to choose the GPU type.

The RTX 4090 is the most frequently encountered GPU on the Nosana grid, closely followed by the RTX 3090. The prevalence of the RTX 4090 and RTX 3090 GPUs on the Nosana grid highlights one of the network’s primary advantages over centralized compute providers: its ability to tap into a pool of underutilized consumer-grade hardware. Consequently, the most interesting comparison to make for Nosana is between popular enterprise hardware such as the A100 and underutilized consumer hardware such as the RTX 4090. Therefore, we pick the RTX 4090 for our experimental benchmark.

Research Setup

Let's go over the rest of the research setup. Now that we have determined the fixed variables for the baseline and the experimental condition, we have to pick the shared variables. The model, the LLM serving framework, and the number of concurrent users.

For the model, we picked Llama 3.1–8B. Llama models are the most used open-source LLMs in the world, and the 8 billion variant makes it possible to easily load the model on both the A100 and the RTX 4090 GPUs.

As an LLM serving framework, we experimented with both vLLM and LMdeploy. vLLM is one of the most popular frameworks and is frequently mentioned by our prospective clients. LMdeploy is a highly optimized framework and has shown the highest inference speed in recent benchmarking research. When using these frameworks, we used the out-of-the-gate inference configurations for both the baseline and experimental benchmark.

In our benchmarking script we implemented functionality to send concurrent user requests. While our previous article demonstrated that the 4090 slightly outperforms the A100 for a single concurrent user, this scenario rarely reflects optimized production environments. Therefore, we tested performance using 1, 5, 10, 50, and 100 concurrent users to see how the comparison holds up under different workloads.

As an evaluation metric, we used tokens produced per second, which directly measures inference speed. We evaluated both the A100 and RTX 4090 GPUs across all combinations of the variables mentioned above.

Results

In the above graphs, we can see the performance of the RTX 4090 and the A100 with the LMdeploy and vLLM frameworks for different levels of concurrency. The graphs show that:

At a low number of concurrent users, the A100s outperform the 4090s. However, this outperformance decreases relatively with the increase of concurrent users.
At a higher number of concurrent users, LMdeploy greatly outperforms vLLM with its standard settings. The RTX 4090 with LMdeploy even outperforms the A100 with vLLM at 50 and 100 concurrent users.
You need 1.5–2 RTX 4090s to reproduce the performance of an A100.

Price Comparison

Considering the respective purchase costs of the RTX 4090 and the A100, the performance results of the RTX 4090 are quite impressive. In this section, we analyze both GPUs’ performance while taking into account their purchase cost and operational expenses. For the cost-adjusted analysis we assume:

The purchase cost of an RTX 4090 is $1,750.
The purchase cost of an A100–80GB is $10,000.
2 RTX 4090s are required to reproduce the performance of an A100.
The price of energy is equal to the average American price of $0.16 per kWh.
The energy consumption of an RTX 4090 is 300W.
The energy consumption of an A100 is 250W.
The price for renting an A100 is $1.65.

Let’s start by calculating the return on investment (ROI) for the A100, which measures the amount of return relative to the investment cost. This helps us determine how quickly each GPU setup can earn its initial cost and start generating profit.

A100 ROI

Initial Investment: $10,000
Hourly Energy Cost: 0.25kW * 1 hour * $0.16/kWh = $0.04 per hour
Hourly Rental Revenue: $1.65 per hour
Hourly Net Profit: $1.65 — $0.04 = $1.61 per hour

To find the break-even point, we divide the initial investment of $10,000 by the hourly net profit of $1.61, which gives us approximately 6,211 hours or 259 days. Therefore, it would take about 259 days of continuous operation and rental to earn back the initial investment on the A100 GPU.

RTX 4090 ROI

Let’s perform a similar analysis for the RTX 4090 setup where we deliver the same performance as the A100 setup. Remember, we’re assuming that two RTX 4090s are required to match the performance of one A100.

Initial Investment: $1,750 * 2 = $3,500
Hourly Energy Cost: (0.3kW * 2) * 1 hour * $0.16/kWh = $0.096 per hour

Let’s first calculate the ROI assuming we rent out the RTX 4090 setup at the same price as the A100:

Hourly Rental Revenue: $1.65 per hour
Hourly Net Profit: $1.65 — $0.096 = $1.554 per hour

To find the break-even point: $3,500 \/ $1.554 per hour ≈ 2,252 hours or about 94 days

In this scenario, the RTX 4090 setup would break even much faster than the A100, in about 94 days compared to 259 days for the A100.

Now, let’s determine the hourly rental price that would allow the RTX 4090 setup to break even in the same timeframe as the A100. Here’s the calculation:

Hourly rate to cover initial investment: $3,500 \/ 6,211 hours = $0.56 per hour
Total hourly rate including energy cost: $0.563 + $0.096 = $0.66 per hour

This means that if we set the hourly rental price for the RTX 4090 setup at $0.66, it would break even at the same point as the A100.

Comparing this to the A100’s rental price of $1.65 per hour, we can see that the RTX 4090 setup could potentially be rented out 2.5X cheaper than the A100 while still achieving the same return on investment timeline. On top of that, the initial investment for the RTX 4090 setup is significantly lower than that of the A100, which reduces the barrier to entry for those looking to offer GPU rental services.

Wrapping Up

Through our comparison of the A100 and RTX 4090, we have demonstrated the potential competitive advantage that consumer-grade hardware has over enterprise hardware. As production models currently seem to trend toward smaller sizes, this benefit will only grow as more consumer-grade hardware becomes capable of running AI models efficiently. This trend holds enormous potential benefits for the Nosana grid, which primarily consists of consumer-grade technology.

LLM Benchmarking on the Nosana grid

David Britt — Mon, 07 Apr 2025 11:12:20 +0000

Intro

The Nosana grid contains about two thousand nodes with various hardware configurations, which are actively running AI models. At the start of the Nosana test phase, these nodes have mostly been running image generation or transcription jobs through Stable Diffusion and Whisper. Although these jobs are suitable to make sure nodes are functioning properly, they do not provide any additional benefit from an AI use case perspective.

So to make the best use of the nodes until the launch of the mainnet, at the beginning of 2024 we started looking for opportunities to run jobs that would be useful to the Nosana community. As Large Language Models (LLMS) are projected to be the center pillar of Nosana’s AI demand in the foreseeable future, we decided to hire a dedicated AI specialist team to start working on a large-scale LLM Benchmarking project on the Nosana grid. This project aims to provide information that will help clients make better informed decisions, help the Nosana team implement a fairer market system, and contribute valuable information to the LLM research community. In this blogpost, we will go over the required fundamentals to understand how benchmarking works, and then show how we can use the results of the benchmarks to create fair markets.

LLM Fundamentals

Let’s start with the fundamentals of LLMs. What is an LLM? How does it work? And what do we need to run one?

Architecture

For anyone reading up on LLMs they might seem like complex neural networks used in artificial intelligence. While this is true to some extent, in practice, LLMs essentially consist of two easy to understand files. The model weights file, and the model code file. The model weights file contains the parameters of the model and determines the model size which is measured in bytes. The model code file contains the instructions on how to load and run the model.

When looking at llama3.1–70B, the identifier 70B means the model contains 70 billion parameters. The parameters of the model are stored with 16-bit floating point precision, which equals to 2 bytes, making the model weights file size 140 gigabytes.

Each parameter in the model weights file corresponds to a neuron in the architecture described in the model code file. For most modern day LLMs this architecture is called a transformer. The image below shows a generalized transformer architecture used for producing text.

A detailed explanation of transformers is beyond the scope of this article, so we will focus only on the part that is the most important to understand for this benchmarking research. For a LLM to output text it needs to perform computations at every layer, and to do this, it needs to have its parameters and specific cached computations loaded into memory at the respective layers in the model.

Inference

Now that we know what an LLM is, let’s see how we actually produce language with them. LLMs are trained on the task of next token prediction. Tokens are units of text that correspond to words or parts of words, and they are the vocabulary that is understood by LLMs. So as far as LLMs are concerned, producing language is nothing more than correctly predicting the next word or subword given the preceding ones. This process of producing tokens with LLMs is called inference. The speed of inference is an important factor in the usability of LLMs, and it is influenced by the model size, architecture, and the hardware & software configuration on which it is run.

So how does inference work? To facilitate inference, LLMs go through 2 main stages. The prefill phase and the decoding phase.

In the prefill phase, the model processes all input tokens simultaneously to compute all the necessary information for generating subsequent tokens. In practice, the duration of this phase corresponds to the time you wait until the LLM starts generating its response. The prefill phase is highly parallelized and makes efficient use of the computing capacity, passing through the model only once.

At the start of the decoding phase, the model uses the cached information computed during the prefill phase to generate a token. From this point on, for every newly generated token, the previous token needs to pass through the network together with the cached computations. This process of repeatedly going through the network is not computationally intensive as computations only have to be performed for a single token. Instead, the decoding phase is memory intensive, because the cached information has to be moved around to perform the necessary computations.

The prefill phase is compute bound, while the decoding phase is memory bound. A process is considered compute-bound when it requires significant computation and its speed or performance is limited primarily by the amount of processing power of the hardware. A process is memory bound when its performance is limited by the rate at which data moves to and from memory. This rate is called the memory bandwidth.

Hardware Requirements

Alright, so we need compute and memory capacity to run LLMs. This is where our GPUs come in. Let's figure out what we would be able to run with the Nosana grid's most popular GPU, the RTX 4090.
When considering GPUs, the available memory is expressed in VRAM, the memory bandwidth is expressed in bytes per second, and the computational capacity is expressed in FLOPS, floating point operations per second. The RTX 4090 has 24 GB VRAM, a memory bandwidth of 1,008 GB/s, and can produce 82.58 teraFLOPS.

The hard requirement for running an LLM is having enough VRAM to store its parameters + cached computations. For llama3–8B with 16-bit floating point precision, this would approximately be 18GB of VRAM. Our RTX 4090 will be able to handle that.

Figuring out if a model can be run is fairly straightforward, but figuring out how it will perform is hard to determine as it is dependent on many variables. That being said, given the constraint of a memory bound process we can give a rough theoretical estimation of the amount of tokens per second by dividing the model size by the memory bandwidth.

18GB ÷ 1,008 GB/s = 18ms

This calculation gives us 18ms per token or 56 tokens per second. So this calculation would give a rough estimation for an inference run with a single input sequence, which is a predominantly memory-bound process. However, if we would increase the amount of sequences processed at the same time, it could shift from memory-bound to compute-bound and the amount of FLOPS of the GPU would start to play a more important role.

Optimization

We have covered the fundamentals of LLMs and understand which variables are important for running them. By using optimization techniques we can tweak these variables resulting in beneficial tradeoffs that will allow us to run bigger models, or increase inference speed. There are many knobs to turn when it comes to optimization, but the three most important ones are quantization, computation enhancement and caching strategies.

Quantization reduces the precision of the model’s weights to lower bit representations. For example, if we would use quantization to transform the 16-bit floating point precision of our llama3.1–8B model to 8-bit integer, then we would reduce the model size from 16GB to 8GB. While this can have an impact on the accuracy of the model, the significant decrease in memory usage makes it a viable optimization technique.

Computation enhancement focuses on optimizing the operations performed within the model such as the attention mechanism. This mechanism is an important part of the transformer’s success, but also computationally expensive. By modifying the order of computations or by fusing together certain model layers we can reduce the data that needs to be written from and to memory.

Caching strategies involve the reduction of the cached computations that are kept in memory. By simplifying the structure of the cached computations it is possible to significantly reduce the memory footprint in exchange for a slight decrease in model accuracy.

Inference Frameworks

We have briefly gone over the main optimization techniques and it is already becoming apparent that implementing any potentially desired change might become complicated. Luckily, there are various LLM inference frameworks that provide an interface to a wide range of models with built-in options for optimization. As the field of LLM inference is rapidly evolving, these frameworks roll out updates frequently and there is no clear-cut best optimization framework.

One such framework is Ollama. We will give it a special mention here because it is the framework that was used to gather the initial benchmarking results. Ollama originated as a user-friendly framework with the goal of democratizing the use of LLMs. The Ollama team has impressively succeeded in achieving this goal, as it is undoubtedly the easiest framework for anyone with minimal hardware requirements to spin up an LLM. It is especially fitting for running and testing consumer grade hardware as its optimization techniques seamlessly allow models of any size to be run on GPUs with any amount of VRAM.

Nosana Benchmarking

Enough preliminaries. It is time to get into the actual benchmarking. Lets start off by going over the data we collected, and how we collected it.

Benchmarking Setup

As mentioned, Ollama was picked as the initial framework for benchmarking due to its compatibility with consumer grade hardware. With this framework in place, we implemented a custom made benchmarking script to gather data in two distinct categories, model performance and system specifications.

The performance data contains variables on the total amount of produced tokens and how long it took to produce them, but also the clockspeed and the wattage of the GPU. The system specifications data contains variables on an extensive set of system configurations that can have either large or small effects on model performance. The tables below illustrate the kinds of variables and their potential values.

With the to be collected variables defined, we now had to pick a model for benchmarking. Given the notable performance of the newly launched llama models and the Nosana nodes with a wide variety of VRAM capacities, we decided to pick llama3.1–8B that can fit on all GPUs.

Now for the actual procedure of benchmarking we had to create a method compatible with the current job posting structure of Nosana. A job has a maximum length of X hours, giving us plenty of time to load in one or more models, prompt them, and measure their performance. During a job every model got prompted with randomly sampled sequences such as “Write a step-by-step guide on how to bake a chocolate cake from scratch”. The content of a prompt does not have an influence on the performance of the model, but it does determine the length of the response, so to make sure the LLMs spent most of their time on actual inference the pool of prompts we made encouraged longer answers. At the end of each job the output contains the model performance and system specs variables which we extracted and added to a large dataset.

Evaluation

Before we get into the results, let's quickly look at the key evaluation metrics for LLM inference, inference speed and time to first token (TTFT). The inference speed is measured in tokens produced per second during the decoding phase, and largely determines how long a user has to wait for a full response. The TTFT is a measurement of the time a user has to wait for the first response token, which is a crucial component in the usability and desirability of many LLM applications.

In this first article, where we test consecutive single user queries, we will mainly focus on inference speed as a measure of performance, not the TFTT. The TTFT measurement is dependent on the pre-fill phase, which is a compute bound process as we mentioned. In our setting where we process single queries at a time, the amount of computations needed during the prefill phase is low, resulting in uniformly low TTFTs for all GPUs. Inference speed on the other hand, which is a measurement of how fast the decoding process is executed, is heavily dependent on memory bandwidth. There is a large variety of memory bandwidth capacity between the GPUs on the Nosana grid, so focusing on the inference speed metric will provide us with the most insightful observations.

As a final note on our evaluation procedure, it is worth mentioning that most LLM inference benchmarking research focuses on the hardware’s capacity to handle concurrent users. When dealing with concurrent users, the model has to handle multiple queries at the same time. This setting makes it possible to maximally utilize the GPUs memory and computational capacities, especially for enterprise hardware with large amounts of VRAM. We have deliberately chosen to perform this initial benchmark with consecutive single queries, or 1 concurrent user, to define a setting in which we can evaluate all GPUs on the Nosana grid. This makes it so that our results will help us identify well performing and underperforming nodes across all markets, which enables the implementation of a fair market structure. However, the current setup does not benchmark the actual maximum capabilities of the nodes, which will be a topic in one of our upcoming articles.

Results

The dataset we used contains information on 10,596 jobs performed by 550 unique nodes in 15 Nosana Markets. Within these nodes there are 39 unique types of GPUs. The RTX 4090 and 3090 are the most common GPUs by far with 122 and 101 counts respectively.

Market Performance

As an initial goal of our benchmarking research we set out to create fair markets. In the presented results we have aggregated the data on market level, so we can show the performance for each market and highlight opportunities for improvements.

In the above visual we see the average tokens per second for each market. All the way at the top we have the H100 with 111 tokens per second, and at the bottom we have the RTX 4060 with 42 tokens per second. At first glance, this graph does not indicate anything out of the ordinary. On average the general trend seems to be, the more expensive the GPU the better the performance.

When we look at the performance variation within markets, indicated by the black bar, we get some more interesting findings. The larger the variation within a market the more varied the performance of different nodes within a market is. Varied performance within markets is undesirable for both clients and node runners. When clients use a Nosana node for inference compute, they want reliable performance suitable for their application. When node runners provide compute, they want to be paid based on the quality of compute they provide. A high variance within markets interferes with both of these objectives.

So having completely fair markets would mean that we would have a variance of 0 within all of them. However, getting to 0 variance within each market would require drastic solutions that impede the functionality of the Nosana grid. Fortunately though, designing the markets to minimize performance variance is something we can do. For example, we can implement a minimum performance threshold based on the average of each market. Every node performing worse than the average minus the performance threshold would be removed from the market. This would not only reduce the variance of the market, which is caused predominantly by underperforming nodes, but also increase the average tokens per second within the market.

The visual below illustrates what would happen to the market if a 20% or a stricter 10% threshold were to be implemented.

As we can see, for the markets with high variation, the threshold causes a significant increase in performance. This is because the threshold removes a larger amount of underperforming nodes within these markets. As a result, these markets become fairer because they provide more reliable compute to clients, and payout similar amounts for similar quality compute.

Performance Monitoring

After we observed the variance between nodes, we started analyzing variables that cause performance fluctuations. Even though we used a large set of hardware specs for our analysis, the results were unambiguous and pointed at two main factors responsible for performance, GPU type & Wattage. GPU type is the foundation that determines the performance range of nodes, but the wattage plays an arguably more crucial role by determining the location within this range. For example, a 3070 GPU running at full power can outperform a 4090 GPU that’s not getting enough, showing how proper wattage allocation can be just as important as the GPU model itself.

Now with this knowledge we are able to categorize 3 types of node runners that deviate from the expected performance. Spoofers, malignant node runners that fake hardware configurations & performance. Underclockers, economically greedy node runners that do not provide enough power to their hardware setup. And a third category of node runners with unforeseen technical issues. As the Nosana team we want to remove spoofers, keep underclockers in check, and help any node runner facing technical difficulties. Due to the unfakeable nature of model inference performance, monitoring this metric helps us identify which category of underperforming node runner we are dealing with, and take appropriate measures to balance the markets.

Node Leaderboard

As a first practical step towards fair markets we introduce the Nosana Node Leaderboard. Here we track the performance of each node within the market and display relevant hardware configurations. This allows us, together with the community, to monitor the performance of Nosana nodes in a transparent way. Go check it out!

Next Steps

By doing research using the Nosana Grid we aim to accomplish two main goals. First, to create the optimal Nosana experience by incorporating data-driven insights into our decision making. Second, to contribute valuable research findings to the broader large language model community. In this article we mainly focused on the first goal, as the current benchmark results are practically useful for the Nosana Grid, but do not provide information about the maximum performance capacity of specific model-hardware combinations in realistic settings. In our next article, we’ll explore maximum performance in real-world conditions.

[Boost]

David Britt — Thu, 03 Apr 2025 09:29:37 +0000

David Britt for Nosana

Apr 1 '25

Nosana Builders' Challenge - $3,000 USDC in prizes

#web3 #ai #hackathon #gpu

3 min read

Nosana Builders' Challenge - $3,000 USDC in prizes

David Britt — Tue, 01 Apr 2025 12:00:00 +0000

We’re thrilled to launch the Nosana Builder Challenge, a developer-focused contest designed to push the boundaries of AI model deployment on the Nosana Network. This is your chance to showcase your skills, gain visibility, learn new tools — and compete for over $3,000 USDC in prizes!

TL;DR

Create reusable Nosana Templates for deploying AI models.
Submit via GitHub PR to win USDC token prizes.
$3,000+ total rewards for top 10 submissions.
Deadline is 14 of April 12.00 UTC.
Submission details: Builders Challenge Page

What is the Builder's Challenge?

The Builder Challenge empowers developers to build powerful tools, features, and in this 1st edition Templates using the Nosana SDK, CLI, and Dashboard. It's all about growing a strong community of builders who can unlock the full potential of decentralized AI inferencing on Nosana.

First Challenge: Create Nosana Templates

For our first edition, we’re zooming in on Nosana Templates — reusable, pre-built job definition files that simplify AI model deployment on Nosana’s decentralized GPU network.

Nosana Templates are reusable, pre-built job definition files designed to simplify AI model deployment on Nosana's GPU grid. They allow users to quickly set up complex AI tasks without extensive configuration.

Templates let users launch complex AI workloads quickly, without deep configuration. Current templates include deploying DeepSeek LLMs or running a VSCode instance. You can explore more examples in the Templates section of the Dashboard.

How to Participate

Build a new template by creating a Nosana Job Definition File. You can do this:

You can create a new template by crafting a Nosana Job Definition File. This can be done either:

Directly through the Nosana Dashboard Interface, or
By creating and editing a JSON file locally with your preferred text editor, then submitting it to the Nosana Network using the Nosana CLI Tool.

While we encourage AI models, feel free to get creative — analytics or dev tools are welcome too!

Submission Instructions

Follow these clear steps to submit your template:

Fork the Nosana GitHub Template Repository.
Create your new template JSON file based on your chosen AI model or other innovative use-case.
Submit a Pull Request clearly describing your template, its intended use-case, and implementation specifics.
- A new folder for your Nosana Template with the following files:
- job-definition.json: Standard Nosana Job Definition JSON File
- info.json: JSON file with display information for the dashboard
- README.md: README file with a description of the Job Definition, Models, any other relevant information about the job.
Ensure your template is functional and deployable directly from the Nosana Dashboard.
Last but not least, also do a submission at the Builder Challenge Page

Example Template

Here’s an example for deploying a DeepSeek R1 LLM:

{
  "version": "0.1",
  "type": "container",
  "meta": {
    "trigger": "cli"
  },
  "ops": [
    {
      "type": "container/run",
      "id": "deepseek-r1-qwen-1.5b",
      "args": {
        "entrypoint": [
          "/bin/sh",
          "-c",
          "python3 -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --served-model-name R1-Qwen-1.5B --port 9000 --max-model-len 130000"
        ],
        "image": "docker.io/vllm/vllm-openai:latest",
        "gpu": true,
        "expose": 9000
      }
    }
  ]
}

In this example, we're using the vLLM Docker image, but feel free to choose any container image suitable for your needs.

Judging Criteria

Submissions will be evaluated based on:

Creativity: Original and innovative template ideas.
Popularity of AI Model: Implementation of widely-adopted or cutting-edge AI models.
Technical Interest: Efficient, scalable, or uniquely creative use of Nosana’s capabilities.
Diversity of Models: Varied implementations including LLMs, GANs, Stable Diffusion, analytics, and other inferencing models.

Prizes

We’re awarding the top 10 submissions:

🥇 1st: $1,000 USDC
🥈 2nd: $750 USDC
🥉 3rd: $500 USDC
🏅 4th: $250 USDC
🔟 5th–10th: $100 USDC

All prizes are paid out directly.

Tutorial & Resources

For a comprehensive tutorial and additional insights into how Nosana works, how to run models, and best practices, visit:

Don’t Miss Nosana Builder Challenge Updates

Join our Discord where we have dedicated Builders Challenge Dev chat for technical support and information.

Join our Telegram or follow us on X for the latest Nosana and NOS announcements.

Happy Building!