Pablo Werlang

Posted on May 4

About Sharing Local Inference: A Marketplace for Renting Idle GPUs with an OpenAI-Compatible Backend

#ai #openai #api #llm

The last year of AI tooling has felt weirdly split in two.

On one side, frontier cloud models are still impressive, still useful, and still setting the pace for a lot of the industry. On the other side, they are getting harder to treat like stable infrastructure. Prices move up, limits get tighter, availability gets noisier, and the feeling of building on top of someone else's quota policy keeps getting stronger.

At the same time, the supply side is changing fast.

Chinese model labs and open-weight ecosystems are shipping at a pace that would have felt unrealistic not long ago. The gap with the biggest frontier models is still real, but for a lot of practical tasks it is getting smaller, and sometimes smaller much faster than the market narrative suggests. That matters because once the quality floor rises enough, the whole question changes from “who owns the smartest model?” to “who can serve good-enough intelligence cheaply, reliably, and close to the user?”

That shift is why more people are buying GPUs for local LLM use.

Some are doing it for privacy. Some want predictable costs. Some care about latency. Some just want control over the stack instead of depending on a remote platform that can change the rules overnight. And once those GPUs exist, a second question shows up almost immediately: what happens when they sit idle?

That question is what pulled me into this project.

I was not trying to make a polished product pitch out of it. I wanted to see what happened if I treated that question as an actual backend design problem.

Why this started to feel worth building

Three things seem to be converging at once.

First, frontier cloud APIs are becoming harder to treat like boring infrastructure. Prices move, limits tighten, regional availability changes, and a lot of teams are discovering that “just call the best hosted model” is not as stable a default as it looked a year ago.

Second, the supply side is changing. Chinese labs and open-weight ecosystems are shipping fast, and the quality curve is rising quickly enough that for many practical tasks the question is no longer only “which model is smartest?” but also “which model is good enough at the best operational cost?”

Third, a lot more people now own GPUs than they used to. Some bought them for privacy. Some for latency. Some for predictable cost. Some because they want to run agents and workflows locally without asking permission from a remote platform every five minutes. Once those GPUs exist, one obvious systems question appears:

How do we coordinate idle capacity?

The idea

I wanted to explore a simple premise: if people already own GPUs for local inference, why not let them rent out idle capacity to other developers through a marketplace?

Not a vague “decentralized AI” slogan. A concrete backend structure:

workers connect and advertise model capacity
consumers send requests through an OpenAI-compatible API
the platform matches demand to supply
responses stream back in real time
usage gets settled after the job completes

That became LocalLMarket: a peer-to-peer marketplace for LLM compute where GPU owners can publish an offer and API users can buy inference from the available pool.

The goal was not to pretend this is solved. The goal was to build a working backend structure that lets me experiment with the idea in a serious way, and show the developer community both the possibilities and the obstacles.

This repository is exactly that: a working backend for testing the concept, not a production-ready marketplace.

The way I ended up thinking about it

Once I got past the vague “decentralized AI” framing, the problem became much easier to reason about.

It stopped looking like a grand vision and started looking like a handful of pretty ordinary backend concerns:

discovery: which workers exist and what do they offer?
matching: which worker should handle this request?
relay: how do tokens get streamed back to the caller?
settlement: who pays whom, and when?
trust: how do you stop the whole thing from becoming nonsense?

That framing is what led me to build LocalLMarket.

Not as a finished startup. Not as “Uber for GPUs,” which is the kind of phrase that should make everyone a little nervous. As a working backend structure for experimenting with the concept and seeing where the real engineering friction actually is.

A minimal architecture for this kind of system

The current repo implements a pretty opinionated split:

an API service owns the public HTTP surface, authentication, worker selection, stream relay, and settlement
worker processes connect outward over WebSocket, advertise model capacity, receive jobs, run inference, and stream results back

That shape matters.

Instead of exposing a public HTTP server on every worker node, the system keeps the control plane in one place and treats workers more like queue consumers. That simplifies the first version of the problem: auth, accounting, and routing stay centralized while compute stays distributed.

Here is the abstract flow:

Consumer app / agent
        |
        | OpenAI-compatible request
        v
API service
  - authenticate user
  - apply pricing/throughput constraints
  - choose worker
  - create order record
        |
        | WebSocket job dispatch
        v
Worker node
  - run model
  - stream chunks back
        |
        | SSE relay
        v
Consumer app / agent

Why the OpenAI-compatible part matters more than it sounds

One decision I like here is using an OpenAI-compatible API surface.

This is not just about convenience. It is about lowering integration resistance.

If a local compute marketplace speaks the same language most tooling already expects, it can drop into existing applications and almost any agentic workflow with very little ceremony. You are not asking developers to rebuild orchestration just to try a different supply layer.

In practice, the mental shift becomes:

"What if I changed the base URL and the backend supply model, but kept the rest of my app or agent stack basically the same?"

That mattered a lot to me while building this.

An agent loop, internal tool runner, or multi-step workflow can keep using the same chat completion pattern while routing requests through a marketplace-backed control plane instead of a single centralized vendor.

For example:

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="http://localhost"
)

response = client.chat.completions.create(
    model="qwen-or-llama-worker-pool",
    messages=[
        {"role": "system", "content": "You are an agent planner."},
        {"role": "user", "content": "Summarize this incident report."}
    ],
    stream=True,
)

That same pattern works whether the caller is a chatbot, an internal automation service, or an agentic workflow coordinating multiple model calls. The point is not that OpenAI is the only interface that matters. The point is that compatibility turns a weird infrastructure experiment into a very small application change.

What the project turned into, technically

What fell out of this pretty quickly was that the idea is less about “building a marketplace” and more about stitching together five backend problems.

1. Worker registration

Workers need to identify themselves, declare which model they can serve, expose pricing and throughput information, and maintain a live session with the control plane.

Conceptually, the worker advertises an offer like this:

{
  "workerId": "gpu-node-17",
  "model": "qwen2.5-32b",
  "price": 0.40,
  "tps": 52,
  "status": "available"
}

The exact fields are less important than the shape: you need a registry of who can do what, at what cost, and whether they are actually online.

2. Matching logic

Once a request arrives, the system needs to choose a worker. The current project keeps that intentionally simple: respect consumer constraints such as max price and minimum throughput, then prefer the cheapest suitable worker with throughput as a tie-breaker.

In pseudocode, the idea is basically:

const candidates = workers
  .filter((worker) => worker.model === requestedModel)
  .filter((worker) => worker.price <= user.maxPrice)
  .filter((worker) => worker.tps >= user.minTps)
  .sort((left, right) => left.price - right.price || right.tps - left.tps)

return candidates[0]

That is enough to make the market legible before you start adding more sophisticated routing, reputation weighting, or dynamic pricing.

3. Stream relay

Once the worker starts generating output, the system has to relay chunks back to the caller in real time. While the worker is connected over WebSocket to the API service, the caller is usually expecting an HTTP response with a streaming body. That means the API service has to be a middleman for the token stream, which adds some complexity around backpressure, error handling, and connection management.

4. Settlement

Billing in LLM systems is awkward because you often do not know the exact final cost until generation is complete. So the cleaner model is usually:

create an execution record when dispatch starts
compute actual cost when usage is known
debit the requester
credit the worker owner
keep platform fee logic explicit instead of magical

That is the pattern this backend uses.

The limitations are not side notes

This is the part that is still the hardest. If you are paying for remote inference, how do you know the results are real?

A marketplace for remote model execution is not just a routing problem. It is a trust problem wearing a routing costume.

You cannot robustly verify worker execution yet

If a worker says it ran a given model, there is no built-in proof that it actually did. No trusted execution environment. No strong attestation. No cryptographic proof of faithful execution. That is a major unresolved problem, not an implementation detail.

When the worker gives you back tokens, you have no strong guarantee they came from a real model running on a GPU instead of a different and cheaper model, a local cache, or even a random token generator. That is a fundamental trust issue that any open marketplace has to grapple with.

If the workers are all running in a shared physical environment you control, that is less of an issue. But if the whole point is to let anyone rent out their GPU, it becomes a real problem.

Reputation is still weak

Uptime and request count are better than nothing, but only slightly. A real market would need stronger feedback loops, better failure accounting, dispute handling, and probably model-specific trust signals.

This is not production-ready

That is deliberate. The repository is a working backend structure for experimenting with the idea, sharing the tradeoffs, and making the constraints visible to other developers. It is not pretending to be a finished marketplace product.

Why I think this is worth discussing now

I do not think the interesting future is just “everyone uses one hosted frontier API forever.”

Model capability is diffusing. Hardware ownership is diffusing. Agentic workflows are increasing demand for repeated, composable model calls. And once teams start caring more about cost control, locality, and infrastructure independence, alternative supply layers become much more interesting.

A local LLM marketplace is one possible response to that shift.

Maybe it becomes a serious category. Maybe it stays niche. Maybe the trust problem is harder than the market opportunity. All of those are plausible outcomes. But I think it is worth exploring in code rather than only in threads and hot takes.

Why I am sharing the repo anyway

Part of the fun of building something like this is that it forces a bunch of fuzzy industry arguments to become concrete.

You stop saying “decentralized inference” and start asking much more useful questions:

where should the control plane live?
how will workers register and stay authenticated?
how will you choose between price, latency, and throughput?
how will you make streaming reliable?
what trust model are you actually offering users?

Those questions are more valuable than the slogan.

If you want to see the implementation I used to explore them, the repo is here:

werlang / locallmarket

Peer-to-peer LLM compute marketplace

LocalLMarket

🤝 Peer-to-peer LLM compute marketplace. Anyone can contribute GPU power and earn. Anyone can access affordable AI.

💡 The Problem

LLM inference is expensive. OpenAI, Claude, Gemini—they all charge premium prices because they own the compute. Meanwhile, millions of people and businesses have idle GPU capacity: a gaming PC, a cloud instance running 4 hours a day, a research lab with spare hardware.

Why should all that compute go unused? Why should consumers pay centralized prices when free-market competition could drive costs down by 10x or 100x?

🎯 The Solution

LocalLMarket is a peer-to-peer marketplace where:

Workers (anyone with a GPU) register a model, set their own price, and earn credits per completed request.
Consumers find the best-priced worker meeting their needs and call the model via a simple API.
The platform matches orders, streams responses, settles payments, and tracks reputation.

No gatekeepers. No middlemen. Just fair-market pricing and…

View on GitHub

If you have built something adjacent, or if you think this architecture breaks in an important way, I would genuinely like feedback. The point of publishing this is not to advertise a product. It is to compare notes with other developers while this design space is still open.

DEV Community