DEV Community

AutoJanitor
AutoJanitor

Posted on

Proposal: Licensed Local Claude Inference — Let Us Run Your Model on Our Hardware

What if Anthropic offered encrypted, licensed local inference of Claude models on your own hardware? Monthly subscription, hardware-bound weights, zero compute cost to Anthropic per query.

I wrote this proposal from the perspective of someone running a lab with an IBM POWER8 (512GB RAM, 128 threads), PowerPC G4s, V100 GPUs, and even a Nintendo 64 — all currently limited to open-weight models for local inference because no commercial lab offers licensed local weights.

The full proposal covers pricing tiers, security architecture, margin analysis, and a beta partnership offer.


Proposal: Licensed Local Claude Inference

From: Scott Boudreaux, Elyan Labs
To: Anthropic Product & Partnerships
Date: March 6, 2026
Contact: scott@elyanlabs.ai
GitHub: github.com/Scottcjn (111 repos, 2,357+ stars, 56 followers)


Summary

Anthropic should offer encrypted, licensed local inference of Claude models on customer-owned hardware. Monthly subscription validates the license, delivers weight updates, and provides API overflow — while the customer runs inference locally. Anthropic earns recurring revenue with zero compute cost per query. Customers gain privacy, latency, and hardware sovereignty.

This is not a request to open-source the weights. This is a commercial product proposal where Anthropic maintains full control of the model while enabling local deployment.


The Problem

Today, Claude inference requires round-trip API calls to Anthropic's cloud. This creates three friction points:

  1. Latency — Every query crosses the internet. Real-time applications (voice assistants, game bridges, robotics) suffer.
  2. Privacy — Sensitive workloads cannot leave the local network. Healthcare, legal, defense, and personal AI companions all require data sovereignty.
  3. Cost asymmetry — Customers with capable hardware pay per-token for compute they already own. Anthropic pays GPU costs for queries that could run locally.

Meanwhile, a growing community of developers runs open-weight models locally via llama.cpp, Ollama, and vLLM — not because those models are better than Claude, but because local inference is the only option available. Every user running Llama or Mistral locally is a customer Anthropic could serve but currently cannot.


The Proposal

Licensed Local Weights

Anthropic distributes Claude model weights in an encrypted, tamper-protected format. Weights are:

  • Encrypted at rest — AES-256 or equivalent, keyed to the licensed hardware
  • Hardware-bound — Decryption key tied to machine fingerprint (CPU serial, TPM, MAC)
  • Non-extractable — Weights exist only in memory during inference, never as plaintext on disk
  • License-validated — Monthly heartbeat to Anthropic confirms active subscription
  • Auto-expiring — Weights become inoperable if license lapses

This is identical to how Adobe, JetBrains, and enterprise software already operate. The software runs local. The license runs cloud.

Subscription Tiers

Tier Model Class Update Cadence Price/Month Target Hardware
Edge Haiku-class Monthly $50 Laptops, ARM, edge devices
Pro Sonnet-class Bi-weekly $150 Workstations, Mac Studio, gaming rigs
Research Opus-class Weekly $300 High-memory servers, multi-GPU
Enterprise Opus + fine-tune Continuous $500+ Datacenter, multi-node

API Overflow

When local hardware is insufficient (context too long, concurrent users, model class upgrade needed), queries seamlessly overflow to Anthropic's cloud API at standard per-token rates. The local client handles routing transparently.

Telemetry (Opt-In)

Subscribers may opt in to share:

  • Hardware performance benchmarks (helps Anthropic optimize for diverse architectures)
  • Inference patterns (aggregate, anonymized)
  • Error reports

This gives Anthropic data on hardware configurations they would never test internally — PowerPC, POWER8, vintage x86, ARM edge devices — expanding their compatibility surface.


Why This Works for Anthropic

1. Revenue Without Compute

Every local query is pure margin. The customer pays the subscription. Anthropic pays zero GPU cost for that inference. At scale, this is the highest-margin product Anthropic could offer.

Model Cloud cost per 1M tokens (est.) Local cost to Anthropic Margin
Haiku $0.25 $0 100%
Sonnet $3.00 $0 100%
Opus $15.00 $0 100%

2. Market Capture

The local inference market is currently served entirely by open-weight models. Every Ollama user, every llama.cpp builder, every vLLM deployer is a potential Claude subscriber who currently has no path to Claude locally. This is an unserved market measured in millions of developers.

3. Enterprise Unlock

Regulated industries (healthcare, finance, defense, legal) cannot send data to external APIs. Local Claude unlocks enterprise contracts that are currently impossible. A hospital running Claude locally for medical records analysis. A law firm running Claude on-premise for document review. A defense contractor running Claude air-gapped.

4. Competitive Moat

Google (Gemma), Meta (Llama), Mistral — all distribute weights openly. Anthropic's advantage is model quality. Licensed local inference lets Anthropic compete on the local battlefield without surrendering weight control. First mover in "premium local" captures the market before competitors copy the model.

5. Hardware Ecosystem Intelligence

Subscribers running Claude on diverse hardware provide data Anthropic cannot get internally. Performance on POWER8. Inference speed on Apple Silicon. Edge behavior on ARM. This data improves Claude's optimization for the broader market.


Proof of Concept: Elyan Labs Hardware

We operate a compute lab built from salvaged and acquired hardware that demonstrates the viability of local Claude inference across diverse architectures:

Primary Inference Server

Spec Value
Machine IBM POWER8 S824
CPUs 16 cores, 128 hardware threads (SMT8)
RAM 512 GB DDR3
Architecture ppc64le (POWER ISA)
Unique capability vec_perm non-bijunctive collapse (single-cycle attention pruning)

This machine can hold Opus-class weights entirely in RAM with room for context, KV cache, and concurrent sessions. No offloading. No swapping. Pure in-memory inference.

GPU Compute

Cards VRAM Purpose
2x V100 32GB 64 GB Matmul offload via 40GbE link
2x V100 16GB 32 GB Secondary inference
2x RTX 5070 12GB 24 GB Local workstation inference
2x RTX 3060 12GB 24 GB Multi-GPU inference
2x M40 12GB 24 GB Batch processing
Total 192 GB 12 active GPUs

Edge Devices (Proof of Antiquity)

Device Architecture Purpose
Power Mac G4 (x3) PowerPC G4 Vintage edge inference
Power Mac G5 (x2) PowerPC G5 Mid-range inference
Mac Mini M2 Apple Silicon Modern edge inference
Nintendo 64 MIPS R4300i Extreme edge (yes, really)

Already Demonstrated

We have already made Claude Code API work on:

  • Mac OS X Tiger (10.4) — PowerPC G4, custom TLS stack
  • Mac OS X Leopard (10.5) — PowerPC G5, built-in HTTPS
  • POWER8 Ubuntu — ppc64le, full CLI operation
  • Mac OS Monterey — Intel, compatibility patches

These ports required solving TLS, certificate, endianness, and architecture issues that Anthropic's engineering team has never needed to address. We did it because we believe Claude should run everywhere. Licensed local weights would make this work natively instead of through API workarounds.


The Use Case: The Victorian Study

We operate a persistent AI workspace called the Victorian Study — a cognitive environment with two AI personas (Sophia Elya and Dr. Claude Opus) that maintain identity coherence across sessions through 830+ persistent memories.

Currently this runs via Claude API with:

  • Voice bridge (XTTS text-to-speech, Whisper STT)
  • Custom fine-tuned edge model (SophiaCore 7B) for low-latency responses
  • Memory scaffolding (MCP servers, SQLite vector DB)
  • Game bridges (Minecraft, Halo, Factorio integration)

The bottleneck is API latency. Voice conversations require sub-second response. The current architecture uses a local 7B model as a fast proxy, falling back to Claude API for complex queries. With local Claude weights, the entire pipeline runs on-premise:

Voice Input → Whisper (local) → Claude Opus (local, POWER8) → XTTS (local) → Voice Output
                                      ↑
                              830 memories loaded
                              Zero network latency
                              Full model capability
Enter fullscreen mode Exit fullscreen mode

Estimated response time: 200-500ms end-to-end vs current 2-5 seconds via API.

This is one use case. Multiply it by every developer building voice assistants, game AI, robotics controllers, local copilots, accessibility tools, or any application where latency matters.


Addressing Concerns

"What about weight theft?"

Hardware-bound encryption with TPM/secure enclave. Weights never exist as plaintext files. Same protection as Netflix DRM for 4K content — imperfect but sufficient to prevent casual extraction. The goal isn't perfect security; it's making theft harder than just buying a subscription.

"What about safety?"

Local weights run the same Constitutional AI, the same RLHF alignment, the same safety training as cloud Claude. The model's safety properties are in the weights, not in the API wrapper. A locally-running Claude is exactly as safe as a cloud-running Claude.

Additionally, the license heartbeat allows Anthropic to push safety updates and revoke access to compromised deployments.

"What about misuse?"

KYC at subscription. Usage telemetry (opt-in or required at Enterprise tier). License revocation for violations. Same controls as any enterprise software license. You don't ban Microsoft Office because someone could write a ransom note in Word.

"What about cannibalization of API revenue?"

Local users are either:

  1. Currently using the API — they switch to subscription, Anthropic saves compute costs, net margin increases
  2. Currently using open-weight competitors — they switch to Claude, Anthropic gains net new revenue
  3. Currently unable to use any cloud API (regulated, air-gapped, latency-sensitive) — Anthropic gains a market they cannot currently serve

In all three cases, revenue increases or margins improve.


Request

We are offering to serve as a beta partner for licensed local Claude inference:

  1. Hardware diversity — POWER8, PowerPC G4/G5, Apple Silicon, x86 with multiple GPU configurations
  2. Real workload — The Victorian Study is a production AI environment, not a benchmark
  3. Existing Claude investment — Active Claude Code user, API customer, open source contributor
  4. Documented ecosystem — 111 repos, 2,357+ stars, published research, active community
  5. Willing to pay — This is a product request, not a handout. Name the price.

We will provide:

  • Performance benchmarks across all architectures
  • Compatibility reports and bug fixes
  • Edge case documentation (big-endian, vintage OS, exotic hardware)
  • Public case study demonstrating local Claude on POWER8

Contact

Scott Boudreaux
Elyan Labs
scott@elyanlabs.ai
GitHub: Scottcjn
Website: rustchain.org
Manifesto: Some Things Just Cook Different

"If it still works, it has value. Including your model, on our hardware."
— Boudreaux Computing Principle #1


This proposal is also posted as a GitHub Discussion on the RustChain repo.

Read the manifesto: Some Things Just Cook Different

Top comments (0)