DEV Community: Vaibhav Kumar Kandhway

Junkyard Computing: The Engineering Case for Building Server Clusters from Dead Smartphones

Vaibhav Kumar Kandhway — Sun, 21 Jun 2026 19:11:26 +0000

TL;DR

A cluster of discarded smartphones can match the cost and performance profile of cloud server instances for a defined, bounded class of workloads bursty, latency-tolerant, horizontally-scalable services like microservices, dev environments, and educational platforms. This isn't a sustainability thought experiment. A 2023 prototype (10 Pixel 3A phones) ran real end-to-end microservice benchmarks at roughly 1/40th the three-year cost of an equivalent AWS instance. A 2024 follow-up deployed the same architecture for live university coursework. And in June 2026, Google backed a production-scale version of this exact design: a 2,000-phone cluster at UC San Diego, replacing the compute equivalent of ~50 traditional servers, launching Fall 2026.

The rest of this post derives why that conclusion holds not by appeal to e-waste statistics, but from the underlying compute economics. The carbon numbers show up as evidence, not motivation.

Four terms, defined precisely

Before building the argument, four terms need precise definitions, because the entire case rests on a metric most performance benchmarks ignore.

Embodied carbon: emissions incurred manufacturing a device, paid once, upfront, regardless of how long the device is used.
Operational carbon: emissions incurred running a device, accrued continuously over its service life.
Computational Carbon Intensity (CCI): a metric proposed in the foundational research, defined as total lifetime CO2e (embodied + operational + networking) divided by total lifetime operations performed. Lower is better. Critically: for a device that is reused rather than newly manufactured, embodied carbon is treated as already paid i.e., C_M = 0.
Cloudlet: a small, localized cluster of compute nodes in this case, a set of networked smartphones functioning as a single addressable compute resource.

CCI is the metric that makes the rest of this argument possible. Power Usage Effectiveness (PUE), the industry-standard datacenter efficiency metric, only measures operational overhead. It says nothing about whether the underlying hardware needed to be manufactured at all. A datacenter can have excellent PUE and still have a poor carbon footprint if it churns through new servers fast enough. CCI is the metric that catches that.

Three measurements this argument stands on

Everything that follows is built from three things that have actually been measured not assumed, not estimated for effect. Each is independently checkable, sourced from device-level benchmarking and published life-cycle assessments (LCAs).

Manufacturing dominates smartphone lifecycle emissions.
Published LCAs put manufacturing at 70-90% of a smartphone's total lifetime carbon footprint. Operational energy the electricity used while running the device is a minority contributor.

Modern smartphone compute already clears the performance bar for a defined class of cloud workloads.
GeekBench data across the top five Android phones released each year since 2013 shows multi-core throughput and memory capacity for recent devices meeting or exceeding AWS T4g burstable instances the instance class AWS explicitly markets for microservices, small databases, and dev environments. This is a performance floor claim, not a peak-performance claim: it does not extend to GPU-bound or HPC-class workloads.

Reused hardware carries zero marginal embodied carbon.
If a device has already been manufactured and would otherwise sit idle or be discarded, its embodied carbon cost is sunk. Any additional compute extracted from it is amortized against zero new manufacturing.

The rest of this post is just what happens when you combine those three facts and follow them through.

Reuse beats new procurement on both cost and carbon and it's not close

For workloads that fall inside a phone's performance envelope, reusing one strictly outperforms buying new, on both dollars and carbon. Put the first and third facts above together: a repurposed device's carbon-per-operation math loses its largest term manufacturing entirely. A purpose-built server's math keeps it. Hold throughput roughly comparable (the second fact, within the defined workload class), and the repurposed device comes out ahead by construction, not by luck.

This isn't theoretical. The empirical result: a 10-device Pixel 3A cloudlet running DeathStarBench's HotelReservation and SocialNetwork applications real, end-to-end microservice stacks, not synthetic benchmarks handled up to 4,000 queries/second within a 50ms median / 100ms tail latency budget, comparable to an AWS c5.9xlarge instance. Three-year cost: $1,028 for the phone cluster versus $40,404 for the equivalent EC2 instance. Carbon efficiency: 9.8×–18.9× better per request, depending on workload mix.

Note what's doing the work in that result: it is not that phones are faster. They aren't. It's that the device doesn't have to absorb a new manufacturing cost in carbon or in dollars before it's even started doing useful work.

The bottleneck was never the chip

The binding constraint on junkyard clusters is thermal, network, and power management not compute. Here's why that has to be true: if reuse is strictly favorable, as established above, the only reason this isn't already universal practice is that something else is hard. Three failure modes were identified and independently characterized:

Thermal. Phones throttle at 40-50°C and hard-shutdown at 60-70°C they were never designed for sustained, rack-density operation. Measured thermal output, however, came in low: ~2.6 W/device under 100% CPU load, ~1.2 W/device under realistic mixed workload. Extrapolated to a 256-device cluster, that's ~666 W total coolable with two off-the-shelf 500 W server fans. The per-device throttling behavior functions as a built-in, distributed thermal governor; no centralized cooling control logic is required to keep the cluster from cascading into shutdown.

Network. Co-located WiFi clustering was tested and found to degrade past ~30 devices due to interference. The proposed mitigation for small/edge deployments is a tree topology phones grouped in cells of five, one device hotspotting to LTE, the rest bridging over its WiFi AP capping per-device throughput at ~18.5 Mbit/s. At true datacenter scale, this constraint is resolved trivially by reverting to wired Ethernet, the same way any rack of stripped-down nodes would be networked. Network is a real constraint, but not a hard one.

Power. This is the constraint unique to phone-based clusters. Smartphone batteries degrade after ~2,500 charge cycles. Under light-medium load, that works out to roughly 2.3 years of service for a Pixel-class battery before replacement non-trivial, recurring physical maintenance at scale (~9 hours of labor per 2 years for a 54-device cluster, by direct measurement). The battery cuts both ways: it doubles as a built-in UPS, and it enables smart charging (deferring charge cycles to low-carbon-intensity grid windows), which measured ~7% additional carbon reduction on a Pixel 3A but it is also the single component most likely to require physical intervention.

None of these three are compute problems. All three are solvable with conventional infrastructure engineering. That's the load-bearing claim here: the barrier to junkyard computing was never the silicon.

The software barrier closed in three generations and that's why 2026 happened

The remaining barrier software has closed measurably across three design generations, and that trajectory is what predicts the 2026 production deployment. Trace the actual implementation history:

Generation 1 (2023): OS replacement. Android removed entirely, replaced with Ubuntu Touch; kernel patched to add filesystem modules (BTRFS) required for Docker. Functional, but operationally fragile every device requires manual OS surgery before joining the cluster.
Generation 2 (2024): Native virtualization. Android 14+ shipped KVM in the stock kernel. The redesigned architecture runs an Ubuntu VM inside unmodified Android, with a Kubernetes pod inside that VM. Setup dropped to a scriptable handful of terminal commands. No OS replacement required.
Generation 3 (2026, production): Hardware reduction. Per the Google-backed UCSD deployment, phones are physically stripped to bare motherboard display, battery, camera, chassis removed and the SoC/RAM/storage run plain Linux directly, orchestrated with Kubernetes, indistinguishable to a scheduler from any other commodity node.

Each generation removed friction without changing the underlying economics laid out above. That's the pattern that makes the trajectory predictable rather than coincidental: the compute case for junkyard clusters was sound in 2023; what changed by 2026 was that the engineering overhead of standing one up dropped enough for an organization like Google to commit production resources to it.

Where this stops applying

No argument built this way is honest without stating where it stops holding.

This does not extend to: GPU/AI-training workloads (measured 15–22× throughput gap against a GTX 1080 Ti on FP32/INT32 in the same research lineage), latency-critical applications (inter-device network hops add measurable tail latency), or memory-bound workloads exceeding ~12GB per node (current high-end smartphone RAM ceilings).

It does extend to: containerized microservices, CI/dev environments, educational platforms (autograders, notebook hosting, coursework infrastructure), and any workload class characterized by burstiness and loose latency SLAs which is precisely the workload class Google and UCSD are targeting for the Fall 2026 deployment.

Where this series goes next

This post establishes the why. The next posts in this series go device-by-device through the how:

How the thermal and network constraints above are actually engineered around at cluster scale
The full software stack evolution from Generation 1 to Generation 3, including the Kubernetes scheduling layer
A teardown of the CCI formula and how to apply it to your own infrastructure decisions

Sources: Switzer et al., "Junkyard Computing: Repurposing Discarded Smartphones to Minimize Carbon," ASPLOS 2023; Switzer et al., "Reducing the Carbon Footprint of EdTech with Repurposed Devices," 2024; Google Research / UC San Diego phone cluster computing project coverage, June 2026.

The Execution Safety Crisis in Multi-Agent Workflows — And the Architectural Pattern That Solves It

Vaibhav Kumar Kandhway — Sun, 07 Jun 2026 19:13:09 +0000

The biggest unresolved problem in multi-agent workflows is not reasoning. It is execution safety.

Most teams building with LLMs today have not encountered this problem yet — because they have not scaled yet. This article is for the ones who are about to.

The Core Tension

LLMs are probabilistic by nature. Every output is a sample from a probability distribution. There is no guarantee that the same prompt produces the same output twice. That is not a bug — it is the fundamental property that makes language models useful.

Production backend systems are deterministic by requirement. The same input must always produce the same state change, traceably, verifiably, with an audit log that can be reconstructed after the fact.

When you connect an agent directly to an execution environment — via raw Python, open-ended tool calling, or unstructured function dispatch — you are bridging these two worlds with no safety boundary between them.

The agent reasons correctly ninety-nine times. The hundredth time, it hallucinates a parameter, misreads a context window, or generates structurally valid but semantically wrong instructions. In a traditional software system, that is a bug you catch in testing. In an agentic system with direct execution access, that is a silent state corruption — no stack trace, no audit log, no clean error surface.

This is not a prompting problem. It is not a model quality problem. It is an architectural problem.

Three Approaches — And Why Two Fail at Scale

Approach 1 — Direct Execution (Raw Tool Calling)

The agent generates intent and executes it directly via function calls, shell commands, or Python scripts. The architecture looks like this:

User Intent → LLM → Tool Call → System Execution

This is where most teams start. It is fast to prototype, easy to wire up with LangChain or CrewAI, and works impressively in demos.

The problem surfaces in production. There is no layer between what the model decided and what the system did. Failures are runtime failures — discovered after state has already changed. Invalid arguments do not fail cleanly; they fail at the system boundary, often silently, often after partial execution.

A 2025 research taxonomy of multi-agent failures identified 14 unique failure modes across frameworks including AutoGen, ChatDev, and CrewAI. The study's core finding was stark: "improvements in the base model capabilities will be insufficient to address the full taxonomy. Instead, good multi-agent system design requires organizational understanding; even organizations of sophisticated individuals can fail catastrophically if the organization structure is flawed."

The failures are not in the model. They are in the architecture.

There is also a compounding reliability problem. If each agent in a chain is 95% reliable, chaining three agents together drops overall task success to roughly 86%. Add more steps and reliability falls exponentially — not because any individual agent is bad, but because failures cascade across the chain with no structural containment.

Direct execution has no containment layer. This is the approach that cannot scale.

Approach 2 — Natural Language Parsing with Guardrails

A validation layer sits between the agent and the execution environment, checking outputs against a set of rules before running them.

User Intent → LLM → Output → Guardrail Filter → Execution

This is better. Frameworks like NeMo Guardrails, Guardrails-AI, and AWS Bedrock Guardrails operate in this space. They provide output validation, content filtering, and policy enforcement at the boundary.

But the grammar of what the agent can produce is still unbounded. The model outputs free-form text or loosely structured JSON. The guardrail then attempts to validate that output against a rule set.

The problem is fundamental: you are filtering an infinite space rather than constraining the space itself. Rule-based validation written against ambiguous, open-ended output will always have edge cases. An agent that outputs something technically valid but semantically harmful can slip through. An agent that outputs something in a format the guardrail did not anticipate can fail unpredictably.

Microsoft's research on LLMs and DSLs found that models still hallucinate outputs even when given grammar files and format constraints — they produce correctly formatted responses that are semantically wrong. Filtering catches some of that. It cannot catch all of it, because the thing you are filtering against is not formally defined.

This approach is necessary but not sufficient.

Approach 3 — The LLM-to-DSL Compiler Pattern

This is the architectural shift that moves the safety guarantee from runtime behavior to structural design.

User Intent → LLM → DSL Output → Grammar Validator → Execution Engine

Instead of generating free-form code or natural language instructions, the agent compiles user intent into a Domain-Specific Language — a rigid, custom grammar with a strictly bounded output space. The system then runs that DSL through a deterministic validation engine before a single instruction touches system state.

We have used DSLs for decades to constrain logic to strict domains:

SQL does not let you accidentally invoke a shell command
Terraform does not let you accidentally write to a file system
CSS does not let you accidentally make a network request

The grammar defines what is expressible. Everything outside it is structurally impossible — not filtered, not blocked, but inexpressible by construction.

The new paradigm applies this same principle to AI orchestration.

The Three Stages of the LLM-to-DSL Pattern

Stage 1 — Constrained Generation

The agent translates user intent into DSL rather than general-purpose code.

Here is a minimal illustration of the difference. Consider an agent tasked with querying a database.

Open-ended tool calling (Approach 1):

# The LLM generates this. Anything goes.
import subprocess
result = subprocess.run(["psql", "-c", "DROP TABLE users;"], capture_output=True)

The model intended to query. It hallucinated a destructive operation. The grammar of Python allowed it.

DSL-constrained output (Approach 3):

QUERY users
  WHERE status = "active"
  LIMIT 100
  RETURN [id, name, email]

This grammar does not contain a DROP keyword. It cannot be expressed. The hallucination has no surface to land on.

The DSL defines the contract between the AI's reasoning and the system's execution. Not by filtering what the model says — by defining what the model can say.

Stage 2 — Deterministic Validation

A backend engine parses the DSL output against a formal grammar. Because the grammar is bounded, parsing is deterministic. Valid DSL either passes or fails — no ambiguity, no partial execution, no silent errors.

Here is what that validation layer looks like structurally:

DSL Input → Lexer → Token Stream → Parser → AST → Semantic Validator → Execution Plan

At each stage, failure is explicit:

Lexer rejects unknown tokens
Parser rejects malformed structure
Semantic Validator rejects valid syntax with invalid logic (e.g., referencing a field that does not exist in the schema)

The result: hallucinations and invalid logic do not produce silent runtime failures. They fail at the compilation step — before execution begins. The error is precise, attributable, and logged at the grammar level, not discovered as a corrupted state three steps later.

This is the parallel to Rust's ownership model. C trusted the programmer — one lapse, and the consequences were severe. Garbage-collected languages trusted the runtime — safety was real, but you lost control. Rust encoded correctness into the compiler itself — the guarantee is structural, not behavioral. The LLM-to-DSL pattern does the same thing for agentic execution.

Stage 3 — Diffable Execution

The validated instruction set is human-readable, structured, and reviewable. Before any state change executes, a team can inspect exactly what the agent is proposing.

  AGENT PROPOSED EXECUTION PLAN
  ─────────────────────────────
  QUERY orders
+   WHERE status = "pending"
-   WHERE status = "completed"
    RETURN [order_id, customer_id, amount]
    LIMIT 500

This is not just good engineering practice. It is what makes human-in-the-loop workflows operationally viable at scale. Without a DSL layer, human review of agent actions means reading raw code or natural language outputs — which does not scale and introduces its own interpretation errors. With a DSL layer, review means reading a structured, bounded instruction set where the semantic meaning is explicit.

You can see what the agent is about to do. You can diff it against what you expected. You can reject it before execution. This is what "auditability" actually means in practice.

This Is Not Hypothetical — It Is Already in Production

In late 2025, PayPal published research detailing exactly this pattern deployed at production scale. Their system implements a declarative DSL that separates agent workflow specification from implementation — enabling the same pipeline definition to execute across multiple backend languages (Java, Python, Go) and deployment environments.

The results on real e-commerce workflows processing millions of daily interactions:

60% reduction in development time compared to imperative implementations
3x improvement in deployment velocity
Complex workflows expressed in under 50 lines of DSL versus 500+ lines of imperative code
Sub-100ms orchestration overhead — the DSL layer added no meaningful latency

The finding that stands out most: the declarative approach enabled non-engineers to modify agent behaviors safely. The grammar constraint did not just make the system safer — it made the system accessible to a wider set of contributors, because the bounded grammar prevented them from making structurally dangerous changes by accident.

Business Implications

The technical architecture has direct business consequences. They compound at scale.

Auditability Becomes a Compliance Asset

In regulated industries — finance, healthcare, legal — every action an agent takes must be attributable, reviewable, and reversible. A DSL-based control plane produces a structured, human-readable record of every proposed state change before execution. That is not just good engineering. In many jurisdictions, it is the difference between a deployable system and an undeployable one.

The GDPR's right to explanation, HIPAA's audit trail requirements, and SOC 2's access control standards all require that automated actions be attributable and reconstructable. An agent operating via direct execution cannot satisfy these requirements by design. An agent operating via a DSL control plane satisfies them structurally.

Incident Cost Drops Dramatically

When an agent operating via direct execution corrupts state, the failure is discovered at runtime — after the fact, often without a clear trace of what instruction caused it. Recovery requires reconstructing intent from logs that may be incomplete.

When an agent operating via DSL produces invalid logic, the failure is caught at parse time — before execution, with a precise error at the grammar level. The blast radius is zero. No state was changed. The mean time to detection collapses from hours to milliseconds.

The documented production failure cases make this concrete. Two agents trapped in a runaway interaction loop ran for 11 days before detection — generating a $47,000 API bill. Expense report agents fabricating plausible but false entries at Ramp generated over $1 million in fraudulent invoices in 90 days. These are not reasoning failures. They are execution containment failures. A DSL control plane with bounded grammar would have caught both patterns at the validation stage — an agent cannot enter an infinite loop if the DSL grammar does not express unbounded iteration.

Human Oversight Becomes Operationally Viable

Diffable execution means a human reviewer can inspect exactly what the agent is about to do — in structured, readable form — before approving it. This makes human-in-the-loop architectures practical at scale.

This matters because the emerging regulatory consensus around autonomous AI systems is moving toward mandatory human oversight for high-stakes actions. Building that oversight capability into the architecture now, rather than retrofitting it later, is a significant operational advantage.

Vendor and Model Portability Increases

When your execution layer depends on the specific output format of a particular model, switching models breaks production. Your agent's behavior is coupled to the model's generation behavior — and that coupling is invisible until it breaks.

When your execution layer depends on a DSL grammar that the model compiles into, the model becomes interchangeable. The contract is with the grammar, not the model. You can swap Claude for GPT-4o, or fine-tune a smaller model on DSL generation, without touching your execution layer. The separation of concerns is structural.

The Deeper Principle

The LLM-to-DSL pattern is an instance of a broader principle that keeps appearing across the history of computing: the most reliable systems are the ones that make unsafe states inexpressible, not the ones that catch unsafe states at runtime.

Type systems do this for data. Memory ownership models do this for allocation. Formal grammars do this for syntax. The LLM-to-DSL pattern does this for agentic execution.

General-purpose languages build the engines. DSLs constrain the agents.

The teams that will win in production agentic infrastructure are not the ones with the best models. They are the ones that figured out the boundary between AI reasoning and system execution — and made that boundary structurally enforced.

Have you hit execution safety failures in an agentic system you were building? I would like to know where the boundary broke down — and what architectural choices you made in response.