DEV Community

David Aronchick
David Aronchick

Posted on • Originally published at distributedthoughts.org

While Everyone Argues About AI Regulation, Data Is the Real Wild West

Last month, Colorado's AI Act went into full effect. California's considering similar legislation. New York has its own version in committee. Meanwhile, Sam Altman sat across from Brad Gerstner and said something quietly alarming: "I don't know how we're supposed to comply with that Colorado law. I would love them to tell us what we're supposed to do."

Not "we disagree with it." Not "it's too burdensome." Simply: we don't know what compliance even means.

This same week, OpenAI released a new licensing framework positioning themselves as more responsible, less "edgy." They can write compliance-shaped language. They can commit to trying harder. But operational clarity on HOW to comply with Colorado's specific requirements? That's the gap Sam's describing. You can promise to be good without knowing what good means in fifty different state definitions.

I GET the intent here - a clean regulatory framework is an accelerant to AI development, not a source of friction. When you have things that many people, including prominent AI researchers, are saying could bring about the downfall of humanity, a little regulatory thought seems... good?

However, I also agree that fifty states creating fifty different interpretations of concepts like "algorithmic discrimination" is not ideal. This patchwork approach, as seen with some state-level privacy laws, will likely just result in a bunch of lawyers figuring out fifty different ways to get sued by someone claiming harm from a chatbot.

But I ALSO think that nobody's talking about the bottom 2/3rds of the iceberg: while we're building this elaborate regulatory framework for AI, we've created zero coherent rules for the thing AI actually runs on. Data.
The Infrastructure Nobody Regulates
Jensen Huang dropped a really big number. Like... REALLY big. Inference workloads are about to scale by a billion times. Not 10x. Not 100x. A billion times.

Now map that against what Satya Nadella said last week: Microsoft is "short on power and infrastructure" and has been for "many quarters." They're not compute-constrained in the traditional sense. They literally can't plug in all the GPUs they have because they don't have enough warm shells near power sources.

Connect these dots. We're about to generate a billion times more inference compute. That compute requires data: training data, context data, real-time data. And all that data has to move.

Where's it moving? Not just to hyperscale data centers in Virginia and Oregon. It's moving to edge devices, to on-premises deployments, to robots on factory floors. Moving across state lines, across national borders, through networks we barely understand and certainly don't regulate coherently.

The Colorado AI Act? It focuses on model outputs and bias. It says nothing substantive about data sovereignty, data movement costs, or the physics of moving petabytes across networks that weren't designed for AI workloads.
The Ghost in the Machine: Edge Computing
Both Jensen and Satya hinted at something profound: the future of AI isn't just massive centralized compute farms. Sam Altman said it explicitly: "Someday we will make an incredible consumer device that can run a GPT-5 or GPT-6 capable model completely locally at a low power draw."

Aside: It's quite unlikely it'll be anything like a GPT-5/6 model, they're way too general. They'll likely be far more [*quantized](https://huggingface.co/docs/optimum/en/concept_guides/quantization?ref=distributedthoughts.org" rel="noreferrer) and [*distilled](https://labelbox.com/guides/model-distillation/?ref=distributedthoughts.org" rel="noreferrer), but the point remains the same.

Think about what that means. Not "someday far in the future." Someday. Soon enough to matter for business planning.

Right now, if you want to run a sophisticated AI model, you're paying inference costs to someone's cloud. You're sending your data to their servers, getting tokens back, and hoping the economics work out. The unit economics of AI today look nothing like search - Satya admitted that search had magical economics because you built one index and amortized it across billions of queries. Chat burns GPU cycles for every interaction.

But what if the model runs locally? On your phone. In your car. On the robot in your warehouse. Suddenly, the economics flip. No inference costs. No latency from round-trips to datacenters. No bandwidth constraints.

And no coherent regulatory framework for any of it.
Three Regulatory Failures We're Ignoring
The patchwork AI regulation problem is real. But it's hiding four deeper issues that actually matter more:

Data residency requirements that ignore physics. Europe wants data to stay in Europe. China wants data to stay in China. California wants certain data to stay private. None of these regulatory regimes acknowledge that modern AI architectures require massive context windows, real-time updates, and distributed training. You can't just "keep the data in Germany" when your model needs to learn from global patterns. The latency costs alone make certain applications impossible.

No standards for data movement costs. When Satya talks about needing $250 billion in Azure commitments from OpenAI over five years, a huge portion of that is about data movement. Moving training data. Storing in distributed, multi-region buckets. Preprocessing data in locations (using local VMs) to ready it for execution. Every byte costs money in bandwidth, CPU cycles, and latency. The result? Architectures where legal compliance come as an afterthought will result in a whole lot of band-aids and inefficiency, rather than technical or economic sense.

Edge deployment creates a mismatch between regulatory boundaries and technical architecture. Here's the critical distinction: we already know how to handle devices crossing state lines. You can carry a laptop anywhere, and nobody cares what calculations it performs. But AI regulations aren't about device movement; they're attempting to regulate algorithmic outputs and decisions. Colorado's Act requires explanations for algorithmic decisions. California's proposals mandate bias monitoring. These laws care about WHAT the model does, not just that it exists.

Politicians will find boundaries to enforce; they always do. Points of sale. Import/export restrictions. Certification requirements. Like tariffs: prove your product meets our standards or you can't sell here. The power lies in controlling the pipes. (thanks Squillace for this point!)

But here's the problem: those enforcement boundaries won't align with distributed technical architecture. Now put that model on your phone. You're traveling through three states. The model makes decisions about what content to show you, what actions to recommend, what patterns to flag. Which state's definition of "algorithmic discrimination" applies at the moment of inference? Or consider a robot in a Texas warehouse, using a model trained in California on data from customers in fifty states, certified for sale in New York. When it makes a decision about inventory routing that affects a worker, which jurisdiction's compliance requirements matter? The training location? The deployment location? The point of sale? The data sources?

Regulators will enforce at boundaries they control. But those boundaries - borders, sales, certification - don't map to where AI decisions actually happen. Edge deployment means the regulated boundaries and the technical boundaries exist in different dimensions.
The Compute-Over-Data Inversion
I've been thinking about distributed systems for longer than I care to admit, and I keep coming back to a fundamental principle: moving data is expensive. Moving compute is cheap(-er). We spent the last decade building centralized cloud architectures because centralization meant economies of scale.

But AI didn't break this model. We've known for years that network bandwidth, while increasing massively, wasn't keeping pace with edge data creation. The physics were already wrong. AI just made the cost so catastrophic we can't ignore it anymore.

Five years ago, you couldn't move that much data fast enough. You couldn't power that many data centers efficiently. You couldn't build network infrastructure quickly enough to handle the load. Those constraints haven't disappeared - they've compounded. When your inference workload scales by a billion times, centralization transforms from inefficient to economically impossible.

The solution isn't bigger datacenters. It's distributing compute to where the data already lives.

This is why Jensen, Satya, Sama, and others keep talking about edge AI and "fungible fleets" across geographies and workloads. Why there's a race to get frontier models running on consumer devices. They're all describing the same architectural shift: from centralized compute with data movement, to distributed compute with data locality.

But our regulatory frameworks assume centralization. They assume enforceable boundaries align with technical architecture that you can identify where AI happens, who's responsible, and what jurisdiction applies at the moment of decision. Regulators will enforce where they can: at borders, at points of sale, at certification. But AI decisions happen everywhere else.
The Robotics Wildcard
And, as usual, there's no problem that can't be made worse when it comes to real world implementation. Look at robotics.

This isn't science fiction; Figure, Tesla, Boston Dynamics, and a dozen Chinese companies are shipping real robots that use real AI models. These robots need to make decisions in milliseconds. They can't wait for a round-trip to a datacenter.

So they'll run models locally. Trained on data from multiple jurisdictions. Updated via networks that cross state and national boundaries. Operating in physical spaces where privacy, safety, and liability rules differ dramatically.

Colorado's AI Act requires that people can request explanations for algorithmic decisions that affect them. Fine. Now explain that to a robot that uses a locally-running vision model trained on 100 million images from 30 countries, making real-time decisions about navigation, object manipulation, and human interaction.

Nobody's answered the fundamental question: are model weights data? Legally, if weights aren't "data," then data sovereignty requirements don't apply. But technically, weights ARE data; they're the compressed, encoded representation of training data. You can't meaningfully separate them.

So when someone says says "explain your algorithmic decision," and I'm using a locally-running model, what am I explaining? The weights (which encode patterns from millions of training examples)? The specific inference path? The training data provenance? If courts say "weights aren't data," we've created a massive loophole. If they say "weights ARE data," then every edge device becomes a data sovereignty nightmare. This ambiguity alone makes most current regulations operationally meaningless.
What Actually Needs Regulating
The EU is using the AI Act to focus on risk assessment before deployment, data governance during training, and safety evaluation before market entry. That's regulating where "data gets encoded into weights"; earlier in the pipeline. Product safety model: want to deploy a model? Get it certified. Make the training process auditable. (thanks @marypcbuk.bsky.social!)

US state regulations focus on output monitoring, bias detection after deployment, and explanation rights for end users. That's output-layer regulation; attempting to govern what happens AFTER the infrastructure decisions are already locked in.

If you want to regulate AI effectively, add a regulatory framework for the data layer. Set clear rules for:
Model weights status. Are they data? Establish this legally before building another decade of edge AI on ambiguity.Data provenance and lineage. Not just "where did this data come from" but "who touched it, when, and how did it change?" Make data transformations auditable from source to training set to weight generation.Cross-border data flow standards. Not blanket prohibitions but sensible frameworks that acknowledge the technical requirements of distributed training while protecting legitimate sovereignty concerns.Edge device accountability. Clear standards for who's responsible when locally-running models make decisions. Is it the device manufacturer? The model provider? The end user? The update service? Define the liability chain before millions of devices ship.
But we're not doing any of this. We're writing laws about chatbot outputs while ignoring the infrastructure those chatbots run on. It's like regulating cars by specifying wheel sizes while ignoring road standards, traffic laws, and fuel regulations.
The Path Forward
Federal preemption would help, as Sam and Satya both noted. One set of rules beats fifty competing ones. But even federal rules focused on AI outputs miss the point. Data movement. Edge deployment standards. Model and data lineage, versioning, and provenance. Liability frameworks for distributed systems.

In five years, hyperscale data-centers will still have their place in AI, but I'm on Jensen's side for the bet (way back at the beginning); I think distributed AI will be a billion times larger, across edge devices, on-premises systems, and specialized hardware. It'll run locally, update occasionally, and move data constantly. And, unless we do it right, we'll still be arguing about chatbot bias while the real infrastructure remains unregulated.

The best time to regulate data was ten years ago. The second best time is now before we build another decade of AI on top of regulatory sand.

That's not a policy I'd bet on.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*

NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. I'd love to hear your thoughts!


Originally published at Distributed Thoughts.

Top comments (0)