Michael Smith

Posted on Jun 15

Rio's "Homegrown" LLM: A Merge of Existing Models?

#discuss #news #tech #ai

Rio's "Homegrown" LLM: A Merge of Existing Models?

Meta Description: Rio de Janeiro's "homegrown" LLM appears to be a merge of an existing model — here's what the evidence shows, why it matters, and what cities should demand from AI projects.

TL;DR: Rio de Janeiro launched what it promoted as a locally developed AI language model, but technical analysis suggests it may actually be a fine-tuned merge of an already-existing open-source model rather than a ground-up creation. This raises important questions about transparency, public funding accountability, and what "homegrown AI" really means for governments investing in the technology.

Key Takeaways

Rio de Janeiro's city-backed LLM was marketed as a homegrown, locally developed AI model
Technical fingerprinting and community analysis suggest it is likely a merge or fine-tune of an existing open-source model
This is not necessarily a technical failure — but the framing as "homegrown" is the real controversy
Government AI projects need clearer transparency standards around model provenance
The incident highlights a broader pattern of AI "washing" in public sector tech announcements
Residents, journalists, and policymakers have actionable steps to demand better accountability

What Happened With Rio de Janeiro's AI Model?

In mid-2025, Rio de Janeiro's municipal government made headlines by announcing it had developed its own large language model — a significant claim for any city government, let alone one in a developing economy. The announcement was framed with considerable civic pride: a Brazilian city building its own AI infrastructure, reducing dependency on foreign tech giants, and investing in local talent.

The problem? Technical investigators, open-source AI researchers, and members of the Brazilian AI community began poking under the hood. What they found raised serious questions: Rio de Janeiro's "homegrown" LLM appears to be a merge of an existing model — specifically, evidence points to it being a derivative or merged variant of a pre-existing open-source foundation model, with relatively modest local fine-tuning applied on top.

This distinction matters enormously, both technically and politically.

Understanding the Technical Difference: Built vs. Merged

Before diving into the controversy, it's worth understanding what these terms actually mean — because the difference between "building" and "merging" an LLM is substantial.

What Does "Building" an LLM From Scratch Mean?

Developing a genuine foundation model from scratch requires:

Massive compute infrastructure — typically hundreds to thousands of GPUs running for months
Curated training datasets — often hundreds of billions to trillions of tokens of text
A dedicated research team — typically dozens of ML engineers, researchers, and data scientists
Significant financial investment — foundation model training runs routinely cost millions of dollars
Novel architectural decisions — choices about model size, attention mechanisms, tokenization

This is what organizations like Meta (with LLaMA), Mistral AI, and Google do when they release foundation models.

What Is a Model Merge?

A model merge is a significantly different (and far less resource-intensive) process:

Takes existing pre-trained models as a starting point
Combines weights from multiple models using techniques like SLERP, TIES, or DARE merging
May include additional fine-tuning on domain-specific data
Requires far less compute — often achievable on a single high-end server
Does not require novel training from raw data

Model merging is a legitimate and increasingly popular technique in the open-source AI community. Tools like MergeKit have made it accessible to small teams and individuals. But it is categorically not the same as building a model from the ground up.

The Spectrum Between the Two

It's also worth noting there's a real spectrum here:

Approach	Compute Required	Novelty	Legitimate?
Full pretraining from scratch	Extreme	High	Yes
Continued pretraining on new data	High	Medium-High	Yes
Supervised fine-tuning (SFT)	Moderate	Medium	Yes
Model merging + fine-tuning	Low-Moderate	Low-Medium	Yes
Model merge with rebranding	Minimal	Low	Ethically questionable

The last row is where Rio's project appears to land — and that's the crux of the controversy.

How Researchers Identified the Model's Origins

The AI research community has developed increasingly sophisticated methods for identifying model provenance. Here's how investigators pieced together the story of Rio de Janeiro's "homegrown" LLM.

Model Fingerprinting Techniques

Modern LLMs leave detectable "fingerprints" that survive merging and fine-tuning:

Weight similarity analysis — comparing model weight distributions to known public models
Benchmark performance patterns — models tend to preserve characteristic strengths and weaknesses from their base
Tokenizer analysis — most merged models retain the original tokenizer, which is a strong identifier
Output style analysis — certain quirks in generation behavior persist through fine-tuning
Metadata inspection — model cards, configuration files, and architecture parameters often reveal origins

Researchers using tools like LM Evaluation Harness for benchmark comparison and direct weight inspection identified strong similarities between Rio's model and existing open-source models — similarities too close to be coincidental.

The Open-Source Community's Role

Much of this detective work happened in public forums — Hugging Face discussions, Brazilian AI community Discord servers, and academic social media. This kind of distributed, open-source scrutiny is one of the most valuable accountability mechanisms in modern AI development.

[INTERNAL_LINK: open-source AI community accountability]

The community's findings weren't malicious — many researchers were genuinely curious and even supportive of the project's goals. The concern was specifically about the framing, not the underlying technical approach.

Why the "Homegrown" Label Matters

You might reasonably ask: does it really matter if Rio used an existing model as a base? Isn't building on top of open-source work exactly how the ecosystem is supposed to work?

The answer is nuanced — and this is where the story gets genuinely interesting.

The Legitimate Case for Building on Open Source

Using open-source foundation models as a base is:

Standard industry practice — virtually every commercial LLM application does this
Cost-effective — it makes AI accessible to organizations without billion-dollar budgets
Technically sound — fine-tuning a strong base model often outperforms training a weaker one from scratch
Encouraged by open-source licenses — models like LLaMA and Mistral are released specifically for this purpose

If Rio had said "we've built a Portuguese-language AI assistant for city services, built on top of [Model X] with local fine-tuning," that would have been a perfectly respectable announcement.

The Problem With the "Homegrown" Framing

The issue is that "homegrown" implies something it wasn't. Specifically:

Public funding accountability — if taxpayer money funded the project, the public deserves accurate information about what was built
Misleading civic narrative — the announcement leveraged national pride around local AI development
Procurement questions — did the city pay for something that was largely already built?
Skills development claims — if the model wasn't truly built locally, the claimed benefits for local AI talent development are overstated
Reproducibility and trust — government AI systems should be auditable and transparent

[INTERNAL_LINK: government AI transparency standards]

This pattern — what some researchers are calling "AI washing" in the public sector — is becoming more common globally as governments rush to announce AI initiatives without always being precise about what they've actually built.

Rio's Situation in a Global Context

Rio de Janeiro is far from alone in this. The pressure on governments to demonstrate AI capability has created a troubling incentive structure.

A Growing Pattern of Government AI Washing

Across the world, government AI announcements frequently obscure the distinction between:

Deploying a third-party AI API (e.g., wrapping GPT-4)
Fine-tuning an existing open-source model
Building a genuinely novel system

Some notable patterns researchers have identified:

Rebranded API wrappers presented as national AI systems
Fine-tuned models described as "developed from the ground up"
Procurement of commercial AI described as "homegrown innovation"
Proof-of-concept demos presented as production-ready infrastructure

This matters because public AI investments are often justified on the basis of building local capability, reducing foreign dependency, and creating jobs — goals that are undermined when the actual work is primarily integration and rebranding.

What Genuine Municipal AI Development Looks Like

For contrast, there are examples of more transparent and genuinely novel municipal AI projects:

Barcelona's data sovereignty initiatives — clear about using open-source tools while being transparent about architecture
Singapore's GovTech AI projects — publish detailed model cards and methodology
Estonia's digital governance AI — open about partnerships and the role of international vendors

The common thread is transparency about what was actually built and how.

What This Means for AI Policy and Procurement

For readers working in government, policy, or civic tech, Rio's situation offers concrete lessons.

Questions Governments Should Be Required to Answer

When a public body announces an AI system, journalists, auditors, and citizens should demand answers to:

What is the base model? Was it trained from scratch or derived from an existing model?
What training data was used? Is it locally sourced? What are the privacy implications?
What was the total cost? Including compute, personnel, and vendor contracts
Who built it? Internal team, contractor, or vendor?
What is the model license? Are there restrictions on how it can be used or modified?
How is it being evaluated? What benchmarks and safety evaluations were conducted?
Is the model publicly accessible? Can independent researchers audit it?

Recommended Tools for Independent AI Auditing

For researchers, journalists, and civic technologists who want to investigate government AI claims:

Hugging Face Hub — the primary repository for open-source models; useful for comparing architectures and model cards
LM Evaluation Harness — open-source benchmarking framework for systematic model evaluation
MergeKit — understanding this tool helps you understand what model merging actually involves
Weight analysis tools — Python libraries like safetensors and torch can be used to compare model weight distributions

[INTERNAL_LINK: AI auditing tools for journalists]

What Rio Should Do Next

This doesn't have to be a dead end. The city has an opportunity to turn this controversy into a model (no pun intended) for better practice.

Recommended Steps for Transparency Recovery

Publish a full model card — document the base model, training data, fine-tuning methodology, and evaluation results
Open-source the fine-tuned weights — if built with public funds, the public should have access
Clarify the procurement process — who was paid, for what work, and at what cost
Commission an independent technical audit — have a neutral third party verify the model's origins and capabilities
Reframe the narrative accurately — "built on open-source AI, customized for Rio" is still a good story

The Brazilian AI community is vibrant and talented. There's a genuinely compelling story to tell about local adaptation of AI for Portuguese-language services — but it has to be told honestly.

Frequently Asked Questions

Q: Is using an existing model as a base actually wrong or illegal?

No — not inherently. Using open-source models as a foundation is standard practice and is explicitly encouraged by most open-source licenses. The ethical issue arises specifically when this practice is misrepresented as building something from scratch, especially when public funds and civic trust are involved.

Q: How can I tell if a government AI announcement is genuine?

Look for specifics: training data sources, compute costs, architecture details, and model cards. Vague claims about "developing" AI without technical documentation are a red flag. Genuine projects can point to reproducible methodology and independent evaluation.

Q: What is model merging, exactly?

Model merging combines the weights of two or more pre-trained models into a single model, often producing results that outperform either parent model on specific tasks. It's a legitimate technique widely used in the open-source community, but it requires far less original work than training a model from scratch.

Q: Does this mean Rio's AI system doesn't work or isn't useful?

Not necessarily. A well-executed fine-tune or merge of a strong base model can be highly effective for specific applications. The question is about accuracy in how it was presented — not necessarily about whether the resulting tool is functional or valuable.

Q: What should citizens do if they're concerned about government AI projects?

File freedom of information requests for procurement documents and technical specifications. Support local journalism covering AI policy. Engage with civic tech organizations that audit government technology. And share findings from independent researchers — community scrutiny is one of the most effective accountability mechanisms we have.

The Bottom Line

Rio de Janeiro's "homegrown" LLM appears to be a merge of an existing model — and while that's a technically defensible approach, the framing as a locally built AI system does a disservice to the public that funded it and the community that celebrated it.

The real story here isn't about technical failure. It's about a transparency gap that's becoming increasingly common as governments race to claim AI credibility. The open-source AI community's ability to identify and publicize these gaps is genuinely valuable — and it's a reminder that in the age of open-weight models, claims about AI provenance are checkable.

What you can do right now:

If you're a journalist or researcher, use the auditing tools listed above to investigate AI claims in your own region
If you're a policymaker, push for mandatory model cards and open procurement documentation for any publicly funded AI system
If you're a citizen, ask your local government: what exactly did you build, and how much did it cost?

The technology is getting more accessible every year. That's a good thing. But accessibility doesn't mean the work of building something genuinely new has been done — and the public deserves to know the difference.

[INTERNAL_LINK: AI policy transparency best practices]
[INTERNAL_LINK: open-source LLM landscape 2026]

Have thoughts on government AI transparency? Found similar patterns in your region? The conversation is happening — and your voice matters in shaping how public AI is built and disclosed.

DEV Community

Rio's "Homegrown" LLM: A Merge of Existing Models?

Rio's "Homegrown" LLM: A Merge of Existing Models?

Key Takeaways

What Happened With Rio de Janeiro's AI Model?

Understanding the Technical Difference: Built vs. Merged

What Does "Building" an LLM From Scratch Mean?

What Is a Model Merge?

The Spectrum Between the Two

How Researchers Identified the Model's Origins

Model Fingerprinting Techniques

The Open-Source Community's Role

Why the "Homegrown" Label Matters

The Legitimate Case for Building on Open Source

The Problem With the "Homegrown" Framing

Rio's Situation in a Global Context

A Growing Pattern of Government AI Washing

What Genuine Municipal AI Development Looks Like

What This Means for AI Policy and Procurement

Questions Governments Should Be Required to Answer

Recommended Tools for Independent AI Auditing

What Rio Should Do Next

Recommended Steps for Transparency Recovery

Frequently Asked Questions

The Bottom Line

Top comments (0)