Clear Code Intelligence

Posted on Jun 12

What We Learned Scanning Microsoft's Public Agent Framework Repository

#architecture

Clear Code Intelligence scanned a public Microsoft repository: microsoft/agent-framework.

This is not a dunk on Microsoft.

It is a public-code methodology test.

Microsoft's public GitHub organization is verified and publishes thousands of open-source repositories. microsoft/agent-framework is especially relevant because it is a framework for production-grade AI agents and multi-agent workflows.

That makes it a strong example of a new technical debt problem:

Large AI-agent frameworks need scope-aware technical debt reporting.

What We Scanned

The Clear Code scan reviewed the public microsoft/agent-framework repository and produced a 31-page technical diligence report.

The scan measured:

4,620 analyzed files
703,170 lines of code
250 report findings
1,156 raw managed findings
high AI token debt risk

The scorecard was severe:

Area	Score
Overall diligence	29/100
Projected after remediation	47/100
Architecture	100/100
Delivery	70/100
Maintainability	0/100
AI governance	0/100

That raw result needs careful interpretation.

This is a large repository with Python packages, .NET packages, frontend tooling, samples, documentation, test fixtures, generated-looking assets, and integration examples. A useful technical debt report cannot treat all of those scopes the same way.

The Important Lesson Is Scope

One example from the scan illustrates the point.

The scanner flagged an AWS access-key-shaped value in sample documentation:

AWS_ACCESS_KEY_ID | AKIAIOSFODNN7EXAMPLE

That value looks like an AWS access key pattern.

But it also appears to be an example key shape commonly used in documentation.

A noisy scanner would call this a breach.

A serious technical debt report should classify it:

documentation example
active secret
test fixture
false positive
accepted risk
missing safe-example annotation

That classification step is critical.

The value still deserves evidence and review. But the remediation is probably not "rotate production credentials." The remediation is more likely to make the example classification explicit so humans and AI agents do not keep rediscovering the same context.

Scanner Dumps Are Not Enough

In a large AI-agent repository, raw findings mix very different things:

core runtime code
generated frontend assets
sample applications
docs and READMEs
test fixtures
dependency metadata
security-sensitive examples
multi-language package boundaries

If those are all scored as one undifferentiated bucket, the report can become technically correct but operationally weak.

The better report should answer:

Is this production runtime code?
Is this sample code?
Is this generated code?
Is this documentation?
Is this an accepted risk?
Is this a real secret or example credential?
Is this an AI-agent reasoning hotspot?

That is the difference between "we found 1,156 things" and "here is the remediation plan."

The AI Token Debt Signal

The strongest signal from the scan was AI token debt.

AI token debt is the extra AI-agent context, search, inference, retry, and review work created when a codebase is hard to reason about.

For microsoft/agent-framework, the scan modeled high AI token debt because the repository contains:

703,170 LOC
4,620 analyzed files
89 large files
62 complex files
386 dependency signals
72 files with deferred-work markers

A few context hotspots stood out:

python/packages/openai/agent_framework_openai/_chat_client.py
python/packages/core/agent_framework/observability.py
python/packages/core/agent_framework/security.py
python/packages/core/agent_framework/_agents.py
multiple DevUI frontend files above 1,500 LOC

The issue is not that large files are automatically bad.

The issue is that AI agents pay for ambiguity in tokens.

When context is spread across Python, .NET, frontend tools, samples, package boundaries, docs, and dependency policy, an agent needs more context to make safe changes. It searches more. It retries more. It asks for more review. It has to infer which code is production-critical and which code is illustrative.

That is technical debt in an AI-assisted engineering environment.

Strong Architecture Can Still Have AI-Agent Friction

One of the most useful parts of the scan was that architecture scored 100/100.

That prevents an overly simplistic conclusion.

The repo did not look structurally chaotic in the scanner's architecture model. The friction came from a different layer:

context size
classification gaps
long files
dependency uncertainty
deferred-work markers
mixed production/sample/documentation scopes

That is exactly why technical debt needs a richer model in the AI era.

The question is not only "is the architecture clean?"

The question is also:

How much work does the codebase force every future engineer and AI agent to do before they can safely change it?

What Clear Code Needs To Improve

This scan also teaches us something about our own product.

Clear Code needs stronger scope classification for large public repositories:

production package
sample package
docs
generated assets
test fixtures
demo credentials
accepted risk
false positive

That classification would make the score more useful and the remediation plan more credible.

The best technical debt report is not the harshest report.

It is the report that helps maintainers decide what to do next.

Why Public Scans Matter

Public repositories let technical debt discussions become concrete.

The evidence is inspectable. The methodology can be challenged. Maintainers can correct the interpretation.

That is the right standard.

If anyone from Microsoft Open Source or the Agent Framework maintainer community wants the full PDF report, we would be glad to share it and hear where the scan should be corrected, tuned, or scoped differently.

Public code deserves public, fair, evidence-backed analysis.