DEV Community: Adedoyinsola Ogungbesan

Scoring Goals Beyond the World Cup

Adedoyinsola Ogungbesan — Fri, 26 Jun 2026 17:57:32 +0000

The World Cup is heating up, with new stars emerging from every corner of the globe. Japan, Morocco and Cape Verde have been playing some of the best football we've seen in a long time.

While the countdown to the final in July continues, something else is quietly coming to an end in the coding world.

Gemini Code Assist's free GitHub code reviews are being sunset.

For those of us who leaned on the free tier, that's a bit of a shame.

My first thought wasn't, "What paid service do I replace it with?"

Instead it was:

How much of my engineering workflow can I push to open-source models instead?

The Problem

I've noticed something about myself.

Whenever Codex, Antigravity or another coding assistant generates code, I'm always impatient to review it.

The code works.

The tests pass.

I have another idea waiting.

So I push.

That's usually when mistakes sneak in.

Rather than depending on my own discipline, I wanted the review to happen automatically inside my CI before I even opened a pull request.

Not because AI writes bad code.

Because software deserves another pair of eyes.

Dave Farley's Influence

I've spent a lot of time listening to Dave Farley over the last year.

One thing he keeps repeating is that quality shouldn't be inspected at the end.

It should be built into the process.

That idea stayed with me.

Rather than trying to review everything manually, I wondered if a lightweight reviewer could become just another part of my development pipeline.

Bigger Models Aren't Always the Answer

Around the same time I was working through Kaggle's five-day course on AI Agents.

One lesson that really stuck with me was that not every task needs the biggest model available.

Planning.

Architecture.

Design.

Those probably deserve stronger reasoning models.

Reviewing a pull request?

Checking test quality?

Looking for maintainability issues?

Maybe those can be delegated to much smaller models.

That completely changed how I started thinking about AI workflows.

Instead of asking:

What's the smartest model?

I started asking:

What's the smallest model that can do this job well enough?

Building the Reviewer

That question slowly became Silver-One.

The goal wasn't to replace frontier models.

The goal was to move as much repetitive engineering work as possible onto inexpensive open-source models, leaving the larger models for the work that genuinely benefits from them.

Using our replayable LLM cassette design, I started experimenting with small models like Qwen running directly inside CI.

They're tiny.

They're imperfect.

But because they're inexpensive, I no longer hesitate to let them review every pull request.

That completely changes the economics.

Instead of worrying about token costs every time I push code, I can reserve my paid credits for planning, architecture and difficult reasoning while letting smaller models handle continuous review.

A Humbling Experience

After a few pull requests the feedback became surprisingly humbling.

Functions I thought were finished kept coming back with comments about maintainability, correctness and test quality.

It reminded me that generating code quickly isn't the same thing as building software that's easy to evolve.

Fast code generation is impressive.

Maintainable software is still hard.

Comparing Against Gemini

Before Gemini disappears, I had the chance to compare its reviews against the reviewer I'm currently building.

Gemini is genuinely impressive.

It often finds the two or three issues that actually matter instead of producing a long list of generic advice.

My reviewer still has a long way to go.

But that's also the exciting part.

Because it's open source, I control the prompts, the scoring, the workflow and the evaluation.

Improving it becomes an engineering problem instead of waiting for another hosted service.

The goal was never to replace Gemini.

It was to learn from it.

One Lesson I'll Probably Keep

The biggest thing I've learned isn't about models.

It's about workflows.

Sometimes the biggest improvement doesn't come from using a larger model.

Sometimes it comes from designing a better engineering process.

If I can push repetitive review work onto small open-source models while saving larger models for planning and architecture, then I think that's a workflow worth building.

Hopefully by the time the World Cup reaches its final whistle, I'll have scored a few goals of my own—not on the football pitch, but inside my CI pipeline.

Thanks to Dave Farley and everyone working to make software engineering a little more disciplined.

It's been a fun experiment so far, and I think it's only getting started.

Our Comeback Story

Adedoyinsola Ogungbesan — Sun, 07 Jun 2026 13:40:35 +0000

This is a submission for the GitHub Finish-Up-A-Thon Challenge

What I Built

This project began life as an Agentbeats platform template for the AgentX-AgentBeats competition (initial commits in Nov 2025). At the time it was a straightforward pro/con debater and judge wired to local models (Ollama). Over the last several months I reworked the harness into silver-one: a reproducible, auditable pipeline for generating evidence-grounded code-security reasoning data.

Key ideas implemented:

BARRED (Boundary Adversarial Reasoning for Reproducible Evaluation and Dataset generation) — an agent debate + verifier architecture that records every changing input for replay and audit.
Deterministic replay: LLMCassette/ReplayManager to record prompts, responses, and run state so outputs are reproducible and debuggable.
Offline B-gate: offline_b_gate.py computes quality gates (structural completeness, anchor grounding, predicate aboutness, verifier parse/pass) and produces artifacts/metrics/*.json.
Telemetry and token accounting: per-stage and per-model token totals so we can optimize cost vs quality.

Why this matters: synthetic security corpora are easy to generate but hard to trust. silver-one treats generation settings, verifier outcomes, rejected attempts, and run checkpoints as first-class artifacts — enabling training data to be audited, replayed, and improved.

Demo

The repo includes scripts and instructions to run the BARRED stack locally (see scenarios/debate/start_stack.sh and scenarios/debate/run_batch.py). Example B-gate computation:

# start judge + participants
./scenarios/debate/start_stack.sh

# run a clocked batch and export attempts + corpus
uv run python scenarios/debate/run_batch.py \
  --run-id pilot-v1-calibrated-d \
  --seed 42 \
  --mode record \
  --clock-now 2026-06-07T14:12:00Z \
  --seeds scenarios/debate/cve_seeds_test.jsonl \
  --output training_corpus_calibrated_d.jsonl \
  --attempts-out artifacts/attempts/pilot-v1-calibrated-d.jsonl


# compute B-gate metrics
./scripts/run_b_gate.sh \
  training_corpus_calibrated_d.jsonl \
  artifacts/attempts/pilot-v1-calibrated-d.jsonl \
  artifacts/metrics/b_gate-pilot-v1-calibrated-d.json

The harness is CLI-first and instrumented

The Comeback Story

Where it started: the project was a small Agentbeats demo wired to Ollama models (Nov 2025). Over time, the repo grew into a research harness as I discovered that simple debate + label workflows produced many ungrounded labels.

What I changed and finished up:

Hardened structured-output parsing and repair to avoid malformed JSON rows.
Added deterministic replay (ReplayManager) so the same prompts and responses can be replayed and audited.
Implemented a verifier agent and wired it into judge gating; verifier reports are now used to improve grounding and reduce hallucinated mechanism claims.
Built offline_b_gate.py to compute reproducible quality metrics and surface failure modes (anchor normalization, mechanism grounding, predicate aboutness).
Added telemetry (per-stage tokens, per-model usage) so we can optimize generator context size without losing auditability.

Result: the harness now produces a small high-fidelity corpus (examples: training_corpus_calibrated_a.jsonl…_d.jsonl) with accompanying artifacts/metrics/b_gate-pilot-v1-calibrated-*.json showing gate pass and detailed telemetry.

Inspiration & Context

The redesign was strongly inspired by the Google DeepMind AGI hackathon and the Metacognitive research conversations there: we wanted to observe how models behave under adversarial conditions and make those behaviours auditable. I also leaned on practical tutorials and engineering approaches (e.g., Daily Dose of Data Science, Dave Farley / Modern Software Engineering) to make the system deterministic and "test-easy by design" — small, replayable units, strong structured-output parsing, and clear telemetry.

This work is directly connected to the Metacognitive Coding Safety Benchmark (MCSB) — which was our submission to the Google DeepMind AGI hackathon. MCSB's multi-tier structure (pilot/core/adversarial) and its focus on directional confidence updates shaped our Tiered experiment design and the offline B-gate evaluation.

A note on cassettes — why the metaphor matters

Peter Quill's mixtape in Guardians of the Galaxy is a tiny, precious archive that preserves memory, identity, and the reason a song feels like "the best." Our LLMCassette serves a similar purpose for silver-one: it preserves the prompt, model responses, and run state so the project's decisions remain auditable, replayable, and emotionally intelligible. Treating generation traces as a cassette makes the system kinder to debugging and kinder to future researchers — you can rewind to the exact moment a label was created, listen to the context, and understand why a model thought a particular answer was "the best."

Kaggle Task & Repo

We published a companion Kaggle benchmark task for predicate-quality evaluation that lets you test multiple models locally and run remote benchmark jobs from the CLI. Task link:

https://www.kaggle.com/benchmarks/tasks/surfiniaburger/silver-one-predicate-quality

To push and run the task from the repository (example from kaggle_notebooks/silver_one_predicate_quality_task.py):

kaggle b t push silver-one-predicate-quality -f kaggle_notebooks/silver_one_predicate_quality_task.py --wait
kaggle b t run  silver-one-predicate-quality -m gemini-3.5-flash --wait

Repository (source): https://github.com/surfiniaburger/silver-one

My Experience with GitHub Copilot

GitHub Copilot (and iterative local editing) helped speed up refactors and suggested tests and doc improvements during the finish-up. I used the suggestions as a productivity assist rather than an authoritative change — every structural tweak was followed by running the deterministic smoke path and validating metric artifacts.

Files I relied on to finish this up

README.md — project summary and quick start
pilot_report.md — pilot run analysis, verifier-era notes, and telemetry interpretation
scenarios/debate/* — implementation of BARRED, judge, verifier, data generator, batch runner
src/agentbeats/replay.py and src/agentbeats/structured_output.py — deterministic replay and structured-output repair
artifacts/metrics/b_gate-pilot-v1-calibrated-*.json — final calibrated metrics (used to summarize improvements)

Where to go from here

Improve anchor normalization to increase accepted yield without raising predicate failures.
Add CI guardrails to limit generator_boundary prompt sizes and prevent runaway token use.
Expand verifier capabilities toward bit-level data-flow tracing for stronger mechanism grounding.

References & Related Docs

BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate — Arnon Mazza*, Elad Levi* (Plurai Inc.), Preprint Jan 21, 2026. Role: scenario specification and debate-based synthetic-data generation algorithm; served as the blueprint for the BARRED scenario, gating rules, and the offline B-gate implementation used in this repo. (*equal contribution)
Pioneer Agent: Continual Improvement of Small Language Models in Production — Dhruv Atreja, Julia White, Nikhil Nayak, Kelton Zhang, Henrijs Princis, George Hurn-Maloney, Ash Lewis, Urchade Zaratiana (Fastino Labs), arXiv:2604.09791, Apr 10, 2026. Role: engineering systems paper that inspired telemetry-driven adaptation loops used in our evaluation.

Figures — Key Metrics Visuals

The following figures were generated from artifacts/metrics.

Attempts vs Accepted rows by run — shows yield and how many attempts were required per accepted corpus row.

Predicate-fail and B2 strict-fail by run — highlights quality improvements and failure modes across runs.

Verifier called rate and verifier pass rate by run — shows verifier coverage and effectiveness.

Tokens per accepted row vs predicate-fail (point size = accepted rows) — visualizes the cost/quality tradeoff.

Stacked token usage by stage (generator_boundary, generator_refine, judge_adjudication, verifier_audit) — identifies where tokens are spent.

Per-model token share by run — shows which models dominate token costs.

Thanks for reading thus far. Keep an eye on In-vari for more updates.

Google I/O 2026 Wasn’t About AI Models — It Was About Infrastructure

Adedoyinsola Ogungbesan — Sun, 24 May 2026 03:02:49 +0000

This is a submission for the Google I/O Writing Challenge

I actually appointed myself as AGI Police some days back. On account of the number of times I've found myself sipping AI slop milkshake.

As a reformed individual (with regulated slop intake), I keep listening to Yann LeCun's insights through several notable podcasts. He constantly reminds us that “LLMs in general cannot predict the consequences of their actions.”

Google I/O 2026 made an impact, with lots of exciting announcements. Training across the largest clusters in the world. Over 7x more tokens processed every month. Bigger infrastructure. Faster inference. More intelligence delivered instantly to billions of people.

But somewhere in the middle of all the demos and applause, I found myself thinking less about the models and more about the machinery underneath them.

So I asked Google AI Mode to help calculate the energy consumption behind large-scale token processing and compare it to something human-sized.

Here is what we found out in less than 30 seconds of processing.

We could power nearly 3 million light bulbs continuously, 24 hours a day, for an entire year. Let that sink in.

What struck me wasn’t just the raw number itself. It was the inversion of intuition.

AI feels weightless.

You type words into a chat box and receive intelligence back in seconds. No smoke. No factory floor. No visible machinery. Just text appearing instantly on a glowing rectangle in your hand.

But underneath that interface sits an industrial system consuming electricity, water, cooling infrastructure, and global semiconductor supply chains at unprecedented scale.

Before I looked away, Gemini made another suggestion that made me even more curious. It suggested comparing the energy consumption with Nvidia infrastructure and also estimating the amount of water required to cool the servers powering the inference workloads.

I indulged.

And in less than 5 seconds (which means I was exaggerating when I said 30 seconds earlier), this happened:

457 million litres matches the total annual water footprint of roughly 1,200 average household families.

At that point, the conversation stopped feeling like a fun experiment and started feeling like a glimpse into the physical economics of intelligence itself.

The real takeaway from Google I/O wasn’t simply that models are getting smarter.

It was that intelligence is becoming infrastructure.

Every prompt now has a physical cost attached to it:

Electricity generation
Water cooling systems
Data center expansion
Semiconductor fabrication
TPU supply chains
Thermal management at planetary scale

And the strange part is that users rarely see any of it.

What fascinated me most was how Gemini framed the answers. Instead of treating the numbers like an alarming revelation, it immediately contextualized them against broader industry infrastructure. The response was technically useful, but it also revealed something subtle about AI systems: they do not merely answer questions; they shape how scale is emotionally interpreted.

The answer, although accurate, still felt slightly biased — almost like the system instinctively softened the psychological impact of the numbers by normalizing them within the broader AI race.

And honestly, I understand why.

Because the benefit of knowledge being delivered at our fingertips is genuinely incredible.

A student can learn quantum mechanics from a village with weak infrastructure. A founder can prototype an idea in hours instead of months. A developer can debug systems faster than ever before. The productivity gains are real.

But so is the cost.

For years, software scaled mostly through abstraction. AI may be the first mainstream computing paradigm where scaling intelligence also means scaling physical consumption in the real world.

That may ultimately become the defining tradeoff of this era.

The question after Google I/O is no longer just:

“How intelligent can these systems become?”

But also:

“What will it cost to sustain them?”

The First Modular AI Chain

Adedoyinsola Ogungbesan — Mon, 04 Mar 2024 00:16:00 +0000

In the vast expanse of the deep blue sea, where mysteries abound and the unknown beckons, lies a realm ripe for exploration and discovery. Much like intrepid explorers charting uncharted waters, 0G Labs ventures into the depths of innovation, pushing the boundaries of what's possible in the realm of decentralized technology. Join us as we embark on a journey through the depths, guided by 0G's pioneering spirit and modular technology.

Plunging into the Abyss: Understanding 0G's Modular Technology

Beneath the surface lies a world of complexity and possibility, much like the intricate framework of 0G's modular technology. Here, developers and content creators are equipped with the tools to navigate the depths of decentralized AI applications. Picture it as a sturdy vessel, sturdy yet adaptable, allowing for seamless integration of AI components into the ever-shifting currents of innovation.

Discovering Hidden Treasures: Unleashing Creativity with 0G's Modular Canvas

Just as the ocean conceals untold treasures within its depths, so too does 0G's modular canvas hold the promise of boundless creativity. Here, developers and content creators are invited to explore uncharted territories, bringing their boldest ideas to life in a realm where imagination knows no bounds. With an intuitive interface akin to a captain's chart and a vast library of modular components as diverse as the ocean's inhabitants, the canvas becomes a playground for innovation.

Navigating the Undercurrents: Building Decentralized AI Applications

In the depths of the ocean, currents ebb and flow, shaping the landscape in unexpected ways. Similarly, within the realm of decentralized AI applications, 0G's modular technology provides a framework that adapts to the ever-changing tides of innovation. Developers navigate these undercurrents with ease, leveraging the flexibility and scalability of 0G's platform to bring their creations to life.

Forging Alliances in the Deep: Fostering Collaboration and Community

Just as explorers band together to conquer the challenges of the deep, so too does the 0G community come together to tackle the complexities of decentralized technology. Here, developers, artists, and visionaries unite, sharing insights, collaborating on projects, and supporting one another on their journey. Through open dialogue, shared resources, and a spirit of camaraderie, the community becomes a beacon of light in the darkness, guiding explorers through the depths.

Emerging into the Light: Shaping the Future of Innovation

As explorers resurface from the depths, they bring with them newfound knowledge and discoveries that shape the course of history. Similarly, with 0G's modular technology, developers emerge from the depths of innovation, armed with groundbreaking creations that push the boundaries of what's possible. Together, we chart a course towards a brighter future, where the depths of creativity and the vast expanse of technology converge in harmony.

Conclusion: Charting a Course for Tomorrow's Innovators

In the boundless depths of the deep blue sea, and within the realm of decentralized technology, the journey is as important as the destination. With 0G Labs as our compass, we navigate uncharted waters, guided by a spirit of exploration and a commitment to innovation. Join us as we embark on this journey into the unknown, shaping the future of technology one discovery at a time.

Navigating the Decentralized Cosmos: A Journey with 0G's Modular Starship

Adedoyinsola Ogungbesan — Sun, 03 Mar 2024 23:57:19 +0000

Embark on a journey like no other with 0G Labs' modular starship, the gateway to the decentralized cosmos. In this blog post, we'll take you on a voyage through space and time, exploring the vast expanse of possibilities that await those who dare to dream and innovate with 0G's modular technology.

Launching into the Unknown: Exploring New Frontiers

Just as explorers of old set sail for uncharted territories, so too do developers and content creators embark on a journey into the unknown with 0G's modular starship. With its modular architecture and interoperable design, the starship serves as a vessel for exploration, enabling users to navigate the decentralized cosmos with ease and discover new worlds of possibility.

Charting Your Course: Customizing Your Journey

No two journeys are alike, and with 0G's modular starship, users have the freedom to chart their own course through the decentralized cosmos. Whether you're exploring the depths of AI-powered virtual reality or charting a course through the blockchain galaxy, the modular starship puts the power of customization in your hands, allowing you to tailor your journey to your unique vision and objectives.

Forging Alliances: Collaborating Across the Cosmos

As you traverse the decentralized cosmos, you'll encounter fellow explorers, innovators, and visionaries who share your passion for discovery and innovation. With 0G's modular starship, forging alliances and collaborating across the cosmos has never been easier. Whether you're joining forces to tackle a common challenge or pooling resources to explore new frontiers, the modular starship provides a platform for collaboration and cooperation that transcends boundaries and borders.

Pushing the Boundaries: Innovation Without Limits

Innovation knows no bounds, and with 0G's modular starship, the sky's the limit. Whether you're pushing the boundaries of AI, blockchain, or virtual reality, the starship provides a platform for innovation without limits, empowering users to dream big, think boldly, and explore new horizons. With 0G's modular technology as your guide, the possibilities are limitless, and the journey is yours to define.

Conclusion: Beyond the Stars

As we journey through the decentralized cosmos with 0G's modular starship, we're reminded that the true essence of exploration lies not in the destination, but in the journey itself. With its modular architecture, interoperable design, and boundless potential, the starship serves as a beacon of innovation, guiding us toward new frontiers of possibility and paving the way for a brighter, more decentralized future for all. https://0g.ai/

ML unification is all about family

Adedoyinsola Ogungbesan — Wed, 05 Jul 2023 01:52:03 +0000

In a never-ending twist and turns, we haven't seen the last of Dom and his ever-growing family.

Machine learning like the Fast and Furious franchise keeps getting more fascinating as the day goes by.

ML unification, according to OpenAI ChatGpt refers to the convergence and integration of various machine learning (ML) techniques, methodologies, and frameworks into a unified framework or ecosystem. It aims to create a standardized and cohesive ML development, deployment, and management approach.

The need for ML unification arises from the growing complexity and diversity of ML models, algorithms, and tools. ML practitioners often work with different frameworks and libraries for specific tasks, such as deep learning, reinforcement learning, or natural language processing. This fragmented landscape can lead to inefficiencies, interoperability challenges, and duplication of efforts.

ML unification addresses these challenges by providing a unified platform combining multiple ML techniques, frameworks, and tools. It involves creating common standards, interfaces, and protocols that enable seamless integration and collaboration across different ML domains.

Dominic Toretto is to Fast and Furious as Ivy is to ML unification.

Ivy is both an ML transpiler and a framework, currently supporting JAX, TensorFlow, PyTorch and Numpy.

In consonance with Kapa.ai, Ivy unifies all ML frameworks enabling you not only to write code that can be used with any of these frameworks as the backend but also to convert any function, model or library written in any of them to your preferred framework. This makes it broadly applicable to a wide range of applications, from cutting-edge deep learning to more conventional machine learning, general numerical computing, and data analytics.

TWIST
Dan Fu, Stanford University via Ivy paper reading group talked about FlashAttention.

As stated by their paper on FlashAttention published a year ago on ArXiv, FlashAttention is an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM.

Basically, transformer models are known as essential building blocks for natural language processing and image classification. they have grown larger and deeper, but equipping them with a longer context remains difficult,
due to the self-attention module at their core
having time and memory complexity quadratic in sequence length. In other words slow and less memory efficient.

FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3× speedup on GPT-2 (seq. length 1K), and 2.4× speedup on long-range arena (seq. length 1K-4K).

FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexities on GPT-2 and 6.4 points of lift on long-document classification) and entirely new
capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge
(seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).

EXIT
With more and more major advances being reported every day in the vast world of Machine learning. Accelerating your AI with one line of code is here to stay.

Additional Reading
unify.ai

Reference(s)
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. ArXiv. /abs/2205.14135