DEV Community: K X

Can Parallel AI Agents Build SaaS in Minutes? Benchmarking GPT-5.6 Sol's "Ultra Mode"

K X — Fri, 10 Jul 2026 15:13:20 +0000

With OpenAI's sudden launch of the GPT-5.6 model series led by the flagship GPT-5.6 Sol, the AI landscape has shifted from simple chatbots to autonomous agent execution.

In this article, we'll dive deep into Sol's core architecture, its SOTA score on Terminal-Bench 2.1, and test whether its native "Ultra Mode" can coordinate parallel subagents to build and deploy web applications.

1. Under the Hood: What is Sol's "Ultra Mode"?

Unlike older reasoning models where developer teams have to build complex outer-loop Python frameworks (like AutoGen or CrewAI) to coordinate multiple models, GPT-5.6 Sol handles multi-agent orchestration natively.

When you send a complex prompt with orchestration_mode: "ultra", Sol acts as a manager agent that:

Spawns specialized subagents (e.g., Coding agent, Reviewing agent, Testing agent).
Coordinates tasks and compiles parallel files simultaneously.
Runs self-correction loops (feeding traceback errors back to the code agent to repair syntax bugs before outputting).

Here is a quick Python snippet demonstrating how to configure this on the official OpenAI SDK:


python
import openai

client = openai.OpenAI()

response = client.chat.completions.create(
    model="gpt-5.6-sol",
    messages=[{"role": "user", "content": "Refactor repository structure and add database indexing."}],
    orchestration_mode="ultra",
    max_subagents=6,
    reasoning_effort="max"
)

print(response.choices[0].message.content)

Google I/O 2026 Wasn't About One More Model. It Was About the Agent Stack.

K X — Sat, 23 May 2026 15:50:30 +0000

This is a submission for the Google I/O Writing Challenge.

The release that stuck with me

The Google I/O 2026 announcement I keep coming back to is not a single model, a single demo, or a single product screenshot.

It is the shape of the developer stack Google is assembling around Gemini: Google AI Studio for fast prototyping, the Gemini API for production integration, Managed Agents for less infrastructure friction, and Antigravity as an agent-first development environment.

Google's own challenge prompt asks us to go beyond summary, so here is my take: I/O 2026 was less about "look, the model is smarter" and more about "the unit of software is changing."

The new unit is not just an app. It is a loop:

intent -> prototype -> agent/tool loop -> deployed workflow -> feedback -> improvement

That sounds abstract until you compare it with how developers actually build today. We still spend a lot of time translating intent into boilerplate, wiring tools, building glue code, and creating small internal dashboards that should not take a sprint. Google is clearly trying to compress that middle layer.

Why this is different from another chatbot launch

A chatbot helps when you already know what to ask.

An agentic developer stack helps when the work itself has multiple steps:

inspect the current project
create or modify code
call tools
run checks
explain tradeoffs
repeat until the workflow is closer to done

That is why the developer announcements matter. The official Google I/O developer highlights describe updates around Google Antigravity, an enhanced Gemini API, and native Android support in Google AI Studio. The Developer Keynote roundup also points to Google AI Studio integrations, native Kotlin support, and Managed Agents through the Gemini API.

Those details are easy to skim past. I think they are the main story.

If AI Studio can turn an idea into an Android prototype faster, and the Gemini API can host agent behavior with less custom infrastructure, and Antigravity can make the development loop more agent-native, then Google is not just improving a model. It is trying to own the path from idea to shipped AI workflow.

The useful mental model: three layers

The way I now think about Google's I/O 2026 developer stack is three layers.

1. Prototype layer: Google AI Studio

AI Studio is the "try the idea now" layer.

This matters because many AI projects die before they become software. Not because the model cannot do the task, but because the setup cost is too high: credentials, app shell, mobile scaffolding, API wiring, prompt iteration, output handling.

Native Android support and Kotlin-oriented workflows are interesting because they lower the distance between "I have an idea" and "I can touch the thing on a device."

My critique: fast prototyping is only valuable if it does not trap you in a toy environment. The winning version of AI Studio is not a magic demo generator. It is a staging area that teaches good patterns and lets developers graduate to real code.

2. Runtime layer: Gemini API and Managed Agents

This is the layer I find most underrated.

A lot of agent projects look impressive in a demo but become annoying in production because you have to maintain orchestration, tool calls, retries, memory boundaries, logs, permissions, and state.

Managed Agents are interesting because they suggest Google wants to make agent infrastructure feel more like an API capability than a pile of custom glue.

That does not remove the need for engineering judgment. In fact, it makes judgment more important. If agents are easier to create, developers need stronger defaults around observability, permission scopes, evaluation, and failure handling.

The dangerous version is "just let the agent do it."

The useful version is "give the agent a narrow job, clear tools, auditable outputs, and a way to fail safely."

3. Development layer: Antigravity

Antigravity is the signal that Google sees coding itself as an agent workflow, not just a text completion problem.

That fits where developer tools are going. The IDE is no longer only a place where humans type code. It is becoming a coordination surface where humans, models, tests, tools, and deploy systems negotiate changes.

This changes what "developer productivity" means.

The old productivity question was:

How fast can I write code?

The new productivity question is:

How safely can I turn intent into a reviewed, tested, shipped change?

That is a much better question.

What I would build with this stack

If I were using the Google I/O 2026 stack for a real project, I would build a small "agent readiness checker" for developer repos.

The workflow would be:

Use AI Studio to prototype the UI and interaction.
Use the Gemini API to inspect repository docs, tests, CI files, and issue history.
Use a Managed Agent to produce a narrow readiness report.
Ask the agent to suggest one safe improvement at a time.
Require every suggestion to include a test or verification step.

The output would not be "rewrite my whole app." It would be closer to:

{
  "repo_status": "almost shippable",
  "highest_risk_gap": "no CI for sample workflow",
  "safe_next_step": "add a smoke test and GitHub Actions job",
  "verification": "run python -m unittest discover -s tests"
}

That is the kind of agent I trust: boring, narrow, inspectable, and useful.

What developers should be careful about

My main concern after I/O is not that agents will be useless. It is that they will become too easy to deploy without enough boundaries.

The hard parts are still hard:

permissions
cost controls
prompt injection
user consent
logging
human review
model drift
tool failure
rollback

The more powerful the stack becomes, the more important it is to design small loops instead of giant autonomous blobs.

My rule of thumb:

If you cannot explain what the agent is allowed to do, what it is not allowed to do, and how you will verify the result, the agent is not ready for production.

My takeaway

Google I/O 2026 made me more convinced that AI development is moving from prompt demos to workflow systems.

The interesting competition is not only "which model is best?" It is:

who gives developers the shortest path from idea to working workflow?
who makes agent infrastructure observable and safe?
who lets prototypes graduate into real software?
who helps developers keep control as automation gets stronger?

That is why the AI Studio + Gemini API + Managed Agents + Antigravity direction feels important.

It is not one more chatbot.

It is a bet that the next generation of developer tools will be agent-native from the first sketch to the final deploy.

AI disclosure

I used AI assistance to organize this draft and check clarity. The perspective, source selection, final structure, and publication decision are mine.

Sources I read

I Finished My Hermes Content Ops Agent With Tests, CI, and a Public Audit Trail

K X — Fri, 22 May 2026 08:29:40 +0000

This is a submission for the GitHub Finish-Up-A-Thon Challenge.

What I Started

I started with Hermes Agent Content Ops, a real content-operations workflow for a Chinese WeChat Official Account.

The private version uses Hermes Agent to help run a two-post-a-day media workflow: topic research, article packaging, WeChat-safe HTML, local visual assets, draft-first WeChat delivery, and a Xiaohongshu pet-supplies review queue.

The first public repo was useful, but it was still mostly a challenge package. It had a README, diagrams, and sanitized sample scripts, but it did not have enough proof that the public artifact was finished and safe to maintain.

The unfinished parts were:

no CI
no tests
no public audit command
no before/after evidence inside the repo
no explicit Copilot instructions for future assisted edits
no repeatable quality gate for secret hygiene

What I Shipped

For this finish-up pass, I turned the repo from a polished demo into a more complete public artifact.

I added:

content_ops/audit.py for repeatable completion checks
scripts/run_finish_audit.py to generate a Markdown audit report
tests/test_audit.py for basic quality gates
.github/workflows/ci.yml for GitHub Actions
.github/copilot-instructions.md for future Copilot-assisted work
BEFORE_AFTER.md to make the completion arc explicit
FINISH_UP_A_THON.md to document what changed and why
WATCH_LAYER.md and content_ops/watch_layer.py after a community comment pointed out that cron should not be the whole operating model

The public repo now has a way to answer: is the submission complete, inspectable, and safe to publish?

Update after community feedback

After publishing the first version, Harpinder left a useful comment: the system should not rely only on cron. A real content operations agent should also have a watch layer that can wake on source changes, human review events, or safe retry conditions.

I agreed, so I added a lightweight event-aware layer:

content_ops/watch_layer.py
scripts/watch_content_events.py
WATCH_LAYER.md
tests/test_watch_layer.py
examples/watch-report.json

Cron still owns the 07:00 and 18:00 rhythm, but the watch layer now models event-based wakeups for changed research, approved human review, and WeChat draft upload failures.

Example output:

{
  "wakeAgent": true,
  "nextAction": "wake_agent",
  "events": [
    {
      "kind": "source_changed",
      "path": "topic_research.md",
      "reason": "topic research changed; agent should re-evaluate freshness and angle",
      "severity": "info"
    }
  ]
}

This made the project feel much less like "a cron script" and much more like an event-aware agent workflow.

Before & After

Before:

Strong project idea.
Real private workflow behind it.
Good README and visuals.
Sanitized sample scripts.
But no automated verification.

After:

Public audit command.
Unit tests.
CI workflow.
Secret-hygiene check.
Before/after docs.
Copilot-ready instructions.
A generated completion report.
Event-aware wakeups for source changes, human review approvals, and safe retries.

Run the audit:

python scripts/run_finish_audit.py

It writes:

finish-audit-report.md

Demo

Repository:

https://github.com/kax168/hermes-agent-content-ops

Core files added during the finish-up:

content_ops/audit.py
scripts/run_finish_audit.py
tests/test_audit.py
.github/workflows/ci.yml
.github/copilot-instructions.md
BEFORE_AFTER.md
FINISH_UP_A_THON.md

The audit checks that the required public files exist, that the finish-up evidence exists, and that obvious credential tokens are not present in public source files.

How GitHub Copilot Helped

The honest note: my local GitHub CLI exposed the gh copilot entry point, but the Copilot CLI binary was not installed in this environment when I attempted to invoke it. I did not want to pretend otherwise.

So I made the project Copilot-ready rather than writing a fake Copilot story.

The repo now includes .github/copilot-instructions.md, which tells future Copilot-assisted edits to:

keep credentials out of the repo
preserve draft-first publishing
avoid live API calls in public samples
keep Xiaohongshu review human-in-the-loop
add tests or audit checks when changing the public package

I also added a concrete audit script and CI workflow, which are useful because Copilot works better when a repository has clear boundaries, tests, and a feedback loop.

If I continue this project with Copilot enabled, the next Copilot tasks are obvious: expand tests, add issue templates, strengthen the audit report, and refactor the sample scripts while preserving the safety boundaries.

What I Learned

The hard part of finishing an agent project is not only making it run. It is making it inspectable.

Agent workflows touch schedules, APIs, credentials, generated content, and user intent. A public repo needs more than screenshots. It needs proof that the dangerous parts are fenced off and the useful parts are easy to understand.

This finish-up pass gave the project that missing layer: tests, CI, auditability, and a clear before/after story.

I Built a Local Gemma 4 Content Radar for Private Editorial Decisions

K X — Wed, 20 May 2026 07:07:04 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4.

What I Built

I built Local Gemma 4 Content Radar, a small but practical editorial intelligence tool that runs Gemma 4 locally and turns a messy batch of content signals into a structured publishing decision report.

The problem I wanted to solve is simple: creators and technical media operators do not need one more random idea generator. They need a way to compare signals, choose the strongest angle, explain why it matters now, and flag risks before something gets published.

The tool takes a JSON file of candidate signals like trend notes, draft ideas, source snippets, audience hooks, or risk observations. It sends the full batch to a local Gemma 4 model through Ollama and returns:

the strongest topic to pursue
why that topic matters now
a reader-facing hook
ranked alternative candidates
evidence notes
risk notes
practical next actions

The project is designed around privacy. The default endpoint is http://127.0.0.1:11434/api/generate, so the editorial notes stay on the local machine unless the user chooses otherwise.

Demo

Repository:

https://github.com/kax168/gemma4-local-content-radar

Run it locally:

python3 scripts/content_radar.py examples/signals.json \
  --out examples/radar-output.json \
  --markdown examples/radar-output.md

Example output from Gemma 4 selected this top topic:

Developers are testing local multimodal models for private document review

Gemma 4 explained that this is timely because teams increasingly want to process proprietary PDFs, screenshots, internal docs, and research notes without sending sensitive data to cloud APIs. It also produced a hook, risk notes, ranked alternatives, and follow-up actions.

That is the core user experience: local Gemma 4 acts as a private editorial reasoning layer, not just a text generator.

Code

The implementation is intentionally small and inspectable.

.
+-- examples/
|   +-- signals.json
|   +-- radar-output.json
|   +-- radar-output.md
+-- scripts/
|   +-- content_radar.py
+-- assets/
    +-- hero.svg

The script does four things:

loads candidate signals from JSON
builds a prompt that asks for strict structured output
calls local Ollama with Gemma 4
writes both JSON and Markdown reports

I kept the project dependency-light on purpose. It uses Python's standard library so the workflow is easy to audit, clone, and adapt.

How I Used Gemma 4

I used gemma4:e4b locally through Ollama.

From my local setup:

Architecture: gemma4
Effective parameters: 8.0B
Context length: 131072
Capabilities reported by Ollama: completion, vision, audio, tools, thinking
License: Apache License 2.0

I chose E4B because this project is about practical local AI, not winning a benchmark screenshot. E4B is small enough to run locally, but capable enough to compare a batch of signals, rank competing angles, and explain the tradeoffs.

Gemma 4 is doing the central work here:

long-context comparison across all candidate signals
editorial prioritization
hook generation
risk-aware summarization
next-step planning

The part I like most is that the model is not being used as a cloud chatbot bolted onto a workflow. It is the local reasoning engine at the center of the product.

What Gemma 4 Unlocked

The useful unlock is privacy-friendly judgment.

For content operations, the sensitive material is often not a final article. It is the messy middle: private notes, early research, screenshots, customer language, internal docs, and unverified claims. Sending all of that to a hosted API is not always acceptable.

A local Gemma 4 workflow changes the shape of the product. It lets the user perform the first editorial pass on private material before anything is sanitized, summarized, or published.

That makes this pattern useful beyond content marketing. The same approach could support:

private research triage
local document review
pre-publication safety checks
creator workflow planning
multilingual editorial planning
internal knowledge synthesis

What I Would Build Next

The next version would add a small browser UI, source importers, and an optional multimodal path where Gemma 4 can inspect screenshots or visual drafts before producing the editorial report.

I would also add a "claim hygiene" mode that forces every generated angle to include what is known, what is inferred, and what still needs verification.

For this submission, I wanted the core to be honest and reproducible: local model in, structured decision report out.

I Built a Hermes Agent That Runs a Two-Post-a-Day AI Media Workflow

K X — Wed, 20 May 2026 05:10:08 +0000

This is a submission for the Hermes Agent
Challenge.

What I Built

I built a Hermes-powered content operations system for a Chinese WeChat Official Account focused on AI tools, cross-border monetization, content going global, and practical AI workflows.

The goal was not to make a generic chatbot. I wanted Hermes to do the unglamorous operational work behind a real media account:

research current topics before writing;
pick a high-click-through topic that fits the account strategy;
generate a polished WeChat article;
create a theme-matched cover and visual structure;
format everything as WeChat-safe HTML;
upload the result into the WeChat Official Account draft box;
run the whole thing twice per day on a schedule.

The account publishes one article at 7:00 AM and another at 6:00 PM. The morning slot favors practical tutorials and tool/workflow guides. The evening slot favors trend analysis, controversy, reviews, and case breakdowns.

I also added a second operational workflow: a daily Xiaohongshu research queue for pet-supplies affiliate videos. It prepares search targets and review files, but intentionally keeps the final approval human-in-the-loop because the acceptance criteria require audio and on-screen text judgment.

Demo

The WeChat pipeline produces a complete local article package for each run:

~/.hermes/wechat-mp/YYYY-MM-DD/morning/
  article.md
  article.html
  article.json
  topic_research.md
  image_plan.md
  cover.png
  cover_prompt.txt

The draft creation step uses the official WeChat API. A run only counts as successful if the API returns a real draft media ID. Otherwise, the generated package stays on disk for inspection and Hermes reports the failure.

The Xiaohongshu workflow produces a daily review queue:

~/.hermes/xhs-pet-supplies-videos/YYYY-MM-DD/
  review_queue.txt
  approved_links_template.txt

Approved video links are appended to:

~/.hermes/xhs-pet-supplies-videos/links.txt

The important design choice: Hermes automates the repeatable operational work, but it does not pretend to verify things it cannot safely verify. For videos, "no Chinese narration" and "no more than 10 Chinese characters on screen" need visual/audio review, so the system creates the queue and saves only manually approved links.

Validation From My Local Build

The private local deployment has been tested end-to-end with real scheduled jobs and real WeChat draft creation. I am not publishing account credentials, access tokens, draft media IDs, or screenshots that expose private account metadata, but the workflow is designed around concrete success checks:

Hermes cron has two active daily article jobs: morning and evening.
A successful WeChat run must produce article.md, article.html, article.json, topic_research.md, image_plan.md, and cover.png.
The WeChat API call must return a real draft result; otherwise the run is considered failed.
The Xiaohongshu workflow creates review queues but does not auto-save links without human verification.
The public repo contains sanitized runnable samples so the architecture can be reviewed without leaking secrets.

Code

Repository: github.com/kax168/hermes-agent-content-ops

The private installation contains:

a Hermes skill for WeChat Official Account operations;
deterministic cron scripts for reliable scheduled execution;
WeChat Official Account API integration;
a content strategy file persisted under Hermes home;
Xiaohongshu review-queue scripts;
environment-based model/provider configuration.

Secrets are stored outside the repo in ~/.hermes/.env and are never printed or committed.

My Tech Stack

Hermes Agent
Hermes cron scheduler
Python
WeChat Official Account API
Google News RSS and Hacker News search for topic discovery
Configurable model provider for article generation
WeChat-safe inline HTML/CSS
Local PNG cover generation as a reliable fallback

How I Used Hermes Agent

Hermes is the operating layer of the project, not just a model wrapper.

1. Persistent skills

I created a local Hermes skill called wechat-official-account-operator. It contains the account strategy, content rules, visual requirements, API safety defaults, output schema, and success criteria.

That mattered because the account has a strong editorial position:

AI tool tutorials and reviews;
global AI product news;
AI agents and automation;
cross-border monetization;
GEO, AdSense, affiliate, independent sites, and content export;
legal "freebie" and cost-saving AI tool tactics.

The skill keeps these rules close to the agent so each run is not starting from a blank prompt.

2. Scheduled autonomous work

Hermes cron runs two jobs every day:

0 7 * * * for the morning article;
0 18 * * * for the evening article.

This is where Hermes felt different from a normal chat interface. The system is not waiting for a human to ask, "Please write a post now." It wakes up, collects context, generates artifacts, calls an API, and leaves a concrete result.

3. Tool use and API execution

The WeChat pipeline performs several real steps:

Collect current topic candidates from web/news/community sources.
Generate a topic-research file with candidate topics, source links, heat signals, and final rationale.
Generate an article with a strong hook, practical value, risks, and CTA.
Render a polished WeChat-safe HTML layout instead of dumping plain Markdown.
Create a local cover image and visual plan.
Write a machine-readable article.json.
Upload the cover and article to WeChat through the official API.
Report the run status.

I added a strict "no simulation" rule: the run is not successful unless the local files exist and the WeChat API returns a real draft result.

4. Practical reliability over agent theater

One of the more interesting lessons was that a fully autonomous large-prompt agent run was not the most reliable way to handle scheduled publishing. Some agentic runs timed out or got stuck on browser/model calls.

So I changed the architecture: Hermes still owns the schedule, skills, memory, and operational workflow, but the daily execution path uses deterministic Python scripts for the fragile parts.

That hybrid design made the system more useful:

the agent defines and evolves the workflow;
scripts perform predictable API/file work;
Hermes cron keeps it alive on a schedule;
failures leave inspectable artifacts instead of vague chat logs.

5. Human-in-the-loop where judgment is required

The Xiaohongshu workflow is deliberately semi-automatic. It searches for pet-supplies affiliate video candidates and prepares a review queue, but final saving requires manual confirmation.

That is not a limitation I wanted to hide. It is a product decision. If the rule says "no Chinese narration" and "almost no Chinese on-screen text," then silently adding links without checking would produce bad data. Hermes is still useful because it removes the repetitive search/setup work while preserving the judgment step.

Why This Fits the Judging Criteria

The Build With Hermes Agent prompt is judged on effective use of Hermes Agent's agentic capabilities, technical implementation and code quality, creativity and originality, and usability/user experience. Here is how I designed for those points:

Agentic capabilities: Hermes is used for persistent skills, scheduled execution, workflow memory, tool/API orchestration, and delivery reporting.
Technical implementation: Fragile steps are handled by deterministic scripts, credentials stay in environment files, generated artifacts are inspectable, and API success is verified instead of assumed.
Creativity: The project applies an open agent to a real media-operations workflow across WeChat and Xiaohongshu, not a generic assistant demo.
Usability: The system defaults to draft creation instead of risky auto-publishing, keeps failures debuggable, and uses human review where automation would be unsafe.

What Makes This Useful

This project is small, but it is real. It has all the annoying parts of actual automation:

credentials and API permissions;
scheduled jobs;
retries and timeouts;
local artifacts for debugging;
editorial constraints;
platform-specific HTML limitations;
safe publishing defaults;
human review boundaries.

Hermes Agent was a good fit because the workflow needed memory, scheduling, tools, and persistence more than it needed a prettier chat response.

The result is an agentic system that can operate a narrow but real content pipeline: from topic discovery to WeChat draft creation, twice a day.

What I Would Improve Next

I would like to add:

a visual dashboard for scheduled run status;
screenshot-based QA for WeChat article layout before upload;
safer browser automation for Xiaohongshu candidate collection;
automatic source-quality scoring;
multi-platform repurposing from one article package to X/Twitter, LinkedIn, and short-video scripts.

The biggest takeaway: the best agent projects are often not magic demos. They are workflows where the agent keeps showing up, on schedule, doing the boring parts correctly.

5 AI Prompt Engineering Techniques That Actually Work in 2026

K X — Fri, 06 Mar 2026 06:17:44 +0000

After testing hundreds of prompts over the past year, I've identified 5 techniques that consistently deliver high-quality outputs.

1. Specificity Over Generality

❌ Bad: "Write a story"
✅ Good: "Write a 500-word mystery story set in 1920s Paris, featuring a detective who solves crimes using psychology"

The more specific you are, the better the output.

2. Role Assignment

Start with: "You are an expert [role]..."

This primes the AI to respond with domain-specific knowledge and appropriate tone.

Example: "You are an expert copywriter with 10 years of experience in SaaS marketing..."

3. Output Format Specification

Always specify the format you want:

Bullet points
Numbered lists
JSON
Markdown table
Code blocks

This prevents the AI from choosing a format you don't need.

4. Iterative Refinement

Don't expect perfection on the first try. Use follow-up prompts:

"Make it more concise"
"Add more technical details"
"Rewrite in a casual tone"

5. Context Layering

Provide context in layers:

Background information
Specific task
Constraints
Desired outcome

Example:

Background: I'm writing a blog post about AI for beginners
Task: Explain neural networks
Constraints: Use simple language, no math
Outcome: 300 words, engaging tone

Resources

If you want ready-to-use prompts, check out LaerKai - a collection of 200 battle-tested prompts across 8 categories including creative writing, business content, and technical documentation.

What prompt techniques work best for you? Share in the comments!