Arthur

Posted on May 24

I built an AI PR-triage agent in 30 lines of Markdown

#devchallenge #googleiochallenge #python #tutorial

Google I/O Writing Challenge Submission

This is a submission for the Google I/O Writing Challenge

A recipe for the AI PR-triage agent I built after Google I/O 2026: three Markdown skill files, one Python runner, one real public GitHub repo, about twelve cents per run.

1. What I built

At Google I/O 2026, Logan from the Gemini API team walked through an AGENTS.md file for an AI talk-radio agent and dropped a line on stage that stuck with me: "the hottest new programming language is Markdown." He had written no orchestration logic, just skills and tools in Markdown files, and the agent shipped a finished podcast episode from a single API call.

I took that seriously. The next day, I spent a few hours building an AI pull-request triage agent on the public Gemini API. Three Markdown skill files, one small Python runner, one real public GitHub repo as the target. The agent scanned sixteen open pull requests, categorized each by risk, drafted a one-line summary, and produced a grouped report. Two consecutive runs, identical category distributions, under two minutes each, about twelve cents per run.

This article is the recipe. Working code, real cost, an excerpt of the actual triage report the agent wrote, and enough scaffolding for you to try it tonight against any public repo you care about.

2. What "skills as Markdown" actually means

A skill is a single .md file with four pieces:

A name and a one-sentence description of when to invoke it.
A numbered procedure.
Constraints (what the skill must not do).
Composition notes (which skill, if any, the agent should call next).

The agent loads skills when they are relevant to the user request, and it calls the tools they reference. There is no orchestration logic inside the skill file. The skill is the spec.

This is meaningfully different from cramming everything into a system prompt. Skills compose: skill A can hand off to skill B without the runner reshuffling state. Skills version independently from the runner, so you can iterate prose without touching Python. Skills carry per-tool constraints, which the model respects because the constraint is attached to the procedure rather than buried in a long preamble.

3. The three skills

I wrote three files. Together they are 101 lines of Markdown for the entire agent definition. Here is the first one verbatim, the entry point for the agent:

# Skill: scan_open_prs

Use this skill when asked to list, scan, or audit open pull requests on a GitHub
repository.

## Procedure

1. Call the `github_list_prs` tool with `state="open"` and the requested `limit`
   (default 25, maximum 50). The tool requires a `repo` argument in the form
   `owner/name` (for example, `cli/cli`).
2. For each returned PR, keep these fields verbatim: `number`, `title`, `user`,
   `additions`, `deletions`, `changed_files`, `draft`, `created_at`, and the
   first 200 characters of `body`.
3. Return the result as a JSON array. Do not paraphrase the title or body.

## Constraints

- Do not fetch full diffs in this skill. That is `categorize_by_risk`'s job.
- Skip draft PRs unless the user has explicitly asked for them.
- If the tool returns zero PRs, report that plainly and stop. Do not invent PRs.
- If the tool returns an error, surface the error message verbatim and stop.

## Composition

After running this skill, the agent should call `categorize_by_risk` once with
the JSON array as input.

Twenty-six lines. That is the entire entry-point skill. Notice how much of it is constraints: "do not paraphrase," "do not invent PRs," "skip drafts unless asked." Most of the work in writing a good skill goes into anticipating the model's bad habits and writing them out of the procedure.

The second skill, categorize_by_risk.md, is 41 lines. It calls github_get_pr_files for each PR and applies first-match-wins heuristics: breaking if the PR touches dependency files, security if it touches auth or crypto paths, docs if it only changes docs, fix if the title contains certain keywords, refactor if additions roughly equal deletions, feature otherwise. Each PR gets a category, a confidence, and a one-sentence reason.

The third, draft_summary.md, is 34 lines. It produces an action-verb-first one-line summary for each PR and emits the final report grouped by category, security first.

One short note on composition. When skill A says "now call skill B," the agent treats the boundary as a turn break. Skill B runs in a fresh turn with the JSON output of skill A as its input. This is multi-turn composition, not in-call composition, and it shapes how you structure your skills: each one is a complete unit of work with a clean input and output, not a function in a chain.

4. The runner

The runner is roughly 70 lines of Python that loads the skills, registers two function tools (github_list_prs and github_get_pr_files), and drives a multi-turn loop until the model says it is finished.

Google's official Managed Agents API is early-access only at the moment, but the same shape (one call, attached skills, attached tools) runs on the public Gemini API today, with the same skill files.

The shape, abbreviated:

from google import genai
from google.genai import types

client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
skills = load_skills("skills/")           # reads the three .md files
tools = [github_list_prs_decl, github_get_pr_files_decl]

contents = [user_turn(f"Triage open PRs on {repo}. Skills:\n\n{skills}")]
while True:
    resp = client.models.generate_content(
        model="gemini-3.5-flash",
        contents=contents,
        config=types.GenerateContentConfig(tools=tools),
    )
    calls = extract_tool_calls(resp)
    if not calls:
        break
    contents.append(resp.candidates[0].content)
    contents.append(run_tools(calls))     # executes locally, returns FunctionResponse
print(resp.text)

That is the entire control structure. One client, two function tools, one loop, three Markdown files attached as part of the user turn. The loop pays for everything the agent learns about the repo: which PRs exist, which files they touch, what the titles look like. No graph framework, no orchestrator, no agent class hierarchy.

5. The runs

I pointed the agent at cli/cli, the GitHub CLI repository, which had sixteen open non-draft pull requests at the time. I ran it twice from a cold start.

The numbers:

19 tool calls per run. Three github_list_prs calls during exploration (the agent verified pagination), then sixteen github_get_pr_files calls, one per PR.
Elapsed time. Run 1: 112 seconds. Run 2: 84 seconds. The second run is faster because the model commits to the plan earlier and skips the exploration calls partway through.
Cost. About $0.12 per run in Gemini 3.5 Flash spend.
Stability. Both runs produced identical category distributions across the sixteen PRs. No hallucinated PR numbers, no missed PRs.

Here is what the agent wrote for the top of the report:

## security

- #13500: Refactor string splitting in loops to use the more efficient SplitSeq function. [security]
- #13492: Add gh-cli-site-deployer App to replace SITE_DEPLOY_PAT in release workflows. [security]
- #13403: Refactor GitHub database IDs to use 64-bit integers across commands and API clients. [security]
- #13250: Add categorized target host categorization (github.com vs tenant) to telemetry data. [security]

## feature

- #13471: Add --all flag to gh skill install to support installing all discovered skills. [feature]

The category-by-category reasoning was crisp. Security PRs were grouped at the top, exactly as draft_summary.md had instructed. Every summary led with a verb. Confidence scores matched the heuristics in categorize_by_risk.md. The skill files did the work.

At nightly cadence on a repo this size, the annual cost lands somewhere around $40 to $50. Cheap, especially compared to the developer-hours of triage it replaces.

6. Three things worth knowing

A few practical notes from the build.

Composition is multi-turn, not in-call. If skill A invokes skill B, plan for a turn boundary between them. The model's working memory between turns is whatever you put back into contents, so emit clean JSON at skill boundaries rather than relying on natural-language handoff.

Token spend is non-deterministic. The agent pays to learn the repo, and how much it pays depends on what it finds. On a 1,000-PR monorepo, set an explicit tool-call budget in the runner and have the loop break when it is exceeded. Otherwise a single run can quietly become expensive.

For audited or strictly deterministic pipelines, an orchestrator graph still wins. Markdown skills are the right tool for exploratory work, summarization, triage, and drafting. If your pipeline has compliance hooks, retry semantics, or a fan-out fan-in shape, reach for a graph framework. The two patterns coexist.

7. Try it tonight

The whole recipe:

pip install google-genai in a virtualenv. Set GEMINI_API_KEY from Google AI Studio and GH_TOKEN from a read-only GitHub personal access token.
Save three skill files in skills/. Use scan_open_prs.md above as the template; write categorize_by_risk.md and draft_summary.md in the same shape (name, procedure, constraints, composition).
Write the runner: one genai.Client, two function tool declarations (github_list_prs, github_get_pr_files), one multi-turn loop driving until the model emits no more tool calls.
Point it at a public GitHub repo. Start with something small. cli/cli is a good first target because the PR titles are descriptive.
Read the JSON trace the loop produces. Tweak the skill prose where the agent went sideways. Run again. The whole iteration cycle is about a minute.

Two evenings of work, including the runs, and the agent is paying for itself the first time you let it sweep a backlog before standup.

8. Closing

I am optimistic about this pattern. Markdown skills make agent definitions reviewable in a pull request, runnable from any IDE, portable across runners. The skill file is a primary artifact, not a string buried inside a Python class. Anyone on the team can edit it. Anyone reading the repo can see what the agent will do.

Which workflows in your stack feel like a natural fit for Markdown skills, and which still need a graph?

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.