DEV Community: Yoges Mohan

When AI Reviewers Disagree: Building a Multi-Agent DevSecOps Tribunal with Qwen-Max

Yoges Mohan — Sat, 11 Jul 2026 02:22:47 +0000

Every AI code review tool has the same weakness

I've been building AI code review tools since 2024. Every one I've tried, and every one on the market, has the same structural flaw: it's a single AI voice making a single judgment with no cost to being wrong.

Even the newer "multi-agent" tools like GitHub Copilot's parallel review agents don't really solve this. They divide labor: one agent for security, one for linting, one for testing. But when two of those agents disagree about how serious an issue is, nothing resolves it. The reviewer's job just gets pushed onto the developer.

That's the gap I built ShiftLeft Society to fill, for the Qwen Cloud Global AI Hackathon 2026.

The idea: two AI reviewers who actually negotiate with each other when they disagree, with a cost structure that keeps the whole thing auditable.

The result: a live, deployed multi-agent tribunal that scored 95% on a 40-case benchmark, versus 82.5% for a single-agent baseline. A 12.5-point improvement, primarily by reducing false positives on safe code.

Here's how I built it and what I learned.

The core insight: the LLM proposes, deterministic code disposes

Most "multi-agent negotiation" tutorials I found online let the LLM decide everything, including the math. They ask the model to self-report a confidence number (0-100), then multiply it against something, and out pops a "verdict." That produces different outputs on every run. Unauditable. Uncalibrated. Basically vibes.

I inverted that:

The LLM only picks a categorical position: DEFEND, PARTIAL, or CONCEDE.
Deterministic Python computes the budget cost, the revised severity, and everything downstream.

Here's the actual cost table:

Position	Cost
DEFEND (dig in on your severity)	`gap_tiers × 30`
PARTIAL (compromise halfway)	15
CONCEDE (yield fully)	0

Each agent starts a negotiation with a 100-point budget. If Security wants to DEFEND against Performance and they're 2 severity tiers apart (say, Security = CRITICAL vs Performance = MEDIUM), DEFEND costs 60 points. If they're 3 tiers apart, DEFEND costs 90. Meaning: the stronger your disagreement, the more it costs you to hold your ground. That forces the LLM to actually think about whether it wants to spend the budget.

The LLM never touches a number. It just picks a category. The math is Python.

Three properties this gives you that pure-LLM agent demos rarely have:

Auditable. Every verdict has a transcript: who argued what, what position they took, what it cost them.
Reproducible. Same inputs → same output. Regardless of LLM temperature.
Defensible. The mediator's final call is derived from budget-weighted state, not from vibes.

Cross-PR credibility: the agents remember

The negotiation isn't stateless. Every time the tribunal produces a verdict, each agent's Round 2 judgment gets scored: did it defend correctly, or did it correctly defer to a peer flagging a more serious issue?

That outcome is:

Bayesian-smoothed against a 5-negotiation neutral prior (so a single early win/loss can't swing anything)
Capped at ±15 budget points (so no agent can ever fully dominate or be silenced)
Persisted across every PR the tribunal has ever reviewed

The next time that agent negotiates, on a completely different PR days later, it starts from 100 + track_record_adjustment, not a fresh 100 every time.

effective_budget = 100 + clamp(-15, +15, round((smoothed_win_rate - 0.5) * 30))
smoothed_win_rate = (wins + 2.5) / (total_negotiations + 5)

This turns the tribunal from a system that negotiates well once into a system that gets better at knowing which of its own voices to trust over time, without any human manually re-weighting anything. It shows up right in the PR comment:

"track record: 95% upheld over 42 past negotiations → budget +13"

The tech stack

Qwen-Max via Alibaba Cloud DashScope international endpoint (dashscope-intl.aliyuncs.com/compatible-mode/v1) powers all agent reasoning.
LangGraph orchestrates the agent society: parallel Round 1 (using Send API for concurrent fan-out), conditional Round 2 negotiation only when severity gap exists, then the mediator, then end. Not a linear chain, a stateful graph with conditional edges.
FastMCP (Model Context Protocol) exposes real security tools: scan_vulnerabilities, detect_secrets, check_yaml_pinning, analyze_complexity. If the MCP server is unreachable, everything falls back to local regex-based detection so the tribunal degrades gracefully.
FastAPI for the gateway: REST endpoints, Server-Sent Events for the dashboard, GitHub webhook receiver, SARIF and CycloneDX SBOM export.
SQLite on a Docker-persisted volume for analyses, dialogue history, and the credibility table.
Docker + Caddy for containerized deployment with automatic Let's Encrypt HTTPS.
Alibaba Cloud ECS in Singapore as the live substrate.

Building on Qwen Cloud: honest impressions

This is a hackathon post so I want to be honest about the developer experience, both the good and the friction.

What worked well:

The OpenAI-compatible endpoint (compatible-mode/v1) is a genuine developer experience win. My tribunal uses LangChain's ChatOpenAI wrapper unchanged, just pointed at Alibaba's Base URL:

llm = ChatOpenAI(
    model="qwen-max",
    api_key=os.getenv("QWEN_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
    temperature=0.1,
    max_tokens=4096,
)

That's it. No custom client, no bespoke JSON handling. Everything from LangGraph to structured output parsing just worked.

Qwen-Max's structured output is reliable enough for production. I use LangChain's .with_structured_output(Pydantic model) on almost every agent call, and Qwen-Max produces valid JSON on ~99% of calls. The rare failures are caught by a 4-layer fallback chain (structured output → raw parse → regex scrape → deterministic inference), which means the tribunal never crashes on a malformed model response.

Where I hit friction:

Model catalog navigation was harder than expected. Alibaba's catalog lists several Qwen models, but not all are accessible on every API tier. I tested qwen3.7-max (which is a reasoning model) and found it works but adds ~5x latency because of the reasoning tokens. For a system that needs to respond within GitHub's 10-second webhook timeout, that's disqualifying. So I use qwen-max (the stable current alias) which is fast enough and just as accurate for my use case.

Real production engineering decision, not a compromise: latency budget matters more than "newest model" for webhook-driven agents.

Measurable results

I ran a 40-case benchmark: 22 vulnerable, 18 safe code samples covering categories from SQL injection to unpinned dependencies to secure random tokens. Same model, same prompts for both systems. The only difference: single-agent verdict vs multi-agent tribunal.

System	Correct verdicts	Notes
Single-agent baseline	33 / 40 (82.5%)	Mostly false positives, flagged safe code as vulnerable
Multi-agent tribunal	38 / 40 (95.0%)	One miss on an insecure-random case

The interesting story is where the tribunal improved. On six cases, the single agent wrongly flagged safe code as dangerous:

A properly hashed password
A verified JWT with an explicit algorithm
TLS verification enabled with a timeout
Proper Python context managers
Sliced pagination on a list

On all six, the negotiation cleared the false positive. Security's Round 1 caution got tempered by Performance's correctly identifying "no actual issue here." That's the tribunal's biggest practical win: less noise, not just more accuracy.

Results committed to the repo as benchmark_results.json.

What I learned building this

1. Non-determinism in LLM-driven negotiation is a real production problem. Letting the model self-report confidence produces unreproducible verdicts. Constraining the LLM to categorical choices and computing consequences in code fixed this cleanly.

2. Blocking calls silently break async systems. Early builds had requests.get for the GitHub diff, unwrapped sqlite3.execute in a FastAPI endpoint, and asyncio.create_task without holding references. Every one of those is a latent bug that only shows under real load. Wrapping in asyncio.to_thread and holding tasks in a module-level set turned "works on my laptop once" into "survives real webhook traffic."

3. Error handling has to be visible, not just present. I built 4-layer fallbacks for the mediator, MCP tool calls, and structured output parsing. But that engineering is invisible unless you tell people. Documenting them in the README is how a judge sees the depth.

4. Persistence matters more than you think. My credibility trust data got wiped twice by container restarts before I put SQLite on a Docker volume. Non-persistent state is invisible until it disappears.

limitations

Benchmark is 40 cases, not thousands.
Tuned and tested on Python only.
Single-instance SQLite deployment, not multi-node Postgres.
Credibility metric rewards outcome-matching, not full outcome-grounded calibration.

What I'd actually build next

1. Ground the credibility signal in real merge outcomes.

Right now the credibility metric asks: "did this agent's severity match the final verdict?" That's a proxy. The real signal is what happened to the PR after the tribunal reviewed it. Did the developer merge it as-is, rewrite the flagged code, or ignore the review entirely? That's not a bigger benchmark. It's a fundamentally different feedback loop. The system stops guessing whether it was correct and starts learning from what humans actually did with its verdicts.

2. Multi-language through the MCP layer, not just prompting harder.

The obvious "extend to JavaScript" answer is to prompt the LLM more aggressively. That works okay. What's more interesting: add language-specific detectors (ESLint for JS, go vet for Go, cargo-audit for Rust) as MCP tools. The deterministic layer grows, the LLM guesses less, and the whole thing gets more auditable as it gets more general.

3. A third specialist agent to stress-test the negotiation mechanic.

With only two agents, negotiation is 1-v-1. Adding a Maintainability voice tests whether confidence-budget generalizes cleanly to N-way disputes, or whether it needs redesigning at higher agent counts.

4. Team-level credibility, not just global agent-level.

Different teams have different risk tolerances. A payments team should weight security's judgment more heavily. A hackathon team can tolerate more noise. Track credibility per-repo or per-team, and the system tunes itself to each team's actual priorities over time.

Try it

Live demo: shiftleft-society.duckdns.org
Live PR: github.com/jmy744/shiftleft-society/pull/4
Repo (MIT license): github.com/jmy744/shiftleft-society

The whole system is open source. Comments and PRs welcome.

Finishing What I Started: AutoDoc: AI-Powered OpenAPI Docs from C# 🤖

Yoges Mohan — Thu, 28 May 2026 07:55:03 +0000

This is a submission for the GitHub Finish-Up-A-Thon Challenge

What I Built:

AutoDoc is a tool that automatically generates enriched OpenAPI 3.0.3 documentation from raw C# ASP.NET Core controller code using a local AI model - no manual annotations, no attribute decorators, no maintenance required.

The idea came from a real frustration I noticed during my software engineering internship. Swagger documentation in .NET projects only documents what developers explicitly declare. Miss an attribute, forget an error response, skip a summary and your docs silently drift from reality. Developers spend time writing documentation instead of writing code.

AutoDoc fixes this by reading raw C# controller code directly and inferring everything automatically: routes, HTTP methods, parameters, response codes, error schemas, and operation summaries.

The project was built in 3 phases:

Phase	What it does
Phase 1 - Console App	Paste controller code, get OpenAPI YAML in the terminal
Phase 2 - Docker Web API	Containerised REST backend, call via HTTP POST, get YAML back
Phase 3 - Playground UI	Browser dashboard: paste code, see Raw YAML + Swagger Preview instantly

What started as a terminal script became a fully containerised, browser-accessible API documentation tool and this challenge is what pushed me to finish it.

Demo:

How it works: paste any C# ASP.NET Core controller, click generate, get full OpenAPI docs instantly.

Step 1 : Paste your controller and click Generate

The AutoDoc Playground UI is a split-panel dashboard. On the left, you paste any ASP.NET Core controller code - no modifications needed. The sample TodoController comes pre-filled so you can test immediately. Hit Generate OpenAPI Docs and AutoDoc sends the raw code to the AI backend running locally via Ollama.No attributes. No decorators. No [ProducesResponseType]. Just raw C# code.

Step 2 - Raw YAML is generated instantly

Input: this is the only code AutoDoc received:

[ApiController]
[Route("api/[controller]")]
public class TodoController : ControllerBase
{
    [HttpGet]
    public IActionResult GetAll() => Ok(new[] { "Task 1", "Task 2" });
}

No [ProducesResponseType]. No [SwaggerOperation]. No XML comments.
No manual annotations of any kind. Just raw C# code.

Output: full OpenAPI 3.0.3 spec generated automatically:

AI inferred everything automatically:

What was inferred	How
`operationId`	Unique names like `GetAllTodos`, `CreateTodo`
`tags`	Grouped by controller name `TodoController`
`summary`	Human-readable descriptions per endpoint
`requestBody`	Schema inferred from method parameters
`responses`	200, 201, 400, 404, 500 from code logic
`components/schemas/ErrorResponse`	Defined once, referenced everywhere

none of this existed in the original controller code.

Step 3 - Swagger Preview: POST endpoint

Switching to the Swagger Preview tab renders the YAML as a live interactive Swagger UI. The POST /api/Todo endpoint shows:

A required request body with example JSON schema.
201 Created response with the created resource structure.
Internal Server Error fallback - inferred automatically, not declared anywhere in the controller.

Step 4 - Swagger Preview: GET endpoint

The GET /api/Todo endpoint shows:

200 OK response returning an array - inferred from Ok(new[] { "Task 1", "Task 2" }). Internal Server Error fallback on every endpoint automatically.
Try it out button - fully interactive, ready to fire real requests.

The AI understood what Ok(new[] { ... }) returns and documented it correctly as an array schema.

Step 5 - Auto-generated Schemas

At the bottom of the Swagger Preview, the Schemas section shows the auto-generated ErrorResponse schema with:

code : integer
message : string

This schema was never defined anywhere in the C# controller. AutoDoc inferred it from the pattern of error responses across all endpoints and generated it once, referenced consistently throughout the entire spec.

The Comeback Story:

Where it started

AutoDoc began as a simple console application - a proof of concept to answer 1 question:

Can an AI model read raw C# controller code and generate valid OpenAPI documentation without any manual annotations?

The answer was yes. But the Phase 1 console app was far from finished:

What existed	What was missing
Console app that called Ollama locally	No HTTP interface - unusable by others
Raw YAML printed to terminal	No output validation
Basic Llama 3.2 prompt	No error handling
Proof the concept worked	No UI, no Docker, no way to share it

It worked on my machine. That was it.

What I changed:

Phase 2 - From console app to Docker REST API

The console app was converted into a proper ASP.NET Core minimal API with 2 endpoints:

POST /generate-openapi - receives controller code as JSON, returns enriched OpenAPI YAML.
GET /health - confirms the service is running.

A Dockerfile was added so the entire service runs in a container with a single docker run command. The Ollama host was made configurable via environment variable so it works both locally (localhost:11434) and inside Docker (host.docker.internal:11434).

Phase 3 - Playground UI

A browser-based dashboard was built and served directly from the Docker container via wwwroot. No separate frontend deployment needed - the UI and API ship together.

The playground features:

Split-panel layout - controller input left, output right.
Raw YAML tab - full generated spec in monospace.
Swagger Preview tab - live interactive Swagger UI rendered from the YAML.
Status pill - shows generation state in real time.
Pre-filled sample controller - works out of the box with 0 setup.

The hardest part - taming Llama 3.2 output

Llama 3.2 is a small, free, local model. It does a remarkable job generating mostly correct OpenAPI YAML - but it makes small inconsistent mistakes. Each one broke the Swagger Preview with a parsing error.

A full post-processing pipeline was built to fix every category of output error:

Helper function	Problem it solves
`MergeDuplicatePaths`	Llama repeating the same API path twice
`MergeDuplicateMethods`	Llama repeating GET/POST under the same path
`MergeDuplicateComponents`	Llama emitting `components:` key twice
`RemoveDuplicateSchemaKeys`	Llama defining `ErrorResponse` twice
`RemoveMalformedSecurity`	Llama adding broken `security:` blocks
`FixMalformedOperationIds`	Llama using `{verb}` as a literal placeholder
`NormaliseOutput`	Schema name typos like `ErrorRespond`

Each helper was born from a real parsing error seen in production output. The pipeline runs on every generation before the YAML is returned to the client.

Before and after:

	Phase 1 - Console App	Phase 3 - Full AutoDoc
Interface	Terminal only	Browser dashboard
Deployment	Local .NET required	Docker container
Output	Raw YAML in terminal	Raw YAML + live Swagger Preview
Validation	None	YAML cleaning pipeline + spec validation
Usability	Developer only	Anyone with a browser
Shareable	No	Yes - 1 `docker run` command

What started as a terminal experiment is now a containerised, browser-accessible documentation tool that anyone can run and use.

My Experience with GitHub Copilot:

GitHub Copilot was my implementation partner throughout AutoDoc.I defined the problems and architecture,Copilot wrote the code.

The approach:

I used Copilot not for autocomplete but as a problem-to-code translator. I described what I needed in plain English, Copilot read my existing code, and produced working implementations I could review, test, and keep.

Prompt 1 - Serving the Playground UI from the API

The first key prompt connected the backend to the frontend:

"I am building an ASP.NET Core 10 Minimal API called AutoDoc.I want to serve static files from a wwwroot folder and make sure it serves index.html as the default page. How should I update my Program.cs to allow this?"

Copilot read my existing Program.cs directly, understood the current pipeline, and knew exactly what to add without breaking anything.

It explained what it was doing and why - then applied the change automatically:

app.UseDefaultFiles();
app.UseStaticFiles();

2 lines. Copilot explained that UseDefaultFiles() rewrites / to /index.html and UseStaticFiles() serves everything in wwwroot. It then validated the file for compile errors before confirming. The change was applied in one shot - I clicked Keep.

The result: index.html created in wwwroot, Program.cs updated, Copilot confirming 1 file changed, +4 -0. The Playground UI was now being served directly from the Docker container.

Prompt 2 - Building the Playground UI from scratch

With the static file serving in place, the next prompt built the entire frontend:

"Build a dark-themed single-page HTML dashboard for AutoDoc.Left panel: controller name input and C# code textarea with a Generate button. Right panel: tabbed output with Raw YAML (monospace) and Swagger Preview rendered from the YAML using SwaggerUIBundle. Show a status pill in the header that updates during generation. No frameworks, vanilla JS only."

Copilot generated the complete index.html:

Dark theme with CSS variables.
Split-panel responsive layout.
Tab switching between Raw YAML and Swagger Preview.
fetch call to /generate-openapi with full error handling.
Swagger UI integration using jsyaml for YAML-to-spec parsing.
Real-time status pill updates during generation.

The entire Playground UI came from a single prompt.

Prompt 3 - YAML post-processing pipeline

Every YAML cleaner in AutoDoc came from describing a real Llama output bug to Copilot:

"Llama is generating the components: key twice in the output YAML. Write a C# helper that scans line by line and removes duplicate components: keys, keeping only the first."

Copilot wrote MergeDuplicateComponents. I repeated this pattern for all seven helpers - each one targeting a specific category of Llama output error discovered during testing.

What I learned:

Copilot was most valuable not when I said "write this" but when I said "here is the problem I am seeing, fix it."

It read my existing code before responding. It explained its reasoning. It validated changes before applying them. It felt less like autocomplete and more like a senior developer who already knew my codebase.

Limitations I noticed:

Copilot is powerful but not perfect. Here is what it could not do:

Limitation	What happened
Could not predict Llama output bugs	Every YAML cleaner was reactive - I discovered the bug first, then asked Copilot to fix it
Did not flag environment differences	Generated `host.docker.internal` hardcoded - broke local `dotnet run`, I found the error myself
Could not test browser output	Generated the Playground UI confidently but could not see the broken Swagger Preview in a browser
Prompt quality = output quality	Vague prompts gave generic results - writing precise prompts was a skill I had to develop

The pattern was consistent: Copilot excelled at implementation, but discovery and debugging required a human.It could not run the code, open the browser, or experience the errors - I had to do that and bring the findings back to Copilot to fix.

What's Next:

AutoDoc is functional but it is just the beginning.

The project:

The current version uses Llama 3.2 locally via Ollama which is free
and private, but limited in output consistency. The natural next step is upgrading to a more capable model for cleaner, more reliable YAML generation.

Here is what I hope to build next:

Feature	Why
Support for larger Ollama models (Llama 3.1, Mistral)	Better output quality, fewer post-processing fixes needed
Multi-controller batch generation	Generate docs for an entire project at once
GitHub Actions CI/CD integration	Auto-generate docs on every push
Diff-based incremental generation	Only regenerate endpoints that changed
Export to JSON	Support both YAML and JSON OpenAPI formats

The biggest hope is this: AI-generated documentation that stays in sync with code automatically - no manual maintenance, no drift, no outdated Swagger specs.

Developers should write code. The AI should write the docs.

GitHub Copilot:

Working with GitHub Copilot on AutoDoc changed how I think about
building software. Before this project, I used it for small autocomplete suggestions. After this project, I use it as a genuine collaborator.

What I hope to do differently next time:

Start with Copilot earlier - involve it in architecture decisions before writing the first line, not just implementation.
Better prompt engineering - the more precise and constrained the prompt, the better the output. This is a skill worth developing deliberately.
Trust it more on boilerplate - every time I wrote something myself that I could have asked Copilot to write, I lost time.

The most important thing I learned: Copilot is only as good as the problem you give it.vague problems get vague code. Clear problems get working code.

AutoDoc taught me to think more clearly about problems because I had to describe them precisely enough for an AI to solve them. That skill makes me a better developer with or without Copilot.

Try it yourself! clone the repo, run Ollama, and paste your 1st C# controller. I'd love to see what AutoDoc generates for your API. Drop a comment with your results.

GitHub: https://github.com/jmy744/AutoDoc

Beyond Alt-Text: Building a Personalized AI Narrator for Accessibility

Yoges Mohan — Sun, 13 Apr 2025 09:44:41 +0000

What if the way visually impaired users experienced images online wasn't just about basic identification, but about genuine understanding tailored to their world? Imagine, instead of simply hearing 'painting of a woman', an art student could get insights into the brushstrokes and historical context relevant to their studies. Or picture a botanist learning the specific species of a flower in a photo, not just 'flower'.

Currently, standard image descriptions and alt-text, while essential, often provide only these basic labels. This limits deeper engagement and can create unequal access to the rich information embedded in visual content, especially for individuals with specialized knowledge or passions. Why should their experience be less informative just because the default description is generic?

In this project, I explore that 'What if?'. I introduce the novel Personalized AI Narrator,a prototype I built using Google Cloud's powerful Vertex AI Gemini models. My goal is to move beyond generic alt-text by automatically generating image descriptions dynamically tailored to an individual's unique interests, aiming for a future where everyone has the opportunity to connect with visual information in a way that truly resonates with them.

Introducing the 'Personalized AI Narrator'

So, how did I bridge this gap? The solution I explored in this project is the Personalized AI Narrator prototype. Instead of just one standard description for everyone, the aim is to generate a narration that specifically highlights what you, as an individual user, might find most relevant or interesting in an image.

The process involves several steps, orchestrating different AI capabilities on Vertex AI:

Detailed Image Understanding: An advanced multimodal Gemini model (gemini-1.5-pro-002) first analyzes the image to generate a rich, detailed base description.
Text & Interest Representation: This base description is broken down into sentence chunks. Both chunks and the user's pre-defined interests are then converted into numerical embeddings using a Vertex AI embedding model (text-embedding-004), capturing their semantic meaning.
Semantic Relevance Matching: The system calculates cosine similarity between the user's interest embeddings and the description chunk embeddings to find the parts of the description most relevant to the user.
Context Selection: The text of the Top N most relevant chunks is selected.
Tailored Synthesis (Controlled Generation): Finally, this selected relevant_context and the user's interests are fed to a Gemini text model (gemini-2.0-flash). Guided by a specific prompt I engineered, the model synthesizes these relevant excerpts into a concise, new narrative tailored to the user, ensuring it remains grounded in the selected information.

The result? A narration designed to provide deeper insight and a more engaging, informative experience.

Powered by Vertex AI Gemini
At the heart of this prototype lies Google Cloud's Vertex AI platform, enabling seamless integration of cutting-edge AI. I leveraged the Gemini models for powerful Image understanding (using gemini-1.5-pro-002) and versatile Controlled generation (using gemini-2.0-flash). Vertex AI's Embeddings API (text-embedding-004) provided the numerical representations for semantic matching, and I also used Gemini again for automated Gen AI evaluation. The Vertex AI SDK for Python made orchestrating these capabilities straightforward in my Kaggle Notebook environment.

From Generic to Personal: Seeing the Difference

But does it truly make a difference? Let's see it in action with an example image:

A vibrant butterfly on a red and yellow flower.
[https://images.pexels.com/photos/462118/pexels-photo-462118.jpeg]

The Base Description (from Gemini Vision) identified key details like:

...Plain Tiger butterfly (Danaus chrysippus)... perched on a Blanket flower (Gaillardia)... wings are a mix of orange, black, and white... flower... exhibits a gradient of red and yellow petals... background... bokeh effect... green stems and leaves... bright and even lighting... natural scene of pollination...

(A detailed starting point, rich with potential context).

Personalized for Evelyn (Botanist) (Using Top 5 relevant sentences):
```
Evelyn, the image features Blanket flowers (likely Gaillardia) displaying a fascinating gradient of red and yellow petals, with the red deepening towards the brownish-red center of the blooms. Note how one flower, slightly behind the main subject, shows a more even split of red and yellow hues, while glimpses of green stems and leaves add ecological context to the vibrant blossoms.
```
(Result: Excellent focus on flower specifics, color gradients, and ecological context – details directly present in relevant base description chunks).
AI Evaluation Score: An automated AI check rated this Score: 4/5, justifying that it effectively highlighted grounded details like the flower's color variations and ecological context relevant to botany.
Personalized for Anya (Art Student) (Using Top 5 relevant sentences):
```
Anya, imagine a vibrant close-up photograph reminiscent of Renaissance portraiture, where a butterfly takes center stage against a softly blurred background of flowers. The bright, even lighting enhances the vivid colors, creating a depth of field that subtly emphasizes the delicate interaction between the butterfly and its floral perch, much like the sfumato technique used to soften edges and focus attention.
```
(Result: Attempts to link visual elements like lighting and depth of field to Renaissance concepts).
AI Evaluation Score: The AI evaluation gave this a lower Score: 2/5. The justification highlighted that the connections to Renaissance art were superficial and weakly supported by the base description's mention of 'blurred background' and 'vivid colors.'

Code Snippet - The Prompt's Core: How did I guide the AI for personalization? Through prompt engineering based on the selected context:

personalization_prompt = f"""Act as [Role: Expert Narrator]...
User Profile: Name: {persona_name}, Interests: {interests_string}
Relevant Context:
---
{relevant_context} #<-- The Top 5 selected sentences
---
Task: Synthesize context concisely focusing on interests, grounded ONLY in context provided..."""

(This snippet demonstrates the core instruction guiding the controlled generation based on selected context).

This comparison using the actual results clearly shows the potential for tailoring with the semantic approach (getting a good score for Evelyn where relevant details existed) but also honestly demonstrates the current limitations related to grounding when source details are sparse (reflected in Anya's score), a finding objectively confirmed by the AI evaluation step.

Notebook Output:

1. Evelyn:

2. Anya :

The Potential: Towards Richer Digital Accessibility

The need for better digital accessibility is immense. The World Health Organization (WHO) estimates that for at least 1 billion people globally, existing vision impairment could have been prevented or has yet to be addressed. This staggering figure underscores the urgency for innovative solutions. Tools like the Personalized AI Narrator demonstrate how AI can contribute, aiming to create more inclusive and equitable digital experiences. By generating descriptions that resonate with individual interests, I believe this approach can help users move beyond basic labeling towards deeper understanding and engagement with visual content.

As my results showed, a key limitation of the current grounded approach (even with semantic chunk selection) is its dependency on the initial image analysis. To overcome this and provide truly rich context, future work should focus on integrating external knowledge sources (using RAG - Retrieval Augmented Generation). Developing seamless integration with screen readers is another vital next step for real-world usability.

Conclusion: A Step Towards More Personal AI Narration

The Personalized AI Narrator prototype I built showcases a novel application of Vertex AI Gemini for enhancing accessibility. By tailoring image descriptions to individual interests, it offers a glimpse into a future where visual content is not just described, but truly brought to life for everyone, respecting individual perspectives. While challenges remain, particularly around balancing relevance with groundedness when source details are sparse, the potential for using AI to foster greater digital inclusion is immense.

Thanks for reading, and have a great day!

Explore the full implementation I developed and try it yourself in the Kaggle Notebook here:[https://www.kaggle.com/code/yogesmohan/personalized-ai-narrator-for-visual-accessibility]