Building an AI task generator for Vikunja that doesn't trust itself

#ai #selfhosted #opensource #kotlin

I self-host Vikunja to organize my projects, and I kept hitting the same chore: turning a vague idea ("set up CI for a Kotlin service") into a clean list of individual tasks, by hand, every time.

LLMs are obviously good at this kind of decomposition. But I didn't want an AI that silently writes a pile of tasks into my tracker and hopes for the best. I wanted it to propose, and then get out of my way so I can approve.

So I built Trof — an AI task generator for Vikunja with a human-in-the-loop step baked in. It's open source and self-hosted. This post is about how it works and a few decisions I made along the way.

Why "Trof"? It's named after the Trophy active protection system on the Merkava tank, which intercepts incoming threats before they reach the hull. Trof does the same for your task board: nothing lands in Vikunja until you've reviewed it.

The flow

You describe a project in plain language.
The AI decomposes it into structured tasks — and can ask clarifying questions first if your description is ambiguous.
You review and edit everything — task names, descriptions, comments, tags — and drop the ones you don't want.
Only on confirm does anything get written to Vikunja.

It also works for editing existing projects, not just creating new ones — you describe the change, and it proposes a diff of tasks to add, edit, or remove.

Architecture

I split it into two services instead of one:

Backend gateway (Kotlin/Spring) — talks to Vikunja and the frontend, owns the confirm/apply logic.
AI service (Kotlin/Spring) — runs the agentic workflow, isolated from the rest so I can swap providers and iterate on prompts without touching the gateway.

The agent workflow runs on Koog, JetBrains' agent framework for Kotlin. The frontend is React + Vite. Everything ships as Docker Compose behind nginx with TLS.

It's provider-agnostic — Anthropic, OpenAI, Google, DeepSeek, or fully local via Ollama, in which case nothing leaves your
machine. Your Vikunja token and API keys live in

The design decision I care about most

The interesting part wasn't "call an LLM and pars-before-apply** boundary.

It would have been easier to let the agent call t's basically what an MCP server does). But thenthe AI mutates your real data on every run, and you're left cleaning up after a confident-but-wrong decomposition.

Instead, the AI never touches Vikunja. It produces a proposal; the gateway holds it; you edit it; and only an explicit confirm
turns it into real API calls. The AI proposes, yo manages my actual projects, that tradeoff wasworth the extra plumbing.

The clarifying-questions step

This one I'm genuinely unsure about. When the description is vague, the agent pauses and asks a follow-up question instead of
guessing. It produces noticeably better task breaound-trip and some friction.

If you've built conversational/agentic tools: do t, or generating a best-effort result and lettingthe user correct it? I keep going back and forth.

Try it / poke at it

It's early and rough in places, but it works end to end. Repo with setup instructions:

👉 https://github.com/Yooshyasha/Trof

Feedback welcome — especially on the review flow and whether the clarifying-questions step is worth the friction.

Top comments (2)

Harjot Singh • May 31

"Doesn't trust itself" is exactly the right posture. The systems that work assume the model is wrong until proven otherwise instead of shipping its first output. Self-distrust in practice is a generate-then-check loop: produce the tasks, then validate against constraints the generator can't fudge (does it parse, fit the schema, conflict with existing items). The trick is making the check independent of the generation, otherwise it just rationalizes its own mistakes. I build the same distrust into Moonshift: every generation step has to clear a verify gate before it counts as done, and a second pass judges against the spec rather than the model grading its own work. For the Vikunja tasks, what's your check, schema/format validation, or a separate model judging quality?

Maksim • May 31

Thanks a lot for the feedback!

My system has two validation layers. First, a Serializable DTO via Koog's structured-output tooling so anything that doesn't parse or fit the schema fails before it counts. Second, a separate agent with an isolated context that validates the generation: it receives the original input, the clarifications gathered during the user's dialogue with the AI, and the generated tasks themselves. It judges the quality, and on FAIL the graph goes back to the generation state with refinement instructions (it keeps the previous generation in context), looping until it produces a result which the user ultimately validates too (though they can just hit confirm without even looking).

How does Moonshift handle the case where the verifier and generator never converge, do you cap the number of iterations, or escalate to a human at some point?