DEV Community: Xuan Li

Why We Built Komos

Xuan Li — Sat, 09 May 2026 05:15:07 +0000

We believe humans should spend their time on work that matters. Thinking, discovering, creating. Not clicking through the same screens over and over.

Every company has repetitive work. Data entry across old systems. Documents moving between five different tools. Checks that need someone to do the same steps hundreds of times a day. This work is important, but it does not need a human brain. It needs reliability and speed. AI can do that.

AI as a co-worker

Think about what happens when a new person joins your team. You sit with them, share your screen, and walk them through how things work. After a few times, they get it. They can do it on their own.

We wanted AI to work the same way. Show it a task once. Walk it through the steps. It learns the pattern and can repeat it on its own. If something changes, it adapts, just like a good team member would.

That is what Komos does. Record a workflow or upload a video, and our AI learns how to do it. It builds an automation you can run again and again.

Humans discover. AI executes.

Our goal is simple. No human should do repetitive work. Humans should discover, innovate, and decide what needs to happen. Once the action is clear, AI does the work.

This is not about replacing people. It is about freeing them to do the things only humans can do.

Real problems in real companies

Today's enterprise is full of gaps. Legacy systems that do not talk to each other. Knowledge stuck in people's heads or scattered across tools. Processes that depend on someone remembering the right steps in the right order.

These are real problems. We are going to solve them one by one.

Komos handles browser automation, document processing, API connections, and data work. Teams use it for insurance checks, vendor onboarding, legal reviews, and more. And we are just getting started.

Moss: an AI engineer that automates browser workflows

Xuan Li — Sat, 09 May 2026 05:05:54 +0000

We spent a lot of time thinking about how people build automations. The task builder is powerful, but there is still a gap between having an idea and seeing it run. You need to know which nodes to use, how to connect them, and how to handle edge cases.

Moss closes that gap.

What is Moss

Moss is an AI engineer inside Komos. Tell it what you want to automate in plain words, and Moss builds the full task for you. It picks the right nodes, writes the prompts, connects the variables, and sets up the outputs.

You can also show Moss what you want. Upload a video of yourself doing the workflow, paste a process document, or share a screenshot. Moss watches and builds the automation from it.

Moss can do anything you can do

Moss operates with your permissions. It calls APIs on your behalf, creates tasks, manages integrations, and takes real actions in your workspace. If you can do it in Komos, Moss can do it for you.

This is not a chat window that gives you suggestions and asks you to go click buttons. Moss actually does the work.

It knows the product inside out

Moss has deep knowledge of how Komos works. It knows the product documentation, understands the platform architecture, and even knows parts of our codebase. When you ask a question, Moss does not guess. It gives you the right answer because it truly understands the system.

This makes Moss a great support tool too. Ask it how a feature works, why a run failed, or how to set something up. Moss gives you specific, accurate answers based on how the product actually works.

It sees what you see

Moss can see your current screen. Ask a question while looking at a failed run, and Moss reads the error, checks the task, and tells you what went wrong. It can take screenshots to help debug visual problems.

Sessions stay between conversations. Come back the next day and Moss remembers what you were working on and what is left to do.

What is next

We are putting a lot into Moss. Background jobs let it do complex, multi-step work while you focus on other things. And we are making it faster and more reliable in long conversations.

The goal is simple: if you can explain what you need, Moss can build it.

Why most AI agents fail in production

Xuan Li — Sat, 09 May 2026 05:04:05 +0000

Most AI agents work in demos. They break in production.

MIT's 2025 State of AI in Business report (Fortune summary) found that 95 percent of enterprise AI pilots show no measurable business return. The same study found something more telling. Pilots built by buying tools from a vendor succeed about 67 percent of the time. Pilots built internally succeed at about a third of that rate, roughly one in five.

That gap matches what we see every week. A team builds something that works on a sample of five cases. The first week of real volume, it falls apart.

The failures are not random. After watching this happen across recruiting, healthcare back-office, legal review, financial reconciliation, supply chain, and more, two patterns explain almost all of them.

Reliability and governance

Most prototypes are built to demo. Demos ask one question. Can the agent finish the task once?

Production asks different questions. Can it finish the task one thousand times in a row, on slightly different inputs, while no one is watching? When something goes wrong, can someone reconstruct what the agent saw, decided, and did? Can a human approve consequential steps before they happen, not after? When the auditor asks what we ran last quarter, is there a paper trail?

Most prototypes do not answer any of these. They were not built to.

AI agents make non-deterministic decisions. Without a step-by-step audit log, you cannot debug a bad run. Without explicit approval gates on actions that cannot be undone, the first time the agent runs against real data is the first time anyone notices the difference between reading a file and wiring money. Without access controls, secrets management, and retention policies, you do not meet the bar that any real enterprise asks for.

When something has to give, reliability has to win. No one consciously chooses fast and broken. But every prototype has implicit speed defaults. Short timeouts. No retries. No verification step. Those defaults survive into production unless someone redesigns them. A run that takes thirty seconds and is right is always worth more than a run that takes five seconds and is wrong ten percent of the time.

The teams that succeed treat reliability and governance as the architecture, not as features they bolt on at the end. They are not building "an agent". They are building a small operations system that happens to use a model. The model is the easy part.

Standard, not bespoke

It has never been easier to ship a prototype. A founder, an analyst, a recruiter, anyone can vibe code a workflow with Claude or ChatGPT in a weekend. That is a real productivity win at the prototype stage.

The moment that workflow is supposed to be how the company does something, the bespoke version becomes the problem.

Five people on a team need to do the same thing. They each end up with a slightly different flow. One person's version skips a step. Another adds an unauthorized vendor. A third silently uses an old API. They all work in isolation. None of them is the process.

There is no single source of truth. The flow lives in someone's chat history, or a doc, or a script in a Github gist. When the manager asks how we do something, there are five answers, each one slightly out of date.

When the upstream API or the policy changes, the change has to be made in five places. Usually it gets made in two and the rest drift. A month later, half the team is operating on the old version and no one knows.

You cannot deploy one hundred slightly different copies of a workflow and call that production. Production means one canonical version. Versioned, shared, observable, updated in one place. Vibe coded flows in one hundred chat sessions are the opposite of that.

What to do about it

Treat reliability as a v1 requirement, not a v2 cleanup. Ship audit logs from day one. Build the approval gate before you build the action. Define guardrails for destructive operations before the agent has the credentials to execute them. The cost of doing this up front is hours. The cost of retrofitting after a bad run is months and trust.

Pick speed or reliability for the right work, but pick consciously. Some workflows benefit from a fast best effort agent. Most enterprise workflows do not. If the work touches financial data, customer records, or anything a regulator cares about, default to reliability and accept the latency cost.

Centralize the workflow definition. Treat each automation like code. One canonical version, in a system everyone can see, with version history. Vibe coding is great for prototyping the first version. It is a disaster as the operating version. The transition from one to the other is the work.

Make governance someone's job, not no one's. One person owns access controls, audit, and exception handling for the agents you deploy. This is unglamorous and necessary.

Be honest about the production bar. "It worked in the demo" is a fact about the demo, not about the agent. Run the work on real volume, with real data, watch what happens to the failure modes, and fix them. That phase often takes longer than building the agent in the first place. That is normal.

What this really means

The bottleneck is not the model. The current models are more than capable. The bottleneck is the audit trail, the guardrails, the source of truth, the governance, the operational discipline.

That is good news. It means it is all addressable. The teams investing in those layers compound much further than the teams chasing the next benchmark.

If you are building an AI workflow that runs unattended on real work, take reliability seriously before you take speed seriously. Standardize the process before you scale it.

The teams that do are the ones we see succeed.