Shixin Zhang

Posted on May 21

Next-Generation Software: From by-AI to within-AI

#agents #ai #software

Why "using AI to build more apps, faster" may just be putting an engine on a horse carriage

Over the past year, most discussions about AI and software have stayed within a very intuitive picture: a human describes a requirement, an AI agent writes the code, and after the code is written the software is still traditional software, only produced faster. Vibe coding, in essence, lowers the marginal cost of software production. What used to require engineers to implement line by line becomes an iterative process of generation, debugging, and refactoring driven by natural language.

This view is correct, but only half-correct.

The bigger change is not whether AI can generate software. The bigger change is whether the word "software" itself is about to change. Using AI to generate yet another standalone app often feels like putting an engine on a horse carriage: the power system has changed, but the form factor still belongs to the previous era. You still have a frontend, backend, account system, deployment, database, permissions, logs, subscriptions, settings pages. These things can now be generated faster, but they are still the same old shape.

But if the engine is already powerful enough, perhaps we should stop optimizing the carriage. The real next step is this: software is no longer merely written by AI agents. It starts to exist within AI agents.

Next-generation software does not necessarily have to be a complete app, a website, a SaaS product, a desktop client, or even a CLI with a fixed entry point. It may be a collection of prompts, skills, scripts, schemas, local files, cache conventions, tool permissions, and agent-facing instructions. The real runtime is not a specialized intelligent application. It is a general agent such as Codex or Claude Code. In other words, the harness is becoming the software.

Everyday ArXiv is a concrete example of this insight.

Research Software Without A Traditional Form

Everyday ArXiv is a daily intelligent arXiv processing assistant. The traditional way to build it would be straightforward: write a backend service, connect to the arXiv API, implement a recommendation algorithm, add a user system, build a web interface or email push system, and embed LLM calls at specific nodes such as summarization, scoring, recommendation explanations, and email drafts. In the end, it would become a specialized agent or SaaS product for researchers.

There is nothing wrong with this path. Many products will continue to be built this way. The problem is that, in this specific setting, the most valuable part is not the UI, not the database, and not a fixed pipeline. The most valuable part is the judgment made during each reading session.

Why is this paper worth reading today? Which of the user's previous papers is it actually connected to? Is the overlap merely keyword-level, or is there a real methodological connection? Is a proposed idea once again falling into the mediocre pattern of "add noise, change the model, run larger numerics"? If a new paper does not cite the user's work, is it an obvious omission, a weak connection, or only conceptually adjacent?

These questions are hard to compress into fixed software features. They look more like the judgment process of a research assistant. So the architecture of this project is inverted. Python code handles only deterministic tasks: fetching arXiv metadata, parsing Google Scholar, writing stable JSON caches, loading configuration, and maintaining local file boundaries. Judgment-heavy tasks are not hardcoded inside the application. They are delegated to a general agent. The repository provides the agent with a workspace, skills, profiles, prompts, scripts, and privacy rules.

In other words, this is not "software with LLM features." It is software that an LLM agent can directly run.

Why This Is Not "Everyone Writes A Custom App"

A common prediction is that AI lowers the cost of software production, so everyone will write many small custom apps for themselves. I think this is only half right.

What may actually happen is not that everyone has a pile of custom apps, but that many custom apps never exist in app form at all.

Once general agents are powerful enough, many "software" systems do not need to be compiled into standalone products. They can remain open-form: a few Skills, a few scripts, a directory convention, a profile, and some examples. Their functionality unfolds at runtime through the agent. They have no fixed buttons, but clear protocols; no complete backend, but stable tools; no page, but Markdown or HTML reports; no embedded intelligence module, but access to the general intelligence of the agent.

This goes beyond vibe coding. Vibe coding still assumes the goal is to generate a software product. Agent-native software tries to avoid prematurely generating software in the old form. It asks: does this thing really need to be productized, or does it only need to be agentized?

The Structural Tax Of Vibe Coding

The appeal of vibe coding is that, for the first time, building software feels cheap. You can ask an agent to generate a full-stack repo with React pages, API routes, database schemas, Dockerfiles, auth, deployment notes, and a README.

But this reveals another problem: the faster AI generates software, the more visible the structural tax of the old software form becomes.

Standalone apps carry many default taxes. There is a UI tax, because every capability must be turned into buttons, forms, and pages. There is a deployment tax, because every capability needs its own runtime environment. There is an integration tax, because every new app has to reconnect to data sources, permissions, and user state. There is a maintenance tax, because dependencies drift, frameworks upgrade, and deployments break. There is also a product-shape tax, because many open-ended judgment processes must be compressed into fixed features, losing flexibility and customization.

When the task is essentially "run a high-judgment workflow in a specific context," these taxes become heavy.

If Everyday ArXiv were built as traditional software, it would be forced to invent many things that are not its core value: recommendation pages, profile editors, PDF readers, email draft editors, background jobs, account systems, synchronization state. Of course these can be built. But they are not the core of "read arXiv and make research judgments." The core is to put the user's profile, today's papers, paper full texts, historical preferences, and research taste into the same reasoning loop.

If a general agent can already read files, run commands, call tools, edit Markdown, maintain local state, and follow project rules, many standalone app shells start to look unnecessary.

This is why "AI helps me write an app faster" may only be a transitional form. It optimizes the speed of software production, not the shape of software itself.

The Compilation Target Of Software Has Changed

Traditional software compiles to machines: CPUs, browsers, mobile devices, cloud services. Even SaaS ultimately compiles into deterministic behavior on some fixed runtime.

Agent-native software does not compile only to machines. It compiles to agents.

This sounds strange, but it is the key point. A Skill is not merely documentation. It is closer to a runtime definition: when an agent encounters a certain kind of task, which files should it read, which scripts should it call, which boundaries should it respect, how should it handle failure, when should it stop, what output format should it use, which judgments must not be hardcoded, and which data must not be committed to Git.

In Everyday ArXiv, the Python package under src/ is the deterministic kernel. .agents/skills/arxiv-daily/SKILL.md is the workflow definition. user_profile/ is user-space memory. agents.md is the runtime specification. data/raw/arxiv and data/reports are the persistence layer. Codex or Claude Code is the execution environment and runtime.

From the perspective of traditional software, src/ is the software, and everything else is documentation or data.

In agent-native software, this boundary is inverted. src/ is only the tool layer. The real software behavior emerges from the tool layer, Skill instructions, user profiles, cache formats, report conventions, privacy boundaries, and the general reasoning ability of the agent.

This is the architectural inversion: infrastructure moves downward from each standalone application into the agent platform. The application itself becomes a lightweight, injectable, modifiable, and portable capability layer.

The shift can be summarized as follows:

Dimension	Old Paradigm: Software by AI	New Paradigm: Software within AI
Architecture	AI generates a custom full-stack repo: frontend framework, backend API, database, and hosting layer included.	Lightweight skills, structured manifests, execution scripts, and local directory conventions are injected into the agent as capabilities.
Infrastructure	Each app has its own runtime, database, DevOps pipeline, permissions, and deployment environment.	The app reuses the native environment of the host agent platform: filesystem, command line, browser, sandbox, tool calling, and context window.
Cost Model	A standalone AI app must maintain a SaaS shell and pay the marginal API cost of each model call. Heavy usage quickly becomes expensive.	An agent-native workflow lives inside general-agent subscriptions such as Claude Code or Codex, letting users share the subscription economics of model providers.
Flexibility	Features are hardcoded into UI, backend, and schemas. New capabilities require code changes, redeployment, and redesigned entry points.	The agent dynamically interprets Skills, reads and writes files, and calls scripts based on runtime intent, adapting to edge cases without rebuilding the product shape.

A Skill Is Not A Plugin. It Is A New Software Unit.

We are used to thinking of plugins as accessories to a host program. Browser extensions depend on browsers. Editor extensions depend on editors.

But Skills inside agents are closer to a new unit of software.

They contain at least four layers.

The first layer is deterministic tools. These are ordinary scripts, CLIs, parsers, fetchers, and formatters. They handle the parts that should not be left to an LLM's improvisation.

The second layer is semantic policy. These are the instructions: what counts as a good recommendation, what counts as a mediocre idea, when to run a citation check, when not to pad the list to ten papers, and when to write only to local private files.

The third layer is private state. User profiles, historical papers, negative preferences, idea logs, and local config are not merely "database records" in the traditional sense. They are personal context that the agent can read, interpret, update, and audit at runtime.

The fourth layer is the execution substrate. This is what the agent platform provides: filesystem access, command execution, browsing, code understanding, long context, multi-tool coordination, and natural language interaction.

A traditional app often packages all four layers into its own code and services. Agent-native software separates them: stable parts become scripts; judgment-heavy parts become Skills; personal parts remain in local files; execution is reused from a general agent.

So it behaves more like a dynamically loaded driver than a complete machine. It does not need to spin up a company-sized software shell every time. It only needs to inject capability into an existing agent runtime.

The Cost Advantage Is Not Just Form. It Is A Pricing-Layer Mismatch.

Another key point is the cost difference between API calls and subscriptions. If you build a specialized agent yourself, every intelligent step calls a model API. If you use a subscription-based general agent such as Codex or Claude Code, many of these steps are absorbed by the platform. This difference is not merely "a little cheaper." It can be an order-of-magnitude architectural difference.

Take Karpathy's LLM Wiki / agentic wiki idea as an example. At its core, it is a lightweight set of directory conventions, Markdown files, schemas, and agent instructions. Of course you can productize it: turn it into a standalone knowledge-base and note-taking app, add login, upload, search, sync, team workspaces, RAG pipelines, a polished UI, and then connect it to frontier-model APIs. At that point it becomes a standard AI SaaS product: every ingest, query, rewrite, and cross-reference burns your API bill.

But the same idea does not have to be productized. You can put raw sources, wiki pages, and instructions in a local repo and let a general agent such as Claude Code maintain it directly. For the user, this is not "opening a new SaaS product." It is "running a workflow inside an agent subscription I already have." The workflow is lightweight enough that its main cost moves from API metering into the agent subscription.

The gap can be huge. For heavy users, the equivalent API cost can easily be more than ten times higher than the subscription cost. Put differently: for the same frontier model capability, a standalone AI app must pay by the API meter, while an agent-native workflow living inside Claude Code or Codex may have its entire user-facing cost absorbed by the platform subscription, because the user already needs an AI subscription plan anyway.

If frontier model providers can sustain this pricing structure, the consequences will be severe. General agent tools such as Codex will devour a large fraction of so-called intelligent software that merely connects large-model APIs to old software shells. Those products carry two layers of cost: the product-shape cost of traditional SaaS, and the usage-based API cost of frontier models. Agent-native workflows reuse a runtime that the agent platform has already subsidized, deployed, and sold to the user.

So the cost advantage comes from two directions.

First, the form is lighter. Standalone software is expensive not only because it runs servers, but because it must maintain a fixed product shape. That shape forces you to predefine user paths, feature boundaries, error handling, state synchronization, permission models, UI copy, and upgrade mechanisms. For high-frequency standardized tasks, this is worth it. For personalized, low-frequency, high-judgment tasks, it becomes a burden.

Second, the billing layer is lower. API-wrapper software turns every intelligent action into its own marginal cost. Agent-native software tries to place intelligent action inside the general-agent runtime that the user already owns. The former is like putting an engine inside every small tool. The latter is like loading different tools onto a unified power system.

The filesystem becomes state. Markdown becomes interface. JSONL becomes database. Skills become product logic. Python CLIs become reproducible tools. The agent becomes the interaction layer, reasoning layer, and glue code. The user does not need a complete app. The user needs a workspace that an agent can understand and operate.

This does not mean engineering quality becomes unimportant. The opposite is true: engineering boundaries become more important. Deterministic tasks must live in code. Privacy boundaries must be protected by .gitignore and file naming rules. Cache formats must be stable. Profile updates must be traceable. Reports must be reviewable. We simply no longer have to assume that all software value must be packaged into a fixed UI.

LLM OS Is Not A Metaphor

This also explains why software within AI agents resonates with the idea of an LLM OS.

If we think of the LLM as an operating system, the model itself is not the whole system. A real OS includes filesystems, permissions, processes, tool calls, environment variables, package management, history, working directories, user preferences, executable scripts, and application protocols. Agent platforms are reorganizing these pieces.

From this perspective, a Skill is like an application. A prompt is like configuration and entry point. A script is like the executable behind a system call. user_profile is like user-space data. agents.md is like a software manual, permission model, and runtime specification. Cache directories are persistence. The agent is a mixture of shell, window manager, workflow engine, and interpreter.

Traditional software runs on top of operating systems. Next-generation lightweight software runs inside the LLM OS.

This does not mean all software disappears. High-frequency, multi-user, strongly consistent, permission-heavy, transaction-heavy systems will still need traditional software forms. Banking systems, collaborative editors, production databases, payment platforms, and medical systems cannot rely solely on an agent runtime.

But a large amount of personalized, low-frequency, high-judgment software will be rewritten.

Research reading assistants, personal knowledge systems, paper response tools, code review workflows, experiment records, document drafting, data cleaning, chart generation, long-term research projects, and idea management have historically been hard to turn into good software. Not because the need does not exist, but because every person's need is too specific, the market is too small, the shape is too fragmented, and fixed products quickly stop fitting.

Agents change that economics.

This Example Generalizes Far Beyond

Everyday ArXiv is just one example. The structure behind it generalizes to many scenarios that used to require being "turned into software."

The first category is knowledge workflows. Today it is arXiv. Tomorrow it could be a paper library, technical blog library, investment research library, legal document library, or internal decision memo system. The traditional approach is to build a standalone application: dashboard, search box, favorites, summaries, recommendations, and RAG. The agent-native approach is looser: raw materials are files, indexes are scripts, workflows are Skills, user preferences are profiles, reports are Markdown. It is less like a product and more like a work environment that an agent can unfold at runtime.

The second category is scientific computing and experiment management. A research project may need to manage models, parameters, run scripts, remote machines, result directories, logs, figures, and conclusions. Of course you can write an independent CLI with commands such as submit, status, plot, and report. This is still valuable, because deterministic low-level tasks need stable tools. But if the entire experiment-management process is compressed into a CLI, you lose a great deal of contextual judgment: when to rerun, which parameter combinations are worth extending, which anomaly may be a bug, which figure should enter the paper, and which result is already sufficient to stop.

The more natural architecture is: keep low-level scripts deterministic, and use a set of Skills to specify how the agent should read experiment directories, submit jobs, record provenance, generate reports, and avoid overwriting results. The experiment system is not a closed tool. It is an agent-operable research workspace. Its flexibility is often stronger than that of an independent CLI, and its results are often better, because scientific experimentation is not a fixed sequence of commands. It is a process of continuous judgment, adjustment, and interpretation.

The third category is existing Python software frameworks. In the past we would ask: should we wrap it in a GUI? Should we build a web app where users can select parameters, drag modules, and display results? But for many scientific computing, machine learning, and quantum simulation frameworks, the better interface may not be a GUI. It may be an agent.

The framework itself provides strict APIs, types, tests, documentation, and examples. Agent-native adaptation lets the agent read the documentation, compose algorithms, write scripts, run demos, explain results, and generate figures directly. The user no longer has to learn every API before starting to explore. The user describes the goal in natural language, and the agent compiles that goal into framework code. This is not wrapping an old framework in a shell. It is connecting the framework to a natural-language programmable operating layer. TensorCircuit-NG represents this agent-native direction: the point is not to build another polished GUI, or a CLI that restricts functionality, but to make the framework itself a computational substrate that agents can understand, invoke, and extend.

These examples point to the same conclusion: next-generation software does not necessarily turn every tool into a standalone product. It lets tools enter the fluid environment of agents. This form has one enormous advantage: fluidity.

Traditional software is hard. It must be installed, deployed, upgraded, compiled, and released. Its features solidify into buttons and pages. If users want to change it, they usually have to file an issue, wait for developers, fork the repo, or edit code.

Agent-native software is soft. It can be copied as a directory, changed into another set of Skills, locally rewritten by users through natural language, and migrated across agent platforms. It does not necessarily need compilation, a fixed UI, or versioned releases. Often, the software is simply a set of readable, editable, executable conventions.

If the user really needs an interface, the agent can generate an HTML page on demand. Today it can be a minimal table. Tomorrow it can be a flashy dashboard. The day after tomorrow it can be a paper-style report page. The interface becomes a runtime artifact, not the fixed shell of the software.

This may be the most counterintuitive part: software in the AI era may not increasingly look like "smarter apps." A lot of software may become less app-like, more like an amorphous fluid that agents can read, modify, compose, and temporarily materialize.

This fluid form does not depend on a fixed UI. It can be executed by Codex, by Claude Code, or by future agents. As long as the agent is strong enough to read files, run commands, follow Skills, and maintain boundaries, it can run the software.

Software portability changes accordingly. In the past, migrating software meant migrating applications and data. Now it means migrating workspace conventions. What you take with you is docs, skills, scripts, templates, profile schemas, and examples. More concretely: a folder. The execution runtime can change; the software remains.

Design Principles From This Project

From Everyday ArXiv, we can extract several design principles.

First, deterministic work belongs in code. Fetching, parsing, caching, schemas, configuration, paths, and format checks should be ordinary software engineering. Do not let an LLM "remember" where today's cache should go. Do not let it invent data structures at runtime every time.

Second, judgment belongs in the agent. Recommendation, selection, close reading, research ideas, citation risk, and email tone are exactly where general agents are strong. Hardcoding them into fixed API pipelines sacrifices flexibility.

Third, user profiles should be local files, not abstract preference buttons. Research interests, negative preferences, prior papers, and citation anchors are detailed and personal. The agent should be able to read, cite, update, and audit them directly.

Fourth, Skills are the product core. They are not documentation attached to the product. They are the main execution logic of agent-native software. Traditional software has its core in code paths; agent-native software often has its core entry point in Skills.

Fifth, the open-source boundary must be designed upfront. The public repository should store the general protocol. Private files should store the user. This allows the software to be reusable without leaking personal knowledge and workflows.

Together, these principles define a new software form: not an application wrapped around LLM APIs, but a workspace growing around general agent platforms.

Closing

AI agents first looked like programmers who could write code faster. Then they looked like assistants that could operate tools. Next, they may look more like general intelligent runtimes and operating systems.

If this is true, part of next-generation software will no longer be understood as "applications." It will be agent-readable directories, prompts, Skills, scripts, profiles, and caches. It will have no fixed shape, but still run reliably. It will have no complete UI, but still complete complex work. It will not be generated once by AI and then left alone; it will continuously live inside AI agents.

Everyday ArXiv is a small research tool, but it shows the early form of this direction: the intelligent part of software does not necessarily need to be packaged into a specialized agent. When general agents become strong enough, software can write itself as a harness for agents. I would even make a stronger claim: very few specialized agents will remain useful. Most will be swallowed by general agents, just as Sutton's bitter lesson would suggest.

This may be the shift from software generated by AI to software existing within AI.

DEV Community