DEV Community

Cover image for Prototype: Voice-Controlling Web Apps with LLMs and Actions
Stylianos Kamnakis
Stylianos Kamnakis

Posted on

Prototype: Voice-Controlling Web Apps with LLMs and Actions

I've been tinkering with an idea that I can't seem to shake: what if we could interact with web applications just by talking to them? Not in a gimmicky way, but in a way that genuinely makes them faster and more accessible to use. This isn't a new idea, of course. The command line is the original conversational interface, and we've seen assistants on our phones and in our homes for years. But I've been feeling like there's a gap when it comes to web applications themselves.

The Little Problem That's Actually a Big Problem

Modern web apps are powerful. They can be feature-rich, complex, and, let's be honest, sometimes a little overwhelming. We've all been there—clicking through nested menus, hunting for that one specific function, or trying to remember the precise sequence of operations to get something done.

This complexity can be a barrier. For power users, it might be a minor annoyance, a few wasted seconds. But for new users, it can be a steep learning curve. And for users with accessibility needs, a complex UI can be a significant hurdle. What if we could flatten that curve? What if, instead of hunting and clicking, you could simply state your intent?

That's the vision that got me started on this little project. The gut feeling is that by enabling users to express their goals in natural language—"add a new task called 'buy milk' and mark it as important," or "show me all the high-priority items in the 'work' project"—we could create a more fluid and intuitive user experience.

The First Step: A Proof of Concept

NLI in Todo App

To test this idea, I put together a small framework, which I've been calling Kinesis. The core concept is to bridge the gap between a user's natural language input and the application's executable functions. Based on the code I've been working on, the flow looks something like this:

  1. Action Registration: The application developer defines a set of "actions" that a user can perform. These are just plain JavaScript functions with a name, a description, and a schema for the parameters they expect. In a React context, this could be done with a simple hook like useKinesisAction.
  2. Manifest Generation: The framework collects all these registered actions into a "manifest." This manifest is essentially a menu of capabilities that the application has. It describes what the app can do in a structured way.
  3. The AI Agent: When a user types a command, it's sent to a backend "agent." This agent, powered by a large language model (LLM) like GPT-4 or a local model via Ollama, is given the user's message, the manifest of available actions, and the current state of the application.
  4. Intent to Action: The LLM's job is to act as a translator. It looks at the user's request and the list of available tools (the actions in the manifest) and determines which function, or sequence of functions, to call. It then returns the name of the action and the parameters to execute it with.
  5. Execution: The frontend receives the action(s) from the agent and runs them, updating the UI accordingly.

I've built out a small to-do list application to try this out. It has actions like add-todo, delete-todo, and even reorder-todos. Being able to type "add a task to buy groceries" and see it appear in the list feels... promising.

The Bumps in the Road: Implementation Challenges

Getting this to work, even as a prototype, brought a few interesting challenges to the surface.

First was the question of context. For the AI to make good decisions, it needs to understand not just the user's command but also the current state of the application. I decided to serialize the application's state and send it along with the prompt. This gives the model a snapshot of what's happening on the screen, allowing for more intelligent suggestions. For example, if you say "delete the first item," the AI needs to know what the "first item" actually is.

Another challenge was action definition and discovery. How do you make it easy for developers to expose their app's functionality to the AI? The approach I've taken is a registration pattern (registerAction). In a React app, custom hooks can wrap this, making it almost declarative. A component can simply say, "here's an action I support," and the framework handles the rest.

Finally, there's the LLM adapter layer. Not everyone wants to use the same language model. I created a simple LLMAdapter interface that allows plugging in different AI providers. I've implemented adapters for both OpenAI and Ollama so far. This allows for flexibility—a developer could use a powerful cloud model for production but a local, free model for development.

The Leap to a Library: The Next Set of Challenges

Turning this from a project-specific implementation into a generic, reusable library is a whole other mountain to climb. The challenges here are less about getting it to work and more about getting it to be adoptable.

  • Generality vs. Specificity: How do you create a library that is general enough to work for any application but specific enough to be genuinely useful? The current implementation is tied to my to-do list's state management. A truly universal library would need to be agnostic about how the host application manages its state.
  • Developer Experience: For a library like this to be successful, it has to be incredibly easy for developers to integrate. This means clear documentation, a simple API, and as little boilerplate as possible. The useKinesisAction hook is a step in this direction, but there's more to be done.
  • Security: As soon as you're executing functions based on AI-generated output, you have to think about security. What if the model hallucinates a function call that could have unintended consequences? We need robust validation and maybe even a permission model for actions. Furthermore, implementing a versioning system for actions could be crucial. This would provide a clear audit trail and rollback capability for the actions the take place.
  • Performance: Making a round trip to a server for every command might not always be ideal. For certain actions, we might want to explore running smaller, on-device models for faster feedback, reserving the larger models for more complex queries.

Tutorial: Bringing it to Life with Code

Let's look at some simplified code snippets to illustrate how this natural language UI framework can be integrated into a React application. The core idea revolves around defining actions and then letting the AI agent pick and execute them based on user input.

1. Defining and Registering an Action

The useKinesisAction hook is the primary way to expose an application's functionality to the AI. It takes an ActionDef object, which includes the action's name, a description for the AI, a params schema, and the handler function that performs the actual work.

Here's how you might define a simple "add-todo" action within a React component:

import { useKinesisAction } from '@kinesis-framework/react'; // Assuming this is an installed external package

// ... (other imports and component setup)

export default function TodosPage() {
  // Assuming you have a state management function like this
  const { addTodo } = useTodosStore();

  useKinesisAction({
    name: "add-todo",
    description: "Add a new todo item to the list",
    params: {
      text: "string", // The text content of the todo
      done: "boolean", // Whether the todo should be marked as done initially
    },
    handler: async ({ text, done }) => {
      // This is the actual function that modifies your application state
      addTodo(text, done);
      console.log(`Added todo: "${text}" (Done: ${done})`);
    },
  });

  // ... (rest of your component's JSX)
}

Enter fullscreen mode Exit fullscreen mode

In this example:

  • name: "add-todo": A unique identifier for this action.
  • description: This is crucial. It tells the AI what this action does in plain language. The more descriptive, the better the AI's understanding.
  • params: Defines the expected input for the handler. Here, text is a string and done is a boolean. The framework uses this schema to guide the AI in extracting parameters from the user's natural language.
  • handler: This is your application's logic. When the AI decides to call add-todo, this function will be executed with the parameters extracted from the user's prompt.

A possible upside of using useKinesisAction within components is the flexibility to define actions that are relevant only to the current view or page. Just like event listeners are scoped to the elements they are attached to, actions registered with useKinesisAction are available only for the lifetime of the component they are defined in. This means actions from a previous page (e.g., if you navigate away) will automatically be unregistered and not be callable by the AI, preventing unintended side effects.

If you desire global actions that are always available regardless of the current page or view, you would define them in a higher-level layout component (e.g. your main App component or a root layout) that persists across navigation.

2. Triggering the AI Intent

Once actions are registered, you need a way to send the user's natural language input to the AI agent and then execute the returned actions. The askAndRunAIIntent function simplifies this process.

Typically, you'd connect this to an input field or a voice command interface.

// In your main application component or a command bar component

import { askAndRunAIIntent } from '@kinesis-framework/run'; // Assuming this is an installed external package

// ... (other imports and component setup)

export default function App() {
  const [inputValue, setInputValue] = useState('');
  // Assuming you have some global application state you want to send to the AI
  const appState = useTodosStore.getState(); // Example: get current todos state

  const handleCommandSubmit = async (e) => {
      e.preventDefault();

    if (inputValue.trim() === '') return;

    try {
      // Send the user's message and the current app state to the AI agent
      const success = await askAndRunAIIntent({
        message: inputValue,
        state: appState, // Provide the current application state for context
      });

      if (success) {
        console.log("AI intent processed successfully!");
        setInputValue(''); // Clear input on success
      } else {
        console.log("AI could not process the intent or found no relevant action.");
        // Optionally, shake a command bar or show an error message
      }
    } catch (error) {
      console.error("Error processing AI intent:", error);
      // Handle error, e.g., show a user-friendly message
    }
  };

  return (
    <form onSubmit={handleCommandSubmit}>
      {/* Your main application UI */}
      <input
        type="text"
        value={inputValue}
        onChange={(e) => setInputValue(e.target.value)}
        placeholder="Type a command, e.g., 'add buy milk'"
      />
      <button type="submit">Go</button>
    </form>
  );
}

Enter fullscreen mode Exit fullscreen mode

In this snippet:

  • The inputValue holds the user's natural language command.
  • appState is passed to the AI agent, providing crucial context for the LLM to make informed decisions (e.g., if the user says "delete the last item," the AI needs to know what the "last item" is).
  • askAndRunAIIntent handles the network request to your backend AI endpoint (/api/ai/intent) and then executes the actions returned by the AI.

This simplified view demonstrates the core interaction loop: define capabilities, send user input with context, and execute AI-determined actions. The beauty lies in abstracting the complex LLM interaction behind simple function calls, allowing developers to focus on defining their application's capabilities.

3. Adding More Actions: A Seamless Process

One of the design goals for this framework was to make extending the natural language capabilities as straightforward as possible. Once the initial setup is in place, adding new actions is a highly declarative process.

Consider if you wanted to add an action to "mark a todo as done." You would simply add another useKinesisAction call within your component:

// Inside your React component (e.g., page.tsx)

import { useKinesisAction } from '@kinesis-framework/react';
// ... (other imports)

export default function Home() {
  const { addTodo, doneTodo, deleteTodo } = useTodosStore(); // Assume doneTodo is available

  // Existing 'add-todo' action
  useKinesisAction({
    name: "add-todo",
    description: "Add a new todo item to the list",
    params: { text: "string", done: "boolean" },
    handler: async ({ text, done }) => {
      addTodo(text, done);
      console.log(`Added todo: "${text}" (Done: ${done})`);
    },
  });

  // New 'mark-todo-done' action
  useKinesisAction({
    name: "mark-todo-done",
    description: "Marks a specific todo item as completed",
    params: {
      id: "string", // The ID of the todo item to mark as done
    },
    handler: async ({ id }) => {
      doneTodo(id); // Your application's function to mark a todo as done
      console.log(`Marked todo with ID "${id}" as done.`);
    },
  });

  // Another example: 'delete-todo'
  useKinesisAction({
    name: "delete-todo",
    description: "Deletes a specific todo item from the list",
    params: {
      id: "string", // The ID of the todo item to delete
    },
    handler: async ({ id }) => {
      deleteTodo(id); // Your application's function to delete a todo
      console.log(`Deleted todo with ID "${id}".`);
    },
  });

  // ... (rest of your component's JSX)
}

Enter fullscreen mode Exit fullscreen mode

As you can see, adding new functionality for natural language interaction is as simple as:

  1. Defining a unique name for the action.
  2. Providing a clear, descriptive description for the AI.
  3. Specifying the params schema to guide the AI in extracting necessary information.
  4. Implementing the handler function that directly calls your existing application logic.

The framework automatically handles the registration of this new action with the manifest, making it available to the AI agent without any further manual configuration or retraining of the LLM. This design aims to minimize friction for developers, allowing them to rapidly expand the natural language capabilities of their applications.

Where to From Here?

I'm not claiming to have solved all these problems. This is just the start of an exploration. But the initial feeling is one of cautious optimism. There's something powerful about the idea of making our applications more conversational, more human-centric.

My next steps will likely involve trying to abstract the core logic into a standalone package and testing it with a different, more complex application. The goal isn't to build a massive, all-encompassing framework, but rather to continue exploring this space and see if this gut feeling leads to something genuinely useful. I'd be curious to hear if others have been thinking along the same lines.

Top comments (0)