The number of tools and functions that aim to enhance the abilities of language models (LMs) is growing rapidly. For example, the popular LM framework LangChain grew its tool catalog from three to seventy-seven in the last 15 months. However, this approach of building tools for every little thing may be misguided and ultimately counterproductive. Instead, providing AI with direct access to a terminal, where it can use the many command line tools already created, and even create its own tools, will lead to more powerful, flexible, and future-proof systems.
Theory
Rich Sutton's short essay "The Bitter Lesson" strikes at the heart of this issue. On the surface, the essay is about choosing general methods over human-designed algorithms. But if that's all it was, it'd be just a lesson. The bitter part is that we humans want to feel special. We desperately need to believe that our knowledge and our contributions matter. So we saddle the AI with our knowledge, come up with processes for it to follow, and fill the gaps with our own code - and it all works, for a while. However, the next wave of innovation always comes and washes this sand castle away. We don't remember the clever algorithms modeling human vocal cords in speech recognition, we don't remember the clever algorithms searching for generalized cylinders in computer vision, and so neither will we remember most of LangChain's current seventy-seven tools.
The Princeton Language & Intelligence lab recently released SWE-Agent, shocking the world with its ability to bump OpenAI's GPT-4's percentage resolution of real-world GitHub Python issues from 1.4% to 12%. What will be remembered from their achievement is not all the clever work in optimizing GPT-4 (remember the Bitter Lesson), but the introduction of the idea of the Agent-Computer Interface, and the focus on improving this interface. The Princeton researchers took a very important step of asking how we can improve the 'user experience' of an AI agent. Here are some features they implemented:
- File search with 50 results at a time.
- File viewer that displays 100 lines at a time.
- Context management to remind the agent what it's working on.
These user experience challenges are not exclusive to AI. Humans are also more productive when we can use search and pagination instead of having to read and remember thousands of lines of text. Which is why there's already a solution to most of these challenges, the king of all software: The terminal. I'd estimate that 80% of any tasks that need to be done on the computer can be done in the terminal. And the terminal can be used to develop solutions for the remaining 20%.
In fact, the Princeton researchers explicitly called out this possibility in their paper, but dismissed it as insufficiently user-friendly for LMs. They write:
Consider a text editor like
Vim
which relies on cursor-based line navigation and editing. Carrying out any operation leads to long chains of granular and inefficient interactions. Furthermore, humans can ignore unexpected inputs, such as accidentally outputting a binary file or thousands of lines returned from a ‘grep’ search. LMs are sensitive to these types of inputs, which can be distracting and take up a lot of the limited context available to the model. On the other end, commands that succeed silently confuses LMs. We observe that LMs will often expend extra actions to verify that a file was removed or an edit was applied if no automatic confirmation is given. We show that LMs are substantially more effective when using interfaces built with their needs and limitations in mind.
These are important points, but I don't agree that they have any bearing on the viability of just letting an LM use the terminal:
- A terminal window has fixed dimensions, so an LM that interacts with a terminal won't be forced to process thousands of lines of unexpected output - only whatever it can see in that window.
- Having to carry out lots of granular interactions and execute silent commands that don't return output may confuse the current generation of LMs, but this seems a limitation of just the current generation, not a universal limitation of all LMs. Per the Bitter Lesson, it's counterproductive to build tools while focusing on current limitations, as those tools then get an expiration date to their usefulness.
Implementation
As the Princeton paper says, the current generation of LM models aren't capable of using the terminal to perform any complex tasks reliably.
However, you can try it out for yourself by installing langchain-community from my fork (I opened a pull request into the official LangChain repo but it's currently blocked due to security concerns).
You'll have to install tmux first.
If you're using pip:
pip install libtmux git+https://github.com/panasenco/langchain.git@terminal-window-tool#subdirectory=libs/community
Alternatively, if you're using poetry, add this to your dependencies section in pyproject.yml:
langchain_community = {git = "https://github.com/panasenco/langchain.git", branch="terminal-window-tool", subdirectory="libs/community"}
libtmux = "^0.37.0"
Then follow the instructions in the documentation notebook. Here's an excerpt that shows how an LM can interact with the terminal:
from langchain_community.tools import (
TerminalLiteralInputTool,
TerminalSpecialInputTool,
TerminalBottomCaptureTool,
)
from langchain_openai import ChatOpenAI
lm = ChatOpenAI(model="gpt-4o-2024-05-13").bind_tools(
[TerminalLiteralInputTool(), TerminalSpecialInputTool(), TerminalBottomCaptureTool()]
)
msg = lm.invoke("What top 3 processes consuming the most memory?")
msg.tool_calls
[{'name': 'terminal_literal_input',
'args': {'__arg1': 'ps aux --sort=-%mem | head -n 4'},
'id': 'call_xS2CDWlgZFkslqe7lM2QBoBz'},
{'name': 'terminal_special_input',
'args': {'__arg1': 'Enter'},
'id': 'call_p6b4MVlPZ5FdC2aWsk2F4dEo'}]
In the above example we can see that GPT-4o knows how to translate the problem statement into a shell command, and knows to press the Enter key after entering the command. It can't do much more than this without a lot of hand-holding yet, but I'm sure we'll have a model for which navigating the terminal won't be a challenge by the end of 2024.
Security
AIs are prone to many categories of unsafe behavior, as highlighted by the paper Concrete Problems in AI Safety. The more power and freedom we give AI agents, the more likely they are to behave in unexpected and unwanted ways, and I can't think of a single application that gives its users as much power and freedom as the terminal.
The minimum that needs to be done to satisfy the concerns of safety and security in granting LMs terminal access depends on the generation of LMs we're talking about.
- GPT-4 and equivalent LMs merely need local containerization. Just use a development container when working on your app. Development containers work in VS Code and GitHub CodeSpaces, and are a best practice in general to let others easily collaborate. Working inside a development container will ensure that oopsies like
rm -rf /
do minimal damage to the parent systems. These LMs don't yet seem capable of using the terminal to intentionally break out of the container. - The next generation of LMs will need a greater degree of isolation. Cloud providers like Amazon already allow arbitrary users access to their systems without knowing whether the user is a hacker. The startup E2B brings the same secure containerization technology Amazon uses to the AI space. Treating these LMs as if they were potentially malicious human-level hackers and placing similar restrictions on them as cloud providers do on human users should be sufficient to contain the threat.
- The following generation of LMs will probably need to be treated as superhuman hackers. These LMs should probably not be given access to any tools at all, not just the terminal, at least until the Superalignment team figures something out.
Conclusion
The Bitter Lesson teaches us that AI researchers over the years have tried desperately to retain a feeling of human control and a sense of value of human knowledge, only for those illusions to be shattered over and over. The only things of lasting value we human engineers seem to have to offer to AI are general methods that scale as the AI's abilities grow. I predict the "more dangerous" tools that also offer AI agents more flexibility and power will begin to replace the more limited ones. In the end, only the terminal tool will be necessary.
What about you? Can you use these tools to find a way to make current-generation LMs interact with the terminal reliably, contrary to what the Princeton paper claims? What do you think about the safety of letting LMs use a terminal? Let me know in the comments!
Top comments (10)
Anything I've tried to date that you can pipeline within a shell you can send to an LLM for processing with FluentCLI (token limitations and some escaping of the content apply). This should be true with Tmux also but I'll dig into it later today and let you know.
Currently, with FluentCLI there is an output flag to parse the code blocks out of an LLM response. It returns them as a string and if that is wrapped in an eval expression the command is run by the system.
The FlowiseAI chatflow looks like this.
It's unconfigured except for the connections between components. FluentCLI transparently provides the configuration at runtime while also decrypting secrets stored in a private replicable vault.
This is the command I ran to get the top 3 processes on my system.
eval "$(fluent GPT4oToolAgentRepoCloud "what are the top 3 processes consuming memory on my system, give me the command to find out for macOS" -p )"
The command structure for fluent is fluent flowname "request" [flags]. So in the above example, we are sending the request to the FlowiseAI chatflow called GPT4oToolAgentRepoCloud and asking it for the right command to run to execute on a macOS.
The -p flag tells FluentCLI to parse the response and return only the code blocks.
The entire output is wrapped in an eval and then invoked. It looked a little like this (notice the pipe to Tee /dev/tty) this is so the command would show on the screen. Otherwise, the pipeline consumes it.
I ran it again and told it not to use top.
Here is one more example.
This prompt invokes a FlowiseAI chatflow that creates another FluentCLI prompt to execute a Langflow Blog writer chatflow. It includes updating and changing the URLS that the Langflow chatflow uses to write the blog post.
You can use Fluent to call Fluent.
I have set up long dialogues between all the various LLMs in a round table sort of lead discussion with one moderator.
I've created loops where Fluent asks for code to be written, the code is output to a file, compiled, the output of the compilation is fed back into fluent, and the llm to revise the code and do it again. It works well for smaller code bases.
This pattern is how I made this utility a couple of weeks ago, rust airtable utility in an afternoon. It was also used to make, rust logseq utility which took a bit longer because of the logseq formatting, but still scarey fast. Those utilities were specifically created to allow simple and easy interaction with both services and FluentCLI and FluentCLI was the primary tools used to create them.
This is very interesting! Does Fluent build the FlowiseAI flows, or do you have to do it yourself?
It's impressive how productive you've managed to become with the current generation of LMs. I still find myself rewriting pretty much all code LMs produce, but I also haven't seriously attempted the recursive approach you describe, which I believe SWE-Agent also uses.
Omg, I wrote up a long response, and it was lost. I can't bring myself to do it again. I'll write a post soon on the full design.
To answer your question, Flowise and Langflow chatflows still need to be defined.
But, once they are defined, FluentCLI lets them become much more flexible.
Let me share an example configuration for a RAG workflow. This setup can efficiently query github repository information stored in a vector store, which could greatly enhance our workflow.
All of the 'AMBER_' settings are replaced at runtime with the keys stored in the secure amber vault.
In the same way that the amber keys are replaced, any settings that have been defined on the command line as '-override' are also replaced.
This means I can reconfigure this chatflow to a different GitHub repository, upsert the data in the same command, and issue a query to the new RAG topology.
The currently available source code contains the configuration file I'm using for development. It also has many publicly available (unconfigured) chatflows.
To use any of these public chatflows, you must configure 2 ENV variables and then walk through the Amber key setup, adding your API keys to your Amber vault.
Once configured, an Amber vault can be versioned to GitHub or stored on a cloud sync drive. To use Fluent on any computer, you only need to define those environment keys, and Fluent will work with your configuration.
This opens the possibility of providing scripts, applications, and future designs that can be easily shared and consumed by anybody using FluentCLI. They add a configuration (which is applied immediately) and ensure they have defined the right API keys in the configuration and it will just work.
And specifically about my productivity, this summarizes what I've learned: start with the smallest part of your program you can and get it working, then iterate, adding features and functionality incrementally. When you get to about 2.5k lines in one file, code generation results will degrade rapidly. That's all I have. 😀
Just to follow up, the direct integration with the Langchain terminal component isn't in FluentCLI. Perhaps I'll add at some point. For me, it's somewhat unnecessary in a FluentCLI world but I do see the significant potential of that implementation.
FluentCLI allows a much broader use case (I think) because you can give the necessary context for language models to understand how to write the syntax which can then be executed.
This is a shell script to run some Fluent sessions within a Tmux env.
The output looks like this after execution.
I didn't tell the chatflows the commands to execute, just what I wanted to accomplish and maybe some parameters around them like a local repo path. That's enough for the rest to work, generally speaking, and when it doesn't, it's likely just a tiny prompt tweak to make the change.
This is really fascinating! How do you think current LM limitations can be overcome to effectively use the terminal? Any insights on potential breakthroughs?
Thanks Ethan! I don't think there's anything special we can or should do, as per 'The Bitter Lesson'. As much as GPT-4o blew people away recently, I'd bet the next 'whale-sized' LM that OpenAI is already either training or safety-testing (see Microsoft CTO video here) will be able to use the terminal right out of the box, no special training needed. We just need to wait 6 months. :)
If you’re interested in interacting with your LLMs from the cli, I highly recommend:
fluentlcli homepage.
fluentcli GitHub repo.
FluentCLI is designed to interact with Langflow, FlowiseAI and generic webhook services. It supports pipeline input, dynamic overrides and numerous other features which are unmatched by other offerings.
Frankly, it is the most powerful and full featured AI cli application right now…..all self proclaimed of course, I created it. 😊
Looks like Fluent integrates with LangChain, so could it use the tmux tool I've discussed in this post to call itself? :D
I have been enjoying the Fabric CLI tool. It is very easy to add your own extensions to it and makes it easy to chain commands together