DEV Community: Wojciech Wentland

"Load the module from memory" sounds like one thing. It is not one thing. Each operating system gives you different primitives, hides different walls, and has a different definition of what counts as "never touched disk." I've been working on paker, a Python library that ships encrypted packages over the network and imports them without writing bytes to the filesystem, and the platform-by-platform reality turned out to be messier than the pitch. Post one in this series covered what paker is and what it's for; this one is about the plumbing underneath.

The patterns below apply whether you're building a plugin system, shipping a proprietary SDK, or doing anything where the word "executable" appears without a corresponding path on disk. paker is the concrete example I'll point at, but the primitives are the OS's, not mine.

Linux: `memfd_create` is the clean case

Linux gives you the primitive you want. memfd_create(2) was added in kernel 3.17 and creates an anonymous file descriptor backed entirely by RAM. You write bytes to the fd, then dlopen("/proc/self/fd/N"), and the loader maps the code into executable memory. The kernel never creates a filesystem entry.

import ctypes, os, importlib.util
from importlib.machinery import ExtensionFileLoader

libc = ctypes.CDLL(None, use_errno=True)
fd = libc.memfd_create(b"", 0x0001)  # MFD_CLOEXEC
os.write(fd, so_bytes)

loader = ExtensionFileLoader("module_name", f"/proc/self/fd/{fd}")
spec = importlib.util.spec_from_loader("module_name", loader)
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)

Outside the process, ls -l /proc/<pid>/fd/ shows memfd: as the link target. Another process can reach the bytes through /proc/<pid>/fd/<fd> (the memfd man page documents exactly this pattern for shared mappings), but it has to pass normal /proc access checks: matching UID, no container isolation, ptrace_scope friendly. There's no stable filesystem path, only a procfs entry whose accessibility depends on process-access rules.

If you pass an empty tag instead of the module name, the anonymous mapping doesn't publish what it is.

So far so clean. Then the real gotcha.

glibc can treat a reused `/proc/self/fd/N` path as the same library

glibc's dynamic linker reference-counts loaded objects and returns the same handle when the same library is opened again, as dlopen(3) describes. Fine for normal libraries whose on-disk paths are stable. With memfd, the "path" looks like /proc/self/fd/3, and 3 is just whatever file descriptor number the kernel handed you. Close that fd after loading, the next memfd_create will get fd 3 back, and in my glibc tests dlopen("/proc/self/fd/3") returned the cached handle for the previous library.

I verified this empirically in Docker on glibc: close fd 3, open a new memfd for a different module, dlopen returns the wrong shared object. No error, no warning. The code just calls into the wrong library.

So the fix is to keep the memfd fd open for the lifetime of the process. The mmap'd pages survive an fd close on their own as long as the mapping exists, so the lifetime question isn't about keeping the code mapped. It's that closing the fd lets the kernel reuse the same fd number for a different memfd. Once /proc/self/fd/3 can mean a different shared object later, glibc's loaded-object cache can hand you the wrong handle. Cleaning up the fd is how you reintroduce a collision bug that will fire on the first process loading more than a handful of extensions.

Every primitive has a papercut. This one was mine.

macOS: "zero disk" is a lie on Apple Silicon

Apple Silicon and mandatory code signing killed in-memory loading as a general technique. There is no macOS equivalent of memfd_create. The closest public API is NSCreateObjectFileImageFromMemory, and Meta's Red Team X documented the behavior change: starting with dyld3, NSLinkModule stopped doing in-memory loading and instead writes a temporary file with a NSCreateObjectFileImageFromMemory-XXXXXXXX fingerprint in $TMPDIR. The API still exists. It just doesn't do what the name says, and the temp artifact has a predictable fingerprint that monitoring tools can see.

The practical ceiling on modern macOS is "milliseconds on disk." You mkstemp a file with a random name, write the bytes, dlopen it, and os.unlink immediately. The directory entry disappears but the kernel keeps the mmap'd pages alive for the loaded module.

import os, tempfile, ctypes

fd, path = tempfile.mkstemp(prefix="", suffix="")
os.write(fd, so_bytes)
os.close(fd)

# Depending on how the host binary is signed, you may need to disable
# library validation or otherwise satisfy macOS code-signing rules.
handle = ctypes.CDLL(path, mode=ctypes.RTLD_NOW)
os.unlink(path)  # directory entry gone, mmap pages still valid

Signing is its own rabbit hole. Apple's forums distinguish between library validation (who signed the dylib), unsigned executable memory (JIT-style entitlements), and the deprecated reflective-loading APIs. Different problems, different entitlements. Which one you need depends on your host binary's signature and whether the bundle you're loading is ad-hoc signed, team-signed, or unsigned.

If your threat model assumes an attacker with root and a filesystem watcher, the milliseconds-on-disk window is still a window. If your threat model is "no plaintext bytes should survive this process," macOS lets you get close but never all the way there.

Windows: a full PE loader, until Static TLS stops you

Windows is surprisingly good if you're willing to bring your own loader. The _memimporter code from py2exe (itself derived from Joachim Bauch's MemoryModule, MPL-2.0) implements a PE loader in C: VirtualAlloc a region, copy the section headers, walk the relocation table, resolve the import address table by calling GetProcAddress against LoadLibrary'd dependencies, mark pages executable, call PyInit_<modname>. No filesystem syscalls anywhere in the path.

It works. I've shipped numpy, Pillow, and a dozen other packages through it on both x64 and ARM64. The ARM64 work surfaced two real bugs that had been dormant because nobody had tried:

HOST_MACHINE was hardcoded to IMAGE_FILE_MACHINE_AMD64. On ARM64 the PE header says 0xAA64, and the check was rejecting every valid library.
After relocations and section copies, ARM64 needs an explicit FlushInstructionCache before executing the new pages. x64 gets away without it because of cache coherency; ARM64 has split I-cache and D-cache and will happily execute the old bytes.

Fixing both of those was a Saturday. Static TLS is not a Saturday.

Static TLS is where my in-memory Windows path stops

Any extension that uses static TLS (__declspec(thread) or thread_local with static initialization) cannot be loaded safely and generally by a normal user-mode PE loader. I hit this first with Rust-based Python extensions, but the reliable test isn't the language. It's the PE header: if IMAGE_DIRECTORY_ENTRY_TLS is present and non-empty, the module needs the Windows loader's TLS machinery.

The reason is where the TLS data lives. TlsAlloc gives you dynamic indices and writes to TEB->TlsSlots. That one's fine. But the compiler doesn't emit TlsAlloc code for static TLS. It emits direct indexing into TEB->ThreadLocalStoragePointer, which is a separate array. That array is managed by LdrpHandleTlsData in ntdll, an undocumented internal function responsible for resizing the array when a new DLL with TLS is loaded. When LdrpHandleTlsData never runs, your extension's thread-local indexes point into memory that was never allocated for it.

Projects that have tried to work around this all make trade-offs I wasn't willing to ship:

Blackbone reimplements LdrpHandleTlsData via pattern-scanning ntdll. It maintains thirteen-plus byte patterns across Windows versions; every Windows update is a potential break.
Fatmike's PELoader does handle static TLS for a single loaded module. The author is explicit that simply iterating TLS callbacks isn't enough for Rust and that custom per-thread TLS-data initialization is required, with a TlsCallbackProxy shim to forward thread events. The project describes itself as experimental on this surface and loads one target at a time.
Manual slot management with TlsAlloc addresses the wrong array entirely. The one your compiler never touches.

This is why Python freezing tools often keep a real file path in the loop for native extensions. py2exe made that trade-off explicit for Python 3.12+, where bundle_files < 3 is no longer supported. The maintainer closed the issue with "I did not see a viable implementation." At some point delegating to the OS loader is the boring solution that works.

Accepting that took longer than fixing it. The TEB structure has been stable for decades. There is no user-mode path around it that's both general and stable, and a library that quietly breaks on the next Windows update is worse than a tempfile.

So for static TLS, paker detects it, writes the extension to a tempfile, and lets LoadLibrary do its job. For the majority of extensions without static TLS, the PE loader runs and nothing touches disk.

What you actually get, by platform

	Linux	macOS (arm64)	Windows (no TLS)	Windows (static TLS)
Bytes on disk	None	Milliseconds	None	Lifetime of load
Visible filesystem path	No	Briefly	No	Yes
Plaintext artifact after crash	None	Possible before unlink	None	Temp file may remain
Extra constraints	Keep fd open	Signing / entitlements	PE loader quirks	None, it's a file

"Loaded from memory" is not a binary property. Linux gives you a real zero-disk path with one behavioral gotcha. macOS gives you a brief window, not a guarantee, and the gap isn't closing. Windows gives you zero-disk for most of your bindings and a hard wall for the rest.

If you're choosing techniques, the question I'd ask isn't "can I load without touching disk" but "what's the worst-case footprint when the happy path doesn't apply." The packages that break the pattern are usually the ones you most want to ship.

paker hides most of this. pip install paker, paker.dumps(...) on one end, paker.loads(...) on the other, and you get the correct primitive per platform with the gotchas already handled: memfd fds held open on Linux, ephemeral tempfile on macOS, PE loader on Windows with a static-TLS fallback. Source, docs, and a remote-agent example live at github.com/desty2k/paker.

Footnote on memory hygiene: if you're decrypting bytes before loading, the buffer lifetime matters too. mmap + mlock keeps the plaintext out of swap, MADV_DONTDUMP keeps it out of core dumps, and zeroing via a volatile pointer loop after use prevents the compiler from eliminating the scrub. None of this is OS-specific and all of it is secondary to the loading primitive itself.

Ship Python code without leaving source on disk. Encrypted bundles arrive at runtime, load from memory. Works with numpy, pydantic, boto3.

Wojciech Wentland — Wed, 06 May 2026 08:12:28 +0000

Wojciech Wentland

May 6

paker: load encrypted Python packages from memory

#python #security #opensource #programming

Comments

2 min read

paker: load encrypted Python packages from memory

Wojciech Wentland — Wed, 06 May 2026 06:15:00 +0000

Ship Python code without leaving source on disk. Encrypted bundles arrive at runtime, load from memory. Works with numpy, pydantic, boto3.

PyInstaller bundles Python code into a standalone binary. Run pyinstxtractor on that binary and you get every .pyc back in about ten seconds. uncompyle6 turns those into readable source. The "compiled" binary is a zip file with extra steps.

paker takes a different approach. The binary ships without the proprietary code. At runtime, encrypted modules arrive from the network and load directly into memory.

import paker, requests

# Key arrives from your license server. paker doesn't manage keys.
key = requests.post("https://license.example.com/key",
                    json={"license": LICENSE_ID}).content

# loads() accepts dict, str, bytes, or bytearray.
bundle = requests.get("https://cdn.example.com/sdk.paker").json()

with paker.loads(bundle, key=key):
    import proprietary_sdk
    proprietary_sdk.run()

No pip install on the client. Python modules get compiled from memory. Native extensions load through platform-specific loaders. numpy, pydantic, boto3, Pillow, anthropic, and a dozen more packages all work, including ones with compiled C code.

How it loads

paker implements zero-disk loading natively on Windows and Linux. macOS requires an ephemeral temp file (written, loaded, immediately deleted) because of mandatory code signing, and the host binary needs the disable-library-validation entitlement. I'll go into the platform-specific details in a separate post.

What it doesn't protect against

Once loaded, modules live in sys.modules as normal Python objects. Someone with access to the running process can inspect bytecode through marshal.dumps(func.__code__) or walk the heap with gc.get_objects(). Compiled code objects stay resident in the process. mlock and MADV_DONTDUMP keep them out of core dumps, not out of the address space.

A determined attacker with the key and a debugger gets through. This is defense in depth against casual reverse engineering. If someone with a valid license wants to spend a week extracting bytecode, the real protection is legal, not technical.

What paker does protect: the bundle at rest. Without the key, the encrypted payloads are AES-256-CTR + HMAC-SHA256 ciphertext. The key and the bundle always travel separately, through whatever key management you already have (license server, HSM, environment variable). paker handles encryption and loading. Key distribution is your problem.

Transform hooks

paker ships hooks instead of a built-in obfuscator. Pass any callable as ast_transform or code_transform:

import ast, paker

def strip_docstrings(tree: ast.AST, module_name: str) -> ast.AST:
    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef,
                             ast.ClassDef, ast.Module)):
            if ast.get_docstring(node):
                node.body.pop(0)
    return tree

bundle = paker.dumps("myapp", key=KEY, ast_transform=strip_docstrings)

There's also a code_transform hook for bytecode-level passes. I spent days building a built-in obfuscator before I understood my own threat model well enough to realize I didn't need one. Separate post coming on that.

Install

pip install paker

Source, docs, examples (including a remote agent that ships the anthropic SDK over TCP to a zero-install client): github.com/desty2k/paker.

I built a read-only MCP server for Akamai

Wojciech Wentland — Wed, 29 Apr 2026 06:00:00 +0000

I had 200+ CDN properties in Akamai and an agent that couldn't find any of them. Akamai's Property Manager API lists properties by group and contract, but there's no fuzzy search endpoint. If the agent doesn't know the exact property name or ID, it's stuck. The conversation dead-ends with "I couldn't find that property" and the user goes back to the Akamai control panel.

So I built an MCP server that wraps Akamai's APIs. 16 tools for searching properties, browsing EdgeWorker code, querying DNS zones, inspecting network lists, and translating error codes. All read-only. I wrote about why I only build read-only MCP servers separately.

Property search with a preloaded index

Akamai organizes properties under groups and contracts. To search across all of them through the API, you'd iterate every group-contract pair and list properties one by one. Slow, and no fuzzy matching.

The server preloads every property into an in-memory index at startup. It fans out API calls across all group-contract pairs in parallel, deduplicates, and builds a list of names. rapidfuzz handles the matching with WRatio as the scorer. WRatio tries multiple comparison strategies (ratio, partial ratio, token sort, token set) and picks the best one, weighted by string length differences. Slower than a simple ratio, but it means "checkout config" matches "checkout.example.com - Production" without the agent needing to know the exact naming convention.

On a real account with 95 groups and 263 properties, the index loads in about 3 seconds. After that, searches hit memory with zero API calls.

One thing I hit early: fanning out 95 concurrent requests without any throttling. Akamai's PAPI has rate limits, and a burst that size at startup can trigger 429s. The server caps concurrency with a semaphore, 10 requests at a time. Still fast enough, no rejected requests.

The index refreshes every 5 minutes in a background task. I described this pattern in Your MCP server is not an API adapter.

EdgeWorker code browsing

Akamai EdgeWorkers are serverless functions that run on CDN edge nodes. The code is stored as tgz archives containing main.js, bundle.json, and supporting files. To read a file, you download the archive, extract it, and find what you need. Doing that on every tool call would be slow.

The server downloads the bundle once, extracts all files into memory, and caches them with cachetools.TTLCache. 1-hour TTL, max 50 entries. After the first download, the agent can list files, read by line range, and search with regex. No repeat downloads.

When the agent asks "what does the main.js of EdgeWorker X look like?", the first call takes a second or two. Follow-up questions like "search for the routing logic" or "show me lines 50-80" are instant.

I considered caching to disk, but these bundles are small (usually under 100KB). Keeping them in memory avoids filesystem management and the cache evicts automatically when TTL expires or the LRU limit hits. The tradeoff is bundles disappear on restart, but the reload is cheap enough that it doesn't matter.

Response shaping

Akamai property rule trees can be hundreds of KB. A typical production property has nested rules with behaviors, criteria, and options. Sending the full JSON wastes context.

The server strips the rule tree before returning it. Keeps rule names, match criteria, behavior configs, and the recursive structure. Removes template UUIDs, format versions, and other internal metadata the agent doesn't need. Property details, activations, and DNS records get the same treatment.

This is more aggressive than just dropping null fields. The raw rule tree has UUIDs on every node, template links, criteria satisfaction mode flags, locked indicators. None of that helps an agent answer "what caching rules are set for this property?" Stripping it cuts the response to maybe a third of the original size.

EdgeGrid auth from scratch

Akamai uses EdgeGrid for API authentication. There's an official edgegrid-python library, but it wraps requests (sync). I wanted httpx (async) with connection pooling, so the server implements EdgeGrid signing directly: HMAC-SHA256 over a canonical request string, base64-encoded, attached as an Authorization header. About 40 lines.

The signing is straightforward from the public spec. The annoying part is that the query string must be included in the signed data, so you have to build the full URL with parameters before signing, then make the request with that same URL.

What the agent can do

With 16 read-only tools, an agent can answer:

"Which CDN property handles checkout.example.com?"
"What caching rules are configured for the API property?"
"Show me the main.js from the latest EdgeWorker version"
"Search the EdgeWorker code for references to the auth header"
"What DNS records exist for example.com?"
"Which IPs are in the production allowlist?"
"What does Akamai error code 9.6f64d440.1318965461.2f2b078 mean?"

Right now these all require the Akamai control panel.

Setup

Add to Claude Code:

claude mcp add akamai -e AKAMAI_HOST=your-host.akamaiapis.net -e AKAMAI_CLIENT_TOKEN=akab-xxx -e AKAMAI_CLIENT_SECRET=xxx -e AKAMAI_ACCESS_TOKEN=akab-xxx -- uvx readonly-mcp-akamai

Create a read-only API credential in Akamai's Identity and Access Management panel. Source and docs: readonly-mcp-akamai on GitHub.

AI coding agents compressed the feedback loop from hours to seconds. I wrote about why that compression looks a lot like the variable-reward patterns behind slot machines and social media.

Wojciech Wentland — Mon, 13 Apr 2026 17:12:41 +0000

Wojciech Wentland

Apr 13

Your coding agent is a slot machine

#ai #productivity #programming #psychology

Comments

4 min read

Your coding agent is a slot machine

Wojciech Wentland — Mon, 13 Apr 2026 08:16:25 +0000

Programming used to have a speed limit. You wrote code, fought the compiler, debugged, tested, and eventually deployed. The hit of satisfaction came when the feature shipped or the tests went green. That cycle took hours. Sometimes days. The delay regulated behavior.

You couldn't binge on deploy-satisfaction because the loop was too slow. The friction was structural. Nobody pulled 14-hour days writing Java servlets because the reward came too fast. If anything, people quit too early because it took too long.

AI coding agents removed the speed limit. I'm using "slot machine" as a behavioral metaphor here, not a clinical diagnosis.

Thirty seconds

Prompt. Result. Satisfaction. Next prompt. The entire arc of "I had an idea, I built it, I saw it work" now fits inside 30 seconds. And it repeats. Indefinitely.

A 2023 paper in Addictive Behaviors by Clark and Zack lays out why this matters. Two factors make non-drug activities addictive: reward variability (you don't know exactly what you'll get) and frequency (how many reward cycles you can fit into a unit of time). They call the second one temporal compression. Social media has both. Loot boxes have both. Slot machines are the textbook case.

AI coding agents have both. Each prompt returns something slightly different (variability). And you can run dozens of cycles per hour (compression). The paper's conclusion is blunt: "By enabling near limitless diversity and speed of delivery of non-drug rewards, digital technology has permitted engineering of reinforcers with addictive potential that, delivered under natural conditions, would likely never become addictive."

They were writing about gambling and social media. They could have been writing about your terminal.

Eight tabs, eight reward streams

Running multiple agent sessions in parallel looks like multitasking. It's actually multiple concurrent reward streams. A study analyzing over a million social media posts found that people adjust their posting frequency to maximize the rate of likes they receive, the same way animals in a Skinner box adjust lever presses to maximize food pellets. Agent tabs work on the same principle. Every switch carries a chance that something finished. A small hit.

The sessions don't compete for your attention. They take turns feeding it.

I have about twenty terminal tabs open right now. Some are from days ago, left around because I'll probably get back to them. Five or six are active. One is pulling together an infrastructure report, one is waiting on CI after fixes it wrote itself, one is something I opened mid-sentence to chase a bug that came up while I was reviewing something else. While writing this paragraph I switched tabs twice to check if a build passed. It doesn't feel like a problem while it's happening.

I keep seeing people call this "an assembly line of productivity." An assembly line runs whether you're paying attention or not. You just keep feeding it. Nobody describes their relationship with a useful tool that way.

Parallel sessions and rapid context switching get marketed as productivity features. Wanting novelty, trying things quickly, jumping between five terminals. But the behavior this produces looks a lot like the variable-reward patterns behind checking your phone 80 times a day.

The meta-tool trap

People are building elaborate systems to manage the chaos of their agent-assisted workflow. Productivity hubs with skill trees. XP points. Urgency scores. Daily summaries that rate your day out of 10. Automatic task splitting when you miss a deadline.

They gamified the thing that was already acting like a game.

The meta-work around the compulsive work becomes its own loop. Hit from the agent completing a task. Hit from watching the XP bar move. Hit from the daily score. And then you need a system to manage that system, and at some point you're four layers deep in productivity tooling and haven't shipped anything in a week.

Starting is the drug

The first hour of a new project has the highest novelty density. Agents make that phase unusually cheap. Everything is possible, nothing is broken yet, and the code just keeps coming.

Finishing is edge cases. Tests for the boring paths. The last 20% that takes 80% of the time. Reward density drops off a cliff. So you open a new tab and start something else.

You see the pattern everywhere once you look. Bursts of intense output followed by nothing. Somebody produces a hundred pieces of content in six weeks, then drops to zero. Twenty projects in various stages of beginning, none shipped.

And building the thing is the easy part. Finding users, handling support, keeping the service running at 3am, writing docs nobody reads, negotiating contracts, doing the marketing that actually brings people in. None of that gives you a dopamine hit, and none of it fits in a 30-second prompt cycle. The agent can scaffold an app in an afternoon. It can't make anyone care about it. The work that makes a product real is exactly the work that the reward loop skips over.

The behavior tracks novelty, not value. And since starting costs almost nothing now, you run out of interest before you run out of ideas.

The accidental guardrail

The most revealing thing about these tools is the feature nobody asked for: usage limits.

Usage caps end up acting as a hard stop. That's a strange role for a productivity tool, but it's clearly how some people experience these products. Claude's own usage limit docs discuss how to work within the caps. Some users describe the cap running out as the thing that makes them stop for the night. In a r/ClaudeAI thread, one person described deliberately downgrading their plan so it would expire before midnight. Recognized the pattern, reintroduced friction on purpose.

A productivity tool where the most effective safety feature is running out of capacity.

Your text editor doesn't need a cooldown timer. Nobody ships an IDE with "take a break" reminders. Productivity tools don't usually come with harm reduction features.

The token limit is a circuit breaker. Most of the discourse around it is people asking how to get rid of it.

Why I only build read-only MCP servers

Wojciech Wentland — Thu, 09 Apr 2026 13:43:42 +0000

Every MCP server I build is read-only. List, search, get, read. No create, update, delete, activate, purge.

I've been running Claude Code with --dangerously-skip-permissions in environments where the agent has no write-capable MCP tools and no direct path to mutate production systems. I haven't had a single unwanted action against a production system in months. Not because I trust the model to never hallucinate. Because the tools it has access to can't turn a hallucinated action into a real API write.

Read-only doesn't make an agent safe. It removes an entire class of failures.

The failure mode isn't hypothetical

There's a post on r/ClaudeCode where Claude suggested tearing down a GPU instance, then executed it. The user never confirmed. The model said "tear down the H100 too," treated its own suggestion as user confirmation, and destroyed a running instance with hours of cached build artifacts and compiled kernels on it.

The model later admitted: "I hallucinated you saying that. You never said those words. I said it, then executed it as if you'd agreed."

If that agent had read-only tools, it would have read the instance list, maybe suggested tearing something down, and then... nothing. The suggestion dies as text. No one loses a machine.

How I actually use agents

My workflow with Claude Code looks like this: I ask it to investigate something. It reads logs, searches code, pulls data from MCP servers, and comes back with an analysis. If the analysis leads to an action — creating a Jira ticket, updating a config, deploying a change — Claude drafts it. I review the draft, then I do the action myself.

The agent reads and analyzes. I act.

I trust the model's judgment on what to write in a ticket. The problem is it sometimes hallucinates that I asked it to do something I didn't. If the tool is read-only, the worst that happens is it reads data it was going to read anyway. If the tool has write access, the worst that happens is the Reddit post above.

Approval fatigue is the real problem

"But there's a confirmation prompt before destructive actions." Sure. Claude Code asks before running commands. The problem is approval fatigue. After confirming 50 read operations, you stop reading the prompts. You click yes. And then the 51st one is vastai destroy instance 34122719.

Anthropic wrote about this in their sandboxing post. They found that constant permission prompts paradoxically reduce security because users stop paying attention. Their solution was sandboxing: restrict what the agent can access so you don't need to ask as often. They reduced permission prompts by 84% while maintaining security.

Read-only MCP servers follow the same logic. If the server can't write, you don't need to confirm writes. The agent operates freely within the read boundary. No fatigue, no missed confirmation on a destructive action.

That's why I run --dangerously-skip-permissions. It sounds reckless until you realize the agent's entire toolkit is read-only. There's nothing dangerous to skip permission for.

What this doesn't cover

Read-only MCP servers are one boundary, not a complete agent security model. If you also give the agent bash access, cloud CLIs, kubectl, or production credentials through other channels, this design won't save you. Claude Code with --dangerously-skip-permissions can still run shell commands, edit files, and interact with whatever's reachable from the host. Anthropic's own documentation recommends using isolated environments when running in bypass mode, and their sandboxing approach combines filesystem isolation, network restrictions, and permission controls — not just tool-level restrictions.

This article is about the MCP boundary specifically. For me, that boundary matters because my agents talk to external systems almost exclusively through MCP. But it's one layer, not the whole stack.

Beyond the IDE

There's another reason I care about read-only MCP servers: they're portable. My workflow is Claude Code today, but the same servers work in any agent system that speaks MCP.

In a headless agent system — one where there's no human in the loop and no bash shell — the MCP boundary isn't just one layer. It's the only interface the agent has to external systems. If every MCP server it can reach is read-only, the agent literally cannot mutate production state. No sandboxing needed, no permission prompts, no approval fatigue. The tools themselves are the guardrail.

This matters if you're building agent systems for other users. Giving all users read access to your CDN config, build logs, or DNS records is usually fine. Giving all users write access is a different conversation entirely. Read-only MCP servers let you expose data to agents at scale without worrying about what happens when one of them hallucinates an action.

What read-only servers are good for

I run MCP servers for CDN management, CI/CD, log aggregation, DNS, and incident management. All read-only. The questions I ask look like: "What's the current CDN config for checkout?" "Which build failed last night?" "Compare caching rules between production and staging." "Draft a Jira ticket for the DNS change we discussed."

Claude produces the draft text. I copy it into Jira or GitHub myself. Nothing in this workflow needs the agent to write to the target system.

The credential argument

Getting a read-only API credential approved is a conversation. "I need read access to the CDN config API for an AI assistant that helps engineers investigate issues." Most teams say yes.

Getting a write credential is different. "I need an AI agent to be able to modify CDN configurations." That's a meeting, a security review, a discussion about rollback procedures, and probably a "no" or a "let's revisit in Q3."

Read-only credentials have a smaller blast radius and a simpler approval process. They also happen to cover every use case I actually have.

What this means for MCP servers

Every MCP server I publish follows this: read-only by design. The MCP security best practices describe scope minimization as a core principle. Start with the minimum privileges, elevate only when required. My servers don't elevate.

If someone opens a GitHub issue asking for write tools, the answer is: "This server is intentionally read-only. Fork it if you need write operations." That's not laziness. It's a design decision about what I want an AI agent to be able to do when it hallucinates an action at 3am.

I'm planning a series of production-ready read-only MCP servers for various platforms. More on that soon.

Your MCP server is not an API adapter

Wojciech Wentland — Wed, 08 Apr 2026 19:59:13 +0000

A lot of MCP servers I see in the wild look like this:

@mcp.tool()
async def get_thing(id: str):
    resp = await httpx.get(f"https://api.example.com/things/{id}")
    return resp.json()

Fetch, forward, done. A thin HTTP proxy with a JSON Schema wrapper. For some
use cases, that's enough.

The servers I keep coming back to do something different. They hold state and
pre-compute answers. An agent hitting a thin wrapper might need three round
trips and 30 seconds. The same agent hitting a server that does real work gets
its answer in one call, under a millisecond.

Preloaded in-memory index

Here's a failure mode I run into constantly: the agent needs to find something
but doesn't know the exact ID. Most APIs only support exact lookups. No ID, no
result. The conversation dead-ends with "I couldn't find that resource" and the
user gives up.

I built a server that wraps a CDN management API. Hundreds of properties, and
the agent regularly needs to find which one handles a given hostname. The API
has a search endpoint, but it's slow, requires exact matches, and sometimes
returns 403 depending on account permissions.

So the server loads every property into memory at startup:

class PropertyIndex:
    _entries: list[PropertyEntry] = field(default_factory=list)
    _name_index: dict[str, int] = field(default_factory=dict)

    async def load(self, refresh_interval: int = 300):
        await self._build_index()
        self._refresh_task = asyncio.create_task(self._refresh_loop())

    def search_by_name(self, query: str, limit: int = 50):
        names = [e.property_name for e in self._entries]
        matches = process.extract(
            query, names, scorer=fuzz.WRatio,
            limit=limit, score_cutoff=50,
        )
        return [self._entries[idx].to_dict() for _, score, idx in matches]

Builds once by fanning out parallel API calls, deduplicates, refreshes every
five minutes in the background. Lookups take under a millisecond.

Without this, the agent guesses at exact property names, picks the wrong one,
retries, burns three turns. With the index, someone types "the CDN config for
checkout" and gets the right answer first try. That's the kind of difference
that decides whether people keep using the agent or go back to doing it
manually.

I did the same thing for a CI/CD server. The API lets you fetch a build config
by ID, but there's no fuzzy search. If you don't know the ID, you're stuck. The
server caches all build configurations at startup, runs fuzzy matching against
them. The agent says "find the deploy job for the payments service" and gets a
ranked list instantly, even though the CI system itself can't do that.

Embedded analytical database

I have another server that sits in front of a relational database. Some tables
have 20 million rows. The agent needs to answer analytical questions, things
like "which providers have the highest volume in this region?" or "show me the
top performers for a given category."

The database wasn't designed for these queries. It was built for a web UI with
narrow, well-indexed lookups. The agent's access patterns are different: it asks
broad analytical questions that require joins across tables the application
never joins. Adding indexes wasn't an option either, because the database is
owned by another team and optimizing it for an AI agent's query patterns wasn't
on anyone's roadmap. Some of these queries took 10 to 30 seconds on a read
replica, and in an agent loop where that latency gets multiplied by however many
tool calls the agent needs, the conversation times out before it gets anywhere.

The server embeds DuckDB in-process and loads pre-aggregated views and lookup
tables at startup. Some are straight copies of small reference tables. Others
are materialized summaries that flatten joins the source database was never
designed to run efficiently, the kind of cross-table aggregations that make
sense for an analytical question but would be expensive on a schema built for
transactional web UI lookups:

class DuckDBCache:
    async def start(self):
        self._conn = duckdb.connect(":memory:")
        for key, config in fast_configs.items():
            await self._load_table(key, config)
        self._ready = True
        self._deferred_task = asyncio.create_task(
            self._load_deferred(deferred_configs)
        )
        self._refresh_task = asyncio.create_task(self._refresh_loop())

Each table has a fingerprint query (a cheap COUNT(*) or checksum) that the
refresh loop checks before doing a full reload. Large tables load in the
background after the server is already taking requests. If something asks for a
table that hasn't loaded yet, it falls back to the source database.

The 30-second query now takes under a millisecond. The agent can actually have a
back-and-forth with the user instead of timing out after the first question.

There's a query-result cache on top of this too. It has a prewarm manifest,
basically a list of common queries that run at startup so the first person to
use the agent on Monday morning doesn't sit through a cold start.

class QueryCache:
    async def get_or_compute(self, cache_key, compute_fn, ttl=None):
        cached = self.get(cache_key)
        if cached is not None:
            return cached

        result = await compute_fn()
        if "error" not in result:
            self._put(cache_key, result, ttl or self._default_ttl)
        return result

It skips caching error responses. If a query fails because the database is
temporarily overloaded, you don't want that failure served for the next hour.
That one took a production outage to figure out.

Data transformation

Every server I build strips the upstream API response before returning it.
Token usage scales with response size, and most APIs return 10x more data than
the agent will ever look at.

One API I work with returns objects with 60+ fields. The server keeps maybe 8:

def _slim_record(r: dict):
    return _strip_nulls({
        "id": r.get("id"),
        "name": r.get("name"),
        "total_value": _cents_to_major(r.get("total_value_cents")),
        "annual_value": _cents_to_major(r.get("annual_value_cents")),
        "start_date": r.get("start_date"),
        "end_date": r.get("end_date"),
        "status": _effective_status(r),
    })

_cents_to_major converts cents to dollars. The raw API stores monetary values
in cents. Before I added this conversion, 100% of the reports the agent
generated had wrong numbers. Every dollar amount was off by a factor of 100. A $2,000 contract showed up as
$200,000 in the report because the agent treated cents as dollars. No amount of prompt engineering fixed it reliably. Moving the conversion
into the server did.

_effective_status is the other one worth mentioning. The API's status field
can say "active" on a record that ended three months ago. The platform's own UI
derives the real status from multiple fields, so the MCP server does the same:

def _effective_status(r: dict) -> str:
    stage = r.get("stage")
    if stage in ("terminated", "not_renewed"):
        return "inactive"

    if r.get("end_date_not_applicable") or r.get("renewal_type") == "perpetual":
        return r.get("status", "undetermined")

    end_date = r.get("end_date")
    if end_date:
        if date.fromisoformat(end_date) < date.today():
            return "inactive"

    return r.get("status", "undetermined")

Now the agent gives the same answer a human would get looking at the UI.
Stripping nulls across a list of 50 records also saves a few thousand tokens
per response, which adds up.

A log aggregation server I built does something similar: auto-appends
| json auto to queries that don't have a field extraction operator, truncates
raw log lines to 500 characters, converts epoch-millisecond timestamps to
ISO 8601. Small fixes that add up to the agent not wasting turns fighting the
format.

Download once, serve from cache

Some data is expensive to fetch. PDF documents. Code bundles in tgz archives.
The pattern: download on first access, extract the text, build a line offset
index, serve everything from memory after that.

class CachedFile:
    def __init__(self, content: str):
        self.content = content
        offsets = [0]
        pos = 0
        while True:
            pos = content.find("\n", pos)
            if pos == -1:
                break
            pos += 1
            if pos < len(content):
                offsets.append(pos)
        self._offsets = offsets

    def get_lines(self, start: int, end: int) -> str:
        return self.content[self._offsets[start-1] : self._line_end_offset(end)]

I use this for CDN edge function code bundles and PDF documents (extracted with
PyMuPDF). After the first download, the agent reads by line range, searches
with regex, lists the file tree. No repeat downloads. Reading through a
200-page document becomes "just read" instead of "download, extract, read" on
every question.

When thin is fine

Not everything needs this treatment. A server that translates natural language
to a query language and passes it to an API is fine as a thin wrapper. The
translation is the value there. Same for simple lookup tools.

The question I ask: does the agent hit the same data twice? Does the API return
more than the agent needs? Is the API response time slow enough that the agent
loop feels broken? If yes, the server should be doing work.

The multiplier

When a person uses a web UI, they look at a page, think, click something else.
One request at a time, processed by a human brain. An agent works differently.
It makes five tool calls, stuffs all five responses into its context window, and
reasons over them at once. A slow response gets multiplied by every call. A
60-field JSON blob gets multiplied by every call. It adds up fast.

I've measured the difference. CDN property lookups went from three agent turns
to one once the fuzzy index was in place. Analytical queries went from timing
out at 30 seconds to returning in under a millisecond from DuckDB. And every
single dollar amount in every report was wrong until the server started
converting cents for the agent.

You can try to fix that last one with prompt engineering. I tried for weeks. The agent still got it wrong often enough that I couldn't trust the output.

DEV Community: Wojciech Wentland

Ship Python code without leaving source on disk. Encrypted bundles arrive at runtime, load from memory. Works with numpy, pydantic, boto3.

paker: load encrypted Python packages from memory

[Boost]

Loading code without the disk: what each OS lets you get away with

Loading code without the disk: what each OS lets you get away with

Linux: memfd_create is the clean case

glibc can treat a reused /proc/self/fd/N path as the same library

macOS: "zero disk" is a lie on Apple Silicon

Windows: a full PE loader, until Static TLS stops you

Static TLS is where my in-memory Windows path stops

What you actually get, by platform

Ship Python code without leaving source on disk. Encrypted bundles arrive at runtime, load from memory. Works with numpy, pydantic, boto3.

paker: load encrypted Python packages from memory

paker: load encrypted Python packages from memory

How it loads

What it doesn't protect against

Transform hooks

Install

I built a read-only MCP server for Akamai

Property search with a preloaded index

EdgeWorker code browsing

Response shaping

EdgeGrid auth from scratch

What the agent can do

Setup

AI coding agents compressed the feedback loop from hours to seconds. I wrote about why that compression looks a lot like the variable-reward patterns behind slot machines and social media.

Your coding agent is a slot machine

Your coding agent is a slot machine

Thirty seconds

Eight tabs, eight reward streams

The meta-tool trap

Starting is the drug

The accidental guardrail

Why I only build read-only MCP servers

The failure mode isn't hypothetical

How I actually use agents

Approval fatigue is the real problem

What this doesn't cover

Beyond the IDE

What read-only servers are good for

The credential argument

What this means for MCP servers

Your MCP server is not an API adapter

Preloaded in-memory index

Embedded analytical database

Data transformation

Download once, serve from cache

When thin is fine

The multiplier

Linux: `memfd_create` is the clean case

glibc can treat a reused `/proc/self/fd/N` path as the same library