What goes wrong when you're building a SaaS with AI agents against a real WordPress codebase? This post started as a reply to a comment on my LinkedIn launch post. The question: "Where do you see the first real bottleneck — in the AI pipeline itself, or in maintaining reliability across diverse real-world WordPress environments?" The answer got long enough that it deserved its own page. What follows is a raw breakdown of the 11 real bottlenecks we hit building StrictWP, a WordPress management platform built by a human founder directing a team of AI coding agents.
Short answer: The WordPress environment bottlenecks hit first and were more surprising. The AI orchestration bottlenecks were harder to solve and keep recurring. Neither category is "the" bottleneck — they compound each other.
Phase 1: WordPress Environment Bottlenecks (Hit First)
These are the problems that showed up the moment we started building our SaaS with AI agents against real WordPress sites — before we even had an orchestration pipeline.
1. SSH + wp-cli is not the clean API you think it is
WordPress doesn't have a management API. To inspect or update a remote site, we run wp-cli commands over SSH (wp --ssh=user@host/path). Sounds simple. It isn't.
-
Noisy output: PHP notices, deprecation warnings, and plugin startup messages get mixed into wp-cli's JSON output. We had to write
_extract_json()— a function that scans through garbage to find the actual JSON blob wp-cli meant to return. This is not an edge case; it happens on maybe 30% of real-world sites. -
Server software detection is impossible: We wanted to detect whether a site runs Apache or Nginx. The PHP SAPI (Server API) variable returns
cliwhen you're running over SSH — because you are running the CLI. We dropped the feature entirely. Sometimes the pragmatic answer is "don't." -
Shell users aren't always real users: On CWP Pro (a hosting panel), user accounts exist for web file ownership but are deliberately locked out of shell access. The system uses
pam_limitsto set hard limits that look like this:
username hard nproc 0
username hard nofile 0
This means SSH key authentication succeeds (PAM auth module passes), but then the limits module kills the session before bash can start. The error message is completely unhelpful. We had to read CWP's source to figure out why SSH login silently failed after key injection. The fix: detect and remove the lockout file, but only when its contents exactly match the known lockout pattern (a guarded regex — we don't want to delete a custom limits config that happens to be in the same directory).
-
Hosting panel API diversity: Cloudways has a REST API but nests app data inside server responses (no
/appendpoint — you have to call/serverand flatten the nested response). Credential endpoints use query params, not path params. The response key isapp_credsand the user field issys_user, notusername. CWP Pro's API runs on port 2304 with self-signed SSL and wraps every response in a non-standardmsj(message) key instead of something predictable likedataorresult. Every hosting platform speaks a different dialect, and the documentation is sparse. Each integration is days of trial-and-error against live servers.
2. The "works on 95% of sites" problem
WordPress sites are astonishingly diverse. Same CMS, wildly different configurations:
- Plugin update failures are opaque: When wp-cli fails to update a plugin, it returns something like:
Error: No plugins updated (1 failed).
That tells you nothing. We rewrote error extraction to parse Warning: lines from the actual output — things like Could not find a valid zip file — which gives the user something actionable. But this required understanding wp-cli's output format intimately, filtering out noise about cache directories and temporary paths.
- License-gated updates: WooCommerce extensions, Elementor Pro, and other premium plugins require active license keys to download updates. If the key is expired, wp-cli gets a malformed download URL and fails silently or with a cryptic error. This isn't a StrictWP bug, but our users see it in our UI, so it's our problem to explain clearly.
3. Backup reliability across diverse hosting
Our backup system SSHs into each site, rsyncs files down, and stores them in Backblaze B2 via restic. The edge cases:
- Some hosts have SSH but block rsync
- Some hosts have aggressive connection timeouts that kill long-running transfers
- File permissions vary wildly (some sites have files owned by
www-data, others by the user, others bynobody) - Our SLA monitor (
backup-check) runs twice daily and emails alerts if any site hasn't had a successful backup in 25 hours — because "it probably worked" isn't good enough when someone needs a restore
These WordPress bottlenecks stay mostly solved once you've handled each edge case, but they keep expanding as we add support for new hosting panels. Every new provider (Cloudways, CWP Pro, eventually Vultr, GridPane, etc.) brings its own set of API quirks, SSH configurations, and authentication models.
Phase 2: AI Orchestration Bottlenecks (Harder, Keep Recurring)
Once we had a working platform, building the SaaS with AI agents meant constructing an orchestration pipeline to accelerate development. This is where the bottlenecks got more interesting — and more persistent.
4. Permission prompts destroy flow
Claude Code (the AI coding tool) runs in a terminal and asks for permission before executing shell commands it considers potentially dangerous. This is good security design. It's also a massive bottleneck when you're trying to run a multi-step pipeline autonomously.
The compound command pattern was the first killer:
cd /path && git commit -m "message"
This triggers a "bare repository attack" security guard. The fix:
git -C /path commit -m "message"
One command, no compound operators, no permission prompt. Similarly, heredoc inside command substitution triggers a check:
git commit -m "$(cat <<'EOF'
commit message here
EOF
)"
Fix: write the message to a temp file instead:
git commit -F /tmp/msg.txt
General principle we learned: avoid &&, ||, ; with git when a single invocation with flags does the same thing. Fewer prompts = faster pipeline. This sounds trivial but it took days of friction to systematize.
5. Context window limits force architectural decisions
Large language models have a finite context window — think of it as short-term memory. When your codebase has 15+ database tables, 30+ PHP classes, a React SPA, and a Perl backend, you can't fit it all "in memory" at once.
Our solution: a 7-agent pipeline where each agent has a narrow job:
| Agent | Job |
|---|---|
| IssueOps | GitHub branch/PR lifecycle |
| Coder | Writes code + tests |
| Reviewer | Security + permissions review |
| Tester | Runs test suite, parses results |
| DBA | Database migrations |
| Ship | Deploy queue serialization |
| Scout | Post-deploy browser verification |
We name everything. Ship and Scout (the deploy and browser-verification agents) were the first two to get names. The long-running sessions followed: Cowork, Cody, Carl, Cathy — each one a separate Claude instance with its own clone of the repo, its own Docker stack, and its own personality quirks. Naming them started as a joke; it stuck because "tell Ship to hold the deploy" is clearer than "tell agent 6 to hold," and "Cathy wrote the blog post" is clearer than "the marketing session wrote the blog post."
Each agent reads only what it needs. The Coder doesn't need to understand deploy infrastructure. The Tester doesn't need to read business logic — just run the suite and report pass/fail. This is essentially microservices architecture applied to AI agents.
A bonus win: token efficiency. Because each agent's context is scoped to just its job, we burn far fewer tokens per task than we would with a single monolithic session that has to hold the entire codebase in context. The coordinator routes; the specialists execute. Total token spend goes down even as throughput goes up.
The tricky part: the coordinator (Opus, the most capable model) has to hold enough context to route between agents intelligently without doing their jobs. We write detailed agent definition files (.claude/agents/*.md) so each agent knows its role, constraints, and conventions. When something falls through a crack — say, the Reviewer misses a multi-tenancy bug because the context about our permission model wasn't in its prompt — we update the agent definitions. It's essentially documentation-driven architecture where the docs are actually load-bearing.
6. Flag-file races and concurrent operations
We built the flag-file system because our first AI session (Cowork) runs inside a sandboxed VM with no direct access to the outside world — no SSH, no git push, no GitHub CLI. This was before we started using Claude Code. The only way for Cowork to trigger a deploy or run a remote command was to write a flag file; a watcher on the host Mac would pick it up, execute the real command, and write the result back. When we later added Claude Code sessions running directly on the Mac (with full world access), they inherited the same flag-file workflow just to keep things consistent — the watcher was already running and the conventions were established. Simple, but fragile:
- Two agents writing
.deployat the same time would corrupt each other - A stale result file from a previous run could be read by the wrong agent
- Long-running SSH commands would block all other operations
We solved this incrementally:
-
Deploy queue (
bin/deploy-queue): serializes all deploys through a FIFO queue with lock files. Stale locks (>90 seconds) are auto-cleared. -
Unique IDs (
bin/run-cmd): each command gets a UUID. The flag file is.channel-{uuid}, the result is.channel-result-{uuid}. No collisions even when multiple agents fire concurrently. - Concurrent dispatch: all flag handlers run in background subshells. A long SSH command no longer blocks local operations.
-
Timeout safety: SSH calls include
ConnectTimeout=10,ServerAliveInterval=5,ServerAliveCountMax=3— a hung connection auto-terminates after ~25 seconds instead of blocking forever.
This was one of the hardest bottlenecks because it only manifested under concurrency — a single agent working alone never hit these races. It took real incidents (two deploys colliding, a stale result causing the wrong agent to proceed) before we built the queue.
7. Agent coordination without shared state
Each agent runs in its own isolated context. Agent A can't see what Agent B is doing. This means:
- The Coder finishes, but the Reviewer needs to know what changed: We pass file lists and branch names explicitly in each agent's prompt. There's no "shared clipboard."
-
Two agents shouldn't edit the same file: We have working agreements — Vinny (the human) coordinates the project, makes architectural decisions, and handles QA/E2E testing. The AI agents handle all code — PHP, React, REST API, Perl workers. Agents coordinate before touching shared files like
deploy.shor schema migrations. -
Inter-session messaging: We built
bin/relay-msgso that separate Claude sessions (Cody, Cowork, Carl, Cathy) can send short messages to each other through a local MySQL table. Direct messages, scoped broadcast channels (dashfor the dashboard team,mktfor marketing,allfor everyone) — so a dashboard conversation doesn't ping the marketing session and vice versa. A hook checks for incoming messages on every prompt. This is how the message that prompted this post was delivered.
How this post was written: Vinny asked Cody (one of the AI coding sessions that built StrictWP) to document the bottlenecks from memory. Cathy (the marketing-focused AI session) shaped it into the post you're reading now. Vinny reviewed for accuracy — his memory isn't as good as Cody's, but his judgment calls on what matters are better. What you're reading is the result of that three-way collaboration, delivered through the same relay system described above.
8. Inherited tooling that doesn't fit every session
The flag-file system described in #6 was built for Cowork's sandboxed VM. When we added Cody — running directly on Vinny's Mac with full world access (SSH, gh CLI, direct git) — it inherited the same flag-file workflow. Cody could just run ssh vinny@wp-backup-01 "..." directly, but we kept the indirection for consistency. This is a general pattern worth watching for: tooling built for one constraint gets cargo-culted into contexts where it doesn't apply. Recognizing when to shed inherited complexity is its own ongoing bottleneck.
9. Shared directories → session isolation
This was arguably the most disruptive bottleneck in the AI pipeline. Initially, two long-running AI sessions (Cowork and Cody) shared the same project directory. The result:
- One session's uncommitted changes would block the other's
git pull - Both sessions editing the same file meant one would silently overwrite the other
- Git state (current branch, staging area, stash) is inherently single-tenant — two writers in one repo is a race condition by design
The first fix was isolation: "worktree" — git worktrees give each agent an isolated copy of the repo. This worked well for short-lived sub-agents (Coder, Reviewer, Tester) that do their work and exit. But it didn't solve the problem for the long-running orchestrator sessions that need persistent state.
The real fix was giving each session its own full clone of the repo and its own identity: Cowork, Cody, Carl, Cathy — each with a separate directory, separate Docker stack (different ports, container names, volumes), and separate flag-file namespace. The watcher monitors all directories independently. A deploy triggered from Cowork's directory uses Cowork's code; a deploy from Cody's directory uses Cody's code.
Once sessions stopped fighting over shared state, the coordination problem shifted from "don't collide" to "stay informed." That's when the relay messaging system (#7) became essential. The shared directory had been providing implicit coordination (you could see the other session's changes); explicit messaging turned out to be far more reliable.
10. The "dev environment" bottleneck
For the first couple of weeks, all testing happened on a live dev VPS. This created problems:
- Deploying a feature branch to dev could break it for other sessions testing simultaneously
- SSH-dependent features couldn't be tested without a real server
- Schema migrations on dev could conflict with prod
We're solving this with Docker: a full local stack (MariaDB, nginx, PHP-FPM, Perl workers) that mirrors production. Each developer/agent clone gets its own Docker stack on different ports. Feature testing happens locally; the dev VPS is now reserved for user-acceptance testing of main only. Phase 2 will add mock SSH targets so backup/restore flows can be tested without any real servers.
11. Git workflow in an AI context
Things that are simple for humans become bottlenecks for AI agents:
- Rebasing can silently break things: Git reports "no conflicts" but auto-merges can be semantically broken — two branches both adding a method in the same class can result in duplicate definitions that compile fine but break at runtime. Rule: always run tests after a rebase, even when git says it's clean.
-
Amending after a failed pre-commit hook destroys work: If a pre-commit hook fails, the commit didn't happen. Running
git commit --amendnext would modify the previous commit — potentially destroying someone else's work. Rule: always create a new commit after hook failures. -
Interactive git commands don't work:
git rebase -i,git add -i— anything requiring interactive input is impossible for an AI agent. Every git operation must be expressible as a single non-interactive command.
The Real Answer: Which Bottleneck Matters More?
WordPress environment bottlenecks are finite. Each hosting panel has a fixed set of quirks. Once you've handled CWP's pam_limits issue, it stays handled. The work is tedious and requires reading source code of systems with poor documentation, but each fix is permanent.
AI orchestration bottlenecks are systemic and keep recurring. Every new feature we add creates new coordination challenges. Context limits mean we're always deciding what each agent can and can't see. Permission systems evolve. Concurrency edge cases surface only when you scale up the number of parallel agents. The flag-file race conditions we solved two weeks ago reappear in different forms as we add new communication channels.
The compounding effect is what makes it hard. A wp-cli quirk is annoying but manageable. A wp-cli quirk that manifests differently across 50 sites, discovered by an AI agent that has limited context about why the output looks wrong, reported through a messaging system that itself had a race condition — that's where the real complexity lives.
If you're building a SaaS with AI agents and want one sentence to summarize: the WordPress environment is where you stub your toe; the AI orchestration is where you get lost. The first hurts more per incident; the second costs more over time.
Try StrictWP
StrictWP is the product that came out of this pipeline. Try the live demo or see pricing.
If you're building with AI agents and want to compare notes, reach out — I'm always happy to talk shop.
(This post includes affiliate links; I may receive compensation if you purchase products or services from the linked providers.)

Top comments (1)
"One surprising bottleneck when integrating AI agents into a WordPress SaaS is not the AI itself, but the underlying data infrastructure. In our experience, teams often underestimate the complexity of aligning AI outputs with existing data schemas and APIs. This misalignment can lead to significant delays and unreliable agent performance. Focusing on robust data architecture early on can save a lot of headaches down the line. - Ali Muwwakkil (ali-muwwakkil on LinkedIn)"