I spent a weekend wiring Claude directly into the Shopify Admin API to see how a custom agent compares to Shopify Magic on the tasks I actually run every week. This post is the benchmark — the code, the numbers, and the surprises.
TL;DR: Magic wins on latency and zero-setup. A custom agent wins on everything that requires state, scheduling, or taking action. The gap is wider than I expected.
The Setup
I tested four real tasks I run on my Shopify store:
- Generate 50 product descriptions from title + attributes
- Publish a new collection of 20 products with scheduled visibility
- Bulk-adjust prices for 200 SKUs based on a CSV rule sheet
- Send a weekly inventory report to Telegram
Magic can attempt task #1 natively. The other three require human clicks, third-party apps, or a developer. So I built a Claude-based agent for comparison.
Stack for the custom agent:
-
@anthropic-ai/sdkwith Claude Sonnet 4.6 - Shopify Admin GraphQL API (2026-01)
- A minimal MCP-style tool registry
- Node 20 + BullMQ for scheduling
The full agent took ~200 lines. Here's the core tool-loop:
import Anthropic from "@anthropic-ai/sdk";
const anthropic = new Anthropic();
const tools = [
{
name: "shopify_update_product",
description: "Update one or many products via Admin GraphQL",
input_schema: {
type: "object",
properties: {
productId: { type: "string" },
patch: { type: "object" },
},
required: ["productId", "patch"],
},
},
{
name: "shopify_bulk_price_update",
description: "Apply a pricing rule across a list of variant IDs",
input_schema: {
type: "object",
properties: {
variantIds: { type: "array", items: { type: "string" } },
rule: { type: "string", description: "e.g. '+10%' or '-5'" },
},
required: ["variantIds", "rule"],
},
},
];
async function runAgent(userGoal: string) {
let messages = [{ role: "user", content: userGoal }];
while (true) {
const res = await anthropic.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 4096,
tools,
messages,
});
if (res.stop_reason === "end_turn") return res;
const toolUses = res.content.filter((b) => b.type === "tool_use");
const toolResults = await Promise.all(
toolUses.map(async (t) => ({
type: "tool_result",
tool_use_id: t.id,
content: JSON.stringify(await dispatchTool(t.name, t.input)),
}))
);
messages = [
...messages,
{ role: "assistant", content: res.content },
{ role: "user", content: toolResults },
];
}
}
dispatchTool just maps to fetch calls against the Shopify GraphQL endpoint. That's it. That's the whole thing.
Benchmark Results
I ran each task 5 times and averaged. Costs include Anthropic API charges; Magic is "free" (bundled in Shopify plan).
| Task | Shopify Magic | Custom Claude Agent |
|---|---|---|
| 50 product descriptions | 8 min (manual clicks) | 54 s (batched) |
| Publish 20-product collection with scheduled visibility | Not supported | 12 s |
| Bulk price update from CSV | Not supported (needs app) | 2 min 10 s |
| Weekly Telegram inventory report | Not supported | 3 s (scheduled, no human) |
| Cost per run | $0 | $0.08 – $0.42 |
| Setup time | 0 min | ~4 h first time |
| Autonomous execution | No | Yes |
Magic held up on content generation quality — output was comparable. The killer is that every description still required a click to accept and publish. 50 products = 50 clicks. The agent batched the whole thing and wrote back via productUpdate mutations in one pass.
Where Magic Actually Wins
Being fair:
- First-token latency: Magic feels snappy because it's inline in the admin. No agent loop, no tool dispatch.
- Zero trust overhead: You're not granting API scopes to a third party. For a solo merchant shipping 5 products a week, this matters.
- No infra: Magic has no queue, no Redis, no worker. If you just need copy, it's the pragmatic choice.
If you're writing descriptions once a month, stop reading. Magic is fine.
Where a Custom Agent Wins (and why it's not close)
The moment you need any of these, Magic is out:
- Multi-step workflows — "Find products with <10 stock, flag them, post to Slack, update the forecast sheet."
- Scheduled runs — cron-driven actions without a human in the loop.
- Cross-channel reach — Telegram, Discord, WhatsApp, email. Magic lives in the admin.
- Memory — store voice, seasonal patterns, previous decisions. Stateless vs stateful.
- Model choice — pick Claude, GPT, or a local model based on task cost/quality.
The most under-appreciated one is #4. After a few weeks, the custom agent was producing descriptions that matched our brand voice without me prompting for it — because it had examples in context. Magic started from zero every time.
The Architecture That Matters
If you're building your own, the two decisions that dominate everything else:
1. Tool granularity. I initially exposed shopify_graphql_execute(query, variables) as one tool. Claude abused it — hallucinating fields, firing bad mutations. Splitting into 12 narrow tools (update_product, create_collection, schedule_publication, etc.) cut error rate from 18% to 2%.
2. Confirmation gates on mutations. For anything that writes (price changes, publishes), I wrap the tool with a "dry run first, then confirm" pattern:
async function priceUpdate({ variantIds, rule, confirm }) {
const preview = await computePriceChanges(variantIds, rule);
if (!confirm) return { preview, requiresConfirmation: true };
return await applyPriceChanges(preview);
}
Claude gets the preview, shows it in the chat, asks for confirmation, then calls again with confirm: true. This one pattern eliminated the "what if the AI does something dumb" anxiety.
What Surprised Me
Three things I didn't expect:
- Token cost was negligible. Even with 50-product batches and retries, I never hit $0.50 per run. Prompt caching (5-min TTL on the Anthropic API) cut repeat calls by 60%.
-
GraphQL mutations are the bottleneck, not inference. Shopify's
bulkOperationRunMutationsaved me 4x on the price-update task. - The hardest part was the tool schema, not the agent. Once the tools were right, the agent just worked.
If You're Considering This
Build it yourself if:
- You operate across multiple channels or tools
- You run recurring workflows (weekly reports, scheduled publishes)
- You want API-level control and model choice
- You're comfortable owning infra (or you use a platform that hosts it)
Stick with Magic if:
- Your AI needs stop at "help me write things"
- Setup time has zero ROI for your store size
- You don't want another service to trust with API scopes
I wrote a longer teardown of the tradeoffs (including when a hosted agent platform like Clawify makes more sense than rolling your own) over on the full comparison.
What's your setup? Drop a comment if you've benchmarked Magic against a custom agent — especially interested in what other people saw on multi-step workflows.
Top comments (0)