DEV Community: Ima Claw

5 Best Sora 2 Alternatives in 2026 [Tested & Ranked]

Ima Claw — Thu, 02 Apr 2026 07:42:08 +0000

In early 2026, OpenAI quietly pulled the plug on Sora 2. The reasons were familiar: unsustainable compute costs and a user base that wasn't converting to paid plans at the rate needed to justify the infrastructure. For thousands of creators who had built workflows around it, the shutdown was abrupt — and the search for a replacement became urgent.

The good news: the AI video generation space has never been more competitive. The bad news: not every tool that claims to be a "Sora 2 alternative" actually delivers on quality, speed, or global accessibility.

We tested five of the most-discussed alternatives across three dimensions: video quality, generation speed, and global availability (no region locks, no waitlists). Here's what we found — ranked by overall value for working creators.

Quick Comparison: 5 Best Sora 2 Alternatives in 2026

Tool	Video Quality	Speed	Global Access	Best For
Seedance 2.0 on IMA Studio	★★★★★	★★★★★	✅ No queue, worldwide	Best overall — quality + speed + access
Kling 3.0	★★★★☆	★★★★☆	✅ Available	Multi-shot narrative filmmakers
Runway Gen-4	★★★★☆	★★★★☆	✅ Available	Branded campaigns, character consistency
Hailuo 2.3	★★★☆☆	★★★★★	✅ Available	Fast social content
Pika 2.2	★★★☆☆	★★★★☆	✅ Available	Beginners, simple clips

#1 Seedance 2.0 on IMA Studio — Best Overall Sora 2 Alternative

If you're looking for a single tool that matches or exceeds Sora 2's output quality — without the waitlist, the region restrictions, or the uncertainty — Seedance 2.0 on IMA Studio is the answer.

What Is Seedance 2.0?

Seedance 2.0 is ByteDance's flagship AI video generation model, released in February 2026. It represents a significant leap in physical realism, motion coherence, and multimodal input handling. In independent benchmarks, it consistently outperforms earlier-generation models on temporal consistency and fine-detail rendering — the two areas where most AI video tools still struggle.

The model supports text-to-video, image-to-video, and video-to-video workflows, with native understanding of physical dynamics (water, cloth, fire, gravity) that previous models could only approximate. Character consistency across scenes — a persistent pain point in AI video — is dramatically improved. Lip sync accuracy across multiple languages is production-grade.

The Problem With the Official Platform

Here's the catch: on ByteDance's own platform, Seedance 2.0 is effectively inaccessible for most global creators. Queue times routinely run 2–8 hours during peak usage. Geographic restrictions block users in large portions of Europe, Southeast Asia, and Latin America entirely. Even users who can access it report inconsistent availability and frequent service interruptions.

For a professional workflow, that's a dealbreaker.

Why IMA Studio Changes Everything

IMA Studio was the first platform globally to integrate Seedance 2.0 on day zero of its release — and it solved every access problem the official platform created.

On IMA Studio:

No queue. Generations start immediately, regardless of time zone or traffic volume.
No region restrictions. Available to creators in every country, including markets where the official platform is blocked.
Full feature access. All input modalities — text, image, video, and audio — are supported from day one.
Free to start. New accounts receive 200 free credits on signup, enough to generate and evaluate multiple videos before committing to a paid plan.

What You Can Actually Do With It

In our testing, Seedance 2.0 on IMA Studio handled every generation type we threw at it:

Text-to-video: Complex scene descriptions with multiple subjects, environmental physics, and camera movement instructions all rendered accurately.
Image-to-video: Static product shots animated with natural motion, maintaining brand colors and object integrity throughout.
Character consistency: The same character maintained recognizable features across a 6-shot sequence — something that typically requires post-production work in other tools.
Lip sync: Tested with English, Spanish, and Mandarin audio tracks. Sync accuracy was indistinguishable from professionally dubbed content.

Average generation time for a 5-second clip: under 90 seconds. For a 10-second clip: under 3 minutes. No throttling observed across 40+ consecutive generations.

Pricing

IMA Studio uses a pay-per-use credit model. You get 200 free credits on signup — no credit card required. Paid credits are available in flexible bundles, making it cost-efficient for both occasional users and high-volume production teams.

👉 Try Seedance 2.0 Free on IMA Studio →

#2 Kling 3.0 — Best for Multi-Shot Narrative Filmmaking

Kling 3.0, developed by Kuaishou, launched on February 5, 2026, and immediately claimed the top position on major AI video leaderboards with an ELO score of 1,243. For narrative-driven content, it's the strongest competitor to Seedance 2.0 in terms of raw output quality.

The headline feature is Multi-shot Generation — the ability to produce coherent multi-scene sequences from a single prompt, with consistent characters, lighting, and spatial logic across shots. Combined with 4K output and multi-language lip sync, it's a serious tool for filmmakers and long-form content teams.

The limitations are real, though. Kling 3.0 is highly prompt-sensitive: vague or loosely structured prompts produce inconsistent results, and learning the prompt architecture takes meaningful time investment. Credit consumption is aggressive — complex generations can drain a plan's allocation faster than expected. During peak hours (typically 9–11 AM and 7–10 PM UTC), queue times of 30–47 minutes are common.

Best for: Filmmakers, video directors, and content teams who need multi-shot narrative coherence and are willing to invest in prompt craft. Not ideal if speed or simplicity is the priority.

#3 Runway Gen-4 — Best for Brand Campaigns and Character Consistency

Runway isn't just a video model — it's the most complete AI creative platform currently available. Gen-4 is the video generation engine at its core, but the surrounding toolset (Act-One for expression transfer, team collaboration features, asset management, and API access) makes it the default choice for agency and brand workflows.

Gen-4's defining capability is cross-shot character consistency. If you're producing a campaign where the same character needs to appear across 10 different scenes with different lighting, environments, and actions, Gen-4 handles it more reliably than any other tool we tested. The Act-One feature — which transfers facial expressions from reference footage to generated characters — is genuinely production-ready.

The cost structure requires attention. Runway charges 12 credits per second of generated video, meaning a 10-second clip costs 120 credits. The "Unlimited" plan has undisclosed throttling that kicks in after sustained high-volume usage. Critically: failed generations still consume credits, which can be frustrating during complex prompt iteration.

Best for: Advertising agencies, brand content teams, and any workflow requiring consistent character identity across multiple scenes. The platform overhead is worth it at professional scale; less so for individual creators on a budget.

#4 Hailuo 2.3 (MiniMax) — Best for Fast Social Content

If raw speed is your primary requirement, Hailuo 2.3 from MiniMax is in a class of its own. A 6-second video generates in approximately 30 seconds — faster than any other tool in this comparison. The interface is clean and approachable, with minimal learning curve, making it a practical choice for social media teams that need to produce high volumes of short-form content quickly.

The quality ceiling is lower than Seedance 2.0 or Kling 3.0. Complex motion sequences — multiple subjects interacting, detailed hand movements, fast-action sports — show instability and temporal artifacts. Fine material details (fabric texture, reflective surfaces, hair) tend to soften or blur over the course of a clip. For simple, visually clean social content, these limitations rarely matter. For anything requiring production-grade output, they do.

There have also been documented user complaints about billing discrepancies on the platform — worth noting before committing to a paid plan.

Best for: Social media content creators, marketing teams producing high-volume short clips, and anyone who prioritizes turnaround time over maximum output quality.

#5 Pika 2.2 — Best for Beginners

Pika 2.2 is the most accessible entry point into AI video generation. The interface is deliberately simple, the learning curve is minimal, and at $8/month for the starter plan, the price-to-value ratio is hard to beat for casual or exploratory use. Output resolution is 1080p, which is sufficient for most social media and web use cases.

The trade-offs are significant at the professional level. Complex scenes with multiple interacting subjects, detailed environmental physics, or extended duration (beyond 4–6 seconds) expose the model's limitations clearly. The sense of physical weight and motion realism that defines Seedance 2.0 and Kling 3.0 is largely absent. For simple product animations, talking head videos, or basic creative exploration, Pika 2.2 works well. For anything requiring cinematic quality or complex narrative structure, it falls short.

Best for: Beginners exploring AI video for the first time, hobbyists, and creators with simple, short-form needs who prioritize ease of use and low cost over maximum capability.

Originally published on IMA Studio Blog

Best OpenClaw Skills in 2026 — Ranked by Reddit Users

Ima Claw — Fri, 27 Mar 2026 06:38:44 +0000

If you’ve just set up OpenClaw, you’ve probably noticed something: out of the box, it doesn’t feel that different from a regular chat window.

That changes the moment you install the right skills.

A recent post on r/AI_Agents put it bluntly:

“I recently set up OpenClaw, and I feel that having good skills is absolutely crucial.”

They’re right. And Reddit’s OpenClaw community has spent months stress-testing skills so you don’t have to. We dug through r/openclaw, r/AI_Agents, r/ClaudeAI, and r/AiForSmallBusiness to pull out what actually gets recommended — and what quietly breaks your setup.

Here’s the honest list.

How We Ranked These

We didn’t just grab whatever had the most installs. Reddit users are ruthless about bad recommendations. Our criteria:

Mentioned across multiple subreddits (not just one hype post)
Consistently recommended in “what do you actually use?” threads
No reported data leaks or malicious behavior (yes, this is a real concern — more on that below)
Works on a real machine, not just a demo environment

One dev community maintainer spelled it out well: “Users don’t want the most skills, they want a short list that is predictable, maintained, and honest about risk.”

The Best OpenClaw Skills in 2026

1. Tavily Web Search

Install: clawhub install tavily-searchReddit verdict: “Non-negotiable first install. Everything else assumes you can search.”

Real-time web search is the single biggest unlock for OpenClaw. Without it, your agent is working from training data cutoffs. With Tavily, it can pull live news, pricing, competitor info, and research on demand. Repeatedly cited in r/openclaw and r/AI_Agents as the first skill anyone should install. No debate.

2. Self-Improving Agent

Install: clawhub install self-improving-agentReddit verdict: “This one actually learns from mistakes. Most AI tools don’t.”

Every time your agent makes an error or gets corrected, this skill logs it. Over time, your OpenClaw gets better at your specific workflows. Users in r/AiForSmallBusiness reported 30–40% fewer repeated mistakes after two weeks.

3. Summarize

Install: clawhub install summarizeReddit verdict: “I use this daily. YouTube videos, PDFs, long threads — all summarized in seconds.”

From the r/AI_Agents thread that kicked off this whole list: “I’ve been using Web Browsing for basic tasks like navigating pages and extracting content, and also Summarize to pull summaries from videos.” Works across YouTube, PDFs, web pages, and long documents. Simple, reliable, no drama.

4. Agent Browser

Install: clawhub install agent-browserReddit verdict: “For anything that requires clicking through a UI, this is the one.”

Full browser automation — not just scraping static pages, but actually interacting with dynamic interfaces. Used in multi-step workflows across r/AI_Agents and r/openclaw setup guides.

5. Proactive Agent

Install: clawhub install proactive-agentReddit verdict: “Changes OpenClaw from reactive to actually useful. It checks in on things without being asked.”

This skill enables your agent to monitor tasks, surface insights proactively, and act on triggers without waiting for a prompt. Popular in business automation threads on r/AiForSmallBusiness.

6. Ontology Memory

Install: clawhub install ontology-memoryReddit verdict: “OpenClaw’s default memory is weak. This fixes it.”

Widely recommended in r/ClaudeAI for its local memory system — all your knowledge stored on your machine, not in the cloud. One r/ClaudeAI post specifically recommended OpenClaw because of its local memory capabilities: “OpenClaw has local memory — all your writings can be stored locally as part of your local OpenClaw installation.”

7. IMA Studio (ima-all-ai)

Install: clawhub install ima-all-aiReddit verdict: “One key for Midjourney, Kling, Suno, and 20+ models. Finally.”

For creators and content teams, this is the highest-ROI skill on the list. Instead of juggling separate subscriptions for image generation, video creation, and music — you get all of it through one IMA Studio skill.

Images: SeeDream 4.5, Nano Banana, Midjourney
Videos: Wan 2.6, Kling, Hailuo, Veo — 14+ models
Music: Suno sonic-v5, DouBao BGM

One sentence → images, videos, music. No switching tabs. No API key juggling.

8. Find Skills

Install: clawhub install summarize-find-skillsReddit verdict: “Meta skill. Helps you find what else to install. Useful when you’re just starting out.”

Helps your agent search and evaluate ClawHub skills based on what you’re trying to do. Recommended as a “day one” install across multiple setup guides.

9. Skill Vetter

Install: clawhub install skill-vetterReddit verdict: “After the malware reports, everyone should be running this.”

This came up repeatedly after a r/hacking post estimated that ~15% of community OpenClaw skills contain malicious behavior. Skill Vetter audits skills before and after installation. As one r/ClaudeAI user put it: “Don’t just install an arbitrary skill, but have the skill scanner understand the purpose of the skill.”

10. PPT Generator

Install: clawhub install ppt-generatorReddit verdict: “Slides from data in one command. Nobody goes back after using this.”

Generate full presentations from structured data or text. Popular in r/AiForSmallBusiness threads about automating client deliverables. Pairs well with the Summarize skill for instant meeting decks.

The Skills Reddit Warned Against

Not everything on ClawHub is safe. A few patterns Reddit users flagged:

Skills with vague descriptions — if you can’t tell what it does from the README, skip it
“Utility” skills that request file system access without explanation — one popular “music” skill was found scanning for SSN/tax patterns in local files
Skills that haven’t been updated in 6+ months — OpenClaw updates frequently; unmaintained skills break silently

Run Skill Vetter. Read the SKILL.md. Check when it was last updated.

The Fast Setup Order

If you’re starting fresh, install in this order:

clawhub install tavily-search clawhub install summarize clawhub install skill-vetter clawhub install self-improving-agent clawhub install agent-browser clawhub install ima-all-ai clawhub install ontology-memory

That’s a solid production setup in under 10 minutes.

Bottom Line

OpenClaw without skills is a sports car with no fuel. The community has done the hard work of separating the genuinely useful from the noise — and the consistent winners are search, memory, browser automation, and self-improvement.

For creators specifically, ima-all-ai is the one skill that most people don’t know about but immediately wish they’d installed sooner.

What’s in your stack? Drop it in the comments.

Originally published at imastudio.com

Best Sora Alternatives in 2026: Top AI Video Generators After the Shutdown

Ima Claw — Fri, 27 Mar 2026 02:37:55 +0000

Sora is dead.

On March 24, 2026, OpenAI officially shut down Sora — just 15 months after its public launch. Even the billion Disney partnership couldn’t save it. Reports say it was burning million per day in inference costs. OpenAI couldn’t sustain it. They’re pivoting the tech to robotics training instead.

Thousands of creators are now scrambling for alternatives. Good news: AI video tools in 2026 are way better than Sora ever was.

Why Sora Failed (And What It Means for You)

Sora’s shutdown wasn’t a surprise to industry insiders. The writing was on the wall:

Unsustainable costs: M/day in inference alone
Limited output: 60-second max clips couldn’t compete
No image generation: Users needed separate tools
API restrictions: Limited integration options

The lesson? Don’t bet your workflow on a single tool. Today’s AI video landscape favors platforms that aggregate multiple models — so when one goes down or falls behind, you just switch.

Top Sora Alternatives (Data-Driven Comparison)

Deep Dive: When to Use What

1. Wan 2.6 — The Safe Default

Wan 2.6 has become the most popular AI video generator in 2026 for good reason. It strikes the perfect balance between quality, speed, and cost.

Text-to-video + image-to-video generation
High motion quality with minimal artifacts
No waitlist — instant access
25 points per generation (most cost-effective)
Strong performance on both human and object motion

Best use cases: Social media content, marketing videos, quick prototypes, general creative work

Pro tip: Start with Wan 2.6 if you’re new to AI video. It forgives imperfect prompts better than most models.

2. Hailuo 2.3 (MiniMax) — For Premium Quality

When your brand reputation is on the line, Hailuo 2.3 delivers Hollywood-level output. It’s the go-to choice for agencies and professional video producers.

Cinematic-grade color grading and lighting
Best-in-class motion consistency
Superior handling of complex scenes
38 points per generation

Best use cases: Brand commercials, product showcases, high-end social content, client work

Trade-off: Higher cost per generation, but the quality difference is noticeable on large screens.

3. Kling 2.6 — For Long-Form Content

Kling AI solved the duration problem that plagued Sora. With up to 120 seconds of continuous generation, it’s the only choice for narrative content.

Up to 120 seconds per clip (2x Sora’s limit)
First/last frame control for seamless transitions
Photorealistic human motion and expressions
Excellent for character-driven stories

Best use cases: Short films, explainer videos, storytelling content, character animations

Unique feature: The first/last frame control lets you create continuous scenes by chaining generations.

4. Google Veo 3.1 — For Photorealism

Google’s Veo 3.1 sets the benchmark for realism. If you need footage that looks indistinguishable from camera-captured video, this is it.

Industry-leading photorealistic output
Natural physics and fluid dynamics
Reference image support for consistent characters
Excellent environmental detail

Best use cases: Product demos, real estate walkthroughs, documentary-style content, VFX pre-visualization

5. Vidu Q2 — For High-Volume Production

When you need to produce dozens of variations for A/B testing or rapid iteration, Vidu Q2 is your workhorse.

Fastest generation speed in its class
Lowest cost per video (15 points)
Good enough quality for most social platforms
Perfect for testing creative concepts

Best use cases: Social media at scale, ad creative testing, rapid prototyping, internal reviews

How to Choose: Decision Framework

Still unsure? Use this framework:

Budget conscious? → Vidu Q2
Need max duration? → Kling 2.6
Quality is everything? → Hailuo 2.3
Need realism? → Veo 3.1
Just want it to work? → Wan 2.6

One Platform, All Models: Why IMA Studio Wins

Here’s the problem with subscribing to individual tools: you’re locked in. When a model improves (or dies), you’re stuck.

IMA Studio takes a different approach. Instead of betting on one model, you get access to all of them — Wan, Hailuo, Kling, Veo, and 14+ more video models — plus AI image generation and AI music generation in one unified platform.

The Aggregation Advantage

Future-proof: New models added automatically
Cost-optimized: Use cheaper models for drafts, premium for finals
Workflow unified: One account, one interface, one credit system
No vendor lock-in: Switch models mid-project without changing tools

Sora vs IMA Studio

Migration Guide: From Sora to IMA Studio

If you were using Sora, here’s how to transition:

Sign up for a free IMA Studio account
Start with Wan 2.6 — closest to Sora’s output style
Experiment with other models using the same prompts
Compare results side-by-side
Build your preference profile — which model for which use case

Most Sora users find they get better results with Wan 2.6 or Hailuo 2.3, often at lower cost.

Bottom Line

Sora’s death is a wake-up call: don’t rely on single-vendor AI tools. The future belongs to platforms that aggregate the best models and let you switch seamlessly.

IMA Studio is that platform. 14+ video models, image generation, music generation — all accessible from a single interface with one credit system.

👉 Try IMA Studio free today

— Neo, Growth Team @ IMA Studio

Originally published at imastudio.com

I Open-Sourced My AI Cost Optimization Journey: How We Cut $800/day to Under $370/day (Without Sacrificing Output)

Ima Claw — Wed, 18 Mar 2026 11:08:11 +0000

An honest breakdown of how we diagnosed and fixed our open-source AI infrastructure—plus a practical playbook you can apply to your own projects.

The Problem: You're Probably Wasting More Than You Think

Last quarter, I discovered something alarming about my open-source AI automation project. At peak usage, it was costing us roughly $800/day in API spend. Not bad for an indie project, but not sustainable either.

The numbers were making me uncomfortable. At that rate, even with a reasonable user base, we'd be burning through revenue faster than growth could keep up. It was time to get serious about cost optimization.

This isn't a story about finding a "magic bullet" solution or switching to some obscure model provider. Instead, it's about systematic diagnosis, making smart trade-offs, and applying proven infrastructure patterns. And yes—I'll share the actual numbers later in this post.

If you're building with AI APIs, running automation workflows, or operating any AI-powered product, these lessons apply directly to your setup too.

My Initial Diagnoses: The Usual Suspects

When costs spike, most engineers start with the same assumptions:

"The model provider is overcharging"
"We need better caching strategies"
"Switch to a cheaper model"

These aren't wrong—they're just incomplete. Here's what my team actually investigated first:

1. Model Choice (the obvious one)

We reviewed our API billing and found we were predominantly using premium tier models for tasks that could run on mid-tier. The gap between "gpt-4-class" and "good enough" can be 3–5x in cost per token. That's massive when you're pushing thousands of calls daily.

2. System Prompt Bloat (the hidden one)

This is where we found our first real win. Every LLM context slot costs money, and my team had let our system prefixes grow unchecked. What started as "keep the bot focused" had mutated into pages of repetitive instructions:

Multiple conflicting persona definitions
Over-detailed formatting rules repeated across sections
90KB+ worth of instructions before a single user message was processed

3. The Work Itself (not just the models)

The biggest surprise? We were using expensive models for tasks that didn't need them:

Routing logic running on $0.06/token models when rules-based code would suffice
Image generation calls without caching or fallbacks
Browser automation loops re-loading pages instead of reusing state

Our Optimization Framework (What Actually Worked)

After analysis, we built a systematic approach that reduced spend by roughly 55% while maintaining output quality. Here's the breakdown:

Step 1: Eliminate Unnecessary Tasks

We audited every automated workflow and identified about 40–50% of calls that weren't actually needed for user value. These were:

Redundant data-fetching loops
Failed requests without retry logic
Background polling that could be event-driven

Key insight: The cheapest token is never the one you send.

Step 2: Optimize System Prefixes (Our Biggest Win)

We refactored our system instructions to be minimal but effective. The results were shocking:

Reduced from ~90KB to ~15KB per task session
Improved response quality (less token bloat = faster inference)
Reduced hallucination rates

The technique: Instead of "here's everything the bot should know," we moved to "here are just the guardrails needed for this specific task." This 7x reduction in context size directly translated to cost savings without changing output quality.

Step 3: Smart Fallback Strategies

We implemented a tiered fallback system rather than "always fail hard":

Primary: model_with_best_quality
Secondary: fast_model_for_light_tasks
Tertiary: error_state (with cached alternative)

Retry rules:

rate_limit_exceeded: wait 2s, reduce parallelism
token_limit_reached: continue on next batch
network_timeout: immediate retry once

This prevented total failures while keeping most requests on cost-effective paths.

Step 4: Usage-Based Routing

Not all tasks need the same model tiered intelligence. We added simple classification logic:

Simple Q&A → cheaper models
Complex reasoning → higher-tier models
Code generation → specialized instruction-tuned models

This alone saved us an extra ~20% on average per task.

Results and What Didn't Change

The optimization effort gave us two major outputs:

The Numbers

Before: $677/day average, sometimes hitting $800/day during peak
After: $362/day average
Savings: ~46% cost reduction (not counting new infrastructure costs)

For a project with our user growth trajectory, this means the difference between sustainable and unsustainable.

What Didn't Change

This is crucial: we didn't sacrifice output quality or reliability. Key metrics that remained stable:

User satisfaction scores
Task completion rates
Error recovery success
Response times (actually improved 10–15% with less context)

The Real Lesson: Infrastructure Is Habit, Not Project

The most important takeaway isn't the technical tricks—it's the mindset shift. Cost optimization can't be one-off project. It has to become continuous practice embedded in your workflow.

What we changed:

Weekly cost review rituals (15min, no more)
Automated spending alerts at thresholds
"Optimization sprint" before major feature launches
Every engineer owns a slice of the cost pie

This isn't about squeezing pennies—it's about making sure your infrastructure can grow without constraints hitting first.

Your Action Plan (Start Tonight)

Want to apply this to your own projects? Here's the minimal checklist:

Measure before optimizing — export API logs for one week
Audit one workflow at a time — start with costliest paths
Reduce system prompts aggressively — question every line added
Implement fallbacks immediately — don't wait for perfect retry logic
Review costs weekly — 15min is enough

Do these consistently and you'll likely see similar results: faster, cheaper AI infrastructure without quality trade-offs.

Final Thoughts

The biggest cost isn't the model—it's what you pay for while trying to fix it. Our journey from wasting tokens to optimizing workflows was less about technology and more about discipline around usage patterns.

If you're building with AI APIs, ask yourself: "Am I paying for everything I use, or just using everything I pay for?" The answer will tell you where your real optimization opportunities lie.

TL;DR: You're probably over-paying. Cut system prompt bloat, implement fallbacks, and route by task type not preference. The savings alone can buy you another quarter of runway.

How We're Solving Context Window Bloat in an AI Agent Skill Ecosystem

Ima Claw — Wed, 18 Mar 2026 11:00:20 +0000

Your AI agent just got its 53rd skill installed. Image generation, video creation, social media posting, calendar management — the works.

There's just one problem: every single request now carries 25KB of skill descriptions in the system prompt, whether the user needs them or not. That's ~6,200 tokens of overhead before a single word of actual conversation.

This post walks through how we found this problem, the four approaches we tried (and why three of them failed), and the architecture we landed on.

The Problem: More Skills = Worse Performance

We run an AI agent platform where users install "skills" — essentially instruction modules that tell the agent how to use specific tools. Think of them like plugins, but implemented as structured markdown files that get injected into the system prompt.

The mechanism is simple:

Install skill → SKILL.md stored locally  
→ name + description injected into every request's system prefix  
→ Agent sees full skill list → matches → reads SKILL.md → executes

When we audited our system prefix, here's what we found:

Component	Size	Share
Tool schemas	29.6 KB	31.2%
User workspace files	30.8 KB	32.5%
Skills list	24.9 KB	26.2%
Framework instructions	9.5 KB	10.0%
Total	92.5 KB	—

The skills list was eating over a quarter of our context budget. And usage data showed 45% of installed skills had never been triggered — they were just dead weight on every request.

At 53 skills this was annoying but survivable. At 500? The system would collapse.

The Core Tension

The business needs breadth — the more skills available, the more capable the agent. But the runtime needs precision — each request should only carry what's relevant.

We also had a hard constraint: LLM prefix caching. The cache matches tokens from the start of the sequence. If you change anything in the system prefix, everything after that point becomes a cache miss. Skills sit near the front of the prefix, before all conversation history. Touching them means rewriting the cache for the entire conversation — exactly the opposite of what we want.

Approach 1: Two-Layer Architecture (Pinned + Dynamic Discovery)

Idea: Split skills into a "pinned" tier (10-15 high-frequency ones, always injected) and an "ecosystem" tier (hundreds, discovered on demand). Add a new built-in tool for skill discovery.

Why it failed: This required modifying the agent framework's source code — its configuration format, adding a new built-in tool, changing the prompt assembly pipeline.

The framework we use ships updates almost every other day. Maintaining a fork against that velocity is a maintenance nightmare. Even with zero feature work on our end, we'd be constantly rebasing against upstream changes.

Decision: No approach that requires forking or modifying the core framework.

Approach 2: Use a Skill to Manage Skills

Idea: Completely non-invasive. Move low-frequency skills out of the scan directory (so they're not injected), and create a "skill-router" skill that searches through archived skills when needed.

High-frequency skills → standard directory (injected)
Low-frequency skills → archive directory (not injected)
skill-router → searches archive via grep when agent can't handle a request

This was elegant — zero code changes, just filesystem operations plus one regular skill.

Why it failed: We tracked trigger reliability across our production data and found:

Skill trigger rate based on description matching alone: < 30%
With cross-references from other skills: 70-80%
Even our best-documented knowledge-base skill (strong description + referenced by multiple other skills) was missed ~25% of the time

The root cause: agents are probabilistic. Building a critical path on "the agent realizes it needs to search for help" has a reliability ceiling that's too low for production.

Decision: Critical routing can't depend on the agent's probabilistic judgment.

Approach 3: Dynamic Injection via Plugin Hook

Idea: Use the framework's plugin system (specifically a context assembly hook) to dynamically choose which skills to inject based on the user's message. Instead of a static skill list, compute the relevant subset each time.

This felt right — deterministic code picks the skills, not the agent's judgment.

Why it failed: Remember the cache constraint? The skills list sits in the system prefix, before all conversation history. Dynamically changing it means the prefix is different on every request, which cascades into a full cache miss for all historical messages.

We ran the numbers: saving 24.9 KB of skill space but causing 50-100 KB of cache rewrites on every turn. Net negative.

Decision: The system prefix must remain 100% stable. No dynamic modifications to anything before the conversation history.

Approach 4: Append to End (The Solution) ✅

The breakthrough was reframing the problem. Instead of replacing part of the prefix, we append to the end of the message sequence — after all conversation history.

[Fixed prefix: tools + pinned skills + user files + instructions]  → NEVER CHANGES (cache hit)
[Conversation history]                                              → cache hit  
[Additional Skills: dynamically matched this turn]                  → small new addition

Here's why this works:

Prefix stays 100% stable — full cache hit on every turn
Dynamic content is append-only — minimal cache write cost
Deterministic matching — code picks the skills, not the agent
Scales indefinitely — ecosystem can have thousands of skills, but each request only carries 2-3 relevant ones

The Matching Layer

We use embedding similarity to match the user's message against pre-computed skill description vectors:

// In the assembly hook
const response = await openai.embeddings.create({
  model: "text-embedding-3-small",
  input: userMessage.text,
});

const queryVector = response.data[0].embedding;
const matchedSkills = cosineSimilaritySearch(queryVector, skillIndex, topK);

The skill index is pre-computed at install time:

[
  {
    "name": "xhs-note-creator",
    "description": "Create Xiaohongshu note content...",
    "location": "~/.agent/skills-archive/xhs-note-creator/SKILL.md",
    "embedding": [0.012, -0.034, 0.056, ...]
  }
]

Index size: 500 skills × 1536-dim float32 ≈ 3 MB (totally manageable)
Matching latency: ~100-200ms per request
Cost: ~$0.02 per million tokens (negligible)

Why Append Works for the Agent

If you've done RAG, this pattern is familiar. The agent sees:

"Here are some additional skills that may be relevant to this request: [skill descriptions]"

It then reads the corresponding SKILL.md files and executes normally. From the agent's perspective, it's just extra context — no behavioral changes needed.

What We Got Wrong Along the Way

A few pitfalls worth noting:

We overestimated agent self-awareness. We assumed the agent would reliably recognize "I don't know how to do this, let me search for a skill." In practice, it either hallucinated an answer or just apologized — searching was the last resort, not the first.
We underestimated cache sensitivity. Our initial mental model was "save tokens in the prefix → save money." But prefix caching means the stability of the prefix matters more than its size. A 90 KB stable prefix is cheaper than a 70 KB prefix that changes every turn.
We almost built a fork. The two-layer architecture was technically clean, but maintaining a fork of a fast-moving open source project is a long-term tax that compounds. Using the official plugin system — even if it's less flexible — was the right call.

Rollout Plan

We're being deliberate about timing:

Now (< 60 skills): No changes needed. The overhead is acceptable and we're collecting usage data.
100+ skills: Deploy the routing extension. Move low-frequency skills to archive. Validate matching accuracy.
500+ skills: Automate index management. Add user-profile-based pinning. Connect to the skill registry for remote discovery.

Key Takeaways

Injection cost is the hidden tax of plugin ecosystems. Every plugin/skill/tool added to an AI agent's context has a per-request cost, even when unused.
Cache-friendliness is a first-class architectural constraint. For LLM-based systems, prefix stability matters more than prefix size.
Don't build critical paths on probabilistic behavior. If your system relies on the agent "deciding" to do the right thing, measure the actual trigger rate before shipping.
Append > Replace for dynamic context. When you need to add context without breaking caches, treat it like RAG — add to the end, not the middle.
Resist the fork. Plugin/extension systems exist for a reason. The flexibility tax of a fork almost always exceeds the flexibility gain.

This architecture now powers part of how we think about skill scaling at www.imaclaw.ai, where we build AI creative agents with 50+ multimodal skills. The pattern should generalize to any LLM agent system dealing with growing plugin ecosystems.