Netanel Abergel

Posted on Apr 22

12 Production-Grade Skills That Turn an AI Agent Into a Real Teammate

#ai #agents

I run an AI agent in production. It coordinates with other agents, tracks commitments across conversations, monitors its own infrastructure, and learns from its mistakes. It's been running 24/7 for months.

What makes it useful isn't the model — it's the skills. Each skill is a structured unit of behavior with tool access, persistent memory, and real-world integrations. Here are 12 of them, each battle-tested in production, with the actual configs and architecture behind them.

These skills aren't specific to any one type of agent. Whether you're building a coding agent, an ops agent, a customer-facing agent, or a coordination agent — the same patterns apply. An agent that can search its own memory, track its own commitments, and learn from its own failures is a better agent, regardless of domain.

What Makes a Skill "Agentic"?

An agentic skill goes beyond a static system prompt. It has:

Tool bindings — it can execute shell commands, query databases, call APIs, read and write files
Memory access — it remembers what happened yesterday, last week, last month — across multiple storage backends
State management — it tracks what's pending, what's done, what failed, and picks up where it left off
Permissions — it knows what it can do autonomously vs. what requires human approval
Cross-agent coordination — it can sync with other agents, share learnings, avoid duplicate work

Here's how each of the 12 skills works.

Memory & Learning

1. Deep Recall — "Never say 'I don't know' without searching first"

What it does: Before the agent ever says "I don't have context on this," it executes a mandatory multi-layer search cascade: semantic memory → full-text search across structured storage → direct database queries → daily notes grep → session context files.

The actual skill config:

# SKILL.md frontmatter
name: deep-recall
description: >
  Mandatory deep search before answering any question about past
  events, conversations, decisions, people, or context.

Search Order (mandatory, in sequence):

1. memory_search(query="<topic>")
   → Searches durable memory + indexed session transcripts

2. Full-text search tool "<query>" --limit 10 --days 30
   → FTS5 index across structured storage and daily notes

3. Direct database query:
   SELECT body, ts, session_id FROM messages
   WHERE body ILIKE '%keyword%'
   AND ts > NOW() - INTERVAL '30 days'

4. grep -rn "<keyword>" memory/daily/ | tail -20

5. grep -rn "<keyword>" memory/sessions/*/context.md

Production insight: The biggest surprise was how often layer 3 (raw database queries) saves the day. Semantic search is great for fuzzy matches, but when someone asks "what did I say about X on Tuesday," a direct timestamp-ordered DB query is unbeatable. I'd recommend starting with raw storage search and layering semantic on top, not the other way around.

2. Memory Tiering — "Daily notes → durable memory promotion pipeline"

What it does: Manages a three-tier memory architecture (HOT → WARM → COLD) with automated nightly consolidation, graduation gates for promoting learnings to permanent memory, and retention policies that archive old daily notes and prune stale entries.

The actual architecture:

Tier 1: 🔥 HOT  (memory/hot/)     — current session, active tasks
Tier 2: 🌡️ WARM (memory/warm/)    — stable preferences, configs
Tier 3: ❄️ COLD (MEMORY.md)       — long-term, distilled, curated

Nightly "Dreaming" process (3 AM):
  - Light phase: ingests daily notes + session transcripts
  - Deep phase: scores & promotes strong signals → MEMORY.md
    (weighted: relevance 0.30, frequency 0.24, recency 0.15)
  - REM phase: extracts themes and patterns

Graduation Gate — nothing promotes to MEMORY.md without passing:
  - Score >= 0.70 (weighted relevance + frequency + recency)
  - Recalls >= 2 (the signal was retrieved and used at least twice)
  - Content is a rule, preference, or durable fact — not a raw
    conversation fragment or debug log

Production insight: The graduation gate was the single most important addition. Without it, MEMORY.md fills up with noise — one-off debug notes, transient preferences, stale context. The "recalls >= 2" requirement is especially powerful: if a piece of knowledge was never retrieved and used, it probably doesn't belong in long-term memory.

3. Self-Learning — "Turn corrections into concrete improvements"

What it does: When the operator corrects the agent, when a task fails, or when a better approach is discovered, this skill logs the event, identifies the root cause, and applies the smallest durable fix — updating the specific skill or workflow that caused the issue.

The actual loop:

## Entry format
## YYYY-MM-DD | category | short title
- Trigger: what happened
- Context: what you were trying to do
- Root cause: why it happened
- Durable fix: file/process/skill you changed, or `none yet`
- Verification: how to tell the problem is gone

## Quality bar — a learning is only complete when:
- a local skill was improved, OR
- a broken instruction was removed, OR
- a missing prerequisite was documented clearly, OR
- a recurring mistake was converted into a shorter, safer workflow

## NOT enough:
- "be more careful"
- generic promises
- long postmortems without a file or process change

## Auto-flag threshold:
If a skill fails 3+ times in 14 days → flagged for rewrite.
Not "we'll try harder" — the skill itself gets rebuilt.

Production insight: The hardest part was enforcing the quality bar. Early on, the agent would log learnings like "remember to double-check next time" — which is useless. Requiring a concrete file change (a skill update, a config fix, a workflow edit) as the definition of "done" transformed the entire feedback loop. Vague learnings dropped to near zero.

Coordination & Network

4. Agent Network Sync — "Daily sync across multiple AI agents"

What it does: Every morning, a network of AI agents — each operating in its own domain — syncs status, shares blockers, and coordinates across stakeholders. Each agent reports what's needed, what's blocked, and what's been resolved. The sync runs as a scheduled job with structured output.

The pattern:

# Cron config
id: agent-network-daily-sync
schedule: "15 9 * * *"        # 9:15 AM local time
timezone: Asia/Jerusalem
model: sonnet                  # needs reasoning for cross-agent coordination
timeout: 300s
delivery:
  channel: network             # routed to shared agent channel
  to: <network-session-id>

Each agent brings its own domain context. No agent sees another agent's private state. They coordinate on shared tasks and surface blockers to the network.

Production insight: The permission boundary was the design decision I'm most glad we made early. Each agent only shares what's relevant to coordination — never raw private context. This made the whole system trustworthy enough that operators actually let their agents participate. Trust is the bottleneck for multi-agent systems, not capability.

5. Cross-Session Awareness — "Know everything without duplicating anything"

What it does: An AI agent typically operates in isolated sessions — one per context, one per channel, one per task domain. This skill bridges that gap. On every heartbeat, it scans all active sessions, extracts key messages and decisions, and builds a unified context file that the primary session can reference.

The mechanism:

On heartbeat:
  1. Run refresh script → scans active sessions for recent activity
  2. Build sessions-context.md:
     - Messages 11+: compressed summary (who, what, key decisions)
     - Messages 1-10: verbatim with timestamps and sender labels
  3. Key facts/decisions → write to MEMORY.md for permanence

Engagement rules:
  - Scan and note — don't inject into sessions uninvited
  - If something needs action → surface to the operator, let them decide
  - Don't post unsolicited analyses into active conversations

Production insight: The engagement rules matter more than the technical implementation. Early versions would jump into conversations with unsolicited summaries — technically impressive, socially terrible. The "scan and note, act only in the primary session" pattern was a game-changer for making the agent welcome in multi-session contexts.

6. Agent Onboarding — "Structured onboarding for new agents joining the network"

What it does: A step-by-step procedural skill that guides the setup of a new AI agent — from instance provisioning, channel linking, and integration setup, through to the operator's first interaction patterns. It gives one step at a time, confirms completion, and won't move forward until each step is verified.

Key behavioral rules baked in:

Rules:
- Give one step at a time. Do not dump the full guide upfront.
- Confirm each step before moving on.
- Never say something is done unless you verified it.
- Do not start integrations before the agent responds to messages.

Operator interaction signals taught from Day 1:
| Signal              | Meaning         | Agent action                          |
| Any task request    | Operator delegated | Acknowledge immediately, confirm done |
| Positive signal     | Good job        | Log positive feedback                 |
| Negative signal     | Poor result     | Fix and log the lesson                |

Production insight: The "verify before proceeding" rule cut onboarding failures by more than half. Before that, the skill would race through all steps and report success — only for the operator to discover that channel linking had silently failed in step 3. Treating each step as a checkpoint with actual verification (not just "did you do it?") made the whole process reliable.

Execution & Ops

7. Commitment Tracker — "If you said you'd do it, do it now"

What it does: Scans every outgoing message for commitment language — "I'll send," "I'll update," "I'll follow up" — and enforces immediate execution. If the agent is about to promise a follow-up, it must actually execute that action before the reply goes out. If it can't execute, it must rewrite the reply to not promise it.

The enforcement mechanism:

Trigger words scanned before every reply:
  "I'll send", "I'll report", "I'll update", "I'll follow up",
  "I'll check and get back", "will do"

Protocol:
  1. Scan reply for trigger words
  2. If found → execute the committed action NOW
  3. Only after execution → include result ("sent ✅" not "I'll send")
  4. If can't execute → rewrite reply to not promise it

Fault tolerance — intent log:
  echo '{"ts":"...","action":"DESCRIBE","target":"TARGET","status":"pending"}'
    >> data/commitments.jsonl

  On every session start: scan for unresolved commitments
  If any pending → execute immediately

Production insight: This was the skill that changed my relationship with the agent the most. Before it, "I'll follow up" was a polite lie — the kind humans make all the time, and AI agents inherited. After it, every promise became an enforceable contract with crash recovery. The intent log (commitments.jsonl) is critical: if the agent crashes mid-execution, the commitment survives and gets picked up on restart.

8. Pre-Send Validation — "Check before you send"

What it does: Before any outbound message leaves the agent, it runs through a validation pipeline: recipient verification (is this the right target for this person?), commitment detection (am I promising something?), and content checks (am I exposing internal context to an external session?).

The validation chain:

Recipient verification (mandatory):
  1. Look up name in contact registry → get exact session/channel ID
  2. Cross-check: does the ID match the intended recipient?
  3. Ambiguous match = halt and confirm with operator
  4. Wrong recipient = critical failure

Content rules:
  - Never include internal framing in messages to external sessions
  - Never expose secrets — redact if needed
  - Warn if credentials appear in outbound content

Production insight: This skill exists because of a near-miss. The agent almost sent an internal status update to an external contact because the name matched partially. After that, I built recipient verification as a hard gate — not a suggestion, not a "best practice," but a blocking check that prevents message delivery if verification fails. In production, the paranoid path is the right path.

9. Auto-Skill Creator — "Turn complex problem-solving into reusable skills"

What it does: After completing a multi-step task, the agent evaluates: Was this complex (3+ tool calls)? Was the solution non-obvious? Will this recur? If 2 of 3 are true, it automatically generates a new skill — complete with frontmatter, gotchas section, and git commit.

The evaluation + creation pipeline:

Trigger evaluation (after every multi-step task):
  1. Complexity — Did it take 3+ tool calls or require debugging?
  2. Novelty — Was the solution non-obvious or undocumented?
  3. Recurrence — Will this pattern likely happen again?
  If 2 of 3 are true → create a skill.

Process:
  1. Extract the pattern (problem class, key steps, gotchas, prereqs)
  2. Check for duplicates: grep -rl "<keywords>" skills/*/SKILL.md
  3. Create skills/<name>/SKILL.md with frontmatter + instructions
  4. git add && git commit -m "auto-skill: <name>"
  5. git push

Quality gate before commit:
  - [ ] Frontmatter has name and description
  - [ ] Steps are concrete (not "be careful")
  - [ ] Gotchas section exists if there were false leads
  - [ ] No duplicate of existing skill

Production insight: The duplicate check (step 2) was a later addition, and it matters a lot. Without it, the agent would create slight variations of existing skills — "fix-connectivity-v2," "fix-connectivity-groups," etc. The dedup grep catches most of these. The remaining challenge is knowing when to update an existing skill vs. creating a new one. We're still tuning that threshold.

Infrastructure & Self-Improvement

10. Channel Diagnostics — "Decision tree for fixing connectivity issues"

What it does: A structured diagnostic tree that any agent can follow when a communication channel stops working. It starts with "Agent not responding?" and branches through connection issues, ingest issues, and runtime issues — each with specific CLI commands to run and specific log patterns to look for.

The decision tree:

Agent not responding?
│
├─ Dashboard shows "Connected and listening"?
│   ├─ YES → Check Messages count
│   │   ├─ Messages = 0 → INGEST ISSUE
│   │   │   → Check gateway status
│   │   │   → Restart gateway service
│   │   │   → Check logs for: binding failed, session dropped
│   │   │
│   │   ├─ DMs work, groups don't → SESSION SYNC ISSUE
│   │   │   → Verify group session is active
│   │   │   → Send test message + gateway restart
│   │   │
│   │   └─ Messages > 0 → RUNTIME ISSUE
│   │       → grep -i "billing\|402" agent.log
│   │       → curl API endpoint → check HTTP status
│   │       → 401 = invalid key, 402 = billing error
│   │
│   └─ NO → CONNECTION ISSUE
│       → Re-link channel, re-authenticate

Production insight: The decision tree format — not a checklist, not a paragraph of instructions — was key. Agents follow branching logic well when it's explicit. The first version was a linear "try these things in order," which wasted time on irrelevant checks. Branching on the first observable symptom ("dashboard shows connected?") cuts diagnosis time dramatically.

11. Self-Monitor — "Track your own health, fix what you can"

What it does: Monitors disk usage, memory, CPU load, service health, cron job status, and recent errors. Auto-fixes safe issues (old log cleanup). Includes a security layer: SHA256 integrity checks on critical files, prompt injection scanning, and credential leak detection.

The health check + security pipeline:

# Quick health snapshot
DISK=$(df -h / | awk 'NR==2 {print $5}' | tr -d '%')
MEM=$(free -m | awk 'NR==2 {printf "%.0f", $3/$2*100}')
LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk -F',' '{print $1}')

# Thresholds: Disk >90% = critical, Mem >95% = critical

# Security checks (daily):
# 1. SHA256 baseline comparison on core config files
#    (SOUL.md, IDENTITY.md, MEMORY.md)
# 2. Scan memory/*.md for injection patterns:
#    "ignore previous instructions", "you are now", "forget everything"
# 3. Scan workspace for credential patterns:
#    sk-..., ghp_..., api_key = '...'

# Root Cause Iron Law:
# Never apply a fix without identifying the root cause first.
# Investigate → Analyze → Hypothesize → Fix → Verify

Production insight: The security checks were an afterthought that became essential. Once the agent is reading and writing files from multiple sources (incoming messages, shared sessions, external repos), the attack surface grows. The SHA256 integrity check on SOUL.md caught an issue where a malformed message nearly overwrote a critical config. Defense in depth applies to agents just as much as traditional systems.

12. Weekly Retro — "Automated retrospective from git, learnings, and daily notes"

What it does: Every Sunday, automatically generates a structured weekly retrospective by pulling from git log (what was shipped), learnings files (what was learned), daily notes (what happened), and previous retros (are the same issues recurring?).

The retro format:

## Weekly Retro — YYYY-MM-DD

### Shipped
- [commit/action]: what was delivered

### Patterns
- Recurring issues or wins from the week

### Learnings Applied
- Which learnings led to concrete skill/workflow changes

### Failures
- What broke, root cause, fix status

### Next Week
- Top 3 priorities based on patterns

## Trend tracking:
- Compare with previous week's retro
- Are the same issues recurring? Flag them.
- Did last week's priorities get addressed?
- Is failure count trending up or down?

Production insight: The trend tracking section is where the real value lives. A single retro is useful. A series of retros that cross-reference each other reveals systemic issues. We found that the same connectivity problem appeared in 4 out of 6 retros before we finally addressed the root cause (a gateway memory leak). Without the longitudinal view, each occurrence looked like a one-off.

The Agent Network Effect

These skills don't just make one agent better — they compound across a network.

We run multiple AI agents, each operating in its own domain. They share a skills repository. When one agent encounters a new problem — say, a connectivity issue with a specific error pattern — and the Auto-Skill Creator packages the fix into a reusable skill, that fix becomes available to every agent in the network.

Here's the pattern:

Agent A encounters a new connectivity issue
  → Self-Learning logs the root cause and fix
  → Auto-Skill Creator packages it into a skill
  → git push to shared skills repo
  → All agents pull the updated skill
  → Agent B hits the same issue next week
  → Already has the fix. Zero debugging time.

The Agent Network Sync coordinates across agents daily. Cross-Session Awareness prevents duplicate work. Agent Onboarding standardizes how new agents join the network. And Self-Monitor keeps every agent's infrastructure healthy independently.

The real unlock: skills that learn from failures, share fixes across agents, and compound over time. One agent's bad day becomes every agent's education.

The Skill Anatomy

Every skill follows the same structure:

skills/<skill-name>/
├── SKILL.md          # Frontmatter (name, description, triggers)
│                     # + executable instructions with real tool calls
├── scripts/          # Deterministic code (health checks, parsers)
└── references/       # Reference material if needed

The frontmatter declares when the skill activates. The body declares what it does — not in vague terms, but with actual commands, actual queries, actual file paths. The agent's skill router reads the description, matches it to the current task, and loads the right skill at the right time.

Each skill is versioned, testable, and shareable — a unit of agent behavior that runs on infrastructure with tool access and persistent state.

Getting Started — Building Your First Agentic Skill

If you want to build a real agentic skill, here's the minimal structure:

1. Pick a task the agent does repeatedly and gets wrong sometimes.

Not a creative task. Not a one-off. Something recurring where mistakes have consequences — like routing messages to the right session, following up on commitments, or diagnosing a connectivity failure.

2. Write the SKILL.md with concrete instructions.

name: my-first-skill
description: >
  One sentence that tells the skill router when to activate this.

Then write the body as executable steps. Not "be careful about X" — instead, "run grep -rn 'keyword' memory/daily/ and check for matches." Every step should be something the agent can actually do.

3. Include the failure modes.

Add a Gotchas section. What goes wrong? What does the agent try that doesn't work? What's the non-obvious prerequisite? This section is what separates a skill that works once from a skill that works reliably.

4. Version control it.

mkdir -p skills/my-first-skill
# write SKILL.md
git add skills/my-first-skill/
git commit -m "skill: my-first-skill"

Skills in git means rollback, blame, and history. When a skill breaks, you can diff what changed. When it improves, you have the commit that made it better.

5. Let the agent load it.

The skill router matches task descriptions to skill names and descriptions. If you named it well and the description is accurate, the agent will pick it up automatically when the task matches.

Start with one skill. Get it working reliably. Then build the next one that compounds on it.

Looking Forward

The future isn't AI that follows instructions. It's AI that learns from mistakes, coordinates with other agents, and improves its own capabilities over time.

These 12 skills are one implementation of that idea. The specific tools and platforms will change — the principles won't. Persistent memory beats stateless prompts. Concrete tool bindings beat vague instructions. Crash-recoverable state beats hope. And a network of agents sharing skills will always outperform agents working in isolation.

The best AI systems won't be the ones with the most powerful models. They'll be the ones with the best skills — learned, tested, and refined in production.

DEV Community