Are "Agent Skills" the Secret Sauce for AI Productivity?

Thiago da Silva Teixeira — Mon, 16 Feb 2026 21:46:28 +0000

A massive new study titled SKILLSBENCH has just been released, and it’s a must-read for anyone building or using AI agents. As LLMs evolve into autonomous agents, the industry is racing to find the best way to help them handle complex, domain-specific tasks without the high cost of fine-tuning.

The answer? Agent Skills—modular packages of procedural knowledge (instructions, code templates, and heuristics) that augment agents at inference time.

📊 The Study at a Glance

Researchers tested 7 agent-model configurations (including Claude Code, Gemini CLI, and Codex) across 84 tasks in 11 different domains. They compared three conditions:

No Skills: The agent flies solo with just instructions.
Curated Skills: Human-authored, high-quality procedural guides.
Self-Generated Skills: The agent is asked to write its own guide before starting.

💡 Key Takeaways

Curated Skills are a Game Changer: Adding human-curated Skills boosted average pass rates by 16.2 percentage points. In specialized fields like Healthcare and Manufacturing, the gains were massive (up to +51.9pp).
AI Cannot Grade Its Own Homework: "Self-generated" Skills provided zero benefit on average. Models often fail to recognize when they need specialized knowledge or produce vague, unhelpful procedures.
Smaller Models Can "Punch Up": A smaller model (like Haiku 4.5) equipped with Skills can actually outperform a much larger model (like Opus 4.5) that doesn't have them.
Less is More: Focused Skills with only 2-3 modules outperformed massive, "comprehensive" documentation. Too much info creates "cognitive overhead" for the agent.

🏆 Top Performer

The combination of Gemini CLI + Gemini 3 Flash achieved the highest raw performance, hitting a 48.7% pass rate when equipped with Skills.

🛠 Why This Matters

For developers and enterprise teams, this proves that human expertise is still the bottleneck. Building a library of high-quality, modular "Skills" is currently a more effective (and cheaper) way to scale AI agent performance than just waiting for bigger models or spending a fortune on fine-tuning.

Reference: https://arxiv.org/abs/2602.12670

Tokenization

Thiago da Silva Teixeira — Tue, 22 Apr 2025 16:20:44 +0000

When you build an NLP pipeline—whether for sentiment analysis, chatbots, or translation—the very first step is always the same: tokenization. In plain words, tokenization dices raw text into smaller, consistent chunks that a model can count, index, and learn from.

1 What is a Token?

Think of tokens as the LEGO® bricks of language. They can be as big as a whole word or as tiny as a single character, depending on how you slice them.

Sentence: "IBM taught me tokenization."
Possible tokens: ["IBM", "taught", "me", "tokenization"]

Different models expect different brick sizes, so choosing the right tokenizer is strategic.

2 Why Tokenization Matters

Sentiment analysis: detect “good” vs “bad”.
Text generation: decide what piece comes next.
Search engines: match “running” with “run”.

Without tokenization, your neural net sees text as a long, unreadable string of bytes—hardly the recipe for comprehension.

3 The Three Classical Approaches

Method	How It Works	When to Use	Watch‑outs
Word‑based	Splits on whitespace & punctuation	Quick prototypes, rule‑based systems	Huge vocabulary, OOV* explosion
Character‑based	Every character is a token	Morphologically rich languages, misspellings	Longer sequences, less semantic punch
Sub‑word	Keeps common words whole, chops rare ones into pieces	State‑of‑the‑art transformers (BERT, GPT‑x)	More complex training & merges

*OOV = out‑of‑vocabulary words

4 A Closer Look at Sub‑word Algorithms

WordPiece (BERT) Greedy merges: start with characters, repeatedly join pairs that boost likelihood.

   from transformers import BertTokenizer
   tok = BertTokenizer.from_pretrained("bert-base-uncased")
   print(tok.tokenize("tokenization lovers"))  
   # ['token', '##ization', 'lovers']

Unigram (XLNet, SentencePiece) Vocabulary pruning: begin with many candidates, drop the least useful until a target size is reached.
SentencePiece Language‑agnostic: trains directly on raw text, treats spaces as tokens, so no pre‑tokenization needed.

5 Tokenization + Indexing in PyTorch

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

sentences = ["Life is short", "Tokenization is powerful"]
tokenizer = get_tokenizer("basic_english")

def yield_tokens(data_iter):
    for text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(sentences),
                                  specials=["<unk>", "<bos>", "<eos>", "<pad>"])
vocab.set_default_index(vocab["<unk>"])

tokens = tokenizer(sentences[0])          # ['life', 'is', 'short']
indices = vocab(tokens)                   # [5, 6, 7]  (example output)

# Add special tokens + padding
max_len = 6
padded = ["<bos>"] + tokens + ["<eos>"]
padded += ["<pad>"] * (max_len - len(padded))

Why it matters: Models operate on integers, not strings. torchtext lets you jump from raw text to GPU‑ready tensors in three lines.

6 Special Tokens Cheat‑Sheet

Token	Purpose
`<bos>`	Beginning of sentence
`<eos>`	End of sentence
`<pad>`	Sequence padding
`<unk>`	Unknown / rare word

Adding them makes batching cleaner and generation deterministic.

7 Key Takeaways

Tokenization is non‑negotiable—mis‑tokenize and your downstream model will stumble.
Choose by trade‑off: word‑level (semantic clarity) vs character‑level (tiny vocab) vs sub‑word (best of both, extra complexity).
Modern transformers ♥ sub‑word algorithms such as WordPiece, Unigram, and SentencePiece.
Indexing turns tokens into numbers; libraries like torchtext, spaCy, and transformers automate the grunt work.
Special tokens (<bos>, <eos>, etc.) keep sequence models from losing their place.

Reference

My study notes from the IBM Generative AI and LLMs: Architecture and Data Preparation course.

DEV Community: Thiago da Silva Teixeira