DEV Community: Shmulik Cohen

Stop “Vibe Merging”

Shmulik Cohen — Fri, 27 Feb 2026 13:32:07 +0000

A Deep Dive into the Code Review Bench Results

The AI Code Explosion

We are living through an AI code explosion. Coding agents are writing more code than ever before, churning out boilerplate and building new features at a record pace. But this incredible speed has created a massive new bottleneck: generating code is fast, but reviewing it is slow.

Recent telemetry from across the industry has exposed a “Productivity Paradox.” While developers using AI are completing more tasks, their Pull Request (PR) review time has spiked by 91% [Faros AI: The AI Productivity Paradox Research Report]. We’ve reached a point where individual velocity is up, but organizational delivery is stalling because the human verification layer cannot scale.

In this post, we’ll explore:

The Code Review Crisis: Why AI-generated code is actually harder to review than human code.
The “Vibe Merging” Danger: How teams are accidentally sacrificing safety for speed.
The Market Divide: The difference between “All-in-One” giants and “Pure-Play” specialists.
Code Review Bench: A deep dive into the first neutral, data-driven benchmark to rank the top agents and it’s results.

Thanks for reading AI Superhero! Subscribe for free to receive new posts and support my work.

The Code Review Crisis

Let’s be clear: code review is a notoriously difficult problem even for senior engineers. It requires holding massive amounts of context in your head to ensure that a “simple” change doesn’t break a distant, existing system.

AI code review is fundamentally harder than generation. While an agent can generate a local fix in seconds, a reviewer must reason globally across multiple files, architectural patterns, and intricate system integrations. AI-assisted changes are often larger and touch more surfaces, making the cognitive load on reviewers nearly unbearable.

The stakes at the review stage have never been higher. While AI can generate code 10x faster, data shows that AI-generated code produces 1.7x more logic and correctness issues than human-written code [CodeRabbit: AI vs. Human Code Quality Analysis]. The main problem in software engineering has shifted: it’s no longer about how fast we can author code, but how accurately we can ensure its quality.

The “Good Case” vs. The Reality

The bottleneck described above is actually the good case. In this scenario, teams are at least attempting to maintain their standards and keep a “human in the loop” to catch errors before they hit production.

The far more dangerous reality is what’s happening in companies that have simply given up on the bottleneck. When PR queues get backed up for weeks, the pressure to ship becomes overwhelming. We are seeing a surge in “vibe merging” — where developers, overwhelmed by the volume of AI-generated code, simply skim the diff or hit “Approve” based on a gut feeling rather than a proper review.

Vibe Merging: The act of approving a Pull Request based on a “gut feeling” or the reputation of the author, rather than a line-by-line verification of the logic.

Is your team “Vibe Merging”? Look for these symptoms:

The “LGTM” Speedrun: Approving a 300+ line diff in under three minutes.
The Green Light Fallacy: Assuming that because the CI/CD pipeline passed, the logic must be sound. CI/CD catches syntax and crashes, but it doesn’t understand intent. Vibe merging happens when we trust the “green check” to do the thinking for us.
The Seniority Pass: Skimming a PR because the author is a “rockstar” who rarely makes mistakes.
The Ghost Review: Adding a comment like “Nice work!” without actually catching the logic bug on line 42.

When companies merge code to main without deep verification, they aren’t just moving faster, they are accumulating technical debt and security risks at an exponential rate. This lack of a “Quality Gate” is how massive regressions and vulnerabilities slip into production unnoticed.

The Market’s Answer: Code Review Agents

The industry hasn’t ignored this crisis. In the last year, the market for “Code Review Agents” has exploded, but the players generally fall into two distinct camps:

The “All-in-One” Giants

Platform powerhouses like GitHub (Copilot), Anthropic (Claude Code), and Anysphere (Cursor).

Their Worldview: They want to own the entire developer experience. They believe the best review comes from the same agent that helped you write the code, leveraging the shared context of your intent.
The Power Play: This shift was cemented when Anysphere acquired Graphite to bridge the gap between local coding and the final merge.

The “Pure-Play” Specialists

Companies like CodeRabbit, Qodo, and Baz.

Their Worldview: They believe the agent writing the code shouldn’t be the one grading it. They focus exclusively on the review layer, investing in deeper repository indexing to catch architectural breaks that “generalist” agents overlook.

The Evaluation Nightmare: Why We’re Still Flying Blind

Choosing between these players is a nightmare because they all appear nearly identical on the surface. This has left engineering leaders stuck with three flawed ways to evaluate their choices:

The “Vibe Check”: Install a tool, wait a week, and see if it “feels” correct. It’s subjective and ignores the critical bugs the tool missed.
The Internal Benchmark: Trusting vendor marketing. As the saying goes, “Every vendor is #1 on their own test.”
The False Economy (Cheapest Option): Choosing based on price. You may save budget, but you can’t measure the impact on your actual safety goals.

The Precision Trap: The biggest hidden danger is Noise. A player might claim a high “Recall” (they catch every bug), but if they achieve that by leaving 50 “nitpick” comments on a 10-line PR, they cause “Alert Fatigue.” Developers start ignoring the bot, eventually reverting to “vibe merging” just to clear the queue.

Introducing Code Review Bench

We’ve needed a neutral, third-party way to measure these agents. That is exactly why Martian built Code Review Bench.

Who is Martian?

Martian is an independent AI research lab (the team behind the Model Router). Because they do not sell a code generation or review agent themselves, they are in a unique and unbiased position to referee the industry. Their core research focuses on mechanistic interpretability — unpacking the “black box” of LLMs to understand exactly how they make decisions.

Methodology: Beyond the Static Test

Released recently, Code Review Bench is a public, open-source benchmark designed to keep AI tools honest. Unlike previous benchmarks that rely on static datasets (which agents can eventually “memorize” or game), Martian uses a dual-layer approach:

The Offline Benchmark (The Gold Set): This is the controlled environment. Martian uses a curated set of 50 PRs from 5 major open source repositories with human-verified golden comments.

Each PR has curated golden comments with severity labels. An LLM judge matches each tool’s review against the golden comments and computes precision and recall.
The Online Benchmark (The Continuous Reality Check): This is where the benchmark gets revolutionary. It continuously samples fresh real-world PRs from GitHub where code review bots left comments. Because the PRs are recent, tools can’t have memorized them during training.

Each tool ranked by extracting the bot suggestions for each PR and ranking it by matching the human actions on that comment - does the human developer (or his agent) fix the issue or ignore it?

How the LLM judge works

The Standouts: Deciphering the Leaderboard

The most important takeaway from Martian’s data is that there is no single “best” tool, only the best tool for your specific goals. While massive volume tools often dominate the charts, the results split clearly between controlled “lab” performance and real-world behavior about what’s the focus on each vendor.

For checking the dataset by yourself, I suggest to take a look at https://codereview.withmartian.com/

I will explain the results the way I see them next.

Offline Mode: The Augment Dominance

In the controlled Offline Benchmark (the “Gold Set”), Augment didn’t just lead — they dominated. In a “closed-book” environment where bugs are verified and static, Augment’s engine proved remarkably adept at connecting the dots.

The Leader: Augment took the top spot with a powerful 53.8% F1 score, creating a massive gap over the second-place finisher, Cursor, at 44.9%.
The Balance: With 62.8% recall and 47.0% precision, Augment shows it can find a significant portion of problems without drowning the developer in noise.

However, the Graphite case is perhaps the most interesting outlier in the offline set. Graphite operates like a surgeon: it achieved a staggering 75.0% precision, the highest in the category by a lot, but it came at a major cost. Its recall was only 8.8%, leading to an overall 15.7% F1 score. This suggests that while Graphite is almost always right when it speaks, it stays silent on the vast majority of issues in the PR.

Update from 7.3.2026:

I didn’t fully realize this when I wrote the previous section, but I’ve since learned that the “offline” dataset is actually a replication of the code review benchmark used by companies like Augment and Greptile. This context makes Augment’s achievement feel slightly less groundbreaking, as they were essentially tested on familiar ground.

Currently, Qodo (another impressive Israeli company) holds second place with an F1 score of 47.9%. The story behind their ranking is quite interesting: they were initially in 5th place, but after internal review, the team discovered they had run the benchmark using an incorrect and outdated configuration. They coordinated with Martian to resolve the issue, and once the correct settings were applied, their score jumped — marking another Israeli achievement to be proud of.

Online Mode: Baz’s “David & Goliath” Story

When we move to the Online Benchmark — which tracks how real developers react to AI comments in the wild — the narrative shifts toward the “underdog.” This is where Baz, a newer Israeli startup, put up staggering numbers that challenge the industry giants.

The “Surgical Sniper” Results

Taken at face value, Baz dominated the leaderboard in quality-centric metrics:

#1 in Precision — 70.9%
#1 in F1 Score — 52.5%
#1 in F0.5 Score — 62.2% (metric with higher weight for precision)

The performance was so impressive that the Martian team actually made Baz the default reviewer for their own ARES repository.

The Elephant in the Room: Scale

We have to be intellectually honest about the “Sample Size” gap. CodeRabbit has tracked nearly 300,000 PRs in this benchmark, Baz has tracked 790.

Because Baz is a smaller, newer player, their data is naturally noisier. With a smaller user base, a tool can provide specialized attention that is harder to maintain at a “Goliath” scale. However, Baz’s #1 ranking in precision suggests they are operating as a “Surgical Sniper.” They aren’t trying to find every possible bug (which causes alert fatigue), they are trying to ensure that when they do interrupt a developer, they are 70% likely to be right.

It is incredible to see Israeli tech competing at this elite level in this field. While the giants have the data, the underdogs currently have the precision and attention. The real test will be whether Baz can maintain these surgical numbers as they scale to meet the volume of the industry leaders.

The Giants: CodeRabbit and Cursor

Finally, we have to look at the tools that are actually “stress-tested” by the market every single day (at least 3000 Prs). Their results highlight the classic trade-off between coverage and conciseness.

CodeRabbit (The High-Volume King): CodeRabbit is the clear leader in terms of sheer scale. It achieved the best F1 score (51.2%) and the best recall (53.5%) in the online category. If your priority is a “safety net” that catches as many bugs as possible across a massive organization, CodeRabbit is the current gold standard.
Cursor (The High-Precision Specialist): Cursor maintains its “surgical” reputation even in the wild. While its recall sits lower at 36.6%, it boasts a high precision of 68.1%. Cursor isn’t trying to find every single bug, it’s trying to ensure that when it interrupts a developer’s flow, it’s for a very good reason.

Why This Matters: Moving Beyond Goodhart’s Law

This benchmark brings a much-needed layer of accountability to the AI coding space. It does for code review what SWE-bench did for code generation, but with a critical evolutionary step: it accounts for human behavior.

We’ve learned that static benchmarks will always eventually fall victim to Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.” If AI vendors only optimize for a static “Gold Set,” they will eventually game those metrics without actually helping real-world developers.

The future of AI evaluation requires these dynamic systems tied directly to real-world impact. Whether you are optimizing for the highest possible recall to catch every potential bug, or the highest precision to protect your senior engineers from alert fatigue, we finally have the data to stop guessing and start measuring.

Which Agent Should You Choose?

Choose High Recall (Augment, CodeRabbit) if you are in a high-stakes industry (FinTech/Security) or have many junior devs. You want the bot to catch everything, even if it adds some noise.

Choose High Precision (Cursor, Graphite, Baz) if you have a lean team of senior engineers. You only want the bot to speak up if it’s 90% sure it found a real logic flaw, protecting your team from “Alert Fatigue.”

The Bottom Line

We are still in the early innings of the “Verifier Era.” No tool has yet cracked the code on perfect, human-level review — the 63% recall ceiling proves that. But with frameworks like Code Review Bench, we are finally moving past the “Vibe Check” and toward a future where we can trust the agents that help us ship.

Before you buy a tool, look at your own telemetry. If your “Time-to-Merge” has spiked while your “Comments-per-PR” has dropped, you are already Vibe Merging. Use the Code Review Bench results to pick a partner that fits your risk tolerance — whether you need a high-recall safety net or a high-precision surgical assistant.

Originally published on AI Superhero

Vercel Skills 101

Shmulik Cohen — Fri, 06 Feb 2026 14:25:06 +0000

The Package Manager Your AI Agents Were Missing

Agent Skills: The New Standard

If you’ve been following the AI space lately, you’ve likely heard about Agent Skills. Pioneered by Anthropic, this open standard allows us to package specialized instructions, tools, and scripts into a modular format.

It’s a brilliant architectural shift: instead of bloating an AI’s context with a 10,000-line system prompt, you provide “dormant” manuals. The agent only reads and “activates” them when it actually needs to perform a specific task.

In my previous post, Demystifying Coding Agents , I took a deep dive into the problems Skills solve and why they are the natural evolution of context management for coding agents.

The Manual Installation Problem

But here’s the catch: A standard is not a manager.

While tools like Claude Code have built-in ways to fetch skills, the rest of the ecosystem is a fragmented mess. If you’re a developer jumping between Cursor , Windsurf , Claude Code , and OpenClaw ( or just using one that doesn’t have built in way to install Skills), you’re currently living in the “Manual Installation Era.”

To give your agent a new capability today, you usually have to:

Hunt down a GitHub repo or a SKILL.md from anywhere.
Download the directory or file manually.
Manually paste it into a specific hidden folder, like .cursor/skills or .windsurf/skills

We’ve essentially regressed to the days before npm or pip, where “installing” a library meant dragging a ZIP file into your project and praying your environment variables were correct. It’s manual, it doesn’t scale, and it’s a massive barrier to building truly portable AI agents.

A Package Manager for Skills

Vercel Skills solves exactly this. It isn’t trying to create a new standard, it is the Package Manager for the existing one.

Think of it as the “NPM for Agents.” It provides a CLI and a central registry (skills.sh) to find, install, and manage these open-standard skills across every AI agent in your workflow.

In the first installment of My Digital Arsenal , we discussed how tools like uv and pip revolutionized Python development, Vercel’s Skills CLI brings that same level of sanity to the AI ecosystem.

How to Use Vercel Skills

The beauty of Vercel Skills is that there is nothing to install. It lives in the cloud and runs on your machine via npx, mirroring the “zero-config” philosophy of modern dev tools.

1. Discovery: Finding Your Edge

Instead of scouring GitHub for the right instructions, you can search the registry directly from your terminal — npx skills find

This triggers an interactive, searchable list. If you’re looking for something specific, you can pass a keyword (npx skills find react for example)

2. Installation: Adding Powers to Your Agents

Once you’ve found a skill (e.g., vercel-labs/agent-skills), you can download it with the command npx skills add vercel-labs/agent-skills

This starts an interactive process where you choose the exact skills you want, select which AI agents you are using, and decide if you want the skill at the global or repo level.

How it works under the hood: Vercel Skills clones the repo and then download the skills into a central .agents/skills directory. For every agent you use, it copies or symlinks the relevant files into the appropriate folders (like .cursor/skills or .windsurf/skills).

It supports multiple source formats:

# GitHub shorthand

npx skills add vercel-labs/agent-skills

  
  
  Direct path to a specific skill within a repo


npx skills add https://github.com/vercel-labs/agent-skills/tree/main/skills/web-design-guidelines

  
  
  Local development paths


npx skills add ./my-local-skills

Managing Your Arsenal

Vercel Skills doesn’t just “drop and forget” files. It manages the lifecycle of your skills:

npx skills list: List installed skills
npx skills check: Check for available skill updates
npx skills update: Update all installed skills to latest versions
npx skills remove [skills]: Remove installed skills from agents

4. Verification: “What Can You Do?”

After installing a skill, the best way to test it is to simply ask your agent. Try asking your AI agent:

“What skills do you have access to right now?”

If everything is wired up correctly, the agent will have acsses to the new skill and be able to “learn” it easily, ready to be “woken up” when the task demands it.

You can see the full list of commands at vercel-labs/skills.

Sharing your own Skill

publishing your own skill is just as easy. To demonstrate, I built a simple GitHub repository: anuk909/Skills.

It currently contains:

gh-pr-review: Improves agent capabilities for GitHub PRs (based on agynio/gh-pr-review).
git-worktree-workflow: A skill I built (based on some cursor rule from my college) tha finally made the worktree workflow usable for me.

To try them out, just run: npx skills add anuk909/skills

Conclusion

Vercel Skills is the missing link for anyone serious about using AI agents. It solves the “manual installation” headache and makes specialized knowledge portable across our entire stack.

However, we are still in the early days. Unlike npm packages, Skills don’t yet have a robust way to manage versions or dependencies. While this simplicity makes them easy to deploy today, it will be interesting to see if the ecosystem evolves toward the complexity of traditional package managers.

What about you? Have you started using Skills in your workflow yet? What’s your preferred method of installing them today?

Originally published on AI Superhero

Demystifying Coding Agents

Shmulik Cohen — Sun, 01 Feb 2026 22:51:49 +0000

Simple Concepts Can Take You a Long Way

The transition from “Chat” to “Agent” in software development is often framed as a mystical leap in artificial intelligence. However, from a systems engineering perspective, the shift is actually a result of standardizing the interface between three specific components: The Reasoning Engine , External Tooling , and Context Management.

Whether you are using Cursor, Windsurf, Claude Code, or a custom open-source setup, the underlying architecture follows a repeatable pattern that manages state and execution over a stateless core.

The secret? The advancements in coding agents today aren’t about “magic”, they come down to these very simple concepts working in unison.

1. The LLM: The Reasoning Engine

At the center of any agent is the Large Language Model. In 2026, we’ve moved past the “autocomplete” era. The leap from GPT-4 to the current generation (Claude 4.5, GPT-5) wasn’t just about parameters, it was the shift toward Native Reasoning.

Models are now trained specifically to utilize larger context windows (2M+ tokens) without losing the “needle in the haystack,” and they are fine-tuned on synthetic “Chain of Thought” data.

This allows the LLM to act as a CPU with a massive, high-fidelity RAM. It doesn’t just predict the next token, it simulates the logic of the code before typing it.

The Planning Loop: Before writing a single line of code, a robust agent executes a “Plan → Critique → Act” cycle. It writes a plan, checks if that plan breaks anything, and then executes.

The Gateway (API Standardization)

One of the silent drivers of the agent explosion is the standardization of the interface between the “Brain” and the machine.

The Great Equalizer: Libraries like LiteLLM and standards like OpenAI’s structured outputs mean you can swap a local Llama-3 model for Claude 4.5 Opus with a single line of config. This “pluggability” allows agents to remain model-agnostic.
The Power Play: Conversely, the “Big Three” (Anthropic, OpenAI, Google) often bake specialized headers into their APIs specifically for tool-calling. If you’re a big enough provider, you don’t follow the interface — you are the interface, forcing Agent frameworks to write custom logic just to squeeze out that extra 5% of reliability.

2. Tools & Instructions: The Execution Layer

f the LLM is the Navigator , the Tools are the Driver. An LLM on its own can only talk, Tools give it “hands.” The leap we’ve seen recently is about the orchestration of these hands through a specific set of instructions.

Tools

Early tool calling was clunky, relying on rigid JSON blocks that often broke. Today, we use flexible, standardized execution environments:

Tool Calling: Coding agents are proprietary tools for management and file editing. Tools like apply_diff or undo_rewrite allow for surgical changes to code and tools todo_list to keep track of progress.

Each coding agent uses completely different tools that defines big part of it’s DNA.
Sandboxed Terminal Execution: Modern agents have direct access to a pseudo-terminal (PTY). The LLM generates standard shell commands (grep, find, pytest, sed) that run in a secure, isolated sandbox. By capturing stdout and stderr, the agent can “see” a compiler error or a failing test and self-correct, closing the loop between thinking and doing.
Model Context Protocol (MCP): MCP is the open standard that connects AI assistants to systems. It decouples the tool logic from the agent UI. It allows a local or remote server to expose its resources, such as a database schema or a Jira board, via a unified JSON-RPC protocol.

The agent doesn’t need a custom plugin for every service, it only needs to speak MCP, and the server handles the rest.

Instructions

This is the “Secret Sauce.” It’s why Cursor , Windsurf , and Claude Code can all use the same Claude 4.5 Sonnet model but product completely different results.

The System Prompt is a massive, invisible set of instructions that acts as the agent’s “Operating Manual.” It tells the model how to use its utility belt:

“Before you edit a file, you must search the codebase for related symbols. After every shell command, analyze the output for hidden warnings.”

The difference in how these prompts are written, some prioritizing speed, others safety and testing, is what defines the product’s DNA. One agent feels like a cautious Senior Architect , while another feels like a rapid-fire Prototyper , all based on the orchestration of the same tools.

3. Context Management & Memory: The State Machine

Since the LLM is a stateless engine, the Agent framework must maintain a stateful environment. This is the “Operating System” of the agent, and it’s where the heaviest engineering complexity lies.

The Context Window Paradox

A few years ago, we struggled with 8k token windows. In 2026, we have models with 2M+ tokens. However, a bigger bucket doesn’t automatically mean better results. The paradox is that as the window grows, our expectations grow faster. We no longer ask for a single snippet, we expect agents to refactor entire modules, maintain architectural consistency across microservices, and debug complex integration errors.

This “mission creep” means that even with millions of tokens, context remains the most precious currency in the system. More noise increases the Needle in a Haystack risk. The more ‘fluff’ you add to support these massive tasks, the more likely the LLM is to miss the one critical line of code that matters.

To manage this, we use Context Caching to keep the codebase “hot” and affordable in GPU memory, and we structure the “State Machine” into three distinct layers:

A. The Baseline (Static Rules)

This is the “BIOS” or system configuration, rules that are always true.

Global Rules: Project-wide constraints (e.g., .cursorrules, .instructions.md, .windsurfrules) like “Never use external CSS libraries.”
Spatial Context: Directory-specific rules (e.g., AGENTS.md). The agent only loads the “map” for its current folder, keeping the context window lean and focused on the immediate task.

B. The Knowledge (On-Demand Retrieval)

This layer fetches information only when the agent realizes it doesn’t know something.

Codebase RAG: Using Vector Search (for concepts) and Code Graphs (for definitions) to pluck specific snippets. It acts as the agent’s “Library.”
Long-Term Memory: Systems like Windsurf’s Cascade or Copilot index your past PRs and corrections. This creates a “Personal Profile” so the agent learns your specific habits over time.

C. The Manuals (Just-in-Time Skills)

Following the Anthropic Agent Skills (agentskills.io) standard, “Skills” are dormant manuals.

JIT Loading: You might have 500 specialized skills (e.g., “AWS Deployment,” “Stripe Integration”). The agent doesn’t “know” them by heart, it “copy-pastes” the relevant manual into its brain only when the task triggers that specific need.
Native Contextual Awareness: Modern agents now use “Context Caching.” Instead of re-sending your entire 50,000-line codebase with every message, the API “remembers” the base code, only charging you for the new tokens. This makes “Bigger Context” not just a technical feat, but an economic one.

4. The Cumulative Toolkit: Layered Defense

The advancement of the field isn’t about the newest tool replacing the old one, it’s about building a Layered Defense. No single method is a silver bullet, so we stack them based on their specific strengths and operational costs.

We use Static Rules for safety, RAG for scale, and Skills for precision. We don’t choose one, we layer them so the system has multiple chances to find the right context before it fails

5. The Punchline: Standardization as the Catalyst

The reason these tools finally feel like a “Senior Partner” today isn’t because the models became smarter. It’s because we standardized the system.

By moving toward open protocols like MCP and Agent Skills , we have replaced custom-coded complexity with composability. You can write a skill once and share it across Cursor, Windsurf, or your own CLI.

Once you strip away the marketing, you realize that ‘Agentic AI’ is mostly just a very sophisticated while loop wrapped around a copy-paste mechanism. But in engineering, a sufficiently advanced loop is indistinguishable from intelligence.

But that is the beauty of it. It’s not magic, it’s a highly efficient, automated loop of terminal calls and context-stuffing. When you apply that simple loop at scale, fueled by an engaged community building shared “manuals” and servers, the result is a system that works at a professional level.

The next time you hear about a “groundbreaking” new trend in the AI world, there is a high chance it boils down to one of these simple concepts. And if it doesn’t? That’s when things get truly exciting.

The future of coding isn’t only about building “smarter” brains, it’s about building better connections between the Brain , the Tools , and the Memory. Simple concepts, standardization, and a community that shares its manuals will take us much further than a black box ever could.

Originally published on AI Superhero

50 Shades of BERT

Shmulik Cohen — Thu, 15 Jan 2026 12:04:48 +0000

The Encoder Architecture that Unified NLP

Introduction: The Era of Fragmentation

Before 2018, Natural Language Processing was a collection of siloed crafts. Researchers built custom sequential models, like LSTMs or GRUs, for every specific problem. If a model solved Sentiment Analysis, that progress did not easily transfer to Named Entity Recognition. It was an era of “reinventing the wheel” for every dataset.

The release of Google’s BERT (Bidirectional Encoder Representations from Transformers) marked a turning point. It was built on the “Attention is All You Need” architecture, but it fundamentally changed the NLP world by being truly bidirectional and providing a single model that can solve many different tasks.

The Encoder–Decoder Architecture

To understand BERT, we must look back at the original Transformer. It was designed for Machine Translation , which required two distinct specialized roles working together.

The Encoder (The Reader): Its job is to take an input sentence and look at all the words simultaneously. Instead of just one “thought vector,” it generates a rich, contextual mathematical signature for every single word.
The Decoder (The Writer): Its job is to take those signatures and generate a translation one word at a time, ensuring each new word fits with the ones it has already written.

BERT is the pure distillation of the Encoder. It doesn’t just summarize, it provides a map of the entire sentence where every word “knows” about every other word.

The Great Decoupling

In 2018, the AI community realized these components were powerful enough to stand on their own. This led to the two main lineages of modern AI:

Decoder-Only (GPT-style): Optimized for generation. These models are mathematically restricted to looking only at the past to predict the future.
Encoder-Only (BERT-style): Optimized for understanding. These models stack multiple Encoders to create a reading comprehension engine. They do not “chat,” but they understand context and nuance better than almost any other architecture.

The Engine: How BERT Learns to Read

The Engine: How BERT Learns to Read Because BERT is an Encoder-only model, it utilizes bilateral context. In a sentence like “The bank was closed,” a traditional left-to-right model is blind to the future. It doesn’t know if “bank” refers to a financial building or a riverbed until it reaches the very end. BERT, however, sees the entire sentence at once, looking at words before and after every token simultaneously.

The Two Games of Pre-training

BERT discovered the structure of language by solving billions of self-supervised puzzles through two primary methods:

Masked Language Modeling (MLM): Researchers hide about 15% of the words in a sentence. BERT must guess the hidden words using the surrounding 85% of the context. This forces the model to understand how words relate to each other semantically.
Next Sentence Prediction (NSP): BERT is shown two sentences and must decide if the second logically follows the first. This teaches the model to understand the flow of ideas and the relationship between entire sentences.

The Secret Sauce: Self-Attention

Imagine every word in a sentence is a person in a room. Self-attention is the process where every person looks at everyone else to decide who is most relevant to them. In the phrase “The animal didn’t cross the street because it was too tired,” the word “it” uses attention to look at every other word. It realizes that “it” has a much stronger mathematical relationship with “animal” than with “street.”

This allows BERT to create context-aware embeddings. Instead of having one static number for the word “bank,” BERT generates a unique mathematical signature for “bank” when it is near “money” and a completely different one when it is near “river.”

Input and Output: The Transformation

While we think in words, BERT thinks in vectors (lists of numbers).

The Input: You feed BERT a sequence of tokens (words or pieces of words). Along with the words, we provide Positional Encodings, essentially coordinates for each word so the model knows where they sit in the sentence compared to other words.
The Output: BERT outputs a high-dimensional vector for every single token you gave it. These aren’t just definitions, they are rich summaries of what that word means in that specific sentence.

From here, you can take those outputs and feed them into a tiny final layer (a “head”) to perform your specific task, whether that is classifying an email or finding a person’s name.

Note: This is a high-level intuition, not a full mathematical breakdown. If you want to have deeper intuition (clearly explained with great visuals), I highly recommend this amazing video by StatQuest :

The Encoder Renaissance

The 5 Pillars: Deep Technical Variants

In 2026, we don’t just use “BERT.” We use specialized “shades” optimized for different engineering constraints like speed, memory, and context length.

Original BERT (2018): The Google pioneer. It established the bidirectional standard and the 512-token limit. While considered “legacy” by some, it remains the most documented and widely supported baseline for academic reproducibility.
RoBERTa (2019): Facebook’s “Robustly Optimized” upgrade. By removing the Next Sentence Prediction (NSP) task and training on 10x more data (160GB vs 16GB), it proved that BERT hadn’t been trained long enough. It remains the gold standard for pure accuracy on sentence-level tasks.
DistilBERT (2019): Hugging Face’s production workhorse. Using knowledge distillation , it retains 97% of BERT’s performance while being 40% smaller and 60% faster. It is the go-to for low-latency sentiment or classification pipelines running on standard CPUs.
TinyBERT (2020): An ultra-compact variant from Huawei. Unlike other models, it uses layer-by-layer distillation (mimicking the teacher’s attention and hidden states) to compress BERT down to just ~14.5M parameters. It is specifically designed for extreme constraints like mobile apps and IoT devices.
ModernBERT (2024): A breakthrough by Answer.AI and LightOn that drags the architecture into the modern era. It shatters the context limit with a native 8,192-token window using RoPE. By integrating Flash Attention 2 and being heavily pre-trained on code, it is a hardware-optimized powerhouse that is faster and more accurate than its predecessors for almost every 2026 use case.

The 10 Faces of Inference: The Multiplier

When you cross those 5 variants with these tasks, you get the “50 Shades.” However, BERT’s utility is best understood through its functional strengths:

1. Token-Level Precision

NER (Named Entity Recognition): Identifying medical codes or legal clauses.
Part-of-Speech Tagging: Labeling grammar for deep linguistic analysis.
Coreference Resolution: The “pronoun solver” (e.g., figuring out what “it” refers to).

2. Semantic Logic

Sentiment Analysis: Quantifying emotional tone (e.g., brand reputation).
Aspect-Based Sentiment: Analyzing specific features (e.g., Food: +, Service: -).
Natural Language Inference (NLI): A logic-gate to check if statements are contradictory.
Zero-Shot Classification: Categorizing text into labels the model was never specifically trained for.

3. Search & Retrieval

Extractive Question Answering: Reading a 50-page PDF and highlighting the exact answer.
Semantic Similarity: Scoring how closely sentences align to deduplicate datasets.
Paraphrase Detection: Recognizing if two different search prompts seek the same intent.

Why the Spotlight has Returned to Encoders

BERT was released in 2018, millennia ago in AI years, yet it remains the “Invisible Giant” of the ecosystem. Even today, the bert-base-uncased checkpoint sees 38M+ monthly downloads(4th most downloaded model), maintaining its status as one of the most integrated architectures in history.

In fact, the Hugging Face hub is dominated by Encoders. Models like all-MiniLM-L6-v2 see over 140M monthly downloads , while others like electra-base-discriminator pull in 52M+. This enduring popularity is due to an architecture that provides the surgical precision needed for high-stakes, real-world tasks:

Retrieval (RAG): Using sentence transformers to find exact context within massive datasets.
Classification: Powering instant content moderation and sentiment analysis.
Entity Extraction: Identifying specific names or codes for privacy and regulatory compliance.

While the world focuses on chatty generative models, the numbers show that Encoders continue to do the heavy lifting where accuracy, cost, and latency matter most.

The Blind Spots: When NOT to Use an Encoder

While Encoders are surgical, they are not a universal solution for every understanding task. Even in “read-only” missions, there are structural boundaries where the BERT architecture reaches its limit. To be a practical architect, you must recognize when the task shifts from pattern recognition to complex reasoning.

The Fine-Tuning Tax: Unlike large-scale Decoders that excel at Zero-Shot or Few-Shot tasks, BERT is not “plug-and-play.” To achieve its legendary precision, you generally need a substantial labeled dataset to fine-tune the model on your specific domain. If you lack the data to “teach” the model your nuances, a multi-billion parameter Decoder will likely outperform a raw Encoder through sheer scale.
The Reasoning Ceiling: BERT is a master of pattern matching , but it is not a deep reasoner. If your mission requires multi-step causal logic — such as tracing a complex security vulnerability across multiple code files or following an agentic workflow — the “shallow” understanding of a 300M parameter model cannot compete with the emergent logic found in massive Decoders.
Contextual Rigidity: While ModernBERT has expanded the context window, Encoders still process information in a relatively “flat” manner. For tasks that require a “holistic” understanding of a massive project or the ability to weigh conflicting abstract concepts, the dense, multi-layered representations of the largest models still hold a significant edge.

My Personal Story: BERT Usage At Apiiro

When I recently joined the AI team at Apiiro, I was surprised to find fine-tuned BERT models powering some of our most critical core projects. Initially, I thought they were historical relics. I quickly learned that for high-scale, mission-critical missions, BERT isn’t just a fallback — it’s the winner.

Latency: When processing millions of queries, a CPU-based BERT beats a token-streaming LLM every time.
Cost: Running specialized encoders on standard hardware is a fraction of the cost of generative APIs.
Precision: For “Discriminative” tasks, like identifying a specific vulnerability in code, BERT’s bidirectional context provides surgical accuracy.
Fine-Tuning over Prompting: Unlike API-based LLMs that rely on prompt engineering, BERT allows us to fine-tune the entire model on our specific domain data. This “muscle memory” makes the model a specialized expert that does one thing perfectly without being distracted by general-purpose “helpfulness.”

After that initial experience, I got to work on another project involving GPU Inference of BERT. That led me down a Rabbit Hole of evaluation, distillation, optimizations, benchmarks, and platform comparisons.

But I will keep all of that (and much more) for another post.

Over all it was a humbling experience. I learned that sometimes the “senior” move isn’t using the newest model everyone talks about, but choosing the proven, efficient architecture that delivers the best results for your data.

Implementation: Three Shades of BERT

Implementation has become trivial thanks to the transformers library by HuggingFace. By 2026, the ecosystem has moved toward hardware-aware defaults, meaning these few lines of code often trigger specialized kernels like Flash Attention 2 automatically if they detect a compatible GPU.

The beauty of these “shades” is that the API remains nearly identical. You simply swap the model checkpoint to change your entire performance profile.

from transformers import pipeline

import torch

  
  
  1. DistilBERT: The Production Workhorse


  
  
  Task: Sentiment Analysis (High-throughput classification)


classifier = pipeline(

    “sentiment-analysis”, 

    model=”distilbert-base-uncased-finetuned-sst-2-english”

)

  
  
  2. RoBERTa: The Precision Specialist


  
  
  Task: NER (Token-level sequence labeling)


ner_tagger = pipeline(

    “ner”, 

    model=”xlm-roberta-large-finetuned-conll03-en”

)

  
  
  3. ModernBERT: The 2026 Long-Context Standard


  
  
  Task: Document-level Classification (Long-form analysis)


doc_model = pipeline(

    “text-classification”, 

    model=”answerdotai/ModernBERT-base”,

    model_kwargs={”attn_implementation”: “flash_attention_2”} 

)

Conclusion: The Silent Workhorse

While Generative AI captures the headlines and the public imagination, the BERT family remains the invisible foundation of enterprise software. It is the silent workhorse behind global search engines, automated content moderation, and the high-speed data pipelines that keep modern applications running.

Understanding these “shades” is what separates a prompt engineer from a practical NLP architect. It is about knowing that you do not always need a trillion parameters to solve a problem. Sometimes, you just need a specialized expert that understands the context of a single sentence with surgical precision.

As we move further into 2026, the trend is clear: the most senior engineering moves are not about using the biggest and shiny model, but about using the most efficient one for the job.

What about you? Have you found yourself reaching back for “old-school” encoders to solve cost or latency issues in your recent projects, or are you still trying to make generative models fit every classification task? Let’s discuss in the comments below!

Originally published on AI Superhero

How AI Can Actually Make You More Authentic

Shmulik Cohen — Sun, 30 Nov 2025 21:01:59 +0000

Use AI in personal branding the right way

The AI Era presents a tough choice for creators: authenticity or productivity? Ever since I started writing this blog, I’ve been wrestling with that very battle. On one side is the huge temptation and potential of cutting-edge AI tools. on the other, the need to maintain genuine, personal content in a web saturated with “AI slop.”

My readers deserve what I have to say and my personal experience. How do you find the value in creating content when new tools promise to generate it without any human touch?

I’m not the only one facing this. The following post is from of . James doesn’t just talk about the struggle, he designs authentic, AI-powered content systems that turn founders into Unpromptable thought leaders.

His publication name says it all: he’s managed to achieve both productivity and authenticity. Read carefully, and see what lessons you can take away.

AI won’t make you fake unless you let it

You’ve probably heard the warnings. Use AI to build your brand and you’ll sound like everyone else. Automate your content and you’ll lose your voice. Let algorithms handle your messaging and you’ll become a hollow shell of ChatGPT-speak.

There’s truth in those fears, of course.

But only if you’re not paying attention.

The authenticity trap

We need to clarify between what’s true and what’s not.

True : Using AI makes you more likely to fall into this trap.

False : You can NEVER be authentic when you’re using AI. AI and authenticity are opposites, you can have one or the other, but never both.

This shows up everywhere.

Technical founders building in public worry that using AI to draft their updates makes them frauds. Creators using ChatGPT to speed up their newsletters feel guilty, like they’re cheating. Businesses who automate parts of their content pipeline wonder if they’re sacrificing the very thing that makes their brand theirs.

So they choose. Either grind it out manually to stay “real,” or use AI and accept that their brand will feel manufactured.

But that’s just not true.

So, when does AI makes you less authentic?

It’s when you implement AI mindlessly.

Unmindful AI integration does exactly what the critics warn about.

When you automate your thinking, you lose your edge. When you let AI decide what you should say, your voice disappears. When you use it to generate content without filtering through your values, your audience feels it immediately.

They sense the hollow core.

The posts that sound smart but say nothing. The articles that read like everyone else’s because they were generated the same way everyone else generates them. The fake images.

This actively damages your brand.

People don’t trust voices that feel manufactured. They scroll past content that could have come from anyone. They unfollow and block accounts that sound increasingly like bots.

The fear isn’t unfounded. Bad AI use makes you promptable, easy to replicate, impossible to remember.

Does using AI for content creation makes you more or less authentic?

Please share with us your personal experience in the comments

But it doesn’t have to be that way

Those risks are real. But they’re not inevitable.

You can avoid them through mindful AI integration. Not some complex framework, just conscious decision-making about where, why, and how you apply AI to your brand.

The goal isn’t to use AI for everything. It’s to use AI for the right things, so you can be more human where it matters.

How does this actually happen? Let’s go through the three most significant ways AI can make you more authentic, not less:

1. AI clears mental space for the work that matters

When AI handles the grunt work, you gain clarity.

Most creators can’t think strategically because they’re buried in execution. You’re too busy drafting, editing, formatting, and scheduling to ask bigger questions: What does my audience actually need? What change am I trying to create? What makes my perspective unique?

Using AI to handle repetitive tasks like first drafts, research compilation or formatting, frees your mind to think at a higher level. You can step back and see the forest instead of counting trees.

This makes you more authentic because you can align your work with your actual values instead of just surviving your task list. When you’re not exhausted from manual labor, you can ensure every piece of content serves your mission.

You can also use AI as a thinking partner. Feed it your half-formed ideas and let it ask the questions you haven’t considered. Challenge your assumptions. Spot gaps in your logic. Not to replace your thinking, but to sharpen it.

Need help thinking through this? AI can act as your thinking partner. Copy this prompt to your AI:

“Act as my thinking partner. I want to clarify my mission and positioning. Ask me 5 questions that will help me identify what truly matters in my work and what change I want to create. Wait for my answer after each question before moving to the next. Once you have enough information, summarize what you learned and give me suggestions.”

The result: clearer positioning, stronger messaging, work that sounds more like you because you’ve had time to figure out what “you” actually means.

2. AI forces you to define what’s truly yours

AI makes you more authentic by forcing you to decide what you want to keep doing yourself.

When you start delegating tasks to AI, you have to get specific. What parts of content creation matter to you? What do you actually enjoy? What feels essential versus what feels like busywork?

For me, that looked like this: I realized I love coming up with ideas based on personal experience, identifying the emotional or practical value, and structuring the argument. What I hate is the mechanical task of turning bullets into paragraphs, the tedious work of first-draft generation.

So I outsource that.

AI handles the painful parts. I handle the parts that I know will make the difference. I double down on my strengths while shoring up my weaknesses.

And here’s the thing: your audience feels this too. When you double down on what you’re naturally good at and use tools to cover your weaknesses, the quality improves. Your content gets sharper. Your ideas land harder. You show up more consistently because you’re not burning out on tasks you hate.

Here’s a quick prompt you can use:

“I need you to act as my world-class branding coach. Help me map my content creation process. For each step: ideation, research, outlining, drafting, editing, formatting, distribution, ask me: Do I enjoy this? Am I good at this? Does this feel essential to my voice?

Based on my answers, show me what I should keep doing myself and what I could delegate to AI.”

Authenticity isn’t about doing everything yourself. It’s about ensuring the unique value is yours, then using whatever tools help you deliver it.

3. It creates space for what AI can never replace

The equation is simple: less time on menial tasks means more time talking to people.

Real conversations with your audience. Understanding their problems. Solving those problems. Building actual relationships, not just collecting followers.

In the age of AI, this matters more than ever. Everyone can generate content now. Not everyone can be genuinely present with their community.

So, use AI to handle the scalable stuff so you have energy left for the irreplaceable stuff. Responding to comments thoughtfully. Having real conversations in DMs. Hosting calls with your community. Noticing patterns in what they’re struggling with and adjusting your work accordingly.

This makes you more authentic because nothing replicates your personal presence.

There’s no substitute for your specific insights shaped by your specific experiences. Your willingness to show up and give a damn about the people following your work.

Strong relationships can’t be automated. But if you don’t use AI or some form of leverage, you’ll never build them at scale. You’ll be too busy fighting with sentence structure to notice what your audience actually needs.

To push you in the right direction, here’s a prompt:

“Analyze this piece of content or post:

[insert link]

Identify the top 3 recurring questions or problems my audience mentions. Then suggest 3 specific ways I could spend 30 minutes this week having real conversations with my community about these issues—without creating more content.”

The authenticity equation

Mindful AI integration makes you more authentic, not less.

Not because it makes you work harder, but because it lets you work on what matters. It handles the mechanical so you can focus on the meaningful. It takes care of the repeatable so you can invest in the irreplaceable.

The creators who are building trust at scale aren’t just using telling ChatGPT: “Write an article about X.”

They’re using AI intentionally. They’re documenting their creative process and inserting AI where it matters. They’re creating space for the work only they can do.

That’s how AI helps you be more authentic, not less.

PS. Are you a founder or creator who wants to learn more about AI-powered authenticity in personal branding? Subscribe to my newsletter, .

Enjoyed the post? The most sincere compliment is to share our work, it means a lot.

Thanks so much to James for this incredible collaboration. Our interaction was truly pleasant, and I’m very much looking forward to sharing the post soon on .

If this inspires you, and you’re interested in a future collaboration, please reach out, let’s make it happen!

Originally published on AI Superhero

My Digital Arsenal #2: Keeping Your Codebase Clean with Pre-Commit Hooks

Shmulik Cohen — Fri, 21 Nov 2025 13:38:32 +0000

Automate Code Quality with Pre-Commit Hooks

Welcome back to the second installment of “My Digital Arsenal,” the series where I share the essential tools that power my development workflow.

In the first post we dove deep into the world of Python package managers, the unsung heroes that keep our project dependencies from collapsing into chaos.

Today, we are moving from managing dependencies to managing quality. We are setting up our automated guardian for clean code.

In this post:

Why “consistency” matters more than you think.
The “Manual Trap” of standard linters.
How to set up pre-commit with uv (The 60-second setup).
Bonus: A look at prek, the blazing-fast Rust alternative.

Code Quality Starts with Consistency

Before we talk about tools, let’s talk about the code itself.

We often think of “Code Quality” as high-level architecture or efficient algorithms. But there is a lower, grittier level of quality that impacts us every single hour: Consistency.

Imagine reading a book where every paragraph uses a different font size, some sentences end with two periods, and random words are capitalized. Could you read it? Sure. Would it be exhausting? Absolutely.

Code is no different. When you work on a team, or even alone on a project over several months, entropy naturally sets in.

One file uses single quotesm,another uses double.
One function has trailing whitespace, another doesn’t.
Imports are scattered randomly at the top of the file.

The Code Review Nightmare:

import os, sys # messy imports
def  calculate( x):
    print( “debug”) # remove print
    return x*2;

These might seem like “non-important” issues, but they create cognitive friction. Every time your brain has to process inconsistent formatting, it has less energy for solving the actual business logic.

Step 1: The Mechanical Fixers

To remove the human element from style policing, the development community created two types of tools to automate the job:

Linters (e.g., Flake8, Ruff, Pylint): These are the inspectors. They analyze your code for structural rot, catching errors like undefined variables or unused imports.
Formatters (e.g., Black, YAPF, isort): These are the beautifiers. They don’t care what your code does ; they care how it looks. They rewrite your code to strictly follow a style guide.

The “Manual Execution” Trap In the past, using these tools was a manual ritual. You had to remember to run a command like black . or flake8 before every single commit.

It sounds simple, but humans are terrible at repetitive tasks. If you were in a rush (and we are always in a rush), you would forget. You would push the code, wait for the CI pipeline, and then watch it fail 10 minutes later because of a trailing comma.

This led to the infamous “Walk of Shame” in your git history: a stream of tiny commits labeled “ fix linting ,” “ formatting ,” and “ really fix formatting this time .”

We have the tools, but we lack the automation trigger.

This is exactly what drove me to adoptpre-commit. On a previous team, we had a CI stage that checked for linting errors, followed by a very long testing stage. If I forgot to run the formatter locally, the CI would fail early, but the context switch killed my momentum. I would have to fix a trivial whitespace error, push again, and wait for the pipeline to restart. We were losing hours of productivity to simple formatting mistakes.

Step 2: Enter The Automated Guardian

This is where pre-commit comes in.

If a Continuous Integration (CI) pipeline is the security checkpoint at the airport, pre-commit is the helper at your front door checking if you have your keys and wallet before you leave the house.

Under the hood, Git has a feature called “hooks” — scripts that run automatically at specific points in the Git lifecycle. Historically, managing these hooks was a pain. You had to copy-paste unwieldy Bash scripts between projects.

The pre-commit framework solves this. Instead of messy scripts, you define your rules in one simple .pre-commit-config.yaml file. When you try to commit, the framework downloads the tools, runs them against your changes, and stops you if something is wrong.

Here is why it is an essential part of my arsenal:

It Focuses Your Code Reviews: The documentation says it best: it “allows a code reviewer to focus on the architecture of a change while not wasting time with trivial style nitpicks.”
It Fixes the “Small Stuff” Automatically: It doesn’t just catch issues, it often fixes them. Trailing whitespace, end-of-file, and formatting issues are corrected before they ever hit your repository.
It’s Multi-Language: While this series focuses on the Python ecosystem (and the incredible modern tooling like Ruff), pre-commit is language-agnostic. It can manage hooks for JavaScript, Terraform, JSON, and more, all without you needing to manage npm or gem files manually.

My Go-To Pre-Commit Toolkit

Let’s walk through the minimal, high-impact setup I use to fix the most common annoyances.

1. The 60-Second Setup

Since we are already using uv from the last article, let’s use it here. Instead of polluting your global Python install, we will install pre-commit as an isolated tool.

# The modern way: Install as an isolated tool (can be done of course with pip too)
uv tool install pre-commit

# The “Magic” command that activates the hooks in this repo
pre-commit install

That second command is the most important one. It installs a tiny script into your .git/hooks/ directory. Now, every time you type git commit, that script will trigger the framework before Git even saves your changes.

⚠️ The “First Run” Warning: The very first time you commit after setting this up, it will be slow. You will see a message like _[INFO] Initializing environment for.... Don’t panic. It is creating isolated environments for your hooks. This happens only once. Future commits will be instantaneous._

3. The Recipe: My`.pre-commit-config.yaml`

Create a file named .pre-commit-config.yaml in your project’s root. Here is the configuration I use. It covers syntax checking, formatting, and basic file hygiene.

repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: “v5.0.0”
    hooks:
      - id: check-ast # Is it valid Python?
      - id: check-case-conflict # Avoid case-sensitivity issues on Windows/Mac
      - id: check-merge-conflict # Block commits with ‘<<<<<<<’ markers
      - id: check-toml # Validates pyproject.toml
      - id: check-yaml
      - id: check-json
      - id: end-of-file-fixer # Ensures files end with a newline
      - id: trailing-whitespace # Trims accidental whitespace at end of lines

  - repo: https://github.com/astral-sh/ruff-pre-commit
    # Ruff version.
    rev: v0.14.5
    hooks:
      # Run the linter (replaces Flake8)
      - id: ruff
        types_or: [python, pyi]
        args: [--fix]
      # Run the formatter (replaces Black)
      - id: ruff-format
        types_or: [python, pyi]

default_stages: [pre-commit]

What it looks like in action: When you commit, you’ll see this satisfying output:

3. The “Escape Hatch” (For Emergencies)

Sometimes, you just need to commit. Maybe you are working on a messy prototype, or you are saving work before your computer dies. If the hooks are blocking you and you need to bypass them, use the --no-verify flag:

git commit -m “wip: messy code saving for later” --no-verify

Use this sparingly. If you bypass the guard too often, you defeat the purpose of having one.

4. Level Up: The`pre-push`Hook

You will notice I added default_stages: [pre-commit] at the bottom. This means these checks run on every commit.

But what about heavier tasks? Running your entire unit test suite (pytest) on every commit is too slow, it breaks your flow. But you definitely want them to run before you push your code to the team.

Git has a specific hook for this called pre-push. You can add a separate section to your config to run heavy tests only when you push:

- repo: local
    hooks:
      - id: pytest
        name: Run Unit Tests
        entry: uv run pytest
        language: system
        stages: [pre-push]      # <--- Only runs on ‘git push’

To activate this, you need to run one extra install command

pre-commit install --hook-type pre-push

Now you have a tiered defense:

Commit: Fast linting & formatting (Instant).
Push: Heavy testing & security scans (Slower, but safe).

Level 5: Expanding Your Toolkit

We have focused heavily on formatting, but the pre-commit ecosystem is massive. You can find hooks for almost anything—from enforcing static typing to preventing security leaks.

I highly recommend exploring hooks like:

mypy: To catch type errors before execution.
detect-secrets: To prevent accidental commits of API keys or passwords.
commitizen: To enforce standardized commit messages across your team.

Where to find more?

The best place to start is the Official Supported Hooks Index. It is a searchable database of thousands of hooks for every language and task imaginable.
For a curated deep dive, I strongly recommend checking out Gatlen Culp’s article: Effortless Code Quality: The Ultimate Pre-Commit Hooks Guide for 2025. It was a major inspiration for this post and helped me refine my own setup.
Pro Tip: If you already rely on a specific CLI tool (like bandit, hadolint, or sqlfluff), just Google “tool-name pre-commit”. If the tool is popular, there is a very high chance a hook for it already exists.

Level 6: The Rust Revolution (`prek`)

If you read my last post about uv, you know I am bullish on the “Rust-ification” of the Python ecosystem. We are seeing a wave of tools that are faster, smarter, and easier to use than their predecessors.

While pre-commit is the industry standard, it is starting to show its age. It requires a Python runtime and can be slow on large repos.

Then there is the social aspect. I had heard rumors that the maintainer’s interaction style could be ‘abrasive’ but I didn’t get it until I looked at the issue tracker myself. After reading through a few threads and rejected feature requests, I understood why some developers are looking for a friendlier alternative.

Enter`prek`

prek is a reimagined version of pre-commit, built entirely in Rust. It is designed to be a drop-in replacement that respects your existing config but runs circles around the original.

Why I’m keeping my eye on it:

Architectural Efficiency (Speed & Disk Space): It completely redesigned how environments are managed. Unlike the original, prek shares toolchains between hooks rather than duplicating them. It also clones repositories and installs hooks in parallel. Combined with its internal use of uv, this results in significantly faster setups and half the disk usage.
Zero-Hassle Setup: It compiles to a single binary with no dependencies. You don’t need to manage Python versions or virtual environments manually — prek handles all of that automatically. You just download it and run it.
Modern Workflow Features: It solves long-standing pain points like Monorepo support (via “workspaces”) out of the box. It also adds smarter CLI commands we’ve always wanted, like prek run --directory <dir> to scope checks to a specific folder, or --last-commit to check only your latest work.

How to try it

Since we are already using uv, installing prek is a one-liner:

# The modern way: Install as an isolated tool (can be done of course with pip too)
uv tool install

# The “Magic” command that activates the hooks in this repo
prek install

Then, instead of running pre-commit run, you just run:

prek run --all-files

Note: _prek _is newer thanpre-commit , but it is already battle-tested, powering massive projects like*Apache Airflow*. While it is still reaching full feature parity, the speed and “plug-and-play” experience make it a compelling alternative if you are tired of the friction with the legacy tool.

Conclusion: The Guardian at the Gate

We started this series by taming our dependencies. Today, we tamed our code quality.

By setting up pre-commit (or prek), you are doing your future self a massive favor. You are stopping the entropy that slowly kills codebases. You are freeing your brain from worrying about whitespace and commas so you can focus on logic and architecture.

Set it up once. Configure it. Then forget about it, and let the robots do the cleaning.

Do you have a favorite custom hook I didn’t mention? Let me know in the comments.

Zero-Cost Automation: How Students Get n8n Free for a Year

Shmulik Cohen — Sun, 12 Oct 2025 11:39:35 +0000

The Step-by-Step to Max Out Your GitHub Student Developer Pack

Automation is the superpower of the modern developer. Whether you want to scrape a hundred websites, connect five different apps, or build an AI workflow that summarize posts for you, tools like n8n make it incredibly simple with visual, low-code nodes. n8n allows you to build complex integrations that save you hours of manual work and introduce a whole new level of efficiency to your projects.

The problem? That automated magic has to run somewhere. While it’s easy to build a powerful workflow, the associated hosting costs can quickly add up. You could pay for the convenience of n8n’s cloud service, spend a few dollars a month on a cloud hosting platform like AWS, or even run the instance locally on your laptop 24/7.

Beyond just hosting, you’ll also typically need a domain name to easily access and manage your n8n instance from anywhere. For students, or anyone just running a random side-script or learning a new skill, all those options often feel like overkill, or just plain too expensive (or at least you scare that it will be expensive if you won’t pay attention).

What if you could have the full power of n8n running in the cloud, accessible via your own domain, for an entire year without touching your wallet?

Subscribe now

GitHub Student Developer Pack

The secret to achieving zero-cost automation lies with GitHub and its incredibly generous Student Developer Pack (GSDP). This isn’t just a free GitHub Pro account, it’s a bundled collection of free premium software and services that would normally cost hundreds of dollars.

The GSDP gives you direct access to essential perks like GitHub Copilot Pro for AI-powered coding, cloud credits for hosting platforms like DigitalOcean , and a free domain**** for a year by platforms like Namecheap. By getting this pack, you unlock the specific benefits needed to host your self-managed n8n instance and secure a domain, all for free for a full year.

The First Step: Getting Your Developer Pack

To unlock these benefits, you first need to gain access to the GSDP.

Sign up for a GitHub account: If you don’t have one, create a free account at github.com.
Apply for the GSDP: Head to https://education.github.com/pack and apply for the Student Developer Pack. You’ll need to provide proof of your student status (like a school email or student ID).
Wait for Approval: This verification process can take a few days. You’ll receive a confirmation email once you’re approved and gain access to all the partner offers.

This video walks through the eligibility criteria and the sign-up process for the GitHub Student Developer Pack:

Once your github account is approved, you will be able to get all the benefits in this link https://education.github.com/pack.

The Second Step: Claim your $200 in DigitalOcean Credits

One of the most valuable perks for hosting n8n is the $200 in free credits from DigitalOcean. This provides more than enough allowance to run a robust n8n instance for a full year.

To claim your credits:

Visit the special DigitalOcean GitHub Student Pack offer page
You may need to register for a new DigitalOcean account if you don’t already have one. Simply follow the on-screen prompts to create your account and link it to your GitHub Student Developer Pack.

Once successfully claimed, the $200 credit will be applied to your DigitalOcean account, ready to be used!

The Third Step: Registering a Free Domain with Namecheap

To make your n8n instance accessible from anywhere and to ensure secure connections, you’ll need a domain name and an SSL certificate. Thankfully, the GitHub Student Developer Pack has you covered here too!

For this, we’ll use Namecheap’s free .me domain offer, which also includes a one-year free SSL certificate.

Head over to the Namecheap GitHub Student Developer Pack portal: https://nc.me/github/auth.
Follow the instructions to register your free .me domain. I personally chose shmulc.me, but you can pick any available .me domain (or explore other domain offers within the GSDP if you prefer a different ending).
During the process, ensure you activate the included free one-year SSL certificate. This is crucial for securing your n8n instance with HTTPS.

If you need a bit more guidance, this video provides a helpful walkthrough for regisetring your domain with Namecheap :

With your DigitalOcean credits and a free domain (with SSL) in hand, you’re now fully equipped for the final step: deploying n8n!

The Final Step: Creating Your n8n Instance

You now have all the components for a powerful, zero-cost n8n setup: the DigitalOcean credits for hosting and the free domain for access. The only thing left is the deployment itself.

This video by Nick Saraev does a great job visually explaining how to set up and deploy your n8n instance using a cloud server and your new domain name. This link takes you directly to the relevant part of the video that focuses on the deployment steps with DigitalOcean:

Once your instance is live and secure under your new domain, the Deployment setup is complete.

Automate!

Now that you’ve deployed your private n8n instance, your final task is to simply create great automations and share them with the community*!* Go build those AI summarizers, Telegram bots, and complex data pipelines without worrying about the bill for the next year.

To help you get started, I highly recommend diving into the official text courses. Once you’ve grasped the fundamentals, challenge yourself to think about a cool automation that could genuinely improve your life or streamline your student workflow. The web is brimming with resources and examples, a quick search will unveil countless possibilities. If there’s a specific n8n tutorial you’d like me to cover in the future, just let me know!

I truly hope this guide helps you break through the initial automation barrier and empowers you to build incredible things. I’m genuinely excited to hear what you create with this powerful, zero-cost setup! Feel free to share your projects in the comments below or reach out to me directly if you build something cool.

Originally published on AI Superhero

Forget Manual Solving, Let Z3 Crack The Code

Shmulik Cohen — Wed, 08 Oct 2025 17:34:32 +0000

A Formal Approach for Solving Logic Puzzles with an SMT Solver

You know the feeling. That moment when a Sudoku grid snaps into place, the hidden shapes of a Nonogram finally emerge from the pixels, or a perfectly balanced Kakuro sum falls into line. It’s the thrill of logic, the satisfaction of a challenge overcome by sheer brainpower.

But have you ever noticed just how similar these seemingly different puzzles are? At their core, they all boil down to the exact same thing: a set of constrains that must be satisfied. What if I told you there’s a powerful, elegant way to encode those rules, not just for one puzzle, but for all of them, and instantly reveal the answer? Welcome to the formal approach to puzzle-solving.

For the rest of this post, I’ll be assuming you’re familiar with the standard rules of*Sudoku* , Kakuro , and Nonograms. If you need a quick refresher on any of them, please take a moment now to look them up before we proceed.

Introducing the SMT Solver

You are right to see the similarity. The elegant solution to these diverse problems lies in Constraint Satisfaction. This is where we introduce the star of the show: an SMT Solver , specifically Z3 from Microsoft Research. SMT stands for Satisfiability Modulo Theories. Don’t let the name scare you. Think of it as a highly specialized logic engine that takes all your puzzle’s rules, your “constraints”, and determines if a configuration exists that makes every single rule true simultaneously. If it does, Z3 finds that configuration for you: the solution.

If no configuration exists that satisfies all the rules, Z3 returns a result of “unsatisfiable.” This happens when your puzzle is logically flawed, perhaps because it was designed poorly or because the starting clues are contradictory. For instance, if you provide a Sudoku with two of the same number in a single row as a starting point, Z3 will quickly determine that the puzzle is impossible, which is often a useful insight in itself.

Another example for Unsatisfiable Sudoku, can you see why?

Not Sat Solver

Now, you may have heard of a simpler SAT Solver , which only deals with pure Boolean logic, meaning statements that are strictly true or false. It’s worth a quick aside: in theoretical computer science, SAT is an NP-complete problem , meaning it’s generally considered one of the hardest problems to solve efficiently. However, in the real world, modern SAT solvers use clever heuristics and techniques that make them incredibly fast and capable of solving most practical instances.

An SMT Solver , however, is much more powerful. It extends SAT by adding support for common mathematical concepts, or “Theories ,” like integers, real numbers, and arithmetic. This is critical for puzzles like Kakuro , which rely heavily on summing numbers, or even Sudoku , which uses integer constraints. Z3 allows us to easily model all those numeric, non-Boolean puzzle requirements, giving us a universal key to cracking these challenges.

The concept of using computers to solve logic puzzles is certainly not new, specialized Sudoku and Nonogram solvers have been around for years. However, the true elegance of the formal approach lies in its generality. Instead of building a new, dedicated solver for every puzzle type, we can use one strong and general tool like Z3 to tackle a huge range of problems with surprisingly little, highly adaptable code. That versatility is the real game-changer, and we’ll see exactly how it works soon.

Introducing Z3

Before we dive into the code, let’s formally introduce the tool. Z3 is one of the world’s most powerful and widely used SMT solvers , developed by Microsoft Research and open-sourced under the MIT License. Initially built for complex problems in software verification , Z3 handles logical formulas and mathematical constraints with exceptional speed. While its core engine is C++, it provides robust bindings for languages like C, C++, Java, .NET, and Python. For this post, however, we will exclusively use the user-friendly Python binding , as it allows us to quickly and elegantly model our puzzles without unnecessary overhead.

Z3 101: The Essentials

First, you’ll need the Python bindings:

pip install z3-solver

Now, let’s look at the four core components you need to translate any problem into Z3 code.

1. Variables: Defining the Unknowns

In Z3, everything starts with variables that represent the unknown values you’re trying to find. Because Z3 supports various logical Theories , these variables aren’t just true/false statements, they can represent numbers, arrays, and more.

Int(’name’): Creates an Integer variable (e.g., for numbers 1-9).
Bool(’name’): Creates a Boolean variable (True/False).
Ints(’a b c’): A shorthand for creating multiple integer variables.

Example:

from z3 import *
x = Int(’x’) 
is_true = Bool(’flag’)

2. Constraints: Writing the Rules

Constraints are the logical formulas that must be true for the system to have a solution. You build these using familiar comparison operators (<, ==, >) and Z3’s specialized logic functions.

And(c1, c2, ...): Requires all contained constraints to be true.
Or(c1, c2, ...): Requires at least one contained constraint to be true.
Distinct(a, b, c): Forces all listed variables to have different values.
x + y == 10: Represents standard arithmetic constraints.

Example:

We can use the simple solve() function to find values for variables that meet a system of constraints:

from z3 import *
x, y = Ints(’x y’)
solve(x > 2, y < 10, x + 2*y == 7)
# Output: [y = 0, x = 7]

3. The Solver: The Engine

The Solver object is the central component where you accumulate your constraints and ask Z3 to find a solution.

s = Solver(): Initializes the solver.
s.add(constraint): Asserts constraints into the solver’s stack.
s.check(): Executes the search. Returns sat (solvable), unsat (unsolvable), or unknown.

TheunsatResult: If s.check() returns unsat, it means your problem is logically flawed—no set of assignments can satisfy all the rules simultaneously.

4. The Model: The Answer

If s.check() returns sat (satisfiable), the solver has successfully found a solution. This assignment of values is called the Model.

m = s.model(): Retrieves the solution object.
m[variable]: Extracts the solved value for a specific variable from the model.

from z3 import *
x = Int(’x’)
s = Solver()
s.add(x * 2 == 14)

if s.check() == sat:
m = s.model()
# The output will be ‘7’
print(m[x])

For deeper guidance and advanced topics, I suggest reading the Official Tutorial — it’s a great resource for getting a deeper understanding of Z3’s more advanced features.

By understanding these four building blocks, you have everything required to translate the rules of our logic puzzles into Z3’s logical language. Our goal is to create a single, general function for each puzzle that captures its fundamental rules. Once those core constraints are in the solver, plugging in a specific puzzle instance is trivial. We’ll start with the most familiar puzzle.

Sudoku: The Constraint Classic

A standard 9x9 Sudoku has four universal constraints: every cell must contain a value from 1 to 9, and every row, column, and 3x3 block must contain distinct values. We can translate these rules directly into Z3.

The General Solver Function

We’ll define a function, solve_sudoku, that takes a sudoku object (which holds the puzzle’s structure and clues) as its input.

from z3 import *

def solve_sudoku(sudoku: SudokoPuzzle) -> dict[str, str]:
  “”“Solves the given Sudoku puzzle using Z3’s SMT capabilities.”“”

  solver = Solver()

  # 9x9 grid of Z3 Integer variables
  # Assuming ‘sudoku.positions’ is a list of all 81 cell names (’A1’, ‘A2’, ... ‘I9’)
  symbols = {pos: Int(pos) for pos in sudoku.positions}

1. General Constraints (The Rules)

Next, we add the three core uniqueness constraints using Z3’s powerful Distinct function.

# Constraint 1: Cell Range [1, 9]
  for symbol in symbols.values():
    solver.add(And(symbol >= 1, symbol <= 9))

# Constraint 2: Rows must have distinct values
for row in “ABCDEFGHI”:
  row_symbols = [symbols[row + col] for col in “123456789”]
  solver.add(Distinct(row_symbols))

# Constraint 3: Columns must have distinct values
for col in “123456789”:
  col_symbols = [symbols[row + col] for row in “ABCDEFGHI”]
  solver.add(Distinct(col_symbols))

# Constraint 4: 3x3 Blocks must have distinct values
for i in range(3):
  for j in range(3):
    # List comprehension to select cells in the 3x3 block
    block_symbols = [symbols[”ABCDEFGHI”[m + i * 3] + “123456789”[n + j * 3]] 
                     for m in range(3) for n in range(3)]
    solver.add(Distinct(block_symbols))

2. Instance Constraints (The Clues)

With the general rules loaded, we now add the specific clues from the puzzle instance passed in the sudoku parameter.

# Assuming ‘sudoku.grid’ is a dictionary like {’A1’: ‘5’, ‘A2’: ‘0’, ...}
for pos, value in sudoku.grid.items():
  if value != “0”: 
    solver.add(symbols[pos] == int(value))

3. Solving and Interpreting the Model

The final step is to ask the solver to find a solution and extract the model.

if solver.check() != sat:
  raise Exception(”Unsolvable Sudoku provided!”)

  # Retrieve the model (the solution)
  model = solver.model()

  # Extract the final integer value for every cell
  values = {pos: model.evaluate(symbol).as_string() for pos, symbol in symbols.items()}

  # The function returns the solved values
  return values

The resulting code is essentially a direct translation of the Sudoku rules into Z3 constraints. The puzzle-agnostic power of the SMT solver is clear: we didn’t write an algorithm to search for the solution, we simply wrote the code to describe the solution.

Now, let’s look at Kakuro , where we’ll leverage Z3’s powerful Arithmetic Theory.

An Example for Sudoku Puzzle

Kakuro: The Arithmetic Challenge

Unlike Sudoku, Kakuro (or Cross Sums) requires both uniqueness (no repeated digits in a run) and arithmetic (the cells in a run must sum to a target). This is where the power of the Arithmetic Theory within the SMT solver becomes essential.

The General Solver Function

We define the solve_kakuro function to accept the KakuroPuzzle object. This solver leverages Z3’s Sum and Distinct functions together to enforce the puzzle rules.

To manage the runs, we utilize a simple helper function, get_sum_run, which identifies the sequence of cells for any given clue.

from z3 import Solver, Int, Sum, Distinct, sat, And, Not, Ints

# Helper function to identify the sequence of cells for a clue
def get_sum_run(
    puzzle: KakuroPuzzle, first_x: int, first_y: int, direction: str
) -> list[Cell]:
    “”“Get cells involved in a sum run starting from a clue cell”“”
    rows, cols = puzzle.size
    cells = []

    if direction == “right”:
        for x in range(first_x + 1, cols):
            if puzzle.is_wall(x, first_y): break
            cells.append((x, first_y))
    else:
        for y in range(first_y + 1, rows):
            if puzzle.is_wall(first_x, y): break
            cells.append((first_x, y))
    return cells

def solve_kakuro(puzzle: KakuroPuzzle) -> Solution | None:
    rows, cols = puzzle.size
    solver = Solver()

    # Create grid of Z3 Integer variables
    grid = [[Int(f”cell_{col}_{row}”) for row in range(rows)] for col in range(cols)]

1. General Constraints (The Rules and Clues)

These constraints establish the size and valid range for every cell in the grid, including 0 for the clue/wall cells.

# Constraint 1: Cell Range [1, 9] for fillable cells
for row in range(rows):
    for col in range(cols):
        if puzzle.get_clue(col, row):
            # Clue cells (walls) are assigned a constant 0
            solver.add(grid[col][row] == 0)
        else:
            # Fillable cells must be between 1 and 9
            solver.add(And(grid[col][row] >= 1, grid[col][row] <= 9))

2. Instance Constraints (Sums and Uniqueness)

For every sum clue provided, we apply two simultaneous constraints to the run of cells that follows: the numbers must sum correctly, and all numbers must be unique.

# Add sum and uniqueness constraints for every clue
for clue in puzzle.clues:
    x, y, row_sum, col_sum = clue.x, clue.y, clue.row_sum, clue.col_sum

    # Horizontal Run
    if row_sum is not None:
        if right_cells := get_sum_run(puzzle, x, y, “right”):
            cell_vars = [grid[col][row] for col, row in right_cells]

            # Sum Constraint & Uniqueness Constraint
            solver.add(Sum(cell_vars) == row_sum)
            solver.add(Distinct(cell_vars))

    # Vertical Run
    if col_sum is not None:
        if down_cells := get_sum_run(puzzle, x, y, “down”):
            cell_vars = [grid[col][row] for col, row in down_cells]

            # Sum Constraint & Uniqueness Constraint
            solver.add(Sum(cell_vars) == col_sum)
            solver.add(Distinct(cell_vars))

3. Solving and Interpreting the Model

The final step is to retrieve the solution, filtering out the zeros corresponding to the clue/wall cells.

if solver.check() == sat:
        model = solver.model()
        solution_cells = []

        # Iterate over the grid to extract the non-zero (fillable) values
        for col in range(cols):
            for row in range(rows):
                if value := model.evaluate(grid[col][row]).as_long():
                    if value > 0:
                        solution_cells.append(SolutionCell(col, row, value))

        return solution_cells
    return None

The difference between the Sudoku and Kakuro solutions is minimal in terms of code complexity — the SMT solver handles the massive increase in logical complexity (arithmetic) seamlessly.

If you want to see the complete solutions, with visualizer and scrapper too, check out my github repository for the problem.

An Example For Kakuro Puzzle

Nonogram: Modeling Challenge

The journey from Sudoku (simple uniqueness) to Kakuro (uniqueness plus arithmetic) shows how Z3’s SMT capabilities handle increasing logical complexity with minimal code changes. However, Nonograms (or Picross) present a slightly different, more abstract modeling challenge.

Nonograms swap the Integer Theory for a complex sequence problem. The constraints aren’t simple sums or distinct values; they are rules governing the arrangement of blocks (True/False cells) in a line, separated by at least one space, to match the clue numbers.

This problem is solvable using the exact same Z3 tools you’ve already learned — Boolean variables, logic operators, and creative constraint formulation — but correctly defining the sequence logic is considerably harder.

Your Challenge: I spent several hours finding an elegant way to model the sequence constraints for Nonograms using the Z3 API. Now that you have the knowledge of Variables , Constraints , Solvers , and Models , I’m leaving the Nonogram solver as a challenge to you. If you solve it, I would love to see and discuss your solution in the comments! 💡

Thank you for joining me on this formal journey into puzzle-solving. I hope this post inspires you to look at your daily newspaper puzzles not just as a pastime, but as a fascinating challenge in Constraint Satisfaction.

Originally published on AI Superhero

The Table That Saved Me

Shmulik Cohen — Mon, 14 Jul 2025 13:56:14 +0000

How The Big Tasks Table Rescued My Sanity and Deadlines

An Homage to the movie “The Girl That Saved Me”, that I never watched and found out about him 5 minutes ago.

You know that feeling when your brain just can’t hold all your to-dos? In the tech industry, that feeling goes into overdrive. We juggle countless meetings, intricate details, ongoing projects, and so many other obligations that demand serious organization, whether you’re working solo or collaborating with a team.

While countless resources offer task management solutions, I want to introduce my personal method: “The Big Tasks Table.” It’s far from perfect, incredibly simple, but surprisingly effective.

In this post, I’ll walk you through the method, share how I started using it, discuss its pros and cons compared to other approaches, and explain why discovering a personal system that genuinely suits your needs is so vital.

The Task Management Tightrope

Before I found my footing, task management felt like walking a tightrope. As a junior software engineer, my initial role was manageable with basic organization, mostly relying on OneNote. But my second role? That’s where things spiraled. Our team operated with tight Scrum sprints and GitLab issues, and for me, it was a nightmare. I spent almost half my time organizing and describing tasks instead of actually doing the work. I’m not saying these methods are inherently wrong; it was just my personal experience then, but it left me feeling traumatized by complex systems.

When I transitioned to being a course instructor for a technology course, the chaos continued. We had a demanding one to two months to organize the curriculum, followed by an intense course period. The sheer volume of tasks meant we had to be incredibly organized. One day, my own instructor from when I took the course came to help. They mentioned a common way to categorize tasks: “Must Have,” “Really Want,” and “Nice To Have.” Something immediately clicked.

Introducing: The Big Tasks Table

That simple idea sparked “The Big Tasks Table.” I took those three categories and organized them into a structured table, which quickly became full of tasks, hence the name! It took some time to refine and adjust it, even experimenting with colors for collaborative work at one point. I’ve been using it consistently ever since.

The method itself is incredibly straightforward: You create a table with three columns and three rows.

Columns: This Week, This Month, Ongoing
Rows: Must Have, Really Want, Nice to Have

It should look like that!

Each week, you fill the blocks with your tasks, marking them off as you complete them. At the end of the week, you simply copy the table as is, remove any finished tasks, reset any “ongoing” marks, and keep going. I primarily use OneNote for this, but it works just as well in other platforms like Obsidian or any note-taking app.

When deciding which tasks to tackle, I recommend starting with the “Must Have” items for “This Week” (the upper-left block) and then moving across or down. If a task requires more detail or has sub-tasks, you can add them directly within that block. For tasks where you’re waiting on someone else, you can mark them in any way you prefer — I personally use highlighting to indicate a dependency.

Handy OneNote Shortcuts

OneNote is one of the best products of Microsoft in my opinion, here are some shortcuts you must know:

Ctrl+1 — Apply, select, or clear the To Do tag.
Ctrl+2 — Apply or clear the Important tag.
Ctrl+K — Insert a hyperlink.
Ctrl+Alt+H — Highlight the selected text.
Alt+Shift+D — Insert the current date.
Alt+Shift+T — Insert the current time.
Ctrl + f — search current page
Ctrl + e — search all notebooks.

If you want to learn more, check out Keyboard shortcuts in OneNote.

Why It Works (And When It Might Not)

The beauty of the Big Tasks Table lies in its simplicity.

The Good:

Super Simple: It takes only a few minutes to fill with all your tasks. If it takes longer, you’re probably overthinking it.
Always Visible: You can see all your tasks at a glance, ensuring nothing gets lost or forgotten.
Highly Adaptable: It’s easy to adjust to your specific needs. For example, a friend I shared this method with had so many tasks that he adapted it to a daily and weekly view!

The Challenges:

Minimal Detail: Its simplicity means there isn’t much space for comprehensive task descriptions or detailed notes.
No Analytics: You can’t run analytics or generate reports from it.
Solo Focus: It’s primarily designed for individual work and doesn’t seamlessly integrate with complex collaborative systems.

However, these challenges are manageable. For the past two years, I’ve successfully used the Big Tasks Table alongside team-based systems like Jira. It takes almost no extra time to manage my personal tasks in OneNote, even when collaborating within more complex platforms. While I still don’t particularly enjoy intricate systems like Jira, I understand their necessity when working with a team.

Find Your Own System

So, do you already have a method to manage your tasks? If yes, I’d love to hear what it is. Share your insights in the comments below, including what works well for you and any unique adaptations you’ve made. Your experiences could inspire others!

And if you don’t yet have a system, I truly recommend you start with one. The specific tools or methodologies matter less than the act of taking control. Whether it’s my Big Tasks Table, a simple to-do list app, or a complex project management software, any system is infinitely better than no system at all.

Experiment, adapt, and find what resonates with your personal workflow. The goal is to free up your mental energy from remembering tasks so you can focus on actually completing them, leading to less stress and more accomplished goals.

Originally published on AI Superhero

AI Snippets — Napkin AI

Shmulik Cohen — Sun, 29 Jun 2025 22:11:35 +0000

Become the Presentation Champion with only one Napkin

How often have you found yourself drowning in a sea of text, desperately wishing for a magic wand to turn it into something visually digestible? In the high-stakes world of presentations, words alone often fall short of making an impact.

Welcome to the first AI Snippets post, a series dedicated to showcasing innovative AI solutions that tackle real-world problems. Today, we’re diving into Napkin AI , a tool designed to revolutionize how you communicate by turning text into stunning diagrams in mere seconds.

Sometimes text is not enough

“A picture is worth a thousand words.”

Has it ever happened to you that you struggled to understand a piece of text, and then a single diagram made everything clear? Or perhaps you needed to create a presentation that just looked too dry, no matter how much you tweaked the text?

Visuals don’t just clarify; they make information memorable, foster engagement, and can transform a tedious presentation into a captivating experience.

The Napkin AI website is designed to help you with exactly that: transforming any text into a visual diagram in mere seconds.”

Example — Cash Flow and Profit

To illustrate, let’s look at the following text:

“Cash flow refers to the actual money moving in and out of a business. When a business receives payments from sales, loans, or investments, that’s cash inflow. When it pays for expenses like salaries, rent, or supplies, that’s cash outflow. Positive cash flow means the business has more money coming in than going out, which is essential for paying bills and funding operations. Negative cash flow means more money is leaving the business than entering it, which can lead to financial difficulties.

Profit, or net income, is the money left over after all expenses have been subtracted from total revenue. This includes both cash expenses (like salaries) and non-cash expenses (like depreciation, which accounts for the wear and tear on assets over time). A business can be profitable on paper (meaning its revenues exceed its expenses) but still have poor cash flow if customers are not paying quickly enough, or if a lot of its expenses are due immediately. Therefore, both strong cash flow and good profit are important indicators of a business’s financial health.”

To truly put Napkin AI to the test, I fed it the text on cash flow and profit. After exploring various diagram options, I settled on a visual that clarifies these concepts:

Getting Started with Napkin AI

To start using Napkin AI, go to

https://www.napkin.ai/

(PC only) and press on “Get Napkin Free/Sign In”, after that you will be able to sign in with your google account and start creating.

Opening Screen

On the website, you can create a new “Napkin” in three ways:

Start with a blank one and insert your text.
Import text from an existing file , with support for Docs, PDF, PPTX, MD, and HTML.
Generate text with AI , where you can choose any topic in the world and receive AI-generated text on it.

Turtles!

To test the last option, I prepared some text about turtles (of course) with a few diagrams I created in a minute. Feel free to take a look: https://app.napkin.ai/page/CgoiCHByb2Qtb25lEiwKBFBhZ2UaJDc3MWZhODg1LWE0NmQtNGY3Yy05OThlLWI4ZWJhYTg4OGVjNw?s=1

And here are some of the diagram that I got:

Understanding Turtle Biology

Turtle Reproduction Cycle

Pricing

Currently, Napkin is in beta, and most of its features are available for free. For users needing more, Napkin offers Plus and Pro tiers. These paid plans provide additional benefits such as more AI credits, the option to remove Napkin branding, team management tools, exclusive designs and more.

The Plus plan costs $12 per month , while the Pro plan is $30 per month. You can also get a 25% discount if you opt for an annual subscription.

For most casual users, the free tier will likely be more than enough. One commendable aspect of Napkin’s approach is their refreshingly unobtrusive promotion of premium features. Unlike many freemium services, they don’t aggressively push upgrades, I didn’t even know that there is paid plan until I found the pricing tab on accident.

Full Plans Information

Limitations, and Human Touch

So, will Napkin AI solve all your problems? In my opinion, not yet. It’s a innovative and unique product, but the diagrams can be a bit repetitive (at least in the free plan), and with more complex texts full of numbers, it might falter. Sometimes none of the diagrams really fit what you have in mind and you get nothing from it.

It’s important to note that human involvement is still required in the process to decide which diagram is most suitable for the case and to ensure there are no errors.

Conclusion

In conclusion, I believe this is a brilliant tool that can sometimes help you transform dry text into a visual and impressive diagram. And if not, at least you can have better directions to present the text.

Why not give Napkin AI a spin yourself? Head over to https://www.napkin.ai/ and unleash your inner presentation champion.

I would love to see the incredible diagrams you create, share them in the comments below!

Originally published on AI Superhero

My Digital Arsenal #1: Mastering Python Package Managers

Shmulik Cohen — Wed, 04 Jun 2025 22:49:11 +0000

Mastering Python Package Managers

Welcome to the first installment of my new series, “My Digital Arsenal,” where I’ll be sharing the essential tools that power my development workflow. Forget the dusty old toolbox, this is about the sleek, powerful software that makes coding less of a chore and more of a creative adventure. Each post will spotlight a tool or family of tools I’ve found invaluable and think you might too.

Today, we’re diving headfirst into the often-underestimated world of package managers. These aren’t just utilities; they’re your project’s lifeline, your sanity savers, and the unsung heroes that prevent your coding environment from collapsing into a chaotic mess of conflicting libraries.

Our Focus: Python’s Package Management

It’s true that nearly every modern programming language comes equipped with its own package manager(s) — you might have heard of npm for JavaScript, Maven or Gradle for Java, and Cargo for Rust, to name a few. So, you might wonder why this series kicks off with Python, and why it will likely remain a frequent topic. The main reason is straightforward: Python has been my primary programming language for several years now, and I’m most familiar with its ecosystem.

The Appeal of Python in This Context

Beyond personal preference, Python has some distinct advantages, especially when we talk about managing external code. It’s known for being incredibly beginner-friendly. If you’ve ever faced the challenge of manually handling dependencies in a language like C++, the relative simplicity of Python’s package management can feel like a breath of fresh air. Coupled with Python’s vast versatility and widespread popularity across many fields, understanding how to effectively manage its packages is a truly valuable skill for any developer working with the language.

What Are Python Package Managers & Why Use Them?

A Python package manager is a tool that automates managing the external “code toolkits” (packages, libraries or tools) your project needs. It handles finding, installing, updating, and resolving issues for these packages.

Why they’re essential:

Without one, you’re navigating a tricky path! Here’s how they save the day:

Handle Dependencies: Packages often rely on other packages. A manager automatically finds and installs all these necessary “sub-packages” correctly.
Ensure Stability with Versioning: Code toolkits change. Package managers let you control which version of a toolkit your project uses. This prevents updates from unexpectedly breaking your code and ensures everyone on your team uses the same versions (often via a “lock file”).
Keep Projects Separate: Using virtual environments (often managed by or with the package manager), you can isolate each project’s toolkits. This stops different projects from having conflicting toolkit versions.
Save Time: They automate the tedious tasks of downloading, installing, and managing packages, letting you focus on coding.
Simplify Teamwork: When everyone uses the same package manager and configuration, it’s easy to share projects and ensure everyone has a consistent development environment.

In short, package managers are crucial for efficiently and reliably using external code in Python, preventing many common headaches.

My Go-To Python Package Management Toolkit

Let’s look at the tools I frequently reach for in my Python adventures.

pip: The OG, The Standard, The Everywhere Tool

What It Is : pip (Pip Installs Packages) is Python’s standard package installer. It’s the command-line tool you’ll use most often to add libraries from the Python Package Index (PyPI) and other sources to your projects. If you’re using a modern version of Python, pip is typically available right out of the box.

Why It’s Foundational :

Pip is the universal starting point for Python package management due to its ubiquity with Python installs. Its simple commands (like pip install package-name) make common tasks straightforward, and it reliably handles core installation needs for countless developers.

The Catch? : pip is primarily an installer. Its dependency resolver aims to find a compatible set of packages rather than creating a deterministic lock file for the entire dependency graph. This can sometimes lead to subtle conflicts or variations in environments over time if dependencies are not meticulously pinned. It also doesn’t inherently create a “lock file” to guarantee identical environments everywhere, though pip freeze is a common way to pin versions.

Basic Usage (The Classics) :

# Install a package from PyPI (Python Package Index)
pip install requests

# Install a specific version
pip install requests==2.25.1

# Generate a list of installed packages
pip freeze > requirements.txt

# Install all packages from a requirements file
pip install -r requirements.txt

# Uninstall a package
pip uninstall requests

Useful Tip: If you find yourself with a Python installation that somehow doesn’t have pip, you can usually install it using Python’s built-in ensurepip module: python -m ensurepip --upgrade

Poetry: The All-in-One Project & Dependency Maestro

What It Is : Poetry is a modern Python tool for comprehensive dependency management, packaging, and project organization. It moves beyond just installing packages, offering an integrated workflow for developing, managing, and distributing Python applications.

Why It’s a Game Changer : Poetry brings robust structure, consistency, and reliability to Python projects. It standardizes project configuration and metadata through the pyproject.toml file and excels at creating fully reproducible environments, which is critical for serious development.

Key Features & How It Addresses pip's Limitations:

Unified Project Definition withpyproject.toml: Poetry uses pyproject.toml (a now-standard Python project file) as the single source of truth for your project's metadata, dependencies (both main and development), scripts, and even other tool configurations. This is more organized than relying solely on a requirements.txt or a setup.pyfile. Example _pyproject.toml _ snippet :

[tool.poetry]
name = "my-awesome-project"
version = "0.1.0"
description = "A truly awesome project."
authors = ["Your Name you@example.com"]

[tool.poetry.dependencies]
python = "^3.8" # Specifies compatible Python versions
requests = "^2.25.1" # Main dependency with version constraint

[tool.poetry.group.dev.dependencies]
pytest = "^7.0" # Development-only dependency
Reliable, Reproducible Builds : Addressing limitations of simpler tools, Poetry employs a sophisticated dependency resolver. This generates a detailed poetry.lock file, which, while often large and machine-generated (and should be committed to your repository), precisely records all package versions and hashes. This guarantees identical, deterministic builds for everyone on the team.
Automatic Virtual Environments : Poetry automatically creates and manages a dedicated virtual environment per project, simplifying isolation.

Potential Downsides : While powerful, Poetry might be overkill for very simple scripts and projects. Its thorough dependency resolution can sometimes be slower than pip for initial setups in complex projects, and it presents a slightly steeper learning curve due to its richer feature set.

Basic Usage (A Full Workflow) :

# Install Poetry (pip works; official docs also recommend their installer
# or pipx for isolated installs)
pip install poetry

# Start a new Poetry-managed project
poetry new my-awesome-project

# Navigate into your project
cd my-awesome-project

# Add a new dependency (updates pyproject.toml and poetry.lock)
poetry add requests

# Install all dependencies from poetry.lock (or pyproject.toml if no lock)
poetry install

# Run a script within the project's virtual environment
poetry run python your_script.py

# Update a specific package (and its dependencies if needed)
poetry update requests # Or 'poetry update' to update all

# To regenerate the lock file based on pyproject.toml (e.g., after manual edits)
poetry lock

# See your dependency tree
poetry show --tree

Good to Know : Poetry also streamlines the process of building your project into distributable formats (e.g., wheels, sdists) and publishing them to PyPI or private repositories, making it a complete lifecycle tool. You can also manage different dependency groups (e.g., for ‘dev’ tools, ‘docs’ generation, or ‘testing’) beyond the main project dependencies.

uv: The Blazing-Fast Python Packager ⚡

What It Is : uv is an extremely fast Python package installer, resolver, and virtual environment manager, built in Rust by Astral (the creators of Ruff, which I hope to write more about in future post). It’s designed to be a significantly faster alternative to pip and virtualenv , and can work with pyproject.toml for full project management.

Why It’s Gaining Traction : uv 's standout feature is its exceptional speed. It can be 10-100x faster than pip for common operations like installing packages or creating virtual environments, especially when leveraging its global cache. This performance dramatically reduces waiting times, particularly in CI/CD and for large projects.

Speed Comparison

Key Capabilities & How It Compares:

Speed Demon: This is uv ’s hallmark, drastically cutting down time for package installation, resolution, and virtual environment creation across all its usage modes.
Versatile as a pip Replacement: uv offers a uv pip interface mirroring many common pip commands (install, freeze, uninstall), serving as a high-speed drop-in for requirements.txt-based workflows. It also includes uv venv for rapid virtual environment creation and uv pip compile for fast requirements.in to requirements.txt compilation (similar to pip-tools).
Powerful as a Project Manager: uv also excels at managing projects that use a pyproject.toml file (with the standard [project] table for dependencies), offering an organized, centralized approach. Example _pyproject.toml _ snippet :

[project]
name = "my-awesome-project"
version = "0.1.0"
description = "A truly awesome project."
readme = "README.md"
dependencies = [
"httpx",
"ruff>=0.3.0"
]
Dedicated CLI Tool Management: uv includes commands like uv tool install and uv tool run to install and run Python CLI applications in isolated environments, similar to ruff or pipx , offering a convenient way to manage your Python-based developer tools.

Potential Downsides & Considerations:

Maturity & Feature Gaps: As a newer tool, uv is still rapidly evolving. While highly capable, it might not yet mirror every niche feature of the more tenured pip or offer Poetry's full, mature suite of integrated project lifecycle commands (e.g., complex build configurations or extensive publishing plugins). Most existing projects will also be using these established tools, which might influence adoption for ongoing work.
Practical Adoption Hurdles: Unlike pip (usually bundled with Python), uv requires a separate installation step. Furthermore, while its speed is a major draw, this specific benefit might be less critical for smaller projects or workflows with infrequent package operations, where existing tools may already be "good enough."

Basic Usage (Speeding Up Your Workflow):

# Install uv (e.g., via pip; check official uv docs for all install options)
pip install uv

# Create a blazing-fast virtual environment
uv venv .venv
# Activate: source .venv/bin/activate (Linux/macOS) or .venv\Scripts\activate (Windows)

# Install packages (like pip; can use pyproject.toml if present)
uv pip install requests

# Add dependency to pyproject.toml & install (if in uv project mode)
uv add httpx --python ">=3.8"

# Install packages from a requirements.txt file
uv pip install -r requirements.txt

# Generate requirements.txt from current environment (like pip freeze)
uv pip freeze > requirements.txt

# Compile a requirements.in file to requirements.txt
uv pip compile requirements.in -o requirements.txt

Good to Know: uv is part of Astral's high-performance Python tooling suite. It's under very active development, with its feature set expanding quickly, and can even manage Python installations directly (uv python install <version>).

Conclusions: Choosing Your Package Management Ally

Selecting the right Python package manager is key for an efficient, stable development experience. There’s no single “best” tool, as each shines in different scenarios:

pip is your foundational tool, perfect for simple scripts and its universal availability.
Poetry excels at robust, end-to-end project management, offering guaranteed reproducibility for libraries and applications.
uv delivers exceptional speed, whether used as a super-fast pip replacement or an increasingly capable modern project manager.

While these three cover a wide range of needs, the Python ecosystem offers other excellent tools. For instance, PDM provides a modern, comprehensive approach similar to Poetry ; Conda is widely adopted in scientific computing for managing packages (Python and non-Python) and complex environments; and Hatch offers powerful project management and build capabilities.

Consider your project’s complexity, team collaboration needs, whether absolute reproducibility or sheer speed is paramount, and if your team (including yourself) leans towards adopting the newest, potentially fastest tools or prefers the stability and widespread familiarity of more established options. The Python tooling landscape is always improving, so experiment to find what best suits your workflow — it’s a worthwhile investment for any developer.

Originally published on AI Superhero

Stop Wasting LLM Tokens!

Shmulik Cohen — Sun, 18 May 2025 19:21:19 +0000

The Shocking Truth About What Really Affects Your LLM

Zero Waste Tokens

In recent years, Large Language Models (LLMs) and vision-language models (VLMs) have taken the world by storm. With their meteoric rise, a new discipline emerged: Prompt Engineering.

As prompt engineering exploded, so did the myths around it. In this post, I break down what works, what doesn’t, and why being concise might just be the real prompt superpower.

What is Prompt Engineering?

Prompt engineering is the art of crafting task-specific instructions — prompts — to elicit high-quality outputs from AI models, without modifying their internal architecture or retraining them. Instead of changing the model, we change the input to unlock the model’s latent capabilities.

These prompts can be natural language instructions, few-shot examples, or even learned vector embeddings. At their best, prompts act like keys that unlock the right behavior within a powerful pre-trained model.

The results have been impressive. Prompt engineering has powered everything from more coherent summaries to stronger reasoning and even complex task automation. Naturally, this led to an explosion of prompt libraries, marketplaces, and tools promising “10x better results.”

The Prompt Engineering Hype

Take a simple instruction like:

“Summarize this article in 3–4 paragraphs”

A prompt engineer might turn it into something like::

Companies and researchers claim these elaborate prompts significantly outperform basic ones. Prompt marketplaces even emerged, selling optimized templates as premium assets.

But here’s the thing: many of these verbose constructions don’t actually improve results as much as people think. In many cases, they just waste tokens — and money.

The Skeptical View

I’ve always held a healthy skepticism toward “Prompt Engineering”. Not because it’s useless, it can be incredibly valuable and I’ve seen its value firsthand many times, but because its impact is often overstated. Many token-heavy components in so-called optimized prompts don’t meaningfully affect the output at all.

In this post, I’ll explore:

The Rise Of Prompt Engineering, from academic labs to mainstream practice
The Great Simplification process that allows us to craft amazing prompts using fewer words.
Prompt Debloat, a tool that let you see which parts of your prompt matter the most
The Future of Prompting

Let’s separate prompt engineering fact from fiction and learn how to communicate with LLMs more efficiently.

The Rise Of Prompt Engineering

When GPT-3 debuted in 2020, users made a fascinating discovery: the way you asked the model a question mattered as much as the question itself. This observation birthed prompt engineering — a discipline focused on the art and science of communicating with AI.

From Academic Labs to Mainstream Practice

The field evolved rapidly through key research breakthroughs:

Few-Shot Learning (2020) : Researchers found that showing a model examples of what you wanted — “Here’s how you solve problem X, now solve problem Y” — dramatically improved performance. This technique allowed models to adapt to new tasks with minimal guidance.
Chain-of-Thought (2022) : The simple instruction “think step by step” revolutionized how models handled complex reasoning. Accuracy on math and logic problems jumped by 20–40%, simply by asking models to show their work.

These techniques helped bridge the gap between general-purpose language models and task-specific results — without any fine-tuning.

The Prompt-Heavy Era: Midjourney and Maximum Verbosity

Nowhere was prompt maximalism more visible than in the early days of image generation. In Midjourney v1–4, generating a compelling image required long, detailed prompts:

Food photography of cute tiny people preparing a gigantic cheeseburger, ultra realistic, ultra detailed, UHD image — style raw — s 750 — v 5.1

A meticulously engineered external sensor pod for an isolated outpost on Mars, designed to capture environmental data in real time, with a futuristic, aerodynamic shape, integrated soft lighting, and rugged construction that withstands the extreme conditions of a Martian night, perfectly aligned with the advanced space exploration theme. in photorealistic style, Isometric 3D style, isolated on white background with negative space

A whole subculture emerged on Discord around “prompt recipes.” People spent hours crafting elaborate incantations to control everything from lighting to lens distortion. Some prompts grew to hundreds of tokens. Prompt marketplaces like PromptBase started selling “optimized prompts” that sometimes cost more to run than the image itself.

The philosophy was clear: more is better.

But that’s no longer the case.

The Great Simplification

As models advanced in 2023 and 2024, the need for elaborate prompts sharply declined. Why? Because the models got smarter.

Models Now Understand More with Less

Auto-Reasoning : GPT-4, Claude, and others began reasoning step-by-step without needing the phrase “let’s think step by step.”
Intent Inference : These models now infer your goal even from vague or poorly phrased requests.
Self-Prompting Architecture : With systems like GPT-4o and Claude 3.5, the models effectively write internal prompts for themselves as they solve problems.

In practice, this means your original prompt matters less than it used to — especially for reasoning tasks.

Midjourney v3 vs. GPT-4o: One Line Is Enough

Want to generate a stunning image?

In Midjourney v3, it might have taken you 80 carefully chosen tokens. In GPT-4o , you can just type:

“A surreal painting of a cat floating in space.”

…and get something genuinely beautiful. No need to specify “high detail,” “octane render,” or “golden hour lighting” — the model fills in the gaps intelligently.

This reflects a broader truth: prompt engineering today is less about verbosity and more about clarity. The goal isn’t to cast a magic spell — it’s to specify what you want and how you want it formatted.

Reasoning Models: The Ultimate Simplification

The latest frontier is models specifically designed for reasoning, O1 by OpenAI or R1 by DeepSeek. These systems represent a fundamental shift:

They don’t need explicit scaffolding to reason logically
They i nternally generate their own task breakdowns
The Initial prompt has weaker affect on the final result

These models are designed to “figure things out” rather than follow exhaustive instructions. Prompt engineering for them is no longer about micromanaging behavior — it’s about efficiently triggering the right internal processes.

Prompt Debloat

Recently I joined LinkedIn and Amidst the usual stream of motivational posts, Job offers and thoughts about the future of AI, I found this post by Iddo Gino (The founder of RapidAPI and talented software engineer featured in Forbes 30 under 30):

Built a tool to analyze prompt bloat and quality | Iddo Gino posted on the topic | LinkedIn

[Been exploring the concept of "prompt bloat" this week, and ended up building a tool to analyze the importance of words…](https://www.linkedin.com/posts/iddogino_aioptimization-promptengineering-activity-7320842006878908416-svq?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAyeke8BITafzZWtP6WMCOl4gpfmzLZMDS4)[www.linkedin.com](https://www.linkedin.com/posts/iddogino_aioptimization-promptengineering-activity-7320842006878908416-svq?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAyeke8BITafzZWtP6WMCOl4gpfmzLZMDS4)

In the post, Gino introduced a tool he built to analyze which parts of a prompt actually influence the model’s output. The inspiration? A now-famous comment from Sam Altman noting that users adding “please” and “thank you” to prompts was costing OpenAI millions in unnecessary tokens.

Gino’s tool is both clever and practical. It lets you visually inspect which parts of a prompt are contributing to the result — and which parts are just taking up space (and money). You can use it as an educational tool to understand prompt mechanics or as a utility to strip down bloated prompts to their most effective core.

This is exactly the kind of tooling the community needs more of: not magic recipes, but clarity tools**** — ways to make prompting more intentional and efficient.

You can try out for yourself in this link:

https://promptdebloat.datawizz.ai/

A Few Examples

To check the tool, I checked some prompts I found online and the results were stunning.

Python bug buster: Anthropic has Prompt Library with optimized prompt for a breadth of business and personal tasks, not a place you’d expect to find useless token.I tried the ‘Python bug buster’ prompt, and here are the results:

140 IQ Senior: Recently someone if a vibe coding group that I’m in asked about a general system prompt for an agent and someone else responsed with this draft:

You are a senior independent {role} with a 140 IQ and an unparalleled attention to detail. Your mission is to act as a quality assurance inspector for the final {task you want to criticize}.

Your review must ensure that every aspect of the {task} is spot on, all assignment requirements are fully met, and answers are 100% accurate.

So I checked this prompt too and got this result:

Apparently the IQ reference had minimal impact, showing that verbosity doesn’t equate to utility.

And you can try any prompt that you got (until 500 words). But how does it work? And can you trust the results or these are just random green and red coloring.

How Does It Work?

Prompt Debloat uses a method called token ablation (also known as input perturbation) to figure out which words in your prompt actually matter. The basic idea is simple: it removes words from your prompt one by one and sees how much the model’s response changes.

If removing a word makes little or no difference to the output, that word is probably not pulling its weight — and might just be wasting tokens. On the other hand, if removing a word does change the response significantly, it’s likely doing important work.

This process helps you trim down your prompt by spotting the “bloat” — unnecessary words that can safely be cut to save on cost and improve clarity.

Limitations of Token Ablation

Context Matters : A word that seems unimportant in one prompt might be critical in another. Results aren’t always universal.
Too Many Possibilities : Testing every combination of tokens quickly becomes overwhelming as the prompt gets longer, so most tools test one token at a time.
Subtle Changes : Sometimes a word might influence the tone or nuance in ways that aren’t easy to measure just by comparing probabilities or visible output.

So now that you know how it works and what are the Limitations, I invite you to try your most complex prompts Prompt Debloat and share the results in the comment below.

The Future of Prompting: Less Magic, More Simplicity

Not long ago, I attended a meetup where the organizers shared their experience building an AI agent to assist with software engineering tasks. As expected, they kicked things off by talking about prompt engineering — how they refined their inputs, added detailed instructions, and experimented with all sorts of formatting tricks to boost performance.

It sounded like classic prompt wizardry: custom templates, carefully worded system messages, and all the usual prompt engineering lore.

But then they said something that caught everyone off guard.

In the end, most of the improvements didn’t come from some secret prompt formula. What actually made the biggest difference? Just using Cursor’s built-in prompt suggestions — simple, well-structured, and focused on clear output formats. That was it. No prompt maximalism, no elaborate frameworks. The tooling alone lifted their agent far beyond the baseline.

That moment really stayed with me.

It confirmed something I’ve been noticing for a while: the future of prompting isn’t about conjuring magic words. It’s about thoughtful design — clear intent, less noise, and trusting the model to do its job without micromanagement.

Prompt engineering isn’t dead. But the “more is more” era is fading. The real power now lies in restraint — knowing what to say, what to leave out, and how to shape the interaction like a good interface, not a spellbook.

The best prompts aren’t the longest. They’re the clearest.

Let’s stop wasting tokens — and start communicating better with our models.

Originally published on AI Superhero