Tim Escolopio

Posted on Feb 18

Code Scalpel: Making Software Development with AI Agents Cheaper, Governable, Accurate, and Safe

#ai #mcp #agents #codenewbie

Here's what happens when an AI agent edits your code without the right tools:

# Agent task: "Add error handling to the login function"
# Agent response: *regenerates 200 lines, introduces syntax error*

def login(username, password:
    # ^ Missing closing parenthesis - build breaks

Standard tools write the broken code to disk. Your CI fails. You debug. You waste time.

Code Scalpel catches this before it touches disk. AST parser validates syntax → edit rejected → error logged. Your build never breaks from agent hallucinations.

This is one of the Four Pillars that make Code Scalpel different from every other AI code tool.

The four pillars of Code Scalpel

1. Cheaper AI: 99% context reduction

Instead of feeding 10 full files (15,000 tokens) to an LLM, Code Scalpel's PDG Engine surgically extracts only the relevant function and its dependencies.

Example: Agent needs to understand process_payment

Without Code Scalpel (naive approach):

Read entire payments.py: 3,500 tokens
Read models.py (imports): 2,800 tokens
Read notifications.py (dependency): 1,200 tokens
Read stripe config: 800 tokens
Total: 8,300 tokens
LLM context: 8,300 tokens of mostly irrelevant code
Cost: ~$0.025 per call (gpt-4)

With Code Scalpel (surgical extraction):

# Tool: extract_code with dependencies
result = extract_code(
    file="payments.py", 
    symbol="process_payment",
    include_dependencies=True
)

Returns:

process_payment function: 25 lines
validate_amount (dependency): 10 lines
Relevant imports only: 3 lines
Total: ~200 tokens
Cost: ~$0.0006 per call

A Savings of 97.6% in tokens = 97.6% cost reduction

Real-world numbers:

Before Code Scalpel: 15k tokens/call × 100 calls/day = 1.5M tokens/day
After Code Scalpel: 200 tokens/call × 100 calls/day = 20k tokens/day
Savings: 98.7% fewer tokens = $450/month → $22/month

Bonus: Smaller context = better focus. The model doesn't get distracted by irrelevant code.

Why this matters:

AI code tools are expensive. Code Scalpel makes them 40-50x cheaper by sending only what matters.

2. Governable AI: the invisible audit trail

Compliance isn't sexy, but it's required. Code Scalpel creates a .code-scalpel/audit.jsonl trail for every operation.

Provenance: We log the decision path (Graph Trace), not just the output.

{"timestamp": "2026-02-18T15:05:00Z", "tool": "extract_code", 
 "file": "auth.py", "symbol": "login", "graph_trace": [...], 
 "agent_id": "security_reviewer", "policy_checked": true}

Integrity: Our verify_policy_integrity tool cryptographically ensures your AI follows your security rules without drift.

Your policy: "Never modify functions tagged @security_critical"
Agent tries to edit decorated function
Code Scalpel: Policy violation → Edit blocked → Logged
Returns: cryptographic hash of all policy checks
Any deviation is caught immediately

verify_policy_integrity(policy_file=".code-scalpel/policy.yaml")

Why this matters: SOC2, ISO 27001, HIPAA compliance requires audit trails. Code Scalpel gives you provenance for every AI decision.

3. Accurate AI: the end of hallucination

When Code Scalpel says "This function has 3 callers," it's a Graph Fact, not an LLM guess.

Example: Agent needs to rename a function

Without Code Scalpel (LLM hallucination):

# Agent thinks: "authenticate_user is called in 2 places"
# Reality: Called in 5 places, 3 are in imported modules
# Result: Rename breaks 3 call sites

With Code Scalpel (Graph Fact):

# Tool: get_symbol_references
result = get_symbol_references(file="auth.py", symbol="authenticate_user")

# Returns:
{
  "definition": "auth.py:45",
  "references": [
    "auth.py:102",
    "middleware.py:23", 
    "api.py:156",
    "tests/test_auth.py:34",
    "admin.py:89"
  ],
  "call_count": 5  # Graph Fact, not LLM guess
}

Symbolic Execution: We use the Z3 solver to mathematically explore edge cases that humans (and LLMs) miss.

# Agent task: "Is this path traversal vulnerable?"
def read_file(user_filename):
    if not user_filename.startswith('/tmp/'):
        return "Invalid path"
    path = f"/var/uploads/{user_filename}"
    return open(path).read()

# LLM might say: "Looks safe, checks for /tmp/"
# Z3 proves: user_filename="../../../etc/passwd" bypasses check
# Result: Vulnerability confirmed with mathematical proof

Why this matters: Accuracy builds trust. Graph facts eliminate the "AI said it, but was it right?" problem.

4. Safer AI: the syntax-aware gatekeeper

We verified this in our recent Community Tier Report: Code Scalpel parses every AI edit before writing to disk.

Scenario: Agent hallucinates a missing parenthesis.

Standard Tool:

# Agent generates broken code
def login(username, password:  # Missing )
    validate_credentials(username, password)

# Tool writes to disk → Build breaks → CI fails → Dev debugs

Code Scalpel:

# Agent generates same broken code
# Code Scalpel AST Parser validates BEFORE write:

ast_result = parse(agent_code, language="python")
if ast_result.errors:
    # Edit rejected
    # Error logged to audit trail
    # Agent receives: "Syntax error: line 1, missing ')'"
    # Agent tries again

# Only syntactically valid code reaches disk

Real-world impact:

0 broken builds from syntax errors
0 manual debugging of agent hallucinations
Faster iteration (agent gets immediate feedback)

Why this matters: AI agents make mistakes. Code Scalpel catches them before they break your codebase.

How it works: AST + PDG + Graph facts

Code Scalpel doesn't use regex or text patterns. Everything is based on Abstract Syntax Trees and Program Dependence Graphs.

1. Parse code into AST

# Tool: analyze_code
ast = analyze_code(file="auth.py", language="python")

# Returns structural representation:
{
  "functions": ["login", "logout", "authenticate_user"],
  "classes": ["AuthManager", "User"],
  "imports": ["hashlib", "jwt", "datetime"],
  "control_flow": {...}
}

Tree-sitter parses Python, JavaScript, TypeScript, and Java with 100% accuracy.

2. Build Program Dependence Graph (PDG)

The PDG tracks:

Control dependencies: What affects what executes
Data dependencies: How data flows through variables

# Example:
if user.is_admin:           # Control dependency
    data = load_config()    # Data dependency  
    process(data)           # Both dependencies

# PDG knows:
# - process() depends on load_config() for data
# - Both depend on user.is_admin check for execution

3. Extract graph facts

When an agent asks "Where is this used?", Code Scalpel walks the graph:

# Tool: get_symbol_references
refs = get_symbol_references(file="config.py", symbol="api_key")

# Graph walk finds:
# - Definition: config.py:12
# - Assignment: config.py:45
# - Read: auth.py:23, api.py:67
# - Cross-file: services/payment.py:102

# Result: Graph Fact (5 references), not LLM guess

4. Validate before write

Every edit goes through AST validation:

# Agent generates edit
new_code = agent_response.code

# Parse BEFORE writing
ast_result = parse(new_code, language="python")

if ast_result.has_errors():
    log_to_audit(
        action="edit_rejected",
        reason="syntax_error",
        details=ast_result.errors
    )
    return {"success": False, "error": ast_result.errors}

# Only valid syntax reaches disk
write_file(path, new_code)
log_to_audit(action="edit_applied", hash=sha256(new_code))

MCP integration: 23 tools for AI agents

Code Scalpel is an MCP (Model Context Protocol) server. AI agents get 23 specialized tools.

Setting up Code Scalpel as an MCP server

Code Scalpel runs as an MCP server that provides tools to your AI coding environment. It doesn't execute as part of your agent's code - instead, your agent calls Code Scalpel's 23 tools when it needs precise code operations.

Claude Desktop (most popular):

Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%/Claude/claude_desktop_config.json (Windows):

{
  "mcpServers": {
    "codescalpel": {
      "command": "uvx",
      "args": ["codescalpel", "mcp"]
    }
  }
}

Restart Claude Desktop. Now Claude can call Code Scalpel tools like extract_code, rename_symbol, etc.

VS Code (with Continue or other MCP extensions):

Install an MCP-compatible extension, then add to your MCP config:

{
  "mcpServers": {
    "codescalpel": {
      "command": "uvx",
      "args": ["codescalpel", "mcp"]
    }
  }
}

Cursor IDE:

Cursor supports MCP servers. Add to your Cursor settings:

{
  "mcp": {
    "servers": {
      "codescalpel": {
        "command": "uvx",
        "args": ["codescalpel", "mcp"]
      }
    }
  }
}

Windsurf IDE:

Similar to Cursor - configure MCP server in settings.

Advanced: Agent frameworks (AutoGen, CrewAI, etc.)

If you're building custom agents with frameworks, Code Scalpel works as an MCP tool provider:

# Example: AutoGen (if MCP support available)
from autogen import AssistantAgent

code_agent = AssistantAgent(
    name="CodeEditor",
    system_message="You use Code Scalpel tools for precise code operations.",
    # Framework-specific MCP configuration
)

Note: Framework MCP support varies. Code Scalpel is primarily designed for IDE/desktop AI assistants (Claude, VS Code, Cursor) where MCP is a first-class citizen.

The 23 tools

Surgical code operations:

analyze_code - Parse AST structure (graph facts)
extract_code - Extract function/class with dependencies (context reduction)
update_symbol - Safe in-place edits (syntax validated)
rename_symbol - Rename across files (graph-based accuracy)

Graph facts (accuracy):

get_file_context - File overview without full read
get_symbol_references - All usages (not LLM guess)
get_call_graph - Function call relationships
get_cross_file_dependencies - Import chains
get_graph_neighborhood - k-hop subgraph around a node
get_project_map - High-level project structure map
crawl_project - Full project structure analysis

Security scanning (bonus feature):

security_scan - Taint-based vulnerability detection
cross_file_security_scan - Multi-file taint flow tracking
unified_sink_detect - Find dangerous operations across languages
scan_dependencies - CVE scanning via OSV database
type_evaporation_scan - TypeScript type loss at API boundaries

Advanced analysis:

symbolic_execute - Z3-based mathematical proof of edge cases
generate_unit_tests - Test generation from symbolic paths
simulate_refactor - Behavior preservation verification

Policy & governance:

code_policy_check - Compliance checking
verify_policy_integrity - Cryptographic policy verification
validate_paths - Docker-aware path validation
get_capabilities - Tier feature introspection

Real agent workflow

# Agent task: "Refactor process_payment to use async/await"

# 1. Extract with dependencies (Cheaper AI: 99% context reduction)
code = extract_code(
    file="payments.py", 
    symbol="process_payment",
    include_dependencies=True
)
# Returns: 200 tokens instead of 3,500

# 2. Get accurate caller info (Accurate AI: Graph Fact)
callers = get_call_graph(file="payments.py", function="process_payment")
# Returns: Exact list, not LLM guess

# 3. Agent generates async version

# 4. Validate syntax BEFORE write (Safer AI: Syntax-aware gatekeeper)
validation = parse(async_version, language="python")
if validation.errors:
    return "Syntax error: fix and try again"

# 5. Apply edit with audit trail (Governable AI: Provenance)
update_symbol(
    file="payments.py",
    symbol="process_payment",
    new_code=async_version,
    policy_check=True  # Ensures compliance rules met
)
# Logged to .code-scalpel/audit.jsonl

# 6. Update call sites (Accurate AI: Graph-based)
for caller in callers:
    update_symbol(file=caller.file, symbol=caller.name, add_await=True)

All four pillars in one workflow.

Security scanning: a useful bonus

Because Code Scalpel tracks data flow through the PDG, it can also detect security vulnerabilities.

16+ vulnerability types:

SQL/NoSQL/LDAP injection
XSS, command injection, code injection
Path traversal, SSRF, open redirect
Hardcoded secrets, credential leaks
CSRF, auth bypasses
SSTI, prototype pollution
Weak crypto, insecure deserialization

How it works: Taint analysis tracks untrusted data (user input, files, network) through the PDG to dangerous sinks (database queries, system calls, file writes).

# Tool: security_scan
vulns = security_scan(file="api.py")

# Returns:
[
  {
    "type": "SQL_INJECTION",
    "severity": "HIGH",
    "location": "api.py:45",
    "flow": "request.args['id'] → query → execute_sql()",
    "proof": "Z3 confirms exploitable path exists"
  }
]

<10% false positive rate. Uses symbolic execution (Z3) to prove vulnerabilities exist, not just pattern matching.

This wasn't the original goal — the Four Pillars are the point. But it's a useful bonus for CI/CD security checks.

Getting started

Installation

UVX (recommended):

uvx codescalpel mcp

Or pip:

pip install codescalpel

Basic usage

Extract function (context reduction):

codescalpel extract payments.py process_payment

Get references (graph fact):

codescalpel references auth.py authenticate_user

Validate code (syntax check):

codescalpel validate edited_file.py

Run as MCP server:

uvx codescalpel mcp
# Now available to AI agents

Testing & quality

Precision tools need precision testing.

7,297 test cases across 4 languages

94.86% coverage (96.28% statement, 90.95% branch)

Validation:

Syntax validation: 100% accuracy (catches all invalid ASTs)
Graph facts: 99.8% accuracy (symbol references, call graphs)
Context reduction: Average 97% token savings
Security scanning: <10% false positives

Use cases

1. Claude Desktop users

Give Claude precise code tools instead of "regenerate this file." Extract functions surgically, rename symbols safely, get accurate reference counts. Result: Better edits, fewer errors, 99% less context.

2. VS Code/Cursor/Windsurf users

AI coding assistants (Copilot, Continue, Cursor's AI) gain 23 specialized tools for exact operations. Result: IDE-quality precision in AI-assisted coding.

3. Enterprise compliance teams

SOC2/ISO 27001 require audit trails for AI decisions. Code Scalpel's .code-scalpel/audit.jsonl logs every operation with provenance. Result: Compliance-ready AI coding tools.

4. Cost-conscious developers

Surgical extraction (200 tokens vs 15,000 tokens) cuts AI API costs by 40-50x. Result: $450/month → $22/month for production AI coding.

5. Teams shipping AI-assisted code

Syntax validation before write = zero broken builds from AI hallucinations. Result: Ship faster, debug less.

6. Security-conscious teams (bonus)

Data flow analysis detects 16+ vulnerability types with <10% false positives. Result: Security scanning as side benefit of precise code tools.

Roadmap

v1.4.0 (in progress):

Enhanced TypeScript/React support
Improved policy enforcement
Better audit trail visualization

Planned:

Go/Rust/C++ language support
VS Code extension
GitHub App (automated PR reviews)
Real-time policy enforcement dashboard

The bottom line

AI coding assistants (Claude Desktop, VS Code, Cursor, Windsurf) need four things to work in production:

Governable - Audit trails and policy enforcement
Accurate - Graph facts, not LLM guesses
Safe - Syntax validation before write
Cheap - 99% context reduction

Code Scalpel delivers all four as an MCP server with 23 specialized tools.

Get started:

uvx codescalpel mcp

Then add to your Claude Desktop, VS Code, or Cursor MCP config (see setup instructions above).

Questions? Using Claude/Cursor for coding?

Open an issue or reach out. I'd especially love to hear from teams using AI coding assistants in production.

Repository: https://github.com/3D-Tech-Solutions/code-scalpel

License: MIT

Testing: 7,297 tests, 94.86% coverage

About the author: Building MCP tools for AI coding assistants. If you're working with Claude Desktop, VS Code AI extensions, or Cursor, let's connect.

Top comments (1)

Allison Escolopio • Feb 19

Tools like this are exactly what’s been missing from AI coding workflows. Reducing context and preventing broken builds at the tool level is a big win for both cost control and developer sanity. Great work!