As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!
When we write code, we often think it's correct because it runs without an error. But what about the errors that are hiding, waiting to cause trouble later? What about the confusing parts that will make another programmer—or even you in six months—scratch their head? This is where static analysis comes in. Think of it like having a meticulous assistant read through your code, line by line, without ever hitting the 'run' button. It looks for potential bugs, security holes, confusing style, and parts that might be hard to change later.
Python is a wonderfully flexible language, and that same flexibility can sometimes lead to messy code. The good news is Python also gives us powerful tools to clean it up. Over the years, I've built and used many of these tools to keep large codebases manageable. Let me walk you through some of the most effective techniques.
The foundation of almost all Python static analysis is the Abstract Syntax Tree, or AST. When you write print("Hello"), Python doesn't just see text. It sees a structure: a Call node, where the function is print and the argument is the string "Hello". The ast module lets us see this structure directly. We can write programs that understand other programs.
This is how tools like linters work. They don't execute your code; they examine its shape. Let's build a simple analyzer to see how it feels. We'll look for a few common issues: functions that are too long, loops nested too deeply, and mysterious "magic numbers" scattered in the logic.
The code you provided is a great start. Let's expand it to be more robust and explain what's happening step-by-step. Imagine we have a file with a very tangled function.
We write our analyzer to open the file and parse it into a tree. We then walk through every node in that tree. When we find a function definition node, we count its lines. If it's over 50, we record an issue. When we find a loop, we check how many loops are inside it. Deep nesting is a classic sign that code is becoming complex and hard to follow.
The magic number check is interesting. The number 42 in your sample code is a perfect example. Is it a scaling factor? A timeout value? Without a name, its purpose is lost. Our analyzer tries to spot these bare numbers, though we must be careful. The number 0 in result = 0 is usually fine. We might try to skip numbers assigned to variables with uppercase names (like MAX_RETRIES = 5), treating those as intentional constants.
Running this on real code gives you a list of concerns. It's not that the code is broken, but it might be brittle. This kind of automated review is invaluable on a team. It catches the small things humans gloss over during a code review.
While the AST shows us the structure, sometimes we need to understand the flow. This is where Control Flow Graphs (CFG) come in. A CFG shows how execution can move through your code—which paths are possible. It maps out every if-statement, every loop, and every function call in terms of possible jumps.
Why do we care? Complexity. A function with a dozen if and for statements has many possible paths. This makes it hard to test and easy to miss a bug. We can calculate this formally as cyclomatic complexity. Your starter code has a visitor that counts certain nodes. We can make this more precise.
Let's build a small CFG generator for a simple function. It's a more advanced technique, but it reveals so much about code quality.
import ast
from collections import defaultdict, deque
from typing import Set, List, Dict
class BasicCFGNode:
"""Represents a basic block in the control flow graph."""
def __init__(self, id: int):
self.id = id
self.instructions: List[ast.AST] = []
self.successors: List['BasicCFGNode'] = []
self.predecessors: List['BasicCFGNode'] = []
def __repr__(self):
return f"Block({self.id}) -> {[s.id for s in self.successors]}"
class SimpleCFGBuilder(ast.NodeVisitor):
"""Builds a basic CFG from function AST."""
def visit_FunctionDef(self, node: ast.FunctionDef):
self.current_node = BasicCFGNode(0)
self.graph = {0: self.current_node}
self.node_counter = 1
self._visit_sequence(node.body)
# Link the last node to an implicit exit
exit_node = BasicCFGNode(self.node_counter)
self.current_node.successors.append(exit_node)
exit_node.predecessors.append(self.current_node)
self.graph[exit_node.id] = exit_node
return self.graph
def _visit_sequence(self, stmts: List[ast.AST]):
"""Process a sequence of statements linearly."""
for stmt in stmts:
if isinstance(stmt, (ast.If, ast.While, ast.For, ast.Break, ast.Continue, ast.Return)):
self.visit(stmt)
# Control flow statements break the sequence
return
else:
self.current_node.instructions.append(stmt)
# After loop, sequence continues
def visit_If(self, node: ast.If):
# Node for the condition check
condition_node = self.current_node
condition_node.instructions.append(node.test)
# Create nodes for the 'then' branch
then_node = BasicCFGNode(self.node_counter)
self.node_counter += 1
self.graph[then_node.id] = then_node
condition_node.successors.append(then_node)
then_node.predecessors.append(condition_node)
# Process the 'then' body
self.current_node = then_node
self._visit_sequence(node.body)
then_exit = self.current_node
# Handle 'else' branch if it exists
if node.orelse:
else_node = BasicCFGNode(self.node_counter)
self.node_counter += 1
self.graph[else_node.id] = else_node
condition_node.successors.append(else_node)
else_node.predecessors.append(condition_node)
self.current_node = else_node
self._visit_sequence(node.orelse)
else_exit = self.current_node
# Merge point after if/else
merge_node = BasicCFGNode(self.node_counter)
self.node_counter += 1
self.graph[merge_node.id] = merge_node
then_exit.successors.append(merge_node)
merge_node.predecessors.append(then_exit)
else_exit.successors.append(merge_node)
merge_node.predecessors.append(else_exit)
self.current_node = merge_node
else:
# If no else, condition node flows to after the if
merge_node = BasicCFGNode(self.node_counter)
self.node_counter += 1
self.graph[merge_node.id] = merge_node
then_exit.successors.append(merge_node)
merge_node.predecessors.append(then_exit)
condition_node.successors.append(merge_node)
merge_node.predecessors.append(condition_node)
self.current_node = merge_node
def visit_Return(self, node: ast.Return):
self.current_node.instructions.append(node)
# Return has no successors in this simple model
self.current_node.successors = []
# Example function to analyze
source = """
def check_value(x, threshold):
y = x * 2
if y > threshold:
print("High")
return True
else:
print("Low")
z = y + 10
return False
"""
tree = ast.parse(source)
func = tree.body[0]
builder = SimpleCFGBuilder()
cfg = builder.visit(func)
print("Control Flow Graph Nodes:")
for node_id, node in sorted(cfg.items()):
print(f" {node}")
for instr in node.instructions:
print(f" - {ast.dump(instr) if hasattr(instr, 'value') else ast.dump(instr)[:50]}...")
This is a simplified view, but it shows the concept. The function splits at the if, has two branches, and then comes back together. By building this graph, we can automatically calculate the cyclomatic complexity: it's roughly the number of decision points plus one. A high number is a flag that the function is trying to do too much.
Moving beyond structure and flow, we come to the meaning of the data itself: types. Python's type hints are a fantastic step forward. They tell us what kind of value a variable is supposed to hold. But the built-in typing module is just the start. We can build much smarter checks.
Sometimes, a "string" isn't just any string. It must be a valid email format, or a non-empty username, or a country code from a specific list. Basic type checking says "this is a str". We want to say "this is a ValidEmail".
We can build a custom type checker that understands our project's rules. It uses the same AST parsing, but focuses on the annotations and the values being assigned.
import ast
import re
from typing import Any, Dict, Optional
class DomainTypeChecker(ast.NodeVisitor):
"""Checks for domain-specific type violations."""
def __init__(self):
self.issues = []
self.symbol_table = {}
def visit_Assign(self, node: ast.Assign):
# Very basic: check if assigned value seems to match a hinted variable name
for target in node.targets:
if isinstance(target, ast.Name):
var_name = target.id
# Heuristic: if variable name suggests a type, check it
if var_name.endswith('_email') and isinstance(node.value, ast.Constant):
if isinstance(node.value.value, str):
if not self._looks_like_email(node.value.value):
self.issues.append(f"Line {node.lineno}: Value '{node.value.value}' assigned to '{var_name}' does not look like a valid email.")
if var_name.endswith('_id') and isinstance(node.value, ast.Constant):
if isinstance(node.value.value, str):
if not node.value.value.isalnum():
self.issues.append(f"Line {node.lineno}: ID '{node.value.value}' should be alphanumeric.")
self.generic_visit(node)
def visit_AnnAssign(self, node: ast.AnnAssign):
# Check annotated assignments (e.g., `count: int = 0`)
if isinstance(node.annotation, ast.Name):
hinted_type = node.annotation.id
if node.value: # If there's an initial value
if hinted_type == 'int' and isinstance(node.value, ast.Constant):
if not isinstance(node.value.value, int):
self.issues.append(f"Line {node.lineno}: Annotation says 'int', but assigned {type(node.value.value).__name__}.")
elif hinted_type == 'str' and isinstance(node.value, ast.Constant):
if not isinstance(node.value.value, str):
self.issues.append(f"Line {node.lineno}: Annotation says 'str', but assigned {type(node.value.value).__name__}.")
self.generic_visit(node)
def _looks_like_email(self, s: str) -> bool:
# A very basic regex for illustration
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
return re.match(pattern, s) is not None
# Test with some problematic code
test_code = """
user_email: str = "not-an-email"
account_id = "invalid id!"
valid_id = "ABC123"
count: int = "five"
"""
tree = ast.parse(test_code)
checker = DomainTypeChecker()
checker.visit(tree)
print("Domain Type Check Results:")
for issue in checker.issues:
print(f" - {issue}")
This checker is primitive, but it shows the idea. We're adding a layer of project-specific logic on top of Python's type system. In a real tool, you would hook this into your CI/CD pipeline. Every pull request would be automatically scanned not just for syntax errors, but for business logic violations.
Another powerful technique is pattern matching for security flaws. Many security vulnerabilities follow common patterns: using eval() on user input, calling os.system() with a string built from variables, or reading files with paths that aren't sanitized.
We can build a security scanner that looks for these dangerous patterns in the AST. It's like a search for specific, dangerous code shapes.
class SecurityPatternScanner(ast.NodeVisitor):
"""Scans AST for potentially dangerous patterns."""
def __init__(self):
self.vulnerabilities = []
self.imported_modules = set()
def visit_Import(self, node: ast.Import):
for alias in node.names:
self.imported_modules.add(alias.name)
self.generic_visit(node)
def visit_ImportFrom(self, node: ast.ImportFrom):
if node.module:
self.imported_modules.add(node.module)
self.generic_visit(node)
def visit_Call(self, node: ast.Call):
# Check for eval()
if isinstance(node.func, ast.Name) and node.func.id == 'eval':
self.vulnerabilities.append({
'line': node.lineno,
'type': 'DANGEROUS_EVAL',
'message': 'Use of eval() is highly dangerous if any input is not trusted.'
})
# Check for os.system, subprocess.call with shell=True, etc.
if isinstance(node.func, ast.Attribute):
# Check for module.function pattern
func_name = node.func.attr
if isinstance(node.func.value, ast.Name):
module_name = node.func.value.id
if (module_name == 'os' and func_name == 'system') or \
(module_name == 'subprocess' and func_name in ('call', 'run', 'Popen')):
# Check if shell=True is used (simplified check)
for keyword in node.keywords:
if keyword.arg == 'shell' and isinstance(keyword.value, ast.Constant):
if keyword.value.value is True:
self.vulnerabilities.append({
'line': node.lineno,
'type': 'SHELL_INJECTION',
'message': f'Using {module_name}.{func_name} with shell=True can lead to command injection.'
})
# Check for pickle.loads with untrusted data
if isinstance(node.func, ast.Attribute) and node.func.attr == 'loads':
if isinstance(node.func.value, ast.Name) and node.func.value.id == 'pickle':
self.vulnerabilities.append({
'line': node.lineno,
'type': 'INSECURE_DESERIALIZATION',
'message': 'pickle.loads can execute arbitrary code. Do not use with untrusted data.'
})
self.generic_visit(node)
# Example code with security issues
insecure_code = """
import os
import pickle
import subprocess
def process_user_input(data):
# Dangerous eval
result = eval(data.get('expression', '0'))
# Possible shell injection
filename = data.get('file')
os.system(f"ls -la {filename}")
# Insecure deserialization
config = pickle.loads(data['config_blob'])
# Another shell injection
cmd = f"echo {data['name']}"
subprocess.run(cmd, shell=True)
return result
"""
tree = ast.parse(insecure_code)
scanner = SecurityPatternScanner()
scanner.visit(tree)
print("\nSecurity Scan Results:")
for vuln in scanner.vulnerabilities:
print(f" Line {vuln['line']} [{vuln['type']}]: {vuln['message']}")
This scanner would catch several classic mistakes before the code ever reaches production. It's not perfect—it can't know if data is truly "trusted"—but it raises essential questions during development.
A major part of code quality is consistency. This is where style enforcers and custom linters come in. Tools like Flake8 and pylint are built on the principles we've discussed. But sometimes, your team has its own rules. Maybe all API endpoint functions must be named with a _api suffix, or database connection objects must be closed in a specific way.
We can write a small linter plugin to enforce these house rules. Here’s how you might ensure that all class names follow a CamelCase convention and that error logging uses a dedicated logger.
class CustomStyleEnforcer(ast.NodeVisitor):
"""Enforces project-specific coding standards."""
def visit_ClassDef(self, node: ast.ClassDef):
# Rule: Class names must be CamelCase
if not node.name[0].isupper():
self._report(node.lineno, f"Class name '{node.name}' should start with an uppercase letter.")
# Check for a docstring in the first body element
has_docstring = (len(node.body) > 0 and
isinstance(node.body[0], ast.Expr) and
isinstance(node.body[0].value, ast.Constant) and
isinstance(node.body[0].value.value, str))
if not has_docstring:
self._report(node.lineno, f"Class '{node.name}' is missing a docstring.", severity='info')
self.generic_visit(node)
def visit_FunctionDef(self, node: ast.FunctionDef):
# Rule: Functions that handle errors must log them
# Simple heuristic: look for try/except blocks
has_try_except = any(isinstance(stmt, ast.Try) for stmt in node.body)
if has_try_except:
# Check if logging call exists in the function
has_logging = self._contains_logging(node)
if not has_logging:
self._report(node.lineno, f"Function '{node.name}' has error handling but no apparent logging.", severity='warning')
self.generic_visit(node)
def _contains_logging(self, node: ast.AST) -> bool:
"""Check if the node contains a call to a logging function."""
for child in ast.walk(node):
if isinstance(child, ast.Call):
if isinstance(child.func, ast.Attribute):
if child.func.attr in ('error', 'warning', 'exception', 'info', 'debug'):
return True
if isinstance(child.func, ast.Name):
if child.func.id in ('log', 'logging'):
return True
return False
def _report(self, line: int, message: str, severity: str = 'warning'):
print(f"[{severity.upper()}] Line {line}: {message}")
# Code that violates some style rules
style_code = """
class badClassName: # Violates CamelCase
pass
class DataProcessor:
# Missing docstring
def process(self, data):
try:
result = data['value'] / data['divisor']
except KeyError:
# No logging here!
result = 0
except ZeroDivisionError:
# No logging here either!
result = None
return result
class ProperClass:
'''This one is okay.'''
def safe_method(self, x):
try:
return 10 / x
except ZeroDivisionError as e:
logging.error("Division by zero in safe_method")
return 0
"""
tree = ast.parse(style_code)
enforcer = CustomStyleEnforcer()
enforcer.visit(tree)
This enforcer helps maintain a consistent codebase. When everyone follows the same patterns, the code is easier to read and maintain. It reduces the mental load for everyone on the team.
One of my favorite advanced techniques is metrics calculation and trend analysis. It's not enough to know a function is complex today. We need to know if it's getting more complex over time. We can track metrics like lines of code, complexity, and coupling across git history.
We can write a script that checks out each commit in a branch, runs our analyzers, and stores the results. This creates a history of code health. You can see when complexity spiked, perhaps during a rushed feature development, and plan to refactor it.
Here's a conceptual outline for a historical metrics tracker.
# This is a conceptual example. Running it requires gitpython and a repository.
import ast
import datetime
from pathlib import Path
from typing import List, Dict, Any
import pandas as pd
class CodeMetricsTracker:
"""Tracks code metrics over git history."""
def __init__(self, repo_path: Path):
self.repo_path = repo_path
self.metrics_history = []
def analyze_commit(self, commit_hash: str, author: str, date: datetime.datetime):
"""Analyze code at a specific commit."""
# In a real implementation, you would:
# 1. Checkout the commit using gitpython
# 2. Find all .py files
# 3. Run your ASTAnalyzer, CFGBuilder, etc. on each
# 4. Aggregate results
# Simulated metrics for this example
file_metrics = {
'commit': commit_hash[:8],
'author': author,
'date': date.date(),
'total_files': 42, # Example
'total_lines': 15000,
'avg_complexity': 8.2,
'high_complexity_functions': 5,
'style_violations': 12,
}
self.metrics_history.append(file_metrics)
return file_metrics
def generate_trend_report(self):
"""Create a report showing metrics over time."""
if not self.metrics_history:
return "No data collected."
df = pd.DataFrame(self.metrics_history)
df.set_index('date', inplace=True)
report_lines = ["Code Quality Trends", "="*50]
report_lines.append(f"Time period: {df.index.min()} to {df.index.max()}")
report_lines.append(f"Number of commits analyzed: {len(df)}")
report_lines.append("\nKey Metrics:")
report_lines.append(f" Average Complexity: {df['avg_complexity'].mean():.2f} (trend: {'up' if df['avg_complexity'].iloc[-1] > df['avg_complexity'].iloc[0] else 'down'})")
report_lines.append(f" High Complexity Functions: {df['high_complexity_functions'].sum()} total")
report_lines.append(f" Style Violations: {df['style_violations'].sum()} total")
# Identify trend
recent = df.iloc[-5:] if len(df) >= 5 else df
if recent['avg_complexity'].mean() > df['avg_complexity'].mean():
report_lines.append("\nWARNING: Recent commits show increasing complexity.")
return "\n".join(report_lines)
# Simulating a run
tracker = CodeMetricsTracker(Path("."))
# Simulate analyzing some commits
sample_commits = [
("a1b2c3d4", "Alice", datetime.datetime(2023, 10, 1)),
("e5f6g7h8", "Bob", datetime.datetime(2023, 10, 5)),
("i9j0k1l2", "Alice", datetime.datetime(2023, 10, 10)),
]
for commit in sample_commits:
tracker.analyze_commit(*commit)
print(tracker.generate_trend_report())
Seeing these trends helps teams make data-driven decisions about when to pay down technical debt.
Finally, we can tie all these techniques together into a quality gate. A quality gate is an automated check that passes or fails. It might say: "No new functions with complexity > 15", "Zero security vulnerabilities of high severity", or "Test coverage must not decrease".
We can build a script that runs all our analyzers, aggregates the results, and returns a success or failure code. This script becomes the final judge in your CI/CD pipeline.
import sys
from pathlib import Path
class QualityGate:
"""Aggregates multiple analyzers to pass/fail code quality checks."""
def __init__(self, code_dir: Path):
self.code_dir = code_dir
self.analyzers = []
self.results = {}
def add_analyzer(self, name: str, analyzer_class, **kwargs):
self.analyzers.append((name, analyzer_class, kwargs))
def run(self):
"""Run all analyzers and evaluate against thresholds."""
all_passed = True
report = {}
py_files = list(self.code_dir.rglob("*.py"))
for name, AnalyzerClass, kwargs in self.analyzers:
analyzer_instance = AnalyzerClass(**kwargs)
issues = []
for py_file in py_files:
if hasattr(analyzer_instance, 'analyze_file'):
issues.extend(analyzer_instance.analyze_file(str(py_file)))
elif hasattr(analyzer_instance, 'visit'):
with open(py_file, 'r') as f:
tree = ast.parse(f.read())
analyzer_instance.visit(tree)
if hasattr(analyzer_instance, 'issues'):
issues.extend(analyzer_instance.issues)
# Evaluate against rules
passed = self._evaluate(name, issues)
if not passed:
all_passed = False
report[name] = {
'passed': passed,
'issue_count': len(issues),
'sample_issues': issues[:3] # Show first 3
}
self.results = report
return all_passed, report
def _evaluate(self, analyzer_name: str, issues: List) -> bool:
"""Simple evaluation logic."""
thresholds = {
'ASTAnalyzer': {'max_high_severity': 0}, # No high complexity functions allowed
'SecurityPatternScanner': {'max_total': 0}, # Zero security issues allowed
'CustomStyleEnforcer': {'max_warnings': 5}, # Allow some style warnings
}
if analyzer_name not in thresholds:
return len(issues) == 0 # Default: pass only if no issues
rules = thresholds[analyzer_name]
if 'max_high_severity' in rules:
high_sev = [i for i in issues if i.get('severity') == 'warning']
if len(high_sev) > rules['max_high_severity']:
return False
if 'max_total' in rules:
if len(issues) > rules['max_total']:
return False
return True
def print_report(self):
print("QUALITY GATE REPORT")
print("="*60)
for analyzer_name, data in self.results.items():
status = "PASS" if data['passed'] else "FAIL"
print(f"\n{analyzer_name}: {status}")
print(f" Issues Found: {data['issue_count']}")
if data['sample_issues']:
print(" Sample Issues:")
for issue in data['sample_issues']:
if isinstance(issue, dict):
print(f" - {issue.get('message', issue)}")
else:
print(f" - {issue}")
# Simulate running the gate on a project
project_path = Path(".") # Current directory
gate = QualityGate(project_path)
# In a real scenario, we would add our analyzer instances here.
# gate.add_analyzer("Complexity", ASTAnalyzer)
# gate.add_analyzer("Security", SecurityPatternScanner)
print("Simulating Quality Gate run...")
passed, report = gate.run()
gate.print_report()
if not passed:
print("\n❌ Quality Gate FAILED. Fix issues before merging.")
sys.exit(1)
else:
print("\n✅ Quality Gate PASSED.")
sys.exit(0)
This gate becomes the guardian of your codebase. It ensures that no matter how fast you move, a baseline of quality is maintained.
These eight techniques—AST parsing, control flow analysis, enhanced type checking, security scanning, style enforcement, metrics tracking, and the final quality gate—give you a comprehensive toolkit. You don't need to implement them all at once. Start with a simple AST analyzer for complexity. Then add a security scanner. Over time, you'll build a custom suite that perfectly fits your team's needs.
The goal isn't to create busywork or reject every piece of code. It's to have a conversation with your codebase before it goes live. It's about finding the hidden problems while they're still easy to fix. I've seen these techniques transform chaotic projects into clean, maintainable systems. They turn subjective arguments about code style into objective, automated checks. That lets developers focus on what they do best: solving problems and building great software.
📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!
101 Books
101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.
Check out our book Golang Clean Code available on Amazon.
Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!
Our Creations
Be sure to check out our creations:
Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | Java Elite Dev | Golang Elite Dev | Python Elite Dev | JS Elite Dev | JS Schools
We are on Medium
Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva
Top comments (0)