DEV Community: Pawan Singh Kapkoti

Published a SQL Linter to PyPI Because I Was Tired of Bad Queries Hitting Production

Pawan Singh Kapkoti — Sat, 18 Apr 2026 07:12:10 +0000

Food manufacturing ERPs run on SQL Server. SSRS reports, stored procedures, ad-hoc queries — often written by people who learned SQL from Stack Overflow.

A DELETE without WHERE against a staging table is a wake-up call. sql-sop catches these patterns before they reach the database.

sql-sop: 18 rules, 55 tests, 0.08 seconds

pip install sql-sop
sql-sop check .

That is it. Point it at a directory and it scans every .sql file in 0.08 seconds. No config file needed. No database connection. Just pattern matching against compiled regex and sqlparse AST analysis.

The rules

5 errors (block commits):

Rule	What it catches
E001	DELETE without WHERE
E002	DROP without IF EXISTS
E003	GRANT/REVOKE in application code
E004	String concatenation in WHERE (SQL injection)
E005	INSERT without explicit column list

10 warnings (advisory):

Rule	What it catches
W001	SELECT *
W002	Missing LIMIT on large result sets
W003	Functions on indexed columns (kills index usage)
W004	Multi-table JOIN without aliases
W005	Subquery in WHERE that could be a JOIN
W006	ORDER BY without LIMIT
W007	Hardcoded magic numbers in WHERE
W008	Inconsistent keyword casing
W009	Missing semicolons
W010	Commented-out code blocks

3 structural rules (v0.3.0, sqlparse AST):

Rule	What it catches
S001	Implicit cross join (comma-separated tables in FROM)
S002	Subquery nested more than 2 levels deep
S003	CTE defined but never referenced

The fluent API

v0.2.0 added a chainable Python API:

from sql_guard import SqlGuard

result = SqlGuard().enable("E001", "W001").scan("DELETE FROM users")
print(result.passed)     # False
print(result.summary())  # "1 error in 1 statement"

This lets you use sql-sop programmatically - in test suites, CI pipelines, or other tools. The CLI is for humans; the API is for code.

Pre-commit hook

repos:
  - repo: https://github.com/Pawansingh3889/sql-guard
    rev: main
    hooks:
      - id: sql-sop
        args: [--severity, error]

Every SQL file gets checked before every commit. Dangerous patterns are caught before they reach the PR, let alone production.

Structural rules with sqlparse

The regex-based rules catch surface patterns. But some bad SQL looks fine line-by-line:

SELECT * FROM orders, customers WHERE orders.id = customers.order_id

This is an implicit cross join. It works, but it is fragile and unclear. The structural rule S001 catches it by parsing the FROM clause rather than matching text.

For S002 (deeply nested subqueries), sqlparse builds an actual token tree. I walk it recursively, counting parenthesis depth. More than 2 levels deep gets flagged with a suggestion to use CTEs.

Notes on publishing to PyPI

Hatchling is the simplest build backend. pyproject.toml with [build-system] requires = ["hatchling"] — no setup.py, no setup.cfg.
Test matrix matters. Python 3.10 through 3.13 each have slightly different regex behaviour. CI catches what local testing misses.
195 monthly downloads is modest but meaningful. Most PyPI packages get zero. Each download is someone protecting their database.
The pre-commit hook drives adoption. More usage comes via pre-commit than the CLI. Meeting users where they already work matters more than features.

The code

PyPI: pypi.org/project/sql-sop
GitHub: github.com/Pawansingh3889/sql-guard
Install: pip install sql-sop

How I Reverse-Engineered a Reverse ETL Tool and Wrote the Docs Nobody Had

Pawan Singh Kapkoti — Sat, 18 Apr 2026 07:07:43 +0000

drt is an open-source reverse ETL tool. Five destination connectors existed. No guide for building new ones. No documentation beyond the source code.

This post walks through the process of reverse-engineering the connector architecture, shipping five new connectors, and writing the official tutorial that got merged.

The approach

Start with the source, not the README. The actual implementation files tell you what the maintainers intended.

drt/destinations/base.py defines the Destination Protocol with one method:

class Destination(Protocol):
    def load(
        self,
        records: list[dict[str, Any]],
        config: DestinationConfig,
        sync_options: SyncOptions,
    ) -> SyncResult:
        ...

That is the entire interface. One method. Takes records, config, and options. Returns success/failure counts. Every destination - Slack, PostgreSQL, REST API, Discord - implements this same method.

Mapping the architecture

I traced the full flow by reading backwards from the CLI:

CLI (_get_destination) -> isinstance check -> Destination.load()
                                                    |
                                            Config model (Pydantic)
                                            with type: Literal["xxx"]
                                                    |
                                            DestinationConfig union
                                            (discriminated by type field)

Four files. That is it. To add a new destination, you touch four files:

Config model in drt/config/models.py - a Pydantic BaseModel with type: Literal["your_type"]
Destination class in drt/destinations/your_dest.py - implements load()
CLI registration in drt/cli/main.py - one isinstance branch
Tests in tests/unit/test_your_dest.py

No plugin registry. No entry points. No dynamic discovery. Just a Pydantic discriminated union and an isinstance chain. Simple enough that I could hold the whole architecture in my head.

Five connectors from one pattern

Once the pattern is clear, building connectors becomes repetitive:

ClickHouse - database destination with batch inserts
Snowflake - cloud warehouse with snowflake-connector-python
Parquet - file-based output for data lake patterns
Teams - Microsoft Teams webhook notifications
CSV/JSON - simple file export

Each one followed the same pattern:

Config model with destination-specific fields
load() method iterating records with RowError on failure
resolve_env() for secrets (never hardcode credentials)
RateLimiter + with_retry() for HTTP destinations
try/finally for database connection cleanup
Respect on_error: "fail" returns early, "skip" continues

All five connectors were merged into the main branch.

Writing the tutorial nobody had

After five connectors, the pattern was clear. But the next contributor should not have to read five implementations to learn it. So the obvious next step was to write the guide.

PR: drt-hub/drt#332 - merged.

The tutorial walks through building a fictional Webhook destination step by step:

Config model with Pydantic validators
Destination class with the full load() implementation
CLI registration (one line)
Tests using pytest-httpserver for HTTP destinations or unittest.mock for databases

I included a checklist at the end - 14 items that every connector should satisfy. Things like "uses resolve_env() for secrets" and "respects on_error setting" and "builds RowError on per-row failures."

Lessons on reverse engineering open source

Start with the interface, not the implementation. base.py told me everything I needed to know about the contract. The implementations were just variations on the theme.
Read the CLI entry point. _get_destination() showed me exactly how destinations are discovered and instantiated. No magic, no reflection, just isinstance checks.
The config layer is the key. Pydantic discriminated unions with type: Literal["xxx"] meant the YAML config drives everything. Understanding the config model meant understanding the whole system.
Test patterns are documentation. The existing tests showed me what the maintainers considered important: success path, error-skip, error-fail, missing credentials, connection cleanup.
Write the docs you wish existed. Five implementations is enough context to write the guide. The next person should not have to repeat the journey.

The code

drt: github.com/drt-hub/drt
My connector tutorial PR: #332

OpsMind: On-Prem AI for Manufacturing — No Cloud, No API Keys, No Budget

Pawan Singh Kapkoti — Wed, 15 Apr 2026 23:27:06 +0000

Manufacturing companies run on SQL Server ERPs with hundreds of tables. Shift managers need yield numbers, waste reports, temperature readings — daily. The usual path: email IT, wait for an SSRS report, get yesterday's numbers tomorrow.

OpsMind is an open-source tool that lets anyone on the factory floor type a question in English and get the SQL result in 5 seconds. No SQL knowledge required. Runs locally on Ollama, no cloud dependency.

Manager: "What was today's yield by product?"

OpsMind: Salmon fillets: 91.2% (target 90%)
         Cod loins: 88.7% (below target - check line 2 defrost timing)
         Haddock: 93.1%

It runs entirely on-premises. A Gemma 3 12B model via Ollama on a desktop PC. No data leaves the building. No cloud subscription. No API keys. Total hardware cost: one PC.

How it works

OpsMind uses a LangGraph state graph with 6 nodes:

question -> detect_domain -> check_library -> generate_sql -> validate_sql -> execute_sql -> explain_results

detect_domain identifies which of 7 business areas the question belongs to (production, waste, orders, compliance, staff, suppliers, traceability)
check_library checks 20 pre-built queries first. If there is a match, it skips the LLM entirely. Instant, guaranteed-correct SQL.
generate_sql if no match, the LLM generates SQL scoped to only the relevant tables (not all 147)
validate_sql 5-stage safety check: statement type, injection detection, table existence, column existence, row limit enforcement
execute_sql runs the validated query
explain_results LLM explains the numbers in business terms

The key insight: the pre-built query library handles the top 20 questions managers ask every day. The LLM is only the fallback. This means the most common queries are fast and reliable, while novel questions still work.

The SQL validation layer

This is the critical layer. An LLM generating SQL against a production database needs safety gates. The 5-stage validation catches:

Tautologies like WHERE 1=1 (injection attempt)
UNION injection (appending malicious queries)
Comment injection (-- to truncate queries)
Non-existent tables (hallucinated table names)
Missing LIMIT (auto-adds LIMIT 1000 to prevent accidental full table scans)

Only SELECT and WITH (CTEs) are allowed. INSERT, UPDATE, DELETE, DROP are blocked at the validation layer. The database connection uses read-only credentials. Defence in depth.

MCP server architecture

The architecture includes Model Context Protocol servers to decouple the data access layer:

Database server (port 9000) exposes query, table discovery, and domain schema as tools
Document search server (port 9001) exposes RAG search over factory SOPs

This means OpsMind is not the only tool that can use the data. Any MCP-compatible agent can connect to the same servers.

Key takeaways

Pre-built queries beat LLM generation for common questions. A query library handles 80% of real usage with zero latency and zero hallucination risk.
Domain scoping is critical. Exposing all 147 tables to the LLM produces garbage SQL. Scoping to 4-10 relevant tables per domain produces accurate SQL.
Runtime-loaded documentation works. Business rules change. Compliance thresholds change. Loading these from markdown files at runtime keeps domain knowledge current without redeploying.
Local LLMs are sufficient for structured tasks. Gemma 3 12B handles NL-to-SQL and result explanation. No GPT-4 needed. No internet dependency.

The code

Everything is open source: github.com/Pawansingh3889/OpsMind

Built with: Python, LangGraph, Ollama, Streamlit, SQLAlchemy, ChromaDB, pgvector, FastMCP, sqlparse.