DEV Community: Amadou Wolfgang Cisse

Simplifying Programmatic Database Handling

Amadou Wolfgang Cisse — Sat, 31 May 2025 11:06:34 +0000

If you’ve ever worked on a backend or data-centric project in a small team, chances are you’ve hit the same wall I did: setting up local databases consistently, reliably, and without friction.

It’s a deceptively simple task. And yet, it’s where many projects start to feel fragile.

The Problem: The Local Database Setup Spiral

Picture this: you're building a microservice architecture that talks to a PostgreSQL database. You’re working with two other developers. You write a quick README:

“Make sure you have Postgres installed. Create a user, a password, a database. Import this SQL script. Use port 5432 unless it’s taken.”

You think it’s fine. Then the pull requests start rolling in with bugs that don’t make sense. Someone’s DB is misconfigured. Someone else forgot to run the schema script. Another person installed the wrong version of Postgres on Windows and it won’t even start.

And when you try to onboard a new teammate? If it's been a while since anyone performed the setup, it can quickly turn into a full afternoon of troubleshooting.

Local database setup is deceptively expensive. It introduces variance into your dev environments and bakes hidden assumptions into your codebase.

Even with Docker, it's rarely elegant. You might end up with a mess of docker-compose files, environment variables, half-broken shell scripts, and manual volume mounts that no one dares touch.

A Real Example: Automating WG-Gesucht Notifications

In my case, this hit home during development of a bot that scrapes listings from WG-Gesucht (a German apartment-sharing site) and automatically alerts users based on their preferences. It’s built as a collection of microservices:

A scraper service (pulls data and stores in Postgres),
A vector search module using pgvector (to recommend listings),
A notification dispatcher (integrates with email/Telegram).

Each service uses a local DB during development and testing. I needed to:

Quickly spin up Postgres and MySQL instances with test data,
Run init scripts and seed content from Python,
Share configurations with collaborators using Jupyter and notebooks,
Avoid any OS-specific setup pain.

Docker is an obvious choice—but I didn’t want my dev flow to depend on Docker CLI commands buried in scripts. I wanted everything runnable in Python, so it could live side-by-side with my logic and be testable, restartable, and explicit.

What the Right Solution Should Look Like

At this point, I had some clear goals in mind for a better approach to local databases:

Python-first interface: no shell scripts, no docker-compose.yml, no Makefiles.
Minimal dependencies: no installing client tools or external setup scripts.
Cross-platform: works on macOS, Linux, and Windows (even WSL).
Supports init scripts and volumes: I want to seed data, or persist it.
Clear lifecycle control: I want to .start_db(), .stop_db(), .delete_db() like any Python object.

This is the tool I wish existed from the start. So I built it.

Introducing `py-dockerdb`

py-dockerdb is a Python library that lets you manage real Dockerized databases like native Python objects. Visit the project on github for more info and usage examples!

With just a few lines of code, you can spin up Postgres, MongoDB, MySQL, or SQL Server, inject init scripts, connect with familiar Python drivers—and tear them down cleanly when done.

No shell, no YAML, no guesswork.

from docker_db.postgres_db import PostgresConfig, PostgresDB

config = PostgresConfig(
    user="botuser",
    password="botpass",
    database="wggesucht_db",
    container_name="local-postgres"
)

db = PostgresDB(config)
db.create_db()

conn = db.connection
cursor = conn.cursor()
cursor.execute("SELECT COUNT(*) FROM listings;")
print(cursor.fetchone())

This is running an actual PostgreSQL instance in Docker, spun up with an init script, ready for interaction, and controlled completely from Python.

How It Works

Every database type has two classes:

A Config class that defines connection settings and init behavior.
A DB class that manages lifecycle: create_db(), stop_db(), delete_db(), restart_db().

It supports:

Init scripts: SQL, JS, SH, depending on the DB engine.
Volume persistence: so you can reuse data across runs.
Environment injection: useful for script templating.
Native drivers: psycopg2, pymongo, pyodbc, mysql-connector.

All you need is Docker and Python 3.7+ and a running docker instance on the host machine.

What You Can Use It For

Here are some use cases I’ve explored or seen:

Data science notebooks with SQL backends that boot on demand.
CI testing environments that require disposable database containers.
Teaching SQL or NoSQL without asking students to install anything.
Microservice development with predictable, isolated DB instances.
Rapid prototyping for apps that need seeded data on day one.

Philosophy

py-dockerdb is intentionally minimal. It’s not trying to replace docker-compose for full-stack orchestration. It doesn’t scaffold services or guess your intentions.

Instead, it focuses on one thing: let you control local databases, entirely from Python, using real Docker containers.

No DSLs. No hidden automagic. Just code.

Supported Databases

PostgreSQL
MySQL
MongoDB
Microsoft SQL Server

And Cassandra is on the roadmap.

Installation

Just install it from PyPI:

pip install py-dockerdb

And you're ready to go.

The Bottom Line

Working with databases locally shouldn’t be an afterthought. It’s one of the most repeated steps in any backend, data, or devops workflow. Yet we still hand-wave it away with vague instructions and flaky setup scripts.

With py-dockerdb, you can keep database setup alongside your logic—reproducible, isolated, and inspectable.

Your teammates (and future self) will thank you.

Minifying Tables with pymtd2json: Boosting Efficiency in RAG Systems

Amadou Wolfgang Cisse — Sun, 27 Apr 2025 13:43:38 +0000

In retrieval-augmented generation (RAG) pipelines, input efficiency is paramount, not just in terms of tokens, but also character limits

When building a multilingual embedding pipeline, I faced a real challenge:

the Cohere multilingual model imposes a maximum of 2048 charactersnot a token limit per input.

This article walks you through a clever solution:

preprocessing Markdown tables into dense JSON blocks using pymtd2json, to ensure smooth, efficient embeddings without errors.

The Challenge: Character Limits vs Token Limits

Classical chunking methods, like SentenceSplitter from LlamaIndex, are token-focused:

you set a maximum number of tokens per chunk — but not characters

Why This Matters:

Markdown (especially GitHub-Flavored Markdown, GFM) wastes spacewith formatting.
A Markdown chunk might have only 170 tokens but still exceed 2048 characters
This results in rejected API requests or inefficient extra splitting.

Important Note:

Markdown tables are up to 3x less token-efficient than other formats, further compounding the problem.

👉 Read more on token inefficiency of Markdown tables here.

A Real-World Example: Measuring the Problem

Let's dive into a simple simulation:

Step 1: Create a Large Markdown Table

import pandas as pd

# Build data
data = {
    "Name": [f"Person{i}" for i in range(30)],
    "Age": [20 + i for i in range(30)],
    "City": [f"City{i}" for i in range(30)]
}

# Create DataFrame
df = pd.DataFrame(data)
df.columns = ["A very long row content, which leads to a lot of white spaces", "Age", "City"]

# Convert to Markdown
table_text = df.to_markdown(index=False)
print(table_text)

This generates a verbose table with 30 rows and a very long header.

Step 2: Analyze Token and Character Counts

Using Cohere’s tokenizer (available via Hugging Face):

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Cohere/Cohere-embed-multilingual-v3.0")
encoded = tokenizer(table_text, return_tensors="pt", add_special_tokens=False)

num_tokens = encoded.input_ids.shape[-1]
num_chars = len(table_text)

print(f"Characters: {num_chars}")
print(f"Tokens: {num_tokens}")

Result:

Characters: 2719
Tokens: 432

⚡ Problem:

While token count is fine, character count exceeds 2048, causing API errors like:

cohere.error.CohereAPIError: input text exceeds maximum allowed size of 2048 characters

The Solution: Minifying Tables into JSON

Instead of traditional Markdown, why not store the data in a dense JSON block?

Benefits of Minifying Tables:

Remove pipes, dashes, and whitespace, all formatting overhead.
Preserve semantic meaning.
Shrink text to meet character limits safely.

Example of the compact JSON:

{"Name":["Person0","Person1","Person2",...],"Age":["20","21","22",...],"City":["City0","City1","City2",...]}

New Stats:

Characters: 1027
Tokens: 461

✅ Now well within Cohere’s input limit!

Applying Minification in Practice

Want to prepare documents before chunking?

Here's how you can automatically process all Markdown files:

from pathlib import Path
from llama_index import SimpleDirectoryReader
from your_minifier import MinifyMDT

source_dir = Path("example_dir", "markdown")

documents = SimpleDirectoryReader(source_dir, required_exts=[".md"], recursive=True).load_data()

doc_texts = []
for idx, doc in enumerate(documents):
    doc_texts.append(MinifyMDT(doc.text_resource.text).transform())

👉 And voilà: Your data is compact, clean, and embedding-ready!

Final Thoughts

Working with multilingual RAG systems means optimizing every byte.

Whitespace-heavy Markdown tables might look nice for humans, but they’re expensive for machine understanding.

By minifying your tables with pymtd2json, you:

Cut down API errors.
Reduce token overhead.
Boost overall performance.

Efficiency isn't optiona, it's a superpower. 🚀

Supercharge Your Jupyter Notebook: SQL Command Magic for IPython

Amadou Wolfgang Cisse — Tue, 18 Mar 2025 22:02:25 +0000

Find the executable notebook here.

Jupyter Notebooks are widely used for data analysis and scientific computing, but working with databases inside them has always been somewhat cumbersome. While libraries like sqlite3 or pymssql provide connectivity, they require extra Python boilerplate for managing connections, executing queries, and formatting results.

Wouldn’t it be better if we could directly run SQL queries inside a Jupyter Notebook, just like in SQL Server Management Studio (SSMS)?

The Problem: SQL in Jupyter Notebooks

Many data professionals need to execute SQL queries within a Jupyter Notebook. However, the existing approaches often come with drawbacks:

Complex Setup: Managing database connections, cursors, and transactions manually.
Verbosity: Writing additional Python code to fetch and display query results.
Limited Integration: Difficult to run multi-statement SQL batches using GO commands.

Instead of spending time writing extra Python code, what if we could just run SQL commands directly inside a cell, as if we were in SSMS?

The Solution: SQL Command Magic

SQL Command Magic for IPython is an IPython extension that integrates Microsoft’s sqlcmd utility into Jupyter Notebooks. It allows users to execute native SQL queries inside Jupyter, without any extra Python code.

Key Features

✅ Seamless SQL Execution - Write SQL directly in notebook cells without additional Python code.

✅ Built-in Connection Management - Connect to Microsoft SQL Server dynamically.

✅ Multi-Statement Execution - Supports GO statements for executing multiple queries at once.

✅ Variable Substitution - Pass Python variables directly into SQL queries.

✅ Debugging Support - Use --debug to analyze query execution details.

Installation and Setup

Find the executable notebook here.

Step 1: Install the Extension

First, install the required package:

pip install ipython-sqlcmd python-dotenv

Step 2: Load the Extension

In your Jupyter Notebook, load the extension using:

%load_ext sqlcmd

This enables the %sqlcmd magic command inside Jupyter.

Step 3: Connect to SQL Server

To connect to a SQL Server instance, use:

%sqlcmd master --server=localhost --username=sa --password={os.getenv('SSMS_PASSWORD')} --encrypt --trust-certificate

You can replace localhost and credentials with your own connection details.

Running SQL Queries in Jupyter

Simple Query

Once connected, you can execute SQL commands inside a notebook cell:

%%sqlcmd
SELECT TOP 10 * 
FROM sys.tables 
ORDER BY name

This fetches the top 10 tables from the system catalog, just like in SSMS.

Creating and Populating Tables

Creating and inserting data is straightforward. Let’s create a table and insert some values:

%%sqlcmd
CREATE TABLE TestSpaces (
    ID int,
    Description varchar(100),
    Code varchar(20)
);

INSERT INTO TestSpaces (ID, Description, Code) 
VALUES 
    (1, 'This has spaces', 'A1'),
    (2, 'Another spaced value', 'B2'),
    (3, 'No spaces', 'C3');

SELECT * FROM TestSpaces;

This will create the table, insert some values, and return the data in a single execution.

Using Python Variables Inside Queries

You can use Python variables to dynamically modify your SQL queries:

table_name = "sys.tables"
limit = 5

%%sqlcmd
SELECT TOP $limit * 
FROM $table_name 
ORDER BY name

The $limit and $table_name placeholders are automatically replaced with the Python variables before execution.

Executing External SQL Scripts

SQL Command Magic also supports executing external SQL files, making it useful for database migrations or schema setup:

%%sqlcmd
EXECUTE_SQL_FILE '../src/tests/empty.sql'

This will run all SQL commands inside empty.sql.

Debugging Queries

To troubleshoot execution issues, enable debug mode:

%%sqlcmd --debug
SELECT @@VERSION AS SQLServerVersion

This outputs detailed execution logs, showing how the query was processed.

Running Multiple SQL Batches

Unlike standard SQL execution in Jupyter, SQL Command Magic fully supports multi-statement execution using GO:

%%sqlcmd
SELECT DB_NAME() AS CurrentDatabase
GO
SELECT @@SERVERNAME AS ServerName

Each query batch executes separately, just like in SSMS.

Conclusion

SQL Command Magic for IPython is a simple yet powerful tool for running SQL queries inside Jupyter Notebooks. It removes unnecessary Python boilerplate, enables multi-statement execution, and integrates seamlessly with Microsoft SQL Server.

Key Benefits

✅ Reduces Boilerplate – No need to write extra Python code for database connections.

✅ More Natural SQL Workflow – Execute queries just like in SSMS.

✅ Advanced Features – Supports GO statements, variable substitution, and script execution.

If you frequently run SQL queries in Jupyter, this extension is a game changer.

Get Started

pip install ipython-sqlcmd python-dotenv

Try it out and let me know your thoughts!