Nithin Bharadwaj

Posted on Apr 20

How to Reverse Engineer Any Database Structure Using Python

#programming #devto #python #softwareengineering

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

Let's talk about databases. They're everywhere, holding the information that makes our digital world run. But sometimes, you encounter a database you didn't build. There's no manual, no diagram, just raw tables and columns. Your job is to figure out how it all fits together. This process of figuring out an existing database's structure is what we call reverse engineering. Python is an excellent tool for this job, and I want to share some practical ways to use it.

I'll walk you through several techniques, from the simple to the more complex, with plenty of code you can try yourself. Think of it like being a digital archaeologist, carefully brushing away the dust to reveal the blueprint of an ancient system.

The first and most fundamental step is connecting to the database and asking it directly about its structure. Every major database system—like SQLite, PostgreSQL, or MySQL—has a special set of system tables that store this blueprint. In Python, we use libraries to talk to these systems.

Here's a basic starter. We create a class that can connect to different types of databases and fetch a list of all the tables.

import sqlite3
import psycopg2
import mysql.connector

class DatabaseExplorer:
    def __init__(self, db_type='sqlite', **connection_params):
        self.db_type = db_type
        self.connection_params = connection_params
        self.connection = None

    def connect(self):
        """Establishes a connection to the database."""
        if self.db_type == 'sqlite':
            self.connection = sqlite3.connect(**self.connection_params)
        elif self.db_type == 'postgresql':
            self.connection = psycopg2.connect(**self.connection_params)
        elif self.db_type == 'mysql':
            self.connection = mysql.connector.connect(**self.connection_params)
        else:
            raise ValueError(f"This database type isn't supported yet: {self.db_type}")
        print(f"Connected to {self.db_type} database.")
        return self.connection

# Let's try it with a SQLite file.
explorer = DatabaseExplorer('sqlite', database='mystery_database.db')
explorer.connect()

Once you're connected, you need to know what you're looking at. You query the database's internal catalog. For a SQLite database, you ask the sqlite_master table. For PostgreSQL or MySQL, you query the information_schema. The code below gets the table names.

    def get_table_names(self):
        """Fetches all non-system table names from the database."""
        cursor = self.connection.cursor()
        tables = []

        if self.db_type == 'sqlite':
            cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
            for row in cursor.fetchall():
                table_name = row[0]
                # Filter out internal SQLite tables
                if not table_name.startswith('sqlite_'):
                    tables.append(table_name)

        elif self.db_type == 'postgresql':
            cursor.execute("""
                SELECT table_name
                FROM information_schema.tables
                WHERE table_schema = 'public';
            """)
            tables = [row[0] for row in cursor.fetchall()]

        elif self.db_type == 'mysql':
            cursor.execute("""
                SELECT table_name
                FROM information_schema.tables
                WHERE table_schema = DATABASE();
            """)
            tables = [row[0] for row in cursor.fetchall()]

        cursor.close()
        return tables

# Using our explorer
table_list = explorer.get_table_names()
print(f"Found tables: {table_list}")

Knowing table names is just the beginning. The real details are in the columns. What's each column called? What type of data does it hold? Is it a primary key? We can write a method to inspect each table closely. Let's look at a SQLite example.

    def inspect_table(self, table_name):
        """Gets detailed column information for a specific table."""
        cursor = self.connection.cursor()
        columns_info = []

        if self.db_type == 'sqlite':
            # PRAGMA is SQLite's special command for internal info
            cursor.execute(f"PRAGMA table_info('{table_name}');")
            for row in cursor.fetchall():
                # row format: (cid, name, type, notnull, default_value, pk)
                col_id, col_name, col_type, not_null, default_val, is_primary = row
                columns_info.append({
                    'name': col_name,
                    'type': col_type,
                    'can_be_null': not_null == 0,  # 0 means NULL is allowed
                    'default_value': default_val,
                    'is_primary_key': is_primary == 1
                })

        cursor.close()
        return columns_info

# Inspect a table called 'users'
user_columns = explorer.inspect_table('users')
for col in user_columns:
    pk_flag = " (Primary Key)" if col['is_primary_key'] else ""
    print(f"  - {col['name']}: {col['type']}{pk_flag}")

Tables rarely live in isolation. They are connected through foreign keys, which are crucial for understanding the database's logic. A foreign key in one table points to the primary key in another, creating a relationship. Finding these links is our next technique.

    def find_foreign_keys(self, table_name):
        """Identifies relationships where this table references others."""
        cursor = self.connection.cursor()
        relationships = []

        if self.db_type == 'sqlite':
            cursor.execute(f"PRAGMA foreign_key_list('{table_name}');")
            for row in cursor.fetchall():
                # row: (id, seq, table, from, to, on_update, on_delete, match)
                _, _, target_table, from_column, to_column, _, _, _ = row
                relationships.append({
                    'from_column': from_column,
                    'to_table': target_table,
                    'to_column': to_column
                })

        cursor.close()
        return relationships

# Find what the 'orders' table connects to
order_links = explorer.find_foreign_keys('orders')
for link in order_links:
    print(f"The 'orders.{link['from_column']}' column points to '{link['to_table']}.{link['to_column']}'.")

Sometimes you don't have a live database connection. All you have is a giant .sql file—a dump of the database structure. This is where SQL parsing comes in. Instead of querying a live system, you read the text file and look for patterns. It's like reading the architect's original notes instead of touring the building.

We can use regular expressions, or more powerful tools like the sqlparse library, to break down the SQL script.

import re
import sqlparse

def parse_sql_dump(file_path):
    """Reads an SQL file and extracts CREATE TABLE statements."""
    with open(file_path, 'r') as f:
        sql_content = f.read()

    # Use sqlparse to split into individual statements and normalize them
    statements = sqlparse.parse(sql_content)
    table_defs = {}

    for statement in statements:
        # Normalize: lowercase, strip extra spaces
        normalized = str(statement).lower().strip()

        # Look for the start of a CREATE TABLE command
        if normalized.startswith('create table'):
            # A very simple regex to get the table name.
            # Warning: This is fragile for complex SQL.
            match = re.search(r'create table (?:if not exists )?[\`"]?(\w+)[\`"]?', normalized)
            if match:
                table_name = match.group(1)
                table_defs[table_name] = normalized  # Store the full statement

    return table_defs

# Parse a file
definitions = parse_sql_dump('legacy_dump.sql')
print(f"Found definitions for tables: {list(definitions.keys())}")

With the raw CREATE TABLE text, we can do more precise extraction. We want to pull out column definitions, constraints, and keys. Let's write a more focused parser for a single statement.

def analyze_create_statement(create_sql):
    """Takes a single CREATE TABLE SQL string and breaks it down."""
    # First, let's clean it up a bit with sqlparse
    parsed = sqlparse.parse(create_sql)[0]
    # sqlparse can identify tokens like keywords, identifiers, punctuation
    # We'll do a simpler string-based approach for clarity.

    # Find the part between the first parentheses
    start = create_sql.find('(')
    end = create_sql.rfind(')')
    if start == -1 or end == -1:
        return {}

    inner_content = create_sql[start+1:end]
    # Split by commas, but be careful of commas inside parentheses (like in CHECK constraints)
    # This is a naive split for demonstration.
    lines = [line.strip() for line in inner_content.split(',')]

    columns = []
    primary_key = None
    foreign_keys = []

    for line in lines:
        line_upper = line.upper()
        if line.startswith('`') or line[0].isalpha():  # Heuristic for a column definition
            parts = line.split()
            col_name = parts[0].strip('`"')
            col_type = parts[1] if len(parts) > 1 else 'unknown'
            columns.append({'name': col_name, 'type': col_type})
        elif 'PRIMARY KEY' in line_upper:
            # Find column(s) inside parentheses
            match = re.search(r'PRIMARY KEY\s*\(([^)]+)\)', line_upper)
            if match:
                primary_key = [col.strip(' `"') for col in match.group(1).split(',')]
        elif 'FOREIGN KEY' in line_upper:
            # Match pattern: FOREIGN KEY (col1) REFERENCES other_table(other_col)
            match = re.search(r'FOREIGN KEY\s*\(([^)]+)\)\s*REFERENCES\s*(\w+)\s*\(([^)]+)\)', line_upper)
            if match:
                from_col, to_table, to_col = match.groups()
                foreign_keys.append({
                    'from_column': from_col.strip(' `"'),
                    'to_table': to_table.strip(' `"'),
                    'to_column': to_col.strip(' `"')
                })

    return {
        'columns': columns,
        'primary_key': primary_key,
        'foreign_keys': foreign_keys
    }

# Test it
sample_sql = """
CREATE TABLE orders (
    order_id INTEGER PRIMARY KEY,
    customer_id INTEGER NOT NULL,
    amount DECIMAL(10,2),
    FOREIGN KEY (customer_id) REFERENCES customers(cust_id)
);
"""
table_info = analyze_create_statement(sample_sql)
print(f"Columns: {[c['name'] for c in table_info['columns']]}")
print(f"Primary Key: {table_info['primary_key']}")
print(f"Foreign Keys: {table_info['foreign_keys']}")

After gathering all this raw data—tables, columns, keys—you have a collection of facts. The next step is to build a model that shows how they connect. This is about moving from a list to a map. I like to build a dictionary where each table knows about its connections.

class SchemaModel:
    def __init__(self):
        self.tables = {}  # Format: { 'table_name': { 'columns': [], 'links_to': [], 'linked_from': [] } }

    def add_table(self, name, columns, primary_key, foreign_keys):
        """Adds a table and its immediate foreign key connections."""
        self.tables[name] = {
            'columns': columns,
            'primary_key': primary_key,
            'outgoing_links': foreign_keys,  # Links FROM this table
            'incoming_links': []             # Links TO this table (will be filled later)
        }

    def build_relationships(self):
        """Processes all tables to find bidirectional relationships."""
        # First, reset incoming links
        for table_info in self.tables.values():
            table_info['incoming_links'] = []

        # For each foreign key in a table, tell the target table about it.
        for source_table, info in self.tables.items():
            for fk in info['outgoing_links']:
                target_table = fk['to_table']
                if target_table in self.tables:
                    # Add a note to the target table that someone points to it.
                    self.tables[target_table]['incoming_links'].append({
                        'from_table': source_table,
                        'from_column': fk['from_column'],
                        'to_column': fk['to_column']
                    })

    def find_orphaned_tables(self):
        """Identifies tables with no relationships to others."""
        orphaned = []
        for name, info in self.tables.items():
            if not info['outgoing_links'] and not info['incoming_links']:
                orphaned.append(name)
        return orphaned

# Build a simple model
model = SchemaModel()
model.add_table('customers', [{'name':'cust_id', 'type':'INT'}], ['cust_id'], [])
model.add_table('orders', [{'name':'order_id', 'type':'INT'}, {'name':'cust_id', 'type':'INT'}],
                ['order_id'],
                [{'from_column':'cust_id', 'to_table':'customers', 'to_column':'cust_id'}])
model.build_relationships()

print(f"Customers is linked from: {model.tables['customers']['incoming_links']}")

A picture is worth a thousand table definitions. Once you have a model, visualizing it helps you and others understand the design instantly. We can use graphviz via the diagrams library or matplotlib to draw a simple graph.

Here’s a basic example using networkx and matplotlib to plot tables as nodes and foreign keys as edges.

import matplotlib.pyplot as plt
import networkx as nx

def visualize_schema(model):
    """Creates a simple directed graph of the database schema."""
    G = nx.DiGraph()  # Directed Graph

    # Add nodes (tables)
    for table_name in model.tables:
        G.add_node(table_name)

    # Add edges (foreign key relationships)
    for source_table, info in model.tables.items():
        for fk in info['outgoing_links']:
            target_table = fk['to_table']
            if target_table in model.tables:
                # Label the edge with the column name
                G.add_edge(source_table, target_table, label=fk['from_column'])

    # Draw the graph
    pos = nx.spring_layout(G, seed=42)  # Layout for consistent placement
    plt.figure(figsize=(10, 8))

    nx.draw_networkx_nodes(G, pos, node_color='lightblue', node_size=2000)
    nx.draw_networkx_labels(G, pos, font_weight='bold')

    # Draw edges with labels
    edge_labels = nx.get_edge_attributes(G, 'label')
    nx.draw_networkx_edges(G, pos, arrowstyle='-|>', arrowsize=20)
    nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels)

    plt.title("Database Schema Relationships")
    plt.axis('off')  # Turn off the axis
    plt.tight_layout()
    plt.show()

# Visualize our small model
visualize_schema(model)

The final technique is about putting it all together into a useful report. You've discovered the structure; now document it. Generate a Markdown file, a JSON schema, or even SQL CREATE statements that reflect your understanding. This creates a living document for future developers.

def generate_markdown_report(model, filename='schema_report.md'):
    """Produces a human-readable Markdown document of the schema."""
    with open(filename, 'w') as f:
        f.write("# Database Schema Analysis Report\n\n")
        f.write(f"*Generated on: {datetime.now().strftime('%Y-%m-%d %H:%M')}*\n")
        f.write(f"*Total Tables Analyzed: {len(model.tables)}*\n\n")

        for table_name, info in model.tables.items():
            f.write(f"## Table: `{table_name}`\n\n")

            # Columns
            f.write("### Columns\n")
            f.write("| Name | Type | Primary Key |\n")
            f.write("|------|------|-------------|\n")
            for col in info['columns']:
                is_pk = "Yes" if col['name'] in (info['primary_key'] or []) else "No"
                f.write(f"| {col['name']} | {col['type']} | {is_pk} |\n")
            f.write("\n")

            # Relationships
            f.write("### Relationships\n")
            if info['outgoing_links']:
                f.write("**This table references:**\n")
                for fk in info['outgoing_links']:
                    f.write(f"- `{table_name}.{fk['from_column']}` -> `{fk['to_table']}.{fk['to_column']}`\n")
            if info['incoming_links']:
                f.write("\n**Referenced by:**\n")
                for link in info['incoming_links']:
                    f.write(f"- `{link['from_table']}.{link['from_column']}` -> `{table_name}.{link['to_column']}`\n")
            if not info['outgoing_links'] and not info['incoming_links']:
                f.write("*No foreign key relationships found.*\n")
            f.write("\n---\n\n")

    print(f"Report written to {filename}")

# Generate the report
generate_markdown_report(model)

Working through an unknown database can feel daunting. I remember once facing a database with over 200 tables, named with cryptic abbreviations, and no documentation. By applying these Python techniques systematically—first mapping the tables, then drawing the connections—I slowly pieced together the business logic it represented. The moment the visualization graph appeared, showing a clear central hub of customer data, the entire design intention became obvious. Python didn't just give me answers; it gave me a way to ask the right questions. These methods turn a opaque data store into a clear, manageable structure you can understand, document, and work with effectively.

📘 Checkout my latest ebook for free on my channel!

Be sure to like, share, comment, and subscribe to the channel!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

Our Creations

Be sure to check out our creations:

We are on Medium

DEV Community

How to Reverse Engineer Any Database Structure Using Python

101 Books

Our Creations

We are on Medium

Top comments (0)