DEV Community: Fahad Shah

Designing Scalable SQLite Schemas for Python Apps

Fahad Shah — Tue, 19 Aug 2025 17:37:53 +0000

(The Foundation Every Systems Builder Needs — by 1FahadShah)

Most beginners treat SQLite as a toy database.
I learned the hard way: your schema decisions today decide whether your project survives tomorrow.

In my Python journey (Course 4 of Python for Everybody), I stopped thinking of SQLite as “just storage” — and started treating it like the backbone of real pipelines.

Here’s how I approached schema design so my projects didn’t collapse the moment they touched real-world data.

🚧 The Naive Approach (and Why It Breaks)

Early scripts often look like this:

One table for everything
CSV-like storage
Fields crammed together with no normalization

It works for a single file. It dies once you hit:

Multiple data sources
Relationships between entities
Queries that need speed and accuracy

Result: duplication, inconsistency, and painful debugging.

✅ The Scalable Schema Mindset

I shifted to a schema-first approach:

Identify Entities

People, Messages, Logs, Transactions
Each gets its own table.

Normalize Data

No repeated emails or usernames scattered across rows.
Relationships are modeled once, referenced many times.

Think in Queries

Schema isn’t just storage.
It’s the shape of the answers you’ll need later.

🗄 Example: Email System Schema

Here’s a simplified schema I built while parsing large email archives:

CREATE TABLE Person (
    id     INTEGER PRIMARY KEY AUTOINCREMENT,
    email  TEXT UNIQUE
);

CREATE TABLE Message (
    id        INTEGER PRIMARY KEY AUTOINCREMENT,
    person_id INTEGER,
    sent_at   TEXT,
    subject   TEXT,
    FOREIGN KEY (person_id) REFERENCES Person(id)
);

Why it scales:

Person table stores each unique sender once.
Message table references the person via person_id.
No duplication, fast lookups, easy aggregation.

Connecting with Python:

Here’s how cleanly you can now add a new message. Notice how we look up the person's id first, ensuring no duplicate data.

import sqlite3

conn = sqlite3.connect('email_db.sqlite')
cur = conn.cursor()

# Assume the schema from above is already created

sender_email = 'new.sender@example.com'
message_subject = 'Important Update'
timestamp = '2025-08-19 22:50:00'

# Find or create the person
cur.execute('INSERT OR IGNORE INTO Person (email) VALUES (?)', (sender_email,))
cur.execute('SELECT id FROM Person WHERE email = ?', (sender_email,))
person_id = cur.fetchone()[0]

# Insert the message with the foreign key
cur.execute('''
    INSERT INTO Message (person_id, sent_at, subject)
    VALUES (?, ?, ?)
''', (person_id, timestamp, message_subject))

conn.commit()
cur.close()

🔑 Lessons That Stick

Schemas aren’t an afterthought — they are the system.
Clean separation of entities → fewer bugs, easier joins.
Good schemas survive when you evolve from scripts → services → pipelines.

This is why I call schema design my first systems upgrade. It’s where scripts stop being disposable and start becoming infrastructure.

🧠 Why This Matters for AI Systems

Most “AI engineers” ignore databases.
But every LLM workflow is powered by structured + semi-structured data.

Parsing messy logs? → store clean.
Building embeddings? → index consistently.
Agent workflows? → Modern AI using RAG (Retrieval-Augmented Generation) needs queryable memory. A good schema is the foundation for reliable context retrieval.

Your schema is your leverage.

💡 Final Takeaway

Stop treating SQLite like a notepad.
Treat it like your first step in backend + AI infra design.

Once you think in schemas, every Python project becomes:

easier to scale,
easier to extend,
and closer to production.

🚀 Follow My Build Journey

Personal Site: 1fahadshah.com (Launching soon)
GitHub: github.com/1FahadShah
LinkedIn: linkedin.com/in/1fahadshah
Twitter/X: x.com/1FahadShah
Medium: 1fahadshah.medium.com
Hashnode: hashnode.com/@1FahadShah

5 Python Scripts That Taught Me Real-World Parsing and Automation

Fahad Shah — Thu, 07 Aug 2025 16:47:43 +0000

(From Course 2 & 3 of Python for Everybody – Applied Like a Pro)

Most beginners stop at print statements.
I used every course module to build scripts that scrape, parse, and automate real data pipelines.

Here are 5 scripts that went beyond the basics — each one feels like a tool, not a toy.

1️⃣ 📬 Spam Confidence Extractor

Parses through emails and calculates average spam confidence from X-DSPAM-Confidence: headers.

✅ Skills:

find(), float(), string parsing

File reading, data cleaning

count = 0
total = 0

with open("mbox.txt") as f:
    for line in f:
        if line.startswith("X-DSPAM-Confidence:"):
            num = float(line.split(":")[1].strip())
            count += 1
            total += num

print("Average spam confidence:", total / count)

📎 Real-World Use: Email filtering, NLP pre-cleaning, header analysis.

2️⃣ 📧 Email Address Counter

Counts how many times each sender appears and prints the most frequent one.

✅ Skills:

dict counting, string parsing, file handling

emails = {}

with open("mbox.txt") as f:
    for line in f:
        if line.startswith("From "):
            parts = line.split()
            email = parts[1]
            emails[email] = emails.get(email, 0) + 1

max_email = max(emails, key=emails.get)
print(max_email, emails[max_email])

📎 Real-World Use: Inbox analytics, sender clustering, contact insights.

3️⃣ ⏰ Hour Histogram

Parses timestamps from From lines and plots an hour-wise distribution.

✅ Skills:

split(), dict, sorting keys

hours = {}

with open("mbox.txt") as f:
    for line in f:
        if line.startswith("From "):
            time = line.split()[5]
            hour = time.split(":")[0]
            hours[hour] = hours.get(hour, 0) + 1

for hour in sorted(hours):
    print(hour, hours[hour])

📎 Real-World Use: Time-based behavior analysis, email scheduling data, logs monitoring.

4️⃣ 🌐 BeautifulSoup Scraper

Pulls all anchor tag texts from a live webpage using BeautifulSoup.

✅ Skills:

HTTP requests, HTML parsing, bs4 tag navigation

import urllib.request
from bs4 import BeautifulSoup

url = input("Enter URL: ")
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")

for tag in soup("a"):
    print(tag.get("href", None))

📎 Real-World Use: Link scraping, data crawling, sitemap audits.

5️⃣ 🔗 JSON API Extractor

Fetches data from a REST API, parses JSON, and processes nested fields.

✅ Skills:

urllib, json, nested dictionary access

import urllib.request, urllib.parse, json

url = "http://py4e-data.dr-chuck.net/comments_42.json"
data = urllib.request.urlopen(url).read().decode()
info = json.loads(data)

total = sum([int(item["count"]) for item in info["comments"]])
print("Sum:", total)

📎 Real-World Use: API response processing, backend pipelines, data analytics inputs.

🧩 Why This Matters

These aren’t random exercises.
Each script taught me core data processing patterns that show up in real-world systems:

Parsing messy input → extracting value
Aggregating + filtering data
Understanding structure behind unstructured sources

Not toy problems — these are backend blueprints.

🔗 Follow My Build Journey

GitHub: github.com/1FahadShah
Twitter/X: x.com/1FahadShah
Medium: 1fahadshah.medium.com
LinkedIn: linkedin.com/in/1fahadshah
Hashnode: hashnode.com/@1FahadShah
Personal Site: 1fahadshah.com (Launching soon)

#1FahadShah #Python #DataParsing #BackendEngineering #BuildInPublic #WebScraping #JSON #APIs #LearningInPublic

🐍 How I Built a Terminal Knowledge Crawler in Pure Python (No Frameworks)

Fahad Shah — Mon, 04 Aug 2025 17:43:16 +0000

Real-world AI systems aren’t built on tutorials. They start with foundational tools. Here’s how I built my own — and why every serious engineer should too.

🛠️ Problem:

Most Python learners finish courses with throwaway scripts.
I finished mine (Python for Everybody) by building a real system: KRAWLIX — a CLI Knowledge Crawler that fetches, stores, and structures topic summaries like the base layer of an AI assistant.

🚀 Features:

Pure Python: No external libraries (except sqlite3 and urllib).
Fetches summaries from DuckDuckGo & Wikipedia APIs.
Stores data as both .txt files and in a local SQLite database.
Fault-tolerant, modular, CLI-driven — built for real workflows, not just demos.

Full code: GitHub Repo

1️⃣ Project Structure

Your repo isn’t a flat script — it’s real engineering.
Directory layout:

krawlix/
│
├── main.py               # CLI entrypoint
├── crawler/              # Core logic modules
│     ├── fetch.py
│     ├── db_writer.py
│     └── utils.py
├── db/                   # SQLite database(s)
├── summaries/            # Text file outputs
├── data/                 # Input topics.txt and test files
├── failed_topics.txt     # Log for failed fetches
├── README.md
└── ... (tests, demo, etc.)

2️⃣ How It Works

A. The CLI Entry point (main.py)

Takes an input file (data/topics.txt) with one topic per line
For each topic:
- Fetches a summary from DuckDuckGo, Wikipedia (via fetch.py)
- Saves to both a .txt file and SQLite DB (via db_writer.py)
- Logs failed fetches

Code:

import sys
import os
from crawler.fetch import fetch_summary
from crawler.db_writer import create_table, insert_summary, save_summary_to_file
from crawler.utils import get_timestamp
from datetime import datetime

def crawl_topics(topics_file_path):
    """
    this function reads topics from a text file
    and fetches summaries for each of them.
    """

    if not os.path.exists(topics_file_path):
        print("File not found:", topics_file_path)
        return

    create_table()

    topics = []

    # this will get topics from file and append it to 'topics' list
    with open(topics_file_path, "r") as file:
        for line in file:
            line = line.strip()
            if line != "":
                topics.append(line)

    for topic in topics:
        print("\n")
        print("Fetching Summary for: ",topic)
        result = fetch_summary(topic)

        if result:
            result["created_at"] = get_timestamp()
            insert_summary(result)
            save_summary_to_file(result)
            print(f"Summary for {topic} saved in DB")
            filename = result["topic"].replace(" ", "_") + ".txt"
            print(f"{filename} file created\n")
        else:
            print("No Summary found for:", topic)
            with open("failed_topics.txt","a",encoding="utf-8") as fail_log:
                fail_log.write(topic + f", {get_timestamp()}" + "\n")


#manage inputs from CLI

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python main.py data/topics.txt")
    else:
        topics_file = sys.argv[1]
        crawl_topics(topics_file)

B. Fetcher Module (crawler/fetch.py)

Uses DuckDuckGo API as primary, Wikipedia as fallback.
Handles network errors, empty results.

Code:

import urllib.request
import urllib.parse
import json

def fetch_summary(topic):
    # Fetch summary for the given topic using DuckDuckGo API
    # Return a dictionary with summary or None

    base_url = "https://api.duckduckgo.com/"
    params = {
        'q': topic,
        'format': 'json',
        'mo_redirect': '1',
        'no_html': '1'
    }
    query_string = urllib.parse.urlencode(params)
    full_url = base_url + "?" + query_string

    try:
        with urllib.request.urlopen(full_url) as response:
            data = response.read()
            json_data = json.loads(data)

            summary = json_data.get("Abstract", "").strip()
            url = json_data.get("AbstractURL", "").strip()

            if summary:
                return {
                    "topic": topic,
                    "summary": summary,
                    "source": "DuckDuckGo",
                    "source_url": url
                }

    except Exception as e:
        print('DuckDuckGo fetch Failed', e)


    #If DuckDuckGo gave us nothing, we will try Wikipedia

    try:
        wiki_base_url = "https://en.wikipedia.org/api/rest_v1/page/summary/"
        query_string = urllib.parse.quote(topic)
        wiki_url = wiki_base_url + query_string

        with urllib.request.urlopen(wiki_url) as response:
            data = response.read()
            json_data = json.loads(data)

            summary = json_data.get("extract", "").strip()
            url = json_data.get("content_urls", {}).get("desktop", {}).get("page", "")

            if summary:
                return {
                    "topic": topic,
                    "summary": summary,
                    "source": "Wikipedia",
                    "source_url":url
                }


    except Exception as e:
        print("Wikipedia fetch failed", e)

    return None

C. Storage Modules
File: crawler/utils.py

from datetime import datetime

def get_timestamp():
    return datetime.now().strftime("%d-%m-%Y %H-%M-%S")

File: crawler/db_writer.py

import sqlite3
import os
from crawler.utils import get_timestamp

DB_PATH = os.path.join("db","krawlix.sqlite")

def create_table():
    # Creates knowledge table if it doesn't exists

    connect = sqlite3.connect(DB_PATH)
    cur = connect.cursor()
    cur.execute('''
        CREATE TABLE IF NOT EXISTS knowledge(
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            topic TEXT,
            summary TEXT,
            source TEXT,
            source_url TEXT,
            created_at TEXT
        )
    ''')

    connect.commit()
    connect.close()


def insert_summary(summary_data):
    # Insert summary into knowledge table

    connect = sqlite3.connect(DB_PATH)
    cur = connect.cursor()

    cur.execute('''
        INSERT OR IGNORE INTO knowledge (topic, summary, source, source_url, created_at)
        VALUES (?, ?, ?, ?, ?)
    ''',
    (
        summary_data['topic'],
        summary_data['summary'],
        summary_data['source'],
        summary_data['source_url'],
        summary_data['created_at']
    ))

    connect.commit()
    connect.close()

def save_summary_to_file(summary_data, folder="summaries"):
    # save summary to text file inside summaries/folder

    if not os.path.exists(folder):
        os.makedirs(folder)

    filename = summary_data["topic"].replace(" ", "_") + ".txt"
    filepath = os.path.join(folder, filename)

    with open(filepath, "w", encoding="utf-8") as f:
        f.write(summary_data["summary"])

3️⃣ How To Run

1. Prepare your input:

Edit data/topics.txt with each topic on a new line.

2. Run:

python main.py

*3. Outputs:
*

summaries/: Each topic as a separate .txt file
db/krawlix.sqlite: SQLite DB with all summaries
failed_topics.txt: Any failed topics for troubleshooting

4️⃣ What Sets KRAWLIX Apart

Modular folder structure: Not a monolithic script, but reusable, maintainable modules
No external libraries: Runs anywhere with basic Python 3
Error logging & resilience: Failures don’t stop the pipeline
** Built for extension:** Easily add new sources (Google, LLMs), new outputs (Markdown, CSV), or convert to API

5️⃣ Lessons Learned & AI Relevance

“The habits that make KRAWLIX robust are the same that make AI systems scale: modularity, clean storage, error handling, CLI-first design.”

Now ready to plug this into RAG pipelines, agent stacks, or wrap with FastAPI.

Built from Python for Everybody principles — but leveled up.

#1FahadShah #python #cli #opensource #sqlite #ai #buildinpublic #scraping #api #web

If you want a full step-by-step walk-through, advanced features, or want to see this project evolve into an API or LLM pipeline — let me know in the comments!

🚀 Follow My Build Journey

GitHub: github.com/1FahadShah
Medium: 1fahadshah.medium.com
LinkedIn: linkedin.com/in/1fahadshah
Twitter/X: x.com/1FahadShah
Hashnode: hashnode.com/@1FahadShah
Personal Site: 1fahadshah.com (Launching soon!)

I post every new tool, deep-dive, and lesson learned—always with code, always with execution. Got questions, want to collaborate, or building something similar? Drop a comment or DM me!