DEV Community

Cover image for 🐍 How I Built a Terminal Knowledge Crawler in Pure Python (No Frameworks)
Fahad Shah
Fahad Shah

Posted on

🐍 How I Built a Terminal Knowledge Crawler in Pure Python (No Frameworks)

Real-world AI systems aren’t built on tutorials. They start with foundational tools. Here’s how I built my own — and why every serious engineer should too.

🛠️ Problem:

Most Python learners finish courses with throwaway scripts.
I finished mine (Python for Everybody) by building a real system: KRAWLIX — a CLI Knowledge Crawler that fetches, stores, and structures topic summaries like the base layer of an AI assistant.

🚀 Features:

  • Pure Python: No external libraries (except sqlite3 and urllib).
  • Fetches summaries from DuckDuckGo & Wikipedia APIs.
  • Stores data as both .txt files and in a local SQLite database.
  • Fault-tolerant, modular, CLI-driven — built for real workflows, not just demos.

Full code: GitHub Repo


1️⃣ Project Structure

Your repo isn’t a flat script — it’s real engineering.
Directory layout:

krawlix/
│
├── main.py               # CLI entrypoint
├── crawler/              # Core logic modules
│     ├── fetch.py
│     ├── db_writer.py
│     └── utils.py
├── db/                   # SQLite database(s)
├── summaries/            # Text file outputs
├── data/                 # Input topics.txt and test files
├── failed_topics.txt     # Log for failed fetches
├── README.md
└── ... (tests, demo, etc.)

Enter fullscreen mode Exit fullscreen mode

2️⃣ How It Works

A. The CLI Entry point (main.py)

  • Takes an input file (data/topics.txt) with one topic per line
  • For each topic:
    • Fetches a summary from DuckDuckGo, Wikipedia (via fetch.py)
    • Saves to both a .txt file and SQLite DB (via db_writer.py)
    • Logs failed fetches

Code:

import sys
import os
from crawler.fetch import fetch_summary
from crawler.db_writer import create_table, insert_summary, save_summary_to_file
from crawler.utils import get_timestamp
from datetime import datetime

def crawl_topics(topics_file_path):
    """
    this function reads topics from a text file
    and fetches summaries for each of them.
    """

    if not os.path.exists(topics_file_path):
        print("File not found:", topics_file_path)
        return

    create_table()

    topics = []

    # this will get topics from file and append it to 'topics' list
    with open(topics_file_path, "r") as file:
        for line in file:
            line = line.strip()
            if line != "":
                topics.append(line)

    for topic in topics:
        print("\n")
        print("Fetching Summary for: ",topic)
        result = fetch_summary(topic)

        if result:
            result["created_at"] = get_timestamp()
            insert_summary(result)
            save_summary_to_file(result)
            print(f"Summary for {topic} saved in DB")
            filename = result["topic"].replace(" ", "_") + ".txt"
            print(f"{filename} file created\n")
        else:
            print("No Summary found for:", topic)
            with open("failed_topics.txt","a",encoding="utf-8") as fail_log:
                fail_log.write(topic + f", {get_timestamp()}" + "\n")


#manage inputs from CLI

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python main.py data/topics.txt")
    else:
        topics_file = sys.argv[1]
        crawl_topics(topics_file)
Enter fullscreen mode Exit fullscreen mode

B. Fetcher Module (crawler/fetch.py)

  • Uses DuckDuckGo API as primary, Wikipedia as fallback.
  • Handles network errors, empty results.

Code:

import urllib.request
import urllib.parse
import json

def fetch_summary(topic):
    # Fetch summary for the given topic using DuckDuckGo API
    # Return a dictionary with summary or None

    base_url = "https://api.duckduckgo.com/"
    params = {
        'q': topic,
        'format': 'json',
        'mo_redirect': '1',
        'no_html': '1'
    }
    query_string = urllib.parse.urlencode(params)
    full_url = base_url + "?" + query_string

    try:
        with urllib.request.urlopen(full_url) as response:
            data = response.read()
            json_data = json.loads(data)

            summary = json_data.get("Abstract", "").strip()
            url = json_data.get("AbstractURL", "").strip()

            if summary:
                return {
                    "topic": topic,
                    "summary": summary,
                    "source": "DuckDuckGo",
                    "source_url": url
                }

    except Exception as e:
        print('DuckDuckGo fetch Failed', e)


    #If DuckDuckGo gave us nothing, we will try Wikipedia

    try:
        wiki_base_url = "https://en.wikipedia.org/api/rest_v1/page/summary/"
        query_string = urllib.parse.quote(topic)
        wiki_url = wiki_base_url + query_string

        with urllib.request.urlopen(wiki_url) as response:
            data = response.read()
            json_data = json.loads(data)

            summary = json_data.get("extract", "").strip()
            url = json_data.get("content_urls", {}).get("desktop", {}).get("page", "")

            if summary:
                return {
                    "topic": topic,
                    "summary": summary,
                    "source": "Wikipedia",
                    "source_url":url
                }


    except Exception as e:
        print("Wikipedia fetch failed", e)

    return None

Enter fullscreen mode Exit fullscreen mode

C. Storage Modules
File: crawler/utils.py

from datetime import datetime

def get_timestamp():
    return datetime.now().strftime("%d-%m-%Y %H-%M-%S")
Enter fullscreen mode Exit fullscreen mode

File: crawler/db_writer.py

import sqlite3
import os
from crawler.utils import get_timestamp

DB_PATH = os.path.join("db","krawlix.sqlite")

def create_table():
    # Creates knowledge table if it doesn't exists

    connect = sqlite3.connect(DB_PATH)
    cur = connect.cursor()
    cur.execute('''
        CREATE TABLE IF NOT EXISTS knowledge(
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            topic TEXT,
            summary TEXT,
            source TEXT,
            source_url TEXT,
            created_at TEXT
        )
    ''')

    connect.commit()
    connect.close()


def insert_summary(summary_data):
    # Insert summary into knowledge table

    connect = sqlite3.connect(DB_PATH)
    cur = connect.cursor()

    cur.execute('''
        INSERT OR IGNORE INTO knowledge (topic, summary, source, source_url, created_at)
        VALUES (?, ?, ?, ?, ?)
    ''',
    (
        summary_data['topic'],
        summary_data['summary'],
        summary_data['source'],
        summary_data['source_url'],
        summary_data['created_at']
    ))

    connect.commit()
    connect.close()

def save_summary_to_file(summary_data, folder="summaries"):
    # save summary to text file inside summaries/folder

    if not os.path.exists(folder):
        os.makedirs(folder)

    filename = summary_data["topic"].replace(" ", "_") + ".txt"
    filepath = os.path.join(folder, filename)

    with open(filepath, "w", encoding="utf-8") as f:
        f.write(summary_data["summary"])
Enter fullscreen mode Exit fullscreen mode

3️⃣ How To Run

1. Prepare your input:

Edit data/topics.txt with each topic on a new line.

2. Run:

python main.py

*3. Outputs:
*

  • summaries/: Each topic as a separate .txt file
  • db/krawlix.sqlite: SQLite DB with all summaries
  • failed_topics.txt: Any failed topics for troubleshooting

4️⃣ What Sets KRAWLIX Apart

  • Modular folder structure: Not a monolithic script, but reusable, maintainable modules
  • No external libraries: Runs anywhere with basic Python 3
  • Error logging & resilience: Failures don’t stop the pipeline
  • ** Built for extension:** Easily add new sources (Google, LLMs), new outputs (Markdown, CSV), or convert to API

5️⃣ Lessons Learned & AI Relevance

“The habits that make KRAWLIX robust are the same that make AI systems scale: modularity, clean storage, error handling, CLI-first design.”

Now ready to plug this into RAG pipelines, agent stacks, or wrap with FastAPI.

Built from Python for Everybody principles — but leveled up.

#1FahadShah #python #cli #opensource #sqlite #ai #buildinpublic #scraping #api #web


If you want a full step-by-step walk-through, advanced features, or want to see this project evolve into an API or LLM pipeline — let me know in the comments!

🚀 Follow My Build Journey

I post every new tool, deep-dive, and lesson learned—always with code, always with execution. Got questions, want to collaborate, or building something similar? Drop a comment or DM me!

Top comments (0)