Real-world AI systems aren’t built on tutorials. They start with foundational tools. Here’s how I built my own — and why every serious engineer should too.
🛠️ Problem:
Most Python learners finish courses with throwaway scripts.
I finished mine (Python for Everybody) by building a real system: KRAWLIX — a CLI Knowledge Crawler that fetches, stores, and structures topic summaries like the base layer of an AI assistant.
🚀 Features:
- Pure Python: No external libraries (except sqlite3 and urllib).
- Fetches summaries from DuckDuckGo & Wikipedia APIs.
- Stores data as both .txt files and in a local SQLite database.
- Fault-tolerant, modular, CLI-driven — built for real workflows, not just demos.
Full code: GitHub Repo
1️⃣ Project Structure
Your repo isn’t a flat script — it’s real engineering.
Directory layout:
krawlix/
│
├── main.py # CLI entrypoint
├── crawler/ # Core logic modules
│ ├── fetch.py
│ ├── db_writer.py
│ └── utils.py
├── db/ # SQLite database(s)
├── summaries/ # Text file outputs
├── data/ # Input topics.txt and test files
├── failed_topics.txt # Log for failed fetches
├── README.md
└── ... (tests, demo, etc.)
2️⃣ How It Works
A. The CLI Entry point (main.py)
- Takes an input file (data/topics.txt) with one topic per line
- For each topic:
- Fetches a summary from DuckDuckGo, Wikipedia (via fetch.py)
- Saves to both a .txt file and SQLite DB (via db_writer.py)
- Logs failed fetches
Code:
import sys
import os
from crawler.fetch import fetch_summary
from crawler.db_writer import create_table, insert_summary, save_summary_to_file
from crawler.utils import get_timestamp
from datetime import datetime
def crawl_topics(topics_file_path):
"""
this function reads topics from a text file
and fetches summaries for each of them.
"""
if not os.path.exists(topics_file_path):
print("File not found:", topics_file_path)
return
create_table()
topics = []
# this will get topics from file and append it to 'topics' list
with open(topics_file_path, "r") as file:
for line in file:
line = line.strip()
if line != "":
topics.append(line)
for topic in topics:
print("\n")
print("Fetching Summary for: ",topic)
result = fetch_summary(topic)
if result:
result["created_at"] = get_timestamp()
insert_summary(result)
save_summary_to_file(result)
print(f"Summary for {topic} saved in DB")
filename = result["topic"].replace(" ", "_") + ".txt"
print(f"{filename} file created\n")
else:
print("No Summary found for:", topic)
with open("failed_topics.txt","a",encoding="utf-8") as fail_log:
fail_log.write(topic + f", {get_timestamp()}" + "\n")
#manage inputs from CLI
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python main.py data/topics.txt")
else:
topics_file = sys.argv[1]
crawl_topics(topics_file)
B. Fetcher Module (crawler/fetch.py)
- Uses DuckDuckGo API as primary, Wikipedia as fallback.
- Handles network errors, empty results.
Code:
import urllib.request
import urllib.parse
import json
def fetch_summary(topic):
# Fetch summary for the given topic using DuckDuckGo API
# Return a dictionary with summary or None
base_url = "https://api.duckduckgo.com/"
params = {
'q': topic,
'format': 'json',
'mo_redirect': '1',
'no_html': '1'
}
query_string = urllib.parse.urlencode(params)
full_url = base_url + "?" + query_string
try:
with urllib.request.urlopen(full_url) as response:
data = response.read()
json_data = json.loads(data)
summary = json_data.get("Abstract", "").strip()
url = json_data.get("AbstractURL", "").strip()
if summary:
return {
"topic": topic,
"summary": summary,
"source": "DuckDuckGo",
"source_url": url
}
except Exception as e:
print('DuckDuckGo fetch Failed', e)
#If DuckDuckGo gave us nothing, we will try Wikipedia
try:
wiki_base_url = "https://en.wikipedia.org/api/rest_v1/page/summary/"
query_string = urllib.parse.quote(topic)
wiki_url = wiki_base_url + query_string
with urllib.request.urlopen(wiki_url) as response:
data = response.read()
json_data = json.loads(data)
summary = json_data.get("extract", "").strip()
url = json_data.get("content_urls", {}).get("desktop", {}).get("page", "")
if summary:
return {
"topic": topic,
"summary": summary,
"source": "Wikipedia",
"source_url":url
}
except Exception as e:
print("Wikipedia fetch failed", e)
return None
C. Storage Modules
File: crawler/utils.py
from datetime import datetime
def get_timestamp():
return datetime.now().strftime("%d-%m-%Y %H-%M-%S")
File: crawler/db_writer.py
import sqlite3
import os
from crawler.utils import get_timestamp
DB_PATH = os.path.join("db","krawlix.sqlite")
def create_table():
# Creates knowledge table if it doesn't exists
connect = sqlite3.connect(DB_PATH)
cur = connect.cursor()
cur.execute('''
CREATE TABLE IF NOT EXISTS knowledge(
id INTEGER PRIMARY KEY AUTOINCREMENT,
topic TEXT,
summary TEXT,
source TEXT,
source_url TEXT,
created_at TEXT
)
''')
connect.commit()
connect.close()
def insert_summary(summary_data):
# Insert summary into knowledge table
connect = sqlite3.connect(DB_PATH)
cur = connect.cursor()
cur.execute('''
INSERT OR IGNORE INTO knowledge (topic, summary, source, source_url, created_at)
VALUES (?, ?, ?, ?, ?)
''',
(
summary_data['topic'],
summary_data['summary'],
summary_data['source'],
summary_data['source_url'],
summary_data['created_at']
))
connect.commit()
connect.close()
def save_summary_to_file(summary_data, folder="summaries"):
# save summary to text file inside summaries/folder
if not os.path.exists(folder):
os.makedirs(folder)
filename = summary_data["topic"].replace(" ", "_") + ".txt"
filepath = os.path.join(folder, filename)
with open(filepath, "w", encoding="utf-8") as f:
f.write(summary_data["summary"])
3️⃣ How To Run
1. Prepare your input:
Edit data/topics.txt
with each topic on a new line.
2. Run:
python main.py
*3. Outputs:
*
-
summaries/:
Each topic as a separate .txt file -
db/krawlix.sqlite:
SQLite DB with all summaries -
failed_topics.txt:
Any failed topics for troubleshooting
4️⃣ What Sets KRAWLIX Apart
- Modular folder structure: Not a monolithic script, but reusable, maintainable modules
- No external libraries: Runs anywhere with basic Python 3
- Error logging & resilience: Failures don’t stop the pipeline
- ** Built for extension:** Easily add new sources (Google, LLMs), new outputs (Markdown, CSV), or convert to API
5️⃣ Lessons Learned & AI Relevance
“The habits that make KRAWLIX robust are the same that make AI systems scale: modularity, clean storage, error handling, CLI-first design.”
Now ready to plug this into RAG pipelines, agent stacks, or wrap with FastAPI.
Built from Python for Everybody principles — but leveled up.
#1FahadShah #python #cli #opensource #sqlite #ai #buildinpublic #scraping #api #web
If you want a full step-by-step walk-through, advanced features, or want to see this project evolve into an API or LLM pipeline — let me know in the comments!
🚀 Follow My Build Journey
- GitHub: github.com/1FahadShah
- Medium: 1fahadshah.medium.com
- LinkedIn: linkedin.com/in/1fahadshah
- Twitter/X: x.com/1FahadShah
- Hashnode: hashnode.com/@1FahadShah
- Personal Site: 1fahadshah.com (Launching soon!)
I post every new tool, deep-dive, and lesson learned—always with code, always with execution. Got questions, want to collaborate, or building something similar? Drop a comment or DM me!
Top comments (0)