<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Fahad Shah</title>
    <description>The latest articles on DEV Community by Fahad Shah (@1fahadshah).</description>
    <link>https://dev.to/1fahadshah</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3056863%2Fe62de53c-f164-43f5-9858-eb945f5197fb.jpg</url>
      <title>DEV Community: Fahad Shah</title>
      <link>https://dev.to/1fahadshah</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/1fahadshah"/>
    <language>en</language>
    <item>
      <title>Designing Scalable SQLite Schemas for Python Apps</title>
      <dc:creator>Fahad Shah</dc:creator>
      <pubDate>Tue, 19 Aug 2025 17:37:53 +0000</pubDate>
      <link>https://dev.to/1fahadshah/designing-scalable-sqlite-schemas-for-python-apps-27gd</link>
      <guid>https://dev.to/1fahadshah/designing-scalable-sqlite-schemas-for-python-apps-27gd</guid>
      <description>&lt;p&gt;&lt;em&gt;(The Foundation Every Systems Builder Needs — by 1FahadShah)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Most beginners treat SQLite as a toy database.&lt;br&gt;
I learned the hard way: your schema decisions today decide whether your project survives tomorrow.&lt;/p&gt;

&lt;p&gt;In my Python journey (Course 4 of Python for Everybody), I stopped thinking of SQLite as “just storage” — and started treating it like the backbone of real pipelines.&lt;/p&gt;

&lt;p&gt;Here’s how I approached schema design so my projects didn’t collapse the moment they touched real-world data.&lt;/p&gt;
&lt;h2&gt;
  
  
  🚧 The Naive Approach (and Why It Breaks)
&lt;/h2&gt;

&lt;p&gt;Early scripts often look like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One table for everything&lt;/li&gt;
&lt;li&gt;CSV-like storage&lt;/li&gt;
&lt;li&gt;Fields crammed together with no normalization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It works for a single file. It dies once you hit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple data sources&lt;/li&gt;
&lt;li&gt;Relationships between entities&lt;/li&gt;
&lt;li&gt;Queries that need speed and accuracy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result: duplication, inconsistency, and painful debugging.&lt;/p&gt;


&lt;h2&gt;
  
  
  ✅ The Scalable Schema Mindset
&lt;/h2&gt;

&lt;p&gt;I shifted to a schema-first approach:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Identify Entities&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;People, Messages, Logs, Transactions&lt;/li&gt;
&lt;li&gt;Each gets its own table.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Normalize Data&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No repeated emails or usernames scattered across rows.&lt;/li&gt;
&lt;li&gt;Relationships are modeled once, referenced many times.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Think in Queries&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Schema isn’t just storage.&lt;/li&gt;
&lt;li&gt;It’s the shape of the answers you’ll need later.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  🗄 Example: Email System Schema
&lt;/h2&gt;

&lt;p&gt;Here’s a simplified schema I built while parsing large email archives:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;Person&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;     &lt;span class="nb"&gt;INTEGER&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="n"&gt;AUTOINCREMENT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;email&lt;/span&gt;  &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;UNIQUE&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;Message&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;        &lt;span class="nb"&gt;INTEGER&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="n"&gt;AUTOINCREMENT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;person_id&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sent_at&lt;/span&gt;   &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;subject&lt;/span&gt;   &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;FOREIGN&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;person_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why it scales:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Person table stores each unique sender once.&lt;/li&gt;
&lt;li&gt;Message table references the person via person_id.&lt;/li&gt;
&lt;li&gt;No duplication, fast lookups, easy aggregation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Connecting with Python:
&lt;/h3&gt;

&lt;p&gt;Here’s how cleanly you can now add a new message. Notice how we look up the person's id first, ensuring no duplicate data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sqlite3&lt;/span&gt;

&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlite3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;email_db.sqlite&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Assume the schema from above is already created
&lt;/span&gt;
&lt;span class="n"&gt;sender_email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;new.sender@example.com&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;message_subject&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Important Update&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2025-08-19 22:50:00&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

&lt;span class="c1"&gt;# Find or create the person
&lt;/span&gt;&lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;INSERT OR IGNORE INTO Person (email) VALUES (?)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sender_email&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt;
&lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SELECT id FROM Person WHERE email = ?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sender_email&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt;
&lt;span class="n"&gt;person_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Insert the message with the foreign key
&lt;/span&gt;&lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'''&lt;/span&gt;&lt;span class="s"&gt;
    INSERT INTO Message (person_id, sent_at, subject)
    VALUES (?, ?, ?)
&lt;/span&gt;&lt;span class="sh"&gt;'''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;person_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message_subject&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;commit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🔑 Lessons That Stick
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Schemas aren’t an afterthought — they are the system.&lt;/li&gt;
&lt;li&gt;Clean separation of entities → fewer bugs, easier joins.&lt;/li&gt;
&lt;li&gt;Good schemas survive when you evolve from scripts → services → pipelines.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why I call schema design my first systems upgrade. It’s where scripts stop being disposable and start becoming infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  🧠 Why This Matters for AI Systems
&lt;/h2&gt;

&lt;p&gt;Most “AI engineers” ignore databases.&lt;br&gt;
But every LLM workflow is powered by structured + semi-structured data.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Parsing messy logs? → store clean.&lt;/li&gt;
&lt;li&gt;Building embeddings? → index consistently.&lt;/li&gt;
&lt;li&gt;Agent workflows? → Modern AI using RAG (Retrieval-Augmented Generation) needs queryable memory. A good schema is the foundation for reliable context retrieval.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Your schema is your leverage.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  💡 Final Takeaway
&lt;/h2&gt;

&lt;p&gt;Stop treating SQLite like a notepad.&lt;br&gt;
Treat it like your first step in backend + AI infra design.&lt;/p&gt;

&lt;p&gt;Once you think in schemas, every Python project becomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;easier to scale,&lt;/li&gt;
&lt;li&gt;easier to extend,&lt;/li&gt;
&lt;li&gt;and closer to production.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🚀 Follow My Build Journey
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Personal Site: &lt;a href="//1fahadshah.com"&gt;1fahadshah.com&lt;/a&gt; (Launching soon)&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="//github.com/1FahadShah"&gt;github.com/1FahadShah&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;LinkedIn: &lt;a href="//linkedin.com/in/1fahadshah"&gt;linkedin.com/in/1fahadshah&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Twitter/X: &lt;a href="//x.com/1FahadShah"&gt;x.com/1FahadShah&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Medium: &lt;a href="//1fahadshah.medium.com"&gt;1fahadshah.medium.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Hashnode: &lt;a href="//hashnode.com/@1FahadShah"&gt;hashnode.com/@1FahadShah&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>schema</category>
      <category>database</category>
    </item>
    <item>
      <title>5 Python Scripts That Taught Me Real-World Parsing and Automation</title>
      <dc:creator>Fahad Shah</dc:creator>
      <pubDate>Thu, 07 Aug 2025 16:47:43 +0000</pubDate>
      <link>https://dev.to/1fahadshah/5-python-scripts-that-taught-me-real-world-parsing-and-automation-k9n</link>
      <guid>https://dev.to/1fahadshah/5-python-scripts-that-taught-me-real-world-parsing-and-automation-k9n</guid>
      <description>&lt;p&gt;(From Course 2 &amp;amp; 3 of Python for Everybody – Applied Like a Pro)&lt;/p&gt;

&lt;p&gt;Most beginners stop at print statements.&lt;br&gt;
I used every course module to build scripts that scrape, parse, and automate real data pipelines.&lt;/p&gt;

&lt;p&gt;Here are 5 scripts that went beyond the basics — each one feels like a tool, not a toy.&lt;/p&gt;


&lt;h2&gt;
  
  
  1️⃣ 📬 Spam Confidence Extractor
&lt;/h2&gt;

&lt;p&gt;Parses through emails and calculates average spam confidence from X-DSPAM-Confidence: headers.&lt;/p&gt;

&lt;p&gt;✅ Skills:&lt;/p&gt;

&lt;p&gt;find(), float(), string parsing&lt;/p&gt;

&lt;p&gt;File reading, data cleaning&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;count = 0
total = 0

with open("mbox.txt") as f:
    for line in f:
        if line.startswith("X-DSPAM-Confidence:"):
            num = float(line.split(":")[1].strip())
            count += 1
            total += num

print("Average spam confidence:", total / count)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;📎 Real-World Use: Email filtering, NLP pre-cleaning, header analysis.&lt;/p&gt;




&lt;h2&gt;
  
  
  2️⃣ 📧 Email Address Counter
&lt;/h2&gt;

&lt;p&gt;Counts how many times each sender appears and prints the most frequent one.&lt;/p&gt;

&lt;p&gt;✅ Skills:&lt;/p&gt;

&lt;p&gt;dict counting, string parsing, file handling&lt;/p&gt;

&lt;p&gt;emails = {}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;with open("mbox.txt") as f:
    for line in f:
        if line.startswith("From "):
            parts = line.split()
            email = parts[1]
            emails[email] = emails.get(email, 0) + 1

max_email = max(emails, key=emails.get)
print(max_email, emails[max_email])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;📎 Real-World Use: Inbox analytics, sender clustering, contact insights.&lt;/p&gt;




&lt;h2&gt;
  
  
  3️⃣ ⏰ Hour Histogram
&lt;/h2&gt;

&lt;p&gt;Parses timestamps from From lines and plots an hour-wise distribution.&lt;/p&gt;

&lt;p&gt;✅ Skills:&lt;/p&gt;

&lt;p&gt;split(), dict, sorting keys&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;hours = {}

with open("mbox.txt") as f:
    for line in f:
        if line.startswith("From "):
            time = line.split()[5]
            hour = time.split(":")[0]
            hours[hour] = hours.get(hour, 0) + 1

for hour in sorted(hours):
    print(hour, hours[hour])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;📎 Real-World Use: Time-based behavior analysis, email scheduling data, logs monitoring.&lt;/p&gt;




&lt;h2&gt;
  
  
  4️⃣ 🌐 BeautifulSoup Scraper
&lt;/h2&gt;

&lt;p&gt;Pulls all anchor tag texts from a live webpage using BeautifulSoup.&lt;/p&gt;

&lt;p&gt;✅ Skills:&lt;/p&gt;

&lt;p&gt;HTTP requests, HTML parsing, bs4 tag navigation&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import urllib.request
from bs4 import BeautifulSoup

url = input("Enter URL: ")
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")

for tag in soup("a"):
    print(tag.get("href", None))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;📎 Real-World Use: Link scraping, data crawling, sitemap audits.&lt;/p&gt;




&lt;h2&gt;
  
  
  5️⃣ 🔗 JSON API Extractor
&lt;/h2&gt;

&lt;p&gt;Fetches data from a REST API, parses JSON, and processes nested fields.&lt;/p&gt;

&lt;p&gt;✅ Skills:&lt;/p&gt;

&lt;p&gt;urllib, json, nested dictionary access&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import urllib.request, urllib.parse, json

url = "http://py4e-data.dr-chuck.net/comments_42.json"
data = urllib.request.urlopen(url).read().decode()
info = json.loads(data)

total = sum([int(item["count"]) for item in info["comments"]])
print("Sum:", total)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;📎 Real-World Use: API response processing, backend pipelines, data analytics inputs.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧩 Why This Matters
&lt;/h2&gt;

&lt;p&gt;These aren’t random exercises.&lt;br&gt;
Each script taught me core data processing patterns that show up in real-world systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Parsing messy input → extracting value&lt;/li&gt;
&lt;li&gt;Aggregating + filtering data&lt;/li&gt;
&lt;li&gt;Understanding structure behind unstructured sources&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not toy problems — these are backend blueprints.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔗 Follow My Build Journey
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="//github.com/1FahadShah"&gt;github.com/1FahadShah&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Twitter/X: &lt;a href="//x.com/1FahadShah"&gt;x.com/1FahadShah&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Medium: &lt;a href="//1fahadshah.medium.com"&gt;1fahadshah.medium.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;LinkedIn: &lt;a href="//linkedin.com/in/1fahadshah"&gt;linkedin.com/in/1fahadshah&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Hashnode: &lt;a href="//hashnode.com/@1FahadShah"&gt;hashnode.com/@1FahadShah&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Personal Site: &lt;a href="//1fahadshah.com"&gt;1fahadshah.com&lt;/a&gt; (Launching soon)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;#1FahadShah #Python #DataParsing #BackendEngineering #BuildInPublic #WebScraping #JSON #APIs #LearningInPublic&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>automation</category>
      <category>programming</category>
    </item>
    <item>
      <title>🐍 How I Built a Terminal Knowledge Crawler in Pure Python (No Frameworks)</title>
      <dc:creator>Fahad Shah</dc:creator>
      <pubDate>Mon, 04 Aug 2025 17:43:16 +0000</pubDate>
      <link>https://dev.to/1fahadshah/how-i-built-a-terminal-knowledge-crawler-in-pure-python-no-frameworks-53hf</link>
      <guid>https://dev.to/1fahadshah/how-i-built-a-terminal-knowledge-crawler-in-pure-python-no-frameworks-53hf</guid>
      <description>&lt;p&gt;Real-world AI systems aren’t built on tutorials. They start with foundational tools. Here’s how I built my own — and why every serious engineer should too.&lt;/p&gt;

&lt;h2&gt;
  
  
  🛠️ Problem:
&lt;/h2&gt;

&lt;p&gt;Most Python learners finish courses with throwaway scripts.&lt;br&gt;
I finished mine (Python for Everybody) by building a real system: KRAWLIX — a CLI Knowledge Crawler that fetches, stores, and structures topic summaries like the base layer of an AI assistant.&lt;/p&gt;
&lt;h2&gt;
  
  
  🚀 Features:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Pure Python: No external libraries (except sqlite3 and urllib).&lt;/li&gt;
&lt;li&gt;Fetches summaries from DuckDuckGo &amp;amp; Wikipedia APIs.&lt;/li&gt;
&lt;li&gt;Stores data as both .txt files and in a local SQLite database.&lt;/li&gt;
&lt;li&gt;Fault-tolerant, modular, CLI-driven — built for real workflows, not just demos.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Full code: &lt;a href="https://github.com/1FahadShah/krawlix" rel="noopener noreferrer"&gt;GitHub Repo&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;1️⃣ Project Structure&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Your repo isn’t a flat script — it’s real engineering.&lt;br&gt;
&lt;strong&gt;Directory layout:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;krawlix/
│
├── main.py               # CLI entrypoint
├── crawler/              # Core logic modules
│     ├── fetch.py
│     ├── db_writer.py
│     └── utils.py
├── db/                   # SQLite database(s)
├── summaries/            # Text file outputs
├── data/                 # Input topics.txt and test files
├── failed_topics.txt     # Log for failed fetches
├── README.md
└── ... (tests, demo, etc.)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  2️⃣ How It Works
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;A. The CLI Entry point (main.py)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Takes an input file (data/topics.txt) with one topic per line&lt;/li&gt;
&lt;li&gt;For each topic:

&lt;ul&gt;
&lt;li&gt;Fetches a summary from DuckDuckGo, Wikipedia (via fetch.py)&lt;/li&gt;
&lt;li&gt;Saves to both a .txt file and SQLite DB (via db_writer.py)&lt;/li&gt;
&lt;li&gt;Logs failed fetches&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import sys
import os
from crawler.fetch import fetch_summary
from crawler.db_writer import create_table, insert_summary, save_summary_to_file
from crawler.utils import get_timestamp
from datetime import datetime

def crawl_topics(topics_file_path):
    """
    this function reads topics from a text file
    and fetches summaries for each of them.
    """

    if not os.path.exists(topics_file_path):
        print("File not found:", topics_file_path)
        return

    create_table()

    topics = []

    # this will get topics from file and append it to 'topics' list
    with open(topics_file_path, "r") as file:
        for line in file:
            line = line.strip()
            if line != "":
                topics.append(line)

    for topic in topics:
        print("\n")
        print("Fetching Summary for: ",topic)
        result = fetch_summary(topic)

        if result:
            result["created_at"] = get_timestamp()
            insert_summary(result)
            save_summary_to_file(result)
            print(f"Summary for {topic} saved in DB")
            filename = result["topic"].replace(" ", "_") + ".txt"
            print(f"{filename} file created\n")
        else:
            print("No Summary found for:", topic)
            with open("failed_topics.txt","a",encoding="utf-8") as fail_log:
                fail_log.write(topic + f", {get_timestamp()}" + "\n")


#manage inputs from CLI

if __name__ == "__main__":
    if len(sys.argv) &amp;lt; 2:
        print("Usage: python main.py data/topics.txt")
    else:
        topics_file = sys.argv[1]
        crawl_topics(topics_file)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;B. Fetcher Module (crawler/fetch.py)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses DuckDuckGo API as primary, Wikipedia as fallback.&lt;/li&gt;
&lt;li&gt;Handles network errors, empty results.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import urllib.request
import urllib.parse
import json

def fetch_summary(topic):
    # Fetch summary for the given topic using DuckDuckGo API
    # Return a dictionary with summary or None

    base_url = "https://api.duckduckgo.com/"
    params = {
        'q': topic,
        'format': 'json',
        'mo_redirect': '1',
        'no_html': '1'
    }
    query_string = urllib.parse.urlencode(params)
    full_url = base_url + "?" + query_string

    try:
        with urllib.request.urlopen(full_url) as response:
            data = response.read()
            json_data = json.loads(data)

            summary = json_data.get("Abstract", "").strip()
            url = json_data.get("AbstractURL", "").strip()

            if summary:
                return {
                    "topic": topic,
                    "summary": summary,
                    "source": "DuckDuckGo",
                    "source_url": url
                }

    except Exception as e:
        print('DuckDuckGo fetch Failed', e)


    #If DuckDuckGo gave us nothing, we will try Wikipedia

    try:
        wiki_base_url = "https://en.wikipedia.org/api/rest_v1/page/summary/"
        query_string = urllib.parse.quote(topic)
        wiki_url = wiki_base_url + query_string

        with urllib.request.urlopen(wiki_url) as response:
            data = response.read()
            json_data = json.loads(data)

            summary = json_data.get("extract", "").strip()
            url = json_data.get("content_urls", {}).get("desktop", {}).get("page", "")

            if summary:
                return {
                    "topic": topic,
                    "summary": summary,
                    "source": "Wikipedia",
                    "source_url":url
                }


    except Exception as e:
        print("Wikipedia fetch failed", e)

    return None

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;C. Storage Modules&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;File:&lt;/strong&gt; crawler/utils.py&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from datetime import datetime

def get_timestamp():
    return datetime.now().strftime("%d-%m-%Y %H-%M-%S")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; crawler/db_writer.py&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import sqlite3
import os
from crawler.utils import get_timestamp

DB_PATH = os.path.join("db","krawlix.sqlite")

def create_table():
    # Creates knowledge table if it doesn't exists

    connect = sqlite3.connect(DB_PATH)
    cur = connect.cursor()
    cur.execute('''
        CREATE TABLE IF NOT EXISTS knowledge(
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            topic TEXT,
            summary TEXT,
            source TEXT,
            source_url TEXT,
            created_at TEXT
        )
    ''')

    connect.commit()
    connect.close()


def insert_summary(summary_data):
    # Insert summary into knowledge table

    connect = sqlite3.connect(DB_PATH)
    cur = connect.cursor()

    cur.execute('''
        INSERT OR IGNORE INTO knowledge (topic, summary, source, source_url, created_at)
        VALUES (?, ?, ?, ?, ?)
    ''',
    (
        summary_data['topic'],
        summary_data['summary'],
        summary_data['source'],
        summary_data['source_url'],
        summary_data['created_at']
    ))

    connect.commit()
    connect.close()

def save_summary_to_file(summary_data, folder="summaries"):
    # save summary to text file inside summaries/folder

    if not os.path.exists(folder):
        os.makedirs(folder)

    filename = summary_data["topic"].replace(" ", "_") + ".txt"
    filepath = os.path.join(folder, filename)

    with open(filepath, "w", encoding="utf-8") as f:
        f.write(summary_data["summary"])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  3️⃣ How To Run
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Prepare your input:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Edit &lt;code&gt;data/topics.txt&lt;/code&gt; with each topic on a new line.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Run:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;python main.py&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;3. Outputs:&lt;br&gt;
*&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;summaries/:&lt;/code&gt; Each topic as a separate .txt file&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;db/krawlix.sqlite:&lt;/code&gt; SQLite DB with all summaries&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;failed_topics.txt:&lt;/code&gt; Any failed topics for troubleshooting&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  4️⃣ What Sets KRAWLIX Apart
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Modular folder structure:&lt;/strong&gt; Not a monolithic script, but reusable, maintainable modules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No external libraries:&lt;/strong&gt; Runs anywhere with basic Python 3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error logging &amp;amp; resilience:&lt;/strong&gt; Failures don’t stop the pipeline&lt;/li&gt;
&lt;li&gt;** Built for extension:** Easily add new sources (Google, LLMs), new outputs (Markdown, CSV), or convert to API&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  5️⃣ Lessons Learned &amp;amp; AI Relevance
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;“The habits that make KRAWLIX robust are the same that make AI systems scale: modularity, clean storage, error handling, CLI-first design.”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now ready to plug this into RAG pipelines, agent stacks, or wrap with FastAPI.&lt;/p&gt;

&lt;p&gt;Built from &lt;em&gt;Python for Everybody&lt;/em&gt; principles — but leveled up.&lt;/p&gt;

&lt;p&gt;#1FahadShah #python #cli #opensource #sqlite #ai #buildinpublic #scraping #api #web&lt;/p&gt;




&lt;p&gt;If you want a full step-by-step walk-through, advanced features, or want to see this project evolve into an API or LLM pipeline — let me know in the comments!&lt;/p&gt;

&lt;h2&gt;
  
  
  🚀 Follow My Build Journey
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/1FahadShah" rel="noopener noreferrer"&gt;github.com/1FahadShah&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Medium:&lt;/strong&gt; &lt;a href="https://1fahadshah.medium.com" rel="noopener noreferrer"&gt;1fahadshah.medium.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LinkedIn:&lt;/strong&gt; &lt;a href="https://linkedin.com/in/1fahadshah" rel="noopener noreferrer"&gt;linkedin.com/in/1fahadshah&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Twitter/X:&lt;/strong&gt; &lt;a href="https://x.com/1FahadShah" rel="noopener noreferrer"&gt;x.com/1FahadShah&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hashnode:&lt;/strong&gt; &lt;a href="https://hashnode.com/@1FahadShah" rel="noopener noreferrer"&gt;hashnode.com/@1FahadShah&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Personal Site:&lt;/strong&gt; &lt;a href="https://1fahadshah.com" rel="noopener noreferrer"&gt;1fahadshah.com&lt;/a&gt; &lt;em&gt;(Launching soon!)&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;I post every new tool, deep-dive, and lesson learned—always with code, always with execution. Got questions, want to collaborate, or building something similar? Drop a comment or DM me!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>rag</category>
      <category>basic</category>
    </item>
  </channel>
</rss>
