DEV Community: CodeFather

Page to Markdown — Privacy Policy

CodeFather — Thu, 25 Jun 2026 18:50:01 +0000

Page to Markdown — Privacy Policy

Last updated: June 25, 2026

Summary

Page to Markdown does not collect, store, transmit, or share any user data. Period.

What the extension does

Page to Markdown converts the content of the current web page into Markdown format. All processing happens entirely within your browser. No data is sent to any external server.

Data collection

We do not collect:

Personal information
Browsing history
Page content
Usage analytics
Cookies or tracking data

Local storage

The extension stores a single counter in your browser's local storage to track daily usage (for the free tier limit of 3 conversions per day). This counter:

Contains only a number and a date
Never leaves your browser
Is not transmitted anywhere
Can be cleared by removing the extension

Permissions

activeTab: Used to read the current page's content when you click the extension icon
storage: Used to store the daily usage counter locally
scripting: Used to inject the content extraction script into the current page

Third parties

This extension does not use any third-party services, analytics, or tracking tools.

Changes

If this policy changes, the updated version will be posted at this URL.

Contact

For questions about this privacy policy, reach out via Bluesky @devtoolslab.bsky.social.

I Built a Chrome Extension That Converts Any Web Page to LLM-Ready Markdown

CodeFather — Thu, 25 Jun 2026 18:30:16 +0000

Every time I wanted to feed a web page into ChatGPT or Claude, I found myself doing the same tedious dance:

Select all the text
Copy-paste into a text editor
Manually strip out navigation, ads, and cookie banners
Reformat the headings and code blocks
Wonder if I got all the relevant content

After doing this dozens of times, I built Page to Markdown — a Chrome extension that does it in one click.

What It Does

Click the extension icon on any web page. You get:

Clean Markdown with headings, lists, tables, code blocks, images, and links preserved
Token count so you know exactly how much of your context window you're using
Word and character counts for quick reference
Copy to clipboard or download as .md file

It strips out everything you don't need: navigation bars, footers, sidebars, ads, cookie banners, social media widgets, and other boilerplate.

Why Markdown?

If you're building AI workflows, Markdown is the ideal intermediate format:

LLMs understand it natively — headings convey hierarchy, code blocks preserve formatting
It's compact — typically 50-70% fewer tokens than raw HTML
It's portable — works with every LLM, RAG pipeline, and note-taking tool
It's readable — you can verify what you're feeding into your prompt

How It Works (No API Calls)

The extension runs entirely in your browser. Zero API calls, zero data sent anywhere.

Here's the approach:

Find the main content — it walks through <article>, <main>, [role='main'], .post-content, and falls back to <body>
Strip non-content elements — removes <nav>, <footer>, <aside>, cookie banners, social widgets, ad containers, and modals
Convert HTML to Markdown — a recursive converter handles each element type: headings become #, lists become -, tables become pipe-delimited, code blocks get proper fencing
Count tokens — estimates using the ~4 characters per token heuristic (close enough for planning purposes)

The whole thing is about 300 lines of JavaScript. No dependencies, no build step.

The Token Counting Angle

This is the feature I actually use most. Before pasting content into an LLM, I want to know:

Will this fit in my context window?
How much of my budget am I using?
Should I split this across multiple prompts?

The extension shows ~1,580 tokens right in the popup. Quick mental math, no surprises.

Try It

The extension is free — 3 conversions per day, unlimited with Pro.

Privacy: No accounts, no tracking, no data leaves your browser.

Built by The CodeFather — making data offers you can't refuse.

What's your workflow for feeding web content into LLMs? I'd love to hear how others handle this.

Extract GitHub Repository Data Without Hitting Rate Limits

CodeFather — Thu, 25 Jun 2026 11:47:25 +0000

GitHub's REST API is powerful but has aggressive rate limits: 60 requests per hour without a token, 5,000 with one. If you're doing any serious data extraction — searching repos, pulling contributor lists, exporting stargazers — you'll hit those limits fast.

I built a scraper that handles this properly. Here's what I learned and how you can extract GitHub data at scale without getting blocked.

The Rate Limit Problem

GitHub's API returns a 429 Too Many Requests once you exceed your quota. The response headers tell you exactly when your limit resets:

X-RateLimit-Limit: 60
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1719324000

Without a personal access token, 60 requests per hour is nothing. A single search query + paginating through results + fetching user profiles can burn through that in minutes.

With a token (free to generate at github.com/settings/tokens), you get 5,000/hr — much more workable, but still requires careful request management.

How the Scraper Handles It

The GitHub Scraper I built handles rate limits with:

Retry with backoff — When hitting 429, it reads the X-RateLimit-Reset header and waits until the limit resets
Request budgeting — Tracks remaining requests and slows down before hitting zero
Pagination via Link headers — GitHub uses Link: <url>; rel="next" headers, not page numbers

# Simplified version of the retry logic
async def request_with_retry(url, headers):
    resp = await client.get(url, headers=headers)
    if resp.status_code == 429:
        reset_time = int(resp.headers.get("X-RateLimit-Reset", 0))
        wait = max(reset_time - time.time(), 1)
        await asyncio.sleep(wait)
        return await client.get(url, headers=headers)
    return resp

What You Can Extract

The scraper supports 6 modes:

Search repos — Find repos by keyword, language, and star count:

{
  "scrapeType": "search_repos",
  "searchQuery": "web scraping language:python stars:>100",
  "maxItems": 50
}

Scrape profiles — Get user details including email, company, location:

{
  "scrapeType": "profiles",
  "usernames": ["torvalds", "gvanrossum"],
  "enrichProfiles": true
}

Export contributors — Who's building a project:

{
  "scrapeType": "contributors",
  "repos": ["scrapy/scrapy", "microsoft/playwright"]
}

Stargazer export — Everyone who starred a repo (great for developer lead gen):

{
  "scrapeType": "stargazers",
  "repos": ["fastapi/fastapi"]
}

Example: Finding Top Python Scraping Libraries

Here's a real search result:

{
  "full_name": "nicoleahmed/Scrapling",
  "stars": 66129,
  "forks": 2840,
  "language": "Python",
  "topics": ["scraping", "web-scraping", "python"],
  "license": "BSD-3-Clause",
  "description": "Undetected, lightweight, and adaptive web scraping...",
  "url": "https://github.com/nicoleahmed/Scrapling"
}

You get stars, forks, language, topics, license, and description — all the metadata you need for research or comparison.

Profile Enrichment

The enrichProfiles option fetches full user details for every username. This is especially useful for contributor and stargazer exports:

{
  "username": "torvalds",
  "name": "Linus Torvalds",
  "email": "torvalds@linux-foundation.org",
  "company": "Linux Foundation",
  "location": "Portland, OR",
  "followers": 228000,
  "public_repos": 16,
  "bio": ""
}

Note: email and company are only available if the user has made them public on their profile.

Token or No Token?

The scraper works both ways:

Without token: 60 req/hr, good for small extractions (< 50 items)
With token: 5,000 req/hr, needed for bulk exports. The token input field is marked as secret so it's never logged or exposed

Generate a token at github.com/settings/tokens — you only need the public_repo scope for reading public data.

Try It

Run it on Apify: github-scraper

The default input searches for "web scraping language:python stars:>100" and returns the top results — you can see the output format immediately.

Or call it via API:

curl -X POST "https://api.apify.com/v2/acts/ambitious_door~github-scraper/runs" \
  -H "Authorization: Bearer YOUR_APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"scrapeType": "search_repos", "searchQuery": "machine learning stars:>1000", "maxItems": 20}'

Built with Python and httpx. Source on GitHub.

How to Scrape Any Shopify Store and Get Structured Product Data

CodeFather — Thu, 25 Jun 2026 11:46:40 +0000

Most Shopify stores expose a public /products.json endpoint. No API key needed, no authentication, no anti-bot measures. Just append /products.json to any Shopify store URL and you get structured product data back.

I built a tool around this that handles pagination, variant extraction, and batch processing across multiple stores. Here's how it works and why you might want it.

The /products.json Endpoint

Every Shopify store has this:

https://any-store.myshopify.com/products.json?limit=250&page=1

It returns JSON with product titles, descriptions, pricing, variants (sizes, colors, SKUs), images, availability, and tags. Max 250 products per page, so you paginate through larger catalogs.

This works on custom domains too — if a store runs on Shopify, the endpoint exists.

What You Get Back

Here's what the structured output looks like per product:

{
  "title": "Classic Cotton T-Shirt",
  "vendor": "BrandName",
  "product_type": "Clothing",
  "price": 29.99,
  "price_max": 34.99,
  "compare_at_price": 49.99,
  "available": true,
  "variants_count": 3,
  "variants": [
    {"title": "S", "price": "29.99", "sku": "TS-S", "available": true},
    {"title": "M", "price": "29.99", "sku": "TS-M", "available": true},
    {"title": "L", "price": "34.99", "sku": "TS-L", "available": false}
  ],
  "tags": ["cotton", "summer", "sale"],
  "image_url": "https://cdn.shopify.com/...",
  "product_url": "https://store.com/products/classic-cotton-t-shirt",
  "store_url": "https://store.com"
}

You get min/max pricing across variants, availability status, SKUs, compare-at prices (for sale items), and direct product URLs.

Running It at Scale

I packaged this as an Apify Actor so you can run it in the cloud without setting up infrastructure:

{
  "storeUrls": [
    "https://competitor-a.com",
    "https://competitor-b.myshopify.com",
    "https://competitor-c.com"
  ],
  "maxProducts": 500,
  "includeVariants": true,
  "includeCollections": true
}

It handles:

Batch processing — scrape multiple stores in one run
Pagination — automatically walks through all pages
Variant extraction — every size/color/SKU with individual pricing
Collections — store categories and their metadata
Error handling — retries on failures, skips non-Shopify stores

Use Cases

Price monitoring — Run it daily on competitor stores. Track when prices drop, when items go on sale, when new products appear.

Product research — Analyzing a niche? Scrape 10 stores in that category and compare product ranges, pricing strategies, and inventory depth.

Dropshipping research — Find products across suppliers, compare margins, check availability.

Market analysis — Compare how different brands in the same space position their products and pricing.

The Technical Details

Under the hood it uses httpx for async HTTP with retry logic and redirect following (important because many Shopify stores use custom domains that redirect). It validates that a store is actually running Shopify before trying to scrape it, so you don't waste time on non-Shopify URLs.

The pagination follows Shopify's pattern: keep requesting pages until you get fewer than 250 products back, which means you've reached the end.

Try It

The quickest way to test: run it on Apify with a single store URL. The default input scrapes Shopify's own demo store so you can see the output format immediately.

If you want to integrate it programmatically, call the Apify API:

curl -X POST "https://api.apify.com/v2/acts/ambitious_door~shopify-scraper/runs" \
  -H "Authorization: Bearer YOUR_APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"storeUrls": ["https://your-target-store.com"], "maxProducts": 100}'

The data comes back as a standard Apify dataset you can export as JSON, CSV, or pipe into any workflow.

Built with Python and httpx. Source on GitHub.

How to Build a RAG Knowledge Base from Any Documentation Site in 5 Minutes

CodeFather — Thu, 25 Jun 2026 10:55:11 +0000

The Problem

You want to feed documentation into your RAG pipeline, but web scraping gives you a mess of navigation, sidebars, cookie banners, and broken formatting mixed with actual content. You spend hours cleaning up HTML before you can even start building your knowledge base.

The Solution

I built an automated extraction + chunking pipeline that converts any documentation site into clean, structured markdown ready for your vector store.

Step 1: Extract and Chunk the Docs

Using the RAG Docs Extractor on Apify, you can crawl any docs site and get chunked output with a single API call:

{
  "startUrl": "https://fastapi.tiangolo.com/",
  "maxPages": 100,
  "chunkByHeading": true
}

Each chunk in the output looks like:

{
  "url": "https://fastapi.tiangolo.com/tutorial/first-steps/",
  "title": "First Steps - FastAPI",
  "heading": "Create a FastAPI instance",
  "content": "## Create a FastAPI instance\n\nThe simplest FastAPI file could look like this...\n\n```

python\nfrom fastapi import FastAPI\n\napp = FastAPI()\n

```",
  "token_count": 245
}

Notice the token_count field — it uses cl100k_base encoding (GPT-4 / modern embedding models), so you know exactly how many tokens each chunk costs before embedding.

Step 2: Load Chunks into Your Vector Store

With LangChain and ChromaDB:

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document
import json

# Load the extracted chunks (from Apify dataset export)
with open("dataset.json") as f:
    chunks = json.load(f)

# Convert to LangChain documents
docs = [
    Document(
        page_content=chunk["content"],
        metadata={
            "url": chunk["url"],
            "title": chunk["title"],
            "heading": chunk.get("heading", ""),
            "token_count": chunk["token_count"],
        }
    )
    for chunk in chunks
]

# Create vector store
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
print(f"Indexed {len(docs)} chunks")

No re-tokenization needed — the token counts are already computed.

Step 3: Query Your Knowledge Base

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

llm = ChatOpenAI(model="gpt-4")
qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
)

result = qa.invoke("How do I add authentication to a FastAPI app?")
print(result["result"])

Alternative: Single-Page Extraction

If you just need to convert individual pages to markdown (no chunking), use Website to Markdown instead:

{
  "startUrl": "https://docs.python.org/3/library/asyncio.html",
  "maxPages": 1
}

Output is clean markdown with token counts. Good for when you want to control your own chunking strategy or feed single pages into an LLM context window.

How the Cleaning Works

Under the hood, the extractor:

Crawls the site using Crawlee (handles rate limiting, dedup, robots.txt)
Strips noise — removes <nav>, <footer>, .sidebar, .cookie-banner, <script>, <style>, and 20+ other noise selectors
Finds content — looks for <article>, <main>, .markdown-body, .prose, etc.
Converts to markdown — preserves headings, code blocks, tables, links, lists
Counts tokens — uses cl100k_base encoding for accurate token counts

The result is clean, structured content that's ready for any RAG pipeline.

DEV Community: CodeFather

Page to Markdown — Privacy Policy

Page to Markdown — Privacy Policy

Summary

What the extension does

Data collection

Local storage

Permissions

Third parties

Changes

Contact

I Built a Chrome Extension That Converts Any Web Page to LLM-Ready Markdown

What It Does

Why Markdown?

How It Works (No API Calls)

The Token Counting Angle

Try It

Extract GitHub Repository Data Without Hitting Rate Limits

The Rate Limit Problem

How the Scraper Handles It

What You Can Extract

Example: Finding Top Python Scraping Libraries

Profile Enrichment

Token or No Token?

Try It

How to Scrape Any Shopify Store and Get Structured Product Data

The /products.json Endpoint

What You Get Back

Running It at Scale

Use Cases

The Technical Details

Try It

How to Build a RAG Knowledge Base from Any Documentation Site in 5 Minutes

The Problem

The Solution

Step 1: Extract and Chunk the Docs

Step 2: Load Chunks into Your Vector Store

Step 3: Query Your Knowledge Base

Alternative: Single-Page Extraction

How the Cleaning Works

Links