Muhammad Ikramullah Khan

Posted on May 28

# agents.md: Teaching AI Agents How to Scrape (The Future of Web Automation)

#agents #ai #agentskills #python

You're building a Scrapy scraper. You ask Claude Code for help. "Add pagination to my spider."

Claude reads your code. It sees scraper.py, settings.py, the folder structure. Then it makes suggestions.

But the suggestions don't match your style. You use specific naming conventions. Your error handling is particular. Your Scrapy middleware is custom. Claude doesn't know any of this.

So it writes code that technically works but doesn't fit. You spend 30 minutes rewriting it to match your patterns. You ask again. Same problem. Claude doesn't learn. You have to explain everything every time.

This is frustrating. You want the AI to understand your project. Not just the code, but the philosophy. The patterns. The conventions. The way you actually work.

Then you discover agents.md.

A simple Markdown file. You drop it in your project root. It teaches AI agents how you work. Your naming conventions. Your architectural decisions. Your scraping patterns. Your error handling approach. Everything.

Now when you ask Claude Code for help, it reads agents.md first. It understands your project. The code it writes matches your style immediately. No rewriting. No repeating yourself.

You're not building a scraper anymore. You're building a scraper that teaches AI how to build scrapers.

This is agents.md. And it's changing how developers work with AI.

Let me show you.

What agents.md Actually Is (And Why It Matters)

agents.md is a Markdown file that teaches AI coding agents how to work on your project.

Think of it like this:

README.md teaches humans about your project. What it does. How to install it. How to contribute.

agents.md teaches AI about your project. The patterns you use. The decisions you made. How to write code that fits your style.

Both live in your project. Both are important. They serve different audiences.

Why This Matters for Scrapers

Scrapy projects are complex. You have spiders, pipelines, middleware, settings. You have patterns for error handling. Conventions for naming. Architectural decisions about how data flows.

When you ask AI for help, it sees all this complexity. But it doesn't understand the patterns. The philosophy. The way you think about scraping.

agents.md fixes this.

You document your patterns once. AI reads it. Now every suggestion, every piece of generated code, follows your patterns automatically.

What's the Difference: agents.md vs SKILL.md vs CLAUDE.md?

There are several variants of this concept:

agents.md: Open standard for any AI coding agent. Works with GitHub Copilot, Cursor, OpenAI Codex CLI, and more.

SKILL.md: Anthropic's format for specific reusable skills. More structured. Separate folder per skill.

CLAUDE.md: Claude Code's specific variant (newer, more features).

agents.md is the universal format. It works everywhere. This blog focuses on agents.md.

The Basic Structure

agents.md has two parts:

YAML frontmatter (metadata about your project)
Markdown body (instructions for the agent)

Simple Example

Here's a minimal agents.md file:

---
name: ecommerce-scraper
description: Web scraper for e-commerce product data
---

# E-commerce Scraper

## Project Structure

- spiders/ - Scrapy spiders for different sites
- pipelines.py - Data cleaning and storage
- settings.py - Scrapy configuration
- items.py - Item definitions

## Naming Conventions

Spiders are named: spider_<site_name>.py

Examples: spider_amazon.py, spider_ebay.py

## Error Handling

All network errors are caught and logged

That's it. Simple. The agent reads this. Understands your project. Follows your style.

Parts of the YAML Frontmatter

name: Your project's identifier (used internally)

description: Short description (50-100 chars). Tells the agent when to use this.

Optional fields:

---
name: ecommerce-scraper
description: Web scraper for e-commerce product data
version: 1.0.0
author: Your Name
tags: scrapy, web-scraping, ecommerce
license: MIT
---

These fields help the agent understand the project scope. They're optional but helpful for teams.

Writing agents.md for Your Scrapy Projects

Here's how to build an effective agents.md for a scraper.

Step 1: Create the File

In your project root:

touch agents.md

Step 2: Add Frontmatter

---
name: product-scraper
description: Scrapy spider for scraping product data
version: 1.0.0
author: Your Name
tags: scrapy, web-scraping, python
---

Step 3: Document Project Structure

## Project Structure

- scrapers/spiders/ - All Scrapy spiders
  - One spider per website
  - Named spider_<site>.py

- scrapers/pipelines.py - Data processing
  - TextCleaningPipeline - Normalize text
  - DuplicateRemovalPipeline - Remove duplicates
  - DatabaseStoragePipeline - Save to PostgreSQL

- scrapers/items.py - Data schema definitions
- scrapers/settings.py - Scrapy configuration
- scrapers/middlewares.py - Custom middleware

Step 4: Document Naming Conventions

## Naming Conventions

Spiders: spider_<site_name>.py
- Example: spider_amazon.py, spider_ebay.py
- Class: <SiteName>Spider (CamelCase)
- Example: AmazonSpider, EbaySpider

Pipelines: <FunctionName>Pipeline
- Example: TextCleaningPipeline, DuplicateRemovalPipeline

Items: <SiteName>Item
- Example: ProductItem, ReviewItem

Methods: snake_case
- parse_product(), extract_price(), clean_text()

Step 5: Document Your Patterns

Include code examples showing your approach:

## Spider Patterns

### Basic Spider Structure

Every spider follows this pattern:

class <SiteName>Spider(scrapy.Spider):
    name = '<site-identifier>'
    allowed_domains = ['<domain.com>']
    start_urls = ['<url>']

    def parse(self, response):
        # Extract items
        for item in response.css('<selector>'):
            yield {
                'name': item.css('.name::text').get(),
                'price': item.css('.price::text').get(),
            }

        # Handle pagination
        next_page = response.css('<next-page-selector>').get()
        if next_page:
            yield response.follow(next_page, self.parse)

### Error Handling

Always wrap network calls in try-except:

def parse_product(self, response):
    try:
        name = response.css('.name::text').get()
        if not name:
            self.logger.warning(f"No name found on {response.url}")
            return

        price = response.css('.price::text').get()
        # process...

    except Exception as e:
        self.logger.error(f"Error parsing {response.url}: {e}")
        return

### Price Extraction

Prices should be extracted as floats:

def clean_price(self, price_str):
    import re
    match = re.search(r'\d+\.?\d*', price_str)
    return float(match.group(0)) if match else None

### Data Validation

Check for required fields before yielding:

def parse(self, response):
    for item in response.css('.product'):
        product = {
            'name': item.css('.name::text').get(),
            'price': self.clean_price(item.css('.price::text').get()),
            'url': response.url,
        }

        # Only yield if required fields present
        if product['name'] and product['price']:
            yield product

Step 6: Document Configuration

## Scrapy Settings

Key settings in settings.py:

CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 2
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'

Pipelines (in order):
1. TextCleaningPipeline (normalize text)
2. DuplicateRemovalPipeline (remove duplicates)
3. DatabaseStoragePipeline (save to database)

Step 7: Document Common Tasks

## Common Tasks

### Adding a New Spider

1. Create scrapers/spiders/spider_<sitename>.py
2. Inherit from scrapy.Spider
3. Define name, allowed_domains, start_urls
4. Implement parse() method
5. Return items with required fields (name, price, url)
6. Test locally: scrapy crawl <spider-name>

### Adding a New Pipeline

1. Create new class in pipelines.py
2. Implement process_item() method
3. Add to ITEM_PIPELINES in settings.py
4. Assign integer priority (lower = earlier)

### Debugging a Spider

Debug specific spider:
scrapy crawl spider_name -a debug=True

See rendered HTML:
scrapy shell 'https://example.com'
response.css('.selector').get()

Check all items before pipeline:
ITEM_PIPELINES = {}

Step 8: Document Dependencies

## Dependencies

Core:
- Scrapy 2.9+
- Python 3.8+

Data Processing:
- pandas (data cleaning)
- sqlalchemy (database)
- psycopg2 (PostgreSQL)

Testing:
- pytest
- pytest-scrapy

Install all:
pip install -r requirements.txt

Real Item Structure

Document what your items actually look like:

## Item Format

All items follow this structure:

{
    'name': str,              # Product name (required)
    'price': float,           # Price in USD (required)
    'original_price': float,  # Before discount (optional)
    'rating': float,          # 0-5 stars (optional)
    'review_count': int,      # Number of reviews (optional)
    'url': str,               # Source URL (required)
    'site': str,              # Which site (amazon, ebay, etc)
    'scraped_at': str,        # ISO datetime
}

How to Use agents.md with AI Coding Agents

With Claude Code

Claude Code automatically reads agents.md in your project root.

Just ask for help and Claude will follow your patterns.

Example:
"Add a spider for Target.com following the existing patterns"

Claude reads agents.md. Knows your naming conventions. Your error handling. Your spider template. Generates code that fits perfectly.

With GitHub Copilot

Create agents.md in your repo root. GitHub Copilot reads it in VS Code/GitHub.

When you start typing a new spider, Copilot autocompletes following your patterns.

With Cursor

Same as GitHub Copilot. Cursor reads agents.md automatically.

With OpenAI Codex CLI

You can reference agents.md when using Codex.

Progressive Disclosure: Growing Your agents.md

Start small. Add more as you need it.

Version 1: Minimal (Start Here)

---
name: my-scraper
description: Scrapy scraper for product data
---

## Project Structure

- spiders/ - Scrapy spiders
- pipelines.py - Data processing
- items.py - Data schema

## Naming

Spiders: spider_<site>.py
Methods: snake_case

## Error Handling

Wrap in try-except. Log errors with URL.

Start with this. Get working. Then expand.

Version 2: Patterns

Add spider template, pipeline patterns, common tasks.

Version 3: Complete

Add full reference, all patterns, all conventions, troubleshooting.

Version 4: Advanced

Add complex patterns, edge cases, performance tips.

Common Mistakes (And How to Fix Them)

Mistake 1: Too Long

You write a 5,000-word agents.md. Too much. AI can't process it all.

Fix: Keep it under 2,000 words. Use progressive disclosure. Reference external files:

"See docs/advanced-patterns.md for complex scraping scenarios."

Mistake 2: Too Vague

"Use good names."

Too vague. AI doesn't know what good means.

Fix: Be specific. Show examples.

"Spiders: spider_.py. Classes: Spider. Methods: snake_case."

Mistake 3: Outdated

You update your patterns but forget to update agents.md.

AI follows the old patterns from the file.

Fix: Update agents.md whenever you change patterns. Commit it to Git.

Mistake 4: Only Examples, No Explanation

Shows example but no context.

Fix: Explain why. Show both example and reasoning.

Mistake 5: Assumes Too Much Knowledge

"Remember to use kwargs for flexibility."

New developers don't understand this.

Fix: Explain for beginners. Show what it does.

agents.md for Teams

When multiple developers work on the scraper, agents.md becomes crucial.

Enforcing Standards

All developers read the same agents.md. Everyone follows the same patterns.

No more debates: "Should this be snake_case or CamelCase?" It's in agents.md.

Onboarding New Developers

New developer joins. You say: "Read agents.md. It explains how we work."

They read one file. Now they can contribute following your patterns immediately.

Code Review Faster

Reviewer checks if code follows agents.md patterns. Reduces comments about style.

Focus on logic, not formatting.

AI Helps Enforce Patterns

Add this to agents.md:

"AI will check:

All methods use snake_case
All spiders inherit from scrapy.Spider
All errors are logged with URL
All items validated before yielding"

When you ask Claude Code for help, it checks these automatically.

Real-World Example: Complete agents.md

Here's a complete agents.md for an e-commerce scraper:

---
name: ecommerce-scraper
description: Scrapy spider for scraping product data
version: 1.0.0
author: Your Team
tags: scrapy, web-scraping, ecommerce, python
license: MIT
---

# E-commerce Product Scraper

This Scrapy project scrapes product data from multiple e-commerce websites.

## Quick Facts

- Framework: Scrapy 2.9+
- Python: 3.8+
- Database: PostgreSQL
- Sites: Amazon, eBay, Walmart, Best Buy
- Update Frequency: Daily via cron job

## Project Structure

ecommerce-scraper/
├── scrapers/
│   ├── spiders/
│   │   ├── spider_amazon.py
│   │   ├── spider_ebay.py
│   │   └── spider_walmart.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── middlewares.py
├── requirements.txt
└── agents.md

## Data Flow

Start URLs
    ↓
Parse Product Page
    ↓
Extract: name, price, rating, reviews, url
    ↓
Validation Pipeline
    ↓
Text Cleaning
    ↓
Deduplication
    ↓
Database Storage

## Item Structure

All items follow this structure:

{
    'name': str,              # Product name (required)
    'price': float,           # Price in USD (required)
    'original_price': float,  # Before discount (optional)
    'rating': float,          # 0-5 stars (optional)
    'review_count': int,      # Number of reviews (optional)
    'url': str,               # Source URL (required)
    'site': str,              # Which site
    'scraped_at': str,        # ISO datetime
}

## Naming Conventions

Spiders: spider_<site>.py
Classes: <SiteName>Spider (CamelCase)
Methods: snake_case

Examples:
- Spider file: spider_amazon.py
- Class: AmazonSpider
- Method: extract_price()

## Spider Template

class <SiteName>Spider(scrapy.Spider):
    name = '<site-slug>'
    allowed_domains = ['<domain.com>']
    start_urls = ['<start-url>']

    def parse(self, response):
        try:
            # Extract products
            for product in response.css('<selector>'):
                yield self.extract_product(product, response)

            # Handle pagination
            next_page = response.css('<next-selector>').get()
            if next_page:
                yield response.follow(next_page, self.parse)

        except Exception as e:
            self.logger.error(f"Error parsing {response.url}: {e}")

    def extract_product(self, product, response):
        return {
            'name': product.css('.name::text').get('').strip(),
            'price': self.extract_price(product),
            'url': response.urljoin(product.css('a::attr(href)').get()),
            'site': self.name,
        }

    def extract_price(self, product):
        import re
        price_str = product.css('.price::text').get('')
        match = re.search(r'\d+\.?\d*', price_str)
        return float(match.group(0)) if match else None

## Error Handling

Always wrap in try-except. Log errors with URL.

## Scrapy Settings

CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 2
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'

ITEM_PIPELINES = {
    'scrapers.pipelines.TextCleaningPipeline': 100,
    'scrapers.pipelines.DuplicateRemovalPipeline': 200,
    'scrapers.pipelines.DatabaseStoragePipeline': 300,
}

## Common Tasks

Adding a New Spider:
1. Create scrapers/spiders/spider_<sitename>.py
2. Inherit from scrapy.Spider
3. Define name, allowed_domains, start_urls
4. Implement parse() method
5. Test: scrapy crawl <spider-name>

Testing a Spider:
scrapy shell 'https://example.com'
response.css('.selector').get()

Running All Spiders:
scrapy crawl amazon && scrapy crawl ebay && scrapy crawl walmart

## Dependencies

- Scrapy 2.9+
- Python 3.8+
- sqlalchemy
- psycopg2-binary
- pandas

Install: pip install -r requirements.txt

This is a complete agents.md. When Claude Code reads this, it understands everything about your project. Every suggestion follows your patterns.

Summary

agents.md teaches AI agents how your project works.

What it is:

Markdown file in project root. YAML frontmatter + Markdown body. Describes patterns, conventions, architecture. Read automatically by Claude Code, GitHub Copilot, Cursor.

Why it matters:

AI writes code matching your style. No rewriting AI-generated code. Faster onboarding for new developers. Consistent project standards. Works with all AI coding agents.

How to use it:

Create agents.md in project root
Document your patterns (spiders, pipelines, naming)
Add examples
AI reads it automatically
Ask for help confidently

Best practices:

Keep it under 2,000 words. Be specific, not vague. Update when you change patterns. Use progressive disclosure. Explain the why, not just examples.

For teams:

One agents.md = everyone on same page. Faster code review. Easier onboarding. Enforce standards without debate.

You're not just building a scraper. You're building a scraper that teaches AI how to build scrapers.

The future of development is humans and AI working together. agents.md is how you teach the AI to be a good collaborator.

Next Steps:

Create agents.md in your next project (or existing one)
Document your patterns
Ask Claude Code or GitHub Copilot for help
Watch how it follows your style perfectly
Update agents.md as patterns evolve

The investment in documenting patterns pays dividends every time you ask for AI help.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.