You're building a Scrapy scraper. You ask Claude Code for help. "Add pagination to my spider."
Claude reads your code. It sees scraper.py, settings.py, the folder structure. Then it makes suggestions.
But the suggestions don't match your style. You use specific naming conventions. Your error handling is particular. Your Scrapy middleware is custom. Claude doesn't know any of this.
So it writes code that technically works but doesn't fit. You spend 30 minutes rewriting it to match your patterns. You ask again. Same problem. Claude doesn't learn. You have to explain everything every time.
This is frustrating. You want the AI to understand your project. Not just the code, but the philosophy. The patterns. The conventions. The way you actually work.
Then you discover agents.md.
A simple Markdown file. You drop it in your project root. It teaches AI agents how you work. Your naming conventions. Your architectural decisions. Your scraping patterns. Your error handling approach. Everything.
Now when you ask Claude Code for help, it reads agents.md first. It understands your project. The code it writes matches your style immediately. No rewriting. No repeating yourself.
You're not building a scraper anymore. You're building a scraper that teaches AI how to build scrapers.
This is agents.md. And it's changing how developers work with AI.
Let me show you.
What agents.md Actually Is (And Why It Matters)
agents.md is a Markdown file that teaches AI coding agents how to work on your project.
Think of it like this:
README.md teaches humans about your project. What it does. How to install it. How to contribute.
agents.md teaches AI about your project. The patterns you use. The decisions you made. How to write code that fits your style.
Both live in your project. Both are important. They serve different audiences.
Why This Matters for Scrapers
Scrapy projects are complex. You have spiders, pipelines, middleware, settings. You have patterns for error handling. Conventions for naming. Architectural decisions about how data flows.
When you ask AI for help, it sees all this complexity. But it doesn't understand the patterns. The philosophy. The way you think about scraping.
agents.md fixes this.
You document your patterns once. AI reads it. Now every suggestion, every piece of generated code, follows your patterns automatically.
What's the Difference: agents.md vs SKILL.md vs CLAUDE.md?
There are several variants of this concept:
agents.md: Open standard for any AI coding agent. Works with GitHub Copilot, Cursor, OpenAI Codex CLI, and more.
SKILL.md: Anthropic's format for specific reusable skills. More structured. Separate folder per skill.
CLAUDE.md: Claude Code's specific variant (newer, more features).
agents.md is the universal format. It works everywhere. This blog focuses on agents.md.
The Basic Structure
agents.md has two parts:
- YAML frontmatter (metadata about your project)
- Markdown body (instructions for the agent)
Simple Example
Here's a minimal agents.md file:
---
name: ecommerce-scraper
description: Web scraper for e-commerce product data
---
# E-commerce Scraper
## Project Structure
- spiders/ - Scrapy spiders for different sites
- pipelines.py - Data cleaning and storage
- settings.py - Scrapy configuration
- items.py - Item definitions
## Naming Conventions
Spiders are named: spider_<site_name>.py
Examples: spider_amazon.py, spider_ebay.py
## Error Handling
All network errors are caught and logged
That's it. Simple. The agent reads this. Understands your project. Follows your style.
Parts of the YAML Frontmatter
name: Your project's identifier (used internally)
description: Short description (50-100 chars). Tells the agent when to use this.
Optional fields:
---
name: ecommerce-scraper
description: Web scraper for e-commerce product data
version: 1.0.0
author: Your Name
tags: scrapy, web-scraping, ecommerce
license: MIT
---
These fields help the agent understand the project scope. They're optional but helpful for teams.
Writing agents.md for Your Scrapy Projects
Here's how to build an effective agents.md for a scraper.
Step 1: Create the File
In your project root:
touch agents.md
Step 2: Add Frontmatter
---
name: product-scraper
description: Scrapy spider for scraping product data
version: 1.0.0
author: Your Name
tags: scrapy, web-scraping, python
---
Step 3: Document Project Structure
## Project Structure
- scrapers/spiders/ - All Scrapy spiders
- One spider per website
- Named spider_<site>.py
- scrapers/pipelines.py - Data processing
- TextCleaningPipeline - Normalize text
- DuplicateRemovalPipeline - Remove duplicates
- DatabaseStoragePipeline - Save to PostgreSQL
- scrapers/items.py - Data schema definitions
- scrapers/settings.py - Scrapy configuration
- scrapers/middlewares.py - Custom middleware
Step 4: Document Naming Conventions
## Naming Conventions
Spiders: spider_<site_name>.py
- Example: spider_amazon.py, spider_ebay.py
- Class: <SiteName>Spider (CamelCase)
- Example: AmazonSpider, EbaySpider
Pipelines: <FunctionName>Pipeline
- Example: TextCleaningPipeline, DuplicateRemovalPipeline
Items: <SiteName>Item
- Example: ProductItem, ReviewItem
Methods: snake_case
- parse_product(), extract_price(), clean_text()
Step 5: Document Your Patterns
Include code examples showing your approach:
## Spider Patterns
### Basic Spider Structure
Every spider follows this pattern:
class <SiteName>Spider(scrapy.Spider):
name = '<site-identifier>'
allowed_domains = ['<domain.com>']
start_urls = ['<url>']
def parse(self, response):
# Extract items
for item in response.css('<selector>'):
yield {
'name': item.css('.name::text').get(),
'price': item.css('.price::text').get(),
}
# Handle pagination
next_page = response.css('<next-page-selector>').get()
if next_page:
yield response.follow(next_page, self.parse)
### Error Handling
Always wrap network calls in try-except:
def parse_product(self, response):
try:
name = response.css('.name::text').get()
if not name:
self.logger.warning(f"No name found on {response.url}")
return
price = response.css('.price::text').get()
# process...
except Exception as e:
self.logger.error(f"Error parsing {response.url}: {e}")
return
### Price Extraction
Prices should be extracted as floats:
def clean_price(self, price_str):
import re
match = re.search(r'\d+\.?\d*', price_str)
return float(match.group(0)) if match else None
### Data Validation
Check for required fields before yielding:
def parse(self, response):
for item in response.css('.product'):
product = {
'name': item.css('.name::text').get(),
'price': self.clean_price(item.css('.price::text').get()),
'url': response.url,
}
# Only yield if required fields present
if product['name'] and product['price']:
yield product
Step 6: Document Configuration
## Scrapy Settings
Key settings in settings.py:
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 2
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
Pipelines (in order):
1. TextCleaningPipeline (normalize text)
2. DuplicateRemovalPipeline (remove duplicates)
3. DatabaseStoragePipeline (save to database)
Step 7: Document Common Tasks
## Common Tasks
### Adding a New Spider
1. Create scrapers/spiders/spider_<sitename>.py
2. Inherit from scrapy.Spider
3. Define name, allowed_domains, start_urls
4. Implement parse() method
5. Return items with required fields (name, price, url)
6. Test locally: scrapy crawl <spider-name>
### Adding a New Pipeline
1. Create new class in pipelines.py
2. Implement process_item() method
3. Add to ITEM_PIPELINES in settings.py
4. Assign integer priority (lower = earlier)
### Debugging a Spider
Debug specific spider:
scrapy crawl spider_name -a debug=True
See rendered HTML:
scrapy shell 'https://example.com'
response.css('.selector').get()
Check all items before pipeline:
ITEM_PIPELINES = {}
Step 8: Document Dependencies
## Dependencies
Core:
- Scrapy 2.9+
- Python 3.8+
Data Processing:
- pandas (data cleaning)
- sqlalchemy (database)
- psycopg2 (PostgreSQL)
Testing:
- pytest
- pytest-scrapy
Install all:
pip install -r requirements.txt
Real Item Structure
Document what your items actually look like:
## Item Format
All items follow this structure:
{
'name': str, # Product name (required)
'price': float, # Price in USD (required)
'original_price': float, # Before discount (optional)
'rating': float, # 0-5 stars (optional)
'review_count': int, # Number of reviews (optional)
'url': str, # Source URL (required)
'site': str, # Which site (amazon, ebay, etc)
'scraped_at': str, # ISO datetime
}
How to Use agents.md with AI Coding Agents
With Claude Code
Claude Code automatically reads agents.md in your project root.
Just ask for help and Claude will follow your patterns.
Example:
"Add a spider for Target.com following the existing patterns"
Claude reads agents.md. Knows your naming conventions. Your error handling. Your spider template. Generates code that fits perfectly.
With GitHub Copilot
Create agents.md in your repo root. GitHub Copilot reads it in VS Code/GitHub.
When you start typing a new spider, Copilot autocompletes following your patterns.
With Cursor
Same as GitHub Copilot. Cursor reads agents.md automatically.
With OpenAI Codex CLI
You can reference agents.md when using Codex.
Progressive Disclosure: Growing Your agents.md
Start small. Add more as you need it.
Version 1: Minimal (Start Here)
---
name: my-scraper
description: Scrapy scraper for product data
---
## Project Structure
- spiders/ - Scrapy spiders
- pipelines.py - Data processing
- items.py - Data schema
## Naming
Spiders: spider_<site>.py
Methods: snake_case
## Error Handling
Wrap in try-except. Log errors with URL.
Start with this. Get working. Then expand.
Version 2: Patterns
Add spider template, pipeline patterns, common tasks.
Version 3: Complete
Add full reference, all patterns, all conventions, troubleshooting.
Version 4: Advanced
Add complex patterns, edge cases, performance tips.
Common Mistakes (And How to Fix Them)
Mistake 1: Too Long
You write a 5,000-word agents.md. Too much. AI can't process it all.
Fix: Keep it under 2,000 words. Use progressive disclosure. Reference external files:
"See docs/advanced-patterns.md for complex scraping scenarios."
Mistake 2: Too Vague
"Use good names."
Too vague. AI doesn't know what good means.
Fix: Be specific. Show examples.
"Spiders: spider_.py. Classes: Spider. Methods: snake_case."
Mistake 3: Outdated
You update your patterns but forget to update agents.md.
AI follows the old patterns from the file.
Fix: Update agents.md whenever you change patterns. Commit it to Git.
Mistake 4: Only Examples, No Explanation
Shows example but no context.
Fix: Explain why. Show both example and reasoning.
Mistake 5: Assumes Too Much Knowledge
"Remember to use kwargs for flexibility."
New developers don't understand this.
Fix: Explain for beginners. Show what it does.
agents.md for Teams
When multiple developers work on the scraper, agents.md becomes crucial.
Enforcing Standards
All developers read the same agents.md. Everyone follows the same patterns.
No more debates: "Should this be snake_case or CamelCase?" It's in agents.md.
Onboarding New Developers
New developer joins. You say: "Read agents.md. It explains how we work."
They read one file. Now they can contribute following your patterns immediately.
Code Review Faster
Reviewer checks if code follows agents.md patterns. Reduces comments about style.
Focus on logic, not formatting.
AI Helps Enforce Patterns
Add this to agents.md:
"AI will check:
- All methods use snake_case
- All spiders inherit from scrapy.Spider
- All errors are logged with URL
- All items validated before yielding"
When you ask Claude Code for help, it checks these automatically.
Real-World Example: Complete agents.md
Here's a complete agents.md for an e-commerce scraper:
---
name: ecommerce-scraper
description: Scrapy spider for scraping product data
version: 1.0.0
author: Your Team
tags: scrapy, web-scraping, ecommerce, python
license: MIT
---
# E-commerce Product Scraper
This Scrapy project scrapes product data from multiple e-commerce websites.
## Quick Facts
- Framework: Scrapy 2.9+
- Python: 3.8+
- Database: PostgreSQL
- Sites: Amazon, eBay, Walmart, Best Buy
- Update Frequency: Daily via cron job
## Project Structure
ecommerce-scraper/
├── scrapers/
│ ├── spiders/
│ │ ├── spider_amazon.py
│ │ ├── spider_ebay.py
│ │ └── spider_walmart.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── middlewares.py
├── requirements.txt
└── agents.md
## Data Flow
Start URLs
↓
Parse Product Page
↓
Extract: name, price, rating, reviews, url
↓
Validation Pipeline
↓
Text Cleaning
↓
Deduplication
↓
Database Storage
## Item Structure
All items follow this structure:
{
'name': str, # Product name (required)
'price': float, # Price in USD (required)
'original_price': float, # Before discount (optional)
'rating': float, # 0-5 stars (optional)
'review_count': int, # Number of reviews (optional)
'url': str, # Source URL (required)
'site': str, # Which site
'scraped_at': str, # ISO datetime
}
## Naming Conventions
Spiders: spider_<site>.py
Classes: <SiteName>Spider (CamelCase)
Methods: snake_case
Examples:
- Spider file: spider_amazon.py
- Class: AmazonSpider
- Method: extract_price()
## Spider Template
class <SiteName>Spider(scrapy.Spider):
name = '<site-slug>'
allowed_domains = ['<domain.com>']
start_urls = ['<start-url>']
def parse(self, response):
try:
# Extract products
for product in response.css('<selector>'):
yield self.extract_product(product, response)
# Handle pagination
next_page = response.css('<next-selector>').get()
if next_page:
yield response.follow(next_page, self.parse)
except Exception as e:
self.logger.error(f"Error parsing {response.url}: {e}")
def extract_product(self, product, response):
return {
'name': product.css('.name::text').get('').strip(),
'price': self.extract_price(product),
'url': response.urljoin(product.css('a::attr(href)').get()),
'site': self.name,
}
def extract_price(self, product):
import re
price_str = product.css('.price::text').get('')
match = re.search(r'\d+\.?\d*', price_str)
return float(match.group(0)) if match else None
## Error Handling
Always wrap in try-except. Log errors with URL.
## Scrapy Settings
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 2
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
ITEM_PIPELINES = {
'scrapers.pipelines.TextCleaningPipeline': 100,
'scrapers.pipelines.DuplicateRemovalPipeline': 200,
'scrapers.pipelines.DatabaseStoragePipeline': 300,
}
## Common Tasks
Adding a New Spider:
1. Create scrapers/spiders/spider_<sitename>.py
2. Inherit from scrapy.Spider
3. Define name, allowed_domains, start_urls
4. Implement parse() method
5. Test: scrapy crawl <spider-name>
Testing a Spider:
scrapy shell 'https://example.com'
response.css('.selector').get()
Running All Spiders:
scrapy crawl amazon && scrapy crawl ebay && scrapy crawl walmart
## Dependencies
- Scrapy 2.9+
- Python 3.8+
- sqlalchemy
- psycopg2-binary
- pandas
Install: pip install -r requirements.txt
This is a complete agents.md. When Claude Code reads this, it understands everything about your project. Every suggestion follows your patterns.
Summary
agents.md teaches AI agents how your project works.
What it is:
Markdown file in project root. YAML frontmatter + Markdown body. Describes patterns, conventions, architecture. Read automatically by Claude Code, GitHub Copilot, Cursor.
Why it matters:
AI writes code matching your style. No rewriting AI-generated code. Faster onboarding for new developers. Consistent project standards. Works with all AI coding agents.
How to use it:
- Create agents.md in project root
- Document your patterns (spiders, pipelines, naming)
- Add examples
- AI reads it automatically
- Ask for help confidently
Best practices:
Keep it under 2,000 words. Be specific, not vague. Update when you change patterns. Use progressive disclosure. Explain the why, not just examples.
For teams:
One agents.md = everyone on same page. Faster code review. Easier onboarding. Enforce standards without debate.
You're not just building a scraper. You're building a scraper that teaches AI how to build scrapers.
The future of development is humans and AI working together. agents.md is how you teach the AI to be a good collaborator.
Next Steps:
- Create agents.md in your next project (or existing one)
- Document your patterns
- Ask Claude Code or GitHub Copilot for help
- Watch how it follows your style perfectly
- Update agents.md as patterns evolve
The investment in documenting patterns pays dividends every time you ask for AI help.
Top comments (0)