Craigslist remains one of the most visited classifieds platforms in the world. Despite its famously minimalist design, it hosts millions of active listings across categories ranging from real estate and vehicles to jobs and services — spread across hundreds of city-specific subdomains covering virtually every metropolitan area in the United States and many international cities.
For data analysts, real estate investors, market researchers, and developers building aggregation tools, scraping Craigslist offers access to hyper-local market data that simply isn't available anywhere else.
In this comprehensive guide, we'll explore Craigslist's unique data architecture, walk through practical scraping implementations in both Node.js and Python, tackle common challenges, and show how to scale your extraction using Apify's cloud scraping platform.
Understanding Craigslist's Data Architecture
Craigslist is architecturally unique among major websites. Understanding its structure is essential before writing any scraping code.
Geographic Subdomain System
Unlike most platforms that use URL paths for location, Craigslist uses subdomains — one per metropolitan area:
-
newyork.craigslist.org— New York City -
sfbay.craigslist.org— San Francisco Bay Area -
chicago.craigslist.org— Chicago -
losangeles.craigslist.org— Los Angeles -
seattle.craigslist.org— Seattle
There are over 400 active subdomains covering US cities and international locations. Each operates as essentially an independent instance with its own listings.
Category Hierarchy
Within each city, listings are organized into a deep category tree:
craigslist.org
├── community
├── housing
│ ├── apts / housing (apa)
│ ├── rooms / shared (roo)
│ ├── sublets / temporary (sub)
│ ├── housing wanted (hou)
│ └── real estate for sale (rea)
├── for sale
│ ├── antiques (ata)
│ ├── electronics (ela)
│ ├── furniture (fua)
│ ├── cars+trucks (cta)
│ └── ... 30+ subcategories
├── services
├── jobs
│ ├── software / qa / dba (sof)
│ ├── web / info design (web)
│ └── ... many subcategories
└── gigs
Each category has a three-letter code used in URLs. For example, apartments for rent in San Francisco:
https://sfbay.craigslist.org/search/apa
Listing Structure
Every Craigslist listing contains these data fields:
- Title: The listing headline
- Price: Displayed in the title line (e.g., "$2,500/mo")
- Location: Neighborhood or area within the metro region
- Post date: When the listing was created
- Post ID: A unique numeric identifier
- Body text: The full description
- Images: Uploaded photos (often multiple)
- Map coordinates: Latitude/longitude when provided
- Attributes: Structured metadata (bedrooms, sqft, etc. for housing)
- Contact method: Reply link (anonymized email) or phone number
Search and Filtering
Craigslist supports several URL-based search parameters:
| Parameter | Description | Example |
|---|---|---|
query |
Search keywords | query=furnished+studio |
min_price |
Minimum price | min_price=1000 |
max_price |
Maximum price | max_price=2500 |
sort |
Sort order |
sort=date or sort=priceasc
|
hasPic |
Has photos | hasPic=1 |
postedToday |
Today's posts only | postedToday=1 |
bedrooms |
Number of bedrooms | min_bedrooms=2 |
bathrooms |
Number of bathrooms | min_bathrooms=1 |
sqft |
Square footage | minSqft=500&maxSqft=1200 |
Basic Scraping: Node.js with Cheerio
Craigslist uses server-rendered HTML, which makes it one of the simpler major sites to scrape — no headless browser required for basic extraction. Here's a practical implementation using Node.js:
const axios = require('axios');
const cheerio = require('cheerio');
class CraigslistScraper {
constructor(city) {
this.baseUrl = `https://${city}.craigslist.org`;
this.headers = {
'User-Agent':
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ' +
'AppleWebKit/537.36 (KHTML, like Gecko) ' +
'Chrome/120.0.0.0 Safari/537.36',
};
}
async searchListings(category, options = {}) {
const {
query = '',
minPrice = null,
maxPrice = null,
hasPic = false,
maxPages = 5,
} = options;
const allListings = [];
for (let page = 0; page < maxPages; page++) {
const offset = page * 120;
const url = this.buildSearchUrl(
category, query, minPrice, maxPrice, hasPic, offset
);
console.log(`Fetching page ${page + 1}: ${url}`);
const listings = await this.fetchSearchPage(url);
if (listings.length === 0) break;
allListings.push(...listings);
// Respectful delay between pages
await this.delay(2000 + Math.random() * 2000);
}
return allListings;
}
buildSearchUrl(category, query, minPrice, maxPrice, hasPic, offset) {
const params = new URLSearchParams();
if (query) params.set('query', query);
if (minPrice) params.set('min_price', minPrice);
if (maxPrice) params.set('max_price', maxPrice);
if (hasPic) params.set('hasPic', '1');
if (offset > 0) params.set('s', offset);
return `${this.baseUrl}/search/${category}?${params.toString()}`;
}
async fetchSearchPage(url) {
try {
const response = await axios.get(url, { headers: this.headers });
const $ = cheerio.load(response.data);
const listings = [];
$('li.cl-static-search-result, .result-row').each((i, el) => {
const $el = $(el);
const titleEl = $el.find('.titlestring, a.result-title');
const priceEl = $el.find('.priceinfo, .result-price');
const metaEl = $el.find('.meta, .result-meta');
const listing = {
title: titleEl.text().trim(),
url: titleEl.attr('href'),
price: priceEl.text().trim(),
location: metaEl.find('.location').text().trim()
|| $el.find('.result-hood').text().trim(),
date: $el.find('time').attr('datetime')
|| $el.find('.date').text().trim(),
postId: $el.attr('data-pid')
|| this.extractPostId(titleEl.attr('href')),
};
if (listing.title) {
listings.push(listing);
}
});
return listings;
} catch (error) {
console.error(`Error fetching ${url}: ${error.message}`);
return [];
}
}
async fetchListingDetails(listingUrl) {
try {
const fullUrl = listingUrl.startsWith('http')
? listingUrl
: `${this.baseUrl}${listingUrl}`;
const response = await axios.get(fullUrl, {
headers: this.headers,
});
const $ = cheerio.load(response.data);
const details = {
title: $('#titletextonly').text().trim(),
price: $('span.price').first().text().trim(),
description: $('section#postingbody').text().trim()
.replace(/QR Code Link to This Post/g, ''),
location: $('div.mapaddress').text().trim(),
attributes: {},
images: [],
postDate: $('time.date').first().attr('datetime'),
mapCoordinates: null,
};
// Extract structured attributes
$('p.attrgroup span').each((i, el) => {
const text = $(el).text().trim();
const [key, value] = text.split(':').map(s => s?.trim());
if (key && value) {
details.attributes[key] = value;
} else if (key) {
details.attributes[`attr_${i}`] = key;
}
});
// Extract images
$('a.thumb').each((i, el) => {
const href = $(el).attr('href');
if (href) details.images.push(href);
});
// Extract map coordinates
const mapEl = $('#map');
if (mapEl.length) {
details.mapCoordinates = {
latitude: parseFloat(mapEl.attr('data-latitude')),
longitude: parseFloat(mapEl.attr('data-longitude')),
};
}
return details;
} catch (error) {
console.error(
`Error fetching details for ${listingUrl}: ${error.message}`
);
return null;
}
}
extractPostId(url) {
if (!url) return null;
const match = url.match(/(\d{10,})\.html/);
return match ? match[1] : null;
}
delay(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
// Usage example
async function main() {
const scraper = new CraigslistScraper('sfbay');
// Search for apartments in SF Bay Area
const listings = await scraper.searchListings('apa', {
query: 'furnished',
minPrice: 2000,
maxPrice: 4000,
hasPic: true,
maxPages: 3,
});
console.log(`Found ${listings.length} apartment listings`);
// Get details for the first 5 listings
for (const listing of listings.slice(0, 5)) {
const details = await scraper.fetchListingDetails(listing.url);
if (details) {
console.log(`\n--- ${details.title} ---`);
console.log(`Price: ${details.price}`);
console.log(`Location: ${details.location}`);
console.log(`Bedrooms: ${details.attributes.bedrooms || 'N/A'}`);
console.log(`Images: ${details.images.length}`);
}
await scraper.delay(2000);
}
}
main().catch(console.error);
Python Implementation with Beautiful Soup
Here's a comprehensive Python implementation:
import requests
from bs4 import BeautifulSoup
import json
import time
import random
from dataclasses import dataclass, asdict
from typing import Optional
@dataclass
class CraigslistListing:
title: str
url: str
price: Optional[str] = None
location: Optional[str] = None
date: Optional[str] = None
post_id: Optional[str] = None
description: Optional[str] = None
attributes: Optional[dict] = None
images: Optional[list] = None
latitude: Optional[float] = None
longitude: Optional[float] = None
class CraigslistScraper:
def __init__(self, city: str):
self.base_url = f"https://{city}.craigslist.org"
self.session = requests.Session()
self.session.headers.update({
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
),
})
def search(
self,
category: str,
query: str = "",
min_price: int = None,
max_price: int = None,
has_pic: bool = False,
max_pages: int = 5,
) -> list[CraigslistListing]:
"""Search Craigslist listings with filters."""
all_listings = []
for page in range(max_pages):
offset = page * 120
params = {"s": offset}
if query:
params["query"] = query
if min_price:
params["min_price"] = min_price
if max_price:
params["max_price"] = max_price
if has_pic:
params["hasPic"] = 1
url = f"{self.base_url}/search/{category}"
print(f"Fetching page {page + 1}...")
try:
resp = self.session.get(url, params=params, timeout=15)
resp.raise_for_status()
except requests.RequestException as e:
print(f"Error: {e}")
break
soup = BeautifulSoup(resp.text, "html.parser")
listings = self._parse_search_results(soup)
if not listings:
break
all_listings.extend(listings)
time.sleep(2 + random.random() * 2)
return all_listings
def _parse_search_results(
self, soup: BeautifulSoup
) -> list[CraigslistListing]:
"""Parse search results page into listing objects."""
listings = []
for item in soup.select("li.cl-static-search-result, .result-row"):
title_el = item.select_one(".titlestring, a.result-title")
price_el = item.select_one(".priceinfo, .result-price")
if not title_el:
continue
listing = CraigslistListing(
title=title_el.get_text(strip=True),
url=title_el.get("href", ""),
price=price_el.get_text(strip=True) if price_el else None,
location=self._extract_location(item),
date=self._extract_date(item),
)
listings.append(listing)
return listings
def _extract_location(self, item) -> str:
loc = item.select_one(".location, .result-hood")
return loc.get_text(strip=True) if loc else ""
def _extract_date(self, item) -> str:
time_el = item.select_one("time")
if time_el:
return time_el.get("datetime", time_el.get_text(strip=True))
date_el = item.select_one(".date")
return date_el.get_text(strip=True) if date_el else ""
def get_details(self, listing_url: str) -> dict:
"""Fetch full details for a single listing."""
full_url = (
listing_url
if listing_url.startswith("http")
else f"{self.base_url}{listing_url}"
)
try:
resp = self.session.get(full_url, timeout=15)
resp.raise_for_status()
except requests.RequestException as e:
return {"error": str(e)}
soup = BeautifulSoup(resp.text, "html.parser")
# Extract description
body = soup.select_one("#postingbody")
description = ""
if body:
description = body.get_text(strip=True).replace(
"QR Code Link to This Post", ""
)
# Extract attributes
attributes = {}
for span in soup.select("p.attrgroup span"):
text = span.get_text(strip=True)
if ":" in text:
key, val = text.split(":", 1)
attributes[key.strip()] = val.strip()
# Extract images
images = [a["href"] for a in soup.select("a.thumb") if a.get("href")]
# Extract map coordinates
map_el = soup.select_one("#map")
lat = float(map_el["data-latitude"]) if map_el else None
lng = float(map_el["data-longitude"]) if map_el else None
return {
"title": soup.select_one("#titletextonly").get_text(strip=True)
if soup.select_one("#titletextonly") else "",
"price": soup.select_one("span.price").get_text(strip=True)
if soup.select_one("span.price") else "",
"description": description,
"attributes": attributes,
"images": images,
"latitude": lat,
"longitude": lng,
}
# Usage
def main():
scraper = CraigslistScraper("sfbay")
# Search for apartments
listings = scraper.search(
category="apa",
min_price=2000,
max_price=4500,
has_pic=True,
max_pages=3,
)
print(f"\nFound {len(listings)} listings")
# Enrich first 5 with full details
for listing in listings[:5]:
details = scraper.get_details(listing.url)
print(f"\n{listing.title}")
print(f" Price: {listing.price}")
print(f" Beds: {details.get('attributes', {}).get('bedrooms', 'N/A')}")
print(f" Sqft: {details.get('attributes', {}).get('sqft', 'N/A')}")
print(f" Photos: {len(details.get('images', []))}")
time.sleep(2)
if __name__ == "__main__":
main()
Multi-City Scraping: Covering Geographic Markets
One of Craigslist's biggest data advantages is its geographic coverage. Here's how to scrape across multiple cities efficiently:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
from typing import Optional
MAJOR_CITIES = [
"newyork", "losangeles", "chicago", "sfbay", "seattle",
"boston", "denver", "austin", "portland", "miami",
"atlanta", "dallas", "philadelphia", "minneapolis", "sandiego",
]
async def fetch_page(
session: aiohttp.ClientSession, url: str
) -> Optional[str]:
"""Fetch a page with error handling and rate limiting."""
try:
async with session.get(
url, timeout=aiohttp.ClientTimeout(total=15)
) as resp:
if resp.status == 200:
return await resp.text()
print(f"HTTP {resp.status} for {url}")
return None
except Exception as e:
print(f"Error fetching {url}: {e}")
return None
async def search_city(
session: aiohttp.ClientSession,
city: str,
category: str,
query: str = "",
) -> list[dict]:
"""Search a single city's listings."""
url = f"https://{city}.craigslist.org/search/{category}"
params = {}
if query:
params["query"] = query
param_str = "&".join(f"{k}={v}" for k, v in params.items())
full_url = f"{url}?{param_str}" if param_str else url
html = await fetch_page(session, full_url)
if not html:
return []
soup = BeautifulSoup(html, "html.parser")
listings = []
for item in soup.select("li.cl-static-search-result, .result-row"):
title_el = item.select_one(".titlestring, a.result-title")
price_el = item.select_one(".priceinfo, .result-price")
if title_el:
listings.append({
"city": city,
"title": title_el.get_text(strip=True),
"url": title_el.get("href", ""),
"price": price_el.get_text(strip=True) if price_el else None,
})
return listings
async def scrape_all_cities(
category: str, query: str = "", concurrency: int = 3
) -> list[dict]:
"""Scrape listings from all major cities with controlled concurrency."""
all_results = []
semaphore = asyncio.Semaphore(concurrency)
async with aiohttp.ClientSession(
headers={
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36"
)
}
) as session:
async def search_with_limit(city):
async with semaphore:
results = await search_city(session, city, category, query)
await asyncio.sleep(2) # Rate limiting
return results
tasks = [search_with_limit(city) for city in MAJOR_CITIES]
city_results = await asyncio.gather(*tasks)
for results in city_results:
all_results.extend(results)
return all_results
# Run multi-city search
results = asyncio.run(scrape_all_cities("apa", "furnished studio"))
print(f"Found {len(results)} listings across {len(MAJOR_CITIES)} cities")
# Analyze price distribution by city
from collections import defaultdict
import re
city_prices = defaultdict(list)
for item in results:
if item["price"]:
match = re.search(r"[\$]([\d,]+)", item["price"])
if match:
price = int(match.group(1).replace(",", ""))
city_prices[item["city"]].append(price)
for city, prices in sorted(city_prices.items()):
if prices:
avg = sum(prices) / len(prices)
print(f"{city}: avg ${avg:,.0f} ({len(prices)} listings)")
Scaling with Apify for Production Workloads
For production-grade Craigslist scraping — thousands of listings across dozens of cities, running on a schedule — you need cloud infrastructure. Apify makes this straightforward:
Building a Craigslist Apify Actor
const Apify = require('apify');
const cheerio = require('cheerio');
Apify.main(async () => {
const input = await Apify.getInput();
const {
cities = ['sfbay'],
category = 'apa',
query = '',
minPrice = null,
maxPrice = null,
maxPagesPerCity = 3,
} = input;
const requestQueue = await Apify.openRequestQueue();
const dataset = await Apify.openDataset();
// Queue search pages for each city
for (const city of cities) {
const baseUrl = `https://${city}.craigslist.org/search/${category}`;
const params = new URLSearchParams();
if (query) params.set('query', query);
if (minPrice) params.set('min_price', minPrice);
if (maxPrice) params.set('max_price', maxPrice);
for (let page = 0; page < maxPagesPerCity; page++) {
params.set('s', page * 120);
await requestQueue.addRequest({
url: `${baseUrl}?${params.toString()}`,
userData: { city, type: 'search', page },
});
}
}
const crawler = new Apify.CheerioCrawler({
requestQueue,
maxConcurrency: 5,
maxRequestRetries: 3,
requestHandlerTimeoutSecs: 30,
requestHandler: async ({ request, $ }) => {
const { city, type } = request.userData;
if (type === 'search') {
// Parse search results
const listings = [];
$('li.cl-static-search-result, .result-row').each((i, el) => {
const $el = $(el);
const titleEl = $el
.find('.titlestring, a.result-title');
const priceEl = $el
.find('.priceinfo, .result-price');
const href = titleEl.attr('href');
if (titleEl.text().trim() && href) {
listings.push({
title: titleEl.text().trim(),
url: href,
price: priceEl.text().trim() || null,
});
// Queue detail pages
const detailUrl = href.startsWith('http')
? href
: `https://${city}.craigslist.org${href}`;
requestQueue.addRequest({
url: detailUrl,
userData: { city, type: 'detail' },
}).catch(() => {});
}
});
console.log(
`${city}: Found ${listings.length} listings on ` +
`page ${request.userData.page + 1}`
);
} else if (type === 'detail') {
// Parse listing details
const details = {
city,
title: $('#titletextonly').text().trim(),
price: $('span.price').first().text().trim(),
description: $('#postingbody').text().trim()
.replace(/QR Code Link to This Post/g, ''),
location: $('div.mapaddress').text().trim(),
attributes: {},
images: [],
latitude: null,
longitude: null,
sourceUrl: request.url,
scrapedAt: new Date().toISOString(),
};
// Attributes
$('p.attrgroup span').each((i, el) => {
const text = $(el).text().trim();
if (text.includes(':')) {
const [k, v] = text.split(':');
details.attributes[k.trim()] = v.trim();
}
});
// Images
$('a.thumb').each((i, el) => {
const href = $(el).attr('href');
if (href) details.images.push(href);
});
// Map
const map = $('#map');
if (map.length) {
details.latitude = parseFloat(map.attr('data-latitude'));
details.longitude = parseFloat(map.attr('data-longitude'));
}
await dataset.pushData(details);
}
},
});
await crawler.run();
const info = await dataset.getInfo();
console.log(`Scraping complete! ${info.itemCount} listings collected.`);
});
Scheduling and Monitoring
With Apify, you can schedule your Craigslist scraper to run daily, tracking new listings and price changes over time:
from apify_client import ApifyClient
def schedule_craigslist_monitor(api_token: str):
"""Set up scheduled Craigslist monitoring."""
client = ApifyClient(api_token)
# Configure the actor to run daily
schedule = client.schedules().create(
name="craigslist-daily-monitor",
cron_expression="0 8 * * *", # 8 AM daily
actions=[{
"type": "RUN_ACTOR",
"actorId": "your-username/craigslist-scraper",
"runInput": {
"cities": [
"sfbay", "newyork", "losangeles",
"seattle", "austin",
],
"category": "apa",
"maxPagesPerCity": 5,
},
}],
)
print(f"Schedule created: {schedule['id']}")
return schedule
Handling Craigslist-Specific Challenges
Contact Information Patterns
Craigslist anonymizes contact info through relay emails. The reply link format is:
mailto:xxxxx-xxxxxxxxxx@sale.craigslist.org
Some sellers include phone numbers in the listing body. Here's how to extract them:
import re
def extract_contact_info(description: str) -> dict:
"""Extract phone numbers and emails from listing text."""
contacts = {"phones": [], "emails": []}
# Phone patterns
phone_patterns = [
r'\b(\d{3})[-.\s]?(\d{3})[-.\s]?(\d{4})\b',
r'\((\d{3})\)\s?(\d{3})[-.\s]?(\d{4})',
]
for pattern in phone_patterns:
matches = re.findall(pattern, description)
for match in matches:
phone = "".join(match)
if len(phone) == 10:
contacts["phones"].append(
f"({phone[:3]}) {phone[3:6]}-{phone[6:]}"
)
# Email patterns (non-Craigslist relay)
email_pattern = r'[\w.+-]+@[\w-]+\.[\w.-]+'
emails = re.findall(email_pattern, description)
contacts["emails"] = [
e for e in emails if "craigslist.org" not in e
]
return contacts
Dealing with Expired Listings
Craigslist listings expire quickly — typically within 7-45 days depending on the category. Build your scraper to handle 404s gracefully and timestamp everything:
async function fetchWithExpiredHandling(url) {
try {
const response = await axios.get(url, {
validateStatus: (status) => status < 500,
});
if (response.status === 404) {
return { expired: true, url };
}
if (response.status === 403) {
console.log('Rate limited, backing off...');
await delay(10000);
return { rateLimited: true, url };
}
return { data: response.data, url };
} catch (error) {
return { error: error.message, url };
}
}
Geographic Deduplication
Craigslist users often post the same listing in multiple nearby cities. Detect duplicates by comparing title + price + description hash:
import hashlib
def generate_listing_fingerprint(listing: dict) -> str:
"""Create a fingerprint for deduplication."""
content = (
listing.get("title", "").lower().strip()
+ str(listing.get("price", ""))
+ listing.get("description", "")[:200].lower().strip()
)
return hashlib.md5(content.encode()).hexdigest()
def deduplicate_listings(listings: list[dict]) -> list[dict]:
"""Remove duplicate listings posted across multiple cities."""
seen = set()
unique = []
for listing in listings:
fp = generate_listing_fingerprint(listing)
if fp not in seen:
seen.add(fp)
unique.append(listing)
removed = len(listings) - len(unique)
print(f"Removed {removed} duplicates from {len(listings)} listings")
return unique
Real-World Use Cases
Real Estate Market Analysis
Track rental prices across neighborhoods over time to identify trends, underpriced areas, or market shifts. This data is invaluable for real estate investors and property managers.
Vehicle Market Research
Monitor used car prices by make, model, and year to find deals or understand depreciation curves. Dealerships use this data to price their inventory competitively.
Job Market Intelligence
Scrape job listings to track which skills are in demand, what companies are hiring, and how salary ranges vary by city.
Academic Research
Researchers study Craigslist data for everything from housing discrimination patterns to local economic indicators.
Ethical Guidelines and Legal Considerations
- Respect robots.txt: Always check the site's robots.txt before scraping
- Rate limiting: Keep requests slow — 1-2 per second maximum. Craigslist actively blocks aggressive scrapers
- No personal data harvesting: Don't collect or store personal contact information in bulk
- Terms of service: Review Craigslist's ToS regarding automated access
- Data retention: Don't store data longer than needed for your analysis
- No republishing raw listings: Craigslist actively litigates against sites that republish their content
Conclusion
Craigslist's simple HTML structure makes it technically straightforward to scrape, but its geographic scale and data volume create real engineering challenges. By combining efficient parsing (Cheerio/BeautifulSoup for search pages) with cloud-based infrastructure (Apify for scale and scheduling), you can build comprehensive datasets covering local markets nationwide.
The key to successful Craigslist scraping is respecting rate limits, handling the geographic subdomain structure intelligently, and deduplicating across cities. Whether you're analyzing rental markets, tracking vehicle prices, or monitoring job postings, the techniques in this guide provide a solid foundation for extracting actionable data from the world's largest classifieds platform.
Start with a single city and category, validate your extraction pipeline, then scale horizontally across Craigslist's 400+ metro areas using Apify's cloud infrastructure.
Top comments (0)