Twitter/X scraping in 2026 is a minefield. After Elon Musk's aggressive API changes, rate limit crackdowns, and multiple lawsuits against scrapers, most of the old methods are dead. But public data extraction still works — if you know the current landscape.
This guide covers what actually works right now, what got killed, and how to scrape Twitter/X data without getting your IP banned or your account suspended.
The Current State of Twitter/X Data Access
Let's be clear about what changed:
- Official API: The free tier is nearly useless (1,500 tweets/month read limit). Basic tier ($200/mo) gives you 10K tweets. Pro tier ($5,000/mo) for serious access.
- Aggressive bot detection: Twitter now uses advanced fingerprinting, behavioral analysis, and ML-based detection.
- Legal threats: Twitter/X has sued multiple scraping companies. They actively monitor for scraping activity.
- Login walls: Most content now requires authentication to view.
What's still public and legal to access:
- Public profiles and their tweet history
- Public tweet content (when accessible without login)
- Publicly visible engagement metrics
- Trending topics and hashtags
Method 1: The Official API (When It Makes Sense)
Despite the cost, the official API is still the most reliable method for certain use cases.
import requests
BEARER_TOKEN = "YOUR_BEARER_TOKEN"
def search_recent_tweets(query: str, max_results: int = 10):
url = "https://api.x.com/2/tweets/search/recent"
headers = {
"Authorization": f"Bearer {BEARER_TOKEN}"
}
params = {
"query": query,
"max_results": max_results,
"tweet.fields": "created_at,public_metrics,author_id",
"expansions": "author_id",
"user.fields": "username,name,public_metrics"
}
response = requests.get(url, headers=headers, params=params)
return response.json()
# Search for recent tweets
results = search_recent_tweets("python web scraping", 10)
for tweet in results.get("data", []):
print(f"{tweet['text'][:100]}...")
print(f" Likes: {tweet['public_metrics']['like_count']}")
print()
When the API makes sense:
- You need < 10K tweets/month (Basic tier at $200)
- You need real-time data (streaming endpoints)
- Compliance and legal safety matter (enterprise use)
- You need guaranteed uptime and structured data
When it doesn't:
- Budget-constrained projects
- Historical data (API only goes back 7 days on Basic)
- Large-scale data collection
Method 2: Managed Scraping Services
This is what I actually recommend for most people. Let someone else deal with the proxy rotation, CAPTCHA solving, and detection evasion.
ScraperAPI
ScraperAPI handles the hard parts — rotating proxies, browser rendering, and anti-bot bypass. You send a URL, get back the HTML.
import requests
from bs4 import BeautifulSoup
SCRAPER_API_KEY = "YOUR_KEY"
def scrape_twitter_profile(username: str):
"""Scrape a public Twitter profile via ScraperAPI."""
target_url = f"https://x.com/{username}"
response = requests.get(
"http://api.scraperapi.com",
params={
"api_key": SCRAPER_API_KEY,
"url": target_url,
"render": "true", # Enable JS rendering
"country_code": "us"
},
timeout=60
)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
return soup
return None
Pros: No proxy management, automatic retries, scales easily
Cons: Cost per request, depends on their infrastructure
ScrapeOps
ScrapeOps offers a proxy aggregator and monitoring dashboard that's particularly useful for Twitter scraping. They route your requests through the best-performing proxy for each target.
import requests
SCRAPEOPS_API_KEY = "YOUR_KEY"
def scrape_with_scrapeops(url: str):
response = requests.get(
"https://proxy.scrapeops.io/v1/",
params={
"api_key": SCRAPEOPS_API_KEY,
"url": url,
"render_js": "true",
"residential": "true"
},
timeout=60
)
return response
# Scrape a public tweet
result = scrape_with_scrapeops(
"https://x.com/elonmusk/status/1234567890"
)
print(f"Status: {result.status_code}")
What makes ScrapeOps stand out is their proxy benchmarking — they test proxy providers against specific targets and route through whichever performs best. For Twitter specifically, this matters because detection methods change frequently.
Method 3: Browser Automation with Stealth
For maximum control, you can run a headless browser with anti-detection measures. This is the most flexible approach but requires the most maintenance.
from playwright.async_api import async_playwright
import asyncio
import json
async def scrape_twitter_search(query: str, max_tweets: int = 50):
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
]
)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/131.0.0.0 Safari/537.36"
),
)
# Remove automation indicators
page = await context.new_page()
await page.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
""")
await page.goto(
f"https://x.com/search?q={query}&src=typed_query",
wait_until="networkidle"
)
# Scroll and collect tweets
tweets = []
last_height = 0
while len(tweets) < max_tweets:
# Extract visible tweets
tweet_elements = await page.query_selector_all(
'article[data-testid="tweet"]'
)
for element in tweet_elements:
text_el = await element.query_selector(
'[data-testid="tweetText"]'
)
if text_el:
text = await text_el.inner_text()
if text not in [t["text"] for t in tweets]:
tweets.append({"text": text})
# Scroll down
await page.evaluate(
"window.scrollBy(0, window.innerHeight)"
)
await page.wait_for_timeout(2000)
new_height = await page.evaluate(
"document.body.scrollHeight"
)
if new_height == last_height:
break
last_height = new_height
await browser.close()
return tweets[:max_tweets]
# Run the scraper
tweets = asyncio.run(
scrape_twitter_search("web scraping 2026", 20)
)
for t in tweets:
print(t["text"][:100])
Important caveats with browser automation:
- Twitter aggressively detects headless browsers
- You need residential proxies (datacenter IPs are instantly blocked)
- Login is required for most content — and logging in with automation violates ToS
- Sessions get invalidated frequently
Method 4: Alternative Data Sources
Sometimes the best way to get Twitter data isn't scraping Twitter directly.
Nitter Instances
Nitter is an open-source Twitter frontend. Some public instances still work:
import requests
from bs4 import BeautifulSoup
def search_via_nitter(query: str, instance: str = "nitter.net"):
"""Try multiple Nitter instances as fallback."""
instances = [
instance,
"nitter.privacydev.net",
"nitter.poast.org",
]
for inst in instances:
try:
url = f"https://{inst}/search?q={query}"
resp = requests.get(url, timeout=10)
if resp.status_code == 200:
soup = BeautifulSoup(resp.text, "html.parser")
tweets = soup.select(".tweet-content")
return [t.get_text() for t in tweets]
except Exception:
continue
return []
Reality check: Nitter instances are unreliable in 2026. Many have shut down. Don't build a production system on them.
Google Cache / Archive.org
For historical tweets, search engines and web archives sometimes have cached versions:
-
site:twitter.com "your search term"on Google - Wayback Machine API for archived tweet pages
Academic Access
Twitter's Academic Research API still exists for qualified researchers. If you're affiliated with a university, this gives you much broader access than the commercial API.
Rate Limits and How to Handle Them
Regardless of your method, you need to respect rate limits. Here's a reusable rate limiter:
import time
import random
from collections import deque
from functools import wraps
class RateLimiter:
def __init__(
self,
max_requests: int,
time_window: int,
jitter: float = 0.5
):
self.max_requests = max_requests
self.time_window = time_window # seconds
self.jitter = jitter
self.requests = deque()
def wait_if_needed(self):
now = time.time()
# Remove old requests outside the window
while (
self.requests
and self.requests[0] < now - self.time_window
):
self.requests.popleft()
if len(self.requests) >= self.max_requests:
sleep_time = (
self.requests[0]
+ self.time_window
- now
+ random.uniform(0, self.jitter)
)
print(f"Rate limit — sleeping {sleep_time:.1f}s")
time.sleep(sleep_time)
self.requests.append(time.time())
# Usage
limiter = RateLimiter(
max_requests=30, time_window=60, jitter=2.0
)
urls_to_scrape = ["https://x.com/user1", "https://x.com/user2"]
for url in urls_to_scrape:
limiter.wait_if_needed()
# ... make your request here
What Doesn't Work Anymore
Let's save you time. These methods are dead or dying:
- snscrape — The most popular Twitter scraping library. Broken since mid-2023 and abandoned. Don't use it.
- Tweepy free tier — Rate limits make it impractical for any real data collection.
- Simple HTTP requests without rendering — Twitter is a fully JavaScript-rendered SPA. Raw HTTP gets you nothing useful.
- Free proxy lists — Every free proxy list is full of dead or compromised IPs. Use paid services.
- Guest tokens — Twitter killed unauthenticated API access. Guest tokens no longer work for most endpoints.
Ethical and Legal Considerations
I want to be straightforward about this:
- Public data is generally legal to access in most jurisdictions (see hiQ v. LinkedIn)
- Terms of Service violations are not criminal, but can lead to account bans and civil liability
- The CFAA (in the US) is a gray area — the Van Buren decision narrowed its scope, but scraping behind auth could still be risky
- GDPR (in the EU) applies to personal data regardless of how you collected it
- Twitter's specific stance: They've sued companies for scraping and won injunctions. Individual hobbyists are unlikely targets, but commercial operations should be careful.
My recommendation: Use the official API when you can afford it. Use managed services like ScraperAPI or ScrapeOps when you can't. Only go the browser automation route if you truly need it and understand the risks.
Recommended Stack for Twitter/X Scraping in 2026
| Component | Recommendation |
|---|---|
| Primary data source | Official API (if budget allows) |
| Proxy service | ScraperAPI or ScrapeOps for managed proxies |
| Browser automation | Playwright with stealth plugins |
| Rate limiting | Custom rate limiter (code above) |
| Data storage | PostgreSQL or MongoDB |
| Monitoring | Track success rates per method |
| Fallback | Always have 2+ methods ready |
Quick Start: Minimal Working Example
If you just want to get started quickly, here's the simplest path:
import requests
import json
# Using ScraperAPI — simplest approach
API_KEY = "YOUR_SCRAPERAPI_KEY"
def get_tweet_page(tweet_url: str) -> str:
"""Fetch rendered tweet page via ScraperAPI."""
resp = requests.get(
"http://api.scraperapi.com",
params={
"api_key": API_KEY,
"url": tweet_url,
"render": "true"
},
timeout=60
)
return resp.text if resp.status_code == 200 else ""
# Fetch a public tweet
html = get_tweet_page(
"https://x.com/elonmusk/status/1234567890"
)
if html:
print(f"Got {len(html)} bytes of rendered HTML")
# Parse with BeautifulSoup from here
The Twitter/X scraping landscape will keep changing. The key is building flexible systems that can swap between data sources when one breaks. Don't over-invest in any single method — it will break eventually.
Have a method that still works? Found something I missed? Share it in the comments — the community benefits when we share what's actually working right now.
Top comments (0)