DEV Community

agenthustler
agenthustler

Posted on

How to Scrape Twitter/X in 2026: Public Data, Rate Limits, and What Still Works

Twitter/X scraping in 2026 is a minefield. After Elon Musk's aggressive API changes, rate limit crackdowns, and multiple lawsuits against scrapers, most of the old methods are dead. But public data extraction still works — if you know the current landscape.

This guide covers what actually works right now, what got killed, and how to scrape Twitter/X data without getting your IP banned or your account suspended.


The Current State of Twitter/X Data Access

Let's be clear about what changed:

  • Official API: The free tier is nearly useless (1,500 tweets/month read limit). Basic tier ($200/mo) gives you 10K tweets. Pro tier ($5,000/mo) for serious access.
  • Aggressive bot detection: Twitter now uses advanced fingerprinting, behavioral analysis, and ML-based detection.
  • Legal threats: Twitter/X has sued multiple scraping companies. They actively monitor for scraping activity.
  • Login walls: Most content now requires authentication to view.

What's still public and legal to access:

  • Public profiles and their tweet history
  • Public tweet content (when accessible without login)
  • Publicly visible engagement metrics
  • Trending topics and hashtags

Method 1: The Official API (When It Makes Sense)

Despite the cost, the official API is still the most reliable method for certain use cases.

import requests

BEARER_TOKEN = "YOUR_BEARER_TOKEN"

def search_recent_tweets(query: str, max_results: int = 10):
    url = "https://api.x.com/2/tweets/search/recent"
    headers = {
        "Authorization": f"Bearer {BEARER_TOKEN}"
    }
    params = {
        "query": query,
        "max_results": max_results,
        "tweet.fields": "created_at,public_metrics,author_id",
        "expansions": "author_id",
        "user.fields": "username,name,public_metrics"
    }

    response = requests.get(url, headers=headers, params=params)
    return response.json()

# Search for recent tweets
results = search_recent_tweets("python web scraping", 10)
for tweet in results.get("data", []):
    print(f"{tweet['text'][:100]}...")
    print(f"  Likes: {tweet['public_metrics']['like_count']}")
    print()
Enter fullscreen mode Exit fullscreen mode

When the API makes sense:

  • You need < 10K tweets/month (Basic tier at $200)
  • You need real-time data (streaming endpoints)
  • Compliance and legal safety matter (enterprise use)
  • You need guaranteed uptime and structured data

When it doesn't:

  • Budget-constrained projects
  • Historical data (API only goes back 7 days on Basic)
  • Large-scale data collection

Method 2: Managed Scraping Services

This is what I actually recommend for most people. Let someone else deal with the proxy rotation, CAPTCHA solving, and detection evasion.

ScraperAPI

ScraperAPI handles the hard parts — rotating proxies, browser rendering, and anti-bot bypass. You send a URL, get back the HTML.

import requests
from bs4 import BeautifulSoup

SCRAPER_API_KEY = "YOUR_KEY"

def scrape_twitter_profile(username: str):
    """Scrape a public Twitter profile via ScraperAPI."""
    target_url = f"https://x.com/{username}"

    response = requests.get(
        "http://api.scraperapi.com",
        params={
            "api_key": SCRAPER_API_KEY,
            "url": target_url,
            "render": "true",  # Enable JS rendering
            "country_code": "us"
        },
        timeout=60
    )

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, "html.parser")
        return soup
    return None
Enter fullscreen mode Exit fullscreen mode

Pros: No proxy management, automatic retries, scales easily
Cons: Cost per request, depends on their infrastructure

ScrapeOps

ScrapeOps offers a proxy aggregator and monitoring dashboard that's particularly useful for Twitter scraping. They route your requests through the best-performing proxy for each target.

import requests

SCRAPEOPS_API_KEY = "YOUR_KEY"

def scrape_with_scrapeops(url: str):
    response = requests.get(
        "https://proxy.scrapeops.io/v1/",
        params={
            "api_key": SCRAPEOPS_API_KEY,
            "url": url,
            "render_js": "true",
            "residential": "true"
        },
        timeout=60
    )
    return response

# Scrape a public tweet
result = scrape_with_scrapeops(
    "https://x.com/elonmusk/status/1234567890"
)
print(f"Status: {result.status_code}")
Enter fullscreen mode Exit fullscreen mode

What makes ScrapeOps stand out is their proxy benchmarking — they test proxy providers against specific targets and route through whichever performs best. For Twitter specifically, this matters because detection methods change frequently.


Method 3: Browser Automation with Stealth

For maximum control, you can run a headless browser with anti-detection measures. This is the most flexible approach but requires the most maintenance.

from playwright.async_api import async_playwright
import asyncio
import json

async def scrape_twitter_search(query: str, max_tweets: int = 50):
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
            ]
        )

        context = await browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/131.0.0.0 Safari/537.36"
            ),
        )

        # Remove automation indicators
        page = await context.new_page()
        await page.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', {
                get: () => undefined
            });
        """)

        await page.goto(
            f"https://x.com/search?q={query}&src=typed_query",
            wait_until="networkidle"
        )

        # Scroll and collect tweets
        tweets = []
        last_height = 0

        while len(tweets) < max_tweets:
            # Extract visible tweets
            tweet_elements = await page.query_selector_all(
                'article[data-testid="tweet"]'
            )

            for element in tweet_elements:
                text_el = await element.query_selector(
                    '[data-testid="tweetText"]'
                )
                if text_el:
                    text = await text_el.inner_text()
                    if text not in [t["text"] for t in tweets]:
                        tweets.append({"text": text})

            # Scroll down
            await page.evaluate(
                "window.scrollBy(0, window.innerHeight)"
            )
            await page.wait_for_timeout(2000)

            new_height = await page.evaluate(
                "document.body.scrollHeight"
            )
            if new_height == last_height:
                break
            last_height = new_height

        await browser.close()
        return tweets[:max_tweets]

# Run the scraper
tweets = asyncio.run(
    scrape_twitter_search("web scraping 2026", 20)
)
for t in tweets:
    print(t["text"][:100])
Enter fullscreen mode Exit fullscreen mode

Important caveats with browser automation:

  • Twitter aggressively detects headless browsers
  • You need residential proxies (datacenter IPs are instantly blocked)
  • Login is required for most content — and logging in with automation violates ToS
  • Sessions get invalidated frequently

Method 4: Alternative Data Sources

Sometimes the best way to get Twitter data isn't scraping Twitter directly.

Nitter Instances

Nitter is an open-source Twitter frontend. Some public instances still work:

import requests
from bs4 import BeautifulSoup

def search_via_nitter(query: str, instance: str = "nitter.net"):
    """Try multiple Nitter instances as fallback."""
    instances = [
        instance,
        "nitter.privacydev.net",
        "nitter.poast.org",
    ]

    for inst in instances:
        try:
            url = f"https://{inst}/search?q={query}"
            resp = requests.get(url, timeout=10)
            if resp.status_code == 200:
                soup = BeautifulSoup(resp.text, "html.parser")
                tweets = soup.select(".tweet-content")
                return [t.get_text() for t in tweets]
        except Exception:
            continue
    return []
Enter fullscreen mode Exit fullscreen mode

Reality check: Nitter instances are unreliable in 2026. Many have shut down. Don't build a production system on them.

Google Cache / Archive.org

For historical tweets, search engines and web archives sometimes have cached versions:

  • site:twitter.com "your search term" on Google
  • Wayback Machine API for archived tweet pages

Academic Access

Twitter's Academic Research API still exists for qualified researchers. If you're affiliated with a university, this gives you much broader access than the commercial API.


Rate Limits and How to Handle Them

Regardless of your method, you need to respect rate limits. Here's a reusable rate limiter:

import time
import random
from collections import deque
from functools import wraps

class RateLimiter:
    def __init__(
        self,
        max_requests: int,
        time_window: int,
        jitter: float = 0.5
    ):
        self.max_requests = max_requests
        self.time_window = time_window  # seconds
        self.jitter = jitter
        self.requests = deque()

    def wait_if_needed(self):
        now = time.time()

        # Remove old requests outside the window
        while (
            self.requests
            and self.requests[0] < now - self.time_window
        ):
            self.requests.popleft()

        if len(self.requests) >= self.max_requests:
            sleep_time = (
                self.requests[0]
                + self.time_window
                - now
                + random.uniform(0, self.jitter)
            )
            print(f"Rate limit — sleeping {sleep_time:.1f}s")
            time.sleep(sleep_time)

        self.requests.append(time.time())

# Usage
limiter = RateLimiter(
    max_requests=30, time_window=60, jitter=2.0
)

urls_to_scrape = ["https://x.com/user1", "https://x.com/user2"]

for url in urls_to_scrape:
    limiter.wait_if_needed()
    # ... make your request here
Enter fullscreen mode Exit fullscreen mode

What Doesn't Work Anymore

Let's save you time. These methods are dead or dying:

  1. snscrape — The most popular Twitter scraping library. Broken since mid-2023 and abandoned. Don't use it.
  2. Tweepy free tier — Rate limits make it impractical for any real data collection.
  3. Simple HTTP requests without rendering — Twitter is a fully JavaScript-rendered SPA. Raw HTTP gets you nothing useful.
  4. Free proxy lists — Every free proxy list is full of dead or compromised IPs. Use paid services.
  5. Guest tokens — Twitter killed unauthenticated API access. Guest tokens no longer work for most endpoints.

Ethical and Legal Considerations

I want to be straightforward about this:

  • Public data is generally legal to access in most jurisdictions (see hiQ v. LinkedIn)
  • Terms of Service violations are not criminal, but can lead to account bans and civil liability
  • The CFAA (in the US) is a gray area — the Van Buren decision narrowed its scope, but scraping behind auth could still be risky
  • GDPR (in the EU) applies to personal data regardless of how you collected it
  • Twitter's specific stance: They've sued companies for scraping and won injunctions. Individual hobbyists are unlikely targets, but commercial operations should be careful.

My recommendation: Use the official API when you can afford it. Use managed services like ScraperAPI or ScrapeOps when you can't. Only go the browser automation route if you truly need it and understand the risks.


Recommended Stack for Twitter/X Scraping in 2026

Component Recommendation
Primary data source Official API (if budget allows)
Proxy service ScraperAPI or ScrapeOps for managed proxies
Browser automation Playwright with stealth plugins
Rate limiting Custom rate limiter (code above)
Data storage PostgreSQL or MongoDB
Monitoring Track success rates per method
Fallback Always have 2+ methods ready

Quick Start: Minimal Working Example

If you just want to get started quickly, here's the simplest path:

import requests
import json

# Using ScraperAPI — simplest approach
API_KEY = "YOUR_SCRAPERAPI_KEY"

def get_tweet_page(tweet_url: str) -> str:
    """Fetch rendered tweet page via ScraperAPI."""
    resp = requests.get(
        "http://api.scraperapi.com",
        params={
            "api_key": API_KEY,
            "url": tweet_url,
            "render": "true"
        },
        timeout=60
    )
    return resp.text if resp.status_code == 200 else ""

# Fetch a public tweet
html = get_tweet_page(
    "https://x.com/elonmusk/status/1234567890"
)
if html:
    print(f"Got {len(html)} bytes of rendered HTML")
    # Parse with BeautifulSoup from here
Enter fullscreen mode Exit fullscreen mode

The Twitter/X scraping landscape will keep changing. The key is building flexible systems that can swap between data sources when one breaks. Don't over-invest in any single method — it will break eventually.


Have a method that still works? Found something I missed? Share it in the comments — the community benefits when we share what's actually working right now.

Top comments (0)