Ken.xu

Posted on Feb 23

从 Scrapling 失败到 Dev.to API 成功 - 我的爬虫优化之路

#python #scraping #automation #tutorial

从 Scrapling 失败到 Dev.to API 成功 - 我的爬虫优化之路

背景：为什么要做资讯聚合？

作为一个开发者，我需要每天跟进最新的技术资讯。但手动浏览多个网站太耗时了。所以我决定构建一个自动化的资讯聚合系统。

目标很简单：

📰 从多个信息源自动抓取资讯
🤖 每天早上 8 点自动推送
📊 聚合 50+ 条高质量内容

第一次尝试：Scrapling 爬虫

我选择了 Scrapling 框架，这是一个强大的 Python 爬虫库。

初期成功

前 5 个信息源都很顺利：

from scrapling.fetchers import Fetcher

# Hacker News
page = Fetcher.get('https://news.ycombinator.com', stealthy_headers=True)
items = page.css('.athing')[:10]

# GitHub Trending
page = Fetcher.get('https://github.com/trending', stealthy_headers=True)
repos = page.css('article.Box-row')[:10]

✅ 成功的来源：

Hacker News（10 条）
GitHub Trending（10 条）
CSS-Tricks（10 条）
Smashing Magazine（10 条）
Medium（9 条）

遇到的问题

但当我尝试添加更多来源时，问题出现了：

❌ Product Hunt - 返回 403 Forbidden
❌ Reddit - 返回 403 Forbidden
❌ Dev.to - 页面结构复杂，选择器失效

这些网站都使用了 JavaScript 动态渲染，Scrapling 无法处理。

第二次尝试：优化选择器

我尝试了多种方法来修复 Dev.to：

# 尝试 1：article 标签
articles = page.css('article')  # 0 个

# 尝试 2：feed 相关
feed_items = page.css('[class*="feed"]')  # 3 个

# 尝试 3：story 相关
stories = page.css('[class*="story"]')  # 279 个

# 尝试 4：post 相关
posts = page.css('[class*="post"]')  # 36 个

虽然找到了元素，但提取标题和链接仍然很困难。页面结构太复杂了。

转折点：发现 Dev.to API

突然，我想到了一个更简单的方案：为什么不直接用 API？

Dev.to 提供了官方 API，可以直接获取文章数据：

import requests

url = "https://dev.to/api/articles?per_page=10&sort_by=latest"
response = requests.get(url)
articles = response.json()

for article in articles:
    print(article['title'])
    print(article['url'])

结果？完美！ 10 条文章，数据完整，无需爬虫。

最终方案：混合策略

我采用了混合策略：

来源	方法	状态
Hacker News	Scrapling	✅
GitHub Trending	Scrapling	✅
CSS-Tricks	Scrapling	✅
Smashing Magazine	Scrapling	✅
Medium	Scrapling	✅
Dev.to	API	✅
总计	-	59 条/天

关键学到的东西

1. 反爬虫对策

Scrapling 成功的秘诀是反爬虫对策：

# 随机 User-Agent
headers = {'User-Agent': random.choice(user_agents)}

# 频率控制（1-2 秒间隔）
time.sleep(random.uniform(1, 2))

# 重试机制（指数退避）
for attempt in range(3):
    try:
        page = Fetcher.get(url, headers=headers, timeout=10)
        return page
    except:
        time.sleep(2 ** attempt)

2. API 优于爬虫

当有官方 API 时，永远优先使用 API：

✅ API 的优势：

稳定性高
数据完整
无需维护选择器
官方支持

❌ 爬虫的劣势：

容易失效
需要频繁维护
可能违反 ToS
性能较差

3. 自适应选择器

对于必须用爬虫的网站，使用自适应选择器：

# 尝试多个选择器
title = article.css('h2 a::text').get()
if not title:
    title = article.css('h3 a::text').get()
if not title:
    title = article.css('[class*="title"] a::text').get()

完整的解决方案

架构

┌─────────────────────────────────────┐
│   Daily News Aggregator             │
├─────────────────────────────────────┤
│ Scrapling (5 sources)               │
│ ├─ Hacker News                      │
│ ├─ GitHub Trending                  │
│ ├─ CSS-Tricks                       │
│ ├─ Smashing Magazine                │
│ └─ Medium                           │
│                                     │
│ API (1 source)                      │
│ └─ Dev.to                           │
├─────────────────────────────────────┤
│ Data Processing                     │
│ ├─ Deduplication                    │
│ ├─ Classification                   │
│ └─ Storage (JSON)                   │
├─────────────────────────────────────┤
│ Automation                          │
│ ├─ Cron (Daily 8 AM)                │
│ └─ Telegram Push                    │
└─────────────────────────────────────┘

性能指标

总耗时: 10-15 秒
成功率: 99%
数据量: 59 条/天
稳定性: ⭐⭐⭐⭐⭐

代码示例

基础爬虫

from scrapling.fetchers import Fetcher

class DailyNewsAggregator:
    def scrape_hacker_news(self):
        page = Fetcher.get('https://news.ycombinator.com', 
                          stealthy_headers=True)
        items = page.css('.athing')[:10]

        for item in items:
            title = item.css('.titleline > a::text').get()
            link = item.css('.titleline > a::attr(href)').get()

            if title and link:
                yield {
                    'title': title.strip(),
                    'url': link,
                    'source': 'Hacker News'
                }

API 调用

import requests

def scrape_dev_to():
    url = "https://dev.to/api/articles?per_page=10&sort_by=latest"
    response = requests.get(url, headers=headers, timeout=10)

    if response.status_code == 200:
        articles = response.json()
        for article in articles:
            yield {
                'title': article['title'],
                'url': article['url'],
                'author': article['user']['name'],
                'source': 'Dev.to'
            }

总结

这个项目教会了我：

优先使用官方 API - 比爬虫更稳定、更可靠
反爬虫对策很重要 - User-Agent、频率控制、重试机制
混合策略最优 - 根据情况选择最合适的方法
自动化节省时间 - 定时任务让一切自动运行

现在我每天早上都能收到 59 条精选资讯，再也不用手动浏览了！

下一步

我计划继续优化这个系统：

[ ] 集成 Agent Browser 处理 JS 渲染网站
[ ] 添加数据库存储（SQLite）
[ ] 实现自动推送到 Telegram
[ ] 添加内容分类和去重
[ ] 监控投资信号和关键词告警

开源项目

这个项目已经开源到 GitHub：

🔗 daily-news-aggregator

欢迎 Star ⭐ 和 Fork！

你也在做类似的项目吗？ 在评论区分享你的经验吧！

DEV Community

从 Scrapling 失败到 Dev.to API 成功 - 我的爬虫优化之路

从 Scrapling 失败到 Dev.to API 成功 - 我的爬虫优化之路

背景：为什么要做资讯聚合？

第一次尝试：Scrapling 爬虫

初期成功

遇到的问题

第二次尝试：优化选择器

转折点：发现 Dev.to API

最终方案：混合策略

关键学到的东西

1. 反爬虫对策

2. API 优于爬虫

3. 自适应选择器

完整的解决方案

架构

性能指标

代码示例

基础爬虫

API 调用

总结

下一步

开源项目

Top comments (0)