[Golang] Research and Implementation: Replacing a PTT Data Crawler with the Firecrawl API

#data #backend #go #api

title: [Golang] Research and Implementation of Using Firecrawl API for PTT Data Crawling
published: false
date: 2025-05-25 00:00:00 UTC
tags: 
canonical_url: https://www.evanlin.com/firecrawl-go-ptt/
---

# Research and Implementation of Using Firecrawl API for PTT Data Crawling

## Preface

The `photomgr` project (https://github.com/kkdai/photomgr) has always been a powerful tool for crawling data from the PTT Beauty board, helping many people grab articles and pictures from the beauty board. Originally, we used Google Cloud Platform (GCP) to directly connect to the PTT website to grab data, which was simple and convenient. However, recently, after PTT put the data behind Cloudflare protection, GCP connections were often blocked. This might be because Cloudflare's anti-crawling mechanisms (such as CAPTCHA or IP restrictions) were too strong, causing the original crawler program to completely fail. To keep `photomgr` running, we researched several alternative solutions and finally decided to use the Firecrawl API to crawl PTT data. This blog post will explain why we made this change, how we made the changes, and the code and results after the changes.

## Related Changes

PTT recently started using Cloudflare to protect its website, adding anti-crawling mechanisms such as CAPTCHA and IP restrictions. These measures made our original method of using GCP to directly send HTTP requests to grab data unfeasible, because Cloudflare would block or throttle GCP's requests, causing the crawler to often fail. After some research, we found that the Firecrawl service is very suitable for solving this problem. It can not only bypass Cloudflare's anti-crawling protection, but also convert the webpage content into a clean markdown format, making it easier for us to parse. Therefore, we decided to change the crawling logic of `photomgr` from the original direct connection to using the Firecrawl API to grab the article list and content of the PTT Beauty board.

## What is Firecrawl?

Firecrawl is an API service designed specifically for web crawling (https://www.firecrawl.dev/), which can help you grab webpage content and convert it into a structured format, such as markdown or JSON. Its biggest advantage is that it can handle anti-crawling protection like Cloudflare, and it can also simulate browser behavior to grab dynamically loaded content. The `/v1/scrape` endpoint of Firecrawl allows us to send a URL, and it will return the main content of the webpage, saving us the trouble of handling complex HTML ourselves. For us, this is a super-helpful tool because it makes crawling more stable and saves a lot of time writing parsing logic.

## How to Modify?

To make `photomgr` use the Firecrawl API, we need to change the original crawling logic of `ptt.go` to call the Firecrawl API. The following are the key points of the changes:

1.  **Keep Public APIs Unchanged**: The public functions in `ptt.go` (such as `GetPosts` and `GetPostDetails`) are already relied upon by other users, so we cannot change the input and output formats of these functions; we can only change the internal implementation logic.
2.  **Use Firecrawl API**: Use Firecrawl's `/v1/scrape` endpoint to grab the list page of the PTT Beauty board (https://www.ptt.cc/bbs/Beauty/index.html) and single article pages (such as https://www.ptt.cc/bbs/Beauty/M.1748080032.A.015.html).
3.  **Parse Markdown Data**: The data returned by Firecrawl is in markdown format, and we need to parse it into structured JSON, such as article title, URL, author, date, number of pushes (or "爆" mark), etc.
4.  **Environment Variable Management API Key**: The Firecrawl API requires a key, which we will read from the environment variable `FIRECRAWL_KEY` to ensure security and not hardcode it in the code.
5.  **Unit Testing**: Add unit tests to verify the crawler and parsing logic, with the goal of at least 80% code coverage.

### About the `over18=1` Cookie Writing

The PTT Beauty board has an age restriction, and you need to set the `over18=1` cookie to pass the check when browsing. In the Firecrawl API request, we need to add this cookie in the `headers` so that we can successfully grab the page content. The specific writing method is as follows:

{
"url": "https://www.ptt.cc/bbs/Beauty/index.html",
"headers": {
"Cookie": "over18=1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
},
"formats": ["markdown"],
"onlyMainContent": true,
"waitFor": 1000
}


This JSON tells Firecrawl to include the over18=1 cookie when sending the request, simulating a user who has passed the age verification. The User-Agent simulates a browser, ensuring that the PTT server will not block us because of a strange request header.

Related Code

The following are some core code examples of using the Firecrawl API, demonstrating how to grab and parse data from the PTT Beauty board:

Grab the list page

go

package ptt

import (
"encoding/json"
"net/http"
"os"
)

type FirecrawlResponse struct {
Success bool json:"success"
Data struct {
Markdown string json:"markdown"
} json:"data"
}

func GetPosts() ([]Post, error) {
apiKey := os.Getenv("FIRECRAWL_KEY")
if apiKey == "" {
return nil, errors.New("FIRECRAWL_KEY is not set")
}

url := "https://www.ptt.cc/bbs/Beauty/index.html"
reqBody := map[string]interface{}{
    "url": url,
    "headers": map[string]string{
        "Cookie": "over18=1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    },
    "formats": []string{"markdown"},
    "onlyMainContent": true,
    "waitFor": 1000,
}

body, _ := json.Marshal(reqBody)
req, _ := http.NewRequest("POST", "https://api.firecrawl.dev/v1/scrape", bytes.NewBuffer(body))
req.Header.Set("Authorization", "Bearer "+apiKey)
req.Header.Set("Content-Type", "application/json")

client := &http.Client{}
resp, err := client.Do(req)
if err != nil {
    return nil, err
}
defer resp.Body.Close()

var fcResp FirecrawlResponse
json.NewDecoder(resp.Body).Decode(&fcResp)
if !fcResp.Success {
    return nil, errors.New("Firecrawl API request failed")
}

// 解析 markdown 成 Post 結構
posts, err := parseIndexMarkdown(fcResp.Data.Markdown)
if err != nil {
    return nil, err
}
return posts, nil

}

func parseIndexMarkdown(markdown string) ([]Post, error) {
// 使用正則表達式或 markdown 解析庫解析
// 範例：提取標題、網址、作者、日期、推文數
var posts []Post
// 假設 Post 結構如下
type Post struct {
Title string
URL string
Author string
Date string
PushCount int
}
// 實作解析邏輯（簡化範例）
lines := strings.Split(markdown, "\n")
for _, line := range lines {
if strings.Contains(line, "[正妹]") || strings.Contains(line, "[公告]") {
// 解析標題、網址等
post := Post{ /* 填入解析結果 */ }
posts = append(posts, post)
}
}
return posts, nil
}


Grab a single article

go

func GetPostDetails(url string) (PostDetail, error) {
apiKey := os.Getenv("FIRECRAWL_KEY")
if apiKey == "" {
return PostDetail{}, errors.New("FIRECRAWL_KEY is not set")
}

reqBody := map[string]interface{}{
    "url": url,
    "headers": map[string]string{
        "Cookie": "over18=1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    },
    "formats": []string{"markdown"},
    "onlyMainContent": true,
    "waitFor": 1000,
}

body, _ := json.Marshal(reqBody)
req, _ := http.NewRequest("POST", "https://api.firecrawl.dev/v1/scrape", bytes.NewBuffer(body))
req.Header.Set("Authorization", "Bearer "+apiKey)
req.Header.Set("Content-Type", "application/json")

client := &http.Client{}
resp, err := client.Do(req)
if err != nil {
    return PostDetail{}, err
}
defer resp.Body.Close()

var fcResp FirecrawlResponse
json.NewDecoder(resp.Body).Decode(&fcResp)
if !fcResp.Success {
    return PostDetail{}, errors.New("Firecrawl API request failed")
}

// 解析 markdown 成 PostDetail 結構
detail, err := parsePostMarkdown(fcResp.Data.Markdown)
if err != nil {
    return PostDetail{}, err
}
return detail, nil

}

func parsePostMarkdown(markdown string) (PostDetail, error) {
// 假設 PostDetail 結構如下
type PostDetail struct {
Author string
Board string
Title string
Date string
ImageURLs []string
Content string
}
// 實作解析邏輯（簡化範例）
var detail PostDetail
lines := strings.Split(markdown, "\n")
for _, line := range lines {
if strings.HasPrefix(line, "作者") {
detail.Author = strings.TrimPrefix(line, "作者")
} else if strings.Contains(line, "https://i.imgur.com") {
detail.ImageURLs = append(detail.ImageURLs, line)
} // 其他欄位解析
}
return detail, nil
}


Unit test example

go

package ptt

import (
"testing"
"github.com/stretchr/testify/assert"
)

func TestGetPosts(t *testing.T) {
// 模擬 Firecrawl API 回應
mockMarkdown := [正妹] OOO jerryyuan 5/24 [搜尋同標題文章](https://www.ptt.cc/bbs/Beauty/search?q=thread%3A%5B%E6%AD%A3%E5%A6%B9%5D)
posts, err := parseIndexMarkdown(mockMarkdown)
assert.NoError(t, err)
assert.Len(t, posts, 1)
assert.Equal(t, "[正妹] OOO", posts[0].Title)
}


## Summary

After PTT changed to Cloudflare protection, our original GCP crawler solution was directly destroyed, forcing us to find a new way to grab data. The Firecrawl API became our savior, not only bypassing Cloudflare's anti-crawling protection, but also converting PTT's pages into clean markdown, saving us a lot of parsing trouble. We improved the ptt.go of photomgr, using the Firecrawl API to grab the list and articles of the Beauty board, preserving the original public API interface to ensure that existing users are not affected. By managing the API Key through environment variables and adding unit tests, the code is secure and stable. This migration has taught us how to deal with the challenges of website protection mechanisms, and it also proves that Firecrawl is a super helper for crawling tasks. In the future, we will continue to monitor the changes of PTT to ensure that the crawler can continue to run smoothly!

DEV Community

[Golang] Research and Implementation: Replacing a PTT Data Crawler with the Firecrawl API

Top comments (0)