Automating Script Execution and Building a Production-Ready Data Pipeline with GitHub Actions

#githubactions #webdev #automation #news

Learn how to set up a fully automated workflow to fetch external data, update your web app’s content, and trigger redeployment using GitHub Actions. In this guide, I’ll use news fetching as a practical example, but the approach applies to any data pipeline. You’ll see real-world CI/CD, automation tips, and how to keep your site’s data up-to-date—no manual intervention required!

🚀 The Goal

I wanted my Next.js app to always display the latest news from a third-party site, without me having to manually update a JSON file or trigger a redeploy. The solution? Automate everything: scraping, data update, and deployment.

🛠️ The Stack

Next.js (App Router)
Tailwind CSS + DaisyUI
Puppeteer for scraping
GitHub Actions for CI/CD
Vercel for hosting

🧩 The Workflow

1. Scraping News with Puppeteer
I wrote a Node.js script using Puppeteer to scrape news headlines and details, then save them to src/data/news.json in my repo.
Key code snippet:

const puppeteer = require('puppeteer');
const fs = require('fs');

async function scrapeNews() {
  const browser = await puppeteer.launch({
    headless: true,
    args: ['--no-sandbox', '--disable-setuid-sandbox'] // Required for CI!
  });
  const page = await browser.newPage();
  await page.goto('https://www.news-website.com/', { waitUntil: 'networkidle2' });
  // ...scraping logic...
  fs.writeFileSync('src/data/news.json', JSON.stringify(news, null, 2));
  await browser.close();
}
scrapeNews();

Pro tip:
If you run Puppeteer in CI (like GitHub Actions), you must use the --no-sandbox and --disable-setuid-sandbox flags, or Chromium will fail to launch.

2. Automating with GitHub Actions

I created a workflow (.github/workflows/scrape.yml) to:
Run on a schedule (every 12 hours) or manually
Install dependencies
Run the scraper
Commit and push the updated news.json back to GitHub
Key workflow steps:

permissions:
  contents: write

jobs:
  scrape:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-node@v4
        with:
          node-version: '18'
      - run: npm ci
      - run: node utils/News/scrape-news.js
      - run: |
          git config user.name "github-actions[bot]"
          git config user.email "github-actions[bot]@users.noreply.github.com"
      - run: |
          git add src/data/news.json
          git diff --cached --quiet || git commit -m "chore: update news.json [auto]"
          git push

Don’t forget:
Add permissions: contents: write at the top level of your workflow, or you’ll get a 403 error when trying to push.

3. Automatic Vercel Redeploys
Vercel is connected to my GitHub repo. When the workflow pushes a new commit (with updated news), Vercel automatically rebuilds and redeploys my app. The site always shows the latest scraped news—no manual intervention!

🐛 Common Pitfalls & Fixes

Puppeteer fails to launch in CI:
Add args: ['--no-sandbox', '--disable-setuid-sandbox'] to your launch options.

Workflow can’t push to repo:
Add permissions: contents: write to your workflow YAML.
Case-sensitive paths:
Linux (CI) is case-sensitive! Double-check your file and folder names.

The Result
News is scraped and updated automatically, every 12 hours.
My Next.js app always displays fresh content.
No manual data updates or redeploys needed.
The workflow is robust, production-ready, and easy to maintain.

DEV Community

Automating Script Execution and Building a Production-Ready Data Pipeline with GitHub Actions

🚀 The Goal

🛠️ The Stack

🧩 The Workflow

🐛 Common Pitfalls & Fixes

Top comments (0)