Rehan Sayyed

Posted on Mar 30

I Tried Scraping LinkedIn Posts with Puppeteer and This Is What Actually Worked

#javascript #node #webscraping #automation

I Tried Scraping LinkedIn Posts with Puppeteer and It Fought Back

I thought this would take 20 minutes.

Open Puppeteer.
Grab some text.
Save it to JSON. Done.

Instead, LinkedIn reminded me that modern web apps are not just pages.

They are systems.
They react.
They delay.
They break your assumptions.

And suddenly, a simple script turns into a late night debugging session.

This is what actually worked.

🚀 Setting up Puppeteer

We start simple.

const puppeteer = require('puppeteer');
const fs = require('fs');

Puppeteer controls the browser
fs stores the data

Clean. Minimal. Enough to begin.

Then we wrap everything so we can use async and await properly.

(async () => {

🌐 Launching a real browser

const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();

I kept headless mode off on purpose.

Because LinkedIn is very sensitive to automation.

Headless browsers get flagged faster.
A visible browser behaves more like a real user.

And when you are debugging
visibility is everything.

🔐 Logging into LinkedIn

await page.goto('https://www.linkedin.com/login', {
  waitUntil: 'networkidle2',
  timeout: 60000
});

await page.type('#username', 'xxxx');
await page.type('#password', 'xxxxxx');
await page.click('button[type="submit"]');
await page.waitForNavigation({ timeout: 60000 });

At this point, everything felt done.

I thought I was finished.
I was not.

LinkedIn often throws:

security checks
captchas
verification screens

So even after login, we pause.

⏳ Handling the security check

console.log('Please complete the security check...');
await new Promise(resolve => setTimeout(resolve, 12000));

No hacks. No bypass.

Just wait.

Sometimes the most reliable solution
is the least clever one.

🔗 Providing the post links

const postLinks = [ ... ];

Instead of crawling everything, I used a fixed list of URLs.

Why?

Predictable input gives predictable output.

LinkedIn infinite scroll can get messy very quickly.

🔁 Visiting each post

for (const link of postLinks) {
  await page.goto(link, {
    waitUntil: 'domcontentloaded',
    timeout: 60000
  });

Notice I used domcontentloaded.

LinkedIn keeps making background requests forever.
Waiting for network idle can hang your script.

This small change
saved a lot of frustration.

📜 Scrolling to load content

This is where things started breaking.

await page.evaluate(async () => {
  let scrollHeight = document.body.scrollHeight;

  let scrollInterval = setInterval(() => {
    window.scrollBy(0, 1000);

    if (document.body.scrollHeight > scrollHeight) {
      scrollHeight = document.body.scrollHeight;
    }

    if (document.body.scrollHeight === scrollHeight) {
      clearInterval(scrollInterval);
    }
  }, 300);
});

LinkedIn loads content lazily.

No scroll.
No content.

So we simulate a real user:

Scroll
Pause
Repeat

Until nothing new loads.

🖼️ Waiting for images

await page.waitForSelector('img[src*="media.licdn.com"]', {
  timeout: 15000
});

Without this step, your data will be incomplete.

This ensures images are fully loaded before extraction.

🧠 Extracting content and images

const postData = await page.evaluate(() => {
  const contentElement =
    document.querySelector('[data-test-post-container] .break-words') ||
    document.querySelector('.feed-shared-update-v2__description') ||
    document.querySelector('.feed-shared-text__text-view') ||
    document.querySelector('span.break-words');

  const images = [];
  const imageElements = document.querySelectorAll('img[src*="media.licdn.com/"]');

  if (imageElements.length > 1) {
    images.push(imageElements[1].src);
  }

  return {
    content: contentElement
      ? contentElement.innerText.trim()
      : 'No content found',
    images: images.length > 0 ? images : ['No post image found']
  };
});

This is real world scraping.

There is no single selector that always works.

So we try multiple fallbacks until something works.

Also:

first image → profile
second image → actual post

Not perfect.
But practical.

📦 Storing the data

scrapedPosts.push({
  link,
  content: postData.content,
  images: postData.images
});

We collect everything step by step.

Simple. Clean. Reliable.

💾 Saving it to a file

fs.writeFileSync(
  'linkedInScrapedPosts.json',
  JSON.stringify(scrapedPosts, null, 2)
);

Now everything is structured and ready to use.

🧩 Final thoughts

This script is not perfect.

But it is real.

It reflects how scraping actually works:

You try something
It breaks
You tweak
You retry

And slowly, it starts working.

The real lesson?

Web pages are not static
They are systems reacting to user behavior

Once you understand that, everything changes.

⚠️ A small note

Always respect platform rules and use scraping responsibly.

🧾 Full working script

If you just want it to work, copy this and run it.

const puppeteer = require('puppeteer');
const fs = require('fs');

(async () => {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();

  await page.goto('https://www.linkedin.com/login', {
    waitUntil: 'networkidle2',
    timeout: 60000
  });

  await page.type('#username', 'YOUR_EMAIL');
  await page.type('#password', 'YOUR_PASSWORD');
  await page.click('button[type="submit"]');
  await page.waitForNavigation({ timeout: 60000 });

  console.log('Complete the security check if prompted...');
  await new Promise(resolve => setTimeout(resolve, 12000));

  const postLinks = [
    // add your links here
  ];

  const scrapedPosts = [];

  for (const link of postLinks) {
    await page.goto(link, {
      waitUntil: 'domcontentloaded',
      timeout: 60000
    });

    await page.evaluate(async () => {
      let scrollHeight = document.body.scrollHeight;

      let scrollInterval = setInterval(() => {
        window.scrollBy(0, 1000);

        if (document.body.scrollHeight > scrollHeight) {
          scrollHeight = document.body.scrollHeight;
        }

        if (document.body.scrollHeight === scrollHeight) {
          clearInterval(scrollInterval);
        }
      }, 300);
    });

    await page.waitForSelector('img[src*="media.licdn.com"]', {
      timeout: 15000
    });

    const postData = await page.evaluate(() => {
      const contentElement =
        document.querySelector('[data-test-post-container] .break-words') ||
        document.querySelector('.feed-shared-update-v2__description') ||
        document.querySelector('.feed-shared-text__text-view') ||
        document.querySelector('span.break-words');

      const images = [];
      const imageElements = document.querySelectorAll('img[src*="media.licdn.com/"]');

      if (imageElements.length > 1) {
        images.push(imageElements[1].src);
      }

      return {
        content: contentElement
          ? contentElement.innerText.trim()
          : 'No content found',
        images: images.length > 0 ? images : ['No post image found']
      };
    });

    scrapedPosts.push({
      link,
      content,
      images
    });
  }

  fs.writeFileSync(
    'linkedInScrapedPosts.json',
    JSON.stringify(scrapedPosts, null, 2)
  );

  console.log(scrapedPosts);

  await browser.close();
})();

DEV Community