DEV Community

Cover image for I Tried Scraping LinkedIn Posts with Puppeteer and This Is What Actually Worked
Rehan Sayyed
Rehan Sayyed

Posted on

I Tried Scraping LinkedIn Posts with Puppeteer and This Is What Actually Worked

I Tried Scraping LinkedIn Posts with Puppeteer and It Fought Back

I thought this would take 20 minutes.

Open Puppeteer.
Grab some text.
Save it to JSON. Done.

Instead, LinkedIn reminded me that modern web apps are not just pages.

They are systems.
They react.
They delay.
They break your assumptions.

And suddenly, a simple script turns into a late night debugging session.

This is what actually worked.


๐Ÿ“š Table of Contents


๐Ÿš€ Setting up Puppeteer

We start simple.

const puppeteer = require('puppeteer');
const fs = require('fs');
Enter fullscreen mode Exit fullscreen mode

Puppeteer controls the browser
fs stores the data

Clean. Minimal. Enough to begin.

Then we wrap everything so we can use async and await properly.

(async () => {
Enter fullscreen mode Exit fullscreen mode

๐ŸŒ Launching a real browser

const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
Enter fullscreen mode Exit fullscreen mode

I kept headless mode off on purpose.

Because LinkedIn is very sensitive to automation.

Headless browsers get flagged faster.
A visible browser behaves more like a real user.

And when you are debugging
visibility is everything.


๐Ÿ” Logging into LinkedIn

await page.goto('https://www.linkedin.com/login', {
  waitUntil: 'networkidle2',
  timeout: 60000
});

await page.type('#username', 'xxxx');
await page.type('#password', 'xxxxxx');
await page.click('button[type="submit"]');
await page.waitForNavigation({ timeout: 60000 });
Enter fullscreen mode Exit fullscreen mode

At this point, everything felt done.

I thought I was finished.
I was not.

LinkedIn often throws:

  • security checks
  • captchas
  • verification screens

So even after login, we pause.


โณ Handling the security check

console.log('Please complete the security check...');
await new Promise(resolve => setTimeout(resolve, 12000));
Enter fullscreen mode Exit fullscreen mode

No hacks. No bypass.

Just wait.

Sometimes the most reliable solution
is the least clever one.


๐Ÿ”— Providing the post links

const postLinks = [ ... ];
Enter fullscreen mode Exit fullscreen mode

Instead of crawling everything, I used a fixed list of URLs.

Why?

Predictable input gives predictable output.

LinkedIn infinite scroll can get messy very quickly.


๐Ÿ” Visiting each post

for (const link of postLinks) {
  await page.goto(link, {
    waitUntil: 'domcontentloaded',
    timeout: 60000
  });
Enter fullscreen mode Exit fullscreen mode

Notice I used domcontentloaded.

LinkedIn keeps making background requests forever.
Waiting for network idle can hang your script.

This small change
saved a lot of frustration.


๐Ÿ“œ Scrolling to load content

This is where things started breaking.

await page.evaluate(async () => {
  let scrollHeight = document.body.scrollHeight;

  let scrollInterval = setInterval(() => {
    window.scrollBy(0, 1000);

    if (document.body.scrollHeight > scrollHeight) {
      scrollHeight = document.body.scrollHeight;
    }

    if (document.body.scrollHeight === scrollHeight) {
      clearInterval(scrollInterval);
    }
  }, 300);
});
Enter fullscreen mode Exit fullscreen mode

LinkedIn loads content lazily.

No scroll.
No content.

So we simulate a real user:

Scroll
Pause
Repeat

Until nothing new loads.


๐Ÿ–ผ๏ธ Waiting for images

await page.waitForSelector('img[src*="media.licdn.com"]', {
  timeout: 15000
});
Enter fullscreen mode Exit fullscreen mode

Without this step, your data will be incomplete.

This ensures images are fully loaded before extraction.


๐Ÿง  Extracting content and images

const postData = await page.evaluate(() => {
  const contentElement =
    document.querySelector('[data-test-post-container] .break-words') ||
    document.querySelector('.feed-shared-update-v2__description') ||
    document.querySelector('.feed-shared-text__text-view') ||
    document.querySelector('span.break-words');

  const images = [];
  const imageElements = document.querySelectorAll('img[src*="media.licdn.com/"]');

  if (imageElements.length > 1) {
    images.push(imageElements[1].src);
  }

  return {
    content: contentElement
      ? contentElement.innerText.trim()
      : 'No content found',
    images: images.length > 0 ? images : ['No post image found']
  };
});
Enter fullscreen mode Exit fullscreen mode

This is real world scraping.

There is no single selector that always works.

So we try multiple fallbacks until something works.

Also:

  • first image โ†’ profile
  • second image โ†’ actual post

Not perfect.
But practical.


๐Ÿ“ฆ Storing the data

scrapedPosts.push({
  link,
  content: postData.content,
  images: postData.images
});
Enter fullscreen mode Exit fullscreen mode

We collect everything step by step.

Simple. Clean. Reliable.


๐Ÿ’พ Saving it to a file

fs.writeFileSync(
  'linkedInScrapedPosts.json',
  JSON.stringify(scrapedPosts, null, 2)
);
Enter fullscreen mode Exit fullscreen mode

Now everything is structured and ready to use.


๐Ÿงฉ Final thoughts

This script is not perfect.

But it is real.

It reflects how scraping actually works:

You try something
It breaks
You tweak
You retry

And slowly, it starts working.

The real lesson?

Web pages are not static
They are systems reacting to user behavior

Once you understand that, everything changes.


โš ๏ธ A small note

Always respect platform rules and use scraping responsibly.


๐Ÿงพ Full working script

If you just want it to work, copy this and run it.

const puppeteer = require('puppeteer');
const fs = require('fs');

(async () => {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();

  await page.goto('https://www.linkedin.com/login', {
    waitUntil: 'networkidle2',
    timeout: 60000
  });

  await page.type('#username', 'YOUR_EMAIL');
  await page.type('#password', 'YOUR_PASSWORD');
  await page.click('button[type="submit"]');
  await page.waitForNavigation({ timeout: 60000 });

  console.log('Complete the security check if prompted...');
  await new Promise(resolve => setTimeout(resolve, 12000));

  const postLinks = [
    // add your links here
  ];

  const scrapedPosts = [];

  for (const link of postLinks) {
    await page.goto(link, {
      waitUntil: 'domcontentloaded',
      timeout: 60000
    });

    await page.evaluate(async () => {
      let scrollHeight = document.body.scrollHeight;

      let scrollInterval = setInterval(() => {
        window.scrollBy(0, 1000);

        if (document.body.scrollHeight > scrollHeight) {
          scrollHeight = document.body.scrollHeight;
        }

        if (document.body.scrollHeight === scrollHeight) {
          clearInterval(scrollInterval);
        }
      }, 300);
    });

    await page.waitForSelector('img[src*="media.licdn.com"]', {
      timeout: 15000
    });

    const postData = await page.evaluate(() => {
      const contentElement =
        document.querySelector('[data-test-post-container] .break-words') ||
        document.querySelector('.feed-shared-update-v2__description') ||
        document.querySelector('.feed-shared-text__text-view') ||
        document.querySelector('span.break-words');

      const images = [];
      const imageElements = document.querySelectorAll('img[src*="media.licdn.com/"]');

      if (imageElements.length > 1) {
        images.push(imageElements[1].src);
      }

      return {
        content: contentElement
          ? contentElement.innerText.trim()
          : 'No content found',
        images: images.length > 0 ? images : ['No post image found']
      };
    });

    scrapedPosts.push({
      link,
      content,
      images
    });
  }

  fs.writeFileSync(
    'linkedInScrapedPosts.json',
    JSON.stringify(scrapedPosts, null, 2)
  );

  console.log(scrapedPosts);

  await browser.close();
})();
Enter fullscreen mode Exit fullscreen mode

Top comments (0)