loading...

I scraped social media platforms and built an api with it, cause why not 🤷‍♂️

alaadesouky profile image Alaa Desouky ・6 min read

First of all, I'm by no means a professional software engineer, so this won't be the cleanest code you'll see. I'm using this blog post to document my coding process, share my thoughts and the approaches I took to solve problems, also as feedback on how I did things wrong/right.

The inspiration for this project came from Wesbos's Twitter and Instagram scraping project.

You can find the repo here: status-scraper

So, what does it do exactly?

It's an api that accepts a social media flag and a username and returns the user status (eg. # of followers, following, posts, likes, etc...).

Endpoint is /scrape/:flag/:username, and currently the :flag can be any of the following:

  • t => twitter.com
  • r => reddit.com
  • g => github.com
  • b => behance.net
  • q => quora.com
  • i => instagram.com

So, a call for https://statusscraperapi.herokuapp.com/scrape/t/mkbhd would return the following response:

{
 user: "mkbhd",
 status: {
  twitterStatus: {
  tweets: "45,691",
  following: "339",
  followers: "3,325,617",
  likes: "25,255"
  }
 }
}

Tech used

  • Node
  • esm, an ECMAScript module loader
  • Express
  • Axios
  • Cheerio

Server configuration

// lib/server.js
const PORT = process.env.PORT || 3000;
app.listen(PORT, () => console.log(`Server running on port ${PORT}`));


// lib/app.js
class App {
  constructor(app, routePrv) {
    this.app = express();
    this.config();
    this.routePrv = new Routes().routes(this.app);
  }

  config() {
    this.app.use(cors())
    this.app.use(helmet());
  }
}

export default new App().app;

Project structure

The app has three modules:

Module 1 - Router:

// lib/routes/router.js

// all routes have the same structure
export class Routes {
  routes(app) {
    ....
    // @route  GET /scrape/g/:user
    // @desc   log github user status
    app.get("/scrape/g/:user", async (req, res) => {
      const user = req.params.user;
      try {
        const githubStatus = await Counter.getGithubCount(
          `https://github.com/${user}`
        );
        res.status(200).send({ user, status: { githubStatus } });
      } catch (error) {
        res.status(404).send({
          message: "User not found"
        });
      }
    });
    ...
  }
}

Module 2 - Counter:

  • Acts as a middleware between the route and the acual scraping.
  • It gets the html page and pass it to the scraper module.
// lib/scraper/counter.js
class Counter extends Scraper {
  ...
  // Get github count
  async getGithubCount(url) {
    const html = await this.getHTML(url);
    const githubCount = await this.getGithubStatus(html);
    return githubCount;
  }
  ...
}

export default new Counter();

Module 3 - Scraper:

It's where all the work is done, and I'll be explaining each social network approach.
Let's start.

Twitter

Twitter response has multiple <a> elements that contain all data we want, and it looks like this:

<a class="ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav" title="70 Tweets" data-nav="tweets" tabindex=0>
  <span class="ProfileNav-label" aria-hidden="true">Tweets</span>
  <span class="u-hiddenVisually">Tweets, current page.</span>
  <span class="ProfileNav-value"  data-count=70 data-is-compact="false">70</span>
</a>

The class ProfileNav-stat--link is unique for these elements.
With cheerio, we can simply get all <a> with the class, loop through them, and extract the data of the title attribute.
Now we have "70 Tweets", just split it and store as a key-value pair.

// lib/scraper/scraper.js

// Get twitter status
async getTwitterStatus(html) {
  try {
    const $ = cheerio.load(html);
    let twitterStatus = {};
    $(".ProfileNav-stat--link").each((i, e) => {
      if (e.attribs.title !== undefined) {
        let data = e.attribs.title.split(" ");
        twitterStatus[[data[1].toLowerCase()]] = data[0];
      }
    });
    return twitterStatus;
  } catch (error) {
    return error;
  }
}

Reddit

Reddit user page has a <span id="profile--id-card--highlight-tooltip--karma"> on the right side with user's total karma, so it's very easy to get. But when hovered over, it displays post/comment karma.

Reddit response has a <script id="data"> that contains these two pieces of data nested inside an object.

window.___r = {"accountManagerModalData":....
...."sidebar":{}}}; window.___prefetches = ["https://www....};

Just extract the <script> data and parse 'em into json. But we need to get rid of window.___r = at the start, ; window.___prefetches.... at the end and everything after it.

This could be the laziest/worst thing ever :D
I split based on " = ", counted the #of characters starting from that ; -using a web app of course-, and sliced them out of the string. Now I have a pure object in a string.

// lib/scraper/scraper.js

  // Get reddit status
  async getRedditStatus(html, user) {
    try {
      const $ = cheerio.load(html);
      const totalKarma = $("#profile--id-card--highlight-tooltip--karma").html();

      const dataInString = $("#data").html().split(" = ")[1];
      const pageObject = JSON.parse(dataInString.slice(0, dataInString.length - 22));
      const { commentKarma, postKarma } = pageObject.users.models[user];

     return {totalKarma, commentKarma, postKarma};
    } catch (error) {
      return error;
    }
  }

Linkedin

It responded with status code 999! like, really linkedin.

I tried sending a customized head request that worked with everyone on stack overflow, but it did not work for me. Does it have something to do with csrf-token? I'm not really sure.
Anyways, that was a dead-end, moving on to the next one.

Github

This one was fairly easy, there are five <span class="Counter"> that displays the #of repositories, stars, etc.. Loop through 'em to extract the data, and with Cheerio I can get the element's parent, which is an <a> that has what these numbers represent. Store 'em as key-value pairs and we're ready to go.

// lib/scraper/scraper.js

 // Get github status
  async getGithubStatus(html) {
    try {
      const $ = cheerio.load(html);
      const status = {};
      $(".Counter").each((i, e) => {
        status[e.children[0].parent.prev.data.trim().toLowerCase()] = e.children[0].data.trim();
      });
      return status;
    } catch (error) {
      return error;
    }
  }

Behance

Also an easy one, a <script id="beconfig-store_state"> that has an object with all data required. Parse it into json and extract them.

Youtube - you broke my heart

Youtube's response is a huge mess, it has a punch of <script> tags that don't have any ids or classes. I wanted to get the channel's number of subscribers and total video views, both can be found in the About tab.

The desired <script> is similar to the Github one, I could use the same split, slice, parse thing and I'll be done.

But, these two simple numbers are nested like 12 levels deep within the object and there are arrays involved, it's basically hell.

So, I wrote a little helper function that accepts the large JSON/object and the object key to be extracted, and it returns an array of all matches.

// lib/_helpers/getNestedObjects.js

export function getNestedObjects(dataObj, objKey) {
  // intialize an empty array to store all matched results
  let results = [];
  getObjects(dataObj, objKey);

  function getObjects(dataObj, objKey) {
    // loop through the key-value pairs on the object/json.
    Object.entries(dataObj).map(entry => {
      const [key, value] = entry;
      // check if the current key matches the required key.
      if (key === objKey) {
        results = [...results, { [key]: value }];
      }

      // check if the current value is an object/array.
      // if the current value is an object, call the function again.
      // if the current value is an array, loop through it, check for an object, and call the function again.
      if (Object.prototype.toString.call(value) === "[object Object]") {
        getObjects(value, objKey);
      } else if (Array.isArray(value)) {
        value.map(val => {
          if (Object.prototype.toString.call(val) === "[object Object]") {
            getObjects(val, objKey);
          }
        });
      }
    });
  }

  // return an array of all matches, or return "no match"
  if (results.length === 0) {
    return "No match";
  } else {
    return results;
  }
}

As much as I was thrilled that getNestedObjects actually works -try it-, it didn't last for long though.
Somehow the received html didn't contain that <script>, and I have no idea why. I checked if it has the numbers, but a dead-end.
Thanks, youtube.

Quora

The response has multiple <span class="list_count">, and it's the exact same as Github.

Instagram

The response literarily has a problem from each one above:

  • ✅ Multiple <script> tags with the same type="text/javascript"
  • split, slice, parse
  • ✅ The numbers are nested very deep with the object
 // Get instagram status
  async getInstagramStatus(html) {
    try {
      const $ = cheerio.load(html);
      // get the script containing the data
      let script;
      $('script[type="text/javascript"]').each((i, e) => {
        if (e.children[0] !== undefined && e.children[0].data.includes("window._sharedData =")) {
          return (script = e.children[0].data);
        }
      });

      // get json fromat string
      const dataInString = script.split(" = ")[1];

      // convert to json object
      const pageObject = JSON.parse(dataInString.slice(0, dataInString.length -1));

      // extract objects with status
      const [{ edge_followed_by }] = getNestedObjects(pageObject, "edge_followed_by");
      const [{ edge_follow }] = getNestedObjects(pageObject, "edge_follow");
      const [{ edge_owner_to_timeline_media }] = getNestedObjects(pageObject, "edge_owner_to_timeline_media");

      return {
        followers: edge_followed_by.count,
        following: edge_follow.count,
        posts: edge_owner_to_timeline_media.count
      };
    } catch (error) {
      return error;
    }
  }

At least I got to use the helper.

Wraping up

This was a cool project to make and I've learned a lot of stuff building it.
I've also created a frontend app with React and Next that interacts with the api, you can view it here: Status Logger
Maybe I'll write a blog post for it later.

In the meantime, feel free to share your opinion, good or bad, about it. Also, if you have any other social media networks to scrape.

Posted on by:

Discussion

markdown guide
 

I have an error in the server.js, syntax error in the import app from "./app"