DEV Community

loading...

Reddit mass scraping via API

patarapolw profile image Pacharapol Withayasakpunt ・3 min read

First of all, there are rules to obey; if you don't want to get 429 or banned.

https://github.com/reddit-archive/reddit/wiki/API

I am using Node.js, where parallel async is easy; but the general rules should be applicable to any programming languages.

I am playing with Reddit while my Twitter requires official allowing...

Getting the OAuth token

You will need client ID and client secret from https://www.reddit.com/prefs/apps

app pref

  • Client ID is below "Personal-use script".
  • Secret is red underlined.
import axios from "axios";
import fs from "fs";

async function getToken(): Promise<{
  access_token: string;
}> {
  return axios
    .post(
      "https://www.reddit.com/api/v1/access_token",
      "grant_type=client_credentials",
      {
        headers: {
          Authorization: `Basic ${Buffer.from(
            `${process.env.REDDIT_CLIENT_ID}:${process.env.REDDIT_CLIENT_SECRET}`
          ).toString("base64")}`,
          "Content-Type": "application/x-www-form-urlencoded;charset=UTF-8",
        },
        params: {
          scope: "read",
        },
      }
    )
    .then((r) => r.data);
}

if (require.main === module) {
  getToken().then((data) =>
    fs.writeFileSync("token.json", JSON.stringify(data))
  );
}
Enter fullscreen mode Exit fullscreen mode

The real access token is in data.access_token.

Exploring the API

I recommend exploring the API on Postman. I feel it is more convenient than Insomnia, or just cURL with Terminal.

You don't have to login, to the website, nor the app. I feel that logging in is very annoying; but I can't find an alternative.

Another way to test, is go to any Reddit suburls on Firefox, and replace www.reddit.com with api.reddit.com.

Real scraping, while avoiding the troubles.

import axios from "axios";
import rateLimit from "axios-rate-limit";

const api = rateLimit(
  axios.create({
    baseURL: "https://oauth.reddit.com",
    headers: {
      Authorization: `Bearer ${
        JSON.parse(fs.readFileSync("token.json", "utf-8")).access_token
      }`,
    },
  }),
  {
    /**
     * Clients connecting via OAuth2 may make up to 60 requests per minute.
     */
    maxRequests: 60,
  }
);
Enter fullscreen mode Exit fullscreen mode

Helper function

declare global {
  interface Array<T> {
    mapAsync<U>(
      callbackfn: (value: T, index: number, array: T[]) => Promise<U>,
      thisArg?: any
    ): Promise<U[]>;
  }
}

Array.prototype.mapAsync = async function (callbackfn, thisArg) {
  return Promise.all(this.map(callbackfn, thisArg));
};

function dotProp<R>(o: any, p: string | string[], def?: R): R {
  if (typeof o === "undefined") {
    return def!;
  }

  const ps = typeof p === "string" ? p.split(".") : p;

  if (!ps.length) {
    return o;
  }

  if (o && typeof o === "object") {
    if (Array.isArray(o)) {
      return dotProp(o[parseInt(ps[0])], ps.slice(1), def);
    }

    return dotProp(o[ps[0]], ps.slice(1), def);
  }

  return def!;
}
Enter fullscreen mode Exit fullscreen mode

Make use of Async Iterator

Of course, you can also send 1010 requests at once, but that would not only make unpredictable response time, but also get blocked.

function iterListing(apiPath = "/hot", count = 1000) {
  const limit = 50;
  const maxDepth = Math.ceil(count / limit);

  return {
    [Symbol.asyncIterator]() {
      return {
        depth: 0,
        after: "",
        async next() {
          if (!this.after && this.depth) {
            return { done: true };
          }

          if (this.depth < maxDepth) {
            this.depth++;

            const value = await api
              .get(apiPath, {
                params: {
                  after: this.after,
                  limit,
                },
              })
              .then((r) => {
                this.after = dotProp<string>(r, "data.data.after");
                console.log(this.depth, this.after);

                return dotProp<any[]>(r, "data.data.children", []).mapAsync(
                  async ({ data: { name } }) => {
                    return api
                      .get("/comments/" + name.split("_")[1])
                      .then((r) => {
                        const getComment = ({ data: { body = "", replies } }) =>
                          body +
                          "\n" +
                          (replies
                            ? dotProp<any[]>(replies, "data.children")
                                .map((r) => getComment(r))
                                .join("\n")
                            : "");

                        return `${dotProp(
                          r,
                          "data.0.data.children.0.data.title",
                          ""
                        )}\n${dotProp(
                          r,
                          "data.0.data.children.0.data.selftext",
                          ""
                        )}\n${dotProp<any[]>(r, "data.1.data.children", [])
                          .map((r) => getComment(r))
                          .join("\n")}`;
                      });
                  }
                );
              });

            return {
              done: false,
              value,
            };
          }

          return {
            done: true,
          };
        },
      };
    },
  };
}
Enter fullscreen mode Exit fullscreen mode

Don't write everything to file at once in Node.js

Learn to use stream. Stream is a very powerful concept in Node.js.

async function main() {
  const outStream = fs.createWriteStream("raw/reddit.txt", {
    encoding: "utf-8",
  });

  try {
    for await (const out of iterListing()) {
      if (out) {
        out.map((it) => outStream.write(it + "\n"));
      }
    }
  } catch (e) {
    console.error(e.response || e);
  }

  outStream.close();
}

if (require.main === module) {
  main();
}
Enter fullscreen mode Exit fullscreen mode

Discussion (1)

pic
Editor guide
Collapse
joyshaheb profile image
Joy Shaheb

Genius you are ! 🎖️