DEV Community

Luke Hagar
Luke Hagar

Posted on

Scraping Archives of Nethys for fun and profit

I have been getting into pathfinder lately, and one frustration I have is the character creators are all closed source, or just messy, clumsy.

I'm a developer and I wanted to take a crack at creating a better solution, but I cannot hold a candle to the data sets everyone seems to already have at the ready.

So my solution, is to pull this data from the best resource available, currently that seems to be the Archives of Nethys

Digging In

Lets start on the Ancestries page, and step one of figuring out how to get that sweet sweet data, is to hit F12.

And shimmy on over to the network tab:
Image description

I would suggest making sure you filter to only the Fetch/XHR requests, there is typically an easy option for this.

With this menu opened I just started looking around.

Oh, wait, WHATS THIS????
Image description

When I hover over this Elf ancestry, The card that pops up appears to make a post call.
Image description

Lets look at that POST body
Image description

Alright, so when this opened, it made a multiget post call to an elasticsearch instance for the site.

We can certainly work with that.

Now to the fun Part

Now to the code

Alright, so thankfully AON (Archives of Nethys) is using Elastic Search! While that may seem overkill for this kind of usecase, it makes our goal much easier to accomplish.

Setup

In efforts to make this easier to use for other folks, I am centralizing the config into one file, and will use these values other places.

export const config = {
  // These values should be static, and tell the scraper 
  // how to access the AON elastic instance.
  root: "https://elasticsearch.aonprd.com/",
  index: "aon",

  // Comment out any targets you do not want to scrape. 
  // This is just a cursory list I pulled from exploring the site
  targets: [
    "action",
    "ancestry",
    "archetype",
    "armor",
    "article",
    "background",
    "class",
    "creature",
    "creature-family",
    "deity",
    "equipment",
    "feat",
    "hazard",
    "rules",
    "skill",
    "shield",
    "spell",
    "source",
    "trait",
    "weapon",
    "weapon-group",
  ],
};

Enter fullscreen mode Exit fullscreen mode

Now that we have the Elastic instance info populated, lets setup the SDK

import { Client } from "@elastic/elasticsearch";
import fs from "fs";
import path from "path";
import { config } from "./config";
import sanitize from "sanitize-filename";

const client = new Client({
  node: config.root,
});
Enter fullscreen mode Exit fullscreen mode

We now have a client object created that can be used to easily perform search operations against elastic services.

the next portion of the code is going to utilize await, so I will wrap it in an async function.

Additionally, I want to loop through the targets I supply in my config object.

Lastly, while looping through the targets, I want to add an elastic search query

export async function retrieveTargets() {
  for (const target of config.targets.sort()) {
      const search = await client.search({
        index: config.index,
        from: 0,
        size: 10000,
        query: { match: { category: target } },
      });
  }
}
Enter fullscreen mode Exit fullscreen mode

The nice thing about elastic, is the default max size permitted is 10,000 which for our use case means we can get all of the results at once.

Now that we have our full result set for each target, we want to save that data locally for processing.

Below is the full function, with a log statement added for progress and counts, a tracking object implemented to check for errors, and some simple fs operations to save the full array as a local file.

export async function retrieveTargets() {
  for (const target of config.targets.sort()) {
    try {
      const search = await client.search({
        index: config.index,
        from: 0,
        size: 10000,
        query: { match: { category: target } },
      });

      console.log({
        action: "saving",
        target,
        count: search?.hits?.total?.value,
      });

      fs.mkdirSync(path.join(__dirname, "raw"), {
        recursive: true,
      });
      fs.writeFileSync(
        path.join(__dirname, "raw", `${target}.json`),
        JSON.stringify(search?.hits?.hits)
      );
      tracking[target] = true;
    } catch (err) {
      console.error(err);
      tracking[target] = false;
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Now the full array is saved locally, albeit in a pretty specific format, and that's not great for my desired use case.

So lets do some processing. The next async function is very simple, it reads in the local files using the same logic they were saved with, changes the shape of the objects in the array, and then saves the new array in a parsed folder.

export async function parseTargets() {
  for (const target of config.targets.sort()) {
    console.log({ action: "parsing", target });

    if (tracking[target] === true) {
      const raw = JSON.parse(
        fs.readFileSync(path.join(__dirname, "raw", `${target}.json`))
      );
      const parsed = raw.map((entry) => entry._source);
      try {
        fs.mkdirSync(path.join(__dirname, "parsed"), {
          recursive: true,
        });

        fs.writeFileSync(
          path.join(__dirname, "parsed", `${target}.json`),
          JSON.stringify(parsed, null, " ")
        );
      } catch (err) {
        console.error(err);
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Now this data is all in a good state for me to programmatically upload it to a database. But I want to take things one step further just incase it helps someone else that may come along.

I'm going to split the now parsed data up further into individual JSON files instead of one large array.

Below is that last async function, is reads in the parsed files the same way we did during parsing, and then just iterates through the array and saves each entry under its given name value.

It is worth noting that this is why I included that sanitize filename package, as some of the names in the entries from AON are rather abnormal, and this sanitization allows us to just save the files as is and worry about all that Jazz later.

export async function sortTargets() {
  for (const target of config.targets.sort()) {
    console.log({ action: "sorting", target });

    if (tracking[target] === true) {
      const parsed = JSON.parse(
        fs.readFileSync(path.join(__dirname, "parsed", `${target}.json`))
      );
      for (const entry of parsed) {
        try {
          fs.mkdirSync(path.join(__dirname, "sorted", `${target}`), {
            recursive: true,
          });

          fs.writeFileSync(
            path.join(
              __dirname,
              "sorted",
              `${target}`,
              `${sanitize(entry?.name)}.json`
            ),
            JSON.stringify(entry, null, " ")
          );
        } catch (err) {
          console.error(err);
        }
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

And there we are!

We now have a plethora of raw data for use later on.

I'll make another post in the future documenting the process I will use to upload this data to pocketbase

For those interested here is the repo containing all of the code, and the raw, parsed, and sorted JSON data.

Top comments (1)

Collapse
 
jayboy75 profile image
Jayboy75

Well shucks, too bad they're not using Elastic Search to serve the 1st Edition data. I spun up your scraper before realising the 1e section of the site is in ASP.