DEV Community

Anden Acitelli
Anden Acitelli

Posted on • Edited on

Web Scraping + GPT for Data Validation - Credit Cards

Overview

I run Offer Optimist, a site for comparing sign up bonuses. I maintain a free API to access the data here. One of the main challenges is keeping this data up-to-date, which I’ve traditionally done through Doctor of Credit and user-sourced reports.

However, this can be difficult and time-intensive, depending how accurate I want the data to be. I generally promise 80%+ accuracy, but be very clear that it shouldn’t be used for any mission-critical use case. It’s just too time-intensive for me to keep updated through any kind of proactive searching.

I work for Akkio, where I'm building out a no-code predictive AI platform. If you're looking to harness the power of AI without needing a data scientist, give us a try!

Enter, Web Scraping

So, I had the idea to try and automatically pull down the webpages via web scraping in order to verify accuracy. After all, I had URLs for each of the cards in my dataset. So, I took a shot at it.

Note that all of the following is after applying the html-to-text package to the result in order to get rid of all the HTML structure. I’m using ScrapingAnt as a web scraping proxy, which has worked fairly well for me.

First Attempt, Regex

The code ended up looking something like this.

const text = lines
  .map((line) => {
    // Remove substring, not whole line
    return line
      .replaceAll(",", "") // Remove commas, which can interfere with regex finding
      .replaceAll(/spend \$\d+/gi, "") // Remove things that are clearly a spend amount
      .replaceAll(/\d+x/gi, "") // Remove anything that looks like a multiplier
      .replaceAll(/\d%/gi, "") // Remove anything that looks like a percentage
      .replaceAll(/annual.{0,4}\d/gi, "") // Remove anything that looks like it's annual
      .replaceAll(/\d+\.\d+x?/gi, "") // Remove anything that looks like a decimal
      .replaceAll(/up to (a )?\$?\d+/gi, "") // Tends to indicate an offer with several breakpoints; we want to consider the smaller one
      .replaceAll(/[^a-zA-Z0-9\$ ]*/gi, ""); // Replace all non-alphanumeric characters with a space
  })
  .filter((line) => {
    line = line.replaceAll("\n", "").trim();

    // Remove entire lines that clearly don't contain a bonus
    if (!line.length) return false;
    if (line.length < 5 || line.length > 200) {
      logger.debug(
        `Removing line ${line} b/c it's too short or too long.`
      );
      return false;
    }
    if (line.startsWith("--")) {
      logger.debug(`Removing line ${line} b/c it starts with --`);
      return false;
    } // Remove artifact introduced by scraping proxy

    if (!/\$?\d+/g.test(line)) {
      logger.debug(
        `Removing line ${line} b/c it has no distinct numbers in it.`
      );
      return false;
    }

    if (/per|each|every/gi.test(line)) {
      // Indicates something recurring; usually referrals or an earnings rate
      logger.debug(
        `Removing line ${line} b/c it has 'per' or 'each' in it.`
      );
      return false;
    }

    if (/\d+\/\d+\/\d+\//gi.test(line)) {
      logger.debug(`Removing line ${line} b/c it has a date in it.`);
      return false;
    }

    return true;
  });

// Find first regex match; already sorted from highest to lowest specificity
const match = CARD_SUBSTRING_TO_REGEX.reduce<string | undefined>(
  (acc, search) => {
    if (acc) return acc; // Already found our "match"
    for (const line of text) {
      const result = search.f(card, line);
      if (result) return result;
    }
    return acc;
  },
  undefined
);
if (!match) {
  skips.push({
    card,
    reason: `No regex match. Text (${text.length}): ${JSON.stringify(
      text.map((t) => t.substring(0, 200)),
      null,
      2
    )}`,
  });
  return;
}
Enter fullscreen mode Exit fullscreen mode

Lots of very finicky Regex rules, as you can see. This worked for maybe 70% of cards, but tended to give a good chunk of false positives and was ultimately just a bit too high-maintenance and rule-based for it to be worth my time.

Second Attempt, GPT

I got an excellent recommendation to look into GPT for this kind of thing. After all, it’s much better at generalizing. The general idea was to scrape the data, apply whatever pre-processing (regex) I could to minimize the amount of noise (and tokens, because tokens = cost) that GPT would have to sift through. Some rules I applied were:

  • Minimize whitespace (mostly just to save on tokens)
  • Only include lines that have a number somewhere in them
  • Remove anything referencing COVID-19
  • Remove anything that looked like a duration or date

I also applied some site-specific rules, like American Express tending to use a lot of blocks surrounded by [] that were irrelevant and other stuff like that.

The specific code is as follows.

const cleaned = text
    .replaceAll(/\[.*]/g, "")
    .replaceAll(/[\n\r]/g, "  ")
    .replaceAll(/\s{2,}/g, " | ")
    .split("|")
    .map((s) => s.trim())
    .filter((s) => /\d/.test(s)) // Must include a number
    .filter((s) => !/covid-19/gi.test(s)) // COVID-19
    .filter((s) => !/\d* seconds/gi.test(s)) // Time
    .filter((s) => !/ 101/gi.test(s)) // XYZ 101 is text that tends to show up in Amex
    .filter((s) => !/2023/gi.test(s)) // Number is the current year
    .filter((s) => !/\d*%/gi.test(s)) // Percentages
    .filter(
      // URLs
      (s) =>
        !/^https?:\/\/(?:www\.)?[\w#%+.:=@~-]{1,256}\.[\d()a-z]{1,6}\b[\w#%&()+./:=?@~-]*$/gi.test(
          s
        )
    )
    .join(" | ");
Enter fullscreen mode Exit fullscreen mode

Once I got the actual output from that, it was time to feed it into the actual prompt.

const prompt = `I am scraping credit card websites to check whether credit card data I have on file is accurate, especially sign up bonus amounts. You are a helpful assistant helping me verify whether my data is still accurate.

      I have the following data on file for this card, which I am providing in JSON format.
      ${JSON.stringify({
        ...card,
        historicalOffers: undefined,
        imageUrl: undefined,
      })}

      I stripped the HTML from the page, so I now just have the raw text. Here it is, with each page "section" separated by " | ". The page will either be specific to this card, or have infomration on this card. Here's the text:
      ${cleaned}

      If my data is still up to date, please reply ONLY with "Up To Date." If the text resembles some kind of error, please reply ONLY with "Error" and then a brief description of the error. If my data is not up to date, please reply ONLY with details of what is inaccurate. Also feel free to check for any other inaccuracies, such as an incorrect annual fee or an incorrect value for whether the bonus is waived first year.`;
Enter fullscreen mode Exit fullscreen mode

I borrowed a few techniques commonly applied in prompt engineering here. I gave the LLM a “role” to assume and I was very explicit about what kind of output I wanted from it.

How Well Does It Work?

Here’s a sample of the output:

[09:43:23.866] INFO (16328): Getting page text for card BARCLAYS Upromise...
[09:43:27.207] INFO (16328): Got page text for card BARCLAYS Upromise...
[09:43:27.219] INFO (16328): Cleaned text: Get up to $250 in cash back rewards per calendar year on eligible gift card | Earn $100 Bonus Cash Back Rewards. | $0 Fraud Liability Protection. | made within 45 days of account opening.
 After that (and for balance transfers | $0 | $100 BONUS CASH BACK REWARDS | Earn $100 bonus cash back rewards after spending $500 on purchases in the first | 90 days2 | when linked to an eligible College Savings Plan2 | EARN UP T
O $250 IN CASH BACK REWARDS PER YEAR | Get up to $250 in cash back rewards per calendar year on eligible gift card | at MyGiftCardsPlus.com .3 | $0 | annual fee1 | * EARN $100 BONUS CASH BACK REWARDS | Earn $100 bonus cash back r
ewards after spending $500 on purchases in the | first 90 days. | * EARN UP TO $250 IN CASH BACK REWARDS PER YEAR | Get up to $250 in cash back rewards per calendar year on eligible gift card | at MyGiftCardsPlus.com .3 | based o
n the limit you set (from $1 to $500). The total Round Up Amount | is considered a purchase and converted to cash back rewards.2 | on international purchases.1 | * $0 FRAUD LIABILITY PROTECTION | that your score has changed.4 | t
R: EARN 60,000 BONUS POINTS | after qualifying account activity2 | 6X POINTS | on eligible JetBlue purchases2 | 2X POINTS | at restaurants and eligible grocery stores2 | $99 | annual fee1 | * EARN 60,000 BONUS POINTS | after spending $1,000 on purchases and paying the annual fee in full, both | within the first 90 days2 | * 6X POINTS | on eligible JetBlue purchases2 | * 2X POINTS AT RESTAURANTS AND ELIGIBLE GROCERY STORES | and 1X points on all other purchases2 | for you and up to 3 eligible travel companions on JetBlue-operated flights2,4 | Earn toward Mosaic with every purchase3 | on eligible inflight purchases on JetBlue-operated flights2,4 | That’s any seat, any time, on JetBlue-operated flights3 | when you redeem for and travel on a JetBlue-operated Award Flight2 | fare at the time of booking3 | * ANNUAL $100 STATEMENT CREDIT | after you purchase a JetBlue Vacations package of $100 or more with your | JetBlue Plus Card2 | Your points will be ready whenever you are3 | * $0 FRAUD LIABILITY PROTECTION | Earn & share points with family and friends3 | $1,000 annually2 | combination or dollars and TrueBlue points – starting with as few as 500 | points3 | on international purchases1 | * EARN 5,000 POINTS BONUS | each year after your JetBlue Plus Card account anniversary2 | transfer that posts to your account within 45 days of account opening. | After that (and for balance transfers that do not post within 45 days of account | $99 | 1. Offer subject to credit approval. This offer is available through this | days of account opening is applicable for the first 12 billing cycles that | 2. Conditions and limitations apply. Please refer to the Reward Rules within | 3. Refer to TrueBlue Terms and Conditions | 4. JetBlue-operated flights only. Codeshare flights and flights operated by a | Credit Card Customer Support: 877-523-0478...
[09:43:35.314] INFO (16328): GPT Response: Up To Date. The text matches the information you provided in the JSON format, including the sign-up bonus of 60,000 points after spending $1,000 on purchases and paying the annual fee in full, both within the first 90 days.
Enter fullscreen mode Exit fullscreen mode

Yes, it’s a bit of a mess, but GPT is able to handle it like a champ. This is with ChatGPT / GPT3.5, which I’m using due to its good cost-to-performance ratio.

The Other Stuff

I’m running this from a Next.js route handler and triggering via their CRON Jobs. Due to the Vercel runtime limits of this, I’ll probably have to make some tweaks so that I’m basically doing each card as part of a separate HTTP request, rather than all as part of one. However, the hard part is all done!

After I get the output, I’m using octokit to automatically create a GitHub Issue with information on the outdated information.

Conclusion

Hope you enjoyed! If you’d like to learn more about me, take a look at my Portfolio Page. If you’d like to support me, take a look at my Support Page.

Top comments (1)

Collapse
 
crawlbase profile image
Crawlbase • Edited

Such an amazing piece as this blog post provides a fascinating glimpse into the challenges of maintaining accurate data, especially in dynamic fields like credit card sign-up bonuses. The author's journey from using traditional methods to experimenting with web scraping and GPT for data validation is insightful. Tools like Crawlbase could complement this process, offering efficient data extraction and analysis.