DEV Community

Jordan Hansen
Jordan Hansen

Posted on • Originally published at javascriptwebscrapingguy.com on

Jordan Scrapes Secretary of State: Alabama

Demo code here

That picture above is of the USS Alabama. It’s a very cool museum thing of a retired World War II battleship. Which brings us to our 18th post in the Secretary of State web scraping series. We are going to perform web scraping on the Alabama secretary of state in order to get business leads.

I lived down in the panhandle of Florida for a little bit and Mobile, Alabama was just a short hour drive away. It was a cool place and it had the nearest Costco so we visited there quite a few times.

Investigation

I try to look for the most recently registered businesses. They are the businesses that very likely are trying to get setup with new services and products and probably don’t have existing relationships. I think typically these are going to be the more valuable leads.

If the state doesn’t offer a date range with which to search, I’ve discovered a trick that works pretty okay. I just search for “2020”. 2020 is kind of a catchy number and because we are currently in that year people tend to start businesses that have that name in it.

Once I find one of these that is registered recently, I look for a business id somewhere. It’s typically a query parameter in the url or form data in the POST request. Either way, if I can increment that id by one number and still get a company that is recently registered, I know I can find recently registered business simply by increasing the id with which I search.

Alabama secretary of state

Here we can see the search results for “2020”. The dashes between those numbers initially make me think that maybe it’s not in ascending order but there aren’t any alpha characters so I’m hopeful.

Clicking on 643-391 takes me here.

First and most notable thing is that it looks like we have a more recently registered business. We’ll check a few more in a moment to see if bigger generally means more recent.

It’s also worth looking at that URL. http://arc-sos.state.al.us/cgi/corpdetail.mbr/detail?corp=643391&page=name&file=&type=ALL&status=ALL&place=ALL&city= Those additional query parameter items are not necessary; removing them still returns this page as normal. However, it does tell me that we can filter by other things, such as city and business type, which is pretty neat.

Doing some quick incrementing shows that a bigger id does indeed more recent. Cool, we’re in.

The code

The top block is the same kind of formula used in most of the states. We loop through starting from a certain id and than handle the details in a separate function.

(async () => {
     const startingId = 642999;
     for (let i = 0; i < 1000; i+=50) {
         await getDetails(startingId + i);
         await timeout(1000);
     }
})();
Enter fullscreen mode Exit fullscreen mode

The selectors were a little bit more complicated but using the patterns in states such as Vermont we just built a simple switch.

    const informationFields = $("#block-sos-content tr ");
    for (let i = 0; i < informationFields.length; i++) {
        const cells$ = cheerio.load(informationFields[i]);
        const label = cells$(".aiSosDetailDesc").text();
        const value = cells$(".aiSosDetailValue").text();

Enter fullscreen mode Exit fullscreen mode

The main problem is that we can’t use an index because the data displayed is not always in the same spot. So, as we do above, we grab all the rows that have the data we want. Then…we loop through it and pick out the label and the value.

It’s a really neat way to make sure we only get the data we want. Here’s what the entire function looks like:

async function getDetails(sosId: number) {
    const axiosResponse = await axios.get(`http://arc-sos.state.al.us/cgi/corpdetail.mbr/detail?corp=${sosId}`);
    const $ = cheerio.load(axiosResponse.data);


    const business: any = {};
    const title = $("thead:nth-of-type(1) tr:first-child td:first-child").text();
    business.title = title.trim();
    const informationFields = $("#block-sos-content tr ");
    for (let i = 0; i < informationFields.length; i++) {
        const cells$ = cheerio.load(informationFields[i]);
        const label = cells$(".aiSosDetailDesc").text();
        const value = cells$(".aiSosDetailValue").text();

        switch (label) {
            case 'Entity ID Number':
                business.idNumber = value;
                break;
            case 'Formation Date':
                business.formationDate = value.replace(/\n/g, "").trim();
                break;
            case 'Registered Office Street Address':
                business.address = value.trim();
                break;
            case 'Registered Agent Name':
                business.agentName = value.replace(/\n/g, "").trim();
                break;
                // Qualify date varies widely compared to formation date when it is a foreign business
            case 'Qualify Date':
                business.qualifyDate = value;
                break;
            case 'Entity Type':
                business.entityType = value;
                break;
            default:
                break;
        }
    }
    console.log("business", business);
}

Enter fullscreen mode Exit fullscreen mode

And…that’s it. We did it. We scraped the Alabama secretary of state for recently registered businesses.

Demo code here

Looking for business leads?

Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome web data. Learn more at Cobalt Intelligence!

The post Jordan Scrapes Secretary of State: Alabama appeared first on JavaScript Web Scraping Guy.

Top comments (0)