Today we do web scraping on the North Carolina Secratary of State. I’ve been to North Carolina once and it seemed like a great state. Really pretty with some beautiful beaches. This is the 15th (!!) entry in the Secretary of States web scraping series.
Investigation
I try to look for the most recently registered businesses. They are the businesses that very likely are trying to get setup with new services and products and probably don’t have existing relationships. I think typically these are going to be the more valuable leads.
If the state doesn’t offer a date range with which to search, I’ve discovered a trick that works pretty okay. I just search for “2020”. 2020 is kind of a catchy number and because we are currently in that year people tend to start businesses that have that name in it.
Once I find one of these that is registered recently, I look for a business id somewhere. It’s typically a query parameter in the url or form data in the POST request. Either way, if I can increment that id by one number and still get a company that is recently registered, I know I can find recently registered business simply by increasing the id with which I search.
North Carolina was not much different. They allow you to search with pretty standard stuff. No date range, sadly, so the above “2020” trick works pretty well.
Bingo, just like that we find a business registered in July of this year. Worked like a charm.
Whenever I’m first investigating a site, I always check the network requests. Often you can find that there are direct requests to an API that has the data you need. When I selected this company, 2020 Analytics LLC, I saw this network request and I thought I was in business.
This request didn’t return any easy to parse JSON, sadly, only HTML. Still, I should be able to POST that Sos ID here to this request and get what I wanted and just increment from there.
Maybe you’re seeing what I missed.
Database id vs Sos id
The id shown in that photo was a lot bigger than the Secretary of State id. 16199332 vs 2006637. I started making requests and plucking out the filing date and the business title starting with 16199332.
The results were pretty intermittent. The first indication that something was up was that the numbers weren’t exatly sequential. One business would be registered on 7/21/2020 and then 10 numbers later a business was registered on 6/24/2020.
I’m not exactly sure programmatically what is happening that they are making entries into the database like that. In any case, I soon realized that something wasn’t matching up.
I wanted to call directly to this details page but for that I needed to get the database id somehow. Fortunately, North Carolina has a way to search by Sos id.
The resulting HTML looks like this:
Because I’m searching by Sos id it only returned on result. I just grabbed and parsed this anchor tag to pluck out the database id from that ShowProfile
function. Two requests, one to get the database id, another to use that database id to get the business details
The code
(async () => {
const startingSosId = 2011748;
// Increment by large amounts so we can find the most recently registered businesses
for (let i = 0; i < 5000; i += 100) {
// We use the query post to get the database id
const databaseId = await getDatabaseId(startingSosId + i);
// With the database id we can just POST directly to the details endpoint
if (databaseId) {
await getBusinessDetails(databaseId);
}
// Good neighbor timeout
await timeout(1000);
}
})();
This is the base of my scraping code. This showcases how I’m incrementing by larger jumps to be able to quickly determine where the end is. I go out and get the database id and then use that to get the business details
async function getDatabaseId(sosId: number) {
const url = 'https://www.sosnc.gov/online_services/search/Business_Registration_Results';
const formData = new FormData();
formData.append('SearchCriteria', sosId.toString());
formData.append(' __RequestVerificationToken', 'qnPxLQeaFPiEj4f1so7zWF8e5pTwiW0Ur8A0qkiK_45A_3TL__ wTjYlmaBmvWvYJVd2GiFppbLB39eD0F6bmbEUFsQc1');
formData.append('CorpSearchType', 'CORPORATION');
formData.append('EntityType', 'ORGANIZATION');
formData.append('Words', 'SOSID');
const axiosResponse = await axios.post(url, formData,
{
headers: formData.getHeaders()
});
const $ = cheerio.load(axiosResponse.data);
const onclickAttrib = $('.double tbody tr td a').attr('onclick');
if (onclickAttrib) {
const databaseId = onclickAttrib.split("ShowProfile('")[1].replace("')", '');
return databaseId;
}
else {
console.log('No business found for SosId', sosId);
return null;
}
}
Getting the database id looks like this. Simply selecting that anchor tag shown above and parsing the function to grab the database id.
The most enjoyable part was working the business details. This section here had a lot of the data that I wanted but they weren’t always in the same order. The company didn’t always have the same fields.
So I used a trick I’ve used before where I just loop through all of the elements in this section, get the text from the label section, and put the value where it needs to go based on that label.
const informationFields = $('.printFloatLeft section:nth-of-type(2) div:nth-of-type(1) span');
for (let i = 0; i < informationFields.length; i++) {
if (informationFields[i].attribs.class === 'greenLabel') {
// This is kind of perverting cheerio objects
const label = informationFields[i].children[0].data.trim();
const value = informationFields[i + 1].children[0].data.trim();
switch (label) {
case 'SosId:':
business.sosId = value;
break;
case 'Citizenship:':
business.citizenShip = value;
break;
case 'Status:':
business.status = value;
break;
case 'Date Formed:':
business.filingDate = value;
break;
default:
break;
}
}
}
I had to a do little almost abuse of cheerio’s normally very easy API. The problem was at the top you can see that I’m selecting all the spans in this information section. I needed to loop through each one and I couldn’t find a way to access to text()
function without using a proper css selector. For example, $('something').text()
easy. But as I looped I didn’t want to select any further. I wanted that element. And that’s why I ended up with children[0].data
.
Here’s the full function:
async function getBusinessDetails(databaseId: string) {
const url = 'https://www.sosnc.gov/online_services/search/_Business_Registration_profile';
const formData = new FormData();
formData.append('Id', databaseId);
const axiosResponse = await axios.post(url, formData,
{
headers: formData.getHeaders()
});
const $ = cheerio.load(axiosResponse.data);
const business: any = {
businessId: databaseId
};
business.title = $('.printFloatLeft section:nth-of-type(1) div:nth-of-type(1) span:nth-of-type(2)').text();
if (business.title) {
business.title = business.title.replace(/\n/g, '').trim()
}
else {
console.log('No business title found. Likely no business here', databaseId);
return;
}
const informationFields = $('.printFloatLeft section:nth-of-type(2) div:nth-of-type(1) span');
for (let i = 0; i < informationFields.length; i++) {
if (informationFields[i].attribs.class === 'greenLabel') {
// This is kind of perverting cheerio objects
const label = informationFields[i].children[0].data.trim();
const value = informationFields[i + 1].children[0].data.trim();
switch (label) {
case 'SosId:':
business.sosId = value;
break;
case 'Citizenship:':
business.citizenShip = value;
break;
case 'Status:':
business.status = value;
break;
case 'Date Formed:':
business.filingDate = value;
break;
default:
break;
}
}
}
console.log('business', business);
}
And…that’s it! It turned out pretty nice.
Looking for business leads?
Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome web data. Learn more at Cobalt Intelligence!
The post Jordan Scrapes Secretary of State: North Carolina appeared first on JavaScript Web Scraping Guy.
Top comments (1)
Is it legal to scrape gov website data?