This post on scraping the Idaho secretary of state website will be the first in the series of posts scraping various states’ business entity registration. If you are new to web scraping, I suggest checking out the Learn To Web Scrape series first.
An important part of these state scrapes will be acting as good citizens. This is public data maintained by taxpayers and so we will strive to be respectful and not put a heavy load on these sites. I’m also not sure of the requirements for what data to be public. My experience so far has been that addresses and names are required to be public for businesses but it doesn’t look like any information beyond that.
Idaho
I live in Idaho and I am a big fan. The goal with this post is to talk over scraping the Idaho secretary of state website. I’ve looked at this site before and I thought it would be kind of tricky. As I dug more into it, though, I found that it was packaged together a lot easier than I expected.
Above we have the search form. In an ideal world, we could search only by date. That way we would just find all newly registered businesses in a specific date range. We aren’t looking for specific businesses, just….all businesses.
The great thing about this form is that if we watch the dev tools while we search we can see that it’s a simple POST request that returns all the data in JSON. It’s honestly a web scraper’s dream.
Request URL: https://sosbiz.idaho.gov/api/Records/businesssearch
Request Method: POST
This is great but it doesn’t have all of the data that we would expect to have, like address. If we select a specific business, however, we get more data packaged nicely for us. It looks like it just uses the ID field returned in the general business search.
Request URL: https://sosbiz.idaho.gov/api/FilingDetail/business/689711/false
Request Method: GET
From here it’s just formatting the data in a way that we want.
Starter requests
The actual code for getting the data from these calls is incredibly simple. I just use axios to call and then just return the JSON. This one will search for all businesses based on a specific search query and start date.
export async function searchBusinesses(search: string, date: string) {
const url = 'https://sosbiz.idaho.gov/api/Records/businesssearch';
const body = {
SEARCH_VALUE: search,
STARTS_WITH_YN: true,
CRA_SEARCH_YN: false,
ACTIVE_ONLY_YN: false,
FILING_DATE: {
start: date,
end: null
}
};
let axiosResponse: AxiosResponse;
try {
axiosResponse = await axios.post(url, body);
}
catch (e) {
console.log('Error searching Idaho business info for', search, e);
throw `Error searching Idaho business info for ${search}`;
}
console.log('Total business found using', search, Object.keys(axiosResponse.data.rows).length);
if (axiosResponse.data) {
return Promise.resolve(axiosResponse.data.rows);
}
else {
return Promise.resolve(null);
}
}
This one gets the data for an individual business:
export async function getBusinessInformation(businessId: number) {
const url = `https://sosbiz.idaho.gov/api/FilingDetail/business/${businessId}/false`;
let axiosResponse: AxiosResponse;
try {
axiosResponse = await axios.get(url);
}
catch (e) {
console.log('Error getting Idaho business info for', businessId);
throw `Error getting Idaho business info for ${businessId}`;
}
return Promise.resolve(axiosResponse.data);
}
Searching for businesses
Parsing the data from the searching for businesses call was definitely the simplest of the two. The API returns a list of businesses but instead of being arranged in an array, it’s just a map of objects with the id as their key. So we just do something simple like this to loop through that data:
const businesses = await searchBusinesses(alphabet[i], date);
for (let key in businesses) {
if (businesses.hasOwnProperty(key)) {
const currentDate = new Date();
const formattedBusiness = {
filingDate: businesses[key].FILING_DATE,
recordNumber: businesses[key].RECORD_NUM,
agent: businesses[key].AGENT,
status: businesses[key].STATUS,
standing: businesses[key].STANDING,
title: businesses[key].TITLE[0].split('(')[0].trim(),
state: 'Idaho',
sosId: businesses[key].ID,
createdAt: currentDate,
updatedAt: currentDate
};
formattedBusinesses.push(formattedBusiness);
}
}
Okay, now this is the part that kind of sucks. Since I have to have some kind of search term but I’m not looking for any specific business, I just…loop through the alphabet. I know, I know, shameful. I’m still working on a better way to do this. If you have any ideas, I’d love to hear them.
const alphabet = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"];
for (let i = 0; i < alphabet.length; i++) {
// do stuff here
}
Getting business details
This one required a lot more work. Mostly around parsing the addresses into separate fields for city, state, and zipcode and then doing it three times. We do it once for the principal address, once for the mailing address, and then once for the registered agent address.
The data itself looked like this:
This time, oddly enough, the data was all in an array. This was odd to me because each member of the array had different data. It followed the same pattern most of the time so I was generally safe to just do something like the below.
businesses[i].filingType = businessInfo.DRAWER_DETAIL_LIST[0].VALUE;
businesses[i].status = businessInfo.DRAWER_DETAIL_LIST[1].VALUE;
businesses[i].formedIn = businessInfo.DRAWER_DETAIL_LIST[2].VALUE;
I built a small helper function to help parse the addresses since it was all one string, like this PO BOX 12↵MOUNTAIN HOME, ID 83847
.
function formatCityStateAndZip(cityStateAndZip: string) {
const cityStateAndZipObject = {
city: '',
state: '',
zipcode: ''
};
if (cityStateAndZip) {
cityStateAndZipObject.city = cityStateAndZip.split(',')[0];
cityStateAndZipObject.state = cityStateAndZip.split(',')[1] ? cityStateAndZip.split(',')[1].trim().split(' ')[0] : '';
// Sometimes there are two spaces between the state and the zipcode
if (cityStateAndZip.split(',')[1]) {
let zipcode = cityStateAndZip.split(',')[1].trim().split(' ').length < 3 ?
cityStateAndZip.split(',')[1].trim().split(' ')[1] : cityStateAndZip.split(',')[1].trim().split(' ')[2];
cityStateAndZipObject.zipcode = zipcode;
}
return cityStateAndZipObject;
}
return cityStateAndZipObject;
}
And then we just took the formatted data and treated it as follows:
const formattedPrincipalCityStateAndZip = formatCityStateAndZip(principalAddressSplit[1]);
businesses[i].principalAddressCity = formattedPrincipalCityStateAndZip.city;
businesses[i].principalAddressState = formattedPrincipalCityStateAndZip.state;
businesses[i].principalAddressZipcode = formattedPrincipalCityStateAndZip.zipcode;
I mentioned above that this data is only formatted in the same order most of the time. To be able to better handle those times when it didn’t match, I just did the following checks:
if (businessInfo.DRAWER_DETAIL_LIST[7] && businessInfo.DRAWER_DETAIL_LIST[7].LABEL === 'AR Due Date') {
businesses[i].arDueDate = businessInfo.DRAWER_DETAIL_LIST[7].VALUE;
}
else if (businessInfo.DRAWER_DETAIL_LIST[7] && businessInfo.DRAWER_DETAIL_LIST[7].LABEL === 'Registered Agent') {
// Stuff is done here
}
And…there we have it. All of that sweet, succulent Idaho Secretary of State data return into a big array.
Looking for business leads?
Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome business leads. Learn more at Cobalt Intelligence!
The post Jordan Scrapes Secretary of States: Idaho appeared first on JavaScript Web Scraping Guy.
Top comments (0)