Today I will be scraping the Oregon secretary of state business registration site. It will be the second post in the series of scraping various state’s business entity registration. If you are new to web scraping, I suggest checking out the Learn to Web Scrape series first.
An important part of these state scrapes will be acting as good citizens. This is public data maintained by taxpayers and so I will strive to be respectful and not put a heavy load on these sites. I’m also not sure of the requirements for what data to be public. My experience so far has been that addresses and names are required to be public for businesses but it doesn’t look like any information beyond that.
Oregon investigation
The target that I am looking at today is the Oregon secretary of state. As with any scraping project, I start with the form where the data is submitted. That looks like this:
A very simple form, which is normally unfortunate. I always look for an advanced search so that I can narrow my search by date or certain fields. This form here does not have such a thing.
Next step, submit the form and see if the data is loaded in via ajax or if it’s all loaded on the page directly with the parameters in the query string. Looking at the network tab, I can see that the data isn’t loaded in via ajax. Looks like I’m not getting as lucky as I did with Idaho.
If it was directly with ajax, then I could possibly call directly to that and get the data and maybe, just maybe, it would have already all been in json for us. As it is now, however, it does look like there is data in the query string that I can use to just navigate directly to this page with the search I want.
So now I know that I can navigate directly to where I need to go with a search term and get a HUGE list of what I want. It’s simple html, so I can just parse it with Cheerio and be on my way.
One disappointing thing that Oregon doesn’t have that Idaho did is that I can’t search by date. This is a big problem if I am wanting to keep an updated list. What I don’t want to do is pull all of the businesses each time I want to update my list and only take the ones I want but that may be what has to happen.
For now, I’m just going to focus on getting a dump of all of the current business entities.
The code
I call on two different locations. One is for searching the list of all businesses and the other is for getting the details of a specific business.
Business search code
It’s not too difficult. I go directly to here – http://egov.sos.state.or.us/br/pkg_web_name_srch_inq.do_name_srch?p_name=B&p_regist_nbr=&p_srch=PHASE1&p_print=FALSE&p_entity_status=ACTINA with whatever search term I want to use. Then I parse the html with Cheerio and find the link to the details page for each business and push it all into an array of links.
export async function searchBusinesses(search: string, links: string[]) {
const url = `http://egov.sos.state.or.us/br/pkg_web_name_srch_inq.do_name_srch?p_name=${search}&p_regist_nbr=&p_srch=PHASE1&p_print=FALSE&p_entity_status=ACT`;
let axiosResponse: AxiosResponse;
try {
axiosResponse = await axios.get(url);
}
catch (e) {
console.log('Error searching for Oregon businesses', search, e);
throw `Error searching for Oregon businesses ${search}`;
}
const $ = cheerio.load(axiosResponse.data);
$('tr td:nth-of-type(4) a').each((index, element) => {
const link = $(element).attr('href');
if (!links.includes(link)) {
links.push($(element).attr('href'));
}
});
console.log('links found for ', search, links.length);
}
You can see in the image above the selector I are using. I just use that and loop through all of these selectors and take the href from each one. I then use that array to go to each details page.
Business details page code
The biggest part here is just getting the correct selector. I’ll share the code of what I’m going but it’s really just being clever with css selectors. Learn your css selectors!
const currentDate = new Date();
const business = {
title: $('table:nth-of-type(3) tr td:nth-of-type(2)').text(),
filingDate: $('table:nth-of-type(2) tr td:nth-of-type(5)[bgcolor="#CCDDFF"]').text(),
principalAddressStreet: $('table:nth-of-type(8) tr:nth-of-type(1) td:nth-of-type(2)').text(),
principalAddressCity: $('table:nth-of-type(9) tr:nth-of-type(1) td:nth-of-type(2)').text(),
principalAddressState: $('table:nth-of-type(9) tr:nth-of-type(1) td:nth-of-type(3)').text(),
principalAddressZipcode: $('table:nth-of-type(9) tr:nth-of-type(1) td:nth-of-type(4)').text(),
registeredAgentName: `${$('table:nth-of-type(13) tr:nth-of-type(1) td:nth-of-type(2)').text()} ${$('table:nth-of-type(13) tr:nth-of-type(1) td:nth-of-type(4)').text()}`,
registeredAgentStreetAddress: $('table:nth-of-type(14) tr:nth-of-type(1) td:nth-of-type(2)').text(),
registeredAgentCity: $('table:nth-of-type(15) tr:nth-of-type(1) td:nth-of-type(2)').text(),
registeredAgentState: $('table:nth-of-type(15) tr:nth-of-type(1) td:nth-of-type(3)').text(),
registeredAgentZipcode: $('table:nth-of-type(15) tr:nth-of-type(1) td:nth-of-type(4)').text(),
mailingAddressStreet: $('table:nth-of-type(18) tr:nth-of-type(1) td:nth-of-type(2)').text(),
mailingAddressCity: $('table:nth-of-type(19) tr:nth-of-type(1) td:nth-of-type(2)').text(),
mailingAddressState: $('table:nth-of-type(19) tr:nth-of-type(1) td:nth-of-type(3)').text(),
mailingAddressZipcode: $('table:nth-of-type(19) tr:nth-of-type(1) td:nth-of-type(4)').text(),
state: 'Oregon',
recordNumber: $('table:nth-of-type(2) tr:nth-of-type(2) td:nth-of-type(1)').text(),
filingType: $('table:nth-of-type(2) tr:nth-of-type(2) td:nth-of-type(2)').text(),
url: url,
createdAt: currentDate,
updatedAt: currentDate
}
See all those css selectors? It looks gross but it’s cool how I can get them pretty specifically. The details page does provide some caveats that I’ll probably have to clean up at some point.
See the list above? Sometimes there are more members which throws off what I’m getting at certain times. That’s definitely an improvement I’ll have to make.
The end!
Looking for business leads?
Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome business leads. Learn more at Cobalt Intelligence!
The post Jordan Scrapes Secretary of States: Oregon appeared first on JavaScript Web Scraping Guy.
Top comments (0)