DEV Community

Cover image for I wrote a crawler for the first time.

I wrote a crawler for the first time.

Kayla Sween on February 12, 2021

Early on in the pandemic, I decided that I wanted a way to track the moving average of cases per day in my state, Mississippi, since that wasn't so...
Collapse
 
functional_js profile image
Functional Javascript

Great work Kayla!

A quick tip. If you use the async-await pattern, you don't need the ".then" pattern.

const getCovidData = async () => {
  try {
    daily.newDeaths = await getDailyDeaths();
    daily.testsRun = await getTestsRun();
    daily.newCases = await getDailyCases();
    daily.totalCases = await getTotalCases();

//....
Enter fullscreen mode Exit fullscreen mode
Collapse
 
hasnaindev profile image
Muhammad Hasnain

The problem with this approach is that it blocks all the next await chains. Instead, opt for something like this.

const promises = [
  getDailyDeaths(),
  getTestsRun(),
  getDailyCases(),
  getTotalCases(),
];

// If you don't want to use await just go with .then in the line below
const resolvedPromises = await Promise.all(promises);
Enter fullscreen mode Exit fullscreen mode
Collapse
 
functional_js profile image
Functional Javascript

You could do that in certain cases. But there are only 4 of them. Plus by doing it serially to the same domain, you're not overwhelming the one server endpoint in this case.

Actually, as a robustness strategy, with multiple calls to the same domain, you typically want to insert manual delays to prevent a server rejections. I'll actually place a "sleep" between calls. It's not a race to get the calls done as fast as possible. The goal is robustness.

Also, this is a background job, so saving a second or two is not a critical performance criteria here. In this case one request isn't dependent on another so it's fine, but if it was you'll want to run them serially.

Also with Promise.all, if one fails, they all fail. It's less robust. With the serial approach each request is atomic and can succeed on it's own. I.E. Getting 3 out of 4 successful results is better than getting 0 out of 4 successes, even though some of them succeeded.

Also you now have an array of resolved promises that you still have to loop through and process. With the serial approach, it was done in 4 lines. Much easier to grok. Much easier to read. Much easier to debug.

If I had to do dozens of requests that had no order dependency on each other, and they all went to various domains, and 100% of them had to succeed otherwise all fail, then Promise.all is certainly the way to go. If you have 2 or 3 or 4 requests, there's really no compelling benefit. Default to simple.

So there are pros and cons to each approach to consider.
Thanks for the input!

Thread Thread
 
hasnaindev profile image
Muhammad Hasnain

Ahan, thank you for presenting the case in such detail. These things never occurred to me and now I know better. :)

Thread Thread
 
robloche profile image
Robloche

The point about robustness is valid but for the sake of it, I'll mention that the potential issue with Promise.all can be avoided by using Promise.allSettled instead.

Collapse
 
kayla profile image
Kayla Sween

Good to know! Thanks for the tip! Iā€™ll change that in my code!

Collapse
 
raddevus profile image
raddevus

This is a really nice write-up and a great idea.

Are you running Ubuntu or some other Linux version as your desktop OS? Just curious.
I run Ubuntu 20.04 (I switched to Ubuntu about 2 years ago, after using windows for over 25 years...seriously).
Thanks for sharing this interesting article.

Collapse
 
kayla profile image
Kayla Sween

Thanks! Nope, Ubuntu was just the default option in that YAML file, and I didnā€™t change it because it worked just fine. Iā€™m an OSX person.

Collapse
 
donal_tweets profile image
Info Comment hidden by post author - thread only accessible via permalink
DĆ³nal

Tip, there's no point in in using a try-catch if you're just going to re-throw the error. In other words, this

try {
  doStuff()
} catch (err) {
  throw err
}
Enter fullscreen mode Exit fullscreen mode

is functionally identical to

doStuff()
Enter fullscreen mode Exit fullscreen mode
Collapse
 
kayla profile image
Kayla Sween

Tip, if youā€™re going to give unsolicited advice on someoneā€™s code in a blog post, at least comment about the core of the post. Like, ā€œhey great job! This is something you could do instead:ā€ or ā€œhey this blog sucked and hereā€™s why:ā€

I see your account is brand new, so Iā€™m hoping to help educate you on a little bit of etiquette in regards to constructive criticism. Hope that helps!

Collapse
 
donal_tweets profile image
Comment marked as low quality/non-constructive by the community. View Code of Conduct
DĆ³nal

Well seeing as you asked, here's why this blog sucks:

A web crawler is not a technically challenging task. As we can see from the post above, it's just a combination of basic web programming steps such as making HTTP requests and HTML parsing.

That it took someone whose core competency is front-end development 9 years to tackle something so straightforward is fairly risible. I don't see how you can claim to be a "senior-ish" front-end engineer yet be apparently ignorant of core concepts such as error-handling and asynchronous programming.

Hope this helps!

Thread Thread
 
kayla profile image
Kayla Sween

Thanks for reading!

Collapse
 
kevinhickssw profile image
Kevin Hicks

I had to write some crawlers when I first started. It definitely is one of those things you start out having no clue how to do it and thinking it's going to be extremely tough.

I like the simple approach you took to store the data in a JSON file and use Github Actions to update it. It's nice when we can keep things simple instead of spinning up databases and complex infrastructure.

Also, congratulations on getting over the intimidation of doing this. Us senior developers definitely do get afraid of tasks. We need to remember that we can always learn how to do something, just like how we didn't know how to write code before we were developers.

Collapse
 
kayla profile image
Kayla Sween

Definitely! Keeping it simple was really important to me because I didnā€™t want to spend a long time getting it set up or potentially spending longer maintaining it.

Thanks! Yeah this is one thatā€™s been haunting me for sure, so it gave me a good confidence boost! And thatā€™s very true!

Thanks for reading!

Collapse
 
pashacodes profile image
pashacodes

Thank you so much for sharing, I have been really curious about writing cron's, it was so helpful to read through your thought process, and it seems a lot less intimidating now

Collapse
 
kayla profile image
Kayla Sween

Awesome!!! So glad I could help. Thank you for reading!

Collapse
 
crimsonmed profile image
MƩdƩric Burlet

I used cheerio a while but realized when processing a lot of pages it's very slow.
I'd recommend using regex I cut down processing time by 80%.

Good job though it's a great start!

Collapse
 
kayla profile image
Kayla Sween

Good to know! Iā€™m just collecting data from one page for now, but Iā€™ll definitely keep that in mind. Thank you!

Collapse
 
codingtomusic profile image
Tom Connors

As you say, I've not done it and thank you for showing how!!

Collapse
 
kayla profile image
Kayla Sween

Thank you for reading! šŸ˜Š

Collapse
 
heymich profile image
Michael Hungbo

Amazing tutorial Kayla. And congratulations for getting over the intimidation you had to live with for 9yrs! It's a win and I'm happy you got over it.

Collapse
 
kayla profile image
Kayla Sween

Thank you so much! Iā€™m happy I got over it too. šŸ˜…

Collapse
 
rafavls profile image
Rafael • Edited

That's awesome! I love your initiative. Also, am glad that cases seem to be going down lately.

Collapse
 
kayla profile image
Kayla Sween • Edited

Thanks! Yeah, Iā€™m pretty thankful for that. I believe we have more people vaccinated than have tested positive in Mississippi at this point, so Iā€™m optimistic!

Some comments have been hidden by the post's author - find out more