CLI – What You Need to Know about Scraping

Cassandra Parisi — Mon, 11 Jan 2021 03:48:53 +0000

What is scraping?

Web scraping is the act of inspecting the html code of a webpage and extracting information from HTML, CSS or JavaScript code, though scraping from Javascript is a more intricate process.

In my command line interface project, I chose to scrape a bookstore website, https://bookshop.org/books?keywords=permaculture, to extract the title, author, price, and URL. The URL was necessary to scrape into a second layer to extract the book’s description.

How to tell if a site is scrapeable?

One challenge I ran into was figuring out what qualifies a scrapeable website. After reading and listening to several resources, I’ve collected a list of qualifiers.
You will want to find a page that:

1) Has a repetitive format. You will want to use a site that has a list of items that follow the same format, making it easier to iterate, or filter through, each item to extract the information you want.

For instance, in the Figure 1 below, there is an unordered list of items. Great start. Figure 2 shows two items from the list in Figure 1 after they’ve been expanded. As you can see, the format repeats from one item to the next—each one contains same hierarchy of classes: h2, a href, h3, and div. These classes contain the information I want to scrape for my CLI.

Figure 1

Figure 2

2) Does not have a lot of JavaScript in its code.
One way to confirm this is to right-click anywhere on the webpage and click “inspect.” A window will appear. At the top of the window, select the “Sources” tab. This will show you what coding language was used in the site, like in Figure 3.

Figure 3

Another way to confirm this is to look for telltale signs of Javascript when you inspect the page—elements with js-, js class, toggle classes like ::, or endless scrolling in the site.

3) Won’t change often. If the programmers of the site you scraped decide to change the layout of their site, your scraped coding will break.

4) Has useful descriptor tags like class and ID names.
For example, take a look at Figure 4, a website that shows yoga studios in Charlotte. Although this site has a list of studios that you might be tempted to scrape, it does not have a lot of descriptors for each studio. If you look in Figure 5, for each studio’s “p” tag, all of the studio’s information is included within a single “p” tag. This means there’s no good way to access individual pieces of information such as location, style, or even the name of the studio.
Figure 4

Figure 5

Instead, look for a site with descriptor classes and ID’s, preferably classes because IDs refer to one specific element (which makes iterable coding more difficult). Each of the highlighted items in Figure 6, below, show a hierarchy of data, with descriptor classes that can be drilled down into for scraping specific pieces of information in my CLI.
Figure 6

**What in the *CODE!?!***

Figure 7

Figure 7 is an excerpt from my CLI project. Since I used scraping to extract information from the website, I required the gems called Nokogiri and open-uri.

When the program runs and comes to the open-uri code, this tells the program to go to the referenced website. Then the Nokogiri code tells the program to return the HTML code from the website.
After those gems are executed, you’ll want comb out the information you want to use.

https://www.youtube.com/watch?v=_rFCEXpP28E

In my video link above, I demonstrate how to scrape using Nokogiri and extract data.

When it came to scraping the URL, I had to use a work around method for it to work. The URL I scraped was incomplete and my code broke. This is because the site excluded the beginning "https://..." that is necessary for Nokogiri to run properly. I hard coded the "https://..." portion to where the URL turns into an individual web address for that item.

So the way my code reads is when the array_of_books is being iterated over and it searches for the URL code, if the return value is nil, then the program knows to add the hard coded line of code to complete the run.

Why Software Engineering?

Cassandra Parisi — Mon, 21 Dec 2020 02:25:12 +0000

This year has been a season of reflection and change. I was working as a personal banker when the coronavirus broke out. This was a job I had a held for a year and a half. I’d received praise from my superiors, and was even given a promotion. Despite that success, I was unsure of whether a future in personal banking was really what I wanted for my life. People were unapologetically rude to me on an almost daily basis, and I had been a victim in multiple robberies. Frankly, it gave me a lot of anxiety. But changing careers is hard, I felt like I was excelling in my job, and I didn’t want to jump ship on my team.

When COVID hit the US, I was considered an essential worker, and continued to go into work at the branch. My bank took all the precautions to ensure we were protected, but being on the front lines made me uneasy. I was worried I might be increasing my chances of being exposed to the virus. Also, the ever-present threat of my branch suffering another robbery loomed over me as I read about millions of people losing their jobs. During the first few months of COVID, our branch’s lobby was closed and we served customers strictly through the drive-through. So that helped to ease some of my stress. But, as if it were a sign directed at me, telling me once and for all that I needed to get out, the very same week we opened the lobby back up to our customers, my branch was robbed. In the immortal words of the Doors, “The time to hesitate [was] through.”

Software engineering was the first field that I considered that felt right, and for a variety of reasons.

• The job outlook is promising - employment for software developers is projected to grow 22 percent from 2019 to 2029, much faster than the average for all occupations.(1)

• The pay is pretty sweet - The median annual wage for software developers was $107,510 in May 2019. The lowest 10 percent earned less than $64,240, and the highest 10 percent earned more than $164,590.(2)

• Software engineers can often work remotely—and who wouldn’t want to have the freedom do go on a road trip or vacation to the Bahamas and not have to use vacation time? Also, there are some foreign employers on the other side of the world who hire engineers in the US so they can have round-the-clock IT support.

• I’ve also had an interest in coding ever since my teen years dabbling around customizing my MySpace page. That interest was deepened in my first job out of college, an accounting gig where I had to use Excel to build custom, automated financial reports for company executives.

With all these stats in mind, I began to dream about what I would do with this particular set of skills. One goal I have upon completing this program is to find a company for which I could use my creativity to design the look of their website, and then use my technical abilities to sustain the site in line with the mission of the company. I want to build user friendly, professional looking websites that create positive experiences for customers and facilitate interactions between those customers and my company.

My other goal is to begin freelancing to my friends and other entrepreneurs. I want to support those who are also following their passions. I want to work with people who uphold values similar to mine, who live and work with integrity.

I want to work for companies who improve the lives of their customers and community through creative innovations, with disciplined and passionate employees. I believe that when people of similar values collaborate, communication comes easy and the drive to succeed is apparent within the group. When everyone is working towards a goal that’s bigger than any one individual can achieve on their own, the success becomes more than a career boost, it’s about the satisfaction of having made a positive impact that comes from creating a superior product or service. We need more of this in the world today.

So here I am writing my first blog post, as a student of Flatiron School.

(1) https://www.bls.gov/ooh/computer-and-information-technology/software-developers.htm#tab-6
(2) https://www.bls.gov/ooh/computer-and-information-technology/software-developers.htm#tab-5

DEV Community: Cassandra Parisi

CLI – What You Need to Know about Scraping

What is scraping?

How to tell if a site is scrapeable?

What in the *CODE!?!*

Why Software Engineering?

**What in the *CODE!?!***