Overview
- Extract data from file using terminal
- Extract data from a file using a script
- Putting it all together to build a web scraper
Introduction
Each step of the tutorial has a purpose. Try not to skip a step, because you will miss valuable skills needed to become a proficient software engineer. These small steps are the skills you need to own your craft.
Section 1: Extracting data from a file using terminal teaches command line skills and narrowing down on arguments needed to properly extract data.
Section 2: Extracting data from a file using a script give learners practice in writing scripts which are precise. In computer science you want to be accurate, because the computer can handle precise. Writing scripts give learners practice in creating accurate scripts, which will be needed for the final step.
Section 3: Putting it all together to build a web scraper is the final step to show you have mastered the skill of web scraping. You will build a web scraper you can now use to crawl the website. You can gather information or build scrappers for freelancing clients.
How to Build a Web Scrapper
Assignment: create a E-commerce Price Tracker
Section 1:
- select 1 website
For this tutorial I will be using LinkedIn.
- Download 1 job listing html page onto your computer: wget
Website pages can be downloaded as html files by right clickin on the page, selecting "View page source" then right click again and "Save as". There is another way using wget.
Wget is a command line application which will save the file onto our computer at our current directory. The -O flag is used to specify the output file name we select. You can see more informaiton about wget and other flags to use by using the help command below:
$wget -O name_your_file.html www.websiteYouAreScraping.com
$ wget --help
- Using command line applications
grep is a command line application which searches a file for a specified term. The vertical bar (|) feeds the output for the grep command into the wc command. wc is another terminal application, but this application can count the number of lines, words, and characters in the input data:
$grep "search term" file_name.html | wc -l
Sort will organize the data. It takes in the output of the grep command, orders it and outputs sorted data. the carrot symble (>) pipes the sorted data into a new file you specify.
$grep "search term" file_name.html | sort > output_file.txt
Alias Commands & Shortening default prompt
When working with a long command which you use often, it can be useful to create an alias for that command. Alias commands for the terminal can be made by adding the comment to the .bash_profile or .bashrc file
When the prompt in the terminal becomes long, because the list of directories are all shown, it can be easily shortened. Use the comment export PS1=[\!:\w]$
. The forward slash exclamation point and the forward slash w are special characters meaning which parts of omit and which parts to include in the prompt. You can read more on https://ss64.com/bash/syntax-prompt.html
Section 2:
Section 3:
Closing Remarks
Now that you have completed the tutorial, I encourage you to select a website and practice scraping data from it. Try using a web scraping framework to or get paid for your new skills by finding freelancing work. Go onto UpWork.com or Freelancer.com and try to create a web scraper for a client and get paid for your skills.
Web Scrapper Project Ideas
- E-commerce Price Tracker: Scrape product prices from various e-commerce websites and track price changes over time.
- Job Listings Aggregator: Extract job listings from multiple job portals and compile them into a single, searchable database.
- Social Media Sentiment Analysis: Scrape social media platforms for mentions of a particular brand or product and analyze the sentiment of the comments.
- Real-Time News Aggregator: Extract news articles from various news websites and create a real-time news feed.
- Stock Market Trend Analysis: Scrape stock market data and analyze trends to predict future movements.
- Recipe Recommendation Engine: Extract recipes from cooking websites and create a recommendation engine based on user preferences.
- Automated Travel Itinerary Planner: Scrape travel websites for flight, hotel, and activity information to create personalized travel itineraries.
- Sports Statistics Collector: Extract sports statistics from various sources and create a comprehensive database for analysis.
- Product Review Aggregator: Scrape product reviews from e-commerce sites and compile them into a single, searchable database.
- Real Estate Market Analysis: Extract real estate listings and analyze market trends to provide insights for buyers and sellers.
Top comments (0)