Web scraping is a skill that can come in handy in a number of situations, mainly when you need to get a particular set of data from a website. I believe this is used most often in engineering and sciences for retrieving data such as statistics or articles with specific keywords. For this tutorial I will be teaching you how to scrape a website for the latter - articles with specific keywords.
Before we begin, I want to introduce web scraping and some of its limitations. Web scraping is also known as web harvesting or web data extraction and is a method of automatically extracting data from websites over the internet. The method of parsing I will be teaching you today is HTML parsing, which means our web scraper will be looking at the HTML content of a page and extracting the information that matches the class we want to retrieve information from (if this doesn't make sense, don't worry. I'll go into more detail later!) This method of web scraping is limited by the fact that not all web sites store all of their information in html - much of what we see today is dynamic and built after the page has been loaded. In order to see that information a more sophisticated web crawler is required, typically with its own web loader, which is beyond the scope of this tutorial.
I chose to build a web scraper in C# because the majority of tutorials built their web scrapers in Python. Although that is likely the ideal language for the job, I wanted to prove to myself that it can be done in C#. I also hope to help others learn to build their own web scrapers by providing one of only a few C# web scraping tutorials (as of the time of writing).
Building a Web Scraper
The website we will be scraping is Ocean Networks Canada, a website dedicated to providing information about the ocean and our planet. People using this project to scrape the internet for articles and data will find that this website provides a similar model to many other websites they will encounter.
-
Launch Visual Studio and create a new C# .NET Windows Forms Application.
-
Design a basic Form with a Button to start the scraper and a Rich Textbox for printing the results.
-
Open your NuGet Package Manager by right-clicking your project name in the Solution Explorer and selecting "Manage NuGet Packages". Search for "AngleSharp" and click Install.
-
Add an array of query terms (these should be the words you want your articles to have in the title) and create a method where we will set up our document to scrape. Your code should look like the following:
private string Title { get; set; } private string Url { get; set; } private string siteUrl = "https://www.oceannetworks.ca/news/stories"; public string[] QueryTerms { get; } = {"Ocean", "Nature", "Pollution"}; internal async void ScrapeWebsite() { CancellationTokenSource cancellationToken = new CancellationTokenSource(); HttpClient httpClient = new HttpClient(); HttpResponseMessage request = await httpClient.GetAsync(siteUrl); cancellationToken.Token.ThrowIfCancellationRequested(); Stream response = await request.Content.ReadAsStreamAsync(); cancellationToken.Token.ThrowIfCancellationRequested(); HtmlParser parser = new HtmlParser(); IHtmlDocument document = parser.ParseDocument(response); }
CancellationTokenSource provides a token if a cancellation is requested by a task or thread.
HttpClient provides a base class for sending HTTP requests and receiving HTTP responses from a URI-identified resource
HttpResponseMessage represents an HTTP response message and includes the status code and data.
HtmlParser and IHtmlDocument are AngleSharp Classes that allow you to build and parse documents from website HTML content. -
Create another new method to get and display the results from your AngleSharp document. Here we will parse the document and retrieve any articles that match our QueryTerms. This can be tricky, as no two websites use the same HTML naming conventions - it can take some trial and error to get the "articleLink" LINQ query correct:
private void GetScrapeResults(IHtmlDocument document) { IEnumerable<IElement> articleLink; foreach (var term in QueryTerms) { articleLink = document.All.Where(x => x.ClassName == "views-field views-field-nothing" && (x.ParentElement.InnerHtml.Contains(term) || x.ParentElement.InnerHtml.Contains(term.ToLower()))); } if (articleLink.Any()) { // Print Results: See Next Step } }
If you aren't sure what happened here, I'll explain in more detail: We are looping through each of our QueryTerms (Ocean, Nature, and Pollution) and parsing through our document to find all instances where the ClassName is "views-field views-field-nothing" and where the ParentElement.InnerHtml contains the term we're currently querying.
If you're unfamiliar with how to see the HTML of a webpage, you can find it by navigating to your desired URL, right clicking anywhere on the page, and choosing "View Page Source". Some pages have a small amount of HTML, others have tens of thousands of lines. You will need to sift through all of this to find where the article headers are stored, then determine the class that holds them. A trick I use is searching for part of one of the article headers, then moving up a few lines.
-
Now, if our query terms were lucrative, we should have a list of several sets of HTML inside of which are our article titles and URLs. Create a new method to print your results to the Rich Textbox.
public void PrintResults(string term, IEnumerable<IElement> articleLink) { // Clean Up Results: See Next Step resultsTextbox.Text = $"{Title} - {Url}{Environment.NewLine}"; }
-
If we were to print our results as-is, they would come in looking like HTML markup with all the tags, angle braces, and other non-human friendly items. We need to insert a method that will clean up our results before we print them to the form and, like step 5, the markup will vary widely by website.
private void CleanUpResults(IElement result) { string htmlResult = result.InnerHtml.ReplaceFirst(" <span class=\"field-content\"><div><a href=\"", "https://www.oceannetworks.ca"); htmlResult = htmlResult.ReplaceFirst("\">", "*"); htmlResult = htmlResult.ReplaceFirst("</a></div>\n<div class=\"article-title-top\">", "-"); htmlResult = htmlResult.ReplaceFirst("</div>\n<hr></span> ", ""); // Split Results: See Next Step }
So what happened here? Well, I examined the InnerHtml of the result object that was coming in to see what extra stuff needed to be removed from what I actually wanted to display - a Title and a URL. Working from left to right, I simply replaced each chunk of html stuff with an empty string or "nothing", then for the chunk between the URL and the title I replaced with a "*" as a placeholder to split the strings on later. Each of these ReplaceFirst() uses will be different on each website, and it may not even work flawlessly on every article on a particular site. You can continue to add new replacements, or just ignore them if they are uncommon enough.
-
I'm sure you noticed from the previous step that there's one last method to add before we can print a clean result to our textbox. Now that we've cleaned up our result string, we can use our "*" placeholder to split it into two strings - a Title and a URL.
private void SplitResults(string htmlResult) { string[] splitResults = htmlResult.Split('*'); Url = splitResults[0]; Title = splitResults[1]; }
-
Finally we have a clean, human-friendly result! If all went well and the articles haven't drastically changed since the time of writing, running your code should provide the following set of results (and more... there was a lot!) that have been scraped by your application from Ocean Networks:
I hope this tutorial has given you some insight into the world of web scraping. If there's enough interest, I can continue this series and teach you how to set up your application to do a fresh scrape at specific time intervals and send you a newsletter-style email with a day's or week's worth of results.
If you'd like to catch up with me on social media, come find me over on Twitter or LinkedIn and say hello!
Top comments (26)
I found this useful, but I admit to getting a bit stuck around connecting each of your steps together. To help other's in the future, here's a Gist that links everything together.
I admit it's output isn't as neat as yours, so I have a mistake somewhere... but it's a start. One quick note: it's WPF rather than WinForms, so take that into consideration for all UI-interactions.
Follow the link Mathew F. linked to, but edit these lines and everything will work!
Reguarding:
gist.github.com/CodeCommissions/43...
Edit:
TO THIS:
Take note of the:
.Skip(1)
The reason it was ugly is because the first element in the IEnumerable was not filtered properly so instead of spending lots of time filtering through that mess we simply skip the first element :)
Thanks for the fix ^_^
I've updated the Gist to include your suggestion.
Follow the link Mathew F. linked to, but edit these lines and everything will work!
Reguarding:
gist.github.com/CodeCommissions/43...
Edit:
TO THIS:
Take note of the:
.Skip(1)
The reason it was ugly is because the first element in the IEnumerable was not filtered properly so instead of spending lots of time filtering through that mess we simply skip the first element :)
Thank you for your contribution but this is definitely not a beginner's project. I am in the middle of adding the missing parts of the project -namespaces- and I'm done. I went through it because it looks like a very simple and easy project now I'm out looking for something "easier".
This article (dev.to/anjankant/visual-studio-lea...) will also help to scrape whole website
Does this follow a similar method as I wrote above? I see it's using the HTML Agility Pack library, and I'm not familiar with that.
Yes Rachel, these (HTMLAgilityPack) are advanced libraries followed by xpath extractions uses also LINQ. I have written in vast and depth to scrape web sites, myself scraped a number of websites using HTMLAgilityPack. But you explained beautifully to get start with web scraping.
Very cool! I'll have to check it out next time I have some free time for a personal project. Thanks for the recommendation, your articles look very good as well.
Thanks Rachel to taking your time. If any help then please text me.
Hi Rachel,
Thank you for the post - I've just discovered AngleSharp!
How should I modify search, if I want to go to a site, set values to search controls and imitating clicking button Search? For example, site app.toronto.ca/DevelopmentApplicat..., I want to set filter to New Development = 30 days, click Search and read the results below.
Thank you so much,
Alexander
and searc
This may be just me but what I look for in a nicely written blog post such as this one, with the title "create-a-simple-web-scraper", is completeness because it should be a fullproof starter for beginners.
The code here doesn't work without adding the missing parts and fixing implied wrong usage suggestions.
I'm sorry you cannot get it working, but I built the application from the ground up while writing the post. It absolutely does work and is in its fullest form, there are no missing parts and I'm unsure what you mean by "implied wrong usage suggestions". Could you be more specific?
I would be glad to help you get the application working, can you provide the error you're getting and perhaps a link to your code?
Thank you very much for your tutorial! It helped me a lot! I could successfully build my own C# Web scrapper: nerd-corner.com/how-to-program-a-w...
I found something useful from your post and want to apply it to my blog seothetop.com, I will create an xml sitemap generator to submit to Google search
Thank You so much
How would this work on a website that uses a login session?
I found something useful from your post and want to apply it to my blog mucintechmax.com.vn/, I will create an xml sitemap generator to submit to Google search
Thank You so much
Some comments may only be visible to logged-in visitors. Sign in to view all comments.