DEV Community

IronSoftware
IronSoftware

Posted on • Originally published at ironsoftware.com

Webscraping in C#

What is Iron WebScraper?

Iron WebScraper is a class library and framework for C# and the .NET programming platform that allows developers to programmatically read websites and extract their content. This is ideal for reverse engineering websites or existing intranets and turning them back into databases or JSON data. It’s also useful for downloading large volumes of documents from the internet.

In many respects, Iron Web Scraper is similar to the Scrapy library for Python, but leverages the advantages of C#, particularly its ability to step through code as the web scraping process is in progress and debug.

Installation

Your first step will be to install Iron Web Scraper, which you may do from NuGet or by downloading the DLL from our website.

All of the classes you will need can be found in the Iron Web Scraper namespace.

PM > Install-Package IronWebScraper

Popular Use Cases

Migrating Websites to Databases

IronWebScraper provides the tools and methods to allow you to re-engineer your websites back into structured databases. This technology is useful when migrating content from legacy websites and intranets into your new C# application.

Migrating Websites

Being able to easily extract the content of a partial or complete website in C# reduces the time and cost implication in migrating or upgrading website and intranet resources. This can be significantly more efficiant than direct SQL transformations, as it flattens the data down to what can be seen on each webspage, and does not require the previous SQL data structures to be understood, nor complex SQL queries to be built.

Populating Search Indexes

Iron Web Scraper may be pointed at your own website or intranet to read structured data, to read every page, and to extract the correct data so that a search engine within your organization may be populated accurately.

IronWebScraper is an ideal tool to scrape content for your search index. A search application such as IronSearch can read structured content from IronWebScraper to build a powerful enterprise search system.

Using Iron Webscraper

To learn how to use Iron Web Scraper, it is best to look at examples. This basic example creates a class to scrape titles from a website blog.

C#:

using IronWebScraper;
namespace WebScrapingProject
{
    class MainClass
    {
        public static void Main(string[] args)
        {
            var scraper = new BlogScraper();
            scraper.Start();
        }
    }
    class BlogScraper : WebScraper
    {
        public override void Init()
        {
            this.LoggingLevel = WebScraper.LogLevel.All;
            this.Request("https://blog.scrapinghub.com", Parse);
        }
        public override void Parse(Response response)
        {
            foreach (var title_link in response.Css("h2.entry-title a"))
            {
                string strTitle = title_link.TextContentClean;
                Scrape(new ScrapedData() { { "Title", strTitle } });
            }
            if (response.CssExists("div.prev-post > a[href]"))
            {
                var next_page = response.Css("div.prev-post > a[href]")[0].Attributes["href"];
                this.Request(next_page, Parse);
            }
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

VB:

Imports IronWebScraper
Namespace WebScrapingProject
    Friend Class MainClass
        Public Shared Sub Main(ByVal args() As String)
            Dim scraper = New BlogScraper()
            scraper.Start()
        End Sub
    End Class
    Friend Class BlogScraper
        Inherits WebScraper
        Public Overrides Sub Init()
            Me.LoggingLevel = WebScraper.LogLevel.All
            Me.Request("https://blog.scrapinghub.com", AddressOf Parse)
        End Sub
        Public Overrides Sub Parse(ByVal response As Response)
            For Each title_link In response.Css("h2.entry-title a")
                Dim strTitle As String = title_link.TextContentClean
                Scrape(New ScrapedData() From {
                    { "Title", strTitle }
                })
            Next title_link
            If response.CssExists("div.prev-post > a[href]") Then
                Dim next_page = response.Css("div.prev-post > a[href]")(0).Attributes("href")
                Me.Request(next_page, AddressOf Parse)
            End If
        End Sub
    End Class
End Namespace
Enter fullscreen mode Exit fullscreen mode

To scrape a specific website, we will have to create our own class to read that website. This class will extend Web Scraper. We will add some methods to this class, including init, where we can set initial settings and start the first request, which will then in turn cause a chain reaction where the entire website will be scraped.

We must also add at least one Parse method. Parse methods read webpages which have been downloaded from the internet and use jQuery-like CSS selectors to select content and extract the relevant text and/or images for usage.

Within a Parse method, we may also specify which hyperlinks we wish the crawler to continue to follow and which ones it will ignore.

We may use the scrape method to extract any data and dump it into a convenient JSON-style file format for later use.

Moving Forward

To learn more about Iron Web Scraper, we recommend you read the API Reference Documentation, and then start looking at the examples within the tutorial section of our documentation.

The next example we recommend you look at is the C# "blog" web scraping example, where we learn how we might extract the text content from a blog, such as a WordPress blog. This might be very useful in a site migration.

From there, you might go on to look at the other advanced webscraping tutorial examples where we can look at concepts like websites with many different types of pages, ecommerce websites, and also how to use multiple proxies, identities, and logins when scraping data from the internet.

Top comments (0)