DEV Community

Steven McLintock
Steven McLintock

Posted on

Puppeteer Sharp: Crawl the Web using C# and Headless Chrome

Puppeteer Sharp is a port of the popular Headless Chrome NodeJS API built by Google. Puppeteer Sharp was written in C# and released in 2017 by Darío Kondratiuk to offer the same functionality to .NET developers.

Puppeteer Sharp enables a .NET developer to programmatically control, or ‘puppeteer’ the open-source Google Chromium web browser. The convenience of the Puppeteer API is the ability to use a headless instance of the browser, not actually displaying the UI for increased performance benefits.

Why use Puppeteer Sharp?

If you are a .NET developer, installing the Puppeteer Sharp Nuget package into your project can enable you to achieve:

  • Crawling the web using a headless web browser
  • Automated testing of a web application using a test framework
  • Retrieve JavaScript rendered HTML

In the modern web it is common for a web application to rely on JavaScript to load the UI. If you were to programmatically load Bing Maps without using Puppeteer, you may be disappointed to receive:

Bing Maps empty

In addition to retrieving JavaScript rendered HTML, Puppeteer Sharp is also capable of navigating the website by injecting HTML; interacting with UI elements; taking screenshots or creating PDFs, and has many more features currently included in the popular Google NodeJS API.

Getting Started

To use Puppeteer Sharp in a new or existing .NET project. install the latest version of the Nuget package ‘PuppeteerSharp’.

PuppeteerSharp on NuGet

The first line of code that is necessary to ‘puppeteer’ a web browser is to download a revision of Chromium to the local machine. This is the browser that Puppeteer Sharp will use to interact with a website.

Fortunately, we can use C# to download either the default revision, or a revision the developer specifies. The revision will only download if it does not already exist on the local machine.



// Download the Chromium revision if it does not already exist
await new BrowserFetcher().DownloadAsync(BrowserFetcher.DefaultRevision);


Enter fullscreen mode Exit fullscreen mode

If the download is successful, you will see the version of the browser necessary to run on your operating system in your project directory:

Chrome EXE

Load a Webpage

Now that you have a browser downloaded to your local machine, you can begin to load a webpage and retrieve the JavaScript rendered HTML.

First, we will programmatically initiate an instance of the headless web browser, load a new tab and go to https://www.bing.com/maps’:



// Create an instance of the browser and configure launch options
Browser browser = await Puppeteer.LaunchAsync(new LaunchOptions
{
   Headless = true
});

// Create a new page and go to Bing Maps
Page page = await browser.NewPageAsync();
await page.GoToAsync("https://www.bing.com/maps");


Enter fullscreen mode Exit fullscreen mode

Bing Maps (1)

With the webpage successfully loaded in the headless browser, let’s interact with the webpage by searching for a local tourist attraction:



// Search for a local tourist attraction on Bing Maps
await page.WaitForSelectorAsync(".searchbox input");
await page.FocusAsync(".searchbox input");
await page.Keyboard.TypeAsync("CN Tower, Toronto, Ontario, Canada");
await page.ClickAsync(".searchIcon");
await page.WaitForNavigationAsync();


Enter fullscreen mode Exit fullscreen mode

Bing Maps (2)

We’re able to use Puppeteer Sharp to interact with the JavaScript rendered HTML of Bing Maps and search for ‘CN Tower, Toronto, Ontario, Canada’!

If you would like to store the HTML to parse elements such as the address or description, you can easily store the HTML in a variable:



// Store the HTML of the current page
string content = await page.GetContentAsync();


Enter fullscreen mode Exit fullscreen mode

Once you are finished, close the browser to free up resources:



// Close the browser
await browser.CloseAsync();


Enter fullscreen mode Exit fullscreen mode

Screenshots and PDF Documents

One of the benefits of Puppeteer Sharp is the ability to generate screenshots and PDF documents of the current page. This can be particularly useful for debugging purposes; automated testing or to capture a webpage at a specific resolution.

If you would like to a take a screenshot of the current page:



await page.ScreenshotAsync("C:\\Files\\screenshot.png");


Enter fullscreen mode Exit fullscreen mode

Puppeteer screenshots

Alternatively, to generate a PDF document of the current page:



await page.PdfAsync("C:\\Files\\document.pdf");


Enter fullscreen mode Exit fullscreen mode

Change the View Port

If you require to test a webpage at a specific display size, such as to view how the page would appear on a mobile handset, you can use Puppeteer Sharp to change the size of the view port of the current page:



// Change the size of the view port to simulate the iPhone X
await page.SetViewportAsync(new ViewPortOptions
{
    Width = 1125,
    Height = 2436
});


Enter fullscreen mode Exit fullscreen mode

Bing Maps iPhone X viewport

Trace Logs

Whilst the functionality discussed thus far is useful to monitor and detect issues related to the user interface of a webpage, a .NET developer may also use Puppeteer Sharp to closely examine any network performance issues.

To accomplish this we can programmatically start and stop a trace log:



await page.Tracing.StartAsync(new TracingOptions { Path = "C:\\Files\\trace.json" });

...

await page.Tracing.StopAsync();


Enter fullscreen mode Exit fullscreen mode

Bing Maps trace log

If a trace log is not capturing the amount of detail you require in your debugging session, you can programmatically enable Chrome DevTools to yield further insight:



Browser browser = await Puppeteer.LaunchAsync(new LaunchOptions
{
   Devtools = true
});


Enter fullscreen mode Exit fullscreen mode

If you enable Chrome DevTools in Puppeteer Sharp, the headless configuration will automatically be disabled and you will be able to view the browser whilst DevTools displays the options to view the JavaScript rendered code of your web application, view network activity among other features.

Puppeteer dev tools

Connect to a Remote Browser

One last feature of Puppeteer Sharp that I would like to mention is the ability to connect to a remote browser. This may be useful if you are using a serverless environment where installing a browser is not an option, such as the scalable ‘Azure Functions’.

One such service that compliments this feature is browserless.io:

Browserless



var connectOptions = new ConnectOptions()
{
BrowserWSEndpoint = "$wss://chrome.browserless.io/"
};

using (var browser = await Puppeteer.ConnectAsync(connectOptions))
{
...
}

Enter fullscreen mode Exit fullscreen mode




Contribute to Puppeteer Sharp

If you would like to use this excellent API, be sure to visit puppeteersharp.com. However, If you would like to contribute to this project, you can find the GitHub profile at github.com/kblok/puppeteer-sharp.

Top comments (2)

Collapse
 
lalislau profile image
Marcos

await new BrowserFetcher().DownloadAsync(BrowserFetcher.DefaultRevision);
error: BrowserFetcher does not contain a definition for DefaultRevision

Collapse
 
madestroit profile image
madestroIT

I ran into the same issue. It seems this constant no longer exists. I am guessing the parameterless constructor does this implicitly but I couldn't find any information on it.