Dipti M

Posted on Dec 9, 2025

Scraping Multiple Pages from the Same Website

#webdev #programming #ai #tutorial

As a data scientist, manually opening dozens of web pages and copying data is not scalable. The more data you collect, the better your models become—but what if that data lives on web pages rather than clean datasets?
This is where web scraping becomes powerful.
Web scraping is the process of extracting structured or unstructured data from HTML pages and converting it into analysis-ready formats such as data frames. In R, several packages help automate this process, and one of the most popular is rvest, created by Hadley Wickham.
In this guide, you’ll learn how to scrape real-world web pages using R.

Getting Started with rvest
Before diving in, you should have a basic working knowledge of R.
You’ll need these packages:
install.packages("rvest")
library(rvest)

rvest also relies on:
xml2
selectr

How Web Scraping Works in R
The typical workflow is:
Read the web page into R
Identify HTML tags (CSS selectors)
Extract text/content
Convert results into structured formats (e.g., data frames)
Let’s work through a real example.

Step 1: Read a Web Page
We’ll scrape a WordPress blog article.
url <- 'http://pgdbablog.wordpress.com/2015/12/10/pre-semester-at-iim-calcutta/'
webpage <- read_html(url)

Now the HTML content is loaded into memory.

Step 2: Identify Tags Using Selector Gadget
Web pages are structured using HTML tags and CSS selectors.
To identify them easily:
vignette("selectorgadget")

You can also install the Chrome extension from:
http://selectorgadget.com/
Click elements on the web page to get their CSS selector.

Step 3: Scrape Individual Elements
Scrape the Post Date
post_date_html <- html_nodes(webpage, '.entry-date')
post_date <- html_text(post_date_html)
post_date

Output:
"December 10, 2015"

Scrape Title & Summary
title_summary_html <- html_nodes(webpage, 'em')
title_summary <- html_text(title_summary_html)

title_summary[2] # Main title
title_summary[1] # Summary

Scrape the Main Content
content_data_html <- html_nodes(webpage, 'p')
content_data <- html_text(content_data_html)

length(content_data)

You’ll see extra content (comments, footer, etc.). The main article is in the first 11 paragraphs:
content_data[1:11]

Scrape Comments
comments_html <- html_nodes(webpage, '.fn')
comments <- html_text(comments_html)

length(comments)
length(unique(comments))

Step 4: Convert Scraped Data into a Data Frame
first_blog <- data.frame(
Date = post_date,
Title = title_summary[2],
Description = title_summary[1],
content = paste(content_data[1:11], collapse = ''),
commenters = length(comments)
)

str(first_blog)

Scraping Multiple Pages from the Same Website
You can reuse the same logic for other URLs from the same site:
url <- 'http://pgdbablog.wordpress.com/2015/12/18/pgdba-chronicles-first-semester/'
webpage <- read_html(url)

Repeat the same scraping steps.

Scraping Images
To download images:
webpage <- html_session(url)

Image_link <- webpage %>% html_nodes(".wp-image-54")
img.url <- Image_link[1] %>% html_attr("src")

download.file(img.url, "test.jpg", mode = "wb")

Scraping More Complex Sites (Example: IMDb)
IMDb pages have structured IDs like:

titleCast

titleDidYouKnow

Tables for cast
Scrape Movie Cast
cast_html <- html_nodes(webpage, "#titleCast .itemprop span")
cast <- html_text(cast_html)
cast

Scrape Cast Table
cast_table_html <- html_nodes(webpage, "table")
cast_table <- html_table(cast_table_html)

cast_table[[1]]

Using XML Instead of HTML
You can also use XPath:
HTML:
table

XPath:
//table

Both approaches work with rvest.

Why Web Scraping Matters
Web scraping allows you to:
Collect social media data
Scrape blog content
Automate job data collection
Extract customer reviews
Capture competitor data
Scraped data can be analyzed further using tools like:
Tableau
Power BI
R and Python models

Summary
The core workflow of web scraping in R is simple:
Load the webpage
Identify selectors
Extract text/images
Structure the data into data frames
The rvest package makes this process intuitive and powerful.
Once you master the basics, you can build large-scale data pipelines from web sources.
At Perceptive Analytics, our mission is “to enable businesses to unlock value in data.” For two decades, we’ve supported 100+ organizations worldwide in building high-impact analytics systems. Our offerings span marketing analytics company and ai consultation helping organizations turn raw data into meaningful, decision-ready insights. We would love to talk to you. Do reach out to us.

DEV Community

Scraping Multiple Pages from the Same Website

titleCast

titleDidYouKnow

Top comments (0)