Peter Hansen

Posted on Feb 18, 2020 • Edited on Jan 24, 2021

Web Scraping with no coding

Hello World 👋 🌍,

In this article, I will show how easy it can be to do Web Scraping.

I will show how to extract content ( text, HTML, links, images, etc..) form a webpage without writing code.

The only thing you will need to do is to send an HTTP request and specify CSS selectors of elements you want to scrape.

Below you can see an example of a basic request body.

	[
	{
	"selector": "#someId .someClass a",
	"get": "text"
	}
	]

A response can be an array with extracted values or it could be in JSON like format.

	{
	"data": [
	{
	"title": "Our Band Could Be ...",
	"price": "£57.25",
	"image": "media/cache/54/60/54607fe8945897cdcced0044103b10b6.jpg",
	"link": "catalogue/our-band-could-be-your-life-scenes-from-the-american-indie-underground-1981-1991_985/index.html"
	},
	{
	"title": "Libertarianism for Beginners",
	"price": "£51.33",
	"image": "media/cache/0b/bc/0bbcd0a6f4bcd81ccb1049a52736406e.jpg",
	"link": "catalogue/libertarianism-for-beginners_982/index.html"
	},
	{
	"title": "It's Only the Himalayas",
	"price": "£45.17",
	"image": "media/cache/27/a5/27a53d0bb95bdd88288eaf66c9230d7e.jpg",
	"link": "catalogue/its-only-the-himalayas_981/index.html"
	}
	]
	}

view raw web_scraping_response_example.json hosted with ❤ by GitHub

For the Demo I’m going to use:

1) Books to scrape — a playground for web scraping.

2) Postman — app for sending HTTP requests.
3) Proxybot — API service helper tool for web scraping.

Let’s get started 👨‍💻

For people who prefer watching videos, there is a quick video showing how to scrape basic webpages.

The idea is very simple, we just need to:

Find a page we want to scrape
Get CSS selector of desired elements
Send HTTP POST request with a Body containing CSS selectors from step

And now the same steps but just with more details 🔎.

1) Find a page we want to scrape

We will use the ‘Books to Scrape’ (http://books.toscrape.com/) website as our web scraping playground.

The ‘Books to Scrape’ website contains dummy information about various books.

The website is ideal if you want to practice basic web scraping skills.

2) Get CSS selector of desired elements

In order to get CSS for desired elements, we need to open Dev tools of your desired browser and inspect elements.

Inspecting a book element gives us the following HTML markup 🕵.

	<article class="product_pod">
	<div class="image_container">
	<a href="catalogue/a-light-in-the-attic_1000/index.html"><img
	src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg" alt="A Light in the Attic"
	class="thumbnail" /></a>
	</div>

	<h3>
	<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>
	</h3>

	<div class="product_price">
	<p class="price_color">£51.77</p>

	<p class="instock availability">
	<i class="icon-ok"></i>
	In stock
	</p>

	<form>
	<button type="submit" class="btn btn-primary btn-block" data-loading-text="Adding...">
	Add to basket
	</button>
	</form>
	</div>
	</article>

view raw book_element_markup.html hosted with ❤ by GitHub

NB! We will use the above HTML for creating request objects.

3) Send HTTP POST request to API service 🚀

In the below example we will extract title and link for all the books found on the page.

Request URL :

https://proxybot.io/api/v1/API_KEY?url=http://books.toscrape.com

Request BODY:

	[
	{
	"selector": "article.product_pod h3",
	"get": "text"
	},
	{
	"selector": "article.product_pod h3 a",
	"get": "attribute",
	"attribute": "href"
	}
	]

view raw books_to_scrape_basic_request.json hosted with ❤ by GitHub

We specify CSS selector for book’s title and request to get its value as text. However, for the link’s value, we need to instruct service to get the value from the href attribute.

The response will contain titles and links of all books on the page.

Response:

	[
	{
	"selector": "article.product_pod h3",
	"get": "text",
	"data": [
	"A Light in the ...",
	"Tipping the Velvet",
	"Soumission",
	"Sharp Objects",
	"Sapiens: A Brief History ...",
	"The Requiem Red",
	"The Dirty Little Secrets ...",
	"The Coming Woman: A ...",
	"The Boys in the ...",
	"The Black Maria",
	"Starving Hearts (Triangular Trade ...",
	"Shakespeare's Sonnets",
	"Set Me Free",
	"Scott Pilgrim's Precious Little ...",
	"Rip it Up and ...",
	"Our Band Could Be ...",
	"Olio",
	"Mesaerion: The Best Science ...",
	"Libertarianism for Beginners",
	"It's Only the Himalayas"
	]
	},
	{
	"selector": "article.product_pod h3 a",
	"get": "attribute",
	"attribute": "href",
	"data": [
	"catalogue/a-light-in-the-attic_1000/index.html",
	"catalogue/tipping-the-velvet_999/index.html",
	"catalogue/soumission_998/index.html",
	"catalogue/sharp-objects_997/index.html",
	"catalogue/sapiens-a-brief-history-of-humankind_996/index.html",
	"catalogue/the-requiem-red_995/index.html",
	"catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html",
	"catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html",
	"catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html",
	"catalogue/the-black-maria_991/index.html",
	"catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html",
	"catalogue/shakespeares-sonnets_989/index.html",
	"catalogue/set-me-free_988/index.html",
	"catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html",
	"catalogue/rip-it-up-and-start-again_986/index.html",
	"catalogue/our-band-could-be-your-life-scenes-from-the-american-indie-underground-1981-1991_985/index.html",
	"catalogue/olio_984/index.html",
	"catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html",
	"catalogue/libertarianism-for-beginners_982/index.html",
	"catalogue/its-only-the-himalayas_981/index.html"
	]
	}
	]

view raw books_to_scrape_basic_response.json hosted with ❤ by GitHub

This is already super cool!

If you are interested in only specific data then this type of response might be already good enough.

However, I would like to have a formatted response, let's see how we can achieve that.

Request BODY for formatted response:

	[
	{
	"selector": "article.product_pod",
	"get": "json",
	"extract": [
	{
	"selector": "h3",
	"get": "text",
	"as": "title"
	},
	{
	"selector": ".product_price p.price_color",
	"get": "text",
	"as": "price"
	},
	{
	"selector": "img",
	"get": "attribute",
	"attribute": "src",
	"as": "image"
	},
	{
	"selector": "a",
	"get": "attribute",
	"attribute": "href",
	"as": "link"
	}
	]
	}
	]

view raw scraping_with_proxybot_request_body.json hosted with ❤ by GitHub

We need to ask to return “json” and provide an array with selectors in “extract” property.

Additionally, we can specify “as” property which will be used for formatting the response object.

The above request will result in the following response

Formatted response:

	[
	{
	"selector": "article.product_pod",
	"get": "json",
	"extract": [
	{
	"selector": "h3",
	"get": "text",
	"as": "title"
	},
	{
	"selector": ".product_price p.price_color",
	"get": "text",
	"as": "price"
	},
	{
	"selector": "img",
	"get": "attribute",
	"attribute": "src",
	"as": "image"
	},
	{
	"selector": "a",
	"get": "attribute",
	"attribute": "href",
	"as": "link"
	}
	],
	"data": [
	{
	"title": "A Light in the ...",
	"price": "£51.77",
	"image": "media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg",
	"link": "catalogue/a-light-in-the-attic_1000/index.html"
	},
	{
	"title": "Tipping the Velvet",
	"price": "£53.74",
	"image": "media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg",
	"link": "catalogue/tipping-the-velvet_999/index.html"
	},
	{
	"title": "Soumission",
	"price": "£50.10",
	"image": "media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg",
	"link": "catalogue/soumission_998/index.html"
	},
	{
	"title": "Sharp Objects",
	"price": "£47.82",
	"image": "media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg",
	"link": "catalogue/sharp-objects_997/index.html"
	},
	{
	"title": "Sapiens: A Brief History ...",
	"price": "£54.23",
	"image": "media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg",
	"link": "catalogue/sapiens-a-brief-history-of-humankind_996/index.html"
	},
	.
	.
	.
	.
	.
	.
	]
	}
	]

view raw books_to_scrape_full_json_like_response.json hosted with ❤ by GitHub

Wow! We can specify the format of an object we want to get back! How cool is that?

Congratulations 🥳 Now you know how to scrape websites without coding. As you can see it is pretty simple. I hope this article was interesting and useful.

In case you looking for a proxy providers, here you can find a list with TOP 7 proxy providers in 2021.

Deliver your unique apps, your own way.

Heroku tackles the toil — patching and upgrading, 24/7 ops and security, build systems, failovers, and more. Stay focused on building great data-driven applications.

Learn More