Some people may think that only serious backend developers could do that from a website with Python. Is it a typical task you'd see in a job description and dismissed?
The funny thing is, today, that idea isn't true. Automation has gone from being something incomprehensible and technical to something more interesting and easier for a Python developer. 30 lines of code are what you need. After that you will get hundreds of images. Uploading them would have taken you half a day five years ago.
The task to scrape images from a website with Python is definitely suitable for beginners. However, you need someone to explain it without jargon. This is exactly what this guide does.
Prerequisites to Make Everything Work
Before we start, check if you have:
- Python 3.7+
- Basic understanding of HTML structure
-
requestsandBeautifulSoup4installed (pip install requests beautifulsoup4) - For dynamic sites:
selenium+ Chrome driver
Scrape Images from Website Python: The Core Idea
Web pages may have a lot of images. All of them live inside a special tag — <img>. This tag has a src attribute that leads to the image URL. Responsive images may also have a srcset attribute. But not always. Beginners are always confused by the last one.
Your initial task will be to parse the HTML source code. Next, you need to find all the <img> tags and get the image links. Now you can download each file. The instructions may seem simple — there are only three steps. However, in practice, this can cause a lot of chaos.
First Step: Get the HTML of the Page
import requests
from bs4 import BeautifulSoup
from pathlib import Path
import os
url = "https://example.com"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
We recommend specifying the user agent every time. Failure to do so will result in many sites returning a 403 error or displaying a page you don't need. Remember, the requests library will send its own user agent by default — and servers often block that on sight.
Second Step: Start to Extract Images
images = soup.find_all("img")
image_urls = []
for img in images:
src = img.get("src") or img.get("srcset", "").split()[0]
if src:
# Handle relative URLs
if src.startswith("http"):
image_urls.append(src)
else:
image_urls.append(f"{url.rstrip('/')}/{src.lstrip('/')}")
A few things worth noting. Some images only show up in the srcset attribute, not the src attribute — especially on news sites and e-commerce platforms with responsive designs. The .split()[0] grabs the first (usually highest-quality) URL from that attribute.
Relative paths like /images/photo.jpg need to be turned into full image URLs. Easy to forget, annoying to debug.
Third Step: Save Images on Your Local Storage
from pathlib import Path
output_dir = Path("images_directory")
output_dir.mkdir(exist_ok=True)
for i, img_url in enumerate(image_urls):
try:
img_response = requests.get(img_url, headers=headers, timeout=10)
# Get file extension from URL
ext = img_url.split(".")[-1].split("?")[0][:4]
filename = output_dir / f"image_{i}.{ext}"
with open(filename, "wb") as f:
f.write(img_response.content)
print(f"Saved: {filename}")
except Exception as e:
print(f"Failed: {img_url} — {e}")
img_response.content gives you a byte object — raw binary data. Always write with "wb" (write binary), not "w". If you forget that, you'll get corrupted files and spend 20 minutes wondering why your images won't open. All files end up stored locally in your images_directory folder.
Work with Selenium: Handling Dynamic Content
Static HTML scraping works maybe 60% of the time. What about the rest? Images load via JavaScript after the page renders. That's when you need Selenium.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
driver.get("https://example.com")
time.sleep(3)
page_source = driver.page_source
driver.quit()
soup = BeautifulSoup(page_source, "html.parser")
# Continue with same extraction logic...
This approach grabs the page source after JavaScript has run. The time.sleep(3) is crude but often works fine for simple cases. For production scripts, use WebDriverWait with explicit conditions instead. You'll also need a Chrome driver installed and matching your browser version.
Use a Proxy to Avoid Restrictions
Run this image scraper against any serious website and you'll hit rate limits or IP blocks within minutes. Web scraping images at scale basically requires rotating proxies.
Proxy-Seller provides residential and datacenter proxies that work well for image scraping tasks. Integration is straightforward:
proxies = {
"http": "http://user:pass@proxy_ip:port",
"https": "http://user:pass@proxy_ip:port"
}
response = requests.get(url, headers=headers, proxies=proxies)
If you're using Scrapy for larger scraping projects, check out the proxy for Scrapy. The setup is slightly different and worth reading before you start.
Static vs Dynamic Sites: Quick Reference
| Site Type | Tool | When to Use |
|---|---|---|
| Static HTML |
requests + BeautifulSoup
|
Most blogs, simple pages |
| JS-rendered |
Selenium + Chrome driver |
SPAs, lazy-loaded images |
| Large scale |
Scrapy + middleware |
Hundreds of pages |
Common Issues You May Avoid If You Know Them
- Images downloading as 0 bytes — Usually a redirect or auth issue. Check the response status code before saving.
-
Only getting tiny thumbnails — You're grabbing
srcbut the actual image is indata-src(lazy loading). Addimg.get("data-src")to your extraction logic. -
403 Forbidden — Missing or wrong user agent, or the site checks the
Refererheader. Add"Referer": urlto your headers dict. -
Duplicate files — Track downloaded URLs in a
setand skip if already seen.
Wrapping Up
The data extraction process can be elegant and simple at the same time. Here's the full picture of how to scrape images from a website with Python:
- Make HTTP requests
- Find
<img>tags - Grab
srcattributes - Download images as binary files
Believe it or not, a full ninety percent of tasks can be solved using this approach.
However, there can be challenges. Common ones include lazy loading of images, dynamic content, and bot detection. Selenium handles the first two well. Proxy rotation is the right solution for the third.
One final tip: always check robots.txt and the site's terms of service before scraping.
Top comments (0)