DEV Community

Proxy-Seller
Proxy-Seller

Posted on

How to Scrape Images From a Website Using Python

Some people may think that only serious backend developers could do that from a website with Python. Is it a typical task you'd see in a job description and dismissed?

The funny thing is, today, that idea isn't true. Automation has gone from being something incomprehensible and technical to something more interesting and easier for a Python developer. 30 lines of code are what you need. After that you will get hundreds of images. Uploading them would have taken you half a day five years ago.

The task to scrape images from a website with Python is definitely suitable for beginners. However, you need someone to explain it without jargon. This is exactly what this guide does.

Prerequisites to Make Everything Work

Before we start, check if you have:

  • Python 3.7+
  • Basic understanding of HTML structure
  • requests and BeautifulSoup4 installed (pip install requests beautifulsoup4)
  • For dynamic sites: selenium + Chrome driver

Scrape Images from Website Python: The Core Idea

Web pages may have a lot of images. All of them live inside a special tag — <img>. This tag has a src attribute that leads to the image URL. Responsive images may also have a srcset attribute. But not always. Beginners are always confused by the last one.

Your initial task will be to parse the HTML source code. Next, you need to find all the <img> tags and get the image links. Now you can download each file. The instructions may seem simple — there are only three steps. However, in practice, this can cause a lot of chaos.

First Step: Get the HTML of the Page

import requests
from bs4 import BeautifulSoup
from pathlib import Path
import os

url = "https://example.com"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
Enter fullscreen mode Exit fullscreen mode

We recommend specifying the user agent every time. Failure to do so will result in many sites returning a 403 error or displaying a page you don't need. Remember, the requests library will send its own user agent by default — and servers often block that on sight.

Second Step: Start to Extract Images

images = soup.find_all("img")
image_urls = []

for img in images:
    src = img.get("src") or img.get("srcset", "").split()[0]
    if src:
        # Handle relative URLs
        if src.startswith("http"):
            image_urls.append(src)
        else:
            image_urls.append(f"{url.rstrip('/')}/{src.lstrip('/')}")
Enter fullscreen mode Exit fullscreen mode

A few things worth noting. Some images only show up in the srcset attribute, not the src attribute — especially on news sites and e-commerce platforms with responsive designs. The .split()[0] grabs the first (usually highest-quality) URL from that attribute.

Relative paths like /images/photo.jpg need to be turned into full image URLs. Easy to forget, annoying to debug.

Third Step: Save Images on Your Local Storage

from pathlib import Path

output_dir = Path("images_directory")
output_dir.mkdir(exist_ok=True)

for i, img_url in enumerate(image_urls):
    try:
        img_response = requests.get(img_url, headers=headers, timeout=10)

        # Get file extension from URL
        ext = img_url.split(".")[-1].split("?")[0][:4]
        filename = output_dir / f"image_{i}.{ext}"

        with open(filename, "wb") as f:
            f.write(img_response.content)

        print(f"Saved: {filename}")
    except Exception as e:
        print(f"Failed: {img_url}{e}")
Enter fullscreen mode Exit fullscreen mode

img_response.content gives you a byte object — raw binary data. Always write with "wb" (write binary), not "w". If you forget that, you'll get corrupted files and spend 20 minutes wondering why your images won't open. All files end up stored locally in your images_directory folder.

Work with Selenium: Handling Dynamic Content

Static HTML scraping works maybe 60% of the time. What about the rest? Images load via JavaScript after the page renders. That's when you need Selenium.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time

options = Options()
options.add_argument("--headless")

driver = webdriver.Chrome(options=options)
driver.get("https://example.com")

time.sleep(3)

page_source = driver.page_source
driver.quit()

soup = BeautifulSoup(page_source, "html.parser")
# Continue with same extraction logic...
Enter fullscreen mode Exit fullscreen mode

This approach grabs the page source after JavaScript has run. The time.sleep(3) is crude but often works fine for simple cases. For production scripts, use WebDriverWait with explicit conditions instead. You'll also need a Chrome driver installed and matching your browser version.

Use a Proxy to Avoid Restrictions

Run this image scraper against any serious website and you'll hit rate limits or IP blocks within minutes. Web scraping images at scale basically requires rotating proxies.

Proxy-Seller provides residential and datacenter proxies that work well for image scraping tasks. Integration is straightforward:

proxies = {
    "http": "http://user:pass@proxy_ip:port",
    "https": "http://user:pass@proxy_ip:port"
}

response = requests.get(url, headers=headers, proxies=proxies)
Enter fullscreen mode Exit fullscreen mode

If you're using Scrapy for larger scraping projects, check out the proxy for Scrapy. The setup is slightly different and worth reading before you start.

Static vs Dynamic Sites: Quick Reference

Site Type Tool When to Use
Static HTML requests + BeautifulSoup Most blogs, simple pages
JS-rendered Selenium + Chrome driver SPAs, lazy-loaded images
Large scale Scrapy + middleware Hundreds of pages

Common Issues You May Avoid If You Know Them

  • Images downloading as 0 bytes — Usually a redirect or auth issue. Check the response status code before saving.
  • Only getting tiny thumbnails — You're grabbing src but the actual image is in data-src (lazy loading). Add img.get("data-src") to your extraction logic.
  • 403 Forbidden — Missing or wrong user agent, or the site checks the Referer header. Add "Referer": url to your headers dict.
  • Duplicate files — Track downloaded URLs in a set and skip if already seen.

Wrapping Up

The data extraction process can be elegant and simple at the same time. Here's the full picture of how to scrape images from a website with Python:

  • Make HTTP requests
  • Find <img> tags
  • Grab src attributes
  • Download images as binary files

Believe it or not, a full ninety percent of tasks can be solved using this approach.

However, there can be challenges. Common ones include lazy loading of images, dynamic content, and bot detection. Selenium handles the first two well. Proxy rotation is the right solution for the third.

One final tip: always check robots.txt and the site's terms of service before scraping.

Top comments (0)