DEV Community

Max Klein
Max Klein

Posted on

Deploying Python Scrapers on Linux VPS: Complete Setup Guide

Deploying Python Scrapers on Linux VPS: Complete Setup

Web scraping is a powerful tool for extracting data from websites, but deploying scrapers in a production environment requires careful planning and execution. Whether you're building a price comparison tool, gathering market insights, or automating data entry, ensuring your scraper runs reliably and efficiently is critical. A Linux Virtual Private Server (VPS) offers the perfect environment for hosting Python scrapers, providing scalability, security, and control.

In this tutorial, we’ll walk you through the complete process of deploying a Python-based web scraper on a Linux VPS. From setting up your server to writing and scheduling the scraper, we’ll cover every step with practical code examples and best practices. Whether you’re a beginner or an experienced developer, this guide will equip you with the knowledge to build a robust, production-ready scraping solution.


Prerequisites

Before we dive into the deployment process, ensure you have the following:

Hardware and Software Requirements

  • A Linux VPS: Any major provider (e.g., DigitalOcean, Linode, AWS EC2) with a Debian/Ubuntu-based Linux distribution.
  • SSH access: To connect to your VPS and manage files.
  • Python 3.8+: Installed on the VPS.
  • Basic Linux terminal skills: Familiarity with commands like sudo, apt, and curl.

Tools and Libraries

  • Python libraries: requests, BeautifulSoup, lxml, and schedule (for scheduling).
  • Optional tools: virtualenv for isolated environments, gunicorn for running applications, and nginx for reverse proxy setup.

Tips

  • Use a dedicated VPS: Avoid sharing resources with other applications to prevent performance bottlenecks.
  • Choose a provider with good uptime: For mission-critical scrapers, reliability is essential.

Step 1: Setting Up Your Linux VPS

Once you’ve provisioned your VPS, connect to it via SSH:

ssh username@your_vps_ip
Enter fullscreen mode Exit fullscreen mode

1.1 Update the System

Always start by updating your system packages to ensure security and compatibility:

sudo apt update && sudo apt upgrade -y
Enter fullscreen mode Exit fullscreen mode

1.2 Install Python and Dependencies

Install Python 3 and pip if they’re not already installed:

sudo apt install python3 python3-pip -y
Enter fullscreen mode Exit fullscreen mode

Install virtualenv for isolated environments:

sudo pip3 install virtualenv
Enter fullscreen mode Exit fullscreen mode

1.3 Create a Project Directory

Organize your code in a dedicated directory:

mkdir ~/web_scraper
cd ~/web_scraper
Enter fullscreen mode Exit fullscreen mode

Step 2: Writing Your Python Scraper

Now, let’s create a simple scraper using requests and BeautifulSoup.

2.1 Create a Virtual Environment

Use virtualenv to isolate dependencies:

virtualenv venv
source venv/bin/activate
Enter fullscreen mode Exit fullscreen mode

2.2 Install Required Libraries

Install the necessary packages:

pip install requests beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

2.3 Write the Scraper Code

Create a file named scraper.py with the following content:

import requests
from bs4 import BeautifulSoup
import time
import random

def scrape_website(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "lxml")
        # Example: Extract all article titles from a blog
        articles = soup.find_all("h2", class_="article-title")
        for idx, article in enumerate(articles):
            print(f"Article {idx+1}: {article.get_text(strip=True)}")
        return True
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return False

if __name__ == "__main__":
    urls = [
        "https://example-blog.com/page1",
        "https://example-blog.com/page2",
        "https://example-blog.com/page3"
    ]
    for url in urls:
        success = scrape_website(url)
        if success:
            print(f"Successfully scraped {url}")
        else:
            print(f"Failed to scrape {url}")
        time.sleep(random.uniform(1, 3))  # Respectful scraping delay
Enter fullscreen mode Exit fullscreen mode

📌 Best Practice

Always include a User-Agent header and respect robots.txt to avoid being blocked by websites.


Step 3: Deploying the Scraper on the VPS

3.1 Save the Script

Ensure your scraper.py file is in the ~/web_scraper directory.

3.2 Run the Scraper Manually

Test the script with:

python scraper.py
Enter fullscreen mode Exit fullscreen mode

You should see output from the example blog URLs.

3.3 Automate Execution with schedule

Install the schedule library to run the scraper periodically:

pip install schedule
Enter fullscreen mode Exit fullscreen mode

Update scraper.py to include scheduling:

import schedule
import time

def job():
    print("Running scraper job...")
    scrape_website("https://example-blog.com/page1")

# Schedule the job every 10 minutes
schedule.every(10).minutes.do(job)

while True:
    schedule.run_pending()
    time.sleep(1)
Enter fullscreen mode Exit fullscreen mode

⚠️ Warning

Avoid scraping too frequently. Use delays and respect website policies to prevent IP bans.


Step 4: Running the Scraper as a Background Service

To ensure the scraper runs continuously, set it up as a systemd service.

4.1 Create a Systemd Service File

Create a new service file:

sudo nano /etc/systemd/system/web-scraper.service
Enter fullscreen mode Exit fullscreen mode

Add the following configuration:

[Unit]
Description=Web Scraper Service
After=network.target

[Service]
User=your_username
WorkingDirectory=/home/your_username/web_scraper
ExecStart=/home/your_username/web_scraper/venv/bin/python /home/your_username/web_scraper/scraper.py
Restart=always
RestartSec=30
Environment=PATH=/home/your_username/web_scraper/venv/bin

[Install]
WantedBy=multi-user.target
Enter fullscreen mode Exit fullscreen mode

Replace your_username with your actual username.

4.2 Enable and Start the Service

sudo systemctl daemon-reload
sudo systemctl enable web-scraper.service
sudo systemctl start web-scraper.service
Enter fullscreen mode Exit fullscreen mode

Check the status:

sudo systemctl status web-scraper.service
Enter fullscreen mode Exit fullscreen mode

Step 5: Securing Your Scraper

5.1 Use Proxies to Avoid IP Blocking

Install fake-useragent to rotate user agents:

pip install fake-useragent
Enter fullscreen mode Exit fullscreen mode

Update your scraper to use proxies:

from fake_useragent import UserAgent
import random

ua = UserAgent()
headers = {
    "User-Agent": ua.random
}
proxies = [
    "http://proxy1.example.com:8080",
    "http://proxy2.example.com:8080"
]
response = requests.get(url, headers=headers, proxies={"http": random.choice(proxies)})
Enter fullscreen mode Exit fullscreen mode

5.2 Set Up a Reverse Proxy with Nginx

Install Nginx to act as a reverse proxy and hide your scraper’s IP:

sudo apt install nginx -y
Enter fullscreen mode Exit fullscreen mode

Configure Nginx (e.g., /etc/nginx/sites-available/scraper):

server {
    listen 80;
    server_name your_vps_ip;

    location / {
        proxy_pass http://localhost:5000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}
Enter fullscreen mode Exit fullscreen mode

Link the configuration and restart Nginx:

sudo ln -s /etc/nginx/sites-available/scraper /etc/nginx/sites-enabled/
sudo systemctl restart nginx
Enter fullscreen mode Exit fullscreen mode

Step 6: Monitoring and Logging

6.1 Configure Logging in the Scraper

Update scraper.py to log output to a file:

import logging

logging.basicConfig(filename='scraper.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
Enter fullscreen mode Exit fullscreen mode

6.2 Use journalctl for Systemd Logs

View logs for your service:

journalctl -u web-scraper.service -f
Enter fullscreen mode Exit fullscreen mode

Step 7: Scaling Your Scraper

For large-scale scraping, consider:

7.1 Using concurrent.futures for Parallel Requests

Update your scraper to use threading:

from concurrent.futures import ThreadPoolExecutor

def scrape_urls(urls):
    with ThreadPoolExecutor(max_workers=5) as executor:
        executor.map(scrape_website, urls)
Enter fullscreen mode Exit fullscreen mode

7.2 Deploying with Docker

Create a Dockerfile for containerization:

FROM python:3.9-slim
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["python", "scraper.py"]
Enter fullscreen mode Exit fullscreen mode

Build and run the container:

docker build -t web-scraper .
docker run -d -p 5000:5000 web-scraper
Enter fullscreen mode Exit fullscreen mode

Conclusion

Deploying Python scrapers on a Linux VPS requires a combination of careful planning, secure coding practices, and robust infrastructure. By following this guide, you’ve set up a scalable, reliable scraping solution that can handle complex data extraction tasks. Whether you’re scraping a single website or scaling to thousands of URLs, the techniques covered here provide a solid foundation.

Remember to always respect website terms of service and legal boundaries. Scraping should be ethical and transparent, ensuring the websites you target are comfortable with your activities.


Next Steps

Now that you’ve deployed your scraper, consider these advanced topics:

  • Automating backups: Use rsync or cloud storage to back up your logs and data.
  • Monitoring with Prometheus/Grafana: Track scraper performance and errors in real-time.
  • Handling CAPTCHA and anti-scraping tools: Explore libraries like selenium or playwright for dynamic content.
  • Migrating to cloud platforms: Use services like AWS Lambda or Google Cloud Functions for serverless scraping.

By continuously refining your setup, you’ll be well on your way to building a professional-grade web scraping pipeline.


Need professional web scraping done for you? Check out N3X1S INTELLIGENCE on Fiverr.

Top comments (0)