vcavanna

Posted on Sep 27, 2023

Web Scraping Car Sites to Help Me Shop Better

#python #beginners #aws #database

Introduction

It is a terrible time to buy a used car. According to the car research website iseecars.com:

"[...] the average age of used cars sold increased from 4.8 years to 6.1 years, while the average price across all ages increased 33 percent, from $20,398 to $27,133."

I just graduated. I don't have a lot of money. I (currently) don't have a job to get more money.

These problems stack up. If my job is in-person or hybrid, I need a car to get there; a car that I literally don't have because I don't have a job. (The contradictions in my future plans are mind-boggling, I know.)

Confronted by all these problems, I chose the only rational answer: create a database on Redshift Serverless of used car entries from edmunds.com so I can research cars with ad-hoc SQL queries and make a pub/sub alert system for high-value entries.

I want to share what I've learned and how I implemented this project in this post, but if you want to learn more check out the repo (give it a star just for me 😁). I especially suggest checking out the readme tutorials and guides since it's a pretty comprehensive list of what I had to learn as a beginner in order to make this project.

(But first...)

Design

The AWS architecture, as of so far, relies on just a few components of AWS (all of this was implemented on the AWS free trial.)

Design Contents

Local Scripts
- Extract and Transform
- Load (Part 1)
Loading Bucket
Lambda
Car DB

1. Local Scripts

Extract and Transform

The Extract and Transform aspects are done in the file edmunds_scraper.py, which uses the packages bs4 (a.k.a Beautiful Soup) for web scraping, boto3 for interacting programatically with AWS, and requests for fetching data from Edmunds.

It scrapes from the inventory of edmunds and generates a file called car_data.csv for a particular make and model of car.

Fields in the CSV File

car_entry_id: A made up field vaguely representing a unique ID
VIN: The Vehicle Identification Number described here. Theoretically unique, although there's probably data entry errors.
year: The year the car was made
make: The brand of the car
model
trim
miles: The number of miles the car's been driven.
offer: The dealership's offer as seen from the Edmunds site.
mpg_avg: The mpg assuming 55% city driving, 45% highway driving. (needs to be renamed, I know)
mpg_city
mpg_highway
driver_count: Number of owners of the vehicle.
accidents: The number of (recorded!) accidents for the vehicle.
usage_type: The way the car's been used (usually either corporate vehicle, personal use only, or personal use)
city: City of the dealership
state: State of the dealership
dist_from_car: Distance of the car from my location (DFW area)
run_date: The date that the ETL job was performed.

Load (pt. 1)

The copyToS3.py file does exactly what it says on the tin. It's just a tiny script to use my boto3 auth and load the file into my S3 bucket.

2. S3

At least as of right now, S3 is seeming like the easiest part of AWS. I just created the file, and configured a few IAM permissions, and I was set.

3. Lambda

The lambda is set to trigger once data is loaded into s3, performing a redshift COPY command. From what I've read, this two-part load sequence is best practice for automatically copying to S3.

The script for Lambda was mostly borrowed code, by the way. You can find it below if you want to take a look. I still need to make some edits to the script... as embarassing as it is to say, right now the lambda event parameter doesn't actually do anything, it's all hard-coded in.

4. Car DB

There are two things that I learned from working with Redshift... the first is that it actually isn't that hard to copy data in through S3 and query it if you just use the console. So if you're just starting out, copy some data into your redshift database and try out some querying! It's actually just fine for experimentation (although technically AWS Athena might be the better choice for experimental ad-hoc queries.)

The second thing that I learned is that it is much harder to query programmatically. I had to borrow the Lambda python script and tinker with IAM permissions for a couple hours before it worked. Granted, I'm new to this, but still.

Extensions

This is where things get interesting. Having access to your own database of cars gives you statistics that these car database websites don't give you access to.

Here's a few example of useful features that I can make now that the data is loaded:

Create Alerts for Cars I Like: No passive scrolling through cars anymore. I can set a query that I like to happen every time new data comes into the database, and if there's any entries that fit the query I'll send an email my way with the report.
Track Order Volume Over Time: Assuming that scrapes happen periodically, I can just make a periodic snapshot table for the models that I'm interested in.
Connect to Make, Model, Trim Databases: If I do additional webscraping for features in a car broken down all the way to the trim grain, I can query down to the exact features that I want. I can choose my own aesthetic (nice leather seats with black exterior) or core functionality (an engine with x amount of horsepower).
Connect to VIN Databases: The VIN (Vehicle Identification Number) is a number associated with a specific car, tied to its entire history. There are web databases tied to that as well. So I can increase the value of the database by scraping that database as well for the cars I'm interested in.
Calculate Time On The Market: Dealers supposedly use high-pressure tactics like saying that the car might not stay on the market for long. With a database, I can make a study about whether or not that's actually true! I can counter-point aggressive car salesmen by pointing to the evidence to the contrary for their own dealership!
Anything else I want to do, really. I think that's actually the takeaway, and a great transition to my conclusion.

Conclusion

There's nothing wrong with doing car shopping the simpler way instead of turning to over-engineered solutions. But this way turned my dread about car shopping into a feeling of mastery. I'm actually looking forward to talking to that overbearing car salesman, so that I can show him all the cool charts that I made.

... Okay, maybe he won't appreciate the charts when he's trying to sell me a car. Either way, I hope you've enjoyed my little article on setting up a data scrape ETL into the cloud. If you did, take a look at my repo to see what's new about the project, and consider giving it a star. Especially when I'm just starting out, it really helps.

More Notes

Github Repo

Can be found here. Contribution guidelines aren't set, but get in touch with me if you're interested in working on it! My plan is to make it open source.

Lambda Script

Here's the code I used for Lambda

import time
import traceback
import boto3
import logging
from collections import OrderedDict

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    # input parameters passed from the caller event
    # Amazon Redshift Serverless Workgroupname
    redshift_workgroup_name="default-workgroup"
    # database name for the Amazon Redshift serverless instance
    redshift_database_name = "dev"
    # IAM Role of Amazon Redshift sererless instance having access to S3
    redshift_iam_role = "arn:aws:iam::751714364321:role/service-role/AmazonRedshift-CommandsAccessRole-20230825T095934"
    # run_type can be either asynchronous or synchronous; try tweaking based on your requirement
    run_type = "synchronous"

    sql_statements = OrderedDict()
    res = OrderedDict()

    if run_type != "synchronous" and run_type != "asynchronous":
        raise Exception(
            "Invalid Event run_type. \n run_type has to be synchronous or asynchronous.")

    isSynchronous = True if run_type == "synchronous" else False

    # initiate redshift-data redshift_data_api_client in boto3
    redshift_data_api_client = boto3.client('redshift-data')

    sql_statements['CURRENT_USER'] = "select current_user;"
    sql_statements['COPY'] = "COPY dev.public.cars FROM 's3://edmunds-cars/load/car_entries.csv' " + "iam_role '" + redshift_iam_role + """' FORMAT AS CSV DELIMITER ',' QUOTE '"' IGNOREHEADER 1 REGION AS 'us-east-2';"""
    logger.info("Running sql queries in {} mode!\n".format(run_type))

    try:
        for command, query in sql_statements.items():
            logging.info("Example of {} command :".format(command))
            res[command + " STATUS: "] = execute_sql_data_api(redshift_data_api_client, redshift_database_name, command, query,
                                                            redshift_workgroup_name, isSynchronous)

    except Exception as e:
        raise Exception(str(e) + "\n" + traceback.format_exc())
    return res


def execute_sql_data_api(redshift_data_api_client, redshift_database_name, command, query, redshift_workgroup_name, isSynchronous):

    MAX_WAIT_CYCLES = 20
    attempts = 0
    # Calling Redshift Data API with executeStatement()
    res = redshift_data_api_client.execute_statement(
        Database=redshift_database_name, WorkgroupName=redshift_workgroup_name, Sql=query)
    query_id = res["Id"]
    desc = redshift_data_api_client.describe_statement(Id=query_id)
    query_status = desc["Status"]
    logger.info(
        "Query status: {} .... for query--&gt;{}".format(query_status, query))
    done = False

    # Wait until query is finished or max cycles limit has been reached.
    while not done and isSynchronous and attempts &lt; MAX_WAIT_CYCLES:
        attempts += 1
        time.sleep(1)
        desc = redshift_data_api_client.describe_statement(Id=query_id)
        query_status = desc["Status"]

        if query_status == "FAILED":
            raise Exception('SQL query failed:' +
                            query_id + ": " + desc["Error"])

        elif query_status == "FINISHED":
            logger.info("query status is: {} for query id: {} and command: {}".format(
                query_status, query_id, command))
            done = True
            # print result if there is a result (typically from Select statement)
            if desc['HasResultSet']:
                response = redshift_data_api_client.get_statement_result(
                    Id=query_id)
                logger.info(
                    "Printing response of {} query --&gt; {}".format(command, response['Records']))
        else:
            logger.info(
                "Current working... query status is: {} ".format(query_status))

    # Timeout Precaution
    if done == False and attempts &gt;= MAX_WAIT_CYCLES and isSynchronous:
        logger.info("Limit for MAX_WAIT_CYCLES has been reached before the query was able to finish. We have exited out of the while-loop. You may increase the limit accordingly. \n")
        raise Exception("query status is: {} for query id: {} and command: {}".format(
            query_status, query_id, command))

    return query_status

Top comments (1)

NeoSpy • Oct 28 '23

Hey, fellow car enthusiasts! Just stumbled upon an amazing way to up my car shopping game! Web scraping car sites has been a game-changer, especially when it comes to finding the best finance deals for used cars. If you're like me and worried about bad credit, I highly recommend checking out Car-Finance car-finance.co.uk/. They've got 15+ lenders on board, offering fixed interest rates and no-deposit deals. Plus, they accept customers with poor credit and even offer part exchange deals with affordable options. Best part? It won't impact your credit score! Happy car hunting, everyone!

Some comments may only be visible to logged-in visitors. Sign in to view all comments.