DEV Community: vcavanna

Web Scraping Car Sites to Help Me Shop Better

vcavanna — Wed, 27 Sep 2023 20:27:28 +0000

Introduction

It is a terrible time to buy a used car. According to the car research website iseecars.com:

"[...] the average age of used cars sold increased from 4.8 years to 6.1 years, while the average price across all ages increased 33 percent, from $20,398 to $27,133."

I just graduated. I don't have a lot of money. I (currently) don't have a job to get more money.

These problems stack up. If my job is in-person or hybrid, I need a car to get there; a car that I literally don't have because I don't have a job. (The contradictions in my future plans are mind-boggling, I know.)

Confronted by all these problems, I chose the only rational answer: create a database on Redshift Serverless of used car entries from edmunds.com so I can research cars with ad-hoc SQL queries and make a pub/sub alert system for high-value entries.

I want to share what I've learned and how I implemented this project in this post, but if you want to learn more check out the repo (give it a star just for me 😁). I especially suggest checking out the readme tutorials and guides since it's a pretty comprehensive list of what I had to learn as a beginner in order to make this project.

(But first...)

A Simple Table of Contents

Design
1. Local Scripts
  - Extract and Transform
  - Load (Part 1)
2. Loading Bucket
3. Lambda
4. Car DB
Extensions
Conclusion
More Notes

Design

The AWS architecture, as of so far, relies on just a few components of AWS (all of this was implemented on the AWS free trial.)

Design Contents

Local Scripts
- Extract and Transform
- Load (Part 1)
Loading Bucket
Lambda
Car DB

1. Local Scripts

Extract and Transform

The Extract and Transform aspects are done in the file edmunds_scraper.py, which uses the packages bs4 (a.k.a Beautiful Soup) for web scraping, boto3 for interacting programatically with AWS, and requests for fetching data from Edmunds.

It scrapes from the inventory of edmunds and generates a file called car_data.csv for a particular make and model of car.

Fields in the CSV File

car_entry_id: A made up field vaguely representing a unique ID
VIN: The Vehicle Identification Number described here. Theoretically unique, although there's probably data entry errors.
year: The year the car was made
make: The brand of the car
model
trim
miles: The number of miles the car's been driven.
offer: The dealership's offer as seen from the Edmunds site.
mpg_avg: The mpg assuming 55% city driving, 45% highway driving. (needs to be renamed, I know)
mpg_city
mpg_highway
driver_count: Number of owners of the vehicle.
accidents: The number of (recorded!) accidents for the vehicle.
usage_type: The way the car's been used (usually either corporate vehicle, personal use only, or personal use)
city: City of the dealership
state: State of the dealership
dist_from_car: Distance of the car from my location (DFW area)
run_date: The date that the ETL job was performed.

Load (pt. 1)

The copyToS3.py file does exactly what it says on the tin. It's just a tiny script to use my boto3 auth and load the file into my S3 bucket.

2. S3

At least as of right now, S3 is seeming like the easiest part of AWS. I just created the file, and configured a few IAM permissions, and I was set.

3. Lambda

The lambda is set to trigger once data is loaded into s3, performing a redshift COPY command. From what I've read, this two-part load sequence is best practice for automatically copying to S3.

The script for Lambda was mostly borrowed code, by the way. You can find it below if you want to take a look. I still need to make some edits to the script... as embarassing as it is to say, right now the lambda event parameter doesn't actually do anything, it's all hard-coded in.

4. Car DB

There are two things that I learned from working with Redshift... the first is that it actually isn't that hard to copy data in through S3 and query it if you just use the console. So if you're just starting out, copy some data into your redshift database and try out some querying! It's actually just fine for experimentation (although technically AWS Athena might be the better choice for experimental ad-hoc queries.)

The second thing that I learned is that it is much harder to query programmatically. I had to borrow the Lambda python script and tinker with IAM permissions for a couple hours before it worked. Granted, I'm new to this, but still.

Extensions

This is where things get interesting. Having access to your own database of cars gives you statistics that these car database websites don't give you access to.

Here's a few example of useful features that I can make now that the data is loaded:

Create Alerts for Cars I Like: No passive scrolling through cars anymore. I can set a query that I like to happen every time new data comes into the database, and if there's any entries that fit the query I'll send an email my way with the report.
Track Order Volume Over Time: Assuming that scrapes happen periodically, I can just make a periodic snapshot table for the models that I'm interested in.
Connect to Make, Model, Trim Databases: If I do additional webscraping for features in a car broken down all the way to the trim grain, I can query down to the exact features that I want. I can choose my own aesthetic (nice leather seats with black exterior) or core functionality (an engine with x amount of horsepower).
Connect to VIN Databases: The VIN (Vehicle Identification Number) is a number associated with a specific car, tied to its entire history. There are web databases tied to that as well. So I can increase the value of the database by scraping that database as well for the cars I'm interested in.
Calculate Time On The Market: Dealers supposedly use high-pressure tactics like saying that the car might not stay on the market for long. With a database, I can make a study about whether or not that's actually true! I can counter-point aggressive car salesmen by pointing to the evidence to the contrary for their own dealership!
Anything else I want to do, really. I think that's actually the takeaway, and a great transition to my conclusion.

Conclusion

There's nothing wrong with doing car shopping the simpler way instead of turning to over-engineered solutions. But this way turned my dread about car shopping into a feeling of mastery. I'm actually looking forward to talking to that overbearing car salesman, so that I can show him all the cool charts that I made.

... Okay, maybe he won't appreciate the charts when he's trying to sell me a car. Either way, I hope you've enjoyed my little article on setting up a data scrape ETL into the cloud. If you did, take a look at my repo to see what's new about the project, and consider giving it a star. Especially when I'm just starting out, it really helps.

More Notes

Github Repo

Can be found here. Contribution guidelines aren't set, but get in touch with me if you're interested in working on it! My plan is to make it open source.

Lambda Script

Here's the code I used for Lambda

import time
import traceback
import boto3
import logging
from collections import OrderedDict

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    # input parameters passed from the caller event
    # Amazon Redshift Serverless Workgroupname
    redshift_workgroup_name="default-workgroup"
    # database name for the Amazon Redshift serverless instance
    redshift_database_name = "dev"
    # IAM Role of Amazon Redshift sererless instance having access to S3
    redshift_iam_role = "arn:aws:iam::751714364321:role/service-role/AmazonRedshift-CommandsAccessRole-20230825T095934"
    # run_type can be either asynchronous or synchronous; try tweaking based on your requirement
    run_type = "synchronous"

    sql_statements = OrderedDict()
    res = OrderedDict()

    if run_type != "synchronous" and run_type != "asynchronous":
        raise Exception(
            "Invalid Event run_type. \n run_type has to be synchronous or asynchronous.")

    isSynchronous = True if run_type == "synchronous" else False

    # initiate redshift-data redshift_data_api_client in boto3
    redshift_data_api_client = boto3.client('redshift-data')

    sql_statements['CURRENT_USER'] = "select current_user;"
    sql_statements['COPY'] = "COPY dev.public.cars FROM 's3://edmunds-cars/load/car_entries.csv' " + "iam_role '" + redshift_iam_role + """' FORMAT AS CSV DELIMITER ',' QUOTE '"' IGNOREHEADER 1 REGION AS 'us-east-2';"""
    logger.info("Running sql queries in {} mode!\n".format(run_type))

    try:
        for command, query in sql_statements.items():
            logging.info("Example of {} command :".format(command))
            res[command + " STATUS: "] = execute_sql_data_api(redshift_data_api_client, redshift_database_name, command, query,
                                                            redshift_workgroup_name, isSynchronous)

    except Exception as e:
        raise Exception(str(e) + "\n" + traceback.format_exc())
    return res


def execute_sql_data_api(redshift_data_api_client, redshift_database_name, command, query, redshift_workgroup_name, isSynchronous):

    MAX_WAIT_CYCLES = 20
    attempts = 0
    # Calling Redshift Data API with executeStatement()
    res = redshift_data_api_client.execute_statement(
        Database=redshift_database_name, WorkgroupName=redshift_workgroup_name, Sql=query)
    query_id = res["Id"]
    desc = redshift_data_api_client.describe_statement(Id=query_id)
    query_status = desc["Status"]
    logger.info(
        "Query status: {} .... for query--&gt;{}".format(query_status, query))
    done = False

    # Wait until query is finished or max cycles limit has been reached.
    while not done and isSynchronous and attempts &lt; MAX_WAIT_CYCLES:
        attempts += 1
        time.sleep(1)
        desc = redshift_data_api_client.describe_statement(Id=query_id)
        query_status = desc["Status"]

        if query_status == "FAILED":
            raise Exception('SQL query failed:' +
                            query_id + ": " + desc["Error"])

        elif query_status == "FINISHED":
            logger.info("query status is: {} for query id: {} and command: {}".format(
                query_status, query_id, command))
            done = True
            # print result if there is a result (typically from Select statement)
            if desc['HasResultSet']:
                response = redshift_data_api_client.get_statement_result(
                    Id=query_id)
                logger.info(
                    "Printing response of {} query --&gt; {}".format(command, response['Records']))
        else:
            logger.info(
                "Current working... query status is: {} ".format(query_status))

    # Timeout Precaution
    if done == False and attempts &gt;= MAX_WAIT_CYCLES and isSynchronous:
        logger.info("Limit for MAX_WAIT_CYCLES has been reached before the query was able to finish. We have exited out of the while-loop. You may increase the limit accordingly. \n")
        raise Exception("query status is: {} for query id: {} and command: {}".format(
            query_status, query_id, command))

    return query_status

Interactions Tracker, Part 3: Why I stopped and Lessons Learned

vcavanna — Mon, 25 Sep 2023 14:07:28 +0000

I had taken a break from writing posts to assess which projects I really wanted to work on and do some more research into the field of data engineering. Unfortunately, working with the Interactions Tracker didn't really make the cut. This article explains why and shows which directions I'm looking to for project inspiration now.

Topics for Future Posts

Like you've seen, it's been a while since I've posted. I'm still planning to contribute posts on dev.to, and the break from developing posts and this project have been enriching. I have 3 posts planned, each of which I'll write and publish within the next 3 weeks:

Project Design: Data informed Car Purchases: scraping data off the Edmunds car website to answer which car would be the best buy. Unlike the interactions project, the Edmunds data scraping project has a proof of concept ETL demo already working with AWS infrastructure up and running (Redshift, S3, Lambda, IAM, and IAM Center). I'll show how I implemented this entire setup.
A Resource Review: Kaggle: I give a break down how all of the resources on Kaggle could be used for data engineering, data science and data analytics projects.
A Book Review: How to Read a Book: While it certainly was written for a broader audience than developers, How to Read a Book offers a solution to the problems of imposter syndrome and an ever-expanding knowledge base. I show how the principles in the book have been applied to my computer science projects.

As you can see, I still have plans to contribute interesting articles. I know this section seems strange Keep in touch! Subscribe!

Why I Stopped

1. Accessing the Data Source

Every ETL job starts with Extraction. Every extraction needs a source to extract from. Sources hidden by passwords aren't publicly facing. Sources that have personally identifiable information (PII) are trusted only to particular entities. I am not one of those entities that could access that data.

2. Value to Work Ration

The actual value that an interactions tracker adds is fairly minimal: assuming I only made the Minimum Viable Product, all it does is add the ability to check which students aren't being connected to on campus.

There was another project of significantly more increased scope that I considered: developing a events planning tool that would publish events, track timelines, and enable collaboration with University officials, with calendar and mail add-ons. It would add value to the degree that it integrated with other University services (ideally, a poster, email, and push notification would all go out when you pressed publish on the service, and photos taken and associated with events could be attached to event attendance to make developing yearbooks much easier). But at least at the moment, working on this would take up more of my time than I can allow. To the backburner it goes!

What I Learned

The details of modeling data

I learned that when representing the real world through data, I need to think at the lowest possible grain of detail. One problem I encountered in the dataset was mixing up the grains: I had a "event" grain for interactions as grouped by event, and a "interactions" grain, which was one level below that. As you might have seen in the last article, that only confused the issue.

Make simple, complete projects to start

Every project has to start with a bare bones implementation. But in order to make those projects worthwhile, you have to introduce value early on. That's what keeps the motivation going. A project like my database modeling for student interactions is an interesting thought experiment, but does not immediately deliver value. That's why my next project is associated with something I'll need to do anyway: research for buying a car.

Don't be afraid to start

Even though I didn't complete the project, I think I got some insight into data modeling that I would not otherwise have. Obviously none of this learning would happen if I simply read books about the subject; I actually needed to jump in to try and articulate how this database would work. So if I could limit the scope, then from a learning standpoint that makes any of these projects worthwhile.

Conclusion

I hope any other aspiring data engineers can take a look at my mistakes and avoid them, or at the very least get started. I had sat around waiting for the ideal project to fall on my lap. That strategy hadn't really worked. When you're someone like me, searching for jobs without great success, the winning strategy is to experiment. Just beginning helps. It's only after I was tinkering around with this (admittedly poor) project idea that I started to develop other ideas. So get started! I wish you all the best on your development journey.

Interactions Tracker, Part 2: Revising the Data Model

vcavanna — Fri, 25 Aug 2023 14:32:40 +0000

For my next article on the interactions tracker, I had planned to show how the model that I had created could be implemented. Unfortunately enough, the data model I made isn't ready for a RDBMS. That's okay! This mistake gives the opportunity to go over the exact sticking point, learn both my thought process and the RDBMS process, and how I can implement this in a way that better adheres to warehousing principles drawn from Kimball.

So what exactly was the mistake that I made in my data model?

Well, let's re-evaluate the diagram that I had produced earlier:

This diagram has two fact tables at two different levels of detail or grain. So we're not really practicing dimensional modeling.

This became apparent as I began making the SQL tables for this dimensional model:

import sqlite3
conn = sqlite3.connect("rez_life.db")
event_columns = [
    "event_key INTEGER PRIMARY KEY",
    "staff_key INTEGER",
    "student_keys ??????", # What do I do here??
    "conversation_type VARCHAR",
    ...
]

student_keys can't be represented as a single element of data. This is important because student_keys is meant to be a foreign key to the 'dim_member' table. I know that one of my business questions was to answer "Who is attending each event?" To get that answer, I need a query that works like 'from fct_events select student_keys' in order to get the list of students.

Events should not be a fact table

Of course, it's hypothetically possible to store a list of names in a single column like I wanted in my first attempt at data modeling, and keep the "event" level of grain:

>> residentKeyList = ["Jeffrey", "Zack", "Cassandra"]
>> residentKeysAsString = ""
>> for key in residentKeyList:
        residentKeysAsString+= key + "_"

>> print(residentKeysAsString)
Jeffrey_Zack_Cassandra

And now I have it all together, like I wanted. The issue with this is twofold: 1) Now I have to create a custom parser, which is prone to breaking easily, and 2) In choosing a higher level grain, I'm imposing limits on what I can do with this data.

For the second issue, let's suppose I don't just want a list of residents that attended the event. Suppose I only want the residents that live in one particular dormitory! This makes sense from a residence life standpoint: while it's great to offer events to non-residents, you want to make sure that events mostly bring in students that you're asked to serve.

All of the sudden, our handy concatenated string of residents is close to useless! We would need to write a custom query that gets the list of students through SQL, parses through python, then check through the list of students in python for whether they reside in the dormitory of choice. Already I sense the data analysts shaking their fists!

Remodeling to make events a dimension table is the answer

Wouldn't it be so much better if we could answer this question in a single data query? For instance:
SELECT interactions.resident_name FROM fct_interactions INNER JOIN dim_members ON fct_interactions.member_key=dim_members.member_key WHERE fct_interactions.event_key="Chili Cookout" AND dim_members.residence="Clark Hall"

This gets the same data as the above complicated process simply by joining two tables together and filtering the results. Keeping things at the lowest possible grain is standard practice for Kimball, and it's a mistake of mine to try something else.

Luckily, the remodel is very simple:

Simply calling the events table a dimension table rather than make it a separate fact table keeps the grain at the same level, meaning that the queries I mentioned above can be called fairly easily.

Key Takeaways

Unless you want angry analysts, keep the data to one element per cell. Doing otherwise defeats the point of RDBMS. Seems like a clear point in retrospect.
The events table will be treated as a dimensions table for the lowest grain level from now on. We'll see more about how this impacts the design in the next post, when I implement the SQL database.

Part 1: Tracking Interactions - Initial Design and Data Modeling

vcavanna — Mon, 21 Aug 2023 16:16:27 +0000

I'm making a project to track interactions. Follow my progress in these articles as I work towards a fully operational relation tracking site, SQL database, and REST API.

This first article introduces the tech stack and models the data based off of the Four-Step Dimensional Model Process in Kimball's Book: Data Warehouse Toolkit, which I will be referencing periodically later.

Four-Step Dimensional Model Process

1) Select the business process

Community leaders are often challenged to ensure everyone is included and feels like their needs are met. While for small groups a leader just has to remember who they talked to, if for some ungodly reason you have groups of +20 people and multiple people leading the community, you cannot easily ensure that everyone is being seen. To make things more complicated, some interactions are one on one, and some are within a group context. So I'm building a community tracker to make sure that leaders have access to data on whether each person in the community has interacted with the leaders.

Normally the process would involve talking with the operators of the business to understand their work and data needs, I am experienced in the business so I can skip that.

The Business Process: Track the process of interacting with community members in conversations and community events.

Business Needs

Provide a portal for data entry for individual interactions and event attendance between staff/community leaders and members.
Provide a backend database to answer the below business questions.
Provide a REST API so other developers can reference / extract data

Business Questions

Need answered

Who is not regularly interacting with staff and community leaders?
Who has had mainly negative interactions with the staff and community leaders?

Nice to have answered

What types of interactions are there?
What events are associated with a particular group leader, and which members are associated with those events?

2) Declare the Grain

There are actually two transactions implicitly in the business process I outlined above. There will be one row for transactions at the interaction level, and one row for transactions at the event level.

Grain 1: One row per interaction between a student and staff
Grain 2: One row per event held by staff

3) Identify the dimensions

In order to map out dimension tables, we answer the who, what, where, when, why questions.
Who?

The leader
The student What?
The event in which the interaction occurred (grain 1)
The type of conversation / interaction (more on this later)
Whether it was a recurring event (grain 2) Where?
The location of the event / interaction When?
The date of the event / interaction Why?
N/A How?
N/A

At this point, though, we have to recognize the issue with this type of data. Data like this isn't generated by a machine, but instead has to be inputted by a staff member. You can contrast this data to data like purchases in retail, salesforce data, etc.

Since this is the case, it's best to limit the amount of user input to the minimum viable product, at least at first. The business doesn't benefit from a product that their staff doesn't use.

4) Identify the facts

Interactions Fact Table

Student Key
Staff Key
Conversation Type
Event Key
Location Key
Date Key
Negative or Positive Interaction

Events Fact Table

Staff Key
Attendance (Student Keys)
Event Key
Date Key
Location Key
Negative or Positive Interaction
Conversation Type
Recurring Event? (Y/N)

Now that these questions have been answered, we can create a sample diagram of how these fact and dimension tables relate:

Finally, I'll just quickly outline the tech stack for this project:

Tech Stack

Python language
Flask framework
SQLite database

These components, already used in the earlier tutorial, should be all I need to make this project happen.

That's all for now, I'll be back with updates as the project progresses forwards!

Making a Python SQL Database with a Flask Web API

vcavanna — Mon, 14 Aug 2023 00:51:26 +0000

This is meant to document my project to get familiar with Python, SQL Databases, and Flask by creating a Flask API. My goal is to follow and implement tutorials, gradually working up to the first project.

The first tutorial that I'm following for this project is realpython's flask and connexion tutorial

I'm going to update this post with any takeaways, difficulties, etc., as well as post the end results.

Challenges:

Dependency conflicts!! I found in the tutorial that despite following the steps, the connexion dependency didn't work with the flask-sql package installed. I resolved this by updating the connexion dependency, which fixed the issue.

Key Takeaways:

swagger is an extension of flask that lets you document your flask API according to openAI standards.
One thing I wish the tutorial had was a way to easily test the API as I was developing it. So I added Postman following this tutorial, which was very helpful.
Definitely it would be nice to research the SQLAlchemy app! I'm new to SQL database management, so this seems like a good way to start.