DEV Community

Taylor Higgins
Taylor Higgins

Posted on • Edited on

Step Two: Process the data with Python and qGIS

Now that I had a general plan for the web app I wanted to collect and clean all of my data.

The first data set I used was open data from the government of Tuscany, specifically 2011 Census data for the city of Florence. This had important information on the total number of renters and homeowners and also overall resident population by census tract, and the total number of housing units and bedrooms by census tract.

The second data set I used was scraped Airbnb Data from the non-profit Inside Airbnb, a research group dedicated to showing the real impact of Airbnb on communities. This data, though not fully open follows the FAIR, "A", Accessibility data principle by being, "As open as possible but as closed as necessary".

It was important to line up the census data geographically by coordinates with the airbnb listing data in order to understand how saturated a particular census tract or neighborhood was.

The questions I wanted to answer were:

  • What percentage of the total number of residential units and available bedrooms are listed as airbnb?

  • Compared to the total number of residents living in a particular geographic unit (census tract, street or neighborhood), how many airbnb guests are there in a given month or year?

I attempted to use geopandas to join the two datasets by coordinates, and had success converting the crs projection to the same coordinate system, but ultimately found it faster to join the datasets in qgis and then export as a shapefile to continue processing in python. I was able to use geopandas to read the resulting joined shapefile in the next step where I wrote wrote the main functions that fueled the map and stats for the webapp.

I did this after removing unneeded columns from both datasets, in order to save on storage space and increase processing times.

I also created new variables that would help me when doing analysis and mapping in later steps by converting qualitative user created values to quantitative bivariate values.

For example, to more easily group the listings together based off of a qualitative data point, like where the host was based, I converted the user inputs (ie. Tuscany, Firenze, Toscana, Italy, USA, Egypt, Nebraska etc) into a yes/no variable for whether the host was based in or out of the city.

After this sort of processing I was able to ask questions of the dataset more easily. In total I created 27 new variables, not all of which I ended up being able to use, but I hope to use them all in the future.

new_variables = [days_rented_ltm, rounded_revenue_ltm, occupancy_rate_approx, is_hotel, is_entire, many_listings, only_1_listing, only_2_listings, host_florence, has_liscense, is_instant_bookable, dist_duomo, buffer_zone, is_centro, is_gavinana, is_isolotto, is_rifredi, is_campo, listing_revenue_exceed_LTR, effected_by_policy_1, effected_by_policy_2, effected_by_policy_3, commercial, very_likely_commercial, tourist_tax, unpaid_tourist_tax, geom]

An important caveat that should've been accounted for in more depth was that the census data came from 2011 and the airbnb data came from 2021. I look

Top comments (0)