DEV Community: Taylor Higgins

Analyzing Italy's Vulnerability to Sea Level After Applying Feedback

Taylor Higgins — Wed, 08 Feb 2023 12:18:33 +0000

This is a continuation of a previous post, but where we will start off at the analysis and visualisation part. For some context, after presenting the maps to the Geographic Society of Italy, a couple suggestions and improvements were given, so this is a result of applying that feedback to make the analysis and visualisation more useful to the academic research commnunity.

Recap

First we loaded in the files from the data processing, this incudes: -the elevation data with max, min and median elevations by hexagon after the zonal stats -pivot files of sea level rise by hydrobasin for each of the four projections (4.5 2050 4.5 2100 8.5 2050 8.5 2100) and then the shapefile with all geometric data corresonding to the hexagons we used for the zonal stats, and country border of italy to use as a buffer.

Then we make sure they are cleaned in the way we want by dropping and renaming some columns, and finally we merge the first three files alltogether so we have one geodataframe containing the elevation data (max min med), the elsr data by hydrobasin, and the geometric data of each hexagon. We will later use the country border to clip that merged file to remove isolates so we leave that file separate.

Analysis and Visualisation

Applying the feedback from the presentation means that we will end up with more maps, but they will be more focused and some of the data will be abstracted to support ease of interpretation.

For each of the four projections (4.5 2050 4.5 2100 8.5 2050 8.5 2100), we will make three maps based off of the degree of severity for each projection (ie. 5th%, 50th%, and 95%).
In each of those maps, there will be three color-coded layers that you can toggle on and off and will correspond to the degree of vulnerability (red being the most severe at level 3, orange being the second most severe at level 2, and green being the third most severe at level 1).

We arrive at those 36 layers doing these 9 subtractions below for each of the 4 projection, as an example shown below is the 9 subtractions for projection 4.5 2050 :

4.5 2050 5% min	4.5 2050 5% med	4.5 2050 5% max	4.5 2050 50% min	4.5 2050 50% med	4.5 2050 50% max	4.5 2050 95% min	4.5 2050 95% med	4.5 2050 95% max
elev min - eslr projection point 5%	elev med - eslr projection point 5%	elev max - eslr projection point 5%	elev min - eslr projection point 50%	elev med - eslr projection point 50%	elev max - eslr projection point 50%	elev min - eslr projection point 95%	elev med - eslr projection point 95%	elev max - eslr projection point 95%

And then based off of the following logic, we will arrive at three levels of severity for each projection's percentile:

RED 4.5 2050 5%: Elevation hexagons where all 5% subtractions (max, med, min) are below zero
v45_50_5_red = elev_subs[(elev_subs.smi45505 <= 0.0) & (elev_subs.sme45505 <= 0.0) & (elev_subs.sma45505 <= 0.0)]

ORANGE 4.5 2050 5%: Elevation hexagons where both med and min 5% subtractions (med, min) are below zero.
v45_50_5_orange = elev_subs[(elev_subs.smi45505 <= 0.0) & (elev_subs.sme45505 <= 0.0)]

Blue 4.5 2050 5%: Elevation hexagons where only the min 5% subtraction (min) is below zero.
v45_50_5_green = elev_subs[(elev_subs.smi45505 <= 0.0)]

RED 4.5 2050 50%: Elevation hexagons where all 50% subtractions (max, med, min) are below zero

v45_50_50_red = elev_subs[(elev_subs.smi455050 <= 0.0) & (elev_subs.sme455050 <= 0.0) & (elev_subs.sma455050 <= 0.0)]

ORANGE 4.5 2050 50%: Elevation hexagons where both med and min 50% subtractions (med, min) are below zero.

v45_50_50_orange = elev_subs[(elev_subs.smi455050 <= 0.0) & (elev_subs.sme455050 <= 0.0)]

BlUE 4.5 2050 50%: Elevation hexagons where only the min 50% subtraction (min) is below zero.
v45_50_50_green = elev_subs[(elev_subs.smi455050 <= 0.0)]

RED 4.5 2050 95%: Elevation hexagons where all 95% subtractions (max, med, min) are below zero
v45_50_95_red = elev_subs[(elev_subs.smi455095 <= 0.0) & (elev_subs.sme455095 <= 0.0) & (elev_subs.sma455095 <= 0.0)]

ORANGE 4.5 2050 95%: Elevation hexagons where both med and min 95% subtractions (med, min) are below zero.

v45_50_95_orange = elev_subs[(elev_subs.smi455095 <= 0.0) & (elev_subs.sme455095 <= 0.0)]

BlUE 4.5 2050 95%: Elevation hexagons where only the min 95% subtraction (min) is below zero.
v45_50_95_green = elev_subs[(elev_subs.smi455095 <= 0.0)]

Finally we visualise this data with folium's choropleth function, and remove the legend by accessing the maps child object and deleting all children that start with 'color_map'. For each of the 12 maps, there wil be 3 layers corresponding to red, orange or blue severity. Each map will only have data for one projectio and percentile (ie. 4.5 2050 5%, 8.5 2100 50%, and so on)

Problems I encountered and How I solved them

Each map has isolates, these hexagons that are not surrounded by other hexagons. There was a bit of a debate amongst the team on how to resolve them, since the could be relevant due to how the river network works in Italy. Ultimately, we decided to clip them with a buffer (of varying size) from the coastline that was tailored for each projection percentile to only remove the most drastic isolates.

The most time consuming way would have been to select each isolate in qgis and export that file as a csv and then remove the polygons by id from the geodataframe.

Since we had 36 geodataframes, that was not worth the time. An option that would be exciting to explore once more engineering resources are devoted to this project is to write a script that iterates through each geodataframe's row of polygons, and examines using the geometry column if it is near another row's geometry. Some functions to explore that could be useful here would be the geopandas touches function, distance function or nearest neighbor function (coming from shapely) that can remove any polygon that either doesn't touch another polygon, or that is more than 600 meters (2 hexagons width) away from another polygon or whose nearest neighbor is more than a certain distance away.

In the interest of time, and since the first approach of merging all the hexagons and then filtering (with the overlay function with a how intersection) any hexagons that didn't intersect with the coastline, took more time that we had planned for, I decided to wait on exploring this more accurate option and instead used a buffer to clip the hexagons in a rather rudimentary way by trial and error.

The reason the first first approach took more time than I allotted was ultimately because I initially didn't understand well what the unary union function from shapely would output. I hoped it would output groupings of polygons, so that we ended up with a core group of polygons around the coast, but instead it merged all polygons even those that were not touching, so all polygons were one mutlipolygon, and a part of that geometry was of course intersecting with the coastline, so overlaying by an intersection didn't produce the useful result.

ETL with scraped Airbnb Data

Taylor Higgins — Wed, 11 Jan 2023 13:51:22 +0000

import geojson import geopandas import pandas as pd import numpy as np import statistics as st import matplotlib as plt import folium from folium.features import CustomIcon import csv from IPython.display import display, HTML from pathlib import Path import os from datetime import date, time, datetime from dateutil import tz #for dealing with timezone from geopy import distance from collections import defaultdict from django.contrib.gis.geos import GEOSGeometry, Point import math from django.db import models from django.contrib.gis.db import models

Step 1.) EXTRACT Download the raw csv file from Inside Airbnb and load them into a dataframe

def load_files(ia): #load and create dataframes raw_df_02_22 = pd.read_csv(ia) print(raw_df_02_22.head()) #write to file so we can work on it in next function raw_df_02_22.to_csv("../suwa_maps/airbnb_jupyter/raw_file.csv") return(raw_df_02_22) raw_df_02_22 = load_files("../suwa_maps/airbnb_jupyter/listings_feb_2022.csv")
*Step 2.) TRANSFORM Run the processing script on the raw file dataframe to clean and create new variables
*

`def clean_file(raw_df, raw_file):
#clean file
#drop columns
raw_df.drop(columns=["scrape_id", "bathrooms_text","host_verifications", "host_neighbourhood","property_type", "amenities", "host_verifications", "neighbourhood", "host_listings_count", "picture_url","host_url", "description", "neighborhood_overview", "host_about", "host_response_time", "host_response_rate", 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_has_profile_pic', 'host_identity_verified', 'calendar_updated', 'has_availability', 'calendar_last_scraped', 'first_review', 'last_review', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'neighbourhood_group_cleansed', 'beds', 'maximum_nights_avg_ntm', 'bathrooms', 'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms'], inplace=True)
raw_df.to_csv("../suwa_maps/airbnb_jupyter/clean_file.csv")

#adjust price column
raw_df['price'] = raw_df['price'].str.replace("$", "").str.replace(',', '')
raw_df['price'] = pd.to_numeric(raw_df['price'])


cleaned_df = raw_df
# print(cleaned_df.head)

cleaned_df.to_csv("../suwa_maps/airbnb_jupyter/cleaned_file.csv")
return(cleaned_df)

cleaned_df_02_22 = clean_file(raw_df_02_22, "../suwa_maps/airbnb_jupyter/raw_file.csv")`

`def new_variables(cleaned_file_input, file_output, cleaned_dataframe):
#convert categorical to numeric
all = []
with open(cleaned_file_input, 'r') as file:
with open(file_output, 'w', encoding="utf-8") as newfile:
csvwriter = csv.writer(newfile,lineterminator='\n')
csvreader = csv.reader(file)
#extract headers
header = next(csvreader)
new_headers = ['days_rented_ltm', 'rounded_revenue_ltm', 'occupancy_rate_approx', 'is_hotel', 'is_entire', 'many_listings', 'only_1_listing', 'only_2_listings', 'host_florence', 'has_liscense', 'is_instant_bookable', 'dist_duomo', 'buffer_zone', 'is_centro', 'is_gavinana', 'is_isolotto', 'is_rifredi', 'is_campo', 'listing_revenue_exceed_LTR', 'effected_by_policy_1', 'effected_by_policy_2', 'effected_by_policy_3', 'commercial', 'very_likely_commercial', 'tourist_tax', 'unpaid_tourist_tax', 'geom']
for item in new_headers:
header.append(item)
all.append(header)
for row in csvreader:

            min_night_avg = float(row[17]) #minimum_nights_avg_ntm 
            name_of_listing = row[4]
            #filter out long term and inactive listings len(row[24]) > 0 is saying reviews for last 30days are not null, buthow about if reviews for last year are not 0?
            if int(row[23]) > 0 and min_night_avg < 29:
                listing_row_id = int(row[0])
                listing_id = int(row[1])

                #year and month of data scraped
                date_scraped = row[3]
                year = date_scraped.split('-')[0]
                month = date_scraped.split('-')[1]
                day = date_scraped.split('-')[2]


                #dist_duomo lat, long 
                duomo_coords = (43.7731, 11.2560)

                #listing_coords = lat, long
                listing_coords = (row[11], row[12]) # ia_df_clean['latitude'], ia_df_clean['longitude']

                ##dist_duomo = duomo_coords - listing_coords
                dist_duomo = (distance.distance(duomo_coords, listing_coords).m)

                #buffer zone circle 5
                # buffer_zone = 0
                if dist_duomo <= 1500:
                    buffer_zone = 1
                elif dist_duomo > 1500 and dist_duomo <= 3000:
                    buffer_zone = 2
                elif dist_duomo > 3000 and dist_duomo <= 4500:
                    buffer_zone = 3
                elif dist_duomo > 4500 and dist_duomo <= 6000:
                    buffer_zone = 4
                elif dist_duomo > 6000 and dist_duomo <= 7500:
                    buffer_zone = 5

                # geom = ((row[11], row[12]), srid=4326)
                # pnt = GEOSGeometry('POINT(5 23)') # WKT 
                # gedjango is long, lat 
                coords_wkt = (round(float(row[12]),4) , round(float(row[11]),4) )
                coords_wkt = Point(coords_wkt)
                geom = coords_wkt

                #policy considerations
                #has_liscense
                if len(row[25]) > 0: 

                    has_liscense = 1
                else:
                    has_liscense = 0

                #is_instant_bookable
                if row[26] == 't': 
                    is_instant_bookable = 1
                else:
                    is_instant_bookable = 0

                #is in florence host
                if ("Florence" or "Firenze") in row[8]: #row[8] ia_df_clean['host_location']
                    host_florence = 1
                else:
                    host_florence = 0
                    #print(host_florence)



                #to calculate rent burden from statistica averages, refine
                centro_neighborhood_rent = 1238.70 * 12
                campo_neighborhood_rent = 1078.00 * 12
                rif_neighborhood_rent = 954.80 * 12
                iso_neighborhood_rent = 939.40 * 12
                gav_neighborhood_rent = 1070.30 * 12


                #which neighborhoods
                #centro
                if "Centro" in row[10]: #row[10] ia_df_clean['neighbourhood_cleansed']
                    is_centro = 1
                else:
                    is_centro = 0

                #campo
                if "Campo" in row[10]:
                    is_campo = 1
                else:
                    is_campo = 0

                #rifredi
                if "Rifredi" in row[10]:
                    is_rifredi = 1
                else:
                    is_rifredi = 0

                #isolotto
                if "Isolotto" in row[10]:
                    is_isolotto = 1
                else:
                    is_isolotto = 0


                #gavinana
                if "Gavinana" in row[10]:
                    is_gavinana = 1
                else:
                    is_gavinana = 0

                #occupancy rate from SF model
                #did reviews a year divided by how likely someone gives a review (airbnb data) * avg stay in florence (2.6)
                review_rate = .50
                avg_length = 3.2 


                if len(row[24]) > 0: #if number of reviews in last month more than 0 then it's active or there was an error
                    # print(row[23])
                    num_reviews = float(row[23]) #number_of_reviews_ltm
                    #test here for long term
                    if min_night_avg > 29:
                        print("min_night_avg > 29: is {} on row {}".format(min_night_avg,listing_row_id))
                    elif min_night_avg > avg_length:
                        avg_days = min_night_avg
                    else:
                        avg_days = avg_length
                    num_bookings = review_rate * num_reviews
                    days_rented_ltm = num_bookings * avg_days
                    #test for days

                    #adjusting to make occupancy rate 70 to be conservative according to Murray's suggestions
                    if days_rented_ltm > 255:
                        #print("days_rented_ltm > 365: is {} on row {}".format(days_rented_ltm,listing_row_id))
                        days_rented_ltm = 255
                    tourist_tax = (days_rented_ltm * 3) *2 #3 euro which is lower end, 3 start hotel is 3.80 euro and 2 persons conservative 
                    occupancy_rate_approx = (days_rented_ltm/365) * 100

                    if occupancy_rate_approx > 70.00:
                        print("occupancy_rate_approx > 70: is {} on row {}".format(occupancy_rate_approx,listing_row_id))
                        occupancy_rate_approx = 70.00

                    #revenue #have to strip the string for price then turn into float
                    price_stripped_dollar = row[16].strip('$') #price
                    price_replaced_comma = price_stripped_dollar.replace(',','')
                    new_price = float(price_replaced_comma)
                    if new_price < 1.0:
                        print("new_price is basically zero: is {} on row {}".format(new_price,listing_row_id))


                    #yearly revenue
                    revenue_ltm = days_rented_ltm * new_price
                    rounded_revenue_ltm = round(revenue_ltm, 2)
                    if rounded_revenue_ltm >0.0 and rounded_revenue_ltm < new_price:
                        print("rounded_revenue_ltm < {}: and is {} on row {}".format(new_price, rounded_revenue_ltm,listing_row_id))
                    # print(rounded_revenue_ltm)

                else:
                    days_rented_ltm = 'n/a'
                    rounded_revenue_ltm = 'n/a'
                    # listing_revenue_exceed_LTR = 'n/a'

                if is_centro == 1:
                    if rounded_revenue_ltm == 'n/a':
                        listing_revenue_exceed_LTR = 'n/a'
                    elif rounded_revenue_ltm > centro_neighborhood_rent:
                        listing_revenue_exceed_LTR = 1
                    else:
                        listing_revenue_exceed_LTR = 0
                elif is_gavinana == 1:
                    if rounded_revenue_ltm == 'n/a':
                        listing_revenue_exceed_LTR = 'n/a'
                    elif rounded_revenue_ltm > gav_neighborhood_rent:
                        listing_revenue_exceed_LTR = 1
                    else:
                        listing_revenue_exceed_LTR = 0
                elif is_isolotto == 1:
                    if rounded_revenue_ltm == 'n/a':
                        listing_revenue_exceed_LTR = 'n/a'
                    elif rounded_revenue_ltm > iso_neighborhood_rent:
                        listing_revenue_exceed_LTR = 1
                    else:
                        listing_revenue_exceed_LTR = 0
                elif is_rifredi ==1:
                    if rounded_revenue_ltm == 'n/a':
                        listing_revenue_exceed_LTR = 'n/a'
                    elif rounded_revenue_ltm > rif_neighborhood_rent:
                        listing_revenue_exceed_LTR = 1
                    else:
                        listing_revenue_exceed_LTR = 0
                elif is_campo == 1:
                    if rounded_revenue_ltm == 'n/a':
                        listing_revenue_exceed_LTR = 'n/a'
                    elif rounded_revenue_ltm > campo_neighborhood_rent:
                        listing_revenue_exceed_LTR = 1
                    else:
                        listing_revenue_exceed_LTR = 0

                # listing_revenue_exceed_LTR = 'n/a'   

                #many_listings 
                # print(type(row[9]))
                try:
                    if float(row[9]) > 2: # ia_df_clean['host_total_listings_count']
                        many_listings = 1
                    else:
                        many_listings = 0
                    #only_1 success
                    if float((row[9])) == 1:
                        only_1_listing = 1
                    else: 
                        only_1_listing = 0
                    #only_2 success
                    if float((row[9])) == 2:
                        only_2_listings = 1
                    else:
                        only_2_listings = 0
                except ValueError:
                    many_listings = 'na'
                    only_1_listing = 'na'
                    only_2_listings = 'na'

                #is_entire success 
                if row[13] == 'Entire home/apt':  # ia_df_clean['room_type']
                    is_entire = 1
                else:
                    is_entire = 0
                #is_hotel success
                if row[13] == 'Hotel room':
                    is_hotel = 1
                else:
                    is_hotel = 0

                #commercial
                if cleaned_dataframe.loc[listing_row_id, 'host_total_listings_count'] > 4:
                    commercial = 1
                elif host_florence == 0:
                    commercial = 1
                else:
                    commercial = 0

                #very_likely_commercial
                if cleaned_dataframe.loc[listing_row_id, 'host_total_listings_count'] > 2 and cleaned_dataframe.loc[listing_row_id, 'instant_bookable'] == 't' and cleaned_dataframe.loc[listing_row_id, 'room_type']== 'Entire home/apt':
                    very_likely_commercial = 1
                else:
                    very_likely_commercial = 0

                #likely unpaid touist tax for non liscenced, tourist tax was calculated by (3*days rented ltm)
                if has_liscense == 0:
                    unpaid_tourist_tax = tourist_tax

                #effected_by_policy_x (1-liscense,2- -is entire, 3-2 listing max, 4-day limit 90) #switched 2 and 4 from origingal 
                if has_liscense == 0:
                    effected_by_policy_1 = 1
                else: 
                    effected_by_policy_1 = 0
                if is_entire == 1:
                    effected_by_policy_2 = 1
                else:
                    effected_by_policy_2 = 0

                #more than 2 listings means many
                if many_listings == 1:
                    effected_by_policy_3 = 1
                else:
                    effected_by_policy_3 = 0


                new_variables = [days_rented_ltm, rounded_revenue_ltm, occupancy_rate_approx, is_hotel, is_entire, many_listings, only_1_listing, only_2_listings, host_florence, has_liscense, is_instant_bookable, dist_duomo, buffer_zone, is_centro, is_gavinana, is_isolotto, is_rifredi, is_campo, listing_revenue_exceed_LTR, effected_by_policy_1, effected_by_policy_2, effected_by_policy_3, commercial, very_likely_commercial, tourist_tax, unpaid_tourist_tax, geom]
                for item in new_variables:
                    row.append(item)
                #adding entire row to all 
                all.append(row)
            else:
                # print("either inactive or long term listing so removing {} from file".format(name_of_listing))
                pass
        #writing all rows to new out file
        csvwriter.writerows(all)
return csvwriter

new_var_df = new_variables('../suwa_maps/airbnb_jupyter/cleaned_file.csv', '../suwa_maps/airbnb_jupyter/new_variables_file.csv', cleaned_df_02_22)`

*Step 3.) LOAD Migrate the csv data into postgis spatial database. It was very important that we made a GEOM point column from the lat long attributes in the last step. The next code block won't run since environment variables need to be set up but I provided it to show how the database schema is set up to store the airbnb listing data we just processed. Then run python manage.py migrate and python manage.py check and finally the script to load our data into the database table we just created with the django migration function. *

`#models.py file

This is an auto-generated Django model module created by ogrinspect.

class AirbnbListings(models.Model): #issues here ...

Regular Django fields corresponding to the attributes in the IA csv file

id = models.IntegerField(primary_key=True)
listing_url = models.CharField(max_length=1000)
last_scraped = models.DateField(max_length=1000)
name = models.CharField(max_length=1000)
host_id = models.IntegerField()
host_name = models.CharField(max_length=1000)
host_since = models.CharField(max_length=1000)
host_location = models.CharField(max_length=1000)
host_total_listings_count = models.IntegerField()
neighbourhood_cleansed = models.CharField(max_length=1000)
latitude = models.FloatField()
longitude = models.FloatField()
room_type = models.CharField(max_length=1000)
accommodates = models.IntegerField()
bedrooms = models.IntegerField()
price = models.FloatField()
minimum_nights_avg_ntm = models.FloatField()
availability_30 = models.IntegerField()
availability_60 = models.IntegerField()
availability_90 = models.IntegerField()
availability_365 = models.IntegerField()
number_of_reviews = models.IntegerField()
number_of_reviews_ltm = models.IntegerField()
number_of_reviews_l30d = models.IntegerField()
license = models.CharField(max_length=1000)
instant_bookable = models.CharField(max_length=1000)
calculated_host_listings_count = models.IntegerField()
reviews_per_month = models.FloatField()
days_rented_ltm = models.FloatField()
rounded_revenue_ltm = models.FloatField()
occupancy_rate_approx = models.FloatField()
is_hotel = models.IntegerField()
is_entire = models.IntegerField()
many_listings = models.IntegerField()
only_1_listing = models.IntegerField()
only_2_listings = models.IntegerField()
host_florence = models.IntegerField()
has_liscense = models.IntegerField()
is_instant_bookable = models.IntegerField()
dist_duomo = models.FloatField()
is_centro = models.IntegerField()
is_gavinana = models.IntegerField()
is_isolotto = models.IntegerField()
is_rifredi = models.IntegerField()
is_campo = models.IntegerField()
listing_revenue_exceed_LTR = models.IntegerField()
effected_by_policy_1 = models.IntegerField()
effected_by_policy_2 = models.IntegerField()
effected_by_policy_3 = models.IntegerField()
commercial = models.IntegerField()
# editing to see if this is the problem JULY not so putting back
geom = models.PointField(verbose_name='geo',srid = 4326) #    # GeoDjango-specific: a geometry field (MultiPolygonField)
#could be points, line, polygon, multipolygon, raster etc 
# Returns the string representation of the model. maybe add this because it seems to be in docs.
# def __str__(self):
#     return self.name

Auto-generated `LayerMapping` dictionary for AirbnbListings model

augmented_airbnblistings_mapping = {
'id': 'id',
'listing_url': 'listing_url',
'last_scraped': 'last_scraped',
'name': 'name',
'host_id': 'host_id',
'host_name': 'host_name',
'host_since': 'host_since',
'host_location': 'host_location',
'host_total_listings_count': 'host_total_listings_count',
'neighbourhood_cleansed': 'neighbourhood_cleansed',
'latitude': 'latitude',
'longitude': 'longitude',
'room_type': 'room_type',
'accommodates': 'accommodates',
'bedrooms': 'bedrooms',
'price': 'price',
'minimum_nights_avg_ntm': 'minimum_nights_avg_ntm',
'availability_30': 'availability_30',
'availability_60': 'availability_60',
'availability_90': 'availability_90',
'availability_365': 'availability_365',
'number_of_reviews':'number_of_reviews',
'number_of_reviews_ltm': 'number_of_reviews_ltm',
'number_of_reviews_l30d': 'number_of_reviews_l30d',
'license': 'license',
'instant_bookable': 'instant_bookable',
'calculated_host_listings_count': 'calculated_host_listings_count',
'reviews_per_month': 'reviews_per_month',
'days_rented_ltm':'days_rented_ltm',
'rounded_revenue_ltm':'rounded_revenue_ltm',
'occupancy_rate_approx':'occupancy_rate_approx',
'is_hotel':'is_hotel',
'is_entire':'is_entire',
'many_listings':'many_listings',
'only_1_listing':'only_1_listing',
'only_2_listings':'only_2_listings',
'host_florence':'host_florence',
'has_liscense':'has_liscense',
'is_instant_bookable':'is_instant_bookable',
'dist_duomo':'dist_duomo',
'is_centro':'is_centro',
'is_gavinana':'is_gavinana',
'is_isolotto':'is_isolotto',
'is_rifredi':'is_rifredi',
'is_campo':'is_campo',
'listing_revenue_exceed_LTR':'listing_revenue_exceed_LTR',
'effected_by_policy_1':'effected_by_policy_1',
'effected_by_policy_2':'effected_by_policy_2',
'effected_by_policy_3':'effected_by_policy_3',
'commercial':'commercial',
'geom': ['latitude', 'longitude'],
}`

`#script to load up data in database adapted from https://towardsdatascience.com/use-python-scripts-to-insert-csv-data-into-django-databases-72eee7c6a433
def load_csv_db():
with open('"../suwa_maps/airbnb_jupyter/cleaned_file.csv"') as file:
reader = csv.reader(file)
next(reader) # Advance past the header

    AirbnbListings.objects.all().delete()

    for row in reader:
        print(row)


        listing = AirbnbListings(
            id=row[0],
            listing_url=row[1],
            last_scraped=row[2],
            name=row[3],
            host_id=row[4],
            host_name=row[5],
            host_since=row[6],
            host_location=row[7],
            host_total_listings_count=row[8],
            neighbourhood_cleansed=row[9],
            latitude=row[10],
            longitude=row[11],
            room_type=row[12],
            accommodates=row[13],
            bedrooms=row[14],
            price=row[15],
            minimum_nights_avg_ntm=row[16],
            availability_30=row[17],
            availability_60=row[18],
            availability_90=row[19],
            availability_365=row[20],
            number_of_reviews=row[21],
            number_of_reviews_ltm=row[22],
            number_of_reviews_l30d=row[23],
            license=row[24],
            instant_bookable=row[25],
            calculated_host_listings_count=row[26],
            reviews_per_month=row[27],
            days_rented_ltm=row[28],
            rounded_revenue_ltm=row[29],
            occupancy_rate_approx=row[30],
            is_hotel=row[31],
            is_entire=row[32],
            many_listings =row[33],
            only_1_listing=row[34],
            only_2_listings=row[35],
            host_florence=row[36],
            has_liscense=row[37],
            is_instant_bookable=row[38],
            dist_duomo=row[39],
            is_centro=row[40],
            is_gavinana=row[41],
            is_isolotto=row[42],
            is_rifredi=row[43],
            is_campo=row[44],
            listing_revenue_exceed_LTR=row[45],
            effected_by_policy_1=row[46],
            effected_by_policy_2=row[47],
            effected_by_policy_3=row[48],
            commercial=row[49],
            geom=row[50],
                    )
        listing.save()`

NOTE: These next steps normally would call the data from the database

Step 4.) EXPLORE with summary statistics in pandas

`new_variables_df = pd.read_csv('../suwa_maps/airbnb_jupyter/new_variables_file.csv')

quant_labels = ['calculated_host_listings_count','accommodates', 'bedrooms', 'days_rented_ltm', 'rounded_revenue_ltm', 'occupancy_rate_approx', 'tourist_tax', 'unpaid_tourist_tax']
df_quant = df.loc[:, quant_labels]
qual_labels = ['id', 'listing_url', 'name', 'host_id', 'host_name', 'neighbourhood_cleansed', 'host_location', 'room_type', 'license', 'instant_bookable', 'is_hotel', 'is_entire', 'many_listings', 'only_1_listing', 'only_2_listings', 'host_florence', 'has_liscense', 'is_instant_bookable', 'effected_by_policy_1', 'effected_by_policy_2', 'effected_by_policy_3', 'commercial', 'very_likely_commercial',]
df_qual = df.loc[:, qual_labels].astype("category")
continuous_quant_labels = ['host_since', 'latitude', 'longitude', 'price', 'dist_duomo']
df_continuous = df.loc[:, continuous_quant_labels]`

df_quant.describe().transpose()

df_qual.describe().transpose()

pivot_discrete_neighbourhood = pd.pivot_table(data=new_variables_df, index=["neighbourhood_cleansed"],columns=["room_type"], values='host_total_listings_count', aggfunc='mean') print(pivot_discrete_neighbourhood)

`def original_airbnb_map(mapdf, filetitle):
#create bubble map
bub_map = folium.Map(location=[mapdf.latitude.mean(),mapdf.longitude.mean()], zoom_start=16, tiles='CartoDB Positron', control_scale=True)
folium.raster_layers.TileLayer('Stamen Toner').add_to(bub_map)
folium.raster_layers.TileLayer('Stamen Terrain').add_to(bub_map)
folium.raster_layers.TileLayer('CartoDB Positron').add_to(bub_map)
folium.LayerControl().add_to(bub_map)

for index, location_info in mapdf.iterrows():

    # popup = [str(location_info["host_name"]), "http://www.airbnb.com/users/show/" + str(location_info["host_id"]), str(location_info["license"]), str(location_info["host_total_listings_count"]), str(location_info["global_total_listings"]), str(location_info["listing_url"]), str(location_info["id"]), str(location_info["name"]), str(location_info["neighbourhood_cleansed"]), str(location_info["room_type"]), str(location_info["bedrooms"]), str(location_info["rounded_revenue"]), str((location_info["rounded_revenue"])/12), str(location_info["price"]), str(location_info["days_rented"]),str(location_info["reviews_per_month"])]

    popup = 'tbd'

    tooltip = 'tbd'

    folium.Circle([location_info["latitude"],location_info["longitude"]], radius=2, color="blue", opacity=.2, fill=True, fill_opacity=0.1, fill_color ="blue",  popup=popup, tooltip=tooltip).add_to(bub_map)

  
  
  fill_opacity=0.5,


  
  
  bub_map.save('Out_Map/' + filetitle + '.html')


  
  
  pass


return bub_map

original_airbnb_map(mapdf, datadf, tileinfo)

bub_map = original_airbnb_map(cleaned_df_02_22, 'original_airbnb_map')
bub_map`

Making an ETL Pipeline for geospatial data to postgres/postgis database

Taylor Higgins — Sat, 07 Jan 2023 17:19:33 +0000

Extract - First I gather the data from USGS's earth explorer for elevation data, global extreme sea level prediction data from the European commission, and the hydrobasin data from the hydroshed project and finally the country boundries data from natural earth data.

Load - Then I load each dataset into the relational database. Which was built using docker image of postgis extension of postgres. And then connected to qGis for the team to access. This amazing tutorial was super helpful here.

Transform - Finally I clip, vectorize, deduplicate and merge the datasets with sql spatial queries, qGis and geopandas so that all data is consistent.

the steps here involve resampling the rasters to be a less fine grain, then clipping the data to remove ocean pixels, and then, vectorizing the rasters with zonal statistics, removing irrelevant attributes from the hydrobasin dataset and merging the hydrobasin and sealevel rise prediction datasets by a spatial join, and then merging that data to the vectorized raster by the common hydrobasin column. Finally basic subtraction is applied to find the areas of elevation data that are at risk of sea level rise.

Further mapping is then done in qGis to understand the crop yield loss and other economic factors.

Transform steps - Processing:

-All steps leading up to analysis and visualisation steps, so the output is the file with all hexagons and elevation zonal stats data, the pivoted elsr data on hydrobasin, and finally the clipped shapefile in 3035 epsg for italy with the hexagons and geometry.

okay so we got the dem data from usgs script on a bounding box for italy i believe, then merging it into a tif file then in qgis clipping it with the italy boundary and resampling it to be less fine resolution.

the elsr point data we clipped also with the boudnary of italy in qgis

watersheds came from hydrobasins and was also clipped by the italy border in qgis

Analyzing Italy's Vulnerability to Sea Level Rise

Taylor Higgins — Tue, 15 Nov 2022 17:10:37 +0000

Overall Goal:

to have a vulnerability resilience map of the coastal areas of Italy based on sea level rise predictions for RCP 4.5 and 8.5 and to calculate how much crop land could be lost.

Datasets to use:

sea level rise sample point data - sample collection points for sea level rise predictions 2050 and 2100 based on extreme risk and three levels of confidence. These collection points are 20 meters separated from each other around the coastline of Italy and the main Italian islands of Sardinia and Sicily.
https://data.jrc.ec.europa.eu/dataset/jrc-liscoast-10012#dataaccess (global coastline data)

elevation raster data - DEM/DTM lidar collected images of the topographical elevation at 30 meter resolution https://earthexplorer.usgs.gov/

watershed polygon area data - watersheds of Italy Level of granularity 8 https://www.hydrosheds.org/products/hydrobasins

crop raster data https://data.jrc.ec.europa.eu/dataset/15f86c84-eae1-4723-8e00-c1b35c8f56b9

https://land.copernicus.vgt.vito.be/manifest/ndvi300_v2_333m/manifest_cgls_ndvi300_v2_333m_latest.txt

corine: https://land.copernicus.eu/pan-european/corine-land-cover
corine legend: http://clc.gios.gov.pl/doc/clc/CLC_Legend_EN.pdf

admin country boundary polygon data
https://www.naturalearthdata.com/downloads/10m-cultural-vectors/

Hardware used:

Text editor visual studio code
Device AsusVivoBook
Processor Intel(R) Core(TM) i5-1035G1 CPU @ 1.00GHz 1.19 GHz
Installed RAM 20.0 GB (19.8 GB usable)
System type 64-bit operating system, x64-based processor

OS Edition Windows 10 Home
Version 21H2

Data Processing Steps

First I will convert the raster elevation data into vector data. We will do this by overlaying a lattice of 500 square meter hexagons on top of the DTM/DEM dataset, and apply zonal statistics o get summary statistics (min, max and median) of the height of each of the raster pixels, to have an approximation of the elevation for each hexagon. Then I will merge by location the watershed polygon file to the vectorized elevation file so that each hexagon row has an associated watershed that it falls within.
Second I'll join the extreme sea level rise (ESL) point data to the watershed polygon layer using a near function. Since I will use a left join, this will result in a polygon vector file where each row, signifies a watershed, and the watersheds closest to the coast will have the closest ESL points appended to the row in a new column. The near function will search for ESL points within a 5km radius of the watershed, and cap the number of nearest points at 5. We should expect null data for watersheds that are further than 5km from a collection point, and the ESL point columns that are appended to the watershed rows should be averaged so that no new rows of duplicated watershed are created.

When I have that, we will then have the most accurate predicted future sea level forecast for each of the watersheds within the search radius.

NOTE: IT MATTERS HOW YOU DO THE NEAR FUNCTION JOIN If we were to instead join the watersheds to the collection points we would lose data about the watersheds, since we are interested in the watersheds, this is not ideal, and also since there is no data that we would want to average from the watershed file, it doesn't serve us to do the join this way. Ultimately the watershed file is the link between the ESL point data and the DEM raster data, we could have done a join to the hexagonal vectorized DEM data, removing the watersheds data, but taking into account the hydrolic realities makes the flood and crop data loss prediction more accurate.

3.)

Third we will merge the two vector files (the watershed polygon vector that has the nearest ESL points joined from step 2, and the hexagonal vectorized DEM elevation data step 1) on the watershed id that they both have in common.

Analysis and Visualisation

At this point we are done processing the data and we can create new columns after subtracting the predicted sea level rise in meters data from the elevation data. Since there are four predictions for ESL (RCP 4.5, RCP 8.5 for 2050 and 2100) where each prediction has the 5%, 50% and 95% confidence value, we will create 16 resulting columns from a conservative mix of calculations seen in the table below. The new columns that are negative or around 0 imply the area will be underwater in 2050 and 2100 respectively based on the RCP predictions.

4.5 2050	8.5 2050	4.5 2100	8.5 2100
elev med - esl 50%	elev med - esl 50%	elev med - esl 50%	elev med - esl 50%
elev min - esl 50%	elev min - esl 50%	elev min - esl 50%	elev min - esl 50%
elev min - esl 95%	elev min - esl 95%	elev min - esl 95%	elev min - esl 95%
elev min - esl 5%	elev min - esl 5%	elev min - esl 5%	elev min - esl 5%

Lastly I will visualize the data that we just processed by creating maps from each of the new columns highlighting the hexagonal areas of the coast that will be most effected by extreme sea level rise. By creating choropleth maps with equal interval algorithm applied with 5 classification groupings and color coded accordingly, the areas underwater or close to sea level will be showcased.

To be continued...

Adding the crop data to see how much crop revenue could be lost by sea level rise, and ultimately see how much crop land would become too degraded to be fertile under current agricultural practices, even if it is above sea level.

2.
Expanding to the entire Euro-mediterranean region.

Caveats

Since sea level rise has above ground and below ground affects, especially when considering the salinity levels of soil, understanding the effects of sea level rise isn't as simple as subtracting the predicted sea level rise from the elevation. Due to capillary action and soil seepage, understanding how far the sea water rise can travel underground is an important consideration when evaluating a region's vulnerability to sea level rise.

Problems I encountered and How I solved them

Memory issues with the merged file being close to 15 gb I manually adjusted page file limit in advanced system settings but it ultimately needed to run on other less strong computers (mine has 32gb of ram) so instead of this loophole we segmented the raster mosaic into 24 pieces. That proved to be a problem since pulling the tiff files from the Aster sensor for the entire Euromed region was cumbersome and time consuming even with a script, I ultimately split the satellite data image pull by country and used a script calling the earth explorer USGS api after setting up a bounding box for the country of interest.

Gdal downloading is always a bit cumbersome. But before you use pip install you have to have the actual pre-built gdal wheel file and make sure it's on your path.

Step Two: Process the data with Python and qGIS

Taylor Higgins — Wed, 09 Nov 2022 16:47:44 +0000

Now that I had a general plan for the web app I wanted to collect and clean all of my data.

The first data set I used was open data from the government of Tuscany, specifically 2011 Census data for the city of Florence. This had important information on the total number of renters and homeowners and also overall resident population by census tract, and the total number of housing units and bedrooms by census tract.

The second data set I used was scraped Airbnb Data from the non-profit Inside Airbnb, a research group dedicated to showing the real impact of Airbnb on communities. This data, though not fully open follows the FAIR, "A", Accessibility data principle by being, "As open as possible but as closed as necessary".

It was important to line up the census data geographically by coordinates with the airbnb listing data in order to understand how saturated a particular census tract or neighborhood was.

The questions I wanted to answer were:

What percentage of the total number of residential units and available bedrooms are listed as airbnb?
Compared to the total number of residents living in a particular geographic unit (census tract, street or neighborhood), how many airbnb guests are there in a given month or year?

I attempted to use geopandas to join the two datasets by coordinates, and had success converting the crs projection to the same coordinate system, but ultimately found it faster to join the datasets in qgis and then export as a shapefile to continue processing in python. I was able to use geopandas to read the resulting joined shapefile in the next step where I wrote wrote the main functions that fueled the map and stats for the webapp.

I did this after removing unneeded columns from both datasets, in order to save on storage space and increase processing times.

I also created new variables that would help me when doing analysis and mapping in later steps by converting qualitative user created values to quantitative bivariate values.

For example, to more easily group the listings together based off of a qualitative data point, like where the host was based, I converted the user inputs (ie. Tuscany, Firenze, Toscana, Italy, USA, Egypt, Nebraska etc) into a yes/no variable for whether the host was based in or out of the city.

After this sort of processing I was able to ask questions of the dataset more easily. In total I created 27 new variables, not all of which I ended up being able to use, but I hope to use them all in the future.

new_variables = [days_rented_ltm, rounded_revenue_ltm, occupancy_rate_approx, is_hotel, is_entire, many_listings, only_1_listing, only_2_listings, host_florence, has_liscense, is_instant_bookable, dist_duomo, buffer_zone, is_centro, is_gavinana, is_isolotto, is_rifredi, is_campo, listing_revenue_exceed_LTR, effected_by_policy_1, effected_by_policy_2, effected_by_policy_3, commercial, very_likely_commercial, tourist_tax, unpaid_tourist_tax, geom]

An important caveat that should've been accounted for in more depth was that the census data came from 2011 and the airbnb data came from 2021. I look

Step One: Design front-end mock up using figma

Taylor Higgins — Sun, 06 Nov 2022 16:46:23 +0000

This was a very humbling part of the process, especially when I looked back after finishing the project at all I thought I was going to fit into the app.

I learned that this part is crucial, it's the visual equivalent to making an outline for a written essay, and the more detailed and more thought out, the easier it'll be for you when you go to actually build the thing.

It feels like a step you can skip, but please, be kind to your future self and do not try! If you're anything like me, you probably get really excited when you first come up with an idea, and think you can see it perfectly in your head. You can't wait to get started -- and good on you for capturing that momentum and energy, it's also essential if you want to finish any project to move when you're excited -- but please be kind to yourself and sketch out your idea, get it out of your head in some capacity!

It not only helps guide you, during the process of building -- when you're stuck or if you need to stop for a couple weeks mid-way through building the project -- but most importantly it helps you see the gaps and weak spots in your idea and design. For some reason our brains love to smooth over those inconvenient gaps, like a blur filter.

It amazes me how the general stereotype of programmers can be that they're solo workers, when it involves so much communication to build anything. The best programmers I know are so good because they can communicate their ideas well, either before hand or during when they're stuck. The first piece of advice I ever got when I first started learning coding was that the best skill you can learn initially is to know how to ask for help.

And it's proven to be so true. Whenever I'm stuck, writing out my question on stackoverflow often clears up my confusion, before I even hit submit on the question. If not, then after about 10 minutes as I imagine a person reading my question, it's like magic and the missing link pops into my head! Likewise, calling a friend and explaining it to them, often before they can even respond, I have an idea on what was missing.

This first design step is similar to that, it's getting the idea out there, and since inevitably we'll want others to use our project, or enjoy it in some way, involving them as early as possible simply makes sense.

Anyway, I learned to love this step, and would love to hear your thoughts:

How do you first get started on a project? Do you have any "go-to" design steps? Tips on how to keep your project realistic and simple? Write them below 🫶

My journey creating a map-based webapp with django, leaflet/folium, docker and heroku

Taylor Higgins — Sun, 30 Oct 2022 22:20:39 +0000

I learned a lot during this process. I wound up documenting the process fairly well -- mostly out of the desperate need to get all the hurdles out of my brain and onto paper -- so I decided to share it here! Hopefully my experience can help someone else out there (like so many of these blogs and message forums had helped me), and also hopefully so others more experienced than me can give me tips on how to improve next time. Thanks for reading!

This will be a series of blog posts outlining what I did and some of the major time consuming issues I had to overcome. Here are a list of entries that'll make up the series:

Step One:

Design front-end mock up using figma

Step Two:

Process the data using jupyter notebook, scraped airbnb data, and government published census data

Step Three:

Test functions to calculate statistics and build maps for the web app using jupyter

Step Four:

Connect the django web app to Heroku remote server

Step Five:

Customize the views, urls and templates for the django web app

Step Six:

Migrate the data using postgis/postgres

Step Seven:

Write tests for the codebase using pytest

To do in the future:

Incorporate user experience and design feedback from beta users

DEV Community: Taylor Higgins

Analyzing Italy's Vulnerability to Sea Level After Applying Feedback

Recap

Analysis and Visualisation

Problems I encountered and How I solved them

ETL with scraped Airbnb Data

This is an auto-generated Django model module created by ogrinspect.

Regular Django fields corresponding to the attributes in the IA csv file

Auto-generated LayerMapping dictionary for AirbnbListings model

fill_opacity=0.5,

bub_map.save('Out_Map/' + filetitle + '.html')

pass

original_airbnb_map(mapdf, datadf, tileinfo)

Making an ETL Pipeline for geospatial data to postgres/postgis database

Analyzing Italy's Vulnerability to Sea Level Rise

Overall Goal:

Datasets to use:

Hardware used:

Data Processing Steps

Analysis and Visualisation

To be continued...

Caveats

Problems I encountered and How I solved them

Step Two: Process the data with Python and qGIS

The questions I wanted to answer were:

Step One: Design front-end mock up using figma

My journey creating a map-based webapp with django, leaflet/folium, docker and heroku

Step One:

Step Two:

Step Three:

Step Four:

Step Five:

Step Six:

Step Seven:

To do in the future:

Auto-generated `LayerMapping` dictionary for AirbnbListings model