The most common headache for people who work with georeferenced data is managing to manipulate the layers available for a map in a way that feels both agile and professional. With Folium and Geopandas, two wonderful Python libraries, this is no longer a problem
Geopandas is like pandas but focused on the tasks of the surveyor, civil engineer, or environmentalist. In fact, geopandas has all the potential of pandas with the added value of allowing you to manipulate georeferenced data efficiently as it integrates a new type of data: geometries.
An example of this are polygons, a set of vertices that come together to form a geographically delimited space. Their locations are represented under a geographic coordinate system, and these can be passed as an additional column for the geodataframe that geopandas offers.
The applications are many: Imagine using the variables present in your database to color districts, insert informational marks, trace traffic flows, and a whole wide series of spatial representations. Also, as it is available in Python, you can use Jupyter notebooks to embed the html object representing the map. With this you can taking your comunications skill to the next level!!!
Now I am going to show you with an example how both libraries can support the exploratory data analysis stage, before moving on to the construction of analytical models. For that we will use the data of real estate properties in Colombia which have geographic location information. I will use the dataset present on this Kaggle site
The dataset contains the latitude and longitude in which each property is located, the price, the operation involved (sale, rent, and others), the type of property (house, apartment, lot, commercial premises …), and information of rooms, bedrooms, bathrooms, total surface and surface covered. All these variables could be used for the construction of a sales price forecast model that can help real estate agents to better value the property.
Depending on the detail required, the data can be presented at a department, municipality, district or local level, variables already present in the data set. In addition to this, I used this public layer of the administrative division of Colombia, to group results by departments
The use of the layer is in order to show how easy it is to import shape files with geopandas and manipulate the data to create visualizations, this time a Choropleth of the price of the properties by department. Let’s take a look at the code:
The use of the layer is in order to show how easy it is to import shape files with geopandas and manipulate the data to create visualizations, this time a Choropleth of the price of the properties by department. Let’s take a look at the code:
import pandas as pd
import os
path_file = os.path.join(os.path.expanduser('~'),
'Properties_price/data/co_properties.csv')
path_reg = os.path.join(os.path.expanduser('~'), 'Properties_price/data/regiones.csv')
df = pd.read_csv(path_file, encoding='utf8')
df_reg = pd.read_csv(path_reg, encoding='latin1',sep=";")
df = pd.merge(df, df_reg, on="l2")
With these lines we import the data from the file “co_properties.csv” downloaded from Kaggle into a pandas dataframe, and we put everything in one place. I have built an additional column that assigns each department in the Kaggle database its corresponding region (Amazon, Orinoquía, Andina,….).
Once we have the data, the magic begins. With two simple lines of code we have our interactive heat map !!! Isn’t it amazing? We only have to pass the columns that define the geographical location of the points to represent and Folium does the rest.
#Making a map for Colombia
mapa = folium.Map(location=[5.170035, -74.914305], tiles='cartodbpositron', zoom_start=6)
#Adding a base layer to the HeatMap.
#With this we can see where most of the real estate properties are located
HeatMap(data=df[['lat', 'lon']], radius=10).add_to(mapa)
Needless to say that the data layer is decoupled from the map, so we can do whatever transform we want to our dataframe to represent subsets of cases under differents filters. If we want to embed two maps in a cell of a jupyter notebook we can use html to insert the source represented by each map:
maps = []
classes = ['Casa','Apartamento']
for class_ in classes:
obj_map = folium.Map(location=[5.170035, -74.914305], tiles='cartodbpositron', zoom_start=6)
data = df[df['property_type']==clase]
HeatMap(data=data[['lat', 'lon']], radius=10).add_to(obj_map)
maps.append(obj_map)
#Defining titles for maps:
Titulos = ('<table width="100%">'
'<tr>'
'<td style="text-align: center; vertical-align: middle;">'
'<b>Mapa para tipo de propiedad = '+clases[0]+'</b>'
'</td>'
'<td style="text-align: center; vertical-align: middle;">'
'<b>Mapa para tipo de propiedad = '+clases[1]+'</b>'
'</td>'
'</tr>'
'</table>')
#We build the html code in which we embed the maps, applying convenient styles:
#
htmlmap = HTML(Titulos+
'<iframe srcdoc="{}"'
'style="'
'float:left;'
'width: {}px;'
'height: {}px;'
'display:inline-block;'
'width: 50%;'
'margin: 0 auto;'
'border: 2px solid black">'
'</iframe>'
'<iframe srcdoc="{}"'
'style="'
'float:right;'
'width: {}px;'
'height: {}px;'
'display:inline-block;'
'width: 50%;'
'margin: 0 auto;'
'border: 2px solid black">'
'</iframe>'.format(mapas[0].get_root().render().replace('"', '"'),500,500,
mapas[1].get_root().render().replace('"', '"'),500,500))
display(htmlmap)
And the ouput is this:
It so practical!! now let’s move on to the matter that interests us, How to import shape files? Again two simple lines of code …
import geopandas as gpd
fname = os.path.join(os.path.expanduser('~'), 'Properties_price/data/depto.shp')
deptos = gpd.GeoDataFrame.from_file(fname)
Once the data has been imported into the “deptos” geopandas, this is how it looks on the python side:
There is an geometry column. By default geopandas recognizes this column as the one in which the geographical locations of the points to be represented are stored.
Folium is based on OpenStreetMap, so it is usual that we have to transform our coordinate system to EPSG 4326. This is somewhat technical, but suffice it to say that it corresponds to the conventional WGS84 system used for the representation of cartography worldwide. Geopandas allow us to pass from one system to another in a single instruction:
#Since the shapefile has location information based on projection it is necessary to pass them to "EPSG: 4326"
deptos = deptos.to_crs("EPSG:4326")
By default Folium does not directly receive a geopandas object, but the conversion to the geojson format required by folium is relatively straightforward. Before proceeding to this point, we are going to suppose that the interest was to draw the median of the price of the properties in each department, so that the median defines the color that this department must have and thus visualize by shades the spatial distribution of the price of the properties. This is known as a Choropleth map.
I have previously prepared the data of the median of the price, both for houses and for apartments in two independent dataframes. Each one with two columns, indicating the median price and the department for which it is computed. The shape file downloaded from google sites obviously does not have this data, but its insertion is direct as long as there is a column that serves as the insertion key, in this case the column with the name of the department that must be written identically in the two dataframe to join
deptos = pd.merge(deptos, medianas_casas, on="NOMBRE_DPT")
deptos = pd.merge(deptos, medianas_aptos, on="NOMBRE_DPT")
deptos['centroid_lon'] = deptos['geometry'].centroid.x
deptos['centroid_lat'] = deptos['geometry'].centroid.y
Finally, centroids can be computed for each geometry in the shapefile. Geopandas implements a smart method for this. The centroids of each department are available to be used later (for example, put a mark with additional informative data) Now only two simple tasks remain to complete the Choroplet: 1. prepare the geojson with the polygon information, and the file with the variable that it will be used to color the map. 2. Pass them as parameters to the graphing functionsape.
#This code prepare a Choroplet for houses(price_c),
#repit for price_ap para prepare a Choroplet for apartments.
#Data for to color polygons
deptos_dict = deptos[['DPTO','price_c']]
deptos_dict['price_c'] = deptos_dict['price_c']/1000000
#Data for to trace the polygons on the map
###Adjust an index for id polygon
deptos = deptos.set_index("DPTO")
deptos = deptos[["NOMBRE_DPT","geometry"]]
deptos_geo_json = deptos.to_json()
#Pass files like parameter to Folium functions
bins = list(round(deptos_dict["price_c"].quantile([0, 0.15, 0.45, 0.60, 0.80, 0.90, 1]),1))
m = folium.Map(location=[5.170035, -74.914305], tiles='cartodbpositron', zoom_start=6)
folium.Choropleth(
geo_data=deptos_geo_json,
name="choropleth",
data=deptos_dict,
columns=["DPTO", "price_c"],
key_on="feature.id",
fill_color="BuPu",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Median of price(Millions COP)",
bins = bins,
reset = True
).add_to(m)
folium.LayerControl().add_to(m)
display(m)
Running the code twice, once for home and then for apartments and joining the two maps, we get this result:
Naturally, the above is just a data exploration stage for the construction of a predictive price model, but it shows the power of Geopandas and Folium for displaying information. This story culminates in the creation of a linear regression model for our data, which includes the department variable as a covariate:
#Chosse variables for the model based on previous visualizations.
X = df[['rooms', 'bedrooms', 'bathrooms', 'surface_total','surface_covered', 'price','l2', 'property_type', 'bin_rooms',
'bin_bedrooms', 'bin_bathrooms', 'bin_surface_total', 'bin_surface_covered']]#, 'Region']]
#Add log-price
X['log_price']=np.log10(X['price'])
#Add logs for surface_total and surface_covered covariates
X['log_surface_total']=np.log10(X['surface_total'])
X['log_surface_covered']=np.log10(X['surface_covered'])
#Remove original variables
X=X.drop(columns=['price','surface_total','surface_covered'])
# Delete incpmplete rows
columns = ['bin_rooms', 'bin_surface_total', 'bin_surface_covered','bin_bathrooms']
conteos = X[columns].apply(sum,axis=1)
X_clean = X[conteos<=2]
# Imputing upper extreme values with the 98% percentile.
quantil = lambda col: col.quantile([0.98])
columns = ['rooms', 'log_surface_total', 'log_surface_covered','bathrooms']
df_atp = X_clean[columns].apply(quantil,axis=0)
for col in list(df_atp.columns):
X_clean.loc[X_clean[col] > df_atp[col].iloc[0], col] = df_atp[col].iloc[0]
#Defining response vector
y = X_clean['log_price']
X = X_clean.drop(columns=['log_price'])
print(X['l2'].unique())
#Constructing dummies variables for categorical features
var_cat = X.select_dtypes(include=['object']).copy().columns
for col in var_cat:
try:
X = pd.concat([X.drop(col,axis=1),pd.get_dummies(X[col], prefix = col, prefix_sep = "_", drop_first = True,
dummy_na = False)],axis=1)
except Exception as e:
print(col, " no se pudo procesar")
print(e)
displaydf(X.iloc[0:10,])
#Split data set in train and test data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)
#Adjust the model to train set.
lm_model = LinearRegression(normalize=True)
lm_model.fit(X_train, y_train)
#Computing r2 over train and test sets:
y_train_preds = lm_model.predict(X_train)
r2_train = r2_score(y_train, y_train_preds)
y_test_preds = lm_model.predict(X_test)
r2_test = r2_score(y_test, y_test_preds)
print("The r2 for train set was {} and for test set was {}".format(round(r2_train,4),round(r2_test,4)))
Finally, the model ends up predicting about 74% of the variability of the logarithm of the price, which is not bad considering the large amount of missing data that is in the dataset..
Feel free to explore further how these tools support model building by taking a look at the this repository on github where I have stored a fully reproducible example of everything explained, including additional graphics and more modeling details.
Thanks for reading me, until the next map !! :)
Top comments (0)