DEV Community: Magali

Evaluating Time Series Models

Magali — Wed, 02 Nov 2022 03:00:41 +0000

Model evaluation is a critical, yet often under looked, step in data science. Much time is dedicated to obtaining, cleaning, and analyzing large amounts of structured and unstructured data, in order to then build models that may generate new insights. In the rush to extract value from data and to put such insights to work in a fast-paced world, the process of evaluating the performance of the model can be dismissed altogether. Recent examples showcase the need for model evaluation. Most notably, Zillow shut down its home buying business after it paid too much for houses by relying on an algorithm that could not accurately forecast prices during a period of rapid price changes. The algorithm did not perform well with new, changing data.

This blog post will highlight how to incorporate evaluation into modeling time series--such as real estate prices, FX rates, stock prices, retail sales, temperature readings, and numbers of subscribers--in Python.

Train-test splits

The first step is to split the data set into a training subset (to train the model) and a testing subset (to test the trained model). This is referred to as a 'train-test split', where the training data typically accounts for 80% of the sample and the test data accounts for 20% of the sample. In non-time series machine learning, the train-test split function partitions the data randomly.

With time series, however, the data is not split into random subsets. Rather, the testing subset is comprised of the observations towards the end of the dataset. For example, in a dataset with 60 records (let's say these are months), the training set would be comprised of the first 48 months and the test set would be comprised of the last 12 months. The test set is sometimes described as the 'hold-out' set since this data is withheld from the dataset that is used for fitting the model. Importantly, a model should not be trained on test data.

A function can be defined in Python to perform such a split on multiple time series, and the slicing notation can be used to slice the dataset accordingly.

# Define function to apply train_test_split to all selected dfs
def tt_split(df_list, names):
    tts_list = []

    for i, df in enumerate(df_list):
        train_data = df[:-12] #all data except last 12 months
        test_data = df[-12:] #last 12 months of data
        tts_list.extend([train_data, test_data])
        print(names[i],':', df.shape, ' // Train: ', train_data.shape,
             'Test: ', test_data.shape)
    return tts_list    #return list of train, test dfs

Once the data has been split into training and testing subsets, the developed model will be fit to the training set. Evaluation metrics relevant to the use case can then be assessed. Residuals are calculated on the training set, and forecasts are calculated on the test set.

Workflow visualization

A simplified workflow to develop a time series model could look like this, according to Google's Machine Learning guide:

Benchmark models

Another key step in the model validation process is to build a baseline (ie, benchmark) model followed by iterative modeling with different assumptions, inputs, hyper-parameters and even with varying models to compare results. At this point, models can be compared on relevant evaluation metrics (such as RMSE and AIC) and how well the model generalizes to new data, which is known as over- and under-fitting. This matters because models are developed on sample data, which is usually not fully representative of the new data that the model will encounter in the real world. The model should ultimately learn well enough that it can be applied to examples it did not see during the training phase.

Conclusion

Incorporating model evaluation practices--through train-test splits, building benchmark models, and iterating on them--is a good practice that should be part of every time series analysis, especially when such models inform decision-making.

How Bias Sneaks into Machine Learning

Magali — Tue, 20 Sep 2022 22:01:10 +0000

Machine learning offers a unique opportunity to transform how credit is analyzed for risk and is allocated by lenders. It can learn patterns in data to efficiently and accurately assess the credit risk of borrowers without reliance on traditional credit reporting systems. This is transformative because it means that machine learning can be applied in a range of environments--from banked to unbanked, underserved segments--to develop new ways to quickly determine borrowers' creditworthiness.

Despite the promising advantages, the application of machine learning in financial services can also lead to bad outcomes, such as introducing biases and reinforcing discrimination in lending decisions. Bias can sneak into machine learning at various stages of model development, and there are no built-in checks to detect it.

Example

What does this look like in practice? Let's walk through a simplified example. In this example, a machine learning model was trained on the German credit dataset available on UCI's Machine Learning Repository to classify creditworthy (1) and not-creditworthy (0) borrowers. The dataset has 1000 records, with 21 variables comprised of numerical and categorical data. The model determines that the most important features that determine if a borrower is classified as creditworthy are related to a borrower's checking account, terms of credit (amount and duration), age, and credit history, among other attributes.

According to the machine learning model, creditworthy borrowers are older (36.2 years old) than non-creditworthy borrowers (33.9 years old) on average. In particular, the average age of creditworthy males is higher than non-creditworthy males. The average age of creditworthy and non-creditworthy females is about the same. And, in general, older males are more likely to be classified as creditworthy and receive higher credit amounts than younger females.

Types of bias

Sample bias , also known as selection bias, was unknowingly introduced into the model because the age data provided as an input to train the model is not likely be representative of the data the model will encounter going forward when it is used for the business application. Indeed, with a mean of 35.5 years and positive skew, the age data the model was trained on is not normally distributed. Data encountered in the real world is likely to include more younger and older borrowers.

Moreover, women are notably underrepresented in the dataset. In other words, the model is trained on a lot more male data and, thus, has better learned male borrowers' credit attributes.

There are also several other types of biases that can make their way into machine learning: exclusion bias, measurement bias, recall bias, observer bias, association bias, and racial bias.

Reducing Bias

Under-represented data can lead to a machine learning model wrongly inferring attributes about the unrepresented segments, thus leading to unintended consequences such as exacerbating existing biases. Seeing how prevalent biases can be in machine learning, it is crucial for data scientists and analysts to incorporate bias testing to test assumptions, collect and prepare data fairly, and examine modeling decisions throughout the model development cycle.

Making Use of Zip Codes

Magali — Thu, 28 Jul 2022 20:48:00 +0000

Overview

I am continuing to learn Python coding and data science as a student in Flatiron School’s Online Data Science Bootcamp. My motivation is to explore how these skills can be applied to make better, more informed, and timely operating and policy decisions in finance and economics. I am already finding value in the application of these skills to real-world questions and challenges.

Business Problem

My second project involved working with King County, Washington’s house price dataset. This is a large dataset, with over 21,000 entries. Among these entries are house prices and location-related indicators such as latitude, longitude, and over 70 zip codes in King County. Since location can affect home prices, I was interested in exploring how house prices vary across the county. The challenge is that the dataset does not include addresses nor city names, which makes the zip code data difficult to interpret if one does not know what area the zip code represents. Moreover, because zip codes were created to assist postal workers deliver mail, they do not necessarily cleanly delineate an actual area that one would be able to reference on a map. Indeed, many zip codes have odd boundaries. Could the zip code data be made useful and interpretable in Python for this analysis?

Data Exploration

The first step is to confirm that price indeed varies by location. A simple scatterplot of longitude and latitude, with markers weighted by price, indicates that it does.

Next, I explored the zip code data, which does not have a normal distribution. I then plotted the average price by zip code, which further supports the view that location matters. Homes in some zip codes have higher mean prices than homes in other zip codes. There is no order in the relationship between zip codes and home prices.

Bringing in City Data

Since I—and most people—do not know what areas each of these 70+ zip codes represent, I brought in city data obtained from King County GIS Open Data to match with zip codes. I have done similar matching of data previously, but never with Python so this was a test of my new skills. I approached the challenge by creating a dictionary with zip codes as keys and cities as values; calling on the dictionary when applying the pd.replace function to the dataset; and, finally, creating a visualization of home prices by city. Check it out.

#Load primary dataframe that underlies the King County home price analysis.
data = pd.read_csv('data/kingcounty.csv')

#Load second dataframe with zipcode and city data, obtained from King County GIS Open Data.
df_zips = pd.read_csv('data/zipcodes.csv')

#Using this second dataframe, make a dictionary with key, values represented by zipcode, city.
zips_dictionary = dict(zip(df_zips.zipcode, df_zips.city))
zips_dictionary

{98001: 'Auburn',
98002: 'Auburn',
98003: 'Federal_Way',
98004: 'Bellevue',
98005: 'Bellevue',
98006: 'Bellevue',
98007: 'Bellevue',
98008: 'Bellevue',...}

#In primary dataframe named data, create new column named “City” that contains zip codes as placeholder values.
data['city'] = data.zipcode

#Call on dictionary to replace zip code values in this new column with city names.
data.replace({'city': zips_dictionary}, inplace=True)

#Examine results
data.city.value_counts()

Seattle 8777
Renton 1584
Bellevue 1263
Kent 1197
Redmond 960
Kirkland 941
Auburn 907
Sammamish 778
Federal_Way 777
Issaquah 717

Creating Visualization of Mean Price by City

#Plot mean price by city
fig, ax = plt.subplots(figsize = (8,4))

# Plot mean price by city
mean_price_by_city = data.groupby('city')['price'].mean()
mean_price_by_city.sort_values(ascending=True).plot(kind='bar', color='powderblue', label='Mean Price by City')

# Plot mean price for King County
data['price_mean'] = data.price.mean()
data['price_mean'].plot(kind='line', color = 'mediumblue', label = 'Mean Price of King County: $504,333')

#Format y-axis
plt.ylabel('Mean Price ($)',size=12)
current_values = plt.gca().get_yticks()
plt.gca().set_yticklabels(['{:,.0f}'.format(x) for x in current_values])
ax.set_ylim([0, 1500000])

#Format x-xis
plt.xlabel('City', size=12)
plt.xticks(rotation=90)

#Add legend, title
plt.legend(loc='upper left', borderaxespad=0.2, edgecolor='white', fontsize=11)
fig.suptitle('Home Prices: County and City', fontsize=15)
fig.subplots_adjust(top=0.94)

#Remove chart borders
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show();

Conclusion

By bringing in city data to match with each zip codes, the variation in prices across the county can be more easily interpreted. The number of bins were reduced from over seventy to twenty or so, which reduces granularity but results in a grouping that is well known and understandable by the public. This is useful because subsequent analysis—for example, on price per square footage and structural home features—can be analyzed for differences within the county, through the lens of cities.

FS Project 1: Charting with Python

Magali — Tue, 01 Jun 2021 20:17:02 +0000

Introduction

I am learning Python coding and data science because I am interested in applying these skills to work with bigger data sets and to analyze such data to make better, more informed, and timely operating and policy decisions in finance and economics.

As a student enrolled in Flatiron School's Data Science Bootcamp, I am enjoying learning these new skills and am already finding value in their application to solve real-world questions and challenges.

Business problem

My first project examines the movie industry for a client that is interested in entering the business. Descriptive data analysis shows that the movie industry is profitable but there is significant variation in performance across films and production studios. The client can use this analysis to understand the key trends in the movie industry, identify its main competitors, and determine the types of films they will be creating. This analysis also serves as a baseline for deeper dives on the movie industry.

Data and methodology

The project explored more than ten zipped, movie-related data sets from four sources. Data files provide a wide range of information over the last 20 years about individual movies' box office revenues, budgets, genres, as well as about production studios and associated cast and crew members.

The project applies exploratory data analysis and examines trends of key metrics over time. This provides an insightful overview of the evolution of the performance of the movie industry. Several data files contain useful information that complements the other files, while some data was duplicated. This required extensive cleaning and joining data sets. I have experience preparing and analyzing data sets, mostly in Excel; but I have never worked with so many large files simultaneously nor done this with Python code. It was fun to tackle this in a new way, and I particularly enjoyed analyzing the data and generating chart output that presents the results.

Charting with Python

How to make chart with double y-axes

One of the first findings is that there are large outliers in the data and there is notable variation across movies for most indicators. Given these characteristics, I wanted to explore the gross revenue and ROI of the top grossing movies (i.e., those in the 99th percentile). A reasonable hypothesis is that the movies with the highest gross revenue would be among the ones with the highest ROI. This is not the case. To show this cleanly, I plotted the top movies' gross revenue and ROI on two different axes.

The chart shows that top grossing movies are profitable, but these movies do not necessarily have the highest ROI. Blockbuster office performance does not translate to higher ROI, implying that cost control is an important determinant of the bottom line.

#Create figure
fig, ax1 = plt.subplots(figsize=(14,9))

#Assign chart variables
title = df['title']
ww_gross = df['worldwide_gross_m']
ROIp = df['ROIpct']

#Identify two y-axes using the same x-axis (i.e., the second (left) y-axis will use the same x-axis 
ax2 = ax1.twinx()

#Create standard bar chart of gross revenue on the left y-axis
ax1.bar(title, ww_gross, color='lightsteelblue')

#Add line plot of ROI to same chart on the right y-axis. Set markers to '.' and remove line.
ax2.plot(title, ROIp, marker = '.', markersize = 12, color='navy', linestyle='None')

#X-axis label formatting: rotate and center
ax1.set_xticklabels(title, rotation=90, ha='center')

#Y-axis label formatting: Set labels and change colors of labels to match chart content
ax1.set_ylabel('Gross Revenue (Millions $)', color='gray')
ax2.set_ylabel('ROI (Percent)', color='navy')

#Y-axis tick marks: Set min, max, intervals
ax1.set_yticks(np.arange (0, 3250, 250))
ax2.set_yticks(np.arange (0, 3250, 250))

plt.show()

How to make stacked percentage bar chart

Another valuable takeaway from the data is the share of movies that are profitable versus unprofitable. I was curious to see what percentage of movies fall within given profitability ranges—for this, I set out to make a stacked percentage chart. This analysis shows that the movie industry is a profitable, but challenging, business. Forty percent of movies generate healthy return on investment exceeding 100%, while 25% generate positive but lower returns below 100%. Notably, 35% of movies lose money.

After some initial troubleshooting on my code, I approached this as follows:

#Prepare data for stacked 100% bar chart. Create df grouped by year and count of ROI buckets. Convert count of ROI buckets to percent of total count.
df = ((df_roi.groupby(['year', 'ROI_buckets'])['ROI_buckets'].count()
                       /df_roi.groupby(['year'])['ROI_buckets'].count()))*100

#Set color map and select number of colors from color map
viridis = cm.get_cmap('viridis', 9)

#Create stacked bar chart
ax = df.unstack().plot.bar(stacked = True, figsize=(14,10), color=viridis.colors)

#Set title, x-label, y-label
ax.set_title('ROI - Movies in 90th percentile', fontsize = 18)
ax.set_xlabel('Year', fontsize = 14)
ax.set_ylabel('Percent of movies (%)', fontsize = 14)
#Set y-axis ticks: min, max, interval
ax.yaxis.set_ticks(np.arange(0, 110, 10)

#Set tick marks on right side.
ax.tick_params(labeltop=False, labelright=True)

#Reverse legend order and set legend location
handles, labels = ax.get_legend_handles_labels()
ax.legend(reversed(handles), reversed(labels), loc='center left', bbox_to_anchor=(1.05,0.5))

plt.show()

Final insights

Looking forward to continuing to learn and share data science insights--more to come!