DEV Community

Cover image for Natural Language Processing for Loan Risk
Ty Mick
Ty Mick

Posted on • Originally published at tymick.me

Natural Language Processing for Loan Risk

  1. The story so far
  2. Exploratory data analysis
  3. Imputing missing values
  4. Optimizing data types
  5. Creating document vectors
  6. Building the pipeline
  7. Evaluating the model
  8. Next steps

The story so far

A few months ago, I built a neural network regression model to predict loan risk, training it with a public dataset from LendingClub. Then I built a public API with Flask to serve the model's predictions.

Then last month, I decided to put my model to the test and found out that my model can pick grade A loans better than LendingClub!

But I'm not done. Now that I've learned the fundamentals of natural language processing (I highly recommend Kaggle's course on the subject), I'm going to see if I can eke out a bit more predictive power using a couple of freeform text fields in the dataset: title and desc (description).

import joblib

prev_notebook_folder = "../input/building-a-neural-network-to-predict-loan-risk/"
loans = joblib.load(prev_notebook_folder + "loans_for_nlp.joblib")
num_loans = loans.shape[0]
print(f"This dataset includes {num_loans:,} loans.")
Enter fullscreen mode Exit fullscreen mode
This dataset includes 1,110,171 loans.
Enter fullscreen mode Exit fullscreen mode
loans.head()
Enter fullscreen mode Exit fullscreen mode
loan_amnt term emp_length home_ownership annual_inc purpose dti delinq_2yrs cr_hist_age_mths fico_range_low ... pub_rec_bankruptcies tax_liens tot_hi_cred_lim total_bal_ex_mort total_bc_limit total_il_high_credit_limit fraction_recovered issue_d title desc
0 3600.0 36 months 10+ years MORTGAGE 55000.0 debt_consolidation 5.91 0.0 148 675.0 ... 0.0 0.0 178050.0 7746.0 2400.0 13734.0 1.0 Dec-2015 Debt consolidation NaN
1 24700.0 36 months 10+ years MORTGAGE 65000.0 small_business 16.06 1.0 192 715.0 ... 0.0 0.0 314017.0 39475.0 79300.0 24667.0 1.0 Dec-2015 Business NaN
2 20000.0 60 months 10+ years MORTGAGE 63000.0 home_improvement 10.78 0.0 184 695.0 ... 0.0 0.0 218418.0 18696.0 6200.0 14877.0 1.0 Dec-2015 NaN NaN
4 10400.0 60 months 3 years MORTGAGE 104433.0 major_purchase 25.37 1.0 210 695.0 ... 0.0 0.0 439570.0 95768.0 20300.0 88097.0 1.0 Dec-2015 Major purchase NaN
5 11950.0 36 months 4 years RENT 34000.0 debt_consolidation 10.20 0.0 338 690.0 ... 0.0 0.0 16900.0 12798.0 9400.0 4000.0 1.0 Dec-2015 Debt consolidation NaN
5 rows × 69 columns
Enter fullscreen mode Exit fullscreen mode

This post, like its predecessors, was adapted from a Jupyter Notebook, so feel free to fork my notebook on Kaggle or GitHub if you'd like to follow along.

Exploratory data analysis

There isn't too much exploratory data analysis left to do after how thoroughly I cleaned the data in my first post, but I do have a few quick questions about the title and desc fields I'd like to answer before I move on.

  • How many loans use each field?
  • Have these fields always been included in the loan application?
  • What is the typical length of each field (in number of words)?
nlp_cols = ["title", "desc"]

loans[nlp_cols].describe()
Enter fullscreen mode Exit fullscreen mode
title desc
count 1097288 71967
unique 35863 70927
top Debt consolidation
freq 573992 23

If the most frequent desc value is empty (or maybe just whitespace), perhaps I need to convert all empty or whitespace-only values to NaN before continuing.

import re
import numpy as np

for col in nlp_cols:
    replace_empties = lambda x: x if re.search("\S", x) else np.NaN
    loans[col] = loans[col].map(replace_empties, na_action="ignore")

description = loans[nlp_cols].describe()
description
Enter fullscreen mode Exit fullscreen mode
title desc
count 1097288 71943
unique 35863 70925
top Debt consolidation Borrower added on 03/17/14 > Debt consolidat...
freq 573992 9

Thankfully that didn't remove too many values, but this "Borrower added on [date]" deal worries me now. I'll deal with that a little later.

for col in nlp_cols:
    percentage = int(description.at["count", col] / num_loans * 100)
    print(f"`{col}` is used in {percentage}% of loans in the dataset.")

percentage = int(description.at["freq", "title"] / num_loans * 100)
print(f'The title "Debt consolidation" is used in {percentage}% of loans.')
Enter fullscreen mode Exit fullscreen mode
`title` is used in 98% of loans in the dataset.
`desc` is used in 6% of loans in the dataset.
The title "Debt consolidation" is used in 51% of loans.
Enter fullscreen mode Exit fullscreen mode

These fields may not be as useful as I had previously thought. Even though there are 35,860 unique titles used across the dataset, 51% of them just use "Debt consolidation". Maybe the titles are more descriptive in the other 49%?

And the desc field is only used with 6% of loans.

Now to check and see when these fields were introduced.

# `issue_d` is just the month and year the loan was issued, by the way.
loans["issue_d"] = loans["issue_d"].astype("datetime64[ns]")

print("Total date range:")
print(loans["issue_d"].agg(["min", "max"]))
print("\n`title` date range:")
print(loans[["title", "issue_d"]].dropna(axis="index")["issue_d"].agg(["min", "max"]))
print("\n`desc` date range:")
print(loans[["desc", "issue_d"]].dropna(axis="index")["issue_d"].agg(["min", "max"]))
Enter fullscreen mode Exit fullscreen mode
Total date range:
min   2012-08-01
max   2018-12-01
Name: issue_d, dtype: datetime64[ns]

`title` date range:
min   2012-08-01
max   2018-12-01
Name: issue_d, dtype: datetime64[ns]

`desc` date range:
min   2012-08-01
max   2016-07-01
Name: issue_d, dtype: datetime64[ns]
Enter fullscreen mode Exit fullscreen mode

Neither of these fields were introduced late, but they may have stopped using the desc field for the last two years of the database.

Now I'll take a closer look at values in these fields.

import pandas as pd

with pd.option_context("display.min_rows", 50):
    print(loans["title"].value_counts())
Enter fullscreen mode Exit fullscreen mode
Debt consolidation                       573992
Credit card refinancing                  214423
Home improvement                          64028
Other                                     56166
Major purchase                            20734
Medical expenses                          11454
Debt Consolidation                        10638
Business                                  10142
Car financing                              9660
Moving and relocation                      6806
Vacation                                   6707
Home buying                                5097
Consolidation                              4069
debt consolidation                         3310
Credit Card Consolidation                  1607
consolidation                              1538
Debt Consolidation Loan                    1265
Consolidation Loan                         1260
Personal Loan                              1040
Credit Card Refinance                      1020
Home Improvement                           1016
Credit Card Payoff                          991
Consolidate                                 947
Green loan                                  626
Loan                                        621
                                          ...
House Buying Consolidation                    1
Credit Card Deby                              1
Crdit cards                                   1
"CCC"                                         1
Loan to Moving & Relocation Expense           1
BILL PAYMENT                                  1
creit card pay off                            1
Auto Repair & Debt Consolidation              1
BMW 2004                                      1
Moving Expenses - STL to PHX                  1
 Pay off Bills                                1
Room addition                                 1
Optimistic                                    1
Consolid_loan2                                1
ASSISTANCE NEEDED                             1
My bail out                                   1
myfirstloan                                   1
second home                                   1
Just consolidating credit cards               1
Financially Sound Loan                        1
refinance loans and home improvements         1
credit cart refincition                       1
Managable Repayment Plan                      1
ccdebit                                       1
Project Pay Off Debt                          1
Name: title, Length: 35863, dtype: int64
Enter fullscreen mode Exit fullscreen mode

Interesting. It seems like there's plenty of variety in loan titles in the other 49%. A lot of them seem to directly correspond to the purpose categorical field, but not so many as to make this field useless, I think.

Side note: I discovered at one point when perusing this column that someone entered the Konami Code as the title of their loan application, and their inclusion in this dataset means that the code apparently worked for them—they got the loan.

loans[loans["title"] == "up up down down left right left right ba"][
    ["loan_amnt", "title", "issue_d"]
]
Enter fullscreen mode Exit fullscreen mode
loan_amnt title issue_d
1856340 12000.0 up up down down left right left right ba 2013-04-01
loans["desc"].value_counts()
Enter fullscreen mode Exit fullscreen mode
  Borrower added on 03/17/14 > Debt consolidation<br>                                                                                                                                                                                                                                        9
  Borrower added on 01/15/14 > Debt consolidation<br>                                                                                                                                                                                                                                        7
  Borrower added on 02/19/14 > Debt consolidation<br>                                                                                                                                                                                                                                        7
  Borrower added on 03/10/14 > Debt consolidation<br>                                                                                                                                                                                                                                        7
  Borrower added on 01/29/14 > Debt consolidation<br>                                                                                                                                                                                                                                        7
                                                                                                                                                                                                                                                                                            ..
  Borrower added on 01/14/13 > Credit Card consolidation<br>                                                                                                                                                                                                                                 1
  Borrower added on 03/14/14 > Debts consolidation and cash for minor improvements on condominium<br>                                                                                                                                                                                        1
  Borrower added on 03/02/14 > I lost a house and need to pay taxes nd have credit card debt thatI already pay $350 a month on and it goes nowhere.<br>                                                                                                                                      1
  Borrower added on 04/09/13 > I want to put in a conscious effort in eliminating my debt by converting high interest cards to a fixed payment that can be effectively managed by me.<br>                                                                                                    1
  Borrower added on 09/18/12 > Want to become debt free, because of several circumstances and going back to school I got into debt. I want to pay for what I have purchased without it having an effect on my credit. That is why I want to consolidate my debt and become debt free!<br>    1
Name: desc, Length: 70925, dtype: int64
Enter fullscreen mode Exit fullscreen mode

Do all these descriptions start with "Borrower added on [date]"?

pattern = "^\s*Borrower added on \d\d/\d\d/\d\d > "
prefix_count = (
    loans["desc"]
    .map(lambda x: True if re.search(pattern, x, re.I) else None, na_action="ignore")
    .count()
)
print(
    f"{prefix_count:,} loan descriptions begin with that pattern.",
    f"({description.loc['count', 'desc'] - prefix_count:,} do not.)",
)
Enter fullscreen mode Exit fullscreen mode
71,858 loan descriptions begin with that pattern. (85 do not.)
Enter fullscreen mode Exit fullscreen mode

Well now I need to check those other 85.

other_desc_map = loans["desc"].map(
    lambda x: False if pd.isna(x) or re.search(pattern, x, re.I) else True
)
other_descs = loans["desc"][other_desc_map]
other_descs.value_counts()
Enter fullscreen mode Exit fullscreen mode
Debt Consolidation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 2
I would like to pay off 3 different credit cards, at 12%, 17% and 22% (after initial 0% period is up).  It would be great to have everything under one loan, making it easier to pay off.  Also, once I've paid off or down the loan, I can start looking into buying a house.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     1
loan will be used for paying off credit card.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      1
This loan will be used to consolidate high interest credit card debt.    Over the course of this past year my wife and I had our first child, purchase a home and received a large bonus from work.  With the new home and the child on the way I chose to spread my tax withholdings on the bonus to all checks received in 2008 this caused my monthly income to fall by $1500.  This in combination with an unexpected additional down payment for our home of $17,000 with only a weeks notice we were force to dip into our Credit Cards for the past several months.    Starting January 1, 2009 I will be able to readjust my tax withholding and start to pay off the Credit Card debt we have racked up.  This loan will help lower the interest rate during the repayment period and give one central place for payment.  My wife and I have not missed a payment or been late for the past 5 years.  My fico score is 670 mainly due to several low limit credit cards near their max.  I manage the international devision of a software company and my wife is a kindergarten teacher, combined we make 140K a year.    Thank you for your consideration and I look forward to working with you.      1
to pay off different credit cards to consolidate my debt, so I can have just one monthly payment.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  1
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ..
Hello, I would like to consolidate my debt into a lower more convenient payment. I have a very stable career of more than 20 years with the same company. My community is in a part of the country that made it through the last few years basically unscathed and has a very promising future.<br>Thank You<br>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   1
consolidate my debt                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                1
I am looking to pay off my credit card debts.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      1
This loan is to help me payoff my credit card debt. I've done what I can to negotiate lower rates, but the interest is killing me and my monthly payments are basically just taking care of interest. Paying them off will give me the fresh start I need on my way to financial independence. Thank you.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          1
I have been in business for a year and want to eliminate some personal debt and use the remainder of the loan to take care of business expenses. Also lessening the number of trade lines I have open puts me in a better position to pursue business loans since it will  be based on my personal credit. A detailed report can be created to show where exactly the funds will go and this can be provided at any time during the course of the loan.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1
Name: desc, Length: 84, dtype: int64
Enter fullscreen mode Exit fullscreen mode

It looks like the borrower may be able to add information to the description at different points in time. I should check and see if any of those dates come after the actual issue date of the loan.

from datetime import datetime, date

for row in loans[["desc", "issue_d"]].itertuples():
    if not pd.isna(row.desc):
        month_after_issue = date(
            day=row.issue_d.day,
            month=row.issue_d.month % 12 + 1,
            year=row.issue_d.year + row.issue_d.month // 12,
        )

        date_strings = re.findall("\d\d/\d\d/\d\d", row.desc)
        dates = []
        for string in date_strings:
            try:
                dates.append(datetime.strptime(string, "%m/%d/%y").date())
            except:
                continue

        for d in dates:
            if d >= month_after_issue:
                print(f"{row.issue_d}{row.desc}")
                break
Enter fullscreen mode Exit fullscreen mode
2014-01-01 00:00:00 –   Borrower added on 01/08/14 > I am tired of making monthly payments and getting nowhere.  With your help, except for my mortgage, I intend to be completely debt free by 12/31/2016.<br>
2014-01-01 00:00:00 –   Borrower added on 01/08/14 > We have been engaged for  2 1/2yrs and wanted to bring out blended family together as one. We are set to get married on 03/22/14 and we are paying for it on our own. We saved the majority of the budget unfortunately there were a few unexpected cost that we still need help with.<br>
2014-01-01 00:00:00 –   Borrower added on 01/06/14 > I am getting married 04/05/2014 and I want to have a cushion for expenses just in case.<br>
2014-01-01 00:00:00 – BR called in to push payment date to 09/19/14 because of not having the exact amount of funds in their bank account.  Payment was processing. Was able to cancel. It is within grace period.
2014-01-01 00:00:00 –   Borrower added on 01/01/14 > This loan is to consolidate my credit cards debt. I made one year this past  11/28/2013 at my current job. I considered to have job security because I'm a good employee. I make all may credit cards payments on time.<br>
2013-05-01 00:00:00 –   Borrower added on 04/27/13 > My father passed away 05/12/2012 and I had to pay for the funeral.  My mother could not afford it.  He was not ill so I could not have planned it.  I paid with what I had in my savings and the rest I had to pay with my credit cards.  I would like to pay off the CC &amp; pay one monthly payment.<br><br>  Borrower added on 04/27/13 > My paerents own the house so I do not pay rent.    The utilities, insurance and taxes, etc my mother pays.  She can afford that.  I help when needed.<br>
2013-02-01 00:00:00 –   Borrower added on 02/10/13 > I am getting married in a week (02/17/2013) and have made some large purchases across my credit cards.  I would like to consolidate all of my debt with this low rate loan.<br><br> Borrower added on 02/10/13 > I will be getting married in a week (02/17/13) and have had to make some large purchases on my CC. I am financially sound otherwise with low debt obligations.<br>
2012-12-01 00:00:00 –   Borrower added on 12/10/12 > Approximately 1 year ago I had a highefficency furnace /AC installed.  The installing Co. used GECRB to get me a loan.  If I payoff the loan within one year, I pay no interest.  The interest rate if not payed by 12/23/2012 is 26.99%.  A 6.62% rate sounds a lot better.<br>
2012-11-01 00:00:00 –   Borrower added on 11/19/12 > Looking to finish off consolidating the rest of my bills and lower my payments on my exsisting loan. Thanks!!!<br><br>  Borrower added on 11/20/12 > Thanks again for everyone who has invested thus far. With this loan it will give me the ability to have only one payment monthly besides utilities and I will be almost debt free by my wedding date of 12/13/14!! Thanks again everyone!<br>
2012-10-01 00:00:00 –   Borrower added on 10/22/12 > Need money by 10/26/2012 to purchase property on discounted APR.<br>
Enter fullscreen mode Exit fullscreen mode

Good, all of the dates that come after the month the loan is issued only come up because the borrower is talking about a future event.

Now to clean these desc values up a bit I'm going to remove the Borrower added on [date] >s and the <br>s, since those don't add value to the description content.

def clean_desc(desc):
    if pd.isna(desc):
        return desc
    else:
        return re.sub(
            "\s*Borrower added on \d\d/\d\d/\d\d > |<br>", lambda x: " ", desc
        ).strip()


loans["desc"] = loans["desc"].map(clean_desc)
Enter fullscreen mode Exit fullscreen mode

Imputing missing values

Since only 2% of loans in this set are missing a title, and since most titles simply copy the loan's purpose, I'm going to impute missing titles with their loan's purpose.

loans["title"].fillna(
    loans["purpose"].map(lambda x: x.replace("_", " ").capitalize()), inplace=True
)
Enter fullscreen mode Exit fullscreen mode

Since only 6% of loans use a description, I'll just impute missing descriptions with an empty string. I'm going to wait and include that as a pipeline step a little later, though.

Optimizing data types

I'd really love to get right to the fun part, converting these text fields into document vectors, but I ran into a problem the first several times I tried doing so. Manually adding two sets of 300-dimensional vectors to this 1,110,171-row DataFrame caused its size in memory to skyrocket, exhausting the 16GB Kaggle gives me.

My first attempt to fix this was optimizing my data types, which still didn't solve the problem on its own, but it's a worthwhile step to take anyway.

After removing the issue_d column, which is no longer needed, the dataset contains five types of data: float, integer, ordinal, (unordered) categorical, and text.

from pandas.api.types import CategoricalDtype


loans = loans.drop(columns=["issue_d"])

float_cols = ["annual_inc", "dti", "inv_mths_since_last_delinq",
    "inv_mths_since_last_record", "revol_util", "inv_mths_since_last_major_derog",
    "annual_inc_joint", "dti_joint", "bc_util", "inv_mo_sin_rcnt_rev_tl_op",
    "inv_mo_sin_rcnt_tl", "inv_mths_since_recent_bc", "inv_mths_since_recent_bc_dlq",
    "inv_mths_since_recent_inq", "inv_mths_since_recent_revol_delinq", "pct_tl_nvr_dlq",
    "percent_bc_gt_75", "fraction_recovered"]
int_cols = ["loan_amnt", "delinq_2yrs", "cr_hist_age_mths", "fico_range_low",
    "fico_range_high", "inq_last_6mths", "open_acc", "pub_rec", "revol_bal",
    "total_acc", "collections_12_mths_ex_med", "acc_now_delinq", "tot_coll_amt",
    "tot_cur_bal", "total_rev_hi_lim", "acc_open_past_24mths", "avg_cur_bal",
    "bc_open_to_buy", "chargeoff_within_12_mths", "delinq_amnt", "mo_sin_old_il_acct",
    "mo_sin_old_rev_tl_op", "mort_acc", "num_accts_ever_120_pd", "num_actv_bc_tl",
    "num_actv_rev_tl", "num_bc_sats", "num_bc_tl", "num_il_tl", "num_op_rev_tl",
    "num_rev_accts", "num_rev_tl_bal_gt_0", "num_sats", "num_tl_120dpd_2m",
    "num_tl_30dpd", "num_tl_90g_dpd_24m", "num_tl_op_past_12m", "pub_rec_bankruptcies",
    "tax_liens", "tot_hi_cred_lim", "total_bal_ex_mort", "total_bc_limit",
    "total_il_high_credit_limit"]
ordinal_cols = ["emp_length"]
category_cols = ["term", "home_ownership", "purpose", "application_type"]
text_cols = nlp_cols

size_metrics = pd.DataFrame(
    {
        "previous_dtype": loans.dtypes,
        "previous_size": loans.memory_usage(index=False, deep=True),
    }
)
previous_size = loans.memory_usage(deep=True).sum()


for col_name in float_cols:
    loans[col_name] = pd.to_numeric(loans[col_name], downcast="float")

for col_name in int_cols:
    loans[col_name] = pd.to_numeric(loans[col_name], downcast="unsigned")

emp_length_categories = ["< 1 year", "1 year", "2 years", "3 years", "4 years",
    "5 years", "6 years", "7 years", "8 years", "9 years", "10+ years"]
emp_length_type = CategoricalDtype(categories=emp_length_categories, ordered=True)
loans["emp_length"] = loans["emp_length"].astype(emp_length_type)

for col_name in category_cols:
    loans[col_name] = loans[col_name].astype("category")


current_size = loans.memory_usage(deep=True).sum()
reduction = (previous_size - current_size) / previous_size
print(f"Reduced DataFrame size in memory by {int(reduction * 100)}%.")

size_metrics["current_dtype"] = loans.dtypes
size_metrics["current_size"] = loans.memory_usage(index=False, deep=True)
pd.options.display.max_rows = 100
size_metrics
Enter fullscreen mode Exit fullscreen mode
Reduced DataFrame size in memory by 69%.
Enter fullscreen mode Exit fullscreen mode
previous_dtype previous_size current_dtype current_size
loan_amnt float64 8881368 uint16 2220342
term object 73271286 category 1110383
emp_length object 71853397 category 1111197
home_ownership object 69841784 category 1110700
annual_inc float64 8881368 float32 4440684
purpose object 79927721 category 1111750
dti float64 8881368 float32 4440684
delinq_2yrs float64 8881368 uint8 1110171
cr_hist_age_mths int64 8881368 uint16 2220342
fico_range_low float64 8881368 uint16 2220342
fico_range_high float64 8881368 uint16 2220342
inq_last_6mths float64 8881368 uint8 1110171
inv_mths_since_last_delinq float64 8881368 float32 4440684
inv_mths_since_last_record float64 8881368 float32 4440684
open_acc float64 8881368 uint8 1110171
pub_rec float64 8881368 uint8 1110171
revol_bal float64 8881368 uint32 4440684
revol_util float64 8881368 float32 4440684
total_acc float64 8881368 uint8 1110171
collections_12_mths_ex_med float64 8881368 uint8 1110171
inv_mths_since_last_major_derog float64 8881368 float32 4440684
application_type object 74360578 category 1110384
annual_inc_joint float64 8881368 float32 4440684
dti_joint float64 8881368 float32 4440684
acc_now_delinq float64 8881368 uint8 1110171
tot_coll_amt float64 8881368 uint32 4440684
tot_cur_bal float64 8881368 uint32 4440684
total_rev_hi_lim float64 8881368 uint32 4440684
acc_open_past_24mths float64 8881368 uint8 1110171
avg_cur_bal float64 8881368 uint32 4440684
bc_open_to_buy float64 8881368 uint32 4440684
bc_util float64 8881368 float32 4440684
chargeoff_within_12_mths float64 8881368 uint8 1110171
delinq_amnt float64 8881368 uint32 4440684
mo_sin_old_il_acct float64 8881368 uint16 2220342
mo_sin_old_rev_tl_op float64 8881368 uint16 2220342
inv_mo_sin_rcnt_rev_tl_op float64 8881368 float32 4440684
inv_mo_sin_rcnt_tl float64 8881368 float32 4440684
mort_acc float64 8881368 uint8 1110171
inv_mths_since_recent_bc float64 8881368 float32 4440684
inv_mths_since_recent_bc_dlq float64 8881368 float32 4440684
inv_mths_since_recent_inq float64 8881368 float32 4440684
inv_mths_since_recent_revol_delinq float64 8881368 float32 4440684
num_accts_ever_120_pd float64 8881368 uint8 1110171
num_actv_bc_tl float64 8881368 uint8 1110171
num_actv_rev_tl float64 8881368 uint8 1110171
num_bc_sats float64 8881368 uint8 1110171
num_bc_tl float64 8881368 uint8 1110171
num_il_tl float64 8881368 uint8 1110171
num_op_rev_tl float64 8881368 uint8 1110171
num_rev_accts float64 8881368 uint8 1110171
num_rev_tl_bal_gt_0 float64 8881368 uint8 1110171
num_sats float64 8881368 uint8 1110171
num_tl_120dpd_2m float64 8881368 uint8 1110171
num_tl_30dpd float64 8881368 uint8 1110171
num_tl_90g_dpd_24m float64 8881368 uint8 1110171
num_tl_op_past_12m float64 8881368 uint8 1110171
pct_tl_nvr_dlq float64 8881368 float32 4440684
percent_bc_gt_75 float64 8881368 float32 4440684
pub_rec_bankruptcies float64 8881368 uint8 1110171
tax_liens float64 8881368 uint8 1110171
tot_hi_cred_lim float64 8881368 uint32 4440684
total_bal_ex_mort float64 8881368 uint32 4440684
total_bc_limit float64 8881368 uint32 4440684
total_il_high_credit_limit float64 8881368 uint32 4440684
fraction_recovered float64 8881368 float32 4440684
title object 82840461 object 82840461
desc object 46918516 object 46918516

Creating document vectors

Now the fun part. Wrapping my spaCy document vector function in a scikit-learn FunctionTransformer turned out to be the secret that kept this process within memory limits. Scikit-learn must just be way better optimized than whatever manual process I was using (go figure).

import spacy
from sklearn.preprocessing import FunctionTransformer


def get_doc_vectors(X):
    n_cols = X.shape[1]
    nlp = spacy.load("en_core_web_lg", disable=["tagger", "parser", "ner"])

    result = []
    for row in X:
        result_row = []
        for i in range(n_cols):
            result_row.append(nlp(row[i]).vector)

        result.append(np.concatenate(result_row))

    return np.array(result)


vectorizer = FunctionTransformer(get_doc_vectors)
Enter fullscreen mode Exit fullscreen mode

Building the pipeline

First, the transformer. I'll use scikit-learn's ColumnTransformer to apply different transformations to different kinds of data.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from pathlib import Path


def generate_cat_encoder(col_name):
    categories = list(loans[col_name].cat.categories)
    if loans[col_name].cat.ordered:
        return (
            col_name,
            OrdinalEncoder(categories=[categories], dtype=np.uint8),
            [col_name],
        )
    else:
        return (
            col_name,
            OneHotEncoder(categories=[categories], drop="if_binary", dtype=np.bool_),
            [col_name],
        )


Path("../tmp/transformer_cache").mkdir(parents=True, exist_ok=True)
transformer = ColumnTransformer(
    [
        (
            "nlp_cols",
            Pipeline(
                [
                    (
                        "nlp_imputer",
                        SimpleImputer(strategy="constant", fill_value=""),
                    ),
                    ("nlp_vectorizer", vectorizer),
                    ("nlp_scaler", StandardScaler(with_mean=False)),
                ],
                verbose=True,
            ),
            make_column_selector("^(title|desc)$"),
        ),
    ]
    + [generate_cat_encoder(col_name) for col_name in ordinal_cols + category_cols],
    remainder=StandardScaler(),
    verbose=True,
)
Enter fullscreen mode Exit fullscreen mode

This model itself will be identical to my previous model, but I'll use Keras callbacks and a tqdm progress bar to make the training logs much more concise.

import tensorflow as tf
from tensorflow.keras import Sequential, Input
from tensorflow.keras.layers import Dense, Dropout
from sklearn.model_selection import train_test_split
from tqdm import tqdm

np.random.seed(0)
tf.random.set_seed(1)


class ProgressBar(tf.keras.callbacks.Callback):
    def __init__(self, epochs=100):
        self.epochs = epochs

    def on_train_begin(self, logs=None):
        self.progress_bar = tqdm(desc="Training model", total=self.epochs, unit="epoch")

    def on_epoch_end(self, epoch, logs=None):
        self.progress_bar.update()

    def on_train_end(self, logs=None):
        self.progress_bar.close()


class FinalMetrics(tf.keras.callbacks.Callback):
    def on_train_end(self, logs=None):
        metrics_msg = "Final metrics:"
        for metric, value in logs.items():
            metrics_msg += f" {metric}: {value:.5f} -"
        metrics_msg = metrics_msg[:-2]
        print(metrics_msg)


def run_pipeline(X, y, transformer, validate=True):
    X_train, X_val, y_train, y_val = (
        train_test_split(X, y, test_size=0.2, random_state=2)
        if validate
        else (X, None, y, None)
    )

    X_train_t = transformer.fit_transform(X_train)
    X_val_t = transformer.transform(X_val) if validate else None

    model = Sequential()
    model.add(Input((X_train_t.shape[1],)))
    model.add(Dense(64, activation="relu"))
    model.add(Dropout(0.3))
    model.add(Dense(32, activation="relu"))
    model.add(Dropout(0.3))
    model.add(Dense(16, activation="relu"))
    model.add(Dropout(0.3))
    model.add(Dense(1))
    model.compile(optimizer="adam", loss="mean_squared_logarithmic_error")

    history = model.fit(
        X_train_t,
        y_train,
        validation_data=(X_val_t, y_val) if validate else None,
        batch_size=128,
        epochs=100,
        verbose=0,
        callbacks=[ProgressBar(), FinalMetrics()],
    )

    return history.history, model, transformer
Enter fullscreen mode Exit fullscreen mode

Evaluating the model

import dill

history_1, _, _ = run_pipeline(
    loans.drop(columns="fraction_recovered").copy(),
    loans["fraction_recovered"],
    transformer,
)

Path("save_points").mkdir(exist_ok=True)
dill.dump_session("save_points/model_1.pkl")
Enter fullscreen mode Exit fullscreen mode
/opt/conda/lib/python3.7/site-packages/pandas/core/strings.py:2001: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
  return func(self, *args, **kwargs)


[Pipeline] ....... (step 1 of 3) Processing nlp_imputer, total=   0.4s
[Pipeline] .... (step 2 of 3) Processing nlp_vectorizer, total= 1.2min
[Pipeline] ........ (step 3 of 3) Processing nlp_scaler, total=   8.6s
[ColumnTransformer] ...... (1 of 7) Processing nlp_cols, total= 1.3min
[ColumnTransformer] .... (2 of 7) Processing emp_length, total=   0.2s
[ColumnTransformer] .......... (3 of 7) Processing term, total=   0.3s
[ColumnTransformer]  (4 of 7) Processing home_ownership, total=   0.3s
[ColumnTransformer] ....... (5 of 7) Processing purpose, total=   0.3s
[ColumnTransformer]  (6 of 7) Processing application_type, total=   0.3s
[ColumnTransformer] ..... (7 of 7) Processing remainder, total=   1.3s


Training model: 100%|██████████| 100/100 [23:41<00:00, 14.22s/epoch]


Final metrics: loss: 0.02365 - val_loss: 0.02360
Enter fullscreen mode Exit fullscreen mode
# Restore save point if needed
import dill

try:
    history_1
except NameError:
    dill.load_session("save_points/model_1.pkl")
Enter fullscreen mode Exit fullscreen mode
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


def plot_loss_metrics(history, model_num=None):
    for metric, values in history.items():
        sns.lineplot(x=range(len(values)), y=values, label=metric)
    plt.xlabel("epoch")
    plt.title(
        f"Model {f'{model_num} ' if model_num else ''} loss metrics during training"
    )
    plt.show()


plot_loss_metrics(history_1, "1")
Enter fullscreen mode Exit fullscreen mode

A line plot entitled "Model 1 loss metrics during training", with separate lines for training loss and validation loss, plotting the loss metric value on the y-axis across the 100 epochs of training on the x-axis. Training loss falls rapidly and fairly smoothly, with another small but interesting drop around the 40th epoch. The validation loss line, while very jagged, appears on average to follow the same trend as training loss throughout the 100 epochs of training, indicating that the dropout layers in the neural network were sufficient to prevent overfitting.

Well, it didn't overfit, but this model performed a bit worse than my original, which had settled around a loss of 0.0231. I bet the desc feature is getting in the way—zeroes spanning 300 columns of the input data on 94% of the rows is probably quite confusing to the model. I'll see what happens if I repeat the process while leaving desc out (making the title vectors the only new feature of this model compared to my original).

history_2, _, _ = run_pipeline(
    loans.drop(columns=["fraction_recovered", "desc"]).copy(),
    loans["fraction_recovered"],
    transformer,
)

Path("save_points").mkdir(exist_ok=True)
dill.dump_session("save_points/model_2.pkl")
Enter fullscreen mode Exit fullscreen mode
/opt/conda/lib/python3.7/site-packages/pandas/core/strings.py:2001: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
  return func(self, *args, **kwargs)


[Pipeline] ....... (step 1 of 3) Processing nlp_imputer, total=   0.1s
[Pipeline] .... (step 2 of 3) Processing nlp_vectorizer, total=  41.3s
[Pipeline] ........ (step 3 of 3) Processing nlp_scaler, total=   4.6s
[ColumnTransformer] ...... (1 of 7) Processing nlp_cols, total=  45.9s
[ColumnTransformer] .... (2 of 7) Processing emp_length, total=   0.2s
[ColumnTransformer] .......... (3 of 7) Processing term, total=   0.3s
[ColumnTransformer]  (4 of 7) Processing home_ownership, total=   0.3s
[ColumnTransformer] ....... (5 of 7) Processing purpose, total=   0.3s
[ColumnTransformer]  (6 of 7) Processing application_type, total=   0.3s
[ColumnTransformer] ..... (7 of 7) Processing remainder, total=   1.1s


Training model: 100%|██████████| 100/100 [22:26<00:00, 13.46s/epoch]


Final metrics: loss: 0.02396 - val_loss: 0.02451
Enter fullscreen mode Exit fullscreen mode
# Restore save point if needed
import dill

try:
    history_2
except NameError:
    dill.load_session("save_points/model_2.pkl")
Enter fullscreen mode Exit fullscreen mode
plot_loss_metrics(history_2, "2")
Enter fullscreen mode Exit fullscreen mode

A line plot entitled "Model 2 loss metrics during training", with separate lines for training loss and validation loss, plotting the loss metric value on the y-axis across the 100 epochs of training on the x-axis. The validation loss line is even chaotic this time than in the model 1 plot but still doesn’t appear to be overfitting.

Wow, still not good enough to beat my original model. Just for kicks, I also tried additional runs where I trained for 1,000 epochs, and others where I increased the numbers of nodes in the first two dense layers to 128 and 64. And I tried decreasing the batch size to 64. But still none of these beat my original model. I suppose these text features have no predictive quality to them in terms of loan outcomes. Interesting.

Next steps

If adding these two features decreased predictive capability, then perhaps some of the other variables I was already using are doing the same thing. I should try using some of scikit-learn's feature selection methods to reduce the dimensionality of the input data.

A more efficient method of hyperparameter optimization would be pretty useful as well. I should give AutoKeras a shot.


Well that was fun! Have any thoughts on how to better integrate language data into the model? I'd love to hear them in the discussion below.

Discussion (1)

Collapse
brodi333 profile image
brodi333

I have been running my business for a very long time, and everything is going very well, but I remember when I first wanted to start it, and I did not have enough money. I remember thinking of all the possibilities in my head, but I was too afraid to execute them. I wanted to start a shoe repair company, and I needed to build the dwelling on my land. I then thought of loaning money from a company and decided to loan them from kviklanet.dk because that is where I wanted to start my business, in Denmark. There was no problem with the construction loan percentage, and I did not have to pay any extra fees, so don't worry about it.