Table Of Contents
FDIC Bank Failures - Time Series Analysis
My latest endeavor with SQL is working with timeseries data.
For this mini-analysis, I decided to chart out US bank failures overtime.
I found an official dataset of FDIC Bank Failures on data.gov, which has public
datasets from the US governemnt. It lists all the banks that have failed since 2000.
Readout
Over the time period from 2000 to 2023, there were 568 banks that closed. Georiga, Florida, and Illinois had the most closers over the time period with 93, 76, and 69 closures respectively. On a year basis, Florida had the most closures in a single year with 29 banks closing in 2010.
As expected, bank closures trend upwards between 2008 - 2012, peaking with 157 closures in 2010. The year over year changes highlight this explosion of failures.
From 2007 to 2008, the number of bank closures grew over 8x from 3 closures to 25 closures, and grew even more the following year - from 25 closures in 2008 to 140 in 2009 - a 460% increase. Closures in a single year peaked in 2010 with 157 closures with Florida having the most closures with 29 bank clousres. July 2009 was the worst month for closures, with 29 closings in a single month.
Importing the Data
To import the data into my Postgres Database I did the following steps
- Create a table to represent the rows in the CSV file
CREATE TABLE bank_failures (
id bigserial,
bank_name TEXT,
city TEXT,
state TEXT,
cert TEXT,
acquiring_institution TEXT,
closing_date DATE,
fund BIGINT
);
- Copy the data into the table from the csv file, using the COPY command
COPY bank_failures (
bank_name,
city,
state,
cert,
acquiring_institution,
closing_date,
fund
) FROM '/path/to/banklist.csv' WITH (FORMAT csv, DELIMITER ',', QUOTE '"', ESCAPE '"', ENCODING 'UTF-8');
I originally set the ENCODING to ASCII, but after running into some issues realized I needed to set the encoding to 'UTF-8'.
Now, with the correct character encoding,
I imported my data for real but got this error message:
ERROR: invalid byte sequence for encoding "UTF8": 0x96
After some digging online, I realized this meant there was a non UTF-8 character in the csv file. There was a hyphen character '-' for one of the banks was a special character. I replaced it with a "regular" hyphen - character and that allowed me to import all the data.
Once imported, the data looked like this:
id | bank_name | city | state | cert | acquiring_institution | closing_date | fund |
---|---|---|---|---|---|---|---|
40 | Citizens Bank | Sac City | IA | 8758 | Iowa Trust & Savings Bank | 2023-11-03 | 10545 |
41 | Heartland Tri-State Bank | Elkhart | KS | 25851 | Dream First Bank, N.A. | 2023-07-28 | 10544 |
42 | First Republic Bank | San Francisco | CA | 59017 | JPMorgan Chase Bank, N.A. | 2023-05-01 | 10543 |
43 | Signature Bank | New York | NY | 57053 | Flagstar Bank, N.A. | 2023-03-12 | 10540 |
44 | Silicon Valley Bank | Santa Clara | CA | 24735 | First-Citizens Bank & Trust Company | 2023-03-10 | 10539 |
Querying the data
My goal was to run a few simple queries to get the number of bank failures by year,
and then another query to get all the bank failures by month.
I started with a query to find all the months/years that banks had failures
-- Return bank failures by month, year based on the dataset
SELECT
EXTRACT('YEAR' FROM closing_date :: DATE) as closing_year,
EXTRACT('MONTH' FROM closing_date :: DATE) as closing_month,
COUNT(*) as total_failures
FROM bank_failures
GROUP BY closing_year, closing_month
ORDER BY closing_year, closing_month;
To my surprise, there were some months missing from this query. There wasn't a row for months from January - October, 2000.
It may seem odd, but this is the expected output. If there weren't any failures for a given month in a year, there wouldn't be rows. If I want there to be rows for every month from 2000 - 2023, I need to have date for those months too.
To accomplish this I made a table called timescale, that contains all the months, years from January, 2000 - December 2023. My initial attempt was overcomplicated, but I realized PostgreSQL provides a helpful function, generate_series() That lets me generate a series of rows between to dates:
-- Generate a row for every month between Jan, 2000 and Dec, 2023
SELECT generate_series('2000-01-01', '2023-12-01','1 month'::interval) as timescale;
-- Generate a row for every year between Jan, 2000 and Dec, 2023
SELECT generate_series('2000-01-01', '2023-12-01','1 year'::interval) as timescale;
This code will return a result set containing a row for every month from January 2000 until December 2023.
With this table in mind, we can redo our original "Failures per Month" query and JOIN the timescale onto our bank_failures table:
-- Return bank failures by month, year based on the dataset
with series as (SELECT
generate_series('2000-01-01', '2023-12-01','1 month'::interval) as timescale)
SELECT
EXTRACT('YEAR' FROM timescale) as closing_year,
EXTRACT('MONTH' FROM timescale) as closing_month,
COUNT(bank_failures.closing_date) as total_failures
-- the series table has complete dates, so join bank_failures onto the series for complete data for the complete time period
FROM series LEFT JOIN bank_failures
on EXTRACT('YEAR' from timescale) = EXTRACT('YEAR' from closing_date :: DATE)
and EXTRACT('MONTH' from timescale) = EXTRACT('MONTH' from closing_date :: DATE)
GROUP BY closing_year, closing_month
ORDER BY closing_year, closing_month;
Important Notes:
- The bank_failures table is left joined onto the timescale. This is done because the timescale includes every month, year while the bank_failures table only includes records for month/years where a bank failed.
- We do a
COUNT(bank_failures.closing_date)
rather than aCOUNT(*)
- A left join will return a record for each row on the timescale table. COUNT(*) would
return 1 failure, because it counts rows including NULLs.
COUNT(bank_failures.closing_date)
ensures that we only count cases where there were bank failure in a given month.
- A left join will return a record for each row on the timescale table. COUNT(*) would
return 1 failure, because it counts rows including NULLs.
Likewise, if we wanted to upsample the data to yearly buckets, the query would look like this:
with series as (SELECT
generate_series('2000-01-01', '2023-12-01','1 year'::interval) as timescale)
SELECT
EXTRACT('YEAR' FROM timescale) as closing_year,
COUNT(bank_failures.closing_date) as total_failures
FROM series LEFT JOIN bank_failures
on EXTRACT('YEAR' from timescale) = EXTRACT('YEAR' from closing_date :: DATE)
GROUP BY closing_year
ORDER BY closing_year;
We've updated the generate_series() function to use an interval of 1 year rather than 1 month, and
updated our JOIN condition to be based on matching years.
Analyzing the Data
I saved the queries as materialized views and ran a few queries on the data. I used a materialized view because I know that the data is for a fixed time period and will not need to be refreshed. A temp table, or even writing to an actual table would work as well.
create materialized view bank_failures_by_month as (
with series as (SELECT
generate_series('2000-01-01', '2023-12-01','1 month'::interval) as timescale)
SELECT
EXTRACT('YEAR' FROM timescale) as closing_year,
EXTRACT('MONTH' FROM timescale) as closing_month,
COUNT(bank_failures.closing_date) as total_failures
FROM series LEFT JOIN bank_failures
on EXTRACT('YEAR' from timescale) = EXTRACT('YEAR' from closing_date :: DATE)
and EXTRACT('MONTH' from timescale) = EXTRACT('MONTH' from closing_date :: DATE)
GROUP BY closing_year, closing_month
ORDER BY closing_year, closing_month);
Here are some of the queries I ran on the dataset, and materialized view:
How many bank closed over the entire time period:
select count(*) from bank_failures bf ;
Which states had the most bank failures over the complete time period?
select
state,
COUNT(*) as bank_failures
from bank_failures bf
group by state
order by COUNT(*) desc
limit 10;
state | bank_failures |
---|---|
GA | 93 |
FL | 76 |
IL | 69 |
CA | 43 |
MN | 23 |
WA | 19 |
MO | 16 |
AZ | 16 |
MI | 14 |
TX | 13 |
Which states had the most closures over the time period?
select
state,
extract('YEAR' from closing_date) :: TEXT as closing_year,
COUNT(*) as state_failures_by_year
from bank_failures bf
group by state, extract('YEAR' from closing_date)
order by COUNT(*) desc
limit 10;
state | closing_year | state_failures_by_year |
---|---|---|
FL | 2010 | 29 |
GA | 2009 | 25 |
GA | 2011 | 23 |
IL | 2009 | 21 |
GA | 2010 | 21 |
CA | 2009 | 17 |
IL | 2010 | 16 |
FL | 2009 | 14 |
FL | 2011 | 13 |
CA | 2010 | 12 |
For each year, which state(s) had the most bank failures? Include ties.
with ranked_state_failures_by_year as (
select
extract('YEAR' from closing_date) as closing_year,
state,
COUNT(*) as state_failures_by_year,
-- Window functions are evaluated AFTER the group by clause, so this is comparing the bank failures per state
dense_rank() over (partition by extract('YEAR' from closing_date) order by COUNT(*) desc)
from bank_failures bf
group by extract('YEAR' from closing_date), state
order by extract('YEAR' from closing_date), COUNT(*) desc),
all_years as (
select generate_series('01-01-2000', '01-01-2023', '1 year'::interval) as timescale
)
select
extract('YEAR' from all_years.timescale) as closing_year,
coalesce(state, 'N/A') as state,
coalesce(state_failures_by_year, 0) as state_failures_by_year
from
all_years left JOIN
ranked_state_failures_by_year
on extract('YEAR' from all_years.timescale) = ranked_state_failures_by_year.closing_year
-- Some years do not have bank failures, so we include a null check to ensure they appear in the result set
where dense_rank = 1 or dense_rank is null
order by all_years.timescale;
closing_year | state | state_failures_by_year |
---|---|---|
2000 | HI | 1 |
2000 | IL | 1 |
2001 | AR | 1 |
2001 | OH | 1 |
2001 | IL | 1 |
2001 | NH | 1 |
2002 | FL | 2 |
2003 | WI | 1 |
Which years had the most bank failures?
select
RANK() over (order by sum(total_failures) desc),
closing_year::int,
SUM(total_failures) as total_failures
from bank_failures_by_month
group by closing_year
order by SUM(total_failures) desc
limit 5;
rank | closing_year | total_failures |
---|---|---|
1 | 2010 | 157 |
2 | 2009 | 140 |
3 | 2011 | 92 |
4 | 2012 | 51 |
5 | 2008 | 25 |
Which month/year had the most bank failures?
select
RANK() over (order by total_failures desc),
closing_year :: INT,
closing_month,
total_failures
from bank_failures_by_month
order by total_failures desc
limit 5;
rank | closing_year | closing_month | total_failures |
---|---|---|---|
1 | 2009 | 7 | 24 |
2 | 2010 | 4 | 23 |
3 | 2010 | 7 | 22 |
4 | 2009 | 10 | 20 |
5 | 2010 | 3 | 19 |
What was percent change each year?*
with bank_failures_by_year as (
-- Return a table with the year, total_failures, and toal failures in the previous year, defaulting to 0 if not data is found
SELECT
EXTRACT('YEAR' FROM timescale) as closing_year,
COUNT(bank_failures.closing_date) as total_failures,
coalesce(LAG(COUNT(bank_failures.closing_date)) over (order by EXTRACT('YEAR' FROM timescale)),0) as prev_year_failures
FROM (SELECT
generate_series('2000-01-01', '2023-12-01','1 year'::interval) as timescale) as series LEFT JOIN bank_failures
-- Join the series of years with the bank failures table based on matching years
on EXTRACT('YEAR' from timescale) = EXTRACT('YEAR' from closing_date)
GROUP BY closing_year
ORDER BY closing_year)
-- Calculate the year over year change between each year of bank failures
select closing_year, total_failures,
case
when closing_year = 2000 then 'N/A'
when prev_year_failures = total_failures then '0%'
when prev_year_failures = 0 then 'N/A'
else ROUND(((total_failures - prev_year_failures) :: NUMERIC / prev_year_failures) * 100.0, 2) || '%'
end as yoy_pct_change
from bank_failures_by_year;
closing_year | total_failures | yoy_pct_change |
---|---|---|
2006.0 | 0 | 0% |
2007.0 | 3 | N/A |
2008.0 | 25 | 733.33% |
2009.0 | 140 | 460.00% |
2010.0 | 157 | 12.14% |
2011.0 | 92 | -41.40% |
2012.0 | 51 | -44.57% |
2013.0 | 24 | -52.94% |
Wrapping it up
For time series data that has gaps, the generate_series() function can enable
analyzing the data with a complete timeline. And, if needed, can support different
levels of granularity (day, week, month, etc.) to match the granularity of your date.
Improvements
I made use of the EXTRACT function often in my queries. It might make sense to just have columns for month, year since I was using extract so often.
Each row has an acquiring institution. It would be interesting to revisit
the dataset and use some recursive queries to uncover chains of acquisitions.Normalizing the some columns such as state, acquiring institution, institution could make it easier to run queries related to who acquired which banks. FDIC has additional datasets that assign a unique id to institutions. An enhancement to this project would be to import all FDIC institutions and use that as a reference to normalize data for institutions in the bank failures dataset.
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.