DEV Community

Cover image for Group Your Data by the Index
Leandro Ruiz
Leandro Ruiz

Posted on

Group Your Data by the Index

A few days ago, I was working on a project on risk management, until I came across a simple problem that I had never seen.

The Problem

I needed to group data on the different IDs of a transaction dataset. This dataset contains 8,250 transactions for 1,125 customers of a bank.
Here are the first five rows:

id OVD_t1 OVD_t2 OVD_t3 OVD_sum pay_normal prod_code prod_limit update_date new_balance highest_balance report_date
54982353 0 0 0 0 19 10 NaN 16/07/2011 31677.6 204037.0 14/12/2015
54982353 0 0 0 0 11 2 NaN 21/02/2001 0.0 248874.0 22/06/2007
54982353 0 0 0 0 1 10 NaN 08/01/2006 0.0 1622.0 NaN
54982353 0 0 0 0 1 10 NaN 21/11/2006 0.0 NaN NaN
54982353 0 0 0 0 0 10 NaN 30/09/2005 0.0 NaN NaN

As you can see, all the rows have the same ID and I needed to group all the transactions according to the ID that created them.

Without having the solution in my head, I started looking in my Data Science books and more than anything those that were about data analysis and feature engineering but I didn't find anything.

For this reason I started looking at Stack Overflow until I found something that looked like my situation but I had to adapt it to the characteristics of the project.

Alt Text

[Link to GitHub](https://github.com/ruizleandro/Risk-Management-with-LightGBM)

The Solution

To get around this problem you need the well-known groupby of Pandas.

First, we are going to group the prod_code according to their ID (if you want to know more you can enter GitHub with the link above). But we are going to group the various prod_code values into a list with the following code:

df_0 = payment_data.groupby(['id'])['prod_code'].apply(list).reset_index()
df_0.head(10)
Enter fullscreen mode Exit fullscreen mode

This is the output:

id prod_code
54982353 [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1...
54982356 [27, 27, 10, 6, 6, 12, 10]
54982387 [1, 6, 6, 6, 6, 1, 10, 10, 1, 10, 10]
54982463 [5, 13]
54982530 [10, 10, 10, 10]
54982549 [17, 6, 6, 1, 2, 5, 17, 17, 17, 17]
54982579 [2, 10, 10, 10, 10, 10, 2, 10, 10, 1, 10, 10, ...
54982665 [10, 10, 5, 10]
54982697 [10, 10]
54982721 [10, 6, 1, 5, 13, 10, 5, 10, 13, 12]
54982725 [10]

As we can see, now we have all the unique different values of prod_code (one for every transaction made) grouped by the ID.
The next feature that I needed to group was new_balance , where I decided to get the mean of all the new balances for every prod_code:

df_1 = payment_data.groupby(['id'])['new_balance'].apply(np.mean, axis=0).reset_index()
df_1.head(10)
Enter fullscreen mode Exit fullscreen mode
id new_balance
54982353 42033.133333
54982356 2277.771429
54982387 117455.127273
54982463 39890.400000
54982530 1261.200000
54982549 410060.400000
54982579 86454.654545
54982665 40112.400000
54982697 8752.800000
54982721 49843.440000
54982725 15817.200000

Now, I had to group all the different types of overall payment: OVD_t1, OVD_t2, and OVD_t3. First, I grouped the sum of every type with their correspondent ID and then grouped all the types together in one dataframe.
For this I had to use the function reduce:

from functools import reduce

df_OVD_t1 = payment_data.groupby(['id'])['OVD_t1'].apply(np.sum, axis=0).reset_index()
df_OVD_t2 = payment_data.groupby(['id'])['OVD_t2'].apply(np.sum, axis=0).reset_index()
df_OVD_t3 = payment_data.groupby(['id'])['OVD_t3'].apply(np.sum, axis=0).reset_index()
OVD_dfs = [df_OVD_t1, df_OVD_t2, df_OVD_t3]
df_2 = reduce(lambda left,right: pd.merge(left,right,on='id'), OVD_dfs)
df_2.head(10)
Enter fullscreen mode Exit fullscreen mode
id OVD_t1 OVD_t2 OVD_t3
54982353 3 2 38
54982356 0 0 0
54982387 3 0 0
54982463 0 0 0
54982530 0 0 0
54982549 6 2 0
54982579 1 0 0
54982665 0 2 26
54982697 0 0 0
54982721 2 1 43
54982725 0 0 0

The final feature that I needed to group was pay_normal. Here, the only operation that I applied was np.sum because I needed the total of times that the payment was normal for every client.

df_3 = payment_data.groupby(['id'])['pay_normal'].apply(np.sum, axis=0).reset_index()
df_3.head(10)
Enter fullscreen mode Exit fullscreen mode
id pay_normal
54982353 229
54982356 117
54982387 246
54982463 34
54982530 56
54982549 211
54982579 323
54982665 34
54982697 45
54982721 109
54982725 36

Merging all the Dataframes into one

The final step for this task was to merge all the dataframes that I created into only one. For this, I used the reduce function again in the same form as in the grouping of the OVD_types before.

dfs = [df_0, df_1, df_2, df_3]
df_final = reduce(lambda left,right: pd.merge(left,right,on='id'), dfs) 
df_final.head(10)
Enter fullscreen mode Exit fullscreen mode
id prod_code new_balance OVD_t1 OVD_t2 OVD_t3 pay_normal
54982353 [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1... 42033.133333 3 2 38 229
54982356 [27, 27, 10, 6, 6, 12, 10] 2277.771429 0 0 0 117
54982387 [1, 6, 6, 6, 6, 1, 10, 10, 1, 10, 10] 117455.127273 3 0 0 246
54982463 [5, 13] 39890.400000 0 0 0 34
54982530 [10, 10, 10, 10] 1261.200000 0 0 0 56
54982549 [17, 6, 6, 1, 2, 5, 17, 17, 17, 17] 410060.400000 6 2 0 211
54982579 [2, 10, 10, 10, 10, 10, 2, 10, 10, 1, 10, 10, ... 86454.654545 1 0 0 323
54982665 [10, 10, 5, 10] 40112.400000 0 2 26 34
54982697 [10, 10] 8752.800000 0 0 0 45
54982721 [10, 6, 1, 5, 13, 10, 5, 10, 13, 12] 49843.440000 2 1 43 109
54982725 [10] 15817.200000 0 0 0 36

Conclusion

And that's it. I hope this tutorial helps you, and thanks for getting here.

Top comments (0)