Modeling default rate in P2P lending via LSTM
With the fast development of peer to peer (P2P) lending, financial institutions have a substantial challenge from benefit loss due to the delinquent behaviors of the money borrowers.
Therefore, having a comprehensive understanding of the changing trend of default rate in the P2P domain is crucial.
In this paper, we comprehensively study the changing trend of default rate of P2P USA market at the aggregative level from August 2007 to January 2016.
From the data visualization perspective, we found that three features, including , and , could potentially increase the default rate.
The long short-term memory (LSTM) approach shows its great potential in modeling the P2P transaction data.
Furthermore, incorporating the macroeconomic feature can improve the LSTM performance by decreasing RMSE on both training and testing datasets.
Our study can broaden the applications of LSTM approache in the P2P market.
Keywords:peer to peer lending; default rate; long short-term memory.
Peer to peer (P2P) lending, which means lending money virtually, is one of the fastest growing segment in the financial lending market.
Having a thorough assessment of the loan applicants’ risk is essential for the investors to make a successful investment, since the investors aim to minimize risk while expect high return.
As a result, lending institutions continuously focus on exploring methods to understand the behavior of loan applicants during the economic cycles.
The long short-term memory (LSTM) model, which is one of the state-of-the-art methods to model the sequential data, has been widely used in language modeling, disease forecasting, and speech recognition (, , , ).
With respect to the financial domain, LSTM has shown its superiority in credit risk modeling, overdue of bank loan predictions, and credit card fraud detections (, ).
However, there are limited research that uses LSTM to analyze the sequential data generated in the P2P lending market.
Motivated by the aforementioned research, in this paper, we demonstrate a comprehensive case study with the goal to understand the changing trend of the default rate on an aggregative level of the P2P transactions in USA.
We first analyzed the changing trend of the features as well as their relationship with the default rate from a data visualization perspective.
Then, LSTM model is employed to fit the default rate by incorporating a macroeconomic feature into the dataset.
The findings in our study could provide a reference for the inventors of the P2P market.
Since LSTM model is used in this study, its principle will be briefly discussed in this section. LSTM model is a variant of recurrent neural network (RNN) and it becomes popular recently in modeling sequential data (, ). Compared to RNN, the key concept of LSTM is that it contains a sequential cell states. Figure 1 is a classic explanation about the cell state inside LSTM that is illustrated in many studies (, , ). The cell state contains three different gates: forget gate, input gate, and output gate. The forget gate determines the information that should be removed/kept from prior steps (i.e., ) by using Equation 1, where , , and denote the previous hidden output, the previous cell state, and the current input respectively. and denotes the bias and weights in the forget gate, respectively. denotes the Hadamard product (i.e., pointwise multiplication). With respective to the input gate, it decides the information that is stored from the current step (i.e., ) and then a tanh layer generates the information that should be added to the state (i.e., ). Similar to , and can be expressed by Equations 2 and 3, respectively, where and denote the bias in the input gate, while and denote the corresponding weight. Then the previous cell state can be updated into the new cell state by using Equation 4. Finally, the output gate provides the information (i.e., ) for the next hidden state using Equations 5 and 6, where and denotes the bias and weights in the output gate, respectively.
The case study in this paper uses a public available dataset downloaded via the following URL: https://www.lendingclub.com/info/download-data.action.
The dataset records the P2P lending transactions ranging from 2007 to 2017 in USA.
There are millions of loan transactions and each transaction is identified by the unique ID.
For each transaction, there are over thirty features that describe the financial information of the money borrowers as well as the information related to the loan such as the starting date.
The target variable is , which describes different status of the loan transactions: ongoing, fully paid off, and default.
With respective to the features in the dataset, they mainly fall into three categories: personal property (PP), credit information (CI), and loan information (LI).
We remove the features that have ambiguous meanings and only keep thoes with clear descriptions in this study.
Table 1 provides the descriptions, types, as well as the categories of the selected features in the dataset.
Except the target variable , most features are considered as numerical and there are only three categorical features.
In order to obtain more information that has potential effect on , we collect one macroeconomic feature, the unemployment rate, using the following URL: https://datahub.io/core/employment-us#data.
The unemployment rate is named as and then served as an additional numeric feature in the following analysis.
|Indicates whether the loan is an individual application or a joint application with two co-borrowers||LI||Categorical|
|Home ownership status of the borrowers||PP||Categorical|
|Indicates if income was verified by LC, not verified, or if the income source was verified||PP||Categorical|
|The loan is fully paid off or default||LI||Categorical|
|Annual income reported by the borrowers||PP||Numerical|
|Post charge off collection fee||LI||Numerical|
|The past-due amount owed for the accounts on which the borrower is now delinquent||CI||Numerical|
|Number of over 30 days past-due incidences of delinquency in the borrowers’ credit files for the past 2 years||CI||Numerical|
|Interest rate on the loan||LI||Numerical|
|The monthly payment owed by the borrower if the loan originates||LI||Numerical|
|Last total payment amount received||LI||Numerical|
|The amount of the loan||LI||Numerical|
|Number of accounts opened in past 24 months||CI||Numerical|
|Number of derogatory public records||CI||Numerical|
|Post charge off gross recovery||LI||Numerical|
|Total credit revolving balanced||CI||Numerical|
|The total number of credit lines currently in the borrower’s credit file||CI||Numerical|
|Payments received to date for total amount funded||LI||Numerical|
|Late fees received to date||LI||Numerical|
3.2 Problem statement
Based on the information provided by the dataset, we define our two main goals in this study as follows:
1. Explore the status of the loans in the P2P market as time goes on from the aggregative level (rather than the applicant level).
Furthermore, explore the relationships between the features and the loan status.
This could provide insights to the investors on their investments when facing with different money borrowers.
2. Since LSTM algorithm is not widely used in the P2P lending market by previous studies, we implement a LSTM model to summarize and forecast the changing trend of the loan status to explore its potential in modeling P2P transaction data.
3.3 Data pre-processing
To address the research goals described in Section 3.2, several data pre-processing procedures are performed sequentially as follows:
(a) We focus on the P2P lending that have determined status.
Since most P2P lending transactions last for 36 months (some last for 60 months), most loan transactions that begin after February 2016 are still ongoing.
Therefore, observations with the valued ‘ongoing’ are removed.
Then, we transform the categorical feature to numerical by giving observations with valued ‘fully paid off’ a value 0 while those valued ‘default’ a value 1.
As a result, there remains around one million observations and the transaction ranges from August 2007 until January 2016.
(b) The original P2P transaction dataset is aggregated by month in order to get the monthly default rate.
In other words, by calculating the percentage of within each month, we obtain the default rate of the lending on the aggregative level.
In this case, the original categorical target is transformed to the numerical target named .
As a result, we obtain a sequence of for each month within each year ranging from August 2007 until January 2016.
(c) Features (both numerical and categorical) having missing/invalid percentage larger than were removed.
Then median-based imputation is applied on the continuous features.
(d) Exploratory data analysis (EDA) is implemented with the goal to transform categorical features into numerical values.
As described in Table 1, besides the target variable, there are only three categorical features in the dataset:
, , and .
The effects of the different levels of these three categorical variables on are first visualized using the barplots and then compared using Wilcoxon rank-sum test.
Figure 2 displays the average default rate in each level of , , and , respectively.
From the first subplot in Figure 2, it is surprising to find that borrowers who rent or own home have higher default rate than those who have mortgage.
People without home have the lowest default rate while loan applicants who select ‘ANY’ for their have the highest default rate.
Wilcoxon rank-sum test shows that at the significant level of 0.05, there are statistically significant difference in the default rate among the six different levels of .
Therefore, we keep all these six levels and use one-hot-encoding method to convert them into numerical values.
In the third subplot of Figure 2, the feature has two levels: ‘individual’ and ‘joint app’ (refer to Table 1 for details).
Wilcoxon rank-sum test shows that loan applicants belonging to ‘joint app’ have significantly higher default rate than those belonging to ‘individual’.
Thus, similar as for , the two levels of are kept and are transferred into numerical values using one-hot-encoding method.
The second subplot in Figure 2 shows the effects of the three levels of on default rate.
Wilcoxon rank-sum test demonstrates that verified applicants (including both verified and source verified) show significantly higher default rate than un-verified ones.
However, there is no significant difference in the default rate between the level ‘verified’ and ‘source verified’.
As a result, we pool the levels ‘verified’ and ‘source verified’ together as ‘verified’ and only two different levels of are kept: ‘not verified’ and ‘verified’.
These two levels are further transformed into numerical values using one-hot-encoding method.
(e) After one-hot-encoding transformation, the missing values in the categorical features are imputed using the corresponding mode values.
3.4 Data trends
After data pre-processing discussed in Section 3.3, we obtained 102 observations on the aggregative level along with 19 features.
We visualize the changing of all the numerical features that are described in Table 1 from August 2007 to January 2016 using line plots.
Figure 3 shows the illustrative examples of the changing trend from six numerical features: , , , , and .
It is observed that gradually decrease from August 2007 to early 2010 but it begins to increase afterwards.
It is surprising to note that changes in the opposite direction of .
, , and have very similar changing trend: there is a deep decrease in these three features around the middle of 2008 and then they gradually increase until 2016 with some fluctuations.
On the other hand, the changing trend of is different.
It has a deep decrease near the end of 2011 and reaches the top value around 2013 and gradually decrease afterwards.
It is worth noting that the trend of and has some similarities, indicating the potential correlations between these two features.
After visualizing the data trend, a heat map is then used to display the correlations between these numerical time series features.
In the heat map, features that are positively correlated are ‘hot’ while negatively correlated variables are ‘cold’ ().
Figure 4 shows the heat map generated by the continuous features in this study.
It is demonstrated that most features have positive relationship with each other except and , which have a negative relationship with most of the rest features.
The target variable has a very strong positive relationship with the features , and .
Therefore, these three features are considered as the critical features that can potential increase the default rate in the P2P market.
3.5 LSTM for the prediction of default rate
LSTM approach is applied to model the sequence data of P2P lending.
After the data pre-processing described in Section 3.3, the dataset was split into training and testing sets.
To be specific, we use the data from August 2007 to May 2014 as the training set while the testing set uses the data from June 2014 to January 2016.
The loss function used in LSTM is to minimize the squared root of mean squared error (RMSE) between the predicted and the true default rate via the ADAM algorithm.
To identify whether the incorporated macroeconomic feature, , is beneficial to the mode performance, two LSTM models are implemented as follows: (1) LSTM model without using , denoting as LSTM(1); (2) LSTM model by using as an additional feature, denoting as LSTM(2).
The implementation of the LSTM model is based on the Keras library in Python 3 on the personal laptop with 3.3 GHz Intel Core i7 processor, 16GB RAM, and Mac OS system.
Figures 5 and 6 show the predicted monthly along with the true values from August 2007 to January 2016 on the two LSTM models.
The predicted are displayed in red while the true values are shown by the black line.
The subfigure on the left of the vertical line is generated using the training data (i.e., data from August 2007 to May 2014) while the subfigure on the right is based on the testing set (i.e., data from June 2014 to January 2016).
We can see that both LSTM models produce a good fit on the default trend on both training and testing sets, no matter whether the macroeconomic feature is used or not.
Therefore, LSTM shows its great potential in predicting the default rate in the P2P lending domain.
For LSTM(1), it results in the RMSE valued 0.014 and 0.015 on training and testing set, respectively.
For LSTM(2), the RMSE values are 0.011 and 0.012 on training and testing set, respectively.
This indicates that incorporating the macroeconomic feature could further improve the model performance.
In this study, we aim to explore the changing trend of the default rate on the aggregative level in the P2P lending market in USA from August 2007 to January 2016.
From the data visualization perspective, we found that three features, including , and , could potentially increase the default rate.
LSTM model is employed as a technique to fit the sequential P2P transaction data.
To further improve the performance of LSTM model, we incorporate a macroeconomic feature as an additional feature.
The result shows that although not widely used in the P2P market, LSTM is a good alternative to model the P2P transaction data.
It is also demonstrated that the macroeconomic feature can improve the LSTM performance by decreasing RMSE on both training and testing dataset.
Therefore, the case study in this paper provides a good reference for investors in their future investments.
Furthermore, our study can broaden the applications of the modern data-driven approaches in the P2P market.
In the future, more macroeconomic features as well as more transaction data should be incorporated with the goal to improve the predictive power of LSTM models.
-  M. Sundermeyer, R. Schlüter, and H. Ney, “Lstm neural networks for language modeling,” in Thirteenth annual conference of the international speech communication association, 2012.
-  L. Liu, M. Han, Y. Zhou, and Y. Wang, “Lstm recurrent neural networks for influenza trends prediction,” in International Symposium on Bioinformatics Research and Applications. Springer, 2018, pp. 259–264.
-  A. N. Jagannatha and H. Yu, “Bidirectional rnn for medical event detection in electronic health records,” in Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, vol. 2016. NIH Public Access, 2016, p. 473.
-  A. Graves, N. Jaitly, and A.-r. Mohamed, “Hybrid speech recognition with deep bidirectional lstm,” in Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013, pp. 273–278.
-  J. Jurgovsky, M. Granitzer, K. Ziegler, S. Calabretto, P.-E. Portier, L. He-Guelton, and O. Caelen, “Sequence classification for credit-card fraud detection,” Expert Systems with Applications, vol. 100, pp. 234–245, 2018.
-  X. Li, X. Long, G. Sun, G. Yang, and H. Li, “Overdue prediction of bank loans based on lstm-svm,” in 2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI). IEEE, 2018, pp. 1859–1863.
-  M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, “Sequential deep learning for human action recognition,” in International Workshop on Human Behavior Understanding. Springer, 2011, pp. 29–39.
-  J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio, “A recurrent latent variable model for sequential data,” in Advances in neural information processing systems, 2015, pp. 2980–2988.
-  S. Hochreiter and J. Schmidhuber, “Lstm can solve hard long time lag problems,” in Advances in neural information processing systems, 1997, pp. 473–479.
-  J. S. Ren, Y. Hu, Y.-W. Tai, C. Wang, L. Xu, W. Sun, and Q. Yan, “Look, listen and learn-a multimodal lstm for speaker identification.” in AAAI, 2016, pp. 3581–3587.
-  J. C. B. Gamboa, “Deep learning for time-series analysis,” arXiv preprint arXiv:1701.01887, 2017.
-  M. E. Garr, J. Rojicek, and J. Vass, “Heatmap timeline for visualization of time series data,” Oct. 18 2012, uS Patent App. 13/086,255.