HarvardX Final Report - LendingClub dataset

After 3 months of work, the final report for the HarvardX Data Science course was submitted.

It is based on the LendingClub dataset. LendingClub is a peer-2-peer lender. This is a matching of private borrowers and investors. Small amounts, fairly high risk (if they could, borrowers would probably have had a bank involved). Surprisingly, after tapping a market of individual lenders, the biggest lenders are now the banks. To inform the investors, LendingClub make historical information publicly available.

This work went through many blind alleys. I won’t list them, they are in the report (post-mortem section). But it was an overall enriching experience. I learned a lot, often about limitations of what I tried (the dataset is big with a few millions samples (big for an old laptop), with many (ca. 150) mispecified mixed categorical and numeric variables). The experience will be filed in the ‘it-builds-character’ category…

One point that is still tingling my mind is learning about Conditional Inference Trees used to bin variables. That is then used for logistic regression to predict probabilities of loan default.

Why are those trees interesting? They are sourced in information theory and measure the information content of a prediction variable to predict a binary response. The prediction variable is then partitioned in a few intervals (bins). What is great?

  • The measurement does NOT rely on the value of the prediction variable. This means that variable NAs go from being a nuisance to being stashed in a bin of their own treated as any other bin.

  • The logistic regression context predicates binary variables which were perfect for the purpose of this report. But those trees do not require binary outcomes. They rely on what are called Weight of Evidence (calculated for each bin) and Information Value (calculated for each variable).

  • The calculations are very quick (about 1/10th second to bin 1 million samples) with a small memory footprint.

In other words, whatever comes in, we do not have to worry about scaling/z-scoring/filling NAs; it is quickly reformatted into a handful (literally of that order of magnitude depending on parameters used) based on the relevance to predicting what needs to come out.

If I didn’t know better, this should be called model impedance matching (electrical engineers can explain)!

Apart from that, the number of avenues to explore with this dataset (especially using data from other sources) could fill many more months. I listed a list of possible techniques in the report’s conclusion. This is what does and will keep banks’ credit risk departments busy and well-staffed…

I am working on making the MovieLens and LendingClub reports available as gitbooks. To be announced.

Emmanuel Rialland
Emmanuel Rialland
Consultant Finance - Machine Learning
comments powered by Disqus

Related