*This was my final project for my Graduate course, Programming Machine Learning Applications at DePaul University.*

### What is K-Nearest Neighbor?

In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. Source: Wikipedia

### Analysis process

The original data set was downloaded from Kaggle, as an aggregate of issued loans from Lending Club through 2007-2015. Lending Club is a US peer-to-peer lending company. The original data set contains 887383 rows and 75 columns.

Due to computing power on my Macbook Pro, I choose to reduce (sample) the data to perform the data analysis to 5% of the original. I also choose to perform some pre-processing by removing categorical variables with high cardinality. I also chose to impute NaN values to zero, as the other option of removing select rows with NaN would results in eliminating the entire data set.

**Data Acquisition**: I loaded the necessary libraries and download the Zip package containing the CSV file from Kaggle. After viewing the data and its shape I took a random 5% of the data to perform the analysis on.**Exploratory data analysis**: During this step I perform some descriptive analysis and determined the target variable. I also explored how many classes were in the target and a selection of other possibly problematic (high cardinality) variables. I also visualized the target variable in a histogram which is a good technique for understanding the distribution of the data to assist in parameter tuning.**Data Cleaning**: I dropped those high cardinality variables during this step as a pre-cursor to the pre-processing step.**Pre-processing & Transformation**: I removed the target variable from the entire data set and transformed the categorical variable into a model matrix with one-hot encoding. This is sometimes the requirements for certain algorithms to process the data in a sparse matrix format. Other statistical software such as R, automates this step when generating models. I imputed the missing values in the data to 0. I scaled the continuous variables using min-max normalization which transforms values onto a scale from 0 to 1 to prevent variables on different scales heavily impacting the coefficients.**Data Partition**: I partitioned the pre-processed data into a training and test data set.**Modeling**: I built a k-NN classifier model, using 10 neighbor classes and the euclidean distance.**Evaluation**: I scored the classifier on unseen test data and calculated the R squared values for both the training and test data. A confusion matrix and classification report were generated as seen below.

### Results

Class | precision | recall | f1-score | support |
---|---|---|---|---|

Charged Off | 0.69 | 0.12 | 0.2 | 663 |

Current | 0.8 | 0.96 | 0.88 | 9101 |

Default | 0 | 0 | 0 | 20 |

Does not meet the credit policy. Status:Charged Off | 0 | 0 | 0 | 7 |

Does not meet the credit policy. Status:Fully Paid | 0 | 0 | 0 | 24 |

Fully Paid | 0.81 | 0.6 | 0.69 | 3069 |

In Grace Period | 0 | 0 | 0 | 95 |

Issued | 0 | 0 | 0 | 147 |

Late (16-30 days) | 0 | 0 | 0 | 39 |

Late (31-120 days) | 0 | 0 | 0 | 146 |

avg / total |
0.77 | 0.8 | 0.77 | 13311 |

Like what you read?