R language 26 prosper loan data analysis 2

Univariate analysis

First, the basic information of platform customers is analyzed, including location, credit status, reasons for applying for loans, etc., aiming to analyze the general characteristics of target customers:

  • Regional distribution:
library(ggplot2)
ggplot(data=subset(data,!data$BorrowerState==""),
       aes(x=BorrowerState))+geom_bar(fill="pink",color="black")+
  theme(axis.text = element_text(size = 5) )


It can be seen that the company's customers are more distributed in California, New York, Florida, Texas and Illinois, leading other states, and can appropriately increase the publicity in other states and develop new customers. Prosper, based in San Francisco, may also be related to California's largest user base.

  • Analysis of default times:
ggplot(data=subset(data,!data$DelinquenciesLast7Years==""),
       aes(x=DelinquenciesLast7Years))+geom_bar(fill="orange",color="black")+
  theme(axis.text = element_text(size = 5) )+scale_x_continuous(limits = c(-1,50))

  • Customer employment:
ggplot(aes(EmploymentStatus),data = subset(data,!(data$EmploymentStatus==""))) + 
  geom_bar(color="black",fill=I("#B2DFEE"),width = 0.5) +
  theme(axis.text.x=element_text(angle = 90,hjust = 1,vjust=0,size=8))


It can be seen that most of the platform's customers are employed, or full-time, with jobs and stable income.

  • Customer credit inquiry times:
bar_plot <- function(varname, binwidth) {
  return(ggplot(aes_string(x = varname), data = data) + geom_histogram(binwidth = binwidth))
}

bar_plot('InquiriesLast6Months',1)+
  coord_cartesian(xlim=c(0,quantile(data$InquiriesLast6Months,probs = 0.95,
                                    "na.rm" = TRUE)))+
  geom_vline(xintercept = quantile(data$InquiriesLast6Months, 
                                     probs = 0.95, "na.rm" = TRUE), 
             linetype = "dashed", color = "red")+
  theme(panel.background =element_rect(fill="white"))


The number of credit inquiry indicates the number of recent loan application of the borrower, and the more times, the more intense the fund to some extent. It can be seen from the figure that the number of customer loans under 95% is less than 5.

  • Debt to income ratio of customers:
bar_plot('DebtToIncomeRatio',0.04)+
  coord_cartesian(xlim=c(0,quantile(data$DebtToIncomeRatio,probs = 0.95,
                                    "na.rm" = TRUE)))+
  geom_vline(xintercept = quantile(data$DebtToIncomeRatio, 
                                     probs = 0.95, "na.rm" = TRUE), 
             linetype = "dashed", color = "red")+
  theme(panel.background =element_rect(fill="white"))


The higher the debt to income ratio is, the lower the ability to repay the loan is. 95% of the people on the platform have a debt to income ratio of less than 0.5, and the overall debt to income ratio of customers is relatively low.

  • Customer's monthly income:
bar_plot('StatedMonthlyIncome',425)+
  scale_x_continuous(limits = (c(0,15000)),breaks = seq(0,15000,500))+
  geom_vline(xintercept = 5000, linetype = "dashed", color = "red")+
  geom_vline(xintercept = 3000, linetype = "dashed", color = "red")+
  theme(panel.background =element_rect(fill="white"))+
  theme(axis.text.x=element_text(angle = 90,hjust = 1,vjust=0,size=8))


It can be seen that the monthly salary of most borrowers is between 3000-5000 US dollars.

  • Reason for loan:
ggplot(data,aes(x=ListingCategory..numeric.))+
  geom_bar(color="black",fill=I("#70DBDB"))+scale_x_continuous(breaks = c(0:20))+scale_y_sqrt()


Through this analysis, we can see that the main loan uses are concentrated in categories 1, 0 and 7. As no specific meaning is given, the specific purpose of the loan is not clear and can be inquired through complete information.

  • Credit status of platform users (grade / score):
library(gridExtra)
data$creditlevel <- factor(data$creditlevel,order=TRUE,levels = c("AA","A","B","C","D","E","HR"))
data$CreditGrade <- factor(data$CreditGrade,order=TRUE,levels = c("AA","A","B","C","D","E","HR"))
data$ProsperRating..Alpha. <- factor(data$ProsperRating..Alpha.,order=TRUE,
                                     levels = c("AA","A","B","C","D","E","HR"))

p1 <- ggplot(data,aes(x=creditscore))+
  geom_histogram(binwidth=20,color="black",fill=I("#DBDB70"))+
  scale_x_continuous(limits = c(400,900))

p2 <- ggplot(data=subset(data,data$CreditGrade!=""& data$CreditGrade!="NC"),aes(x=CreditGrade))+
  geom_bar(color="black",fill=I("#7093DB"))+
  xlab("creditlevel(pre2009)")

p3 <- ggplot(data=subset(data,data$ProsperRating..Alpha.!=""),
             aes(x=ProsperRating..Alpha.))+
  geom_bar(color="black",fill=I("#E9C2A6"))+
  xlab("creditlevel(after2009)")

p4 <- ggplot(data=subset(data,!is.na(data$creditlevel)),aes(x=creditlevel))+
  geom_bar(color="black",fill=I("#EAADEA"))
grid.arrange(p1,p2,p3,p4,ncol = 1)


According to the customer's credit rating and rating chart, it can be seen that the distribution is basically normal. The credit rating mainly focuses on 650-750 points, and the credit rating focuses on B,C,D. after 2009, the classification of A-level users and AA level users, as well as the end E and HR level users is more clear.

Published 26 original articles, won praise 0, visited 334
Private letter follow

Tags: less P4

Posted on Mon, 10 Feb 2020 00:00:10 -0500 by kendhal