Credit scoring with Python

Project introduction Credit risk: the risk of economic loss caused by non performance of contractual obligations. Scorec...
Missing value
Outliers
Univariate analysis
Multivariate analysis
Project introduction

Credit risk: the risk of economic loss caused by non performance of contractual obligations.
Scorecard: a means of measuring risk probability in the form of score. The higher the score, the safer.

Data source: Kaggle

There are 150000 sample data,

– basic attributes: including the age of the borrower at that time.
– solvency: including the borrower's monthly income and debt ratio.
– credit transactions: 35-59 days overdue in two years, 60-89 days overdue in two years, 90 days or more overdue in two years.
– property status: including the number of open-end credit and loans, real estate loans or lines.
– loan attribute: none.
– other factors: including the number of family members of the borrower (excluding myself).
– time window: the observation window of the independent variable is the past two years, and the performance window of the dependent variable is the next two years.

gradeVariable labelVariable interpretationx0SeriousDlqin2yrsGood customers and bad customersx1RevolvingUtilizationOfUnsecuredLinesRecycling of unsecured loansx2ageAge of the borrower at the time of borrowingx3NumberOfTime30-59DaysPastDueNotWorse35-59 days overdue but not bad timesx4DebtRatioDebt ratiox5MonthlyIncomemonthly incomex6NumberOfOpenCreditLinesAndLoansNumber of open credit and loansx7NumberOfTimes90DaysLate90 day overdue timesx8NumberRealEstateLoansOrLinesNumber of real estate loans or linesx9NumberOfTime60-89DaysPastDueNotWorse60-89 days overdue but not bad timesx10NumberOfDependentsNumber of family members
import numpy as np import matplotlib import matplotlib.pyplot as plt plt.rcParams['font.sans-serif'] =['Microsoft YaHei'] plt.rcParams['axes.unicode_minus']=False import seaborn as sns import pandas as pd from sklearn.ensemble import RandomForestRegressor from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve, auc import os os.chdir(r'\Credit score\GiveMeSomeCredit')
df=pd.read_csv('cs-training.csv') states={"Unnamed: 0":"user ID", "SeriousDlqin2yrs":"Good and bad customers", "RevolvingUtilizationOfUnsecuredLines":"Available limit ratio", "age":"Age", "NumberOfTime30-59DaysPastDueNotWorse":"Overdue 30-59 Tianbi number", "DebtRatio":"Debt ratio", "MonthlyIncome":"monthly income", "NumberOfOpenCreditLinesAndLoans":"Credit quantity", "NumberOfTimes90DaysLate":"Number of transactions overdue for 90 days", "NumberRealEstateLoansOrLines":"Fixed asset loans", "NumberOfTime60-89DaysPastDueNotWorse":"Overdue 60-89 Tianbi number", "NumberOfDependents":"Number of family members"} df.rename(columns=states,inplace=True) df.head()

Data preprocessing
df.info()

def missing_values_table(df): #All missing values mis_val =df.isnull().sum() #Proportion of missing values mis_val_percent =100*df.isnull().sum()/len(df) #Can make a watch mis_val_table=pd.concat([mis_val,mis_val_percent],axis=1) #Change column name mis_val_table_ren_columns=mis_val_table.rename( columns=) #Sort missing values mis_val_table_ren_columns=mis_val_table_ren_columns[mis_val_table_ren_columns.iloc[:,1]!=0]\ .sort_values('Missing proportion',ascending=False).round(1) #Print out table print("Data have"+str(df.shape[1])+"column.\n" "among"+str(mis_val_table_ren_columns.shape[0])+ "Column contains missing values.") #Return true rows and columns return mis_val_table_ren_columns #Show missing values and functions representing missing values missing_values_table(df)

Missing value

The missing value of "number of family members" variable is relatively small, and it will not have a great impact on the overall model if it is deleted directly. After processing the missing values, delete the duplicates.

df=df[df.Number of family members.notnull()].drop_duplicates()

The missing rate of "monthly income" is relatively large. The random forest method is used to fill in the missing value according to the correlation between variables

# Prediction of filling function with missing values by random forest def set_missing(df): # Take out the existing numerical features # process_df = df.iloc[:,1: ] process_df = df.iloc[:,[6,1,2,3,4,5,7,8,9,10,11]] # It is divided into two parts: known feature and unknown feature known = process_df[process_df.monthly income.notnull()].values unknown = process_df[process_df.monthly income.isnull()].values # X is the characteristic attribute value X = known[:,1:] # y is the result label value y = known[:,0] # fit into RandomForestRegressor rfr = RandomForestRegressor(random_state=0,n_estimators=200,max_depth=3,n_jobs=-1) rfr.fit(X,y) # The obtained model is used for unknown eigenvalue prediction predicted = rfr.predict(unknown[:,1:]).round(0) print(predicted) # The prediction results obtained are used to fill in the original missing data df.loc[(df.monthly income.isnull()), 'monthly income'] = predicted return df df=set_missing(df)

Outliers

After the missing value is processed, we also need to process the abnormal value. Outliers refer to values that deviate significantly from most of the sampling data. For example, when the age of an individual customer is 0, it is generally considered as outliers. Here, the capping method is used.
Capping method: replace more than 99% and less than 1% of the points in the data frame in the whole line, and set more than 99% of the point value = 99% of the point value; Point value less than 1% = point value of 1%.

df.Age.hist(bins=50)

#Capping method def blk(floor,root): def f(x): if x<floor: x=floor elif x>root: x=root return x return f q1=df.Age.quantile(0.01) q99=df.Age.quantile(0.99) blk_tot=blk(floor=q1,root=q99) df.Age=df.Age.map(blk_tot) df.Age.hist(bins=50)

Exploratory analysis

Univariate analysis

age_cut=pd.cut(df.Age,5) age_cut_grouped=df['Good and bad customers'].groupby(age_cut).count() age_cut_grouped1=df['Good and bad customers'].groupby(age_cut).sum() df2=pd.merge(pd.DataFrame(age_cut_grouped), pd.DataFrame(age_cut_grouped1),right_index=True, left_index=True) df2.rename(columns={'Good and bad customers_x':'Total customer', 'Good and bad customers_y':'Bad customer'},inplace=True) df2.insert(2,'Bad customer rate',df2['Bad customer']/df2['Total customer']) ax1=df2[['Total customer','Bad customer']].plot.bar() ax1.set_xticklabels(df2.index,rotation=15) ax1.set_ylabel('Number of customers') ax1.set_title('Distribution of age and number of good and bad customers')

ax11=df2['Bad customer rate'].plot() ax11.set_ylabel('Bad customer rate') ax11.set_title('Trend chart of bad customer rate with age')

Multivariate analysis

corr = df.iloc[:,1:].corr() #Calculate the correlation coefficient of each variable xticks = list(corr.index) #x-axis label yticks = list(corr.index) #y-axis label fig = plt.figure(figsize=(10,10)) ax1 = fig.add_subplot(1, 1, 1) sns.heatmap(corr,annot=True,cmap="rainbow",ax=ax1,linewidths=.5,annot_kws={'size':9,'weight':'bold', 'color':'k'}) ax1.set_xticklabels(xticks, rotation=30, fontsize=10) ax1.set_yticklabels(yticks, rotation=0, fontsize=10) plt.show()

WOE value substitution and LR modeling

The full name of woe is Weight of Evidence, which is often used in risk assessment, credit scoring card and other fields.
The full name of IV is Information value, which can be obtained by woe weighted summation to measure the prediction ability of the corresponding variable of the independent variable.
The calculation formula of WOE is: ln [(default / total default) / (normal / total normal)].
W O E i = l n ( B a d i B a d T / G o o d i G o o d T ) = l n ( B a d i B a d T ) − l n ( G o o d i G o o d T ) WOE_i=ln(\frac / \frac)=ln(\frac)-ln( \frac) WOEi​=ln(BadT​Badi​​/GoodT​Goodi​​)=ln(BadT​Badi​​)−ln(GoodT​Goodi​​)

I V i = ( B a d i B a d T − G o o d i G o o d T ) ∗ W O E i IV_i=(\frac -\frac)* WOE_i IVi​=(BadT​Badi​​−GoodT​Goodi​​)∗WOEi​

I V = ∑ i = 1 n ( I V i ) IV=\displaystyle \sum_^n(IV_i) IV=i=1∑n​(IVi​)

pinf = float('inf') #Positive infinity ninf = float('-inf') #Negative infinity cut1=pd.qcut(df['Available limit ratio'],4) cut2=pd.qcut(df['Age'],8) bins3=[ninf, 0, 1, 3, 5, pinf] cut3=pd.cut(df['Overdue 30-59 Tianbi number'],bins3) cut4=pd.qcut(df['Debt ratio'],3) cut5=pd.qcut(df['monthly income'],4) cut6=pd.qcut(df['Credit quantity'],4) bins7=[ninf, 0, 1, 3, 5, pinf] cut7=pd.cut(df['Number of transactions overdue for 90 days'],bins7) bins8=[ninf, 0,1,2, 3, pinf] cut8=pd.cut(df['Fixed asset loans'],bins8) bins9=[ninf, 0, 1, 3, pinf] cut9=pd.cut(df['Overdue 60-89 Tianbi number'],bins9) bins10=[ninf, 0, 1, 2, 3, 5, pinf] cut10=pd.cut(df['Number of family members'],bins10)
#Good / bad customer ratio rate=df['Good and bad customers'].sum()/(df['Good and bad customers'].count()-df['Good and bad customers'].sum()) #Define woe calculation function def get_woe_data(cut): grouped=df['Good and bad customers'].groupby(cut,as_index = True).value_counts() woe=np.log(pd.DataFrame(grouped).unstack().iloc[:,1]/pd.DataFrame(grouped).unstack().iloc[:,0]/rate) #Calculate the woe value for each group return(woe) cut1_woe=get_woe_data(cut1) cut2_woe=get_woe_data(cut2) cut3_woe=get_woe_data(cut3) cut4_woe=get_woe_data(cut4) cut5_woe=get_woe_data(cut5) cut6_woe=get_woe_data(cut6) cut7_woe=get_woe_data(cut7) cut8_woe=get_woe_data(cut8) cut9_woe=get_woe_data(cut9) cutl0_woe=get_woe_data(cut10)
#Define IV value calculation function def get_IV_data(cut,cut_woe): grouped=df['Good and bad customers'].groupby(cut,as_index = True).value_counts() cut_IV=((pd.DataFrame(grouped).unstack().iloc[:,1]/df['Good and bad customers'].sum()-pd.DataFrame(grouped).unstack().iloc[:,0]/ (df['Good and bad customers'].count()-df['Good and bad customers'].sum()))*cut_woe).sum() return(cut_IV) #Calculate the IV value of each group cut1_IV=get_IV_data(cut1,cut1_woe) cut2_IV=get_IV_data(cut2,cut2_woe) cut3_IV=get_IV_data(cut3,cut3_woe) cut4_IV=get_IV_data(cut4,cut4_woe) cut5_IV=get_IV_data(cut5,cut5_woe) cut6_IV=get_IV_data(cut6,cut6_woe) cut7_IV=get_IV_data(cut7,cut7_woe) cut8_IV=get_IV_data(cut8,cut8_woe) cut9_IV=get_IV_data(cut9,cut9_woe) cut10_IV=get_IV_data(cut10,cutl0_woe) #Visualization of IV values for each group df_IV=pd.DataFrame([cut1_IV,cut2_IV,cut3_IV,cut4_IV,cut5_IV,cut6_IV,cut7_IV,cut8_IV,cut9_IV,cut10_IV],index=df.columns[2:]) df_IV.plot(kind='bar') for a,b in zip(range(10),df_IV.values): plt.text(a,b,'%.2f' % b, ha='center',va= 'bottom',fontsize=9)


LR modeling

df_new=df.drop(['Debt ratio', 'monthly income', 'Credit quantity','Fixed asset loans', 'Number of family members','user ID'],axis=1) def replace_data(cut,cut_woe): a=[] for i in cut.unique(): a.append(i) a.sort() for m in range(len(a)): cut.replace(a[m],cut_woe.values[m],inplace=True) return cut #Replace df_new['Available limit ratio']=replace_data(cut1,cut1_woe) df_new['Age']=replace_data(cut2,cut2_woe) df_new['Overdue 30-59 Tianbi number']=replace_data(cut3,cut3_woe) df_new['Number of transactions overdue for 90 days']=replace_data(cut7,cut7_woe) df_new['Overdue 60-89 Tianbi number']=replace_data(cut9,cut9_woe)
x=df_new.iloc[:,1:] y=df_new.iloc[:,0] x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.4,random_state=0) model=LogisticRegression() clf=model.fit(x_train,y_train) print('Test results:{}'.format(clf.score(x_test,y_test))) y_pred=clf.predict(x_test) y_pred1=clf.decision_function(x_test) print('y_pred\n',y_pred) print('y_pred1\n',y_pred1) #Draw ROC curve and calculate AUC value fpr,tpr,threshold = roc_curve(y_test, y_pred1) roc_auc = auc(fpr,tpr) plt.plot(fpr,tpr,color='darkorange', label='ROC curve (area = %0.2f)' % roc_auc) plt.plot([0, 1], [0, 1], color='navy', linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC_curve') plt.legend(loc="lower right") plt.show()

The LR model is converted to the corresponding score

In the Logistic regression model, the logarithm of probability occurrence ratio is expressed as a linear combination of characteristic variables, and the formula is as follows:

Logit(pi) is a logarithmic value, i.e. log(P(bad)/P(good)). Since the value of P(bad)/P(good) is between 0 and ∞, the value of log(P(bad)/P(good)) is between - ∞ and + ∞.
In order to make the score more "practical", the score of each attribute needs to be linearly proportional transformed, and then an offset is added. The score is proportional to the logarithm of the good/bad odds ratio used for Logistic Regression modeling, not the good/bad odds ratio itself. Therefore, the score can be negative, and the smaller the score, the higher the risk.
The score corresponding to each attribute can be calculated by the following formula: WOE multiply the regression coefficient of the variable, add the regression intercept, multiply the scale factor, and finally add the offset:
( w o e i ∗ β i + a n ) ∗ f a c t o r + o f f s e t n (woe_i* \beta_i+\frac)*factor+\frac (woei​∗βi​+na​)∗factor+noffset​
The score of the scorecard can be calculated as follows:

According to the above paper data:
a=log(p_good/P_bad)
Score = offset + factor * log(odds)
The scale factor and offset can be determined by the following industry rules: Good: bad = 20:1, the score scale is 600; Every 20 points increase or decrease in the score will double the ratio of good to bad.

In the conversion process from log (odds) to score, the previous derivation uses factor and offset as conversion parameters. However, in practical applications, these two parameters are difficult to explain, so it is not convenient to specify the initial value. More commonly, the following three parameters are specified in advance:
b: Base point
o: Odds corresponding to benchmark score (odds at base point)
p: When the odds is doubled, the score is increased (point double odds)
The conversion formula between factor, offset and b, o,p is:
f a c t o r = p l o g 2 factor=\frac factor=log2p​
o f f s e t = b − p ∗ l o g o l o g 2 offset=b-p*\frac offset=b−p∗log2logo​
Whether it is factor,offset, or b, o, p, these parameters are only used to convert log (odds) into appropriate scores, which has nothing to do with the logistic regression itself. Different conversion parameters will get different scores, but will not change the probability.

coe=model.coef_ # General industry rules. When the odds is 50, the score is 600 # When Odds is doubled, score+20 factor = 20/np.log(2) offset = 600 - 20 * np.log(20)/np.log(2) #Define variable score calculation function def get_score(coe,woe,factor): scores=[] for w in woe: score=round(coe*w*factor,0) scores.append(score) return scores #Calculate the score for each variable x1 = get_score(coe[0][0], cut1_woe, factor) x2 = get_score(coe[0][1], cut2_woe, factor) x3 = get_score(coe[0][2], cut3_woe, factor) x7 = get_score(coe[0][3], cut7_woe, factor) x9 = get_score(coe[0][4], cut9_woe, factor) #Print out the score corresponding to each feature print("Score corresponding to available quota ratio:{}".format(x1)) print("Age corresponding score:{}".format(x2)) print("Overdue 30-59 Score corresponding to the number of Tianbi:{}".format(x3)) print("Score corresponding to the number of pen overdue for 90 days:{}".format(x7)) print("Overdue 60-89 Score corresponding to the number of Tianbi:{}".format(x9))

30 September 2021, 16:01 | Views: 4578

Add new comment

For adding a comment, please log in
or create account

0 comments