Credit scoring with Python

Project introduction

Credit risk: the risk of economic loss caused by non performance of contractual obligations.
Scorecard: a means of measuring risk probability in the form of score. The higher the score, the safer.

Data source: Kaggle

There are 150000 sample data,

– basic attributes: including the age of the borrower at that time.
– solvency: including the borrower's monthly income and debt ratio.
– credit transactions: 35-59 days overdue in two years, 60-89 days overdue in two years, 90 days or more overdue in two years.
– property status: including the number of open-end credit and loans, real estate loans or lines.
– loan attribute: none.
– other factors: including the number of family members of the borrower (excluding myself).
– time window: the observation window of the independent variable is the past two years, and the performance window of the dependent variable is the next two years.

gradeVariable labelVariable interpretation
x0SeriousDlqin2yrsGood customers and bad customers
x1RevolvingUtilizationOfUnsecuredLinesRecycling of unsecured loans
x2ageAge of the borrower at the time of borrowing
x3NumberOfTime30-59DaysPastDueNotWorse35-59 days overdue but not bad times
x4DebtRatioDebt ratio
x5MonthlyIncomemonthly income
x6NumberOfOpenCreditLinesAndLoansNumber of open credit and loans
x7NumberOfTimes90DaysLate90 day overdue times
x8NumberRealEstateLoansOrLinesNumber of real estate loans or lines
x9NumberOfTime60-89DaysPastDueNotWorse60-89 days overdue but not bad times
x10NumberOfDependentsNumber of family members
import numpy as np
import matplotlib
import matplotlib.pyplot as plt 
plt.rcParams['font.sans-serif'] =['Microsoft YaHei']
import seaborn as sns 
import pandas as pd 
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc
import os 
os.chdir(r'\Credit score\GiveMeSomeCredit')
states={"Unnamed: 0":"user ID",
        "SeriousDlqin2yrs":"Good and bad customers",
        "RevolvingUtilizationOfUnsecuredLines":"Available limit ratio",
        "NumberOfTime30-59DaysPastDueNotWorse":"Overdue 30-59 Tianbi number",
        "DebtRatio":"Debt ratio",
        "MonthlyIncome":"monthly income",
        "NumberOfOpenCreditLinesAndLoans":"Credit quantity",
        "NumberOfTimes90DaysLate":"Number of transactions overdue for 90 days",
        "NumberRealEstateLoansOrLines":"Fixed asset loans",
        "NumberOfTime60-89DaysPastDueNotWorse":"Overdue 60-89 Tianbi number",
        "NumberOfDependents":"Number of family members"}

Data preprocessing

def missing_values_table(df):
    #All missing values
    mis_val =df.isnull().sum()
    #Proportion of missing values
    mis_val_percent =100*df.isnull().sum()/len(df)
    #Can make a watch
    #Change column name
                               columns={0:'Missing value',1:'Missing proportion'})
    #Sort missing values
                              .sort_values('Missing proportion',ascending=False).round(1)
    #Print out table
    print("Data have"+str(df.shape[1])+"column.\n"
    "Column contains missing values.")
    #Return true rows and columns
    return mis_val_table_ren_columns #Show missing values and functions representing missing values

Missing value

The missing value of "number of family members" variable is relatively small, and it will not have a great impact on the overall model if it is deleted directly. After processing the missing values, delete the duplicates.

df=df[df.Number of family members.notnull()].drop_duplicates()

The missing rate of "monthly income" is relatively large. The random forest method is used to fill in the missing value according to the correlation between variables

# Prediction of filling function with missing values by random forest
def set_missing(df):
    # Take out the existing numerical features
#     process_df = df.iloc[:,1: ]
    process_df = df.iloc[:,[6,1,2,3,4,5,7,8,9,10,11]]
    # It is divided into two parts: known feature and unknown feature
    known = process_df[process_df.monthly income.notnull()].values
    unknown = process_df[process_df.monthly income.isnull()].values
    # X is the characteristic attribute value
    X = known[:,1:]
    # y is the result label value
    y = known[:,0]
    # fit into RandomForestRegressor
    rfr = RandomForestRegressor(random_state=0,n_estimators=200,max_depth=3,n_jobs=-1),y)
    # The obtained model is used for unknown eigenvalue prediction
    predicted = rfr.predict(unknown[:,1:]).round(0)
    # The prediction results obtained are used to fill in the original missing data
    df.loc[(df.monthly income.isnull()), 'monthly income'] = predicted
    return df



After the missing value is processed, we also need to process the abnormal value. Outliers refer to values that deviate significantly from most of the sampling data. For example, when the age of an individual customer is 0, it is generally considered as outliers. Here, the capping method is used.
Capping method: replace more than 99% and less than 1% of the points in the data frame in the whole line, and set more than 99% of the point value = 99% of the point value; Point value less than 1% = point value of 1%.


#Capping method
def blk(floor,root):
    def f(x):
        if x<floor:
        elif x>root:
        return x
    return f


Exploratory analysis

Univariate analysis

age_cut_grouped=df['Good and bad customers'].groupby(age_cut).count()
age_cut_grouped1=df['Good and bad customers'].groupby(age_cut).sum()

df2=pd.merge(pd.DataFrame(age_cut_grouped), pd.DataFrame(age_cut_grouped1),right_index=True, left_index=True)
df2.rename(columns={'Good and bad customers_x':'Total customer', 'Good and bad customers_y':'Bad customer'},inplace=True)
df2.insert(2,'Bad customer rate',df2['Bad customer']/df2['Total customer'])

ax1=df2[['Total customer','Bad customer']]
ax1.set_ylabel('Number of customers')
ax1.set_title('Distribution of age and number of good and bad customers')

ax11=df2['Bad customer rate'].plot()
ax11.set_ylabel('Bad customer rate')
ax11.set_title('Trend chart of bad customer rate with age')

Multivariate analysis

corr = df.iloc[:,1:].corr() #Calculate the correlation coefficient of each variable
xticks = list(corr.index)  #x-axis label
yticks = list(corr.index)  #y-axis label
fig = plt.figure(figsize=(10,10))
ax1 = fig.add_subplot(1, 1, 1)

sns.heatmap(corr,annot=True,cmap="rainbow",ax=ax1,linewidths=.5,annot_kws={'size':9,'weight':'bold', 'color':'k'})
ax1.set_xticklabels(xticks, rotation=30, fontsize=10)
ax1.set_yticklabels(yticks, rotation=0, fontsize=10)

WOE value substitution and LR modeling

The full name of woe is Weight of Evidence, which is often used in risk assessment, credit scoring card and other fields.
The full name of IV is Information value, which can be obtained by woe weighted summation to measure the prediction ability of the corresponding variable of the independent variable.
The calculation formula of WOE is: ln [(default / total default) / (normal / total normal)].
W O E i = l n ( B a d i B a d T / G o o d i G o o d T ) = l n ( B a d i B a d T ) − l n ( G o o d i G o o d T ) WOE_i=ln(\frac{Bad_i}{Bad_T} / \frac{Good_i}{Good_T})=ln(\frac{Bad_i}{Bad_T})-ln( \frac{Good_i}{Good_T}) WOEi​=ln(BadT​Badi​​/GoodT​Goodi​​)=ln(BadT​Badi​​)−ln(GoodT​Goodi​​)

I V i = ( B a d i B a d T − G o o d i G o o d T ) ∗ W O E i IV_i=(\frac{Bad_i}{Bad_T} -\frac{Good_i}{Good_T})* WOE_i IVi​=(BadT​Badi​​−GoodT​Goodi​​)∗WOEi​

I V = ∑ i = 1 n ( I V i ) IV=\displaystyle \sum_{i=1}^n(IV_i) IV=i=1∑n​(IVi​)

pinf = float('inf') #Positive infinity
ninf = float('-inf') #Negative infinity

cut1=pd.qcut(df['Available limit ratio'],4)
bins3=[ninf, 0, 1, 3, 5, pinf]
cut3=pd.cut(df['Overdue 30-59 Tianbi number'],bins3)
cut4=pd.qcut(df['Debt ratio'],3)
cut5=pd.qcut(df['monthly income'],4)
cut6=pd.qcut(df['Credit quantity'],4)
bins7=[ninf, 0, 1, 3, 5, pinf]
cut7=pd.cut(df['Number of transactions overdue for 90 days'],bins7)
bins8=[ninf, 0,1,2, 3, pinf]
cut8=pd.cut(df['Fixed asset loans'],bins8)
bins9=[ninf, 0, 1, 3, pinf]
cut9=pd.cut(df['Overdue 60-89 Tianbi number'],bins9)
bins10=[ninf, 0, 1, 2, 3, 5, pinf]
cut10=pd.cut(df['Number of family members'],bins10)
#Good / bad customer ratio
rate=df['Good and bad customers'].sum()/(df['Good and bad customers'].count()-df['Good and bad customers'].sum())

#Define woe calculation function
def get_woe_data(cut):
    grouped=df['Good and bad customers'].groupby(cut,as_index = True).value_counts()
    woe=np.log(pd.DataFrame(grouped).unstack().iloc[:,1]/pd.DataFrame(grouped).unstack().iloc[:,0]/rate) #Calculate the woe value for each group

#Define IV value calculation function
def get_IV_data(cut,cut_woe):
    grouped=df['Good and bad customers'].groupby(cut,as_index = True).value_counts()
    cut_IV=((pd.DataFrame(grouped).unstack().iloc[:,1]/df['Good and bad customers'].sum()-pd.DataFrame(grouped).unstack().iloc[:,0]/
    (df['Good and bad customers'].count()-df['Good and bad customers'].sum()))*cut_woe).sum()

#Calculate the IV value of each group

#Visualization of IV values for each group
for a,b in zip(range(10),df_IV.values):
    plt.text(a,b,'%.2f' % b, ha='center',va= 'bottom',fontsize=9)

LR modeling

df_new=df.drop(['Debt ratio', 'monthly income', 'Credit quantity','Fixed asset loans', 'Number of family members','user ID'],axis=1)

def replace_data(cut,cut_woe):
    for i in cut.unique():

    for m in range(len(a)):
    return cut

df_new['Available limit ratio']=replace_data(cut1,cut1_woe)
df_new['Overdue 30-59 Tianbi number']=replace_data(cut3,cut3_woe)
df_new['Number of transactions overdue for 90 days']=replace_data(cut7,cut7_woe)
df_new['Overdue 60-89 Tianbi number']=replace_data(cut9,cut9_woe)



print('Test results:{}'.format(clf.score(x_test,y_test)))

#Draw ROC curve and calculate AUC value
fpr,tpr,threshold = roc_curve(y_test, y_pred1)
roc_auc = auc(fpr,tpr)
         label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc="lower right")

The LR model is converted to the corresponding score

In the Logistic regression model, the logarithm of probability occurrence ratio is expressed as a linear combination of characteristic variables, and the formula is as follows:

Logit(pi) is a logarithmic value, i.e. log(P(bad)/P(good)). Since the value of P(bad)/P(good) is between 0 and ∞, the value of log(P(bad)/P(good)) is between - ∞ and + ∞.
In order to make the score more "practical", the score of each attribute needs to be linearly proportional transformed, and then an offset is added. The score is proportional to the logarithm of the good/bad odds ratio used for Logistic Regression modeling, not the good/bad odds ratio itself. Therefore, the score can be negative, and the smaller the score, the higher the risk.
The score corresponding to each attribute can be calculated by the following formula: WOE multiply the regression coefficient of the variable, add the regression intercept, multiply the scale factor, and finally add the offset:
( w o e i ∗ β i + a n ) ∗ f a c t o r + o f f s e t n (woe_i* \beta_i+\frac{a}{n})*factor+\frac{offset}{n} (woei​∗βi​+na​)∗factor+noffset​
The score of the scorecard can be calculated as follows:

According to the above paper data:
Score = offset + factor * log(odds)
The scale factor and offset can be determined by the following industry rules: Good: bad = 20:1, the score scale is 600; Every 20 points increase or decrease in the score will double the ratio of good to bad.

In the conversion process from log (odds) to score, the previous derivation uses factor and offset as conversion parameters. However, in practical applications, these two parameters are difficult to explain, so it is not convenient to specify the initial value. More commonly, the following three parameters are specified in advance:
b: Base point
o: Odds corresponding to benchmark score (odds at base point)
p: When the odds is doubled, the score is increased (point double odds)
The conversion formula between factor, offset and b, o,p is:
f a c t o r = p l o g 2 factor=\frac{p}{log2} factor=log2p​
o f f s e t = b − p ∗ l o g o l o g 2 offset=b-p*\frac{log o}{log2} offset=b−p∗log2logo​
Whether it is factor,offset, or b, o, p, these parameters are only used to convert log (odds) into appropriate scores, which has nothing to do with the logistic regression itself. Different conversion parameters will get different scores, but will not change the probability.


# General industry rules. When the odds is 50, the score is 600
# When Odds is doubled, score+20
factor = 20/np.log(2)
offset = 600 - 20 * np.log(20)/np.log(2)

#Define variable score calculation function
def get_score(coe,woe,factor):
    for w in woe:
    return scores

#Calculate the score for each variable
x1 = get_score(coe[0][0], cut1_woe, factor)
x2 = get_score(coe[0][1], cut2_woe, factor)
x3 = get_score(coe[0][2], cut3_woe, factor)
x7 = get_score(coe[0][3], cut7_woe, factor)
x9 = get_score(coe[0][4], cut9_woe, factor)

#Print out the score corresponding to each feature
print("Score corresponding to available quota ratio:{}".format(x1))
print("Age corresponding score:{}".format(x2))
print("Overdue 30-59 Score corresponding to the number of Tianbi:{}".format(x3))
print("Score corresponding to the number of pen overdue for 90 days:{}".format(x7))
print("Overdue 60-89 Score corresponding to the number of Tianbi:{}".format(x9))

Tags: Python

Posted on Thu, 30 Sep 2021 16:01:02 -0400 by webmasternovis