Hands on data analysis | datawhale August | Task05: Data Modeling and model evaluation

Task 05: Data Modeling and model evaluation

According to different task requirements, we should consider what model to establish. We use the popular sklearn library to establish the model. We need to evaluate the quality of a model, and then we will evaluate our model and optimize the model.

1, Data preprocessing

We have the data set of the Titanic, so our purpose this time is to complete the task of predicting the survival of the Titanic.

1. Guide Package

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image
# Set unlimited number of rows
pd.set_option('display.max_rows',None)
# Set unlimited number of columns
pd.set_option('display.max_columns',None)
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei']  # Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False  # Used to display negative signs normally
plt.rcParams['figure.figsize'] = (10, 6)  # Set output picture size

2. Missing value filling and one hot coding

  1. Missing value (Cabin) of category variable: fill in a missing value character (NA) and fill it with the most categories
  2. Missing value for continuous variable (Age): fill in mean, median and mode
  3. Coding classification variable
train['Embarked'].groupby(train['Embarked']).count()

# Missing values for categorical variables: the category with the most cabins is NA, and the category with the most embanked is S
train.Cabin = train['Cabin'].fillna('NA')
train.Embarked = train['Embarked'].fillna('S')
#Missing value of continuous variable: Age (filled with mean)
train.Age = train['Age'].fillna(train['Age'].mean())
# Check the proportion of missing values
train.isnull().sum()

#Bisection type variable: Sex, far
# Extract all input features
data = train[['Pclass','Sex','Age','SibSp','Parch','Fare', 'Embarked']]
data.head()

data = pd.get_dummies(data)
data.head()

#Save file
data.to_csv('clear_data_alone.csv')

Two, model building

1. Select model

  • After processing the previous data, we will get the modeling data. The next step is to select the appropriate model
  • Before model selection, we need to know whether the data set is finally supervised learning or unsupervised learning
  • On the one hand, the choice of model is determined by our task.
  • In addition to selecting the model according to our task, it can also be determined according to the sample size of data and the sparsity of features

2. Cut training set and test set

  1. The data set is divided into independent variables and dependent variables
  2. Cut the training set and test set in proportion (the proportion of general test set is 30%, 25%, 20%, 15% and 10%)
  3. Using stratified sampling
  4. Set random seeds so that the results can be reproduced
from sklearn import tree
from sklearn.model_selection import train_test_split
import pandas as pd
train = pd.read_csv('train.csv')
data = pd.read_csv('clear_data_alone.csv')
target = train['Survived']
target

#Cut dataset (stratified sampling)
# Random sampling is not used when the data distribution is uneven  
# Random = none
Xtrain,Xtest,Ytrain,Ytest = train_test_split(data,target,stratify = target,random_state=1)

3. Model creation

Here, the decision tree classification tree model is selected as an example

#Modeling Trilogy
#instantiation 
clf = tree.DecisionTreeClassifier(criterion='entropy',
                                  random_state=2,
                                  max_depth=4,
                                  splitter='random'
                                 )
#Training model
clf = clf.fit(Xtrain,Ytrain)
#Import test set
score = clf.score(Xtest,Ytest)                 #Prediction accuracy
score

#Forecast classification results / labels
pred = clf.predict(Xtest)
pred

#Predicted tag probability
proba = clf.predict_proba(Xtest)
proba[:10]

4. Optimization

#Draw learning curve
import matplotlib.pyplot as plt
test = []
for i in range(10):
    clf = tree.DecisionTreeClassifier(max_depth=i+1,
                                      random_state=2,
                                      criterion='entropy',
                                      splitter='random'
                                     )
    clf = clf.fit(Xtrain,Ytrain)
    score = clf.score(Xtest,Ytest)
    test.append(score)
print(max(test))
plt.plot(range(1,11),test,color = 'red',label = 'max_depth')
plt.xticks(range(1,11))  
plt.legend()
plt.show()

#Grid search
import numpy as np
from sklearn.model_selection import GridSearchCV


gini_thresholds = np.linspace(0,0.5,10)

#A series of parameters and the value range of the grid search we want corresponding to these parameters
parameters = {'criterion':('gini','entropy'),
              'splitter':('random','best'),
              'max_depth':[*range(1,10)],
              'min_samples_leaf':[*range(1,50,5)],
              'min_impurity_decrease':[*np.linspace(0,0.5,10)]
             }

clf = tree.DecisionTreeClassifier(random_state=2)
GS = GridSearchCV(clf,parameters,cv = 10)
GS = GS.fit(Xtrain,Ytrain)
#Return to best combination
In []:GS.best_params_
Out[]:
{'criterion': 'gini',
 'max_depth': 9,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 6,
 'splitter': 'best'}
#Returns the model evaluation criteria after grid search
In []:GS.best_score_
Out[]:
0.8159430122116689

summary

Here is only an example of simple modeling of decision tree, and more model effects need to be tested.
Hands on data analysis | datawhale - August.
end.
thank.

Tags: Python Machine Learning Decision Tree

Posted on Thu, 02 Sep 2021 16:37:40 -0400 by bradgrafelman