Model building
1.1 characteristic works
Missing value fill
- Missing values for category variables: fill in a missing value character (NA) and fill in with the most categories
- Missing values for continuous variables: fill in mean, median and mode
# Fill in classification variables
train['Cabin'] = train['Cabin'].fillna('NA')
train['Embarked'] = train['Embarked'].fillna('S')
# Fill continuous variables
train['Age'] = train['Age'].fillna(train['Age'].mean())
# Check the proportion of missing values
train.isnull().sum().sort_values(ascending=False)
Coding classification variable
# Extract all input features
data = train[['Pclass','Sex','Age','SibSp','Parch','Fare', 'Embarked']]
# The virtual variable is converted to one hot coding
data = pd.get_dummies(data)

2.2 model construction
- After processing the previous data, we will get the modeling data. The next step is to select the appropriate model
- Before model selection, we need to know whether the data set is finally supervised learning or unsupervised learning
- In addition to selecting the model according to our task, it can also be determined according to the sample size of data and the sparsity of features
- At the beginning, we always try to use a basic model as its baseline, then train other models for comparison, and finally choose the model with better generalization ability or performance
Cut training set and test set
- Cut the training set and test set in proportion (the proportion of general test set is 30%, 25%, 20%, 15% and 10%)
- Slice according to the target variable
- Set random seeds so that the results can be reproduced
from sklearn.model_selection import train_test_split
# Cut the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
# View data shapes
X_train.shape, X_test.shape
((668, 10), (223, 10))
Model creation
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
# Default parameter logistic regression model
lr = LogisticRegression()
lr.fit(X_train, y_train)
# View training set and test set score values
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(lr.score(X_test, y_test)))
Training set score: 0.80
Testing set score: 0.78
Output model prediction results
# Forecast label
pred = lr.predict(X_train)
# At this point, we can see an array of 0 and 1
pred[:10]
array([0, 1, 1, 1, 0, 0, 1, 0, 1, 1], dtype=int64)
# Predicted tag probability
pred_proba = lr.predict_proba(X_train)
pred_proba[:10]
array([[0.62887291, 0.37112709],
[0.14897206, 0.85102794],
[0.47162003, 0.52837997],
[0.20365672, 0.79634328],
[0.86428125, 0.13571875],
[0.9033887 , 0.0966113 ],
[0.13829338, 0.86170662],
[0.89516141, 0.10483859],
[0.05735141, 0.94264859],
[0.13593291, 0.86406709]])
2.3 model evaluation
- Model evaluation is to know the generalization ability of the model.
- Cross validation is a statistical method to evaluate generalization performance. It is more stable and comprehensive than the method of dividing training set and test set.
- In cross validation, the data is divided many times and multiple models need to be trained.
- The most commonly used cross validation is k-fold cross validation, where k is the number specified by the user, usually 5 or 10.
- Accuracy measures how many samples are predicted to be positive examples
- recall measures how many positive samples are predicted to be positive
- f-score is the harmonic average of accuracy and recall
Cross validation
from sklearn.model_selection import cross_val_score
lr = LogisticRegression(C=100)
scores = cross_val_score(lr, X_train, y_train, cv=10)
# k-fold cross validation score
scores
array([0.82352941, 0.79411765, 0.80597015, 0.80597015, 0.8358209 ,
0.88059701, 0.72727273, 0.86363636, 0.75757576, 0.71212121])
# Average cross validation score
print("Average cross-validation score: {:.2f}".format(scores.mean()))
Average cross-validation score: 0.80
Confusion matrix
- Calculating the confusion matrix of binary classification problem
- Calculate the accuracy rate, recall rate and f-score
from sklearn.metrics import confusion_matrix
# Training model
lr = LogisticRegression(C=100)
lr.fit(X_train, y_train)
# Model prediction results
pred = lr.predict(X_train)
# Confusion matrix
confusion_matrix(y_train, pred)
array([[350, 62],
[ 71, 185]], dtype=int64)
# Accuracy, recall and F1 score
print(classification_report(y_train, pred))
precision recall f1-score support
0 0.83 0.85 0.84 412
1 0.75 0.72 0.74 256
avg / total 0.80 0.80 0.80 668