[hands on data analysis] Task05 - model establishment and evaluation

Basic process of modeling and evaluation:

Zero, characteristic Engineering

Import data:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image

plt.rcParams['font.sans-serif'] = ['SimHei']  # Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False  # Used to display negative signs normally
plt.rcParams['figure.figsize'] = (10, 6)  # Set output picture size

# Read training set
train = pd.read_csv('train.csv')

After reading the data set, some operations need to be carried out on the data processing to facilitate the subsequent model establishment and training.

1. Fill in missing value:. fillna()

  • Continuous variables: mean, median, mode
  • Category variables: NA, most categories
# Check the proportion of missing values
train.isnull().sum().sort_values(ascending=False)
Embarked       0
Cabin          0
Fare           0
Ticket         0
Parch          0
SibSp          0
Age            0
Sex            0
Name           0
Pclass         0
Survived       0
PassengerId    0
dtype: int64

2. Code classification variables: pandas.get_dummies()

For example, there are only two possible categories

# Extract all input features
data = train[['Pclass','Sex','Age','SibSp','Parch','Fare', 'Embarked']]
# Virtual variable conversion
data = pd.get_dummies(data)

1, Model building

After processing the data, you need to establish a model. Before modeling, you need to select an appropriate model.

  • First determine the types of data sets: supervised learning and unsupervised learning
  • Selection basis: task, data sample size, sparsity of features
  • Step: first try to use a basic model as its baseline, then compare with other models, and finally select the model with better generalization ability or performance.

1.1 cutting training set and test set

  • Objective: to facilitate the subsequent evaluation of the generalization ability of the model
  • Cutting method:
    • Proportional cutting: generally 30%, 25%, 15% and 10%
    • Slice proportionally according to the target variable
    • Set random seed reproduction results
  • Method of cutting data in sklearn: train_test_split()

When cutting a data set, there is no need for random selection: the data set itself has been randomly processed or the sample size is large enough.

from sklearn.model_selection import train_test_split

# Generally, X and y are taken out before cutting. When they are not cut, X and y can be used
X = data
y = train['Survived']

# Cut the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

# View training set test set size
X_train.shape, X_test.shape
# ((668, 10), (223, 10))

1.2 model creation

Model category

  • Classification model based on linear model (sklearn.linear_model): logistic regression (logistic regression is the classification model, linear regression is the regression model)
  • Classification model based on tree (sklearn. Ensembles): decision tree and random forest (random forest is a set of decision trees to reduce the over fitting of decision trees)

Prove why linear regression can be used for binary classification: Machine learning notes - Classification using linear models

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Default parameter logistic regression model
lr = LogisticRegression()
lr.fit(X_train, y_train)
# View training set and test set score values
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(lr.score(X_test, y_test)))
# Logistic regression model after adjusting parameters
lr2 = LogisticRegression(C=100)
lr2.fit(X_train, y_train)
print("Training set score: {:.2f}".format(lr2.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(lr2.score(X_test, y_test)))


The score on the test set increased after adjusting the parameters.

# Random forest classification model with default parameters
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
print("Training set score: {:.2f}".format(rfc.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(rfc.score(X_test, y_test)))
# Stochastic forest classification model with adjusted parameters
rfc2 = RandomForestClassifier(n_estimators=100, max_depth=5)
rfc2.fit(X_train, y_train)
print("Training set score: {:.2f}".format(rfc2.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(rfc2.score(X_test, y_test)))

1.3 output model prediction results

The general supervision model is in sklearn. Predict outputs the prediction label and predict_ Probabilities of probabilities of proba output Tags

# Forecast label
pred = lr.predict(X_train)
# Predicted tag probability
pred_proba = lr.predict_proba(X_train)

Two, model assessment

  • Objective: to obtain the generalization ability of the model
  • Method: cross validation
  • Accuracy measures how many of the samples predicted as positive examples are true positive examples
  • recall measures how many positive samples are predicted to be positive (TP)
  • f-score is the harmonic average of accuracy and recall

2.1 cross validation

Module in sklearn: sklearn.model_selection

from sklearn.model_selection import cross_val_score

# 10 fold cross validation was used to evaluate the logistic regression model
lr = LogisticRegression(C=100)
scores = cross_val_score(lr, X_train, y_train, cv=10)

# k-fold cross validation score
scores

# Average cross validation score
print("Average cross-validation score: {:.2f}".format(scores.mean()))


The more K-fold, the more time it takes, but the average error is regarded as a generalization error, and the result is more reliable.

2.2 confusion matrix

The commonly used evaluation indexes for binary classification problems are precision and recall, and the index for evaluating classifiers is classification accuracy

The classifier's prediction on the test data set is correct or incorrect. There are four cases:

  • Module in sklearn: sklearn.metrics
  • The confusion matrix requires the input of real labels and prediction labels
from sklearn.metrics import confusion_matrix

# Training model
lr = LogisticRegression(C=100)
lr.fit(X_train, y_train)

# Model prediction results
pred = lr.predict(X_train)
# Confusion matrix
confusion_matrix(y_train, pred)

from sklearn.metrics import classification_report
# Accuracy, recall and F1 score
print(classification_report(y_train, pred))

2.3 ROC curve

  • The module of ROC curve in sklearn is sklearn.metrics
  • The larger the area surrounded by the ROC curve, the better
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test))
plt.plot(fpr, tpr, label="ROC Curve")
plt.xlabel("FPR")
plt.ylabel("TPR (recall)")
# The threshold closest to 0 was found
close_zero = np.argmin(np.abs(thresholds))
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10, label="threshold zero", fillstyle="none", c='k', mew=2)
plt.legend(loc=4)

Tags: Python Machine Learning Data Analysis numpy sklearn

Posted on Thu, 23 Sep 2021 08:10:50 -0400 by jOE :D