Data analysis learning

Model building

1.1 characteristic works

Missing value fill

  • Missing values for category variables: fill in a missing value character (NA) and fill in with the most categories
  • Missing values for continuous variables: fill in mean, median and mode
# Fill in classification variables
train['Cabin'] = train['Cabin'].fillna('NA')
train['Embarked'] = train['Embarked'].fillna('S')

# Fill continuous variables
train['Age'] = train['Age'].fillna(train['Age'].mean())

# Check the proportion of missing values
train.isnull().sum().sort_values(ascending=False)

Coding classification variable

# Extract all input features
data = train[['Pclass','Sex','Age','SibSp','Parch','Fare', 'Embarked']]

# The virtual variable is converted to one hot coding
data = pd.get_dummies(data)

  2.2 model construction

  • After processing the previous data, we will get the modeling data. The next step is to select the appropriate model
  • Before model selection, we need to know whether the data set is finally supervised learning or unsupervised learning
  • In addition to selecting the model according to our task, it can also be determined according to the sample size of data and the sparsity of features
  • At the beginning, we always try to use a basic model as its baseline, then train other models for comparison, and finally choose the model with better generalization ability or performance

Cut training set and test set

  • Cut the training set and test set in proportion (the proportion of general test set is 30%, 25%, 20%, 15% and 10%)
  • Slice according to the target variable
  • Set random seeds so that the results can be reproduced
from sklearn.model_selection import train_test_split

# Cut the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

# View data shapes
X_train.shape, X_test.shape
((668, 10), (223, 10))

Model creation

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Default parameter logistic regression model
lr = LogisticRegression()
lr.fit(X_train, y_train)
# View training set and test set score values
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(lr.score(X_test, y_test)))
Training set score: 0.80
Testing set score: 0.78

Output model prediction results

# Forecast label
pred = lr.predict(X_train)

# At this point, we can see an array of 0 and 1
pred[:10]
array([0, 1, 1, 1, 0, 0, 1, 0, 1, 1], dtype=int64)
# Predicted tag probability
pred_proba = lr.predict_proba(X_train)

pred_proba[:10]
array([[0.62887291, 0.37112709],
       [0.14897206, 0.85102794],
       [0.47162003, 0.52837997],
       [0.20365672, 0.79634328],
       [0.86428125, 0.13571875],
       [0.9033887 , 0.0966113 ],
       [0.13829338, 0.86170662],
       [0.89516141, 0.10483859],
       [0.05735141, 0.94264859],
       [0.13593291, 0.86406709]])

2.3 model evaluation

  • Model evaluation is to know the generalization ability of the model.
  • Cross validation is a statistical method to evaluate generalization performance. It is more stable and comprehensive than the method of dividing training set and test set.
  • In cross validation, the data is divided many times and multiple models need to be trained.
  • The most commonly used cross validation is k-fold cross validation, where k is the number specified by the user, usually 5 or 10.
  • Accuracy measures how many samples are predicted to be positive examples
  • recall measures how many positive samples are predicted to be positive
  • f-score is the harmonic average of accuracy and recall

Cross validation

from sklearn.model_selection import cross_val_score

lr = LogisticRegression(C=100)
scores = cross_val_score(lr, X_train, y_train, cv=10)

# k-fold cross validation score
scores
array([0.82352941, 0.79411765, 0.80597015, 0.80597015, 0.8358209 ,
       0.88059701, 0.72727273, 0.86363636, 0.75757576, 0.71212121])
# Average cross validation score
print("Average cross-validation score: {:.2f}".format(scores.mean()))
Average cross-validation score: 0.80

Confusion matrix

  • Calculating the confusion matrix of binary classification problem
  • Calculate the accuracy rate, recall rate and f-score
from sklearn.metrics import confusion_matrix

# Training model
lr = LogisticRegression(C=100)
lr.fit(X_train, y_train)

# Model prediction results
pred = lr.predict(X_train)

# Confusion matrix
confusion_matrix(y_train, pred)
array([[350,  62],
       [ 71, 185]], dtype=int64)
# Accuracy, recall and F1 score
print(classification_report(y_train, pred))
             precision    recall  f1-score   support

          0       0.83      0.85      0.84       412
          1       0.75      0.72      0.74       256

avg / total       0.80      0.80      0.80       668

Tags: pandas

Posted on Sat, 20 Nov 2021 07:51:05 -0500 by Topper