# Model building

## 1.1 characteristic works

### Missing value fill

- Missing values for category variables: fill in a missing value character (NA) and fill in with the most categories
- Missing values for continuous variables: fill in mean, median and mode

# Fill in classification variables train['Cabin'] = train['Cabin'].fillna('NA') train['Embarked'] = train['Embarked'].fillna('S') # Fill continuous variables train['Age'] = train['Age'].fillna(train['Age'].mean()) # Check the proportion of missing values train.isnull().sum().sort_values(ascending=False)

### Coding classification variable

# Extract all input features data = train[['Pclass','Sex','Age','SibSp','Parch','Fare', 'Embarked']] # The virtual variable is converted to one hot coding data = pd.get_dummies(data)

## 2.2 model construction

- After processing the previous data, we will get the modeling data. The next step is to select the appropriate model
- Before model selection, we need to know whether the data set is finally supervised learning or unsupervised learning
- In addition to selecting the model according to our task, it can also be determined according to the sample size of data and the sparsity of features
- At the beginning, we always try to use a basic model as its baseline, then train other models for comparison, and finally choose the model with better generalization ability or performance

### Cut training set and test set

- Cut the training set and test set in proportion (the proportion of general test set is 30%, 25%, 20%, 15% and 10%)
- Slice according to the target variable
- Set random seeds so that the results can be reproduced

from sklearn.model_selection import train_test_split # Cut the dataset X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0) # View data shapes X_train.shape, X_test.shape

((668, 10), (223, 10))

### Model creation

from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier # Default parameter logistic regression model lr = LogisticRegression() lr.fit(X_train, y_train)

# View training set and test set score values print("Training set score: {:.2f}".format(lr.score(X_train, y_train))) print("Testing set score: {:.2f}".format(lr.score(X_test, y_test)))

Training set score: 0.80 Testing set score: 0.78

### Output model prediction results

# Forecast label pred = lr.predict(X_train) # At this point, we can see an array of 0 and 1 pred[:10]

array([0, 1, 1, 1, 0, 0, 1, 0, 1, 1], dtype=int64)

# Predicted tag probability pred_proba = lr.predict_proba(X_train) pred_proba[:10]

array([[0.62887291, 0.37112709], [0.14897206, 0.85102794], [0.47162003, 0.52837997], [0.20365672, 0.79634328], [0.86428125, 0.13571875], [0.9033887 , 0.0966113 ], [0.13829338, 0.86170662], [0.89516141, 0.10483859], [0.05735141, 0.94264859], [0.13593291, 0.86406709]])

## 2.3 model evaluation

- Model evaluation is to know the generalization ability of the model.
- Cross validation is a statistical method to evaluate generalization performance. It is more stable and comprehensive than the method of dividing training set and test set.
- In cross validation, the data is divided many times and multiple models need to be trained.
- The most commonly used cross validation is k-fold cross validation, where k is the number specified by the user, usually 5 or 10.
- Accuracy measures how many samples are predicted to be positive examples
- recall measures how many positive samples are predicted to be positive
- f-score is the harmonic average of accuracy and recall

### Cross validation

from sklearn.model_selection import cross_val_score lr = LogisticRegression(C=100) scores = cross_val_score(lr, X_train, y_train, cv=10) # k-fold cross validation score scores

array([0.82352941, 0.79411765, 0.80597015, 0.80597015, 0.8358209 , 0.88059701, 0.72727273, 0.86363636, 0.75757576, 0.71212121])

# Average cross validation score print("Average cross-validation score: {:.2f}".format(scores.mean()))

Average cross-validation score: 0.80

### Confusion matrix

- Calculating the confusion matrix of binary classification problem
- Calculate the accuracy rate, recall rate and f-score

from sklearn.metrics import confusion_matrix # Training model lr = LogisticRegression(C=100) lr.fit(X_train, y_train) # Model prediction results pred = lr.predict(X_train) # Confusion matrix confusion_matrix(y_train, pred)

array([[350, 62], [ 71, 185]], dtype=int64)

# Accuracy, recall and F1 score print(classification_report(y_train, pred))

precision recall f1-score support 0 0.83 0.85 0.84 412 1 0.75 0.72 0.74 256 avg / total 0.80 0.80 0.80 668