1, Summary of learning points
- Information obtained from competition information
- Read data
- Evaluation and calculation of classification index
- On parity calculation of regression index
- Understanding of some nouns
2, Learning content:
1. New knowledge learned from the competition
a. Desensitization: process some private information, such as 186 mobile phone numbers****7392 Like this. b. label encoding Digital form c. Anonymity is the failure to tell the relevant nature of the data column d. The evaluation index is to evaluate the gap between the model effect and the actual effect(Specific evaluation indicators (written later)
2. Lessons learned from the task Code:
(1)
From the inside, I know:
a. The head function in pandas can display five pieces of data (five by default)
b. How to read data
3. Evaluation and calculation of classification indicators
1.accuracy
#accuracy import numpy as np from sklearn.metrics import accuracy_score y_pred=[0,1,3,4] y_true=[0,1,4,4] print('ACC:',accuracy_score(y_pred,y_true,normalize=False)) print('ACC:',accuracy_score(y_true,y_pred))
We use accuracy here_ Score function, which is an evaluation method. Classification accuracy score refers to the percentage of correct classification.
sklearn.metrics.accuracy_score(y_true, y_pred, *, normalize=True, sample_weight=None)
If normalize is False, y is returned_ And Y in pred_ The number of elements is the same as True. The default is True, and the return is the correct ratio.
- Prerequisite knowledge:
I use T for correct prediction and F for wrong prediction
Use P for positive class and N for negative class
TP: the prediction is positive and the judgment is correct
FP: the prediction is positive and the judgment is wrong
FN: negative prediction, wrong judgment
TN: the prediction is negative and the judgment is correct
- precision accuracy
## Precision from sklearn import metrics y_pred = [1,1,0,0,0] y_true = [1,0,0,1,0] print('Precision',metrics.precision_score(y_true, y_pred)) '''The resolution is not indicated here average The default value is taken binary. 1 Is the default positive class, then TP(A positive class value predicted to be a positive class, that is, 1) has 1, TP+FP(That is, the total number predicted as 1) has two, so it is calculated P=TP/(TP+FP)Is 0.5'''
Precision is used here_ Score function. The parameter average in this function is binary by default, so y is required by default_ true, y_ PRED only contains 0 and 1 (that is, binary), and it also involves another parameter pos_label,pos_ Label specifies that it is considered positive_ The value of label (that is, the value regarded as a positive class) is 1 by default, that is, 1 is positive by default (that is, the often said positive class).
- Other parameters of average, such as the most commonly used macro and weight calculation methods, are also based on the calculation of P.
- Recall recall rate
Here, as above, average goes to the default value binary
#Recall print('Recall:',metrics.recall_score(y_true, y_pred)) '''The analysis is here. Go and find it first TP,That is 1.Then find TP+FN(FN: If the prediction is negative, the judgment is wrong, that is, find 2 of the predicted value of 0, but the original value is 1), therefore recall=TP/(TP+FN)'''
- F1_ fraction
#F1-score print('F1-score:',metrics.f1_score(y_true, y_pred)) #The calculation method is (2*P*R) / (P+R), where P and R are the above accuracy rate and recall rate respectively
4. Parity calculation of regression index
(relevant information and explanations have been written on the notes)
# coding=utf-8 import numpy as np from sklearn import metrics # MAPE needs to be implemented by itself. It is not in the sklearn library, so you should write your own code to implement it def mape(y_true, y_pred): return np.mean(np.abs((y_pred - y_true) / y_true)) y_true = np.array([1.0, 5.0, 4.0, 3.0, 2.0, 5.0, -3.0]) y_pred = np.array([1.0, 4.5, 3.8, 3.2, 3.0, 4.8, -2.2]) # MSE mean square error print('MSE:',metrics.mean_squared_error(y_true, y_pred)) # RMSE root mean square error print('RMSE:',np.sqrt(metrics.mean_squared_error(y_true, y_pred))) # MAE mean absolute error print('MAE:',metrics.mean_absolute_error(y_true, y_pred)) # MAPE mean absolute percentage error print('MAPE:',mape(y_true, y_pred))
In addition to these, there are about fitting:
## R2 score R2 determination coefficient (goodness of fit), the closer it is to 1, the better from sklearn.metrics import r2_score y_true = [3, -0.5, 2, 7] y_pred = [2.5, 0.0, 2, 8] print('R2-score:',r2_score(y_true, y_pred))
The closer R2 is to 1, the higher the correlation!
3, Learning questions and answers
- Encounter unknown functions: I first go to find the information myself, find it from csdn and official documents, and then solve it
4, Learning, thinking and summary
- I've just started learning. It's normal to encounter something that won't happen. I hope to keep my mind and check the deficiencies and make up the mistakes
- Try to control the learning time every day to about 2 hours. First, ensure long-term thinking. Second, give yourself a pressure and don't relax in learning
- Learn while doing and review in time.