Cancer classification prediction based on logistic regression - tumor prediction and ROC curve drawing of benign / malignant breast cancer

logistic regression Logistic Regression is a classification model in machine learning. Logistic Regression is a classif...
logistic regression

logistic regression

Logistic Regression is a classification model in machine learning. Logistic Regression is a classification algorithm, although the name has regression. Because the algorithm is simple and efficient, it is widely used in practice. Logical regression is a powerful tool to solve the problem of binary classification.
The input of logistic regression is the result of linear regression.
sigmoid function

The result of regression is input into sigmoid function
Output result: a probability value in the interval of [0, 1], the default value is 0.5, which is the threshold value
API

sklearn.linear_model.LogisticRegression(solver='liblinear', penalty='l2', C = 1.0)

Optional parameters of solver: {'liblinear', 'sag', 'saga', 'Newton CG', 'lbfgs'},

Default: 'liblinear'; the algorithm used to optimize the problem.
For small datasets, "liblinear" is a good choice, while "sag" and "saga" are faster for large datasets.

For many kinds of problems, only 'Newton CG', 'sag',
'saga' and 'lbfgs' can handle multiple losses; "liblinear" is limited to the "one versus rest" category.

penalty: types of regularization

C: Regularization intensity

Case: cancer classification prediction - benign / malignant breast cancer prediction
Dataset connection: https://archive.ics.uci.edu/ml/machine-learning-databases/

from sklearn.datasets import load_boston #Data import from sklearn.model_selection import train_test_split #Data segmentation from sklearn.preprocessing import StandardScaler #Standardization from sklearn.linear_model import LogisticRegression #logistic regression from sklearn.externals import joblib #Model save load import pandas as pd import numpy as np import ssl ssl._create_default_https_context = ssl._create_unverified_context # 1. Access to data # 1.1 setting labels names = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'Class'] # 1.2 read data data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", names=names) # 2. Data set division # 2.1 handling missing values data = data.replace(to_replace= "?" ,value= np.NaN) data = data.dropna() # 2.2 set characteristic value and target value c_data = data.iloc[ : ,1:10] #Data division, first, then c_target = data["Class"] # 2.3 data division x_train, x_test, y_train, y_test = train_test_split(c_data , c_target , random_state=22) # 3. Feature Engineering - Standardization transfer = StandardScaler() x_train = transfer.fit_transform(x_train) x_test = transfer.transform(x_test) # 4. Machine learning - logical regression estimator = LogisticRegression() estimator.fit(x_train, y_train) # 5. Model evaluation y_predict = estimator.predict(x_test) estimator.score(x_test, y_test)

result:

0.9766081871345029

Classification evaluation method

Under the classification task, there are four different combinations between the predicted condition and the true condition to form the confusion matrix (applicable to multi classification)

Common evaluation indicators:

Accuracy: (right) (TP+TN) / (TP+TN+FN+FP)
Accuracy - checked accuracy TP/(TP+FP)
Recall rate - Total incomplete TP/(TP+FN)
F1 score reflects the robustness of the model

roc curve and auc index
roc curve: draw the graph through TPR and FPR, and then draw it into an indicator auc
AUC: area under ROC curve, closer to 1, better effect
The closer to 0, the worse the effect, the closer to 0.5, the effect is nonsense

Drawing of ROC curve

API

AUC calculation API
from sklearn.metrics import roc_ auc_ score sklearn.metrics.roc_ auc_ score(y_ true, y_ Score) calculate ROC curve area, i.e. AUC value
y_true: the true category of each sample, which must be marked with 0 (negative example), 1 (positive example)
y_score: forecast score, which can be the estimated probability, confidence value or return value of the classifier method of a positive class

y_test = np.where(y_test > 2.5, 1, 0) #Category 1,0 print("AUC Indicators:", roc_auc_score(y_test, y_predict))

result:

AUC index: 0.97432432432432432433

import numpy as np import matplotlib.pyplot as plt from sklearn import svm, datasets from sklearn.metrics import roc_curve, auc ###Calculating roc and auc from sklearn import model_selection # Import some data to play with c_target = data["Class"].replace(to_replace= 2 ,value= 0) c_target = c_target.replace(to_replace= 4 ,value= 1) X = np.array(c_data) y = np.array(c_target) # Add noise features to make the problem harder random_state = np.random.RandomState(0) n_samples, n_features = X.shape X = np.c_[X, random_state.randn(n_samples, 200 * n_features)] # shuffle and split training and test sets train_data, test_data, train_label, test_label = model_selection.train_test_split(X, y, test_size=.3,random_state=0) #train_data for training sample set, test_data sample set for testing, train_label set corresponding to the training sample, test_label the label set corresponding to the test sample # Learn to predict each class against the other classifier settings svm = svm.SVC(kernel='linear', probability=True,random_state=random_state)#Use kernel function as linear kernel, parameter default, create classifier ###Through decision_ Test calculated by function()_ predict_ Value of label, used in ROC_ In the curve() function test_predict_label = svm.fit(train_data, train_label).decision_function(test_data) #First, the training samples and tags are trained by fit to get the model, and then the decision is used_ Function to obtain the label set predicted by the model for the test sample set print(test_predict_label) # Compute ROC curve and ROC area for each class#Calculate tp,fp #Through the comparison between the label set input by the test sample and the label set predicted by the model, fp,tp, different fp,tp are obtained by the algorithm changing the threshold value through certain rules fpr,tpr,threshold = roc_curve(test_label, test_predict_label) ###Calculating true and false positive rates print(fpr) print(tpr) print(threshold) roc_auc = auc(fpr,tpr) ###Calculate the auc value, auc is the area surrounded by the curve, the larger the better plt.figure() lw = 2 plt.figure(figsize=(10,10)) plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc) ###The false positive rate is the abscissa, and the true rate is the ordinate plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver operating characteristic example') plt.legend(loc="lower right") plt.show()

result:

16 June 2020, 02:10 | Views: 7562

Add new comment

For adding a comment, please log in
or create account

0 comments