Using PCA in Python

PCA

Principal Component Analysis (PCA) is a linear dimension reduction technique. It can be used to map data information from high-dimensional space to low-dimensional space. After PCA processing, the necessary part of the data with large changes can be retained, while the unnecessary part of the data with small changes can be removed.

dimension

The dimension here refers to the characteristics that represent the data.

principal component

Principal component is the most important part of PCA. They can represent the implicit information inside the dataset.
In short, when data sets are mapped from high-dimensional (assumed to be three-dimensional) to low-dimensional (assumed to be two-dimensional), these three dimensions are the three principal components, which contain most of the data change information. Each principal component represents a percentage of the total change obtained from the data.

data set

The data set used in the example in this article is one of the popular data sets in data analysis - Breast Cancer.
The breast cancer data set consists of two real valued multivariate data categories (malignant and benign), each of which identifies whether a patient has breast cancer. Among them, 212 samples were included in malignant group and 357 in benign group.
There are 30 kinds of eigenvalues (radius, diameter, area, fractal dimension, etc.) in the whole dataset.

Code

Next, we use Python code to perform PCA analysis on the Breast Cancer dataset.

  1. Data exploration
    Using the sklearn module, the break cancer data set is introduced from the data set of the sklearn module
from sklearn.datasets import load_breast_cancer

Now that the dataset has been imported successfully, if you want to get the data in the dataset, you can use:

breast_data = load_breast_cancer().data

Check the data (note that the data does not contain labels at this time):

print(breast_data.shape)

The result is:

(569, 30)

It can be seen that there are 569 samples with 30 characteristics.
View all features:

features = breast.feature_names
print(features)

The result is:

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']

For each sample, there is a corresponding label value. There are two labels in this dataset. How to view labels in a dataset:

print(load_breast_cancer().target)

The result is:

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1
 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0
 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1
 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1
 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 0 1 0 0
 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 0 0 0 0 0 0 1]

As you can see from the results, there are two kinds of labels in this dataset, 0 and 1.

  1. Create DataFrame with numpy
    Combine data and tags to generate a DataFrame containing both.
import numpy as np
breast_labels = breast.target
labels = np.reshape(breast_labels, (569, 1))
concatenate_data = np.concatenate([breast_data, labels], axis=1)
print(concatenate_data.shape)

The result is:

(569, 31)

Now, the data matrix is 569 X 31.

  1. Using the Pandas module to present data in tabular form
import pandas as pd
breast_dataset = pd.DataFrame(concatenate_data)

Manually add a label field to the feature matrix.

features_labels = np.append(features, 'label')
breast_dataset.columns = features_labels

Next, you need to modify the values in the tag column (1 for malignant tumors, 0 for benign tumors):

breast_dataset['label'].replace(0, 'Benign', inplace=True)
breast_dataset['label'].replace(1, 'Malignant', inplace=True)

At this point, look at the first 10 rows of data frame:

print(breast_dataset.head(10))

The result is:

   mean radius  mean texture   ...    worst fractal dimension   label
0        17.99         10.38   ...                    0.11890  Benign
1        20.57         17.77   ...                    0.08902  Benign
2        19.69         21.25   ...                    0.08758  Benign
3        11.42         20.38   ...                    0.17300  Benign
4        20.29         14.34   ...                    0.07678  Benign
5        12.45         15.70   ...                    0.12440  Benign
6        18.25         19.98   ...                    0.08368  Benign
7        13.71         20.83   ...                    0.11510  Benign
8        13.00         21.82   ...                    0.10720  Benign
9        12.46         24.04   ...                    0.20750  Benign

[10 rows x 31 columns]
  1. Using PCA to process data

Because the output of PCA is affected by the characteristic quantity of data, the data should be standardized first.
First, import the StandardScaler module from the sklearn Library

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

Standardized treatment:

x = breast_dataset.loc[:, features].values
x = StandardScaler().fit_transform(x)

Convert the normalized data to a table format:

feat_cols = ['feature' + str(i) for i in range(x.shape[1])]
normalised_data = pd.DataFrame(x, columns=feat_cols)

Example results:

     feature0  feature1  feature2    ...      feature27  feature28  feature29
564  2.110995  0.721473  2.060786    ...       1.629151  -1.360158  -0.709091
565  1.704854  2.085134  1.615931    ...       0.733827  -0.531855  -0.973978
566  0.702284  2.045574  0.672676    ...       0.414069  -1.104549  -0.318409
567  1.838341  2.336457  1.982524    ...       2.289985   1.919083   2.219635
568 -1.808401  1.221792 -1.814389    ...      -1.745063  -0.048138  -0.751207

[5 rows x 30 columns]

Next is the core part of PCA analysis, which maps 30 dimensional data into two-dimensional principal components.

pca_breast = PCA(n_components=2)
principal_components = pca_breast.fit_transform(x)

Create DataFrame

principal_dataframe = pd.DataFrame(data=principal_components,
                                   columns=['principal component 1', 'principal component 2'])

View results:

     principal component 1  principal component 2
564               6.439315              -3.576817
565               3.793382              -3.584048
566               1.256179              -1.902297
567              10.374794               1.672010
568              -5.475243              -0.670637
Explained variation per principal component: [0.44272026 0.18971182]
  1. explained_variance_ratio
    Explained "variance" provides the amount of information or variance value held by each principal component.
print(pca_breast.explained_variance_ratio_)

Result

[0.44272026 0.18971182]

From the results, we can see that principal component 1 retains 44.2% of the information, while principal component 2 only retains 19% of the information. In addition, it should be noted that 36.8% of the information is lost when mapping 30 dimensional data to two-dimensional data.

  1. Visualization processing
    Next, use the visualization tools to draw the results of PCA analysis.
import matplotlib.pyplot as plt
plt.figure()
plt.figure(figsize=(10, 10))
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Principal Component - 1', fontsize=20)
plt.ylabel('Principal Component - 2', fontsize=20)
plt.title('Principal Component Analysis of Breast Cancer Dataset', fontsize=20)
targets = ['Benign', 'Malignant']
colors = ['r', 'b']

for target, color in zip(targets, colors):
    indicesToKeep = breast_dataset['label'] == target
    plt.scatter(principal_dataframe.loc[indicesToKeep, 'principal component 1'],
                principal_dataframe.loc[indicesToKeep, 'principal component 2'],
                c=color, s=50)

plt.legend(targets, prop={'size': 15})
plt.show()

Results:

It can be seen from the results that when benign and malignant categories are mapped to two-dimensional space, linear separation can be achieved to some extent.

Published 40 original articles, won praise and 349 visitors
Private letter follow

Tags: Python

Posted on Thu, 16 Jan 2020 11:44:02 -0500 by gateway69