12.3

Based on the principal component analysis of the scores of 50 students in the table below, how many comprehensive variables can be selected to represent the scores of these students in 6 courses?

The data are as follows:

## 1. Data import and data standardization

> data=read.csv('D:/R languaga_main factors_analysis.csv',head=TRUE) > View(data) > std_data=scale(data[1:6]) #Data standardization > #You can also use rownames(std_data)=data[[1]] #Array row names are defined as the first column of the data file for further standardization > View(std_data)#View the data appearance after standardization > class(std_data) #View data types > "matrix" > df=as.data.frame(std_data) > class(df) > "data.frame"

## 2. PCA (principal component analysis)

> df.pr=princomp(df,cor=TRUE) #principal component analysis > summary(df.pr,loadings=TRUE) #List results including eigenvectors < br > Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Standard deviation 1.9170867 1.1026637 0.62781730 0.58486784 0.47527712 0.38314243 Proportion of Variance 0.6125369 0.2026445 0.06569243 0.05701173 0.03764806 0.02446635 Cumulative Proportion 0.6125369 0.8151814 0.88087386 0.93788559 0.97553365 1.00000000 Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 //mathematics 0.402 0.406 0.207 0.701 0.136 0.347 //Physics 0.304 0.614 -0.663 -0.279 -0.113 //Chemistry 0.415 0.268 0.672 -0.529 -0.137 //chinese -0.463 0.278 -0.269 -0.125 0.788 //history -0.431 0.367 0.214 0.744 -0.283 //English -0.418 0.417 0.139 0.278 -0.628 -0.401 #Load matrix > cor(df) #Output correlation coefficient matrix //Mathematics, physics, chemistry, Chinese, history and English //mathematics 1.0000000 0.6248332 0.6652483 -0.5716666 -0.4365279 -0.3727902 //Physics 0.6248332 1.0000000 0.5396874 -0.3024095 -0.2532592 -0.2122329 //Chemistry 0.6652483 0.5396874 1.0000000 -0.5651174 -0.4947985 -0.4916407 //chinese -0.5716666 -0.3024095 -0.5651174 1.0000000 0.8075182 0.7985865 //history -0.4365279 -0.2532592 -0.4947985 0.8075182 1.0000000 0.7670363 //English -0.3727902 -0.2122329 -0.4916407 0.7985865 0.7670363 1.0000000 > y=eigen(cor(df)) #Finding eigenvalues and eigenvectors > y$values #Output eigenvalue [1] 3.6752214 1.2158672 0.3941546 0.3420704 0.2258883 0.1467981 #The output is the eigenvalues of six principal components, the first two are the eigenvalues of principal components;

Find out the eigenvector of matrix A and get the corresponding eigenvector. We can find the correct coordinate axis after rotation, "and get the coordinate axis that makes the data differentiate to the maximum in each dimension." In addition, the value of a certain eigenvalue divided by the sum of all eigenvalues is: the variance contribution rate of the eigenvector (the variance contribution rate represents the proportion of the information contained in the dimension). Using this process, we can calculate the score of each principal component in each classification in EXCEL and get the score matrix table in SPSS.

Generally speaking, the data transformed by the eigenvector is called the principal component of the variable. The variance contribution rate and the cumulative variance contribution rate of all the principal components will be listed in the total variance interpretation table of spss. Generally speaking, the first several principal components whose cumulative variance contribution rate reaches more than 80% can be selected as the final principal component to reduce the dimension of the data. (when the characteristic root is less than 1, it is no longer selected as the principal component, because the explanation strength of the principal component is not as strong as that of an original variable, for example, only two principal components with the characteristic value of 3.6752214 1.2158672 are selected.)

> sum(y$values[1:6])/sum(y$values) #To find the cumulative variance contribution rate of the first six principal components, the cumulative variance contribution rate of all principal components should be exactly 1, which just proves that we have not missed any principal components. [1] 1 #In traditional PCA, the number of principal components is the same as the number of original variables (here are six). When writing the mathematical model, the teacher also suggests that we write all six principal components (if only the extracted principal components are written, the variance or length of the eigenvector is not 1) > df.pr$loadings[,1:6] #Output the load matrix of the first six principal components Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 //mathematics 0.4015600 0.4058115 0.20738435 0.70142207 0.136499735 0.3470349 //Physics 0.3037728 0.6144972 -0.66286300 -0.27925871 0.007852544 -0.1126090 //Chemistry 0.4152894 0.2679773 0.67231227 -0.52884389 -0.136820503 -0.0729615 //chinese -0.4625867 0.2781553 0.02373273 -0.26858310 -0.124567353 0.7876713 //history -0.4305473 0.3668097 0.21382904 -0.03382354 0.743825599 -0.2827300 //English -0.4179102 0.4171075 0.13897656 0.27760560 -0.627529094 -0.4014976 > screeplot(df.pr,type='lines') #Draw a stone map > biplot(df.pr) #Draw the main composition and distribution point map

Learn from another similar R language statistics topic. See

Principal component analysis of R language (PCA)