The main purpose of factor analysis is to describe some more basic hidden variables hidden in a group of measured variables, but they can not be measured directly. Similar to principal component analysis, factor analysis also has the function of dimensionality reduction, but different from principal component analysis, factor analysis is an extension of principal component analysis. The extracted common factors not only consider the correlation between variables, but also consider the strength of the correlation between variables, so it is easier to explain its meaning.
This paper collects three cases from the network and arranges them. The main purpose is to explain the steps of factor analysis and make it easier to understand. Data download here (a worksheet is a piece of data)
Factor analysis steps:
(1) Data import, KMO and Bartlett spherical test to judge whether it is suitable for factor analysis
(2) If the data is suitable for factor analysis, if the number of factors is unknown, first do exploratory factor analysis and make gravel map to judge the number of extracted factors; If the number of variables measured by the questionnaire is known, exploratory factor analysis can be done first, then confirmatory factor analysis can be done, and the thermodynamic diagram can be used to judge whether the variable measurement is accurate
(3) If it is necessary to use factor analysis to calculate the ranking results, first calculate the factor score, and then comprehensively calculate the ranking results according to the factor score and the performance of the sample data on each factor.
Case 1: exploratory factor analysis and confirmatory factor analysis of questionnaire data
This is a known questionnaire with 25 Measurement topics. The measured variables are 5, identity agree ment; Diligent and responsible; Extroverted extraversion; Neuroticism, instability, neuroticism; openness.
Now we assume that the unknown is five variables to verify the validity of the questionnaire.
Import data first:
import pandas as pd dt=pd.read_excel(r'Factor analysis data.xlsx',sheet_name='Questionnaire data') dt.head(2)
The data are as follows:
It can be seen that there are some columns in this data that we don't need. We need to remove them:
dt1=dt.drop(['Unnamed: 0','gender','education','age'],axis=1) dt1.head(2)
The data after removal are as follows:
Then, check whether there is a missing value in the data and delete the missing data:
dt1.isnull().sum()
It is found that the data has missing values, so delete the data with missing values, and then view the data volume:
dt1.dropna(inplace=True) dt1.shape
(2436, 25)
There are 2436 pieces of data left, and the sample size is large enough.
First, KMO and Bartlett tests are performed:
Introduce the required package, this factor_ The analyzer package needs to be installed in advance
from factor_analyzer import FactorAnalyzer from factor_analyzer.factor_analyzer import calculate_kmo from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
Calculate the corresponding KMO and Bartlett test results:
kmo_all,kmo_model=calculate_kmo(dt1) chi_square_value,p_value=calculate_bartlett_sphericity(dt1) print('kmo_all:',kmo_all) print('kmo_model',kmo_model) print('chi_square_value',chi_square_value) print('p_value',p_value)
The result is:
It can be seen that the KMO statistics of each question are greater than 0.75, which is a good result. Generally, the worst is greater than 0.5, while the total KMO reaches 0.8485, and the p value is 0, indicating that the basic data validity of the questionnaire is good.
Assuming that it is not known that there are five variables, do exploratory factor analysis first to see if it is more appropriate to extract several common factors and calculate the eigenvalues
fa=FactorAnalyzer(10,rotation=None) fa.fit(dt1) ev,v=fa.get_eigenvalues()
Then draw the gravel map:
According to the requirement that the eigenvalue is greater than 1, five common factors are indeed more appropriate, but the actual operation is not completely judged according to the graph. It also needs the help of the cumulative variance contribution rate, which is generally about 85%. However, there are some problems in this data. The cumulative variance contribution rate is low, which is also shown in the thermodynamic diagram.
Output eigenvalue, variance contribution rate and cumulative variance contribution rate:
fa_v = fa.get_factor_variance() fa_dt = pd.DataFrame( {'characteristic value': fa_v[0], 'Variance contribution rate': fa_v[1], 'Cumulative contribution rate of variance': fa_v[2]}) #Variance and variance contribution rate of each factor print("\n",fa_dt)
Only 5 eigenvalues greater than 1, so rotate the factors according to 5 common factors and fit again
fa = FactorAnalyzer(rotation='varimax', n_factors=5, method='principal') fa.fit(dt1)
Output the result again:
fa_v = fa.get_factor_variance() fa_dt = pd.DataFrame( {'characteristic value': fa_v[0], 'Variance contribution rate': fa_v[1], 'Cumulative contribution rate of variance': fa_v[2]}) #Eigenvalue and variance contribution rate of each factor print("\n",fa_dt)
It can be seen that the cumulative contribution rate of this variance is indeed not high. Then conduct confirmatory factor analysis and draw thermal diagram:
import seaborn as sns dt2=pd.DataFrame(np.abs(fa.loadings_),index=dt1.columns) plt.figure(figsize=(10,8)) ax=sns.heatmap(dt2,annot=True,cmap='BuPu')#This cmap refers to the color range. BuPu is a range of colors, roughly from gray to purple plt.show()
It can be seen that basically all the high load under the same code (such as C1, C2, C3) are basically under one factor (such as column 3), indicating that the validity of the questionnaire is good. However, it can also be seen from the data here that some item values are not greater than 0.5, and should at least exceed 0.5 as required.
So far, we believe that the data should be five factors by exploratory factor analysis. At the same time, we also determine five factors by confirmatory factor analysis, and the validity of the questionnaire is basically up to standard.
Case 2: analysis of liberal arts and science bias of achievement data
This is a student's grade with nearly 600 records (it seems to be in junior high school). Conduct factor analysis on the data to judge whether there is a bias in Arts and science?
Again, import data first and remove redundant columns:
dt=pd.read_excel(r'Factor analysis data.xlsx',sheet_name='Achievement data') dt.head(2)
Student number, class and physical education (generally no one divides it into subjects) are redundant columns. Delete:
dt1=dt.drop(['Student number','class','Sports'],axis=1)
After judging that there is no missing value, you can continue to the next step. First, conduct KMO and Bartlett test:
The code is the same as the above example:
from factor_analyzer import FactorAnalyzer from factor_analyzer.factor_analyzer import calculate_kmo from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity kmo_all,kmo_model=calculate_kmo(dt1) chi_square_value,p_value=calculate_bartlett_sphericity(dt1) print('kmo_all:',kmo_all) print('kmo_model',kmo_model) print('chi_square_value',chi_square_value) print('p_value',p_value)
Similarly, the KMO and Bartlett tests of the data passed and are suitable for factor analysis.
Or do exploratory factor analysis first, draw gravel map, and output eigenvalue and other data:
fa=FactorAnalyzer(9,method='principal',rotation=None) fa.fit(dt1) ev,v=fa.get_eigenvalues()
from matplotlib import pyplot as plt plt.figure(figsize=(10,8)) plt.scatter(range(1,dt1.shape[1]+1),ev) plt.plot(range(1,dt1.shape[1]+1),ev) plt.grid(True) plt.show()
fa_v = fa.get_factor_variance() fa_dt = pd.DataFrame( {'characteristic value': fa_v[0], 'Variance contribution rate': fa_v[1], 'Cumulative contribution rate of variance': fa_v[2]}) #Variance and variance contribution rate of each factor print("\n",fa_dt)
The gravel diagram is as follows:
The eigenvalue, variance contribution rate and cumulative variance contribution rate are as follows:
It seems that one common factor is almost the same, but in this environment, since it is partial to science, we'd better toss according to the idea of Arts and science, or extract two common factors.
As in the above steps, perform factor rotation fitting again:
#When the variance interpretation rate is close to 0.85, a total of two factors are taken, and the factors are fitted again after rotation fa = FactorAnalyzer(rotation='varimax', n_factors=2, method='principal') fa.fit(dt1)
Draw thermal diagram:
import seaborn as sns from pylab import mpl mpl.rcParams['font.sans-serif']=['SimHei'] mpl.rcParams['axes.unicode_minus']=False dt2=pd.DataFrame(np.abs(fa.loadings_),index=dt1.columns) plt.figure(figsize=(10,8)) ax=sns.heatmap(dt2,annot=True,cmap='BuPu') plt.show()
It can be seen that only Chinese and politics are larger in the second factor (F2), while other subjects are larger in the first factor (F1). For the time being, the second factor is called liberal arts factor and the first factor is called science factor. Chinese can be expressed as 0.35F1+0.87F2.
According to my understanding, there is basically no gap in Arts and Sciences in junior middle school, but some Chinese are related to politics, mathematics is related to physics and chemistry, which we can understand. In English, it seems that science is good, and English is also very good.
Then make a heat map for all students:
dt2=fa.transform(dt1) ax=sns.heatmap(dt2,annot=True,cmap='BuPu') plt.show()
It can be seen that almost all of them are purple. It can also be considered that most of the students prefer science.
Suppose a student's score is [130108120,67.2,48.25,50,48.5,47.5,47]
We substitute it in and calculate the factor coefficient:
fa.transform(pd.DataFrame([[130,108,120,67.2,48.25,50,48.5,47.5,47]]))
The result is: array([[-0.98165912, 4.00093525]])
It can be seen that this student is more outstanding in the second factor, seriously biased towards liberal arts???
Case 3: factor analysis of candidate selection
Data is the collection of questionnaire data from 48 candidates by a unit. Now the unit wants to admit 6 of the best candidates through this data. How to proceed?
Still import data first and remove redundant columns:
dt=pd.read_excel(r'Factor analysis data.xlsx',sheet_name='Candidate data') dt1=dt.drop('ID',axis=1) dt1.head(2)
The data are as follows:
As the old rule, KMO and Bartlett tests shall be conducted first:
#KMO and Bartlett tests from factor_analyzer.factor_analyzer import calculate_kmo from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity kmo_all,kmo_model=calculate_kmo(dt1) chi_square_value,p_value=calculate_bartlett_sphericity(dt1) print(kmo_all) print(kmo_model) print(chi_square_value) print(p_value)
It seems to be OK overall, but some are not very good.
Exploratory factor analysis:
Draw gravel diagram and calculate characteristic value, etc
#Exploratory factor analysis fa = FactorAnalyzer(rotation=None, n_factors=15, method='principal') fa.fit(dt) ev,v=fa.get_eigenvalues() plt.figure(figsize=(10,8)) plt.scatter(range(1,dt.shape[1]+1),ev) plt.plot(range(1,dt.shape[1]+1),ev) plt.show()
fa_v = fa.get_factor_variance() fa_dt = pd.DataFrame( {'characteristic value': fa_v[0], 'Variance contribution rate': fa_v[1], 'Cumulative contribution rate of variance': fa_v[2]}) #Eigenvalue and variance contribution rate of each factor fa_dt
It seems that four common factors are OK, but we still choose five common factors to fit again according to the cumulative contribution rate of more than 85%.
#Take 5 factors when the variance interpretation rate is greater than 0.85, and fit again after factor rotation fa = FactorAnalyzer(rotation='varimax', n_factors=5, method='principal') fa.fit(dt1)
Calculate the score of each candidate according to the factor score:
#Calculate score def F(factors): return sum(factors*fa.get_factor_variance()[1]) scores = [] for i in range(len(fa.transform(dt1))): #This fa.transform(dt1) is the factor score of each candidate, multiplied by the weight of the corresponding factor contribution rate to calculate the total score new = F(fa.transform(dt1)[i]) scores.append(new) print(scores)
Sort all candidates by score:
dt['scores']=scores dt.scores.sort_values(ascending=False)
You can see that candidates 39, 38, 7, 9, 6 and 22 are the people to be admitted by the unit.
Make a diagram to see:
However, the result of this question is different from that on the Internet. I don't know whether it is the problem of data or where there is a problem. Please point out.
All right, all three cases are finished and the work is over.