I also know a little about the covariance homogeneity test, but I'm confused if I explain it more.
This paper describes the covariance matrix test for two classification problems and the covariance matrix test for multiple classification problems
Homogeneity test of covariance matrix for two classes
Σ 1 is the covariance of category 1, Σ 2 is the covariance of category 2, Σ Is the joint covariance of two covariances (i.e. in the figure) Σ_ hat, and S).
Where tr stands for trace, that is, sum along the diagonal.
p is the dimension, that is, the data has several characteristics
Because in the formula of Bayesian discrimination, if the covariance of the two categories is equal, their joint covariance is used for calculation.
So this test method is test Σ 1 and joint covariance, and Σ 2. Difference with joint covariance.
If their differences are not significant, it indicates that their covariance is homogeneous, which can be calculated by joint covariance.
Still use the data of this example.
x = np.array([[1.14,1.78],[1.18,1.96],[1.20,1.86],[1.26,2.00],[1.28,2.00],[1.30,1.96],[1.24,1.72],[1.36,1.74],[1.38,1.64], [1.38,1.82],[1.38,1.90],[1.40,1.70],[1.48,1.82],[1.54,1.82],[1.56,2.08]]) y = np.array([1,1,1,1,1,1,2,2,2,2,2,2,2,2,2])
import numpy as np import scipy.stats as stats
# Separate the data x1 = x[y==1] x2 = x[y==2]
def cov_homogeneity_2(arr1,arr2): ''' The covariance homogeneity test of two categories is returned here True and False,as well as Q1,Q2,And rejection threshold arr1 ,arr2 These are the data characteristics of the two categories ''' # p is the dimension. How many dimensions does the data have, that is, several features if (arr1.shape[1] == arr2.shape[1]): p = arr1.shape[1] else:raise(ValueError("Different data dimensions")) a1 = np.array(arr1,dtype=np.float64) a2 = np.array(arr2,dtype=np.float64) # Calculate the number of samples, and covariance, and joint covariance n1,n2 = a1.shape[0],a2.shape[0] s1 = np.cov(a1,rowvar=False) s2 = np.cov(a2,rowvar=False) s = ((n1-1)*s1+(n2-1)*s2)/(n1+n2-2) # Calculate the statistical value of Q according to the formula Q1 = (n1-1)*(np.log(np.linalg.det(s)) - np.log(np.linalg.det(s1)) - p + np.matrix.trace(np.linalg.inv(s).dot(s1))) Q2 = (n2-1)*(np.log(np.linalg.det(s)) - np.log(np.linalg.det(s2)) - p + np.matrix.trace(np.linalg.inv(s).dot(s2))) df = p*(p+1)/2 # Because this is a chi square distribution, first calculate the critical value, that is, the rejection threshold lingjie = stats.chi2.ppf(0.95,df) # If both are smaller than the rejection threshold, there is no significant difference between the two covariances (H0: no significant difference in covariance) if (Q1<lingjie) and (Q2 < lingjie): return True,Q1,Q2,lingjie else:return False,Q1,Q2,lingjie
print(cov_homogeneity_2(x1,x2)) (True, 2.5784296441237355, 0.7417513518834973, 7.814727903251179) # Returns True, so the covariance of this data passes the homogeneity test and can be considered equal
Homogeneity test of covariance matrix for multi classification
First discuss the situation of the three collectives. In fact, the three collectives can also be compared in pairs by the above method, and a conclusion can be drawn by comparing them many times. Similarly, four and five are the same. More comparisons will always have results.
See the figure below:
The above figure shows three categories. If it is extended to k categories, the only difference is the joint covariance matrix S
If there are k categories, then
Where n1 has been added to nk is the total number of samples n.
Note here that when n=k, such as three samples and three categories, there is no need to classify at this time, because there is too little data to distinguish.. Therefore, at least two data in each category can be distinguished. That is, n > K
The judgment function extended to k categories is given directly
def cov_homogeneity_n(x,y): if x.shape[0]!=y.shape[0]: return "x and y The number of data is not the same" n = x.shape[0] # Total number of samples p = x.shape[1] # dimension k = np.unique(y).size # Number of categories of samples if n=k:return "Please check the data" s1=0 # For joint covariance matrix s2=0 # This means that the sum term of the subtracted part after M is also found in the loop df=p*(p+1)*(k-1)/2 # freedom # Divide d into the product of the previous fixed number and the subsequent summation term, and calculate the summation term d2 within the loop d1 = (2*p**2+3*p-1)/(6*(p+1)*(k-1)) d2=0 for y_i in np.unique(y): x_i = x[y==y_i] n_y_i = x_i.shape[0] # Number of samples in this category d2+=(1/(n_y_i-1)-1/(n-k)) s_i = np.cov(x_i,rowvar=False) s1+=(n_y_i-1)*s_i/(n-k) # In fact, the joint covariance matrix is also a summation term. The denominator is the same, and the numerator is the summation of each term s2+=(n_y_i-1)*np.log(np.linalg.det(s_i)) # Final formula: d = d1*d2 S = s1 M = (n-k)*np.log(np.linalg.det(S))-s2 T = (1-d)*M lingjie = stats.chi2.ppf(0.95,df) if T<lingjie: return True,T,lingjie else:return False,T,lingjie
The following data are given
x = np.array([[261.01,7.36],[185.39,5.99],[249.58,6.11],[137.13,4.35],[231.34,8.79],[231.38,8.53],[260.25,10.02], [259.51,9.79],[273.84,8.79],[303.59,8.53],[231.03,6.15],[308.90,8.49],[258.69,7.16],[355.54,9.43], [476.69,11.32],[316.12,8.17],[274.57,9.67],[409.42,10.49],[330.34,9.61],[331.47,13.72],[352.50,11.00], [347.31,11.19],[189.59,5.46]]) y = np.array([1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3])
cov_homogeneity_n(x,y)
out:
(True, 8.138112152750503, 12.591587243743977)
The covariance matrix in the three categories of this group of data is homogeneous, that is, there is no significant difference.
Method of judging covariance in SPSS
The hypothesis test to judge the homogeneity of covariance is also provided in spss.
In the above data, there are three categories. In spss, in analysis classification discriminant, click in to refer to the discriminant analysis of spss. Here is only for the hypothesis test to explain the covariance.
Click the statistics button in the figure above to get the following:
box'M test is the hypothetical test of covariance. H0 is equal covariance.
The F test is used here. I don't understand the specific test method. The online information is limited. You can leave me a message if you know. Thank you.
The result is that the covariance of these three categories is equal, and there is no reason to reject the original hypothesis.
In addition, incidentally, in the dialog box of statistics in the above figure, there are Fisher and non standardized options in the list of function coefficients,
Note that the Fisher option, which is Bayesian discrimination, uses probability to calculate which category it belongs to
The coefficient he gave:
Several categories will have several sets of coefficients of functions,
To judge a data, bring the data into three equations respectively, and the calculated results are Y1, Y2 and Y3
Then the probability of belonging to the first category is
This is used to calculate the probability of belonging to other classes. Whoever has a high probability will be awarded
If it is not standardized, the non standardized coefficient of Fisher discrimination is given, because the standardized coefficient is given by default
The distance from the centroid is calculated by Fisher's discriminant formula to judge which category it belongs to.
It's easy to confuse here, so write it down!