Homogeneity test of covariance matrix (continued from the previous article)

I also know a little about the covariance homogeneity test, but I'm confused if I explain it more.
This paper describes the covariance matrix test for two classification problems and the covariance matrix test for multiple classification problems

Homogeneity test of covariance matrix for two classes

Σ 1 is the covariance of category 1, Σ 2 is the covariance of category 2, Σ Is the joint covariance of two covariances (i.e. in the figure) Σ_ hat, and S).
Where tr stands for trace, that is, sum along the diagonal.
p is the dimension, that is, the data has several characteristics

Because in the formula of Bayesian discrimination, if the covariance of the two categories is equal, their joint covariance is used for calculation.

So this test method is test Σ 1 and joint covariance, and Σ 2. Difference with joint covariance.
If their differences are not significant, it indicates that their covariance is homogeneous, which can be calculated by joint covariance.

Still use the data of this example.

x = np.array([[1.14,1.78],[1.18,1.96],[1.20,1.86],[1.26,2.00],[1.28,2.00],[1.30,1.96],[1.24,1.72],[1.36,1.74],[1.38,1.64],

y = np.array([1,1,1,1,1,1,2,2,2,2,2,2,2,2,2])
import numpy as np
import scipy.stats as stats
# Separate the data
x1 = x[y==1]
x2 = x[y==2]
def cov_homogeneity_2(arr1,arr2):
	The covariance homogeneity test of two categories is returned here True and False,as well as Q1,Q2,And rejection threshold
	arr1 ,arr2 These are the data characteristics of the two categories
	# p is the dimension. How many dimensions does the data have, that is, several features
	if (arr1.shape[1] == arr2.shape[1]):
        p = arr1.shape[1]
    else:raise(ValueError("Different data dimensions"))
    a1 = np.array(arr1,dtype=np.float64)
    a2 = np.array(arr2,dtype=np.float64)
    # Calculate the number of samples, and covariance, and joint covariance
    n1,n2 = a1.shape[0],a2.shape[0]
    s1 = np.cov(a1,rowvar=False)
    s2 = np.cov(a2,rowvar=False)
    s = ((n1-1)*s1+(n2-1)*s2)/(n1+n2-2)
    # Calculate the statistical value of Q according to the formula
    Q1 = (n1-1)*(np.log(np.linalg.det(s)) - np.log(np.linalg.det(s1)) - p + 
    Q2 = (n2-1)*(np.log(np.linalg.det(s)) - np.log(np.linalg.det(s2)) - p + 
    df = p*(p+1)/2
    # Because this is a chi square distribution, first calculate the critical value, that is, the rejection threshold
    lingjie = stats.chi2.ppf(0.95,df)

	# If both are smaller than the rejection threshold, there is no significant difference between the two covariances (H0: no significant difference in covariance)
    if (Q1<lingjie) and (Q2 < lingjie):
        return True,Q1,Q2,lingjie
    else:return False,Q1,Q2,lingjie
(True, 2.5784296441237355, 0.7417513518834973, 7.814727903251179)
# Returns True, so the covariance of this data passes the homogeneity test and can be considered equal

Homogeneity test of covariance matrix for multi classification

First discuss the situation of the three collectives. In fact, the three collectives can also be compared in pairs by the above method, and a conclusion can be drawn by comparing them many times. Similarly, four and five are the same. More comparisons will always have results.
See the figure below:

The above figure shows three categories. If it is extended to k categories, the only difference is the joint covariance matrix S
If there are k categories, then

Where n1 has been added to nk is the total number of samples n.
Note here that when n=k, such as three samples and three categories, there is no need to classify at this time, because there is too little data to distinguish.. Therefore, at least two data in each category can be distinguished. That is, n > K
The judgment function extended to k categories is given directly

def cov_homogeneity_n(x,y):
    if x.shape[0]!=y.shape[0]:
    	return "x and y The number of data is not the same"
    n = x.shape[0] # Total number of samples
    p = x.shape[1] # dimension
    k = np.unique(y).size # Number of categories of samples
    if n=k:return "Please check the data"
    s1=0 # For joint covariance matrix
    s2=0 # This means that the sum term of the subtracted part after M is also found in the loop
    df=p*(p+1)*(k-1)/2 # freedom
    # Divide d into the product of the previous fixed number and the subsequent summation term, and calculate the summation term d2 within the loop
    d1 = (2*p**2+3*p-1)/(6*(p+1)*(k-1))
    for y_i in np.unique(y):
        x_i = x[y==y_i]
        n_y_i = x_i.shape[0] # Number of samples in this category
        s_i = np.cov(x_i,rowvar=False)
        s1+=(n_y_i-1)*s_i/(n-k) # In fact, the joint covariance matrix is also a summation term. The denominator is the same, and the numerator is the summation of each term
    # Final formula:
    d = d1*d2
    S = s1
    M = (n-k)*np.log(np.linalg.det(S))-s2
    T = (1-d)*M
    lingjie = stats.chi2.ppf(0.95,df)
    if T<lingjie:
        return True,T,lingjie
    else:return False,T,lingjie

The following data are given

x = np.array([[261.01,7.36],[185.39,5.99],[249.58,6.11],[137.13,4.35],[231.34,8.79],[231.38,8.53],[260.25,10.02],

y = np.array([1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3])

(True, 8.138112152750503, 12.591587243743977)

The covariance matrix in the three categories of this group of data is homogeneous, that is, there is no significant difference.

Method of judging covariance in SPSS

The hypothesis test to judge the homogeneity of covariance is also provided in spss.
In the above data, there are three categories. In spss, in analysis classification discriminant, click in to refer to the discriminant analysis of spss. Here is only for the hypothesis test to explain the covariance.

Click the statistics button in the figure above to get the following:

box'M test is the hypothetical test of covariance. H0 is equal covariance.

The F test is used here. I don't understand the specific test method. The online information is limited. You can leave me a message if you know. Thank you.

The result is that the covariance of these three categories is equal, and there is no reason to reject the original hypothesis.

In addition, incidentally, in the dialog box of statistics in the above figure, there are Fisher and non standardized options in the list of function coefficients,

Note that the Fisher option, which is Bayesian discrimination, uses probability to calculate which category it belongs to
The coefficient he gave:

Several categories will have several sets of coefficients of functions,
To judge a data, bring the data into three equations respectively, and the calculated results are Y1, Y2 and Y3
Then the probability of belonging to the first category is

This is used to calculate the probability of belonging to other classes. Whoever has a high probability will be awarded

If it is not standardized, the non standardized coefficient of Fisher discrimination is given, because the standardized coefficient is given by default
The distance from the centroid is calculated by Fisher's discriminant formula to judge which category it belongs to.
It's easy to confuse here, so write it down!

Tags: Python linear algebra

Posted on Fri, 03 Dec 2021 20:08:45 -0500 by mastercool