Using chi square sub box for data discretization - python implementation

We often wonder why some algorithms, such as logistic regression scorecard modeling, use box splitting technology. Understand the advantages of discretization to find the answer.

Discretization: the process of converting quantitative data into qualitative data.

  • Some data mining algorithms only accept classification attributes (LVF, FINCO, naive Bayes).

  • When the data has only quantitative features, the learning process is usually inefficient and ineffective.

Chi Merge Algorithm

  • This discretization method uses the merging method.

  • The relative class frequency should be quite consistent within an interval (otherwise it should split)

  • χ 2 is a statistical measure used to test the hypothesis that two discrete attributes are statistically independent.

  • For two adjacent intervals, if χ 2. If the test shows that this category is an independent interval, it shall be combined. If χ The conclusion of the test is that they are not independent, that is, the difference of relative category frequency is statistically significant, so the two intervals should remain independent.

(chi square distribution diagram)

Contingency table

calculation χ two

The values can be calculated as follows:

computational procedure

  • Calculate the of each pair of adjacent sections χ 2 value

  • Merge has the lowest χ 2-valued adjacent interval pair

  • Repeat the above steps until all adjacent pairs χ 2 value exceeds threshold

  • Threshold: determined by significance level and degree of freedom = number of classes - 1

example

Example of Chi Merge chi square sub box

Chi Merge is as follows. Initially, each different value of numerical attribute a is considered to be an interval. For each pair of adjacent sections χ 2 inspection. With minimum χ Adjacent intervals of 2 values are merged together because a pair is low χ A value of 2 indicates a similar class distribution. The merge process recursively until the predefined stop criteria are met.

  • Step 1

Split the dataset into 2 datasets and find values respectively.

Dataset 1 → X, class

Dataset 2 → Y, class

Chi Merge chi square sub box implemented by Python

Let's use the IRIS dataset and try to implement the Chi Merge process.

python scipy contains chi square test functions

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

#WeChat official account: python wind control model
import scipy.stats as stats
data = np.array([[43,9],
[44,4]])
V, p, dof, expected = stats.chi2_contingency(data)
print(p)

More chi square test codes

# -*- coding: utf-8 -*-
'''
Tencent cloud Classroom: python Financial risk control scorecard model and data analysis:
https://edu.csdn.net/combo/detail/1927
 The official account of WeChat: python Risk control model
'''
'''
Chi square formula(o-e)^2 / e
 The expected value and collected data shall not be less than 5, o(observed)Observed data, e(expected)Represents the expected data
(o-e)Square, and finally divide by the desired data e
'''
 
import numpy as np
from scipy import stats
from scipy.stats import chisquare        
list_observe=[30,14,34,45,57,20]
list_expect=[20,20,30,40,60,30]
 
 
std=np.std(data,ddof=1)
print(chisquare(f_obs=list_observe, f_exp=list_expect))
p=chisquare(f_obs=list_observe, f_exp=list_expect)[1]
'''
return NAN´╝îInfinitesimal
'''
 
if p>0.05 or p=="nan":
   print"H0 win,there is no difference"
else:
   print"H1 win,there is difference"

This version does not need to calculate the expected value, which is more convenient. It refers to the 2 * 2 simultaneous table, with degree of freedom = 1 and critical value=2.7

# -*- coding: utf-8 -*-
 '''
Tencent cloud Classroom: python Financial risk control scorecard model and data analysis:
https://edu.csdn.net/combo/detail/1927
 The official account of WeChat: python Risk control model
'''
#test for independence is also chi square test_ square
#Precondition: a,b,c,d must be greater than 5
 
#2.706 is the judgment standard (90 probability). The larger the value, the more relevant, and the smaller the value, the more irrelevant
def value_independence(a,b,c,d):
    if a>=5 and b>=5 and c>=5 and d>=5:
        return ((a+b+c+d)*(a*d-b*c)**2)/float((a+b)*(c+d)*(a+c)*(b+d))
 
#Returns True for
#Return False to indicate irrelevant
def judge_independence(num_independence):
    if num_independence>2.706:
        print ("there is relationship")
        return True
    else:
        print("there is no relationship")
        return False
 
a=34
b=38
c=28
d=50
chi_square=value_independence(a,b,c,d)
relation=judge_independence(chi_square)

The use of chi square boxes for data discretization is introduced here. It should be emphasized that not all algorithms need data discretization. At present, the integration tree algorithm is also very popular, such as xgboost, lightgbm and catboost. They do not need data discretization. I can't tell which method is the best. I've passed multiple project tests, and different data distributions have different conclusions! I hope you don't blindly believe in theory, do more tests and think more about yourself.

The use of chi square boxes for data discretization is introduced here. Welcome to sign up for < Python financial risk control scorecard model and data analysis micro professional course >, and learn more relevant knowledge
https://edu.csdn.net/combo/detail/1927

Copyright notice: the article comes from the official account (python wind control model). No plagiarism is allowed without permission. Follow CC 4.0 BY-SA copyright agreement. Please attach the original source link and this statement for reprint.

Tags: Python Machine Learning

Posted on Wed, 15 Sep 2021 14:43:47 -0400 by iamtheironman