We often wonder why some algorithms, such as logistic regression scorecard modeling, use box splitting technology. Understand the advantages of discretization to find the answer.
Discretization: the process of converting quantitative data into qualitative data.
Some data mining algorithms only accept classification attributes (LVF, FINCO, naive Bayes).
When the data has only quantitative features, the learning process is usually inefficient and ineffective.
Chi Merge Algorithm
This discretization method uses the merging method.
The relative class frequency should be quite consistent within an interval (otherwise it should split)
χ 2 is a statistical measure used to test the hypothesis that two discrete attributes are statistically independent.
For two adjacent intervals, if χ 2. If the test shows that this category is an independent interval, it shall be combined. If χ The conclusion of the test is that they are not independent, that is, the difference of relative category frequency is statistically significant, so the two intervals should remain independent.
(chi square distribution diagram)
calculation χ two
The values can be calculated as follows:
Calculate the of each pair of adjacent sections χ 2 value
Merge has the lowest χ 2-valued adjacent interval pair
Repeat the above steps until all adjacent pairs χ 2 value exceeds threshold
Threshold: determined by significance level and degree of freedom = number of classes - 1
Example of Chi Merge chi square sub box
Chi Merge is as follows. Initially, each different value of numerical attribute a is considered to be an interval. For each pair of adjacent sections χ 2 inspection. With minimum χ Adjacent intervals of 2 values are merged together because a pair is low χ A value of 2 indicates a similar class distribution. The merge process recursively until the predefined stop criteria are met.
- Step 1
Split the dataset into 2 datasets and find values respectively.
Dataset 1 → X, class
Dataset 2 → Y, class
Chi Merge chi square sub box implemented by Python
Let's use the IRIS dataset and try to implement the Chi Merge process.
python scipy contains chi square test functions
#WeChat official account: python wind control model import scipy.stats as stats data = np.array([[43,9], [44,4]]) V, p, dof, expected = stats.chi2_contingency(data) print(p)
More chi square test codes
# -*- coding: utf-8 -*- ''' Tencent cloud Classroom: python Financial risk control scorecard model and data analysis: https://edu.csdn.net/combo/detail/1927 The official account of WeChat: python Risk control model ''' ''' Chi square formula(o-e)^2 / e The expected value and collected data shall not be less than 5, o(observed)Observed data, e(expected)Represents the expected data (o-e)Square, and finally divide by the desired data e ''' import numpy as np from scipy import stats from scipy.stats import chisquare list_observe=[30,14,34,45,57,20] list_expect=[20,20,30,40,60,30] std=np.std(data,ddof=1) print(chisquare(f_obs=list_observe, f_exp=list_expect)) p=chisquare(f_obs=list_observe, f_exp=list_expect) ''' return NAN，Infinitesimal ''' if p>0.05 or p=="nan": print"H0 win,there is no difference" else: print"H1 win,there is difference"
This version does not need to calculate the expected value, which is more convenient. It refers to the 2 * 2 simultaneous table, with degree of freedom = 1 and critical value=2.7
# -*- coding: utf-8 -*- ''' Tencent cloud Classroom: python Financial risk control scorecard model and data analysis: https://edu.csdn.net/combo/detail/1927 The official account of WeChat: python Risk control model ''' #test for independence is also chi square test_ square #Precondition: a,b,c,d must be greater than 5 #2.706 is the judgment standard (90 probability). The larger the value, the more relevant, and the smaller the value, the more irrelevant def value_independence(a,b,c,d): if a>=5 and b>=5 and c>=5 and d>=5: return ((a+b+c+d)*(a*d-b*c)**2)/float((a+b)*(c+d)*(a+c)*(b+d)) #Returns True for #Return False to indicate irrelevant def judge_independence(num_independence): if num_independence>2.706: print ("there is relationship") return True else: print("there is no relationship") return False a=34 b=38 c=28 d=50 chi_square=value_independence(a,b,c,d) relation=judge_independence(chi_square)
The use of chi square boxes for data discretization is introduced here. It should be emphasized that not all algorithms need data discretization. At present, the integration tree algorithm is also very popular, such as xgboost, lightgbm and catboost. They do not need data discretization. I can't tell which method is the best. I've passed multiple project tests, and different data distributions have different conclusions! I hope you don't blindly believe in theory, do more tests and think more about yourself.
The use of chi square boxes for data discretization is introduced here. Welcome to sign up for < Python financial risk control scorecard model and data analysis micro professional course >, and learn more relevant knowledge
Copyright notice: the article comes from the official account (python wind control model). No plagiarism is allowed without permission. Follow CC 4.0 BY-SA copyright agreement. Please attach the original source link and this statement for reprint.