Feature selection using chi squared for continuous features

by Jondiedoop   Last Updated October 03, 2018 15:19 PM

I'm looking at univariate feature selection. A method that is often described, is to look at the p-values for a $\chi^2$-test. However, I'm confused as to how this works for continuous variables.

1. How can the $\chi^2$-test work for feature selection for continuous variables? I have always learned this test works for counts. It appears to me you have to bin in some way, but the outcomes are dependent on the binning you chose. I'm also interested in how this works for a combination of continuous and categorical variables. If so many respected people are doing, I do not believe it to be wrong, but I'm not sure how it should work, either.

2. Is it a problem that this test is scale dependent? My second concern is that the test is scale dependent. This is not a problem for counts, which have no dimension, but it can have great impact on feature selection for continuous variables, see the example.

Example: show the test is scale-dependent
As an example, let's look at their example: http://scikit-learn.org/stable/modules/feature_selection.html

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target
selector = SelectKBest(chi2, k=2)
selector.fit(X, y)
print(selector.pvalues_)
print(selector.get_support())

Output:

[False False True True]
[4.47e-03 1.657e-01 5.94e-26 2.50e-15]

Now let's imagine we had recorded the first and third column not in cm, but in mm. Obviously, this doesn't change the dependence of the class type on sepal and petal length. However, the p-values change strongly, and accordingly, the selected columns change:

X[:, 0] = 10*X[:, 0]
X[:, 2] = 10*X[:, 2]
selector.fit(X, y)
print(selector.pvalues_)
print(selector.get_support())

Output

[True False True False] 
[3.23e-024 1.66e-001 5.50e-253 2.50e-015]

If I had also recorded the 2nd column in mm instead of cm, that would also have given me a significant p-value.

I believe this had to with the fact that the method does not implement any binning but sums all values and compares that to the expected sum. Additionally, I believe the fact that the numerator in the $\chi^2$ is squared while the denominator is not adds to the problem.



Related Questions



sklearn: chi2 test failed for negative features

Updated May 07, 2018 10:19 AM

How to compare results from chi2 and ANOVA

Updated July 26, 2018 20:19 PM