How can I test if a permutation hypothesis test produces valid results?

by Rodrigo Borges   Last Updated October 03, 2018 13:19 PM

I'm using the Join Count statistics to get insight if there is spatial autocorrelation in a categorical feature. I would like to check if the distribution that produces the pseudo p-value it returns me is "valid".

It is a pseudo p-value because it assumes normality in the distribution, which is not always the case. Being the result of several (999, usually) permutations of the data, the distribution can take other forms that are not gaussian.

My unfounded assumption (and the reason of this question) is that I can't consider the p-value valid if the distribution is uniform, or of a single value (both cases happen when the analysed variable occurs rarely).

Based on that, I started running a Shapiro-Wilk test on the distributions resulted by the analysis. But, when looking graphically at the ones the test was discarding, I realized several are truncated normal distributions (which I would consider valid).

Realizing that the permutations could produce all sorts of distributions, it seemed reasonable to question how to analyse them and to gain knowledge on which distributions can produce valid hypothesis tests.

How can I (and should I) test a distribution that produces a pseudo p-value?

Contextualization of Join Count statistics, if needed

I believe experience with the method I'm trying to use is not needed to help me, so I will try to contextualize it briefly.

The analysis looks at the original distribution of my data, counting how many elements neighbor someone with the same value (black-black connection, or BB) and how many neighbor someone with a different value (black-white connection, or BW). Then, it permutates my data several times, doing the same counts.

A one-tailed hypothesis test is then made considering the distribution of values encountered and the measurement for the original data. The pseudo p-value for positive autocorrelation is the placement of the original BB count in the BB distribution, and for negative autocorrelation it considers the BW count in the BW distribution.

An example

Consider a binary variable that occurs on points that are distributed in space.

Original data distribution

I determine that the neighbors of a point are the K points nearest to it. BB and BW connections are counted based on points of value 1 and their neighbors. In this case, the BB count will be high and the BW count will be low (BW connections only happen in the border between orange and blue).

The method now permutates the variable value for the data, maintaining both the position of the points and the proportion between 1 and 0 occurances.

Random permutation of the data

New BB and BW counts are evaluated, resulting in close values, as their occurances are the same.

A permutation happens again, followed by another count, until N (999) values are gathered. Then, the BB counts are sorted and the original BB count position is the pseudo p-value of the test.

In this case, the original BB count is the highest between all permutations, resulting in 0.999 confidence in rejecting the hypothesis of no positive autocorrelation. Meanwhile, the original BW count is the lowest, resulting in 0.000 confidence in rejecting the hypothesis of no negative autocorrelation.

Permutation distribution and original value

Therefore, a valid conclusion is that this variable has a positive spatial autocorrelation.



Related Questions


Truncated Normal CDF decreasing in $\mu$

Updated August 01, 2018 12:19 PM

How to modify a normal distribution

Updated March 19, 2017 22:19 PM

What is the name of this modified Gaussian?

Updated March 03, 2017 17:19 PM