Friday, March 21, 2014

Statistics based on resampling in Python: bootstrapping and permutation tests

Resampling techniques in statistics are appealing because they rely on few assumptions about the underlying distribution of one's data. Unfortunately, Python does not seem to have a simple library implementing the more common approaches. I set out to write my own routines for a few of these, based mostly on wikipedia's article. My idea was that it would be useful to be able to construct a confidence interval of any metric (e.g. np.mean or np.std) of the values in an array along any axis, ignoring NaNs and masked values. The method should also accept as parameters the alpha of the confidence interval, and the number of resampling iterations to compute. What I came up with is available on my github site:

After completing most of this, I found a useful site describing the basics of computing bootstrap confidence intervals (along with other things) here. I edited my code to be more directly comparable to the snippets there. I have also implemented a similar permutation test to compare any 'metric' (again, e. g. np.mean) of two groups:

This time intentionally borrowing heavily from Cliburn Chan. (Note, however, I believe there is an error on that page in the 'permutation_resampling' method that assumes one wants to compute the mean of the groups.)