Monday, April 5, 2010

Automatic Parameterization of Scipy's Random Number Generators

I am working on a new version of Model-Builder, which will incorporate Uncertainty analysis features based on my BIP package.

One key aspect of doing uncertainty analysis, is the ability to attribute prior distributions to the model's parameters, and programatically, this is easily accomplished through the use of scipy.stats random number generators which provide a large variety of both continuous and discrete distributions. However, since Model-Builder is a GUI-based software, I need to offer a unified interface for the parameterization of such distribution.

This should not be as hard as it seems because in the stats package, all distributions are either a subclass of stats.rv_continuous or stats.rv_discrete, and thus [almost] all of them can be parameterized by calling the class constructor with a tuple of shape parameters, the length of which will vary from distribution to distribution but is available as the attribute numargs. This is exemplified in the doc string of all the distributions as the following pseudo-doctest:
>>> import matplotlib.pyplot as plt
>>> numargs = norm.numargs
>>> [ ] = [0.9,]*numargs
>>> rv = norm()

Display frozen pdf

>>> x = np.linspace(0,np.minimum(rv.dist.b,3))
>>> h=plt.plot(x,rv.pdf(x))
So I set out to test if this would actually work for all generators. If it worked for at least the most important ones, it would mean I could just have the user provide me with a -tuple of numbers, and pass them to whichever distribution the user chose. To run this test I came up with the following script:

# -*- coding: utf-8 -*-
import pylab as P
import numpy as np
f1 = P.figure()
f2 = P.figure()
a1 = f1.add_subplot(111)
a1.set_title("Continuous Distributions")
a2 = f2.add_subplot(111)
a2.set_title("Discrete Distributions")
for d in dir(st):
obj = getattr(st, d)
if isinstance(obj, st.rv_continuous):
numargs = obj.numargs
shape_pars = [0.9]*numargs
rv = obj(shape_pars)
x = np.linspace(0,np.minimum(rv.dist.b,3))
print "+++>",
print "--->",
if isinstance(obj, st.rv_discrete):
numargs = obj.numargs
shape_pars = [0.9]*numargs
rv = obj(shape_pars)
x = np.linspace(np.maximum(-3, rv.dist.a),np.minimum(rv.dist.b,3))
print "+++>",
print "--->",

When I ran this simple script, I not got the plots of each probability distribution that conformed with the parameterization via the shape tuple, but also a report which did (+++>) or did not(--->) conform. Here are the results:

+++> alpha
---> anglit
---> arcsine
+++> bernoulli
---> beta
---> betaprime
---> binom
---> boltzmann
+++> bradford
---> burr
---> cauchy
+++> chi
+++> chi2
---> cosine
+++> dgamma
+++> dlaplace
+++> dweibull
+++> erlang
---> expon
+++> exponpow
---> exponweib
---> f
+++> fatiguelife
+++> fink
+++> foldcauchy
+++> foldnorm
+++> frechet_l
+++> frechet_r
+++> gamma
---> gausshyper
---> genexpon
+++> genextreme
---> gengamma
+++> genhalflogistic
+++> genlogistic
+++> genpareto
+++> geom
---> gilbrat
+++> gompertz
---> gumbel_l
---> gumbel_r
---> halfcauchy
---> halflogistic
---> halfnorm
---> hypergeom
---> hypsecant
+++> invgamma
+++> invnorm
+++> invweibull
---> johnsonb
---> johnsonsu
+++> ksone
---> kstwobign
---> laplace
---> levy
---> levy_l
---> levy_stable
+++> loggamma
---> logistic
+++> loglaplace
+++> lognorm
+++> logser
+++> lomax
---> maxwell
---> mielke
+++> nakagami
---> nbinom
---> ncf
---> nct
---> ncx2
---> norm
+++> pareto
+++> planck
+++> poisson
+++> powerlaw
---> powerlognorm
+++> powernorm
---> randint
---> rayleigh
+++> rdist
+++> recipinvgauss
---> reciprocal
+++> rice
---> semicircular
+++> t
+++> triang
+++> truncexpon
---> truncnorm
+++> tukeylambda
---> uniform
+++> vonmises
---> wald
+++> weibull_max
+++> weibull_min
+++> wrapcauchy
+++> zipf

From the list above, it is clear that although the majority passed the test, many important ones did not (n, beta, uniform, etc.). This don't mean those cannot be used or won't work, but only that they cannot be parameterized in the same way as the rest. The norm distribution, for instance, has numargs = 0, (WTF?)

Although initially I was pretty excited about the possibility of handling every probability distribution in the same standard way, in the end I was disappointed that the Scipy people, after having come up with a such nice design for the stats package, did not think about running their own doc-tests (or making them runnable in the first place).

I am still looking for a solution to this. If you have an idea of how to elegantly work around this issue, please comment below.
Reblog this post [with Zemanta]


Stefan van der Walt said...

We certainly need more eyes on this code and your comments are much appreciated! Currently, most of the statistics work is being done by Josef Perktold, who will be glad to learn of your experiments. I hope you'll raise these issues on the SciPy mailing list so that we may address them further.

Flavio Coelho said...

@Stefan: I am glad my little experiment has help to draw attention to these issues. I'll be sure to blog more often about rough corners of the scipy code when I run into them.

joep said...

The way you have it set up, it only works when numargs is exactly equal to one.
The distributions don't take a tuple
rv = obj(shape_pars)

but individual arguments
rv = obj(*shape_pars)

as parameters.

For automatic parameter choices across distributions, we also need to take into account that many of the parameters have additional restrictions, e.g. bounds, some have to be integers.

The scipy.stats test suite and my original fuzz testing script has versions of parameterizations that work across all distributions. But, I had to special case many of them.

The standard normal distribution doesn't have a shape parameter so numarg=0. There are two extra parameters, loc and scale, that also change the distribution.

Also, the docstrings are generic and autogenerated for 90 distributions and don't necessarily work as written for individual distributions. There is work in progress to improve this.

Flavio Coelho said...


You are right. That was a quick and dirty test based on the docstrings. I have since improved in away similar to the one you point out.

Some distribution do take a tuple of parameters, or so say their docstrings, nevertheless, unpacking the tuple of arguments (rv(*shape_pars)), works for some but not all of them as you point out.

I think that in any case, all distributions, take number of arguments, be it 1 or more. So all of them should be parameterized by unpacking a tuple of arguments of the right length. If extra arguments could be silently ignored, It would be easier to generate automated doctests for all.

In terms of my application, I resorted to expecting the user to provide a tuple with the right number of arguments, handling the potential errors and asking the user to try again.