Friday, August 8, 2014

Prior for normality (df) parameter in t distribution

A routine way to describe outliers in metric data is with a heavy-tailed t distribution instead of with a normal distribution. The heaviness of the tails is governed by a normality parameter, ν, also called the df parameter. What is a reasonable prior for ν? One answer: A shifted exponential distribution, which is what I used in the "BEST" article and software. But the book (on pp. 432-433) seems to use a different and obscure prior distribution. The purpose of this post is to show that the book's prior on the normality parameter is, in fact, a shifted exponential.

The book's prior on ν, there called tdf, is
  tdf <- 1 - tdfGain*log(1-udf)
  udf ~ dunif(0,1)
where tdfGain is a constant. In other words, tdf is a strange logarithmic transformation of udf, and udf has a uniform prior. Well, that's the same thing as a shifted exponential on tdf, with tdfGain being its mean.

To prove that claim, we could do some math, as suggested by the method in Section 23.4 (p. 630) regarding reparameterization. But instead I'll show a large Monte Carlo sample of tdf from the book, and superimpose a shifted exponential density, and you can see that they are identical:

In each of the above graphs, the title indicates the value of tdfGain, labelled as nuGain. A large number of values, udf, were randomly generated from a uniform distribution, and then those values were transformed to tdf (=ν) values using the formula tdf <- 1 - tdfGain*log(1-udf). The histogram plots the resulting tdf (=ν) values. The title of each graph also indicates the mean of Monte Carlo sample, which should be nuGain+1 except for randomly sampling noise. Superimposed on the histogram is a exponential density with mean set to nuGain, and shifted 1 unit to the right (because ν must be > 1). Thus, you can see that the book's prior for tdf really is a shifted exponential, as used in the BEST article.

Appendix: Here is the code for generating the graphs.
for ( nuGain in c(3,30,100) ) {
  # Random sample of nu from book prior:
  nu = 1 - nuGain*log( 1 - runif(50000) )
  # Plot histogram:
  xMax = 5*nuGain
  histInfo = hist( nu , probability=TRUE ,
                   col="skyblue" , border="white" ,
                   xlab=bquote(nu) , xlim=c(0,xMax) ,
                   cex.lab=1.5 , cex.main = 1.5 ,
                   breaks=c(0,seq(1,xMax,length=30),1E10) ,
                   main=bquote( list( nuGain==.(nuGain) ,
                                      mean(nu)==.(round(mean(nu),2)) ) ) )
  # Superimpose shifted exponential curve with same mean-1:
  x = seq( 1 , xMax , length=201 )   # Make x-axis comb.
  lines( x , dexp( x-1 , 1/(nuGain) ) , lwd=2 )
  text( mean(nu) , dexp( mean(nu)-1 , 1/(nuGain ) ) ,
        bquote("curve: dexp( "*nu*"-1 , mean="*.(nuGain)*" )") ,
        adj=c(-0.1,-0.1) , cex=1.5 )


  1. This comment has been removed by the author.

  2. So is there a reason for why you use one version in the book and one version in the paper? Performance differences?

  3. Just a note: To save some computations, it's possible to replace log(1-udf) by log(udf), since 1-udf and udf are both distributed uniformly on [0,1].

    Not that it matters so much today...

  4. Why did the book do it the strange indirect way? I have no idea. Past self did not leave enough bread crumbs for present self to follow. Maybe some motivation can be constructed post hoc.

  5. Oh, don't get me started on Past self. He is this really irritating guy who makes these strange decisions, never does the dishes and never comments his code...

  6. Hi John,

    I would you to know your opinion on these (and may be other) priors for the normality parameter. And if you still think that a shifted-exponential with mean 30 is a good choice (as you recommend in the second edition of your book).

    * gamma(2, 0.1) (i.e mean=20)

    * nu.y <- 1/nu0
    nu0 ~ dunif(0,.5) (i.e uniform between 2 and inf)

    It seems that some people would like to avoid nu < 1, while others nu < 2 and other do not have problems with values below 1.

    Theoretically I guess that avoiding values below 1 is related to the indetermination on the mean, and avoiding values below 2 because the indetermination on the variance.

    From a computational point of view (at least on PyMC3) I have found that is better to avoid values below 1, otherwise the sampler keeps jumping to values far away the posterior mean and the sapling process get really slow (using Metropolis or NUTS), I guess this is the results of allowing "ridiculously· fat tails when nu < 1.

    Thanks in advance.

  7. Yes, I still think that a shifted-exponential prior with mean 30 is a good choice for practical applications.

    Nu (the df parameter in a t distribution) is not defined below 1, so whatever prior you choose, it should now allow values below 1.

    There was some discussion on Gelman's blog about appropriate priors for the df parameter in a t distribution, but now I can't find the thread. Maybe you can. Basically, it's about an article that shows the exponential prior works reasonably well, although it does shrink extreme df values toward the mean a little bit.

  8. Hi,

    Thanks for your answer. I found the Gelman's post you mention.

    Why do you say that the df parameter in a t distribution is not defined below 1? Is not defined in the sense that is not a "degree of freedom". The post start with the suggestion to use the prior gamma(2, 0.1), this will give you values below 1, right?. Also in the Gelman's post there is a comment that states "It is unnecessary to allow nu to be larger than 30 or 50, but it is very important to allow nu to be smaller than 1." A very surprising comment!

  9. Hmmm... Past self had learned that nu ranged from 1 to infinity (not from 0 to infinity). That "knowledge" found its way into both editions of the book. After all these years, you are the first person to question the restriction! Present self recognizes that past self was just plain wrong.

    The fact that the error has (apparently) made virtually no difference to any practical applications reflects the fact that most applications focus on the mu and sigma parameters, not the nu parameter, AND in most applications the kurtosis is moderate not extreme. Therefore the bias in the estimate of nu is (a) mild and (b) not very consequential for the estimate of mu and sigma.

    I will go through all the programs and modify them. Basically, any program that has file name "-Mrobust" will need some attention. I want to be sure that the MCMC samples won't break if nu is ever probed at zero or values very close to zero. The only other concern is that the plots of the posterior use log10(nu), which will break if nu is actually zero (or maybe very close to zero). Hopefully I can do this this week...

    Thanks for bringing this to my attention.

    (By the way, the comment about "very important to be smaller than 1" is strange to me; it would depend very much on the how extreme the non-normality of the data actually is. I would think that it's very important to allow nu<1 only if the data have extreme non-normality that can only be adequately captured with nu<1. I suspect that this sort of data is extremely rare, but of course that depends on the domain of application.)

  10. May be the whole nu < 1 think is a frequentist contamination!!! :-)

    Probably restricting nu > 1 is a good idea, on the one hand I think is easier on the sampler and on the other as you say in general we are interested on mu and sigma parameters, not the nu parameter.

    I hope the guy on the other post agrees to expand his comment.

  11. This comment has been removed by the author.