DBDA2E. The color coding explains the correspondence of metaphor to math.
Sunday, March 5, 2017
Thursday, February 23, 2017
Upcoming multi-day workshops in doing Bayesian data analysis (2017):
- June 5 - 9. Stats Camp, Albuquerque, New Mexico (USA). Taught by Prof. John Kruschke.
- June 12 - 16. Global School for Empirical Research Methods, St. Gallen, Switzerland. Taught by Prof. Michael Kalish
- June 20 -23. Interuniversity Consortium for Political and Social Research (ICPSR), Ann Arbor, Michigan (USA). Taught by Prof. John Kruschke.
- Aug. 28 - Sept. 1. Global School for Empirical Research Methods, Ljubljana, Slovenia. Taught by Prof. Michael Kalish
Sunday, February 19, 2017
Background: Suppose a researcher is interested in the Bayesian posterior distribution of a parameter, because the parameter is directly meaningful in the research domain. This occurs, for example, in psychometrics. Specifically, in item response theory (IRT; for details and an example of Bayesian IRT see this blog post), the data from many test questions (i.e., the items) and many people yield estimates of the difficulties \(\delta_i\) and discriminations \(\gamma_i\) of the items along with the abilities \(\alpha_p\) of the people. That is, the item difficulty is a parameter \(\delta_i\), and the analyst is specifically interested in the magnitude and uncertainty of each item's difficulty. The same is true for the other parameters, item discrimination and person ability. That is, the analyst is specifically interest in the discrimination \(\gamma_i\) magnitude and uncertainty for every item and the ability \(\alpha_p\) magnitude and uncertainty for every person.
The question: How should the posterior distribution of a meaningful parameter be summarized? We want a number that represents the central tendency of the (posterior) distribution, and numbers that indicate the uncertainty of the distribution. There are two options I'm considering, one based on densities, the other based on percentiles.
Densities. One way of conveying a summary of the posterior distribution is in terms of densities. This seems to be the most intuitive summary, as it directly answers the natural questions from the researcher:
- Question: Based on the data, what is the most credible parameter value? Answer: The modal (highest density) value. For example, we ask: Based on the data, what is the most credible value for this item's difficulty \(\delta_i\)? Answer: The mode of the posterior is 64.5.
- Question: Based on the data, what is the range of the 95% (say) most credible values? Answer: The 95% highest density interval (HDI). For example, we ask: Based on the data, what is the range of the 95% most credible values of \(\delta_i\)? Answer: 51.5 to 75.6.
|An illustration from DBDA2E showing how highest-density intervals and equal-tailed intervals (based on percentiles) are not necessarily equivalent.|
Some pros and cons:
Density answers what the researcher wants to know: What is the most credible value of the parameter, and what is the range of the credible (i.e., high density) values? Those questions simply are not answered by percentiles. On the other hand, density is not invariant under non-linear (but monotonic) transformations of the parameters. By squeezing or stretching different regions of the parameter, the densities can change dramatically, but the percentiles stay the same (on the transformed scale). This transformation invariance is the key reason that analysts avoid using densities in abstract, generic models and derivations.
But in applications where the parameters have meaningful interpretations, I don't think researchers are satisfied with percentiles. If you told a researcher, "Well, we cannot tell you what the most probable parameter value is, all we can tell you is the median (50 %ile)," I don't think the researcher would be satisfied. If you told the researcher, "We can tell you that 30% of the posterior falls below this 30th %ile, but we cannot tell you whether values below the 30th %ile have lower or higher probability density than values above the 30th %ile," I don't think the researcher would be satisfied. Lots of parameters in traditional psychometric models have meaningful scales (and aren't arbitrarily non-linearly transformed). Lots of parameters in conventional models have scales that directly map onto the data scales, for example the mean and standard deviation of a normal model (and the data scales are usually conventional and aren't arbitrarily non-linearly transformed). And in spatial or temporal models, many parameters directly correspond to space and time, which (in most terrestial applications) are not non-linearly transformed.
Decision theory to the rescue? I know there is not a uniquely "correct" answer to this question. I suspect that the pros and cons could be formalized as cost functions in formal decision theory, and then an answer would emerge depending on the utilities assigned to density and tranformation invariance. If the cost function depends on densities, then mode and HDI would emerge as the better basis for decisions. If the cost function depends on transformation invariance, then median and equal-tail interval would emerge as the better basis for decisions.
What do you think?
Thursday, February 16, 2017
In this blog post I show that frequentist equivalence testing (using the procedure of two one-sided tests: TOST) with null hypothesis significance testing (NHST) can produce conflicting decisions for the same parameter values, that is, TOST can accept the value while NHST rejects the same value. The Bayesian procedure using highest density interval (HDI) with region of practical equivalence (ROPE) never produces that conflict.
The Bayesian HDI+ROPE decision rule.
For a review of the HDI+ROPE decision rule, see this blog post and specifically this picture in that blog post. To summarize:
- A parameter value is rejected when its ROPE falls entirely outside the (95%) HDI. To "reject" a parameter value merely means that all the most credible parameter values are not practically equivalent to the rejected value. For a parameter value to be "rejected", it is not merely outside the HDI!
- A parameter value is accepted when its ROPE completely contains the (95%) HDI. To "accept" a parameter value merely means that all the most credible parameter values are practically equivalent to the accepted value. For a parameter value to be "accepted", it is not merely inside the HDI! In fact, parameter values can be "accepted" that are outside the HDI, because reject or accept depends on the ROPE.
In the frequentist TOST procedure, the analyst sets up a ROPE, and does a one-sided test for being below the high limit of the ROPE and another one-sided test for being above the low limit of the ROPE. If both limits are rejected, the parameter value is "accepted". The TOST is the same as checking that the 90% (not 95%) confidence interval falls inside the ROPE. The TOST procedure is used to decide on equivalence to a ROPE'd parameter value.
To reject a parameter value, the frequentist uses good ol' NHST. In other words, if the parameter value falls outside the 95% CI, it is rejected.
Examples comparing TOST+NHST with HDI+ROPE.
Examples below show the ranges of parameter values rejected, accepted, undecided, or conflicting (both rejected and accepted) by the two procedures.
- In these cases the ROPE is symmetric around its parameter value, with ROPE limits at \(-0.1\) and \(+0.1\) the central value. These are merely default ROPE limits we might use if the parameter value were the effect-size Cohen's \(\delta\) because \(0.1\) is half of a "small" effect size. In general, ROPE limits could be asymmetric, and should be chosen in the context of current theory and measurement abilities. The key point is that the ROPE is the same for TOST and for HDI procedures for all the examples.
- In all the examples, the 95% HDI and the 95% CI are arbitrarily set to be equal. In general, the 95% HDI and 95% CI will not be equal, especially when the CI is corrected for multiple tests or correctly computed for stopping rules that do not assume fixed sample size. But merely for simplicity of comparison, the 95% HDI and 95% CI are arbitrarily set equal to each other. The 90% CI is set to 0.83 times the width of the 95% CI, as it would be for a normal distribution. The exact numerical value does not matter for the qualitative results.
Example 1: HDI and 90% CI are wider than the ROPE.
In the first example below, the HDI and 90% CI are wider than the ROPE. Therefore no parameter values can be accepted, because the ROPE, no matter what parameter value it is centered on, can never contain the HDI or 90% CI.
Example 2: HDI and 90% CI are a little narrower than the ROPE.
In the next example, below, the HDI and 90% CI are a little narrower than the ROPE. Therefore there are some parameter values which have ROPE's that contain the HDI or 90% CI and are "accepted". Notice that the TOST procedure accepts a wider range of parameter values than the HDI+ROPE decision rule, because the 90% CI is narrower than the 95% HDI (which in these examples is arbitrarily set equal to the 95% CI).
Example 3: HDI and 90% CI are much narrower than the ROPE.
The third example, below, had the HDI and CI considerably narrower than the ROPE. This situation might arise when there is a windfall of data, with high precision estimates but lenient tolerance for "practical equivalence". Notice that this leads to conflicting decisions for TOST and NHST: There are parameter values that are both accepted by TOST by rejected by NHST. Conflicts like this cannot happen when using the HDI+ROPE decision rule.
The comparison illustrated here was inspired by a recent blog post by Daniel Lakens, which emphasized the similarity of results using TOST and HDI+ROPE. Here I've tried to illustrate at least one aspect of their different meanings.
Wednesday, February 8, 2017
The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective.
Abstract: In the practice of data analysis, there is a conceptual distinction between hypothesis testing, on the one hand, and estimation with quantified uncertainty on the other. Among frequentists in psychology, a shift of emphasis from hypothesis testing to estimation has been dubbed “the New Statistics” (Cumming, 2014). A second conceptual distinction is between frequentist methods and Bayesian methods. Our main goal in this article is to explain how Bayesian methods achieve the goals of the New Statistics better than frequentist methods. The article reviews frequentist and Bayesian approaches to hypothesis testing and to estimation with confidence or credible intervals. The article also describes Bayesian approaches to meta-analysis, randomized controlled trials, and power analysis.
Published in Psychonomic Bulletin & Review.
Final submitted manuscript: https://osf.io/preprints/psyarxiv/wntsa/
Published version view-only online (displays some figures incorrectly): http://rdcu.be/o6hd
Published article: http://link.springer.com/article/10.3758/s13423-016-1221-4
Published online: 2017-Feb-08
Corrected proofs submitted: 2017-Jan-16
Revision 2 submitted: 2016-Nov-15
Editor action 2: 2016-Oct-12
Revision 1 submitted: 2016-Apr-16
Editor action 1: 2015-Aug-23
Initial submission: 2015-May-13
Sunday, January 29, 2017
Choose from two multi-day workshops doing Bayesian data analysis in June, 2017:
- 2017 June 20 - 23. Four-day course. Interuniversity Consortium for Political and Social Research (ICPSR), Ann Arbor, Michigan.
- 2017 June 5 - 9. Five-day course. Stats Camp, Albuquerque, New Mexico.
Wednesday, December 21, 2016
A blog post by Christian Robert considered an ancient (2011!) article titled "Bayesian assessment of null values via parameter estimation and model comparison." Here I'll try to clarify the ideas from way back then through the lens of more recent diagrams from my workshops and a new article.
Terminology: "Bayesian assessment of null values" is supposed to be neutral wording to refer to any Bayesian method for assessing null values. Bayesian "hypothesis testing" is reserved for Bayes factors. Making decisions by posterior interval is not referred to as hypothesis testing and is not equivalent to Bayes factors.
Bayesian hypothesis testing: Suppose we are modeling some data with a model that has parameter δ in which we are currently interested, along with some other parameters. A null hypothesis model can be formulated as a prior on the parameters that puts a "spike" at the null value of δ but is spread out over the other parameters. A non-null alternative model puts a prior on δ that allows non-null values. The two models are indexed by a higher-level discrete parameter M. The entire hierarchy (a mixture model) has all its parameters updated by the data. The following slide from my workshops illustrates:
The Bayes factor (BF) is the shift in model-index probabilities:
Digression: I throw in two usual caveats about using Bayes factors. First, Bayesian model comparison --for null hypothesis testing or more generally for any (non-nested) models-- must use meaningful priors on the parameters in both models for the Bayes factor to be meaningful. Default priors for either model are typically not very meaningful and quite possibly misleading.
And, the Bayes factor is not the posterior probability of the models. Typically we ultimately want to know the posterior probabilities of the models, and the BF is just a step in that direction.
|Link in slide above: http://doingbayesiandataanalysis.blogspot.com/2015/12/lessons-from-bayesian-disease-diagnosis_27.html|
Assessing null value through parameter estimation: There's another way to assess null values. This other way focuses on the (marginal) posterior distribution of the parameter in which we're interested. (As mentioned at the outset, this approach is not called "hypothesis testing.") This approach is analogous to frequentist equivalence testing, which sets up a region of practical equivalence (ROPE) around the null value of the parameter:
The logic of this approach stems from a direct reading of the meaning of the intervals. We decide to reject the null value when the 95% highest density parameter values are all not practically equivalent to the null value. We decide to accept the null value when the 95% highest density parameter values are all practically equivalent to the null value. Furthermore, we can make direct probability statements about the probability mass inside the ROPE such as, "the probability that the parameter is practically equivalent to the null is 0.017" or "the probability that the parameter is practically equivalent to the null is 0.984."
The ROPE is part of the decision rule, not part of the null hypothesis. The ROPE does not constitute an interval null hypothesis; the null hypothesis here is a point value. The ROPE is part of the decision rule for two main purposes: First, it allows decisions to accept the null (again, analogous to frequentist equivalence testing). Second, it makes the decision rule asymptotically correct: As data sample size increases, the rule will come to the correct decision, either practically equivalent to the null value (within the ROPE) or not (outside the ROPE).
Juxtaposing the two approaches: Notice that the two approaches to assessing null values are not equivalent and have different emphases. The BF focuses on the model index, whereas the HDI and ROPE focus on the parameter estimate:
Therefore the two approaches will not always come to the same decision, though often they will. Neither approach is uniquely "correct;" the two approaches frame the question differently and provide different information.
Below is an example of the different information provided by hypothesis testing and estimation (for both frequentist and Bayesian analyses). The data are dichotomous, with z=14 successes out N=18 attempts (e.g., 14 heads out of 18 flips). The data are modeled by a Bernoulli distribution with parameter θ. The null value is taken to be θ=0.50. For the Bayesian analysis, the alternative-hypothesis prior is uniform merely for purposes of illustration; uniform is equivalent to dbeta(1,1).
|From: Kruschke, J. K. and Liddell, T. (in press 2016), The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review.|