Recently I have been asked to explain what was meant by the term “Joint test” by reviewers or editors of papers under consideration for publication. This has surprised me somewhat as in most cases joint tests are the most appropriate test for the effect of categorical variables and should be commonplace. However, they are not given out automatically by many statistical packages and so are probably not used as much as they should be.

When a categorical variable is used in a regression model the standard output of most statistical packages is to give you a p-value comparing each category with the baseline group. This results in one less comparison than you have categories, e.g. if there are four categories there will be 3 p-values. What it does is specifically test whether there is any difference between each category and the baseline group, one by one. Sometime this is appropriate, for example if you want to see for which ethnic minority groups there is evidence that their outcome differs from the white majority. Often however, the choice of baseline group can be quite arbitrary and we are more interested in whether there is variation across the categories, for example by age group. This is what a joint test does. Rather than test individual comparisons the joint test considers the overall variation across categories. Formally, it tells us what the probability of seeing such a variation in our data, or a more extreme variation, under the null hypothesis that there is no variation in the population from which we sampled.

While often individual tests can give you a fair idea as to what sort of answer you might see in a joint test, they can be misleading. One example is when a small group is chosen as your baseline. Because the estimate of the mean in this group will be imprecise it may be that comparisons with all other groups lead to null results, i.e. no evidence of any difference between them and the baseline group. However, it may be that there is strong evidence of differences between other groups. Consider the example below (where the dots show the point estimates and the bars show the 95% confidence intervals).

If group 1 is used as the reference group then we will not see any significant differences in our individual tests, but clearly 2 is different from 3. It might be argued that this is just a poor choice of baseline group, but it does not have to be this extreme such as in the example below where the choice of baseline group is more arbitrary.

Finally, it is worth noting that it is possible to have significant results in one or more categories but no significant variation overall. This is related to multiple testing.

In conclusion, joint tests are often the most appropriate test to use, and often give you the answer to the question you probably should be asking. In my opinion they should be the default option, with individual tests only presented when considered particularly appropriate in the setting.

Stata tip – Commands such as test, testparm and lrtest are useful for performing joint tests.