Table 62-1 illustrates the characteristics of each class, as well as the order in which the classes must be considered, since each successive class rests on the foundation of the preceding class.

Table 62-1 The Three Classes of Statistics Used in Psychiatric Research (in Order of Applicability)

Class of Statistic	Purpose	Examples
Psychometric Statistics	Measures of reliability and validity of rating scales and other measures. Once measures are shown to have adequate reliability and validity, they can then be used as descriptive statistics.

• Test-retest reliability coefficient

• Intraclass correlation coefficient

• Kappa coefficient

• Sensitivity

• Specificity

Descriptive Statistics Statistics used to summarize the scores of many subjects in a single count or average to describe the group as a whole. After descriptive statistics have been computed for one or more samples, they can then be used to compute inferential statistics to attempt to generalize these results to the larger population from which these samples were drawn.

• Estimates of effect size

Inferential Statistics Statistics computed to compute probability estimates used to generalize descriptive statistics to the larger population from which the samples were drawn.

• t-statistic

• F-statistic

• χ² statistic

• Confidence intervals

Concrete Examples of the Three Classes of Statistics in a Research Article

To provide a concrete example of these sometimes abstract concepts, consider a fictional study based on the simplest research design in psychiatric research: a randomized double-blind trial of a new drug versus a placebo pill for obsessive-compulsive disorder (OCD).

Figures 62-1 through 62-3 contain the annotated Method and Results sections for this fictional study, showing how the various psychometric statistics are presented in the Method section, while descriptive statistics are presented in the Method and Results sections, and inferential statistics are presented in the Results section (for definitions of terms used in these figures, refer to the section on statistical terms and their definitions).

Figure 62-1 Fictitious Method section annotated to illustrate psychometric statistics.

Figure 62-2 Fictitious Results section annotated to illustrate descriptive statistics.

Figure 62-3 Fictitious Results section annotated to illustrate inferential statistics.

Experiment-wise Error Rate

Researchers should test only a few carefully selected hypotheses (specified before collecting their data!) if their obtained p-values are to have any meaning. The more statistical tests you perform, the greater the chance of finding at least one significant by chance alone. Table 62-2 illustrates this phenomenon.

Table 62-2 Experiment-wise Error: Did the Researcher Find a Single Result Significant Solely by Chance?

Number of Statistical Tests Performed at p < .05	Probability of at Least One False-Positive Finding^*
1	.05
2	.09
3	.14
4	.18
5	.22
6	.26
7	.30
8	.33
9	.36
10	.41
15	.53
20	.64
30	.78
40	.87
50	.92

* Experiment-wise error rate.

One should not be impressed by a researcher who conducts eight t-tests, finds one significant at p < .05, and proceeds to interpret the findings as confirming his theory. Table 62-2 shows us that with eight statistical tests at p < .05, the researcher had a 33% chance of finding at least one result significant by chance alone.

Selecting an Appropriate Statistical Method

The two key determinants in choosing a statistical method are (1) your research goal, and (2) the level of measurement of your outcome (or dependent) variable(s). Table 62-3 illustrates the key characteristics of the various levels of measurement and provides examples of each.

Table 62-3 Levels of Measurement of Variables

Level of Measurement	Description of Level	Examples
Continuous (also known as interval or ratio)	A scale on which there are approximately equal intervals between scores	Beck Depression Scale Diastolic blood pressure Age of subject

Ordinal (also known as ranks) A scale in which scores are arranged in order, but intervals between scores may not be equal

Class ranking in school

Any continuous measure that has been converted to ranks

Nominal (also known as categorical) Scores are simply names for different groups, but the scores do not imply magnitude. Often used to define groups based on experimental treatment or diagnosis

Diagnostic category

Ethnicity

Zip code of residence

Dichotomous (also known as binary) A special case of a nominal variable in which there are only two possible values

Gender (M or F)

Survival (Y or N)

Response (Y or N)

Once the level of measurement of your outcome variable has been determined, you will decide whether your research question will require you to compare two or more different groups of subjects, or to compare variables within a single group of subjects. Tables 62-4 and 62-5 will help you choose the appropriate statistical method once you have made these decisions. (Note that these tables consider only univariate statistical tests; multivariate tests are beyond the scope of this chapter.)

Table 62-4 Choosing an Appropriate Statistical Test to Compare Two or More Groups, Based on Your Research Goal, and the Level of Measurement of Your Outcome Measure

Table 62-5 Choosing an Appropriate Statistical Test for a Single Group of Subjects, Based on Your Research Goal, and the Level of Measurement of Your Outcome Measure

For example, if you want to conduct a study comparing a new drug to two control conditions, and your outcome measure is a continuous rating scale, Table 62-4 indicates that you would typically use the analysis of variance (ANOVA) to analyze your data. If you wanted to assess the associa-tion of two continuous measures of dissociation and anxiety in a single depressed sample, Table 62-5 indicates that you would usually select the Pearson correlation coefficient. (Note that the procedures listed for Ranked outcome measures are those typically referred to as “nonparametric tests.”)

The Importance of Assessing Statistical Power

A nonsignificant p-value is meaningless if the researcher studied too few subjects, resulting in low statistical power. Tables 62-6 through 62-8 will help you estimate the number of subjects required to have a reasonable chance (usually set at 80%, or power = .80) of detecting a true effect.

Table 62-6 Statistical Power: Did the Study Have Enough Subjects to Detect a True Significant Mean Difference between Two Groups?

Table 62-7 Statistical Power: Did the Study Have Enough Subjects to Find a True Significant Difference in Proportions between Two Groups?

Table 62-8 Statistical Power: Did the Study Have Enough Subjects to Find a True Significant Correlation within a Group?

For example, a researcher reports that she has compared two groups of 12 depressed patients, and found that a new drug was not significantly better than placebo at p < .05, by t-test. However, this negative result is not informative, because Table 62-6 indicates that with only 12 subjects per group, this researcher had statistical power of less than .80 to detect even a “large” effect; that is, even if the drug were truly effective, this study had less than a 50/50 chance of finding a significant difference.

Power analysis is now required as part of virtually all grant and institutional review board (IRB) applications submitted today.

STATISTICAL TERMS AND THEIR DEFINITIONS

Analysis of Covariance (ANCOVA)

This is a form of ANOVA that tests the significance of differences between group means by adjusting for initial differences among the groups on one or more covariates. As an example, a psychologist interested in studying the effectiveness of a behavioral weight loss program versus self-dieting includes pretreatment weights as a covariate.

Analysis of Variance (ANOVA)

This is an optimal test of significance of difference among means from three or more independent groups. As an example, if a medical researcher wants to compare the effects of three or more different drugs on a single dependent measure, he or she would compute a one-way ANOVA. The more complex, factorial ANOVA also tests for interaction effects between multiple factors. For example, if the two factors being tested were “drug/placebo” and “male/female,” the ANOVA interaction test may find that the drug is more effective than the placebo in the female subjects only. The significance of the analysis of variance is tested by the F-statistic.

Analysis of Variance (ANOVA) with Repeated Measure(s)

This is the optimal test of significance for comparing continuous variables that are obtained through repeated measurements of the same subjects (because each subject’s scores are usually correlated, the regular ANOVA would give results that are “too significant”). An experimenter may select a repeated-measures design because these are generally more sensitive to treatment effects (i.e., they have high power), since score differences between subjects are ignored.

This is a generalization of multiple regression to the case of multiple independent variables and multiple dependent variables. It is rarely used today, except in neuroimaging studies with hundreds or thousands of correlated measurements. It is considered a multivariate statistical procedure because many intercorrelated variables are analyzed simultaneously.

Chi-Square (χ ²) Test

See Contingency Table Analysis by Chi-Square.

Cluster Analysis

This is a data reduction technique used to group subjects together into subgroups (or “clusters”) based on their similarities or differences on a set of variables. This technique answers questions such as, “Do my subjects fall into subgroups?” and “What variables give a profile that distinguishes subgroups of my subjects?” A simple rule of thumb is “cluster analysis groups people, while factor analysis groups variables.” It is considered a multivariate statistical procedure because many intercorrelated variables are analyzed simultaneously.

Confidence Interval

A confidence interval does more than simply report that our observed mean difference between two groups is 2.5 points, significant at p < .05; it is far more informative to report that the 95% confidence interval around our observed mean is 0.6 to 4.4. Since 0 (the null hypothesis value of the mean difference) is not included in the 95% confidence interval, we know at a glance that the difference is significant at p < .05, but we also learn of the possible values of the actual mean difference between the groups, ranging from as low as 0.6 points to a high of 4.4 points (with 95% confidence). In the case of odds ratios, a confidence interval of 0.60 to 4.40 would not be significant at p < .05, because the null hypothesis would state that the odds ratio is 1.00, and this is included in the computed 95% confidence interval. Many journals require the reporting of confidence intervals, instead of using only p-values.

Contingency Table Analysis by Chi-Square (χ ²)

This is a test to determine whether the frequencies in each cell of a contingency table are different from the proportions expected by chance. It is most commonly used on a 2 × 2 contingency table, represented as four cells forming a square.

A common use is to answer the following question: “Is there a difference between the occurrence of a given side effect in the drug versus a placebo group?” In this case the table is arranged with drug versus placebo as the two rows, and side effect versus no side effect as the two columns. As the (squared) difference between the observed and expected frequencies in each cell increases, the chi-square statistic also increases, and the more significant the result becomes. If all cells contain exactly the frequencies that would be expected by chance, the chi-square statistic is zero. If the frequencies differ greatly from chance, the chi-square statistic gets larger and larger. The size of the chi-square statistic is based on the number of cells in the contingency table (since df = [# rows − 1] × [# columns − 1], a 2 × 2 table always has a single degree of freedom).

This is a variable that the investigator believes may influence the outcome or dependent variable and that is to be statistically adjusted for. For example, in a study of a new antidepressant, the baseline level of depression for subjects in each of the two groups may be used as a covariate.

These are statistics used to describe a single population. Descriptive statistics used commonly to summarize the central tendency of a group are the mean (or arithmetic average) for continuous measures and the median (or “middle score”) for ordinal or ranked measures. Descriptive statis-tics are used commonly to describe the variability within a group. These include the variance and its square root, the standard deviation for continuous measures, and the interquartile range for ranked measures. Researchers look at descriptive statistics first to get a “big picture” of their data (and also to look for data entry errors or obvious outliers).

Discriminant Function Analysis

This is the optimal procedure to statistically distinguish between two or more groups based on a group of discriminating variables. It is an important and underused procedure. It is considered a multivariate statistical procedure because many intercorrelated variables are analyzed simultaneously.

Effect Size

This is a measure of the practice significance of a treatment effect, as opposed to the statistical significance. For comparisons of two treatments, the most common effect size measure used is Cohen’s “d” (the mean difference between the groups in standard deviation units), with values of .20, .50, and .80 defined by Cohen as small, medium, and large effects in the behavioral sciences. Effect sizes are essential to computing statistical power (see Tables 62-6 to 62-8) and also form the basis of meta-analytic research.

Endpoint

See Dependent Variable.

Experiment-wise Error Rate

As more statistical tests are performed, you are more likely to find at least one of these tests significant by chance alone. Although you may perform each test at p < .05, your chance of finding one of many tests significant by chance alone is much higher than this nominal 5% level (see Table 62-2).

Factor Analysis

This is used to statistically reduce the number of variables needed to explain or to describe a larger set of original variables, based on the correlation matrix. For example, the 10 subscale scores of the Minnesota Multiphasic Personality Inventory (MMPI) personality test were derived from the 567 items that make up the questionnaire. As a rule of thumb, “factor analysis groups variables, while cluster analysis groups people.” It is considered a multivariate statistical procedure because many intercorrelated variables are analyzed simultaneously.

This is an optimal procedure to predict a binary outcome variable from a set of continuous or binary predictor variables. Logistic regression yields both a probability value and an odds ratio. Often used in epidemiological studies where the outcome is occurrence/nonoccurrence of a disease or survival/death. When more than one predictor variable is included, it is considered a multivariate statistical procedure because many intercorrelated variables are analyzed simultaneously.

This is used to predict a single continuous outcome variable from a set of continuous or dichotomous predictor variables. It is very flexible because it allows many forms of nonlinear curve-fitting as well, and automatically provides a measure of effect size as R² (this can help assess practical significance, in addition to statistical significance). When more than one predictor variable is included, it is considered a multivariate statistical procedure because many intercorrelated variables are analyzed simultaneously.

Multivariate Analysis of Variance—MANOVA

This is a generalization of ANOVA when multiple dependent variables are to be assessed simultaneously. It is not used often in psychiatric research because of the difficulty that lies with its interpretation. Instead, separate ANOVAs are often computed for each dependent variable. When more than one predictor variable is included, it is considered a multivariate statistical procedure because many intercorrelated variables are analyzed simultaneously.

Multivariate Statistical Analysis

This is a statistical procedure in which multiple correlated variables are analyzed simultaneously to account for their intercorrelation. It is contrasted with univariate analyses. Examples are discriminant function analysis, factor analysis, and MANOVA.

Null Hypothesis

Researchers usually begin with the hypothesis that there is a zero (or “null”) difference between two means, or that there is a correlation coefficient of zero (or “null”) between two measures. Since statistics cannot prove a hypothesis, we usually state the null hypothesis and present data showing that it is unlikely to be true; the p-value indicates just how unlikely. Caveat: In any very large sample, the null hypothesis is likely to be rejected. That is, two groups of individuals rarely have exactly the same mean on any two characteristics; even if they only differ by, say, 0.01 mm, the null hypothesis is not true, and a sample size in the thousands could detect even a tiny difference.

Outcome Variable

See Dependent Variable.

p-Value

This reflects the chance of a result of a statistical test being a false positive (i.e., the probability of a spurious finding). If a finding is almost certainly a spurious finding, which would not be reproducible in another sample, the p-value will be near 1.00. On the other band, if the finding almost certainly represents a “true” finding, the p-value will be near zero (small p-values are represented by several zeros after the decimal point, e.g., p < .0001—but never list a p-value as 0.00, because this is impossible, and it annoys reviewers!).

Most journals require p < 0.05 for significance. If many statistical tests are performed in a study, a more conservative p-value can be used to minimize experiment-wise error (see Table 62-2). Caveat: Do not be overly impressed by very low p-values. Remember that all this tells you is the chance that the difference is probably not zero. Also, remember that, given enough subjects, this is easy to prove. Thus, a very low p-value does not necessarily indicate a large clinical effect, but instead represents that it is a very reliable effect. Check the effect size (the correlation coefficient squared, or the size of the t-statistic) to get an idea of the magnitude of the difference or relation.