62: Statistics in Psychiatric Research

Published on 24/05/2015 by admin

Filed under Psychiatry

Last modified 22/04/2025

Print this page

rate 1 star rate 2 star rate 3 star rate 4 star rate 5 star
Your rating: none, Average: 2.6 (23 votes)

This article have been viewed 4572 times

CHAPTER 62 Statistics in Psychiatric Research

THREE CLASSES OF STATISTICS IN PSYCHIATRIC RESEARCH

The word “statistics” derives from a term used for “numbers describing the state;” that is, the original statistics were numbers used by rulers of states to better understand their population. Thus, the first statistics were simply counts of things (such as the population of towns, or the amount of grain produced by a particular town). Today, we call these kinds of simple counts or averages “descriptive statistics,” and these are used in almost every research study, to describe the demographic and clinical characteristics of the participants in a particular study.

Modern psychiatric research also involves two additional classes of statistics: psychometric statistics and inferential statistics. Most psychiatric studies will involve all three classes of statistics.

In psychiatric research, demographic variables (such as gender and height) can be measured objectively. However, most of our studies also require the measurement of variables that are not as objective (e.g., clinical diagnoses and rating scales of psychopathology). Here, we usually cannot measure directly the characteristics we are really interested in, so instead, we rely on a subject’s score on either self-report or on investigator-administered scales. Psychometrics is concerned with how reproducible a subject’s score is (i.e., how reliable it is), and how closely it measures the characteristic we are really interested in (i.e., how valid it is).

Psychiatric researchers study relatively small samples of subjects, usually with the intent to generalize their findings to the larger population from which their sample was drawn. This is the realm of inferential statistics, which is based on probability theory. Researchers are reporting inferential statistics when you see the telltale p-values and asterisks denoting statistical significance in the text and tables of the Results sections.

All three kinds of statistics (descriptive, psychometric, and inferential) are present in most published papers in psychiatric research, and are considered in a particular order, for the following reasons. First, without reliable and valid measures, neither of the other kinds of statistics will be meaningful. For example, if we rely solely on clinicians’ judgments of patient improvement, but the study clinicians rarely agree on whether a particular patient has improved, any additional statistics will be meaningless. Likewise, a measure can be very reliably measured, as with a patient’s cell phone number, but this measure is not reliable for any of the purposes of the study. Second, descriptive statistics are needed to summarize the many individual subjects’ scores into summary statistics (such as counts, proportions, averages [or means], and standard deviations) that can then be compared between groups. Inferential statistics would be impossible without first having these summary statistics. Third, without inferential statistics and their computed probability values, the researcher cannot generalize any positive findings beyond the particular group being studied (and this is, after all, the usual goal of a research study).

Table 62-1 illustrates the characteristics of each class, as well as the order in which the classes must be considered, since each successive class rests on the foundation of the preceding class.

Table 62-1 The Three Classes of Statistics Used in Psychiatric Research (in Order of Applicability)

Class of Statistic Purpose Examples
Psychometric Statistics

Descriptive Statistics Statistics used to summarize the scores of many subjects in a single count or average to describe the group as a whole. After descriptive statistics have been computed for one or more samples, they can then be used to compute inferential statistics to attempt to generalize these results to the larger population from which these samples were drawn. Inferential Statistics Statistics computed to compute probability estimates used to generalize descriptive statistics to the larger population from which the samples were drawn.

Concrete Examples of the Three Classes of Statistics in a Research Article

To provide a concrete example of these sometimes abstract concepts, consider a fictional study based on the simplest research design in psychiatric research: a randomized double-blind trial of a new drug versus a placebo pill for obsessive-compulsive disorder (OCD).

Figures 62-1 through 62-3 contain the annotated Method and Results sections for this fictional study, showing how the various psychometric statistics are presented in the Method section, while descriptive statistics are presented in the Method and Results sections, and inferential statistics are presented in the Results section (for definitions of terms used in these figures, refer to the section on statistical terms and their definitions).

Selecting an Appropriate Statistical Method

The two key determinants in choosing a statistical method are (1) your research goal, and (2) the level of measurement of your outcome (or dependent) variable(s). Table 62-3 illustrates the key characteristics of the various levels of measurement and provides examples of each.

Table 62-3 Levels of Measurement of Variables

Level of Measurement Description of Level Examples
Continuous (also known as interval or ratio) A scale on which there are approximately equal intervals between scores

Ordinal (also known as ranks) A scale in which scores are arranged in order, but intervals between scores may not be equal Nominal (also known as categorical) Scores are simply names for different groups, but the scores do not imply magnitude. Often used to define groups based on experimental treatment or diagnosis Dichotomous (also known as binary) A special case of a nominal variable in which there are only two possible values

Once the level of measurement of your outcome variable has been determined, you will decide whether your research question will require you to compare two or more different groups of subjects, or to compare variables within a single group of subjects. Tables 62-4 and 62-5 will help you choose the appropriate statistical method once you have made these decisions. (Note that these tables consider only univariate statistical tests; multivariate tests are beyond the scope of this chapter.)

For example, if you want to conduct a study comparing a new drug to two control conditions, and your outcome measure is a continuous rating scale, Table 62-4 indicates that you would typically use the analysis of variance (ANOVA) to analyze your data. If you wanted to assess the associa-tion of two continuous measures of dissociation and anxiety in a single depressed sample, Table 62-5 indicates that you would usually select the Pearson correlation coefficient. (Note that the procedures listed for Ranked outcome measures are those typically referred to as “nonparametric tests.”)

STATISTICAL TERMS AND THEIR DEFINITIONS

p-Value

This reflects the chance of a result of a statistical test being a false positive (i.e., the probability of a spurious finding). If a finding is almost certainly a spurious finding, which would not be reproducible in another sample, the p-value will be near 1.00. On the other band, if the finding almost certainly represents a “true” finding, the p-value will be near zero (small p-values are represented by several zeros after the decimal point, e.g., p < .0001—but never list a p-value as 0.00, because this is impossible, and it annoys reviewers!).

Most journals require p < 0.05 for significance. If many statistical tests are performed in a study, a more conservative p-value can be used to minimize experiment-wise error (see Table 62-2). Caveat: Do not be overly impressed by very low p-values. Remember that all this tells you is the chance that the difference is probably not zero. Also, remember that, given enough subjects, this is easy to prove. Thus, a very low p-value does not necessarily indicate a large clinical effect, but instead represents that it is a very reliable effect. Check the effect size (the correlation coefficient squared, or the size of the t-statistic) to get an idea of the magnitude of the difference or relation.

Reliability

This is the dependability of a score, or the degree to which we can be certain that a measurement can be depended on (i.e., How reproducible is the score?). For self-rated scales, such as paper-and-pencil questionnaires, since there is no rater error to take into account, the main source of unreliability to assess is difference in the person’s self-rating over time. For example, if a patient completes a depression questionnaire at 3 p.m., how close would his or her score be on the same questionnaire if he or she were to take the same scale at 4 p.m., assuming no change in his or her depression? If the scores were identical, and this was the case for all patients, the correlation coefficient would be a perfect 1.00 (in this case the correlation coefficient is referred to as the “reliability coefficient”).

For scales or measures administered by a rater, the major question is, “Would this patient get the same score on this depression scale if Doctor A rated him, as if Doctor B rated him?” If the agreement was perfect for all patients, the reliability coefficient would be 1.00. If, on the other hand, there were a random relationship between the scores of the two raters, the interrater reliability would be 0.00. Reliability is necessary but not sufficient for a useful scale. Thus, a scale can be perfectly reliable, but have no validity for a particular purpose. For example, every time you ask me my phone number I will give you the same answer (perfect reliability); however, if you attempt to use my phone number to predict my anxiety level you will find a zero correlation (no validity). If a measure has no reliability (i.e., it is not reproducible), it has zero reliability. If it has perfect reliability, or repeatability, it has a reliability of 1.00.

For a continuous measure, this is assessed by the correlation coefficient, and as a rule of thumb it is generally r = 0.80 for adequate reliability. For a binary measure (e.g., the presence or absence of a disease), this is often assessed by the kappa coefficient, and a rule of thumb is generally κ = 0.70.