Research Design and Biostatistics

Published on 17/03/2015 by admin

Filed under Orthopaedics

Last modified 22/04/2025

Print this page

rate 1 star rate 2 star rate 3 star rate 4 star rate 5 star
Your rating: none, Average: 0 (0 votes)

This article have been viewed 1785 times

Chapter 13

Research Design and Biostatistics

Contents

section 1 Introduction

Critical review of medical research is essential for orthopaedic surgeons. Experiments are conducted in both the clinical and basic sciences, and decisions are based on the results of these experiments and associated statistical analyses. The concepts that support or refute these decisions and generalizations must be understood by the astute consumer of medical literature.

Research starts with developing a question that is important to a particular area of investigation or clinical population; then the study population is defined, and the most appropriate outcome measures and variables are selected. It is important that the research team collaborate so that their combined expertise can contribute to the study aims.

This chapter describes some important concepts to consider in designing a research study and in analyzing and interpreting results.

section 2 Common Research Designs and Research Terminology

PROSPECTIVE STUDIES

II RETROSPECTIVE STUDIES

III LONGITUDINAL STUDIES

IV OBSERVATIONAL RESEARCH DESIGNS

    These designs can be prospective, retrospective or longitudinal (Figures 13-1 and 13-2). Common observational designs are as follows:

Case reports

Case series

Case-control studies

Cohort study

Cross-sectional study designs

RESEARCH TERMINOLOGY

Instrument validity:

Instrument reliability:

Clinical studies can be designed to determine superiority of one treatment over another, whether one treatment is no worse than another (noninferiority), or whether both treatments are equally effective (equivalency).

Clinical research can be designed to assess outcomes data that are reported by the patient (subjective) or collected by an examiner (objective).

VI EXPERIMENTAL RESEARCH DESIGNS

VII POTENTIAL PROBLEMS WITH RESEARCH DESIGNS

Internal validity concerns the quality of a research design and how well the study is controlled and can be reproduced. External validity concerns the ability of the results to generalize to a whole population of interest.

Confounding variables are factors extraneous to a research design that potentially influence the outcome. Conclusions regarding cause-effect relationships may be explained by confounding variables, instead of by the treatment or intervention being studied and must therefore be controlled or accounted for.

Bias is unintentional, systematic error that threatens the internal validity of a study. Sources of bias include selection of subjects (sampling bias), loss of subjects to follow-up (nonresponder bias), observer/interviewer bias, and recall bias.

Protection against these threats can be achieved through randomization (i.e., random allocation of one or more treatments) to ensure that bias and confounding factors are distributed equally among the study groups. Single blinding (examiner or patient is unaware of to which study condition the patient is assigned) or double blinding (both examiner and patient are unaware of assignment of study condition) is important for minimizing bias.

Control groups can help account for the potential placebo effect of interventions.

Control subjects are often matched on the basis of specific characteristics (e.g., gender, age), which helps account for potential confounding variables that may influence the interpretation of research findings.

The strongest research design involves the use of random allocation, blinding, and use concurrent control subjects who are matched to the experimental group(s).

VIII DESCRIPTIVE AND CONTROLLED LABORATORY STUDIES

These studies are common in basic science research, but they may involve many of the common concepts of clinical research, and similar statistical methods and design methods are used to protect against sources of bias and confounding.

section 4 Concepts of Epidemiology

DEFINITIONS

    Epidemiology is the study of the distribution and determinants of disease. The following measures are commonly used in this type of research.

Prevalence is the proportion of existing cases or conditions of injuries or disease within a particular population.

Incidence (absolute risk) is the proportion of new injuries or disease cases within a specified time interval (a follow-up period is required).

Relative risk (RR) is a ratio between the incidences of an outcome in two cohorts. Typically, a treated or exposed cohort (in the numerator of the ratio) is compared with an untreated or unexposed (control) group (in the denominator of the ratio). Values can range from 0 to infinity and are interpreted as follows:

Odds ratios are calculated from the probabilities of an outcome in two cohorts.

Interpreting relative risk and odds ratio:

II CLINICAL USEFULNESS OF DIAGNOSTIC TESTS

2×2 contingency table (Figure 13-3) can be used to plot the occurrences of a disease or outcome of interest among patients whose diagnostic test results were positive or negative.

Analysis of diagnostic ability

1. Sensitivity:

2. Specificity:

3. Positive predictive value:

4. Negative predictive value:

5. Likelihood ratio:

6. Receiver operating characteristic (ROC) curves are graphic representations of the overall clinical utility of a particular diagnostic test that can be used to compare accuracy of different tests in diagnosing a particular condition (Figure 13-4).

section 5 Statistical Methods for Testing Hypotheses

Statistical tests match the purpose of a particular research study. Statistical analyses differ according to the goals of the researcher: for example, to compare groups to identify differences or establish relationships between groups (Table 13-1).

Table 13-1image

Decision-Making Guide for Choosing Common Parametric and Nonparametric Statistical Tests According to the Desired Study Purpose

Desired Analysis Parametric Statistics* Nonparametric Statistics
Comparison of two groups    
 Paired Dependent (paired)–samples t-test Wilcoxon test
 Unpaired Independent-samples t-test* Mann-Whitney U test
Comparison of three or more groups    
 One outcome variable Analysis of variance (ANOVA) Kruskal-Wallis test
 Repeated observations in same patient Repeated-measures ANOVA Friedman test
 Multiple dependent variables Multivariate analysis of variance (MANOVA)  
 Analysis including a covariate Analysis of covariance (ANCOVA)  
Establishing relationship or association Pearson product-moment correlation coefficient Spearman rho correlation coefficient
Prediction    
 From one predictor variable Simple regression Logistic regression
 From two or more predictor variables Multiple regression  
Comparisons of categorical data    
 Two or more variables Chi-square Chi-square
 Better for low sample size Fisher exact test Fisher exact test

*Appropriate for normally distributed continuous data.

Alternative tests appropriate for nonnormally distributed data, small sample sizes, or both.

SAMPLING AND GENERAL TERMINOLOGY

A population consists of all individuals who share a specific characteristic of clinical or scientific interest. Parameters describe the characteristics of a population.

Study samples affords all members of a specific population equal chance of being studied or enrolled in a clinical study.

Sample populations are representative subsets of the whole population. Statistics describe the characteristics of a sample and are intended to be generalized to the whole population.

Populations are delimited on the basis of inclusion and exclusion criteria that are established before a study starts.

Types of data collected from samples:

Data can be plotted in frequency distributions (histograms) to summarize basic characteristics of the study sample.

Cutoff points: Continuous data are often converted into categorical or binary data through the use of cutoff points. Cutoff points can be arbitrary or evidence based.

II DESCRIPTIVE STATISTICS

A data distribution histogram describes the frequency of occurrence of each data value. Distributions can be described in descriptive statistics such as the following:

Characteristics of a dataset (Figure 13-5):

Confidence intervals quantify the precision of the mean or of another statistic, such as an odds ratio or relative risk.

III INFERENTIAL STATISTICS

Inferential statistics are used to test specific hypotheses about associations or differences among groups of subject or sample data.

The dependent variable is what is being measured as the outcome. There can be multiple dependent variables, depending on how many outcome measures are desired.

The independent variables include the conditions or groupings of the experiment that are systematically manipulated by the investigator.

Inferential statistics can be generally divided into parametric and nonparametric statistics.

1. Parametric inferential statistics are appropriate for continuous data and rely on the assumption that data are normally distributed.

2. Nonparametric inferential statistics are appropriate for categorical and data that are not normally distributed.

3. The goal of inferential statistics is to estimate parameters.

IV WHICH TEST TO USE

    The decision on what statistical test to use is based on several factors inherent in research designs.

Some important distinctions:

When two groups of data are compared, the t-test is used. There are two variations of the t-test:

When three or more groups are compared, an analysis of variance is used. This is also known as the F-test.

1. Analysis of variance (ANOVA) is appropriate when three or more groups of continuous, normally distributed data are used.

2. Repeated-measures ANOVA is a variation of the ANOVA that is appropriate for sequential measurements recorded from the same subjects.

3. Multivariate analysis of variance (MANOVA), a variation of the ANOVA, is used when multiple dependent variables are compared among three or more groups.

4. Analysis of covariance (ANCOVA) is an appropriate test when confounding factors must be accounted for in the statistical analysis.

5. Post hoc testing is necessary after any ANOVA to determine the exact location of differences among groups.

6. Factorial designs are used for multiple independent variables.

Correlation and regression

1. Correlation coefficients

2. Simple linear regression

3. Multivariate linear regression describes the ability of several independent variables to predict a dependent variable.

4. Logistic regression is used when the outcome is categorical and the predictor variables can be either categorical or continuous data that are not normally distributed.

Statistical tests for categorical data

section 6 Important Concepts in Research and Statistics

STATISTICAL ERROR

II PROBABILITY (P) VALUES

Inferential test statistics (e.g., t statistic, F statistic, r coefficient) are accompanied by a probability (P) value. These values are on a scale from 0% to 100% and indicate the probability that the differences or relationships among study data occurred by chance.

P values less than 0.05 mean there is less than a 5% chance that the observed difference or relationship has occurred by chance alone and not because of the study intervention.

Typical threshold for “statistical significance”: P value = 0.05 or less (type I error may occur in 5 of 100 tests)

Therefore, in accordance with the P value, the null hypothesis—which is that no differences or no association exists either is rejected (i.e., P < 0.05) or fails to be rejected (P > 0.05)

Bonferroni correction to the P value:

III STATISTICAL POWER AND ESTIMATING SAMPLE SIZE

Research studies should have enough subjects or samples in order to obtain valid results that can be generalized to a population while minimizing unnecessary risk.

Sample size estimates are based on the desired statistical power (these estimates are often termed power analyses).

Sample sizes are justified as the number of subjects needed for researchers to find a statistically significant difference or association (i.e., P ≤ 0.05) while statistical power is maintained higher than 80%.

Higher sample sizes or highly precise measurements (lower variability) are necessary to find small differences between study groups.

Power analyses can be done before the study starts (a priori) or after the study has been completed (post hoc).

1. Studies with low power have a higher likelihood of missing statistical differences when they actually exist (i.e., type II error).

2. To understand the power of a statistical test, the following aspects should be considered: the number of subjects in the study, the effect size (see subsection V, Effect Sizes), the acceptable level of type I error (usually 5%; i.e., P ≤ 0.05), and an estimate of variability in the data.

3. The power of statistical tests (which improve the likelihood of finding statistical differences when they exist) increases with more subjects, greater treatment effect, and lower variability among the data.

4. Researchers often design studies to maximize the potential for response to a particular treatment by using stringent inclusion and exclusion criteria and selecting a measurement device or outcome instrument that is more precise and accurate (Figure 13-6).

IV MINIMAL CLINICALLY IMPORTANT DIFFERENCES

EFFECT SIZES

Testable Concepts

Section 2 Common Research Designs and research Terminology

• Observational studies can be prospective or retrospective. Common designs include case series (patients with a common injury or disease), case-control studies (similar to case series but with a defined control group, typically retrospective), cohort (defined groups of subjects to monitor over time), and cross-sectional (measurements taken on a single occasion with no retrospective or prospective review). Case reports are descriptions of a single, unique observation of a patient.

• Incidence can be calculated from prospective or longitudinal study designs because a follow-up period is required. Prevalence is calculated from cross-sectional designs to describe injury distribution at a particular time point.

• Odds ratios and relative risks can be calculated from clinical studies that are designed to determine associations among risk factor exposure and patient outcomes.

• Clinical trials are experimental studies in which a research hypothesis is tested through a specific intervention.

• The “gold standard” of experimental research designs is the randomized, blinded, and controlled clinical trial. These design types require greater time and resources; however, findings from a well-designed randomized controlled trial are considered highly influential.

• Bias is unintentional, systematic error that threatens the internal validity of a study. Randomization, matching, blinding, and using control conditions are methods to protect against the numerous forms of bias.

Section 4 Concepts of Epidemiology

• Common epidemiologic measures of the distribution and determinants of disease include prevalence, incidence, odds ratio, and relative risk.

• Prevalence is the proportion of existing injuries or disease cases within a particular population. Incidence is the proportion of new injuries or disease cases within a specified time interval.

• Relative risk and odds ratio describe the risk and odds of a particular outcome of interest between two groups: typically a group in which subjects are treated or exposed and a reference or control group. Relative risk is calculated as the ratio between the incidence rates of an outcome in two cohorts.

• Reliability is the reproducibility of a test or measure; similarly, precision is the repeatability of the results. Validity is the ability of a measure, test, or instrument to represent truth and reality; similarly, accuracy describes the ability of a test to differentiate between correct and incorrect outcomes.

• Sensitivity is a ratio (true-positive test results divided by the number of patients with disease) that describes the proportion of patients who actually have a disease or condition and whose diagnostic test result is positive. Because highly sensitive tests (Sn) yield few false-negative results, a negative (N) result would confidently rule “OUT” the condition of interest (mnemonic: “SnNOUT”).

• Specificity is a ratio (true-negative test results divided by the number of patients without disease) that describes the proportion of patients who do not have the disease or condition and whose diagnostic test result is negative. Because highly specific tests (Sp) yield few false-positives, a positive (P) test would confidently rule “IN” the condition of interest (SpPIN).

• Like specificity and sensitivity, likelihood ratios, positive predictive values, and negative predictive values are calculated from 2×2 contingency tables and can describe the accuracy of diagnostic tests.

Section 5 Statistical Methods for Testing Hypotheses

• Data from research studies can be discrete (infinite possible values) or categorical (finite possible values); the latter type of data can be binary (only two options), ordered (e.g., a scale of intensity or severity), or unordered (e.g., race).

• Descriptive statistics include mean, median, mode, and standard deviation.

• Confidence intervals provide a range of values around a point estimate (e.g., mean, relative risk, odds ratio) that describe the level of confidence in the ability of the study data to accurately describe truth.

• Statistical tests can be parametric (appropriate for normally distributed continuous data) or nonparametric (appropriate for skewed data, categorical data, or small sample sizes). Each parametric test has a nonparametric equivalent.

• For comparing two groups of normally distributed data, the independent-samples t-test is used (paired samples if the groups are matched or measures recorded in the same individual over time). For comparisons of three or more groups, the ANOVA is used (repeated-measures ANOVA for repeated measures, ANCOVA if there is a covariate, MANOVA if there are many dependent variables). The Pearson product moment correlation coefficient is used for correlations.

• The chi-square test is used for comparing categorical data. When sample sizes are small, the Fisher exact test is used.

• For nonparametric tests, Mann-Whitney U test is for comparing two groups; the Wilcoxon test is for paired groups, the Friedman test is for comparisons of three or more groups, and the Spearman rho correlation coefficient is for correlations.

Section 6 Important Concepts in Research and Statistics

• Type I error (α error, false-positive error) is the probability that a statistical test result is wrong when the null hypothesis is rejected (i.e., concluding that groups are different when they actually are not). Researchers are willing to accept this error in 5% of tests.

• Type II error (β error, false-negative error) is the probability that a statistical test result is wrong when the test fails to reject the null hypothesis (i.e., concluding that two groups are not different when they actually are different).

• P values lower than 0.05 mean that the probability that the observed difference or relationship has occurred by chance alone, and not because of the study intervention, is less than 5%.

• Sample size estimates are used to determine the necessary number of subjects or observations needed for statistically significant results and are based on the desired statistical power (these estimates are often termed power analyses).

• Statistical power is the probability of finding differences among groups when differences actually exist (i.e., avoiding type II error). Higher sample sizes or highly precise measurements (lower variability) are needed in order to find small differences between study groups.

• Effect sizes are used to describe the magnitude of a treatment effect. They are calculated as the difference between treatment groups divided by the standard deviation (typically pooled standard deviation, or the standard deviation of the reference or control group).

chapter 13 Review Questions

ANSWER 1: B. This description is of a case-control study in which patients are matched on the basis of age, weight, and level of arthritis. There is no intervention, so it is not a clinical trial. A case series is an observational design for a particular patient population. A case report is an observational design describing an occurrence of a unique medical finding or outcome.

ANSWER 2: A. The validity of a clinical instrument or test is the ability to accurately represent truth or reality. Validity is established by comparison with a “gold standard.” Reliability is the ability to consistently describe a particular characteristic with repeated measurements and in similar situations. Validity and reliability are not related to determining statistical significance.

ANSWER 3: B. Incidence is the proportion of new injuries or disease cases within a specified time interval. Therefore, a follow-up period is needed to calculate rate as a measure of new cases per unit time. Prevalence is the proportion of existing injuries or disease cases within a particular population at a particular moment in time. Reliability is the variability in observations by the same rater (intrarater reliability) or multiple raters (interrater reliability).

ANSWER 4: C. Statistical comparison between two groups is performed with an independent-samples t-test. The paired-samples (also referred to as dependent-samples) t-test is appropriate for within-subject comparisons or for matched-group comparisons. Repeated-measures ANOVA is used to compare sequential measurements recorded in the same subjects. The Spearman rho correlation coefficient is used to calculate relationships among two categorical or nonnormally distributed continuous data. Logistic regression is used for prediction with categorical data.

ANSWER 5: A. P (probability) values describe the probability that a test statistic occurred by chance. If the test statistic occurred by chance, then it would be “wrong” to say the relationship was real or “statistically significant.” It is generally acceptable to commit this error (type I error) 5% of the time or less (indicated by P = 0.05). If P = 0.04, then there is a 4% chance of committing a type I error, so it is acceptable to say the test is “statistically significant.” P values do not describe statistical power, rate of type II errors, or clinical importance.

Share this: