Neurosurgical Epidemiology and Outcomes Assessment

Published on 13/03/2015 by admin

Filed under Neurosurgery

Last modified 13/03/2015

Print this page

rate 1 star rate 2 star rate 3 star rate 4 star rate 5 star
Your rating: none, Average: 0 (0 votes)

This article have been viewed 971 times

CHAPTER 11 Neurosurgical Epidemiology and Outcomes Assessment

In this chapter we focus on the application of statistical and epidemiologic principles to neurosurgical diagnosis, measurement, outcomes assessment and improvement, and critical review skills, which together provide the skill set necessary to develop an evidence-based approach to clinical practice.

To design a rational treatment plan for any patient, a physician must identify that patient’s most likely diagnosis, the most likely natural course of events without intervention, the potential risks and benefits of the available therapies, and how this differs from the natural history of the process. Each aspect of this process has strong mathematical and statistical underpinnings. An evidence-based approach then requires the application of experimental clinical data to the individual patient, a concept known as generalization.

These skills are intended to be tools for the practicing clinician and those who perform clinical as opposed to basic science research. Taken together, they are studied academically under the rubric of clinical epidemiology, which differs in scope and purpose from the epidemic investigation and population-based studies of health and disease that characterize the more traditional field of survey epidemiology.

Concepts—Sources of Error in Analysis

Bias

All research must begin with a question. Research design is the process of constructing clinical experiments that provide true answers to the research question. Two types of errors threaten this process: random error (noise) and structural error (bias). The first of these is a result of the natural variability in subjects in their response to illness or treatment, or both. Random error, given a large enough sample, applies to all groups in a study equally. Usually, although not always, it makes it harder for an experiment to show a difference between experimental and control groups. Adopting the language of radio transmissions, this is the noise that obscures the signal, which is the answer that the investigator or reader is after. This type of error is predictable and, to a certain extent, measurable. Statistical analyses, from simple t-tests to complex analyses of variance, are intended to quantify random error.

Structural error, or bias, in contrast, is error that tends to apply specifically to one group of subjects in a study and is a consequence of intentional or unintentional features of the study design itself. This is error designed into the experiment and is not measurable or controllable by statistical techniques, but it can be minimized with proper experimental design. For example, if we are determining the normal head size of term infants by measuring the head size of all such patients born over a 1-week period, we have a high probability of introducing random error into our reported rate because we are considering a small sample. We can measure this tendency of random error in the form of a confidence interval around the reported head size. The larger the confidence interval, the higher the chance that our reported head size differs from the true average head size of our population of interest. We can solve this problem by increasing the sample size without any other change in experimental technique. Conversely, if we choose to conduct our study only in the newborn intensive care unit, with exclusion of the well-child nursery, and reported our measurements as pertaining to newborn children in general, we have introduced a structural error, or bias, into our experimental design. No statistical test that we can apply to our data set will alert us that this is occurring, nor will measures such as increasing the number of patients in our study help solve the problem. In considering the common study designs enumerated later, note how different types of studies are subject to different biases and how the differences in study design arise out of an attempt to control for bias.1

Control of Confounding Variables

Common to several of the study methodologies to be noted later is the concept of confounding variables. In studies assessing disease causation, natural history, and treatment efficacy, it is assumed that a number of factors may influence the outcome of interest. The presence of these factors, or confounders, alters the mathematical relationship of the risk factor of interest to the outcome. For example, in a cohort study attempting to assess the impact of smoking on stroke rates, hypertension would be considered a probable confounding variable that might obscure the true relationship between smoking and stroke. There are six basic strategies for dealing with confounders.

Noise—Statistical Analysis in Outcomes Assessment

The principal role of statistical analysis is to describe and quantify the naturally occurring variations in the world around us. Chance variation in experimentation comes from several sources, but principally from the variability inherent in the study subjects and the variability associated with measurement. The latter can be minimized by choosing the appropriate measurement tool, particularly one with a high degree of clinical agreement associated with its use. Appropriately used statistical analysis answers the question whether the observed differences or relationships seen among groups are beyond what would be expected based on the intrinsic variability in members of each group. The choice of measure used in the analysis depends on the type of data being analyzed, be it continuous or categorical, and whether the underlying population from which the data are drawn can be assumed to have a “normal” or bell-shaped distribution. For continuous data with a normal distribution, t-tests and other similar tests are appropriate, whereas for non-normally distributed data, the nonparametric Wilcoxon or Mann-Whitney tests are often used. Categorical data are usually described by using contingency tables, which can be assessed with χ2 methodology or summarized with odds ratios.

Alpha and Beta Error and Power

In hypothesis testing, two types of statistically rooted error can occur. The first, or type I error, represents the chance that we will incorrectly conclude that a difference exists between groups when it does not. This is traditionally fixed at 5%. That is, in selecting a P < .05 value for rejection of the null hypothesis, we understand that 5% of the time a sample from the control population will differ from the population mean by this amount, and we may thus incorrectly conclude that control and experimental groups different by this amount came from different populations. The 5% is an arbitrary value, and the investigator can choose any level that is reasonable and convincing in terms of the study hypothesis.

The second, more common error is a type II or β error.2 This represents the probability of incorrectly concluding that a difference does not exist between therapies when in fact it does. This is analogous to missing the signal for the noise. Here, the key issue is the natural degree of variability among study subjects, independent of what might be due to treatment effects. When small but real differences exist between groups and there is enough variability among subjects, the study requires a large sample size to be able to show a difference. Missing the difference is a β error, and the power of a study is 1 − β. Study power is driven by sample size. The larger the sample, the greater the power to resolve small differences between groups.

Frequently, when a study fails to show a difference between groups, there is a tendency to assume that there is in fact no difference between them.3 What is needed is an assessment of the likelihood of a difference being detected, given the sample size and subject variability. Although the authors should provide this, they frequently do not, thereby leaving readers to perform their own calculations or refer to published nomograms.4,5 Arbitrarily, we say that a power of 80%, or the chance of detecting a difference of a specified amount given a sample size of n, is acceptable. If the penalties for missing a difference are high, other values for power can be selected, but the larger the power desired, the larger the sample size required.

Statistics and Probability in Diagnosis

The diagnosis of neurosurgical illness, as in other fields, requires the development of a list of possible explanations for the patient’s complaints to form a differential diagnosis. As an example, imagine the scenario of a hydrocephalic child with his first shunt placed 3 months earlier who now has a 2-day history of vomiting and irritability. An experienced clinician hearing this story would quickly formulate a list of possible diagnoses, including otitis media, gastroenteritis, constipation, upper respiratory infection, and of course, shunt malfunction. Through a series of additional questions about the medical history and a physical examination, the clinician would be able to narrow this list down considerably to one or two possibilities. Additional tests such as computed tomography (CT) or a shunt tap would then be used to either support or refute particular diagnoses.

Pretest and Posttest Probability—A Bayesian Approach

At the onset, certain quickly ascertainable features about a clinical encounter should lead to the formation of a baseline probability of the most likely diagnoses for the particular clinical findings. In the example just presented, among all children seen by a neurosurgeon within 3 months of initial shunt placement, about 30% will subsequently be found to have shunt malfunction.6 Before any tests are performed or even a physical examination completed, elements of the history provided allow the clinician to form this first estimate of the likelihood of shunt failure. The remainder of the history and physical findings allow revision of that probability, up or down. This bayesian approach to diagnosis requires some knowledge of how particular symptoms and signs affect the baseline, or the “pretest” probability of an event. The extent to which a particular clinical finding or diagnostic study influences the probability of a particular end diagnosis is a function of its sensitivity and specificity, which may be combined into the clinically more useful likelihood ratio.7 Estimates of the pretest probability of disease tend to be situation specific; for example, the risk of failure of a cerebrospinal fluid (CSF) shunt after placement is heavily dependent on the age of the patient and the time elapsed since the last shunt-related surgical procedure.8 Individual practitioners can best provide their own data for this by studying their own practice patterns.

Properties of Diagnostic Tests

Elements of the history, physical examination, and subsequent diagnostic studies can all be assessed to describe their properties as predictors of an outcome of interest. These properties may be summarized as sensitivity, specificity, positive and negative predictive values, and likelihood ratios, as illustrated in Table 11-1.

Sensitivity indicates the percentage of patients with a given illness who have a positive clinical finding or study. In other words, that a straight leg raise test has a sensitivity of 80% for lumbar disk herniation in the setting of sciatic pain means that 80 of 100 patients with sciatica and disk herniation do, in fact, have a positive straight leg raise test.9 In Table 11-1, sensitivity is equal to a/(a + c). Conversely, specificity, equal to d/(b + d) in Table 11-1, refers to the number of patients without a clinical diagnosis who also do not have a clinical finding. Again, using the example of the evaluation of low back pain in patients with sciatica, a positive straight leg raise test has about 40% specificity, which means that of 100 patients with sciatica but without disk herniation, 40 will have a negative straight leg raise test.

Sensitivity and specificity are difficult to use clinically because they describe clinical behavior in a group of patients known to carry the diagnosis of interest. These properties provide no information about performance of the clinical factor in patients who do not have the disease and who probably represent the majority of patients seen. Sensitivity ignores the important category of false-positive results (Table 11-1, B), whereas specificity ignores false-negative results (Table 11-1, C). If the sensitivity of a finding is very high (e.g., >95%) and the disease is not overly common, it is safe to assume that there will be few false-negative results (C in Table 11-1 will be small). Thus, the absence of a finding with high sensitivity for a given condition will tend to rule out the condition. When the specificity is high, the number of false-positive results is low (B in Table 11-1 will be small), and the absence of a symptom will tend to rule in the condition. Epidemiologic texts have suggested the mnemonics SnNout (when sensitivity is high, a negative or absent clinical finding rules out the target disorder) and SpPin (when specificity is very high, a positive study rules in the disorder), although some caution may be in order when applying these mnemonics strictly.10

However, when sensitivity or specificity is not as high or a disease process is common, the more relevant clinical quantities are the number of patients with the symptom or study result who end up having the disease of interest, otherwise known as the positive predictive value [PPV = a/(a + b)]. The probability of a patient not having the disease when the symptom or study result is absent or negative is referred to as the negative predictive value [NPV = d/(c + d)]. Subtracting NPV from 1 gives a useful “rule out” value that provides the probability of a diagnosis even when the symptom is absent or the study result negative.

A key component of these latter two values is the underlying prevalence of the disease. Examine Table 11-2. In both the common disease and rare disease case, the sensitivity and specificity for the presence or absence of a particular sign are each 90%, but in the common disease case, the total number of patients with the diagnosis is much larger, thereby leading to much higher positive and lower negative predictive values. When the prevalence of the disease drops, the important change is the relative increase in the number of false versus true positives, with the reverse being true for false versus true negatives.

The impact of prevalence plays a particularly important role when one tries to assess the role of screening studies in which there is only limited clinical evidence that a disease process is present. Examples include the use of screening magnetic resonance imaging studies for cerebrovascular aneurysms in patients with polycystic kidney disease11 and routine use of CT for follow-up after shunt insertion12 and for determining cervical spine instability in patients with Down syndrome,13 as well as to search for cervical spine injuries in polytrauma patients.14,15 In these situations, the low prevalence of the disease makes false-positive results likely, particularly if the specificity of the test is not extremely high, and usually results in additional, potentially more hazardous and expensive testing.

Likelihood ratios (LRs) are an extension of the previously noted properties of diagnostic information. LRs express the odds (rather than the percentage) of a patient with a target disorder having a particular finding present as compared with the odds of the finding in patients without the target disorder. An LR of 4 for a clinical finding indicates that it is four times as likely for a patient with the disease to have the finding as those without the disease. The advantage of this approach is that odds ratios, which are based on sensitivity and specificity, do not have to be recalculated for any given prevalence of a disease. In addition, if one starts with the pretest odds of a disease, this may be simply multiplied by the LR to generate the posttest odds of a disease. Nomograms exist to further simplify this process.16 A number of useful articles and texts are available to the reader for further explanation.17,18 Table 11-3 gives examples of the predictive values of selected symptoms and signs encountered in clinical neurosurgical conditions.9,1924

Measurement in Clinical Neurosurgery

Neurosurgeons have become increasingly aware of the need to offer concrete scientific support for their medical practices. Tradition and reasoning from pathophysiologic principles, although important in generating hypotheses, are not a substitute for scientific method. Coincident with this move toward empiricism has been the recognition that traditional measures of outcome, such as mortality and morbidity rates, are insufficient to capture important changes in functional status. Similarly, traditional outcomes as recorded in the medical record lack precision and reproducibility. Aware that better outcome assessments are required to better provide information on treatment choices, we must ask two important questions. What should we measure? How should we measure it?

Surrogate versus True End Points

In determining what should be measured, one starting point is the question, “What does the patient care about?” Even though this may seem obvious, in the settings of research, quality control, and economic analysis, outcomes that are not clearly of interest to the patient are regularly used. For example, fusion rates have commonly been used for assessing the outcomes of spinal surgery. Yet this is clearly a substitute or surrogate end point for what the patient cares about—that the pain be relieved or function improved. Improvement in one does not always equate with improvement in the other. Other neurosurgical examples of surrogate end points include shunt revision rates, extent of surgical resection (e.g., in an 80-year-old with an acoustic neuroma, the degree of resection may be only partially correlated with overall patient outcome), and CSF flow analysis after posterior fossa decompression. In many cases, an argument can be made that changes in the surrogate measure translate into tangible improvements for the patient. However, there remain numerous situations in which that relationship falters and misses important outcomes for the patient.

Surrogate or intermediate end points clearly have a role early in the investigation of a therapy. They are usually chosen because they are easily quantifiable and generally reproducible. However, the evidence used to justify the widespread application of a technology or procedure should include a broader assessment of the overall impact of the intervention on the target population.

Techniques of Measurement

Quantifying clinical findings and outcomes remains one of the largest stumbling blocks in performing and interpreting clinical research. Although outcomes such as mortality and some physical attributes such as height and weight are easily assessed, many of the clinical observations and judgments that we make are much more difficult to measure. All measurement scales should be valid, reliable, and responsive. A scale is valid if it truly measures what it purports to measure. It is reliable when it reports the same result for the same condition, regardless of who applies it or when they do so. It is responsive when meaningful changes in the item or event being measured are reflected as changes in the scale.25 The complexity of neurologic assessment has led to a proliferation of such measurement scales.26

Validity

The validity of a scale can be assessed along both logical and numerical lines. Face validity assesses whether a scale appears on the surface to measure its targeted outcome. For example, a scale to measure outcome with respect to activities of daily living that focused predominantly on physical pain measurement would have questionable face validity. Although a scale would not deliberately be designed to miss its target, this can happen when an established scale is used to assess a different disease process.

Content validity is a similar concept that arises out of scale development. In the course of developing scales, a series of domains are determined. For example, again referring to the activities of daily living, the functional independence measure has domains of self-care, sphincter control, mobility and transfer, location, communication, and social cognition.27 The more completely that a scale addresses the perceived domains of interest, the better its content validity.

Construct validity refers to the degree to which the scale correlates with predictions based on the understanding of the disease process itself. A pathophysiologic understanding of a disease process should generate hypotheses about the performance of a scale in particular clinical settings. For example, if an instrument is developed to assess patients with tethered cord syndrome and then applied in a screening setting to a population of children with incontinence, the few patients found with tethering lesions should score significantly higher on the scale than children with other causes of incontinence.

Criterion validity refers to performance of the new scale in comparison to a previously established standard. Simplified shorter scales that are more readily usable in routine clinical practice should be expected to have criterion validity when compared with longer, more complex research-oriented scales.28

Reliability

To be clinically useful, a scale should give a consistent report when applied repeatedly to a given patient. Reliability and discrimination are intimately linked in that the smaller the error attributable to measurement, the finer the resolving power of the measurement instrument. Measurement in part involves determining differences, so the discriminatory power of an instrument is of critical importance. Although often taught otherwise, reliability and agreement are not the same thing. An instrument that produces a high degree of agreement and no discrimination between subjects is, by definition, unreliable. This need for reproducibility and discrimination is dependent not only on the nature of the measurement instrument but also on the nature of the system being measured. An instrument found reliable in one setting may not be so in another if the system being measured has changed dramatically. Mathematically, reliability is equal to subject variability divided by the sum of subject and measurement variability.25 Thus, the larger the variability (error) caused by measurement alone, the lower the reliability of a system. Similarly, for a given measurement error, the larger the degree of subject variability, the less the impact of measurement error on overall reliability. Reliability testing involves repeated assessments of subjects by the same observers, as well as by different observers.

Reliability is typically measured by correlation coefficients, with values ranging between 0 and 1. Although there are several different statistical methods of determining the value of the coefficient, a common strategy involves the use of analysis of variance (ANOVA) to produce an intraclass correlation coefficient (ICC).29 The ICC may be interpreted as representing the percentage of total variability attributable to subject variability, with 1 being perfect reliability and 0 being no reliability. Reliability data reported as an ICC can be converted into a range of scores expected simply because of measurement error. The square root of (1 − ICC) gives the percentage of the standard deviation of a sample expected to be associated with measurement error. A quick calculation brings the ICC into perspective. For an ICC of 0.9, the error associated with measurement is (1 − 0.9)0.5, or about 30% of the standard deviation.25 Reliability measures can also be reported with the κ statistic, discussed more fully in the next section as a measure of clinical agreement.30

Internal consistency is a related property of scales that are used for measurement. Because scales should measure related aspects of a single dimension or construct, responses on one question should have some correlation to responses on another question. Although a high degree of correlation renders a question redundant, very limited correlation raises the possibility that the question is measuring a different dimension or construct. This has the effect of adding additional random variability to the measure. There are a variety of statistics that calculate this parameter, the best known of which is Cronbach’s alpha coefficient.31 Values between 0.7 and 0.9 are considered acceptable.25

Clinical Agreement

The success of measurement instruments and clinical decision rules such as those described earlier depends on the degree to which two or more clinicians can agree on the observed findings. Observations that are difficult to reproduce from examiner to examiner are of much less value in predicting the presence of significant injury, illness, or subsequent prognosis. Clinical agreement can be divided into intrarater and interrater reliability, depending on the likelihood of one individual identifying the same findings in a series of assessments done at different times or the probability of two independent evaluators arriving at the same conclusion given the same material to observe.

One of the common statistical ways to measure agreement is the κ statistic.32,33 It attempts to compensate for the likelihood of chance agreement when only a few possible choices are available. If two observers identify the presence of a clinical finding 70% of the time in a particular patient population, by chance they will agree on the finding 58% of the time (Table 11-4). If they actually agree 80% of the time, the κ statistic for this clinical parameter is 80% to 58% actual agreement beyond chance and 100% to 58% potential agreement beyond chance. This is equal to 0.52. Interpretation of these κ values is somewhat arbitrary, but by convention, κ values of 0 to 0.2 indicate slight agreement; 0.2 to 0.4, fair agreement; 0.4 to 0.6, moderate agreement; and 0.8 to 1.0, high agreement.34

The concept of clinical agreement can be broadened to apply not only to individual symptoms, signs, or tests but also to whole diagnoses; for example, Kestle and colleagues reported that a series of objective criteria for establishing the diagnosis of shunt malfunction have a high degree of clinical agreement (κ = 0.81) with the surgeon’s decision to revise the shunt.35

The κ statistic is commonly used to report the agreement beyond that expected by chance alone and is easy to calculate. However, concern has been raised about its mathematical rigor and its usefulness for comparison across studies (for a detailed discussion see http://ourworld.compuserve.com/homepages/jsuebersax/kappa.htm#start).36,37

Responsiveness

Measurement is largely a matter of discrimination among different states of being, but change needs to be both detected and meaningful. Between 1976 and 1996, Bracken and associates performed a series of randomized controlled trials (RCTs) to evaluate the role of corticosteroids in the management of acute spinal cord injury.3840 In the last two trials, the main outcome measure was performance on the American Spinal Cord Injury Association (ASIA) scale. Both studies demonstrated small positive benefits in preplanned subgroup analyses for patients receiving methylprednisolone. Subsequent to their publication, however, controversy arose regarding whether the degree of change demonstrated on the ASIA scale is clinically meaningful and justifies the perceived risk of increased infection rates with the therapy. The scale itself assesses motor function in 10 muscle groups on a 6-point scale from 0 (no function) to 5 (normal function and resistance). Sensory function is assessed in 28 dermatomes for both pinprick and light touch. In the first National Acute Spinal Cord Injury Study (NASCIS 1), the net benefit of methylprednisolone was a mean motor improvement of 6.7 versus 4.8 on a scale of 0 to 70. What is unclear from the scale is whether changes of this degree produce any significant functional improvement in a spinal cord injury patient.4143 Presumably seeking to address this question, the last of these trials included a separate functional assessment, the functional independence measure. This multidimensional instrument assesses the need for assistance in a variety of activities of daily living. A change in this measure, at least on face, produces a more obvious change in functional state than does the ASIA scale. Indeed, a small improvement in the sphincter control subgroup of the scale was seen in the NASCIS 3 trial, although considering the scale as a whole, no improvement was seen.38

Measurement Instruments

A large number of measurement instruments are now in regular use for the assessment of neurosurgical outcomes. Table 11-5 lists just a few of these,4484 as do other summaries.85 Several types of instruments are noted. First, a number of measures focus on physical findings, such as the Glasgow Coma Scale (GCS), Mini-Mental Status Examination, and House-Brackmann score for facial nerve function. These measures are specific to a disease process or body system. A second group of instruments focus on the functional limitations produced by the particular system insult. A third group of instruments look at the role that functional limitations play in the general perception of health. Instruments that assess health status or health perceptions, such as the 36-item short-form health survey (SF-36) or Sickness Impact Profile, measure in multiple domains and are thought of as generic instruments applicable to a wide variety of illnesses. Others instruments include both general and disease-specific questions. The World Health Organization defines health as “a state of complete physical, mental, and social well-being and not merely the absence of disease or infirmity.”86 In this context, it is easy to see the importance of broader measures of outcome that assess quality of life. Finally, economic measures such as health utilities indices are designed to translate health outcomes into monetary terms.87

TABLE 11-5 Common Measurement Instruments in Neurosurgery

DISEASE PROCESS/SCALE FEATURES REFERENCE
Head Injury
Glasgow Coma Scale (GCS) Examination scale: Based on eye (1-4 points), verbal (1-5 points), and motor assessment (1-6 points) 44
Glasgow Outcome Scale 1, Dead; 2, vegetative; 3, severely disabled; 4, moderately disabled; 5, good 45
DRS (disability rating scale) Score of 0 (normal) to 30 (dead); incorporates elements of the GCS and functional outcome 46
Ranchos Los Amigos I, No response; II, generalized response to pain; III, localizes; IV, confused/agitated; V, confused/nonagitated; VI, confused/appropriate; VII, automatic/appropriate; VII, purposeful/appropriate 47
Mental Status Testing
Folstein’s Mini-Mental State Examination 11 Questions: orientation, registration (repetition), attention (serial 7s), recall, language (naming, multistage commands, writing, copying design) 48, 49
Cognitive Testing
Wechsler Adult Intelligence Scale (WAIS) The standard IQ test for adults with verbal (information, comprehension, arithmetic, similarities, digit span, vocabulary) and performance domains (digit symbols, picture completion, block designs, picture arrangement, object assembly)50 51
Epilepsy
Engel class

52 Movement Disorders Unified Parkinson’s Disease Rating Scale Subscales for mentation/mood (0-16), activities of daily living (0-52), motor examination, including speech and tremor (0-108), and complications from therapy (0-23) 53 Ashworth scale (spasticity)

54 Vascular Hunt & Hess 0, Unruptured; 1, mild headache or nuchal rigidity; up to 5, deep coma 55, 56 WFNS grading scale 0, Unruptured; 1, GCS score of 15 without major focal deficit. Scale up to 5, GCS score of 3 with or without deficit 57 NIH stroke scale 15-Item scale: level of consciousness, visual symptoms, facial and limb weakness, ataxia, sensory loss, neglect, and language dysfunction 58 Ranklin Disability Scale, modified RDS (Modified version)

Original version had no grade 0 and grade 1 equivalent to grade 2 above

59, 60 Cranial Nerve House/Brackmann scale for facial nerve function 0, Normal function; to 6, complete paralysis, based on appearance during casual observation, appearance at rest, and appearance with motion. Cut point at 3 vs. 4 for complete eye closure 61 Spine Trauma ASIA classification Radicular motor function at 5 upper extremity and lower extremity levels (0-5 each), light touch (0-2), and pinprick (0-2) for each of 28 levels. Maximum (L + R) motor = 100, sensory = 112 per modality 62 Frankel/ASIA impairment 63 Peripheral Nerve Trauma Medical Research Council grading system Scores strength 0-5. Similar to ASIA motor grading 64 Degenerative Spine North American Spine Society questionnaire Separate cervical and lumbar instruments. Combines both disease-specific questions and SF-36. Normative data published 65 Oswestry low back pain disability questionnaire 10 Domains, including pain intensity, personal care, lifting, walking, sitting, standing, sleeping, sex life, social life, and traveling; 6-item scale per domain 66 Japanese Orthopedic Association scale Reported for cervical myelopathy: Scores 0-2 to 0-4 for motor function in arm and leg; sensory function in arm, leg, and trunk; and sphincter dysfunction. Higher scores signify greater function 67 Hydrocephalus HOQ (Hydrocephalus Outcome Questionnaire) 51-Question disease-specific multidimensional outcome measure developed and validated for pediatric hydrocephalus. Responses scored to yield overall score from 0 to 1. Can be converted to health utility score 68, 69 Craniofacial Whitaker grade 70 Oncology Karnofsky performance scale 100-Point scale (scored by 10s) from 100 (normal) to 0 (dead) that measures degree of dependence; <70 is no longer independent 71 EORTC QLC 30- Multidimensional quality-of-life measures used for glioma outcomes 72, 73 University of Toronto 16 Items from Sickness Impact Profile, 13 items specific to brain tumor patients with an overall assessment question, question answered by visual analog–type scale between descriptive extremes 72 Functional Functional independence measure (FIM) 7-Point scale from independence to total assistance applied to 6 ADL areas: self-care (eating, grooming, bathing, dressing, and toilet), sphincter control, transfers, locomotion (walking and stairs), communication, and social cognition 74 WeeFIM Modification of FIM for pediatric patients 75, 76 Barthel index 10-Item (or 15 if the Granger modification version) score addressing ADLs (feeding, transfers, toiletry, etc). Each item scored for dependent vs. independent with several items at intermediate grade. Score of 0-100 (0-20 for modification) 77, 78 Pain McGill Pain Questionnaire Very common in use. Adjectival description of pain to assess three domains: sensory-discriminative, motivational-affective, and cognitive-evaluative. Adjective selected by patient carries intensity weighting. Several scoring systems exist 79 Visual analog scale Ruler scale of 0-10 for pain 80 General/Multidimensional SF-36 and shorter forms 36 Questions. Domains: physical activity, social activity, societal role, pain, mental health, emotions, vitality, and health perceptions. Can be self-administered; current testing involves Web-based applications. Scoring is out of 100 for each of the 8 domains, with higher scores indicating better health. Physical and mental summary scores also available 81 Sickness Impact Profile 136 Questions. Domains/categories: Physical—ambulation, mobility, body care, and movement; Psychosocial—social interaction, communication, alertness behavior, emotional behavior, sleep and rest, eating, home management, recreation and pastimes, employment 82 Nottingham Health Profile 38 Questions. Domains: physical mobility, energy level, pain, emotional reactions, sleep, and social isolation. Can be combined to summary measure 83, 84

ADLs, activities of daily living; ASIA, American Spinal Cord Injury Association; EORTC, European Organization for Research and Treatment of Cancer; NIH, National Institutes of Health; SF-36, short-form health survey (36 items); WFNS, World Federation of Neurological Surgeons.

Choosing a Specific Measure

From the standpoint of study design, the choice of measures is driven by the study question and context. When a study focuses on a narrow clinical question, focused measurement instruments are appropriate. However, the more varied the interventions considered and broader the generalizations the authors wish to draw, the more generic the outcome measure should be.

Functional outcome measures assess the impact of physiologic changes on more general function. Evidence from these sources may be more compelling than changes in physiologic parameters (see discussion on the NACSIS trials earlier). Even more broadly, health status measures place the loss of function in the context of the patient’s global health picture. These measures are particularly useful when comparing across disease processes or comparing healthy and diseased populations. A common strategy is to incorporate both system/disease-specific and generic outcome measures in the same study. Authors and readers must be clear which of these represents the primary outcome measure of the study. When the purpose of the study is to measure the economic impact of a therapy or its “utility” in an economic context, summary measures that report a single score rather than a series of scores for different domains are necessary. In many cases it is appropriate to have the quality-of-life measure as the principal outcome assessment tool. For example, in assessments of therapy for glioblastoma multiforme, efforts to extend survival have thus far been only minimally successful. Studies focusing on maximizing quality of life with the available modalities are often the best ones to provide information on therapeutic decision making.88

Rating scales and quality-of-life assessment measures need not be limited to research studies. Familiarity with these measures can lead to their introduction into general clinic practice and thereby give the practitioner a more effective way to communicate with others and assess individual results. The “Guidelines for the Management of Acute Cervical Spine and Spinal Cord Injuries” offers as a guideline that the functional independence measure be used as a clinical assessment tool by practitioners when assessing patients with spinal cord injury.89 Practical considerations dictate that instruments used in the conduct of routine care be short and feasible to perform. However, ad hoc modification or simplification of instruments to suit individual tastes or needs voids the instruments’ validation and reduces comparability with other applications. Access to many of the outcome measures listed in Table 11-5 is available through the Internet.

Specific Study Designs

Clinical researchers and clinicians using clinical research must be familiar with the various research designs available and commonly used in the medical literature. To better understand the features of study design, it is helpful to consider factors that limit the ability of a study to answer that question. In this context, it is easier to understand the various permutations of clinical research design. Elementary epidemiology textbooks broadly divide study designs into descriptive and analytic studies (Table 11-6).90 The former seek to describe health or illness in populations or individuals but are not suited to describing causal links or establishing treatment efficacy. Comparisons can be made with these study modalities only on an implicit, or population, basis. Thus, a death rate may be compared with a population norm or with death rates in other countries. Analytic or explanatory studies have the specific purpose of direct comparison between assembled groups and are designed to yield answers concerning causation and treatment efficacy. Analytic studies may be observational, as in case-control and cohort studies, or interventional, as in controlled clinical trials. In addition to being directed at different questions, different study designs are more or less robust in managing the various study biases that must be considered as threats to their validity (Table 11-7).1,91

TABLE 11-6 Study Designs

  EXAMPLE LIMITATIONS (TYPICAL)
Descriptive Studies
Population correlation studies Rate of disease in population vs. incidence of exposure in population No link at the individual level, cannot assess or control for other variables. Used for hypothesis generation only
  Changes in disease rates over time No control for changes in detection techniques
Individuals
Case reports and case series Identification of rare events, report of outcome of particular therapy No specific control group or effort to control for selection biases
Cross-sectional surveys Prevalence of disease in sample, assessment of coincidence of risk factor and disease at a single point in time at an individual level “Snapshot” view does not allow assessment of causation, cannot assess incident vs. prevalent cases. Sample determines degree to which findings can be generalized
Descriptive cohort studies Describes outcome over time for specific group of individuals, without comparison of treatments Cannot determine causation, risk of sample-related biases
Analytic Studies
Observational
Case control Disease state is determined first. Identified control group retrospectively compared with cases for presence of particular risk factor Highly suspect for bias in selection of control group. Generally can study only one or two risk factors
Retrospective cohort studies Population of interest determined first, outcome and exposure determined retrospectively Uneven search for exposure and outcome between groups. Susceptible to missing data, results dependent on entry criteria for cohort
Prospective cohort studies Exposure status determined in a population of interest, then monitored for outcome Losses to follow-up over time, expensive, dependent on entry criteria for cohort
Interventional
Dose escalation studies (phase I) Risk for injury from dose escalation Comparison is between doses, not vs. placebo. Determines toxicity not efficacy
Controlled nonrandom studies Allocation to different treatment groups by patient/clinician choice Selection bias in allocation between treatment groups
Randomized controlled trials Random allocation of eligible subjects to treatment groups Expensive. Experimental design can limit generalizability of results
Meta-analysis Groups randomized trials together to determine average response to treatment Limited by quality of original studies, difficulty combining different outcome measures. Variability in base study protocols

After Hennekens C, Buring J. Epidemiology in Medicine. Boston: Little, Brown; 1987.

TABLE 11-7 Common Biases in Clinical Research

Buy Membership for Neurosurgery Category to continue reading. Learn more here
BIAS NAME EXPLANATION
Sampling Biases
Prevalence-incidence Drawing a sample of patients late in a disease process excludes those who have died of the disease early in its course. Prevalent (existing) cases may not reflect the natural history of incident (newly diagnosed) cases
Unmasking In studies investigating causation, factors that cause symptoms, which in turn cause a more diligent search for the disease of interest. An example might be whether a particular medication caused headaches, which led to the performance of more magnetic resonance imaging studies and resulted in an increase in the diagnosis of arachnoid cysts in patients taking the medication. The conclusion that the medication caused arachnoid cysts would reflect unmasking bias
Diagnostic suspicion A predisposition to consider an exposure as causative prompts a more thorough search for the presence of the outcome of interest
Referral filter Patients referred to tertiary care centers are often not reflective of the population as a whole in terms of disease severity and comorbid conditions. Natural history studies are particularly prone to biases of this sort
Chronologic