PRINCIPLES OF NEUROPSYCHOMETRIC ASSESSMENT

Published on 10/04/2015 by admin

Last modified 22/04/2025

Print this page

This article have been viewed 6972 times

CHAPTER 2 PRINCIPLES OF NEUROPSYCHOMETRIC ASSESSMENT

It might legitimately be asked why this chapter was written by both a neurologist and a neuropsychologist. The answer, in part, is that a neurologist who has worked closely with neuropsychologists is perhaps in the best position to interpret the discipline to his or her colleagues; neuropsychology is often a “black box” to neurologists, to a greater extent than neuropsychologists themselves may realize. This can lead to uncritical acceptance of neuropsychologists’ conclusions without the productive interaction that characterizes, for example, neuroradiological review sessions. At the other extreme, the real added value of expert neuropsychological assessment may be discounted by those unconvinced of its validity. In any event, the value of neuropsychological assessment is considerably increased when the neurologist requesting it understands its strengths, limitations and pitfalls, and the sort of data on which its conclusions are based.

COGNITIVE DOMAINS AND NEUROPSYCHOLOGICAL TESTS

Cognitive Domains

Cognitive domains are constructs (intellectual conceptualizations to explain observed phenomena, such as gravity) invoked to provide a coherent framework for analysis and testing of cognitive functions. The various cognitive processes in each domain are more or less related and are more or less independent of processes in other domains. Although these domains do not have strict, entirely separable neuroanatomical substrates, they do each depend on particular (but potentially overlapping) neural networks.¹ In view of the way in which cognitive domains are delineated, it is not surprising that there is some variation in their stated number and properties, but commonly recognized ones with their potential neural substrates are listed in Table 2-1.

TABLE 2-1 Commonly Assessed Cognitive Domains and Their Potential Neural Substrate

Domain	Main Neural Substrate
Attention	Ascending reticular activating system, superior colliculus, thalamus, parietal lobe, anterior cingulate cortex, and the frontal lobe
Language	Classical speech zones, typically in the left dominant hemisphere, including Wernicke’s and Broca’s areas, and the angular gyrus
Memory	Hippocampal-entorhinal cortex complex
	Frontal regions
	Left parietal cortex
Object recognition (visual)	Ventral visual system: occipital regions to anterior pole of temporal lobe
Spatial processing	Posterior parietal cortex, frontal eye fields, dorsal visual system
Spatial processing	Inferotemporal/midtemporal and polar temporal cortex
Executive functioning	Frontal-subcortical circuits, including dorsolateral prefrontal, orbital frontal, and anterior cingulate circuits

Neuropsychological Assessment of Individual Cognitive Domains

In practice, although many neuropsychological tests assess predominantly one domain, very few in routine clinical use are pure tests of a single domain, and almost all can be performed poorly for several reasons. For example, inability to copy the Rey Complex Figure might result from impairments of volition, comprehension, planning, or praxis, as well as from some form of the most obvious cause of visual impairment. Assignment of particular tests to particular domains is therefore to some extent arbitrary; many tests are capable of being assigned to more than one domain. The interested reader is referred to Lezak and colleagues (2004) and Spreen and Strauss (1998) for details of individual tests. Multidimensional tests, such as the Mini-Mental State Examination (MMSE), may be subjected to factor analysis. This type of analysis identifies groupings of test items that correlate with each other and may well assess aspects of the same domain. In this way, the range of domains assessed by such a test may be identified.

Prerequisites for Meaningful Testing

Adequate testing within some domains requires that some others are sufficiently intact. For example, a patient whose sustained, focused attention (concentration) is severely compromised by a delirium is unable to register a word list adequately. Consequently, delayed recall is impaired, even in the absence of a true amnesia or its usual structural correlates. A patient with sufficiently impaired comprehension may perform poorly on the Wisconsin Card Sorting Test because the instructions were not understood, rather than because hypothesis generation was compromised. These considerations give rise to the concept of a pyramid of cognitive domains, with valid testing at each level dependent on the adequacy of lower level performance ² (Fig. 2-1).

Figure 2-1 The pyramid of cognitive domains. An unconscious or inattentive patient is not able to comprehend test instructions, even though the relevant linguistic networks may be intact. A patient with severely impaired comprehension may not understand the test instructions for praxis, for example, and so forth.

In addition to intact attention and comprehension, patient performance may be compromised by poor motivation—for example, as a result of depression or in the setting of potential secondary gain—or by anxiety. Neurological impairments (e.g., poor vision, ataxia), psychiatric comorbid conditions, preexisting cognitive impairments (e.g., mental retardation), specific learning difficulties or lack of education (e.g., resulting in illiteracy), and lack of mastery of the testing language can all interfere with valid testing and must be carefully considered by the neuropsychologist in interpreting test results.³

BASIC PRINCIPLES OF PSYCHOMETRICS

Test Reliability

For a neuropsychological test (or any other test) to be clinically useful, it must be both reliable and valid. A reliable test is one for which differences in scores reflect true differences in what is being measured, rather than random variation (“noise”) or systematic bias (e.g., consistent differences between test scores at different centers). The reliability coefficient of a test is the proportion of total test result variability that is attributable to true differences in test results. It may also be conceptualized as the variability that would remain after multiple administrations of a test resulted in random variations that canceled each other out, with no systematic bias assumed. (An analogy familiar to neurologists would be electronic averaging in evoked potentials.) Reliability coefficients of standard neuropsychological tests typically vary from about 0.70 (acceptable) to 0.95 (high).

Reliability may be assessed in a number of ways. Test-retest reliability accounts for both random variability resulting from the test itself and systematic bias resulting from practice effects, although it cannot enable the clinician to easily distinguish between the two. It presupposes a stable test population, which may be an unattainable ideal over longer periods of time, inasmuch as acute pathological conditions such as results of strokes and traumatic injuries tend to improve and degenerative conditions tend to worsen. The internal consistency of a multi-item test can be gauged by split-half reliability, whereby scores from half the test items are compared with scores from the other half (but this leaves moot how the division is performed), or by calculating the mean reliability coefficient obtained from all possible split-half comparisons. The latter strategy generates a statistic called Cronbach’s α. Sometimes, alternative (parallel) versions of tests are constructed, often in order to facilitate serial testing in an effort to avoid practice effects. The reliabilities of the different versions can then be compared, in a process very similar to split-half reliability testing. The difficulty, of course, is in knowing whether the two versions really are equivalent, so that variance between the two represents unreliability rather than differences in difficulty or in the variable or variables actually being measured.

Interrater reliability accounts for the variation in test scores resulting from administration by different testers. This is clearly important particularly in multicenter studies and is an essential property for semiquantitative clinical rating scales.

The importance of test reliability underlies the importance of test administration with standardized materials in a standardized manner and a conducive environment, and by appropriately trained personnel (e.g., not by an intern in a noisy ward).

Test Validity

A valid test measures what it is purported to measure. Whereas an unreliable test cannot be valid (as score variations reflecting true differences in the intended measured variable are concealed by noise or systematic bias), reliability itself is no guarantee of validity. Consideration of the following test of semantic knowledge illustrates this point:

1. What is 2 + 3?

2. Which city is the capital of the USA?

3. How many seconds are in a minute?

4. What was the maiden name of Charcot’s maternal grandmother?

All readers presumably score 75% on this test, which is therefore absolutely reliable but quite invalid as a test of semantic knowledge.

Validity can be gauged in a number of ways. Criterion validity reflects the utility of the test in decision making. Perhaps the ideal form of criterion validity is predictive validity, in which test results are used to make a decision or prediction, such as in which patients amnestic mild cognitive impairment will convert to Alzheimer’s disease, and the validity of the decision is subsequently established on follow-up. Such studies tend to be long and expensive, however, and so other methods of assessing validity are often required.

Concurrent validity, another form of criterion validity often used instead, involves comparing test results with a nontest parameter of relevance, such as sustained, directed attention in children with their class disciplinary records. Ecological validity, a related concept, reflects the predictive value of the test for performance in real-world situations. For example, neuropsychological tests of visual attention and executive function, but not of other domains, have been found to have reasonable ecological validity for predicting driving safety, in comparison with the “gold standard” of on-road testing.⁴

Construct validity assesses whether, for example, a test purportedly of a particular cognitive domain is correlated with other established tests of that particular domain and functions as tests of that domain are expected to function.

Content validity concerns checking the test items against the boundaries and content of the domain (or portion of the domain) to be assessed. Face validity exists when, to a layperson (such as the subject undergoing testing), a test seems to measure what it is purported to measure. Thus, a driving simulator has good face validity as a test of on-road safety, whereas an overlapping figures test of figure/ground discrimination may not, even though it may actually be relevant to perceptual tasks during driving. More detailed discussions of reliability and validity were given by Mitrushina and associates (2005), Halligan and colleagues (2003), or Murphy and Davidshofer (2004).

Symptom Validity Testing

Symptom validity testing is rather different and is used as a method to reveal nonorganic deficits (e.g., malingering). It relies on the fact that patients with no residual ability in a domain, who are forced to respond to items randomly or by guessing, can nevertheless sometimes be correct by chance. Performance at statistically significantly worse than chance levels can be explained only by some retention of ability in that domain, with that ability being used (consciously or unconsciously) to produce incorrect answers. This forced-choice/statistical analysis concept should already be familiar to neurologists, as it underlies much of psychophysically correct sensory testing. (A fine example is the University of Pennsylvania Smell Identification Test [UPSIT] for evaluation of olfaction.⁵)

Other methods for detecting nonorganic deficits also exist; they depend on recognition of deviation from the usual patterns of cognitive impairment (e.g., recognition memory’s being worse than spontaneous recall) or discrepancy between scores on explicit tests of a domain and behavior or other tests implicitly dependent on that domain (e.g., dysfluency appearing only when “language” is tested). This subject is covered in more detail elsewhere.⁶

Ceiling and Floor Effects

Two further difficulties may limit the use of neuropsychological measures: lack of discrimination across the range of abilities expected and practice effects on repeated testing. An ideal test would reveal a linear decline in ability in the tested domain, from the supremely gifted to the profoundly impaired. In practice, this is rarely, if ever, achieved. Some tests discriminate well between patients with different severities of obvious impairment but are problematic in attempts to detect subtle disorders and fail to stratify the normal population appropriately. This is known as a ceiling effect. On the other hand, some tests sensitive to subtle declines and capable of stratifying the normal population, are too difficult for patients with more profound deficits. Real differences in their residual abilities may be missed. This is a floor effect. Some tests have both ceiling and floor effects, leading to a sigmoid curve of scores versus ability. If patients with Alzheimer’s disease are assumed to decline at a constant rate, then on average, over time, the MMSE shows both ceiling and floor effects (Fig. 2-2).

Figure 2-2 The Mini-Mental State Examination (MMSE) shows both ceiling effects (flatter segment of graph, upper left) and floor effects (flatter segment of graph, to right) in comparison with its sensitivity to change in the middle stages of Alzheimer’s disease (AD).

(From Gauthier S, ed: Clinical Diagnosis and Management of Alzheimer’s Disease. London, UK: Martin Dunitz, 1996.)

Practice Effects

Practice effects arise when the act of taking a test more than once results in an improvement in the subject’s true score. Repeated assessments over time are often desirable, to determine whether a deficit is static or declining or to monitor treatment or recovery. Such serial assessment is virtually impossible with some tests because of practice effects. For example, once the patient has been exposed to the Wisconsin Card Sorting Test and has learned that the examiner periodically changes the correct sorting rule without divulging this to the patient, much of the challenge and novelty of this test is lost. Repeated exposure to a test, particularly over a short period, may result in overt learning. This probably accounts for the initial rise in MMSE and Alzheimer’s Disease Assessment Scale, Cognitive subscale (ADAS-Cog) scores in the placebo recipients in trials of cholinesterase inhibitors in Alzheimer’s disease, for example (Fig. 2-3).

Figure 2-3 Alzheimer’s Disease Assessment Scale, Cognitive subscale (ADAS-Cog) scores in a trial of an acetylcholinesterase inhibitor in Alzheimer’s disease. Diamonds represent scores of patients receiving placebo in the double-blind phase and inhibitor in the extension phase; squares represent scores of patients taking the inhibitor. A drop in score (negative values) represents an improvement on this test on which 0/70 is a perfect score. Note the improvement of about −½ at 6 weeks in the placebo recipients’ scores. In large part, this probably represents a practice effect.

(From Coyle J, Kershaw P: Galantamine, a cholinesterase inhibitor that allosterically modulates nicotinic receptors: effects on the course of Alzheimer’s disease. Biol Psychiatry 2001; 49: 289-299. Copyright 2001. Reprinted with permission from the Society of Biological Psychiatry.)

The use of alternative forms, such as the Crawford version of the Rey Auditory Verbal Learning Test, may overcome some of these difficulties,⁷ but not all alternative versions of tests really are equivalent (e.g., the Taylor alternative version of the complex figure is easier to recall than the original Rey Complex Figure itself; see Chapter 12 in Mitrushina et al [2005]). Furthermore, learning may be implicit (procedural), in such a way that patients become more proficient at a type of test with practice in the absence of conscious remembering. This may improve scores even on true alternative versions.

Comparison with Appropriate Normative Data

For tests to be useful in making clinical predictions about an individual, particularly on the first assessment, it is essential that individuals’ scores be compared with appropriate normative data. Many test scores in normal populations show systematic variation with demographic variables, such as age, years of education, and gender, and these must be accounted for before interpretation is possible. This may be done by using either stratified norms or regression equations with the relevant variables factored in. It is essential that the “normal” population sampled to provide the normative data is relevant to the patient being tested. For example, a stratification category of native English speakers “>60 years” with an average of 16 years of education is hardly an appropriate normative population against which to compare an 89-year-old patient with only 5 years of education who learned English as an adult immigrant. The selection of appropriate norms is covered in detail in the handbook by Mitrushina and associates (2005).

Testing patients whose language of preference is not that of the test (and the examiner) is particularly plagued with pitfalls: Direct translation on the spot introduces too much random variability. Versions in the target language must first be validated and norms established in that population, allowing for differences in word usage and familiarity. In the case of nonverbal tests, it must be shown that scores are equivalent in the different target groups. Even a carefully translated test may end up measuring something different from the original version.⁸ These considerations can considerably restrict test choice in assessment in these patients.

The effects of culture are even more insidious. Even populations with a common language and broadly similar cultures, such as Americans and Australians, cannot always be directly compared. For example, the Boston Naming Test, a commonly used confrontational naming assessment instrument, contains pictures of a pretzel and a beaver. Older Australians, without experience of either, tend to call the first a snake and the second a platypus!

Reporting of Test Scores

Once appropriate norms are identified, the test scores have to be reported in an intelligible manner. This is commonly done in terms of standard deviations from the mean for the appropriate normative sample, by using Z scores, T scores, or IQ scores. A Z score of −1½ would indicate a score 1½ standard deviations (SD) below average, whereas one of +2 would indicate a score 2 SD above average. T scores have the mean set at 50 and 1 SD set at 10. Hence, a T score of 70 is 2 SD above average. IQ scores have the mean set at 100, and 1 SD set at 15. Hence, an IQ score of 85 is 1 SD below average. Of course, reporting scores in terms of SD assumes that the measured variable is normally distributed in the population. This is not the case with, for example, the Boston Naming Test or the Rey Complex Figure Test copy, for which results from the normal population are positively skewed (see Mitrushina et al [2005]). Reporting by percentiles—the percentage of the normative population whose scores fall below that score level—avoids this difficulty: If a score is below the second percentile, it would indicate impairment, regardless of the distribution of test scores. On the other hand, for tests with a normal distribution of scores, reporting by percentiles does tend to overemphasize unimportant differences near the mean and underemphasize more extreme deviations. For example, the real difference between the first and the 10th percentile levels is likely to be much more important (and larger) than the real difference between the 41st and 50th percentiles. However, other tests have their own idiosyncratic scoring systems (e.g., MMSE: maximum score 30/30; ADAS-Cog: maximum score 0/70). Neuropsychologists also frequently report scores by bands with descriptive labels (such as superior, above average, borderline, etc.). The relationship between some of these various scoring methods for a test with normally distributed scores is shown in Figure 2-4.

Figure 2-4 The relationship between various ways of reporting test scores and the normal distribution. SD, standard deviation; WAIS III, Wechsler Adult Intelligent Scale–Third Edition.

(Adapted from The Psychological Corporation. Methods of expressing test scores. Test Service Notebook, April 1955, No. 1 48. Reproduced with permission of publisher, Harcourt Assessment, Inc.)

Decision Theory

The concepts of sensitivity, specificity, and, more particularly for decision making, positive and negative predictive value and likelihood ratio are as important for neuropsychological tests as for any other form of diagnostic testing in medicine. Their definitions are provided in Table 2-2. The important effect of base rate (prevalence) on these values must also be remembered. For example, the base rate of Alzheimer’s disease in 75-year-old patients with memory complaints is much higher than that in 45-year-olds worried that they are not staying on top of their jobs, so that poor performance on a brief verbal memory test has a much higher positive predictive value in the older population.

TABLE 2-2 Definitions of Important Test Parameters

a = true positive scores

b = false positive scores

c = false negative scores

d = true negative scores

Prevalence = (a + c)/(a + b + c + d)

Sensitivity = a/(a + c)

Specificity = d/(b + d)

Positive predictive value = a/(a + b)

Negative predictive value = d/(c + d)

Likelihood ratio (positive) = posttest odds/pretest odds

= sensitivity/{1 − specificity}

= {a/(a + c)}/{1 − [d/(b + d)]}

Cutoffs and Receiver Operating Characteristic Curves

When a test cutoff point is set (as is often done for the MMSE or the Modified Mini-Mental State Examination, to distinguish demented from nondemented subjects), there is a trade-off between sensitivity and specificity. This may be formalized as a receiver operating characteristic (ROC) curve, on which sensitivity is plotted against 1 − specificity for each proposed cutoff point. The optimal cutoff point for the purpose (e.g., individual diagnosis, requiring high specificity, or community screening, requiring high sensitivity) can then be ascertained from this graph. For example, Monsch and colleagues (1995) used ROC analysis to determine that the optimal cutoff score for the MMSE in a geriatric outpatient service is 25/26.⁹ The utility of different measures of the same parameter can also be compared (e.g., see Fig. 2-5), or the effects of adding another test of the same parameter studied; the test or combination that has the greatest area under the ROC curve is the most accurate discriminator.

Figure 2-5 Receiver operating characteristic curves of the power of four diagnostic indexes to discriminate between patients with mild cognitive impairment that did (in 22 patients) or did not (in 8 patients) progress to Alzheimer’s disease. AUC, area under the curve; CBF, cerebral blood flow; CSF, cerebrospinal fluid.

(From Okamura N, Arai H, Maruyama M, et al: Combined analysis of CSF tau levels and [(123)I]iodoamphetamine SPECT in mild cognitive impairment: implications for a novel predictor of Alzheimer’s disease. Am J Psychiatry 2002; 159:474-476. Reprinted with permission from the American Journal of Psychiatry. Copyright 2002 American Psychiatric Association.)

MEASURING DEFICITS AND CHANGES

Methods of Establishing a Baseline

Most patients have not previously undergone neuropsychological assessment when they are referred, and so there is no established personal baseline against which they can be compared when they are assessed. There are several approaches to this problem. Using demographically stratified norms (see section on Comparison with Appropriate Normative Data) is helpful, and there are even demographic formulas available in some countries, including the United States, to enable estimation of premorbid IQ.¹⁰ However, these still involve comparison with a group, which may not be completely appropriate for a given individual.

A second approach is to estimate premorbid ability from performance on a cognitive task known to be (relatively) resistant to cognitive decline, such as semantic knowledge. The National Adult Reading Test (of pronunciation of irregularly spelled words)¹¹ and its U.S. variants, as well as the vocabulary subtest of the Wechsler Adult Intelligence Scale and its successors, have been used for this purpose.

A large variation between Z scores in different domains might suggest that the lower scores are the result of deterioration and that the higher scores (the patient’s best performance), qualified by all available qualitative information about the patient’s premorbid achievements and abilities, provide a reasonable estimate of the patient’s overall cognitive ability. For example, sometimes an individual’s occupational history is helpful: some otherwise normal older individuals may have difficulty copying a cube, but such difficulty in a former architect, draftsperson, or mathematics teacher would indeed be cause for concern. This best performance approach is discussed by Lezak and colleagues (2004, pp 97-99). The pitfalls in relying on the best test score in the absence of such further qualifying information have been illustrated by Mortensen and associates (1991).¹²

Measuring Change

Sometimes, despite the previously mentioned inferential methods for obtaining a baseline, there is still doubt as to whether deterioration has occurred. Repeated assessments can help to identify progressive deterioration in such circumstances, even if there was uncertainty about score interpretation at the initial assessment. However, this raises the question of how true deterioration can be distinguished from random fluctuations in test scores.

One simple way of determining whether a change in test score is significant is the standard deviation method, in which it is assumed that any score change of more than 1 SD is significant. Although this often does reveal truly significant changes, it is less accurate in doing so than are a number of more sophisticated methods.¹³ In part, this potential inaccuracy arises from the random error component of the actual test scores themselves.

Even if a test does not display practice effects, or if truly parallel (alternative) forms are available, only part of a patient’s actual test score consists of the true score, whereas part consists of random variability. The reliability coefficient (r_xx) of a test is a measure of the proportion of the total variance of a test score that results from variance in the true score. If a subject took such a test multiple times, the average score would approximate the true score. The extent to which a single observed total score represents that patient’s true score can be estimated with the standard error of measurement (SEM), which increases as the total test variance (σ_x) increases and decreases as the reliability coefficient (r_xx) increases (indicating a decrease in that portion of the total variance due to random variance), according to the formula SEM − σ_x √1 − r_xx. This means that confidence intervals (e.g., 95%) can be placed on an individual score, or two scores obtained on separate occasions can be compared to determine whether they are likely to represent a true change.

Reliable change indices (RCIs), calculated from data on normal subjects or—better—from populations containing individuals identified as undergoing significant change according to an external “gold standard” (e.g., reaching criterion for diagnosis of dementia) have been devised that account for measurement error, practice effects, and regression to the mean.¹³ Regression equation-based measures have also been developed.¹³ Overall, these perform more satisfactorily, although none is ideal.

STRATEGIES IN NEUROPSYCHOLOGICAL ASSESSMENT

Neuropsychological assessment would be extremely protracted and exhausting (and thereby inaccurate) for both tester and patient, not to mention prohibitively expensive, if all domains were assessed in all possible detail. Some strategy for keeping time and costs to acceptable levels is therefore required. One possibility is to use a standard battery, such as the Halstead-Reitan or Luria-Nebraska battery.^14,¹⁵ The difficulty with this approach is that testing may occupy several hours, but the particular referral problems are still insufficiently clarified at the end. Another approach is hypothesis driven, on the basis of referral details and the history from the patient. This makes intrinsic sense to physicians: a patient complaining of diplopia and unsteadiness of gait will rightly be given a more thorough neurological examination than will a patient complaining of dyspnea, pleuritic chest pain, and productive cough.

Many assessors use a combination of screening tests across the range of cognitive domains, concentrating on those that seem most relevant (e.g., episodic memory in Alzheimer’s disease). This flexible approach may be modified in midsession: As pointed out in the Cognitive Domains and Neuropsychological Tests section, an abnormal performance may well have a number of possible causes, each of which must then be assessed. Often, the Wechsler Adult Intelligence Scale–Third Edition and the Wechsler Memory Scale–Third Edition, or a selection of items from these, are used for screening. Although the Wechsler Adult Intelligence Scale was designed to assess the range of abilities in the normal population rather than to investigate patients with particular cognitive deficits, test administration is very well standardized, and the normative data are extensive (although not stratified by educational level); therefore, these tests are an attractive option for this purpose. Their merits and pitfalls are discussed by Lezak and colleagues (2004, pp 648-660). Some multidomain bedside screening mental status tests used by nonneuropsychologists, such as the Mattis Dementia Rating Scale¹⁶ and the Cognistat (formerly called the Neurobehavioral Cognitive State Examination),¹⁷ are designed on a “tripwire” (“screen + metric”) basis, with a challenging item given first and, only if the screening item is failed, easier ones then given to establish the degree of impairment in that domain. This approach can save time for both examiner and patient.

Computerized versions of some individual tests are in widespread use (e.g., the Continuous Performance Test and the Wisconsin Card Sorting Test). In view of the duration and expense of neuropsychological assessment, however, it is not surprising that attempts have been made to computerize the entire testing process. An example is the Cambridge Neuropsychological Test Automated Battery.¹⁸ The drawbacks of this approach, however, include not only the loss of flexibility, and therefore the ability to perform hypothesis-generated testing, but also the loss of the potentially very valuable information derived from consideration of referral details, the history from patient and informant, and observation during the test procedure. Consideration of all these features by a trained practitioner is the basis of neuropsychological assessment, as distinct from the more limited neuropsychological testing. An experienced neuropsychologist is able to detect evidence of impulsivity, poor or fluctuating attention, poor planning, and so forth. These observations are particularly important when assessing patients with the dysexecutive syndrome (see Chapter 7: Executive Function and its Assessment), in whom the manner of test performance is often more revealing than the result. Furthermore, anxiety, fatigue, and depression adversely affect performance on many neuropsychological tests; detection of and allowance for or minimization of these process factors are important parts of the neuropsychologist-patient interaction.

In interpreting neuropsychological test results, as with any other collection of test results, the neuropsychologist must remember that if abnormality is defined statistically, and if enough tests are performed, some are expected to yield “abnormal” results by chance. Formal adjustment of the statistical threshold of abnormality (a Bonferroni correction) is possible but cannot be applied blindly in a situation in which the tests are not necessarily fully independent. Conversely, even mild abnormalities on several different measures of a particular domain greatly increase the likelihood that function is impaired in that domain. The corollary is that reliance on a single test to define abnormality within a domain is unsound.