7: Understanding and Applying Psychological Assessment

Published on 24/05/2015 by admin

Filed under Psychiatry

Last modified 24/05/2015

Print this page

rate 1 star rate 2 star rate 3 star rate 4 star rate 5 star
Your rating: none, Average: 0 (0 votes)

This article have been viewed 1939 times

CHAPTER 7 Understanding and Applying Psychological Assessment


Psychological assessment is a consultation procedure that can greatly enhance clinicians’ understanding of their patients and facilitate the treatment process. To a considerable extent, psychological assessment consultations are underutilized in the current mental heath care environment. This is unfortunate given the evidence that psychological tests generally produce reliability and validity coefficients similar to many routine diagnostic medical tests.1 This chapter will provide a detailed review of what a psychological assessment comprises and discuss the potential benefits of an assessment consultation. This will be accomplished by reviewing the methods used to construct valid psychological instruments, the major categories of psychological tests (including detailed examples of each category), and the application of these instruments in clinical assessment. Issues relating to the ordering of psychological testing and the integration of information from an assessment report into the treatment process will also be presented.


Three general test development strategies are available to guide test construction: rational, empirical, and construct validation methods.

Rational test construction relies on a theory of personality or psychopathology (e.g., the cognitive theory of depression) to guide item selection and scale construction. Scales are developed to operationalize the important features of a theory. The Millon Clinical Multiaxial Inventory (MCMI) is an example of a test that was originally developed using primarily rational test construction.

Empirically guided test construction uses a large number of items (called an item pool) and statistical methods to determine which items differentiate between well-defined groups of subjects (a process termed empirical keying). The items, which successfully differentiate one group from another, are organized together to form a scale without regard to their thematic content. The Minnesota Multiphasic Personality Inventory (MMPI) is an example of test developed using this method.

The construct validation method combines aspects from both the rational and the empirical test construction methodologies. Within this framework, a large pool of items is written to reflect a theoretical construct (e.g., impulsivity) and then these items are tested to determine if they actually differentiate subjects who are expected to differ on the construct (impulsive vs. nonimpulsive subjects). Items that successfully differentiate between groups and that meet other psychometric criteria (i.e., they have adequate internal consistency) are retained for the scale. In addition, if theoretically important items do not differentiate between the two groups, this finding may lead to a revision in the theory. The construct validation methodology is considered the most sophisticated strategy for test development. The Personality Assessment Inventory (PAI) is an example of a test developed with a construct validation approach.

Reliability and Validity

To be meaningfully employed for either research or clinical activities, psychological tests must possess adequate reliability and validity. Reliability represents the repeatability, stability, or consistency of a subject’s test score. Reliability is usually represented as some form of a correlation coefficient ranging from 0 to 1.0. Research instruments can have reliability scores as low as .70, whereas clinical instruments should have reliability scores in the high .80s to low .90s. This is because research instruments are interpreted aggregately as group measures, whereas clinical instruments are interpreted for a single individual. A number of reliability statistics are available for evaluating a test: internal consistency (the degree to which the items in a test perform in the same manner), test-retest reliability (the consistency of a test score over time, which typically ranges from a few days to a year), and interrater reliability (observer-judged rating scales). The kappa statistic is considered the best estimate of interrater reliability, because it reflects the degree of agreement between raters after accounting for chance scoring. Factors that affect reliability (the amount of error present in a test score) can be introduced by variability in the subject (subject changes over time), in the examiner (rater error, rater bias), or in the test itself (given under different instruction).

Validity is a more difficult concept to understand and to demonstrate than is reliability. The validity of a test reflects the degree to which the test actually measures the construct it was designed to measure (also known as construct validity). This is often done by comparing the test in question with an already established measure (or measures). As with reliability, validity measures are usually represented as correlation coefficients ranging from 0 to 1.0. Validity coefficients are typically squared (reported as R2) to reflect the amount of variance shared between two or more scales. Multiple types of data are needed before a test can be considered valid. Content validity assesses the degree to which an instrument covers the full range of the target construct (e.g., a test of depression that does not include items covering disruptions in sleep and appetite would have limited content validity). Predictive validity and concurrent validity show either how well a test predicts future occurrence of the construct (predictive validity) or how well it correlates with other current measures of the construct (concurrent validity). Convergent validity and divergent validity refer to the ability of scales with different methods (interview vs. self-report) to measure the same construct (convergent validity), while also having low or negative correlations with scales measuring unrelated traits (divergent validity). Taken together, the convergent and divergent correlations indicate the specificity with which the scale measures the intended construct. It is important to realize that despite the amount of affirmative data for a given test, psychological tests are not themselves considered valid. Rather, it is the scores from tests that are valid under specific situations for making specific decisions.


Intelligence Tests

Alfred Binet (1857-1911) is credited with developing the first true measure of intelligence. Binet and Theodore Simon were commissioned by the French School Board to develop a test to identify students who might benefit from special education programs. Binet’s 1905 and 1908 scales form the basis of our current intelligence tests. In fact, it was the development of Binet’s 1905 test that marked the beginning of modern psychological testing. His approach was practical and effective as he developed a group of tests with sufficient breadth and depth to separate underachieving children with normal intellectual ability from those who were underachieving because of lower intellectual ability. In addition to mathematic and reading tasks, Binet also tapped into other areas (such as object identification, judgment, and social knowledge). About a decade later at Stanford University, Lewis Terman translated Binet’s test into English, added additional items, and made some scoring revisions. Terman’s test is still in use today and is called the Stanford-Binet Intelligence Scales.2

David Wechsler, to help assess recruits in World War I, combined what essentially were the Stanford-Binet verbal tasks with his own tests to form the Wechsler-Bellevue test (1939). Unlike the Stanford-Binet test, The Wechsler-Bellevue test produced a full-scale intelligence quotient (IQ) score, as well as measures of verbal and nonverbal intellectual abilities. The use of three scores for describing IQ became popular with clinicians, and the Wechsler scales were widely adopted. To this day, the Wechsler scales continue to be the dominant measure of intellectual capacity used in the United States.

Intelligence is a hard construct to define. Wechsler wrote that “intelligence, as a hypothetical construct, is the aggregate or global capacity of the individual to act purposefully, to think rationally, and to deal effectively with the environment.”3 This definition helps clarify what the modern IQ tests try to measure (i.e., adaptive functioning) and why intelligence or IQ tests can be important aids in clinical assessment and treatment planning. If an IQ score reflects aspects of effective functioning, then IQ tests measure aspects of adaptive capacity. The Wechsler series of IQ tests cover the majority of age ranges. The series starts with the Wechsler Preschool and Primary Scales of Intelligence (ages 4 to 6 years), progressing to the Wechsler Intelligence Scale for Children–III (ages 5 to 16 years), and ending with the Wechsler Adult Intelligence Scale–III (ages 16 to 89 years).4 Although the current discussion will focus on the measurement of adult intelligence, all of the Wechsler scales provide three major IQ test scores: the Full Scale IQ (FSIQ), Verbal IQ (VIQ), and Performance IQ (PIQ). All three IQ scores have a mean of 100 and a standard deviation (SD) of 15. This statistical feature means that a 15-point difference between a subject’s VIQ and PIQ can be considered both statistically significant and clinically meaningful. Table 7-1 presents an overview of the IQ categories.

Table 7-1 IQ Score Ranges with Corresponding IQ Scores and Percentile Distribution

Full-Scale IQ Score Intelligence (IQ) Categories Percentile in Normal Distribution
≥130 Very superior 2.2
120-129 Superior 6.7
110-119 High average 16.1
90-109 Average 50.0
80-89 Low average 16.1
70-79 Borderline 6.7
≤69 Mentally retarded 2.2

IQ scores do not represent a patient’s innate, unchangeable intelligence. Rather, it is most accurate to view IQ scores as representing a patient’s ordinal position, or percentile ranking, on the test relative to the normative sample at any given time. In other words, a score at the 50th percentile is higher than 50% of the individuals in his or her age bracket. Clinically, IQ scores can be thought of as representing the patient’s best possible current level of adaptive function. Furthermore, because IQ scores are not totally reliable (they contain some degree of measurement and scoring error), they should be reported with confidence intervals indicating the range of scores in which the subject’s true IQ is likely to fall.

The Wechsler IQ tests are composed of 10 or 11 subtests, which were developed to tap two primarily intellectual domains, verbal intelligence (VIQ; Vocabulary, Similarities, Arithmetic, Digit Span, Information, and Comprehension) and nonverbal, or performance, intelligence (PIQ; Picture Completion, Digit Symbol, Block Design, Matrix Reasoning, and Picture Arrangement). All the Wechsler subtests are constructed to have a mean score of 10 and an SD of 3. Given this statistical feature we know that if two subtests differ by 3 or more scaled score points, the difference is clinically meaningful. All IQ scores and subtest-scaled scores are also adjusted for age.4

The subscales included in the VIQ and PIQ can be deconstructed even further to provide specific information about verbal and nonverbal abilities. Table 7-2 provides an outline of the WAIS-III subtests and their location on the verbal and nonverbal indexes. The Verbal Comprehension Index (VCI) and the Working Memory Index (WMI) are the two indexes subsumed under the VIQ. Because the subtests involved in the VCI are not timed, they are considered to be a more pure measure of verbal ability (i.e., word knowledge, verbal abstract reasoning, and general information). As the name implies, the WMI is a measure of how quickly individuals can manipulate, process, and respond to verbal information. For example, the Arithmetic subtest that is part of the WMI is administrated orally and the individual is not permitted to use paper and pencil to solve the word problems. The PIQ is composed of the Perceptual Organizational Index (POI) and the Processing Speed Index (PSI). Subtests in the POI measure nonverbal reasoning, attention to detail, and the integration of visual and motor systems. Although two out of the three subtests that make up the POI are timed, fast responding is not the focus of this index. The PSI is a measure of how quickly individuals can process and respond to visually presented information.4

One of the initial strategies for interpreting a patient’s WAIS performance is to review the consistency of the three main scores: FSIQ, VIQ, and PIQ. For example, an IQ of 105 falls within the average range and by itself would not raise any “red flags.” However, a very high VIQ and very low PIQ can combine to create an FSIQ of 105 as easily as if the VIQ and PIQ were both in the average range. The clinical implications in these two scenarios are quite different and would lead to very different interpretations. Examination of discrepancies is essential when interpreting the WAIS-III, because score differences can occur at the IQ (VIQ vs. PIQ), Index (e.g., POI vs. PSI), or subtest (vocabulary vs. arithmetic) level. However, the existence of a discrepancy does not always necessitate an abnormality. In fact, the occurrence of small to medium VIQ-PIQ discrepancies are not uncommon even in the general population. Typically, discrepancies of between 12 and 15 points are needed before they can be considered significant, and they should be noted in the report.

In sum, although all measures of intelligence are highly intercorrelated, intelligence is best thought of as a multifaceted phenomenon. In keeping with Binet’s original intent, IQ tests should be used to assess individual strengths and weaknesses relative to a normative sample. Too often, mental health professionals become overly focused on the FSIQ score and fall into the proverbial problem of missing the trees for the forest. To counter this error, knowledge of the subtests and indexes of the WAIS-III is essential to understanding the complexity of an IQ score. They are the “trees” of the “forest” that often get missed, and can provide a great deal of more specific information about patients than the FSIQ alone.

Objective (Self-Report) Tests of Personality and Psychopathology

Modern objective personality assessment (more appropriately called self-report) has its roots in World War I when the armed forces turned to psychology to help assess and classify new recruits. Robert Woodworth was asked to develop a self-report test to help assess the emotional stability of new recruits in the Army. Unfortunately, his test, called the Personal Data Sheet, was completed later than anticipated and it had little direct impact on the war effort. However, the methodology used by Woodworth would later influence the development of the most commonly used personality instrument, the MMPI.

Hathaway and McKinley (1943) published the original version of the MMPI at the University of Minnesota.5 (Although the original version of the MMPI was produced in 1943, the official MMPI manual was not published until 1967.5) The purpose of the test was to be able to differentiate psychiatric patients from normal individuals, as well as to accurately place patients in the proper diagnostic group. A large item pool was generated, and hundreds of psychiatric patients were interviewed and asked to give their endorsement on each of the items. The same was done with a large sample of people who were not receiving psychiatric treatment. The results of this project showed that while the item pool did exceptionally well in differentiating the normal from clinical groups, differentiating one psychiatric group from another was more difficult. A major confounding factor was that patients with different conditions tended to endorse the same items; this led to scales with a high degree of item overlap (i.e., items appeared on more than one scale). This method of test development, known as empirical keying (described earlier), was innovative for its time because most personality tests preceding it were based solely on items that test developers theorized would measure the construct in question (rational test development). The second innovation introduced with the MMPI was the development of validity scales that were intended to identify the response style of test takers. In response to criticisms that some items contained outdated language and that the original normative group was considered a “sample of convenience,” the MMPI was revised in 1989. The MMPI-2 is the result of this revision process, and it is the version of the test currently used today.6

The Minnesota Multiphasic Personality Inventory–2

The Minnesota Multiphasic Personality Inventory–2 (MMPI-2) is a 567-item true/false, self-report test of psychological function.6 As mentioned earlier, the MMPI was designed to both separate subjects into “normals” and “abnormals,” and to subcategorize the abnormal group into specific classes.7 The MMPI-2 contains 10 Clinical Scales that assess major categories of psychopathology and six Validity Scales designed to assess test-taking attitudes. MMPI raw scores are transformed into standardized T-scores where the mean is 50 and the SD is 10. A T-score of 65 or greater indicates clinically significant psychopathology on the MMPI-2. An interesting feature of the MMPI-2 is that over 300 “new” or experiential scales have been developed for the test over the years. This is made possible by the empirical keying method described earlier. Groups of items that have been shown to reliably differentiate two or more samples or populations can be added to the MMPI-2 as a clinical or supplemental scale. The addition of these scales helps sharpen and individualize the clinical interpretation of the MMPI-2 results.

The MMPI-2 validity scales are the Lie (L), Infrequency (F), correction (K), Variable Response Inventory (VRIN), True Response Inventory (TRIN), and F back (FB) scales. The L scale was designed to identify respondents who attempt to minimize pathology to the extent that they deny even minor faults to which most individuals will admit. It is commonly thought of as an unsophisticated attempt to appear healthier than one might actually be (i.e., faking good). The F scale contains items of unusual or severe pathology that are infrequently endorsed by most people. Therefore, elevation of the F scale is thought of as either a “cry for help” or a more intentional attempt to appear worse off psychologically (i.e., faking bad). Like the L scale, the K scale is purported to measure defensiveness, but data have suggested that persons with a higher level of education tend to score higher on the K scale items than the L scale items.8

Buy Membership for Psychiatry Category to continue reading. Learn more here