What is Validity?
The most crucial issue in test construction is validity. Whereas reliability addresses issues of consistency, validity assesses what the test is to be accurate about. A test that is valid for clinical assessment should measure what it is intended to measure and should also produce information useful to clinicians. A psychological test cannot be said to be valid in any abstract or absolute sense, but more practically, it must be valid in a particular context and for a specific group of people (Messick, 1995). Although a test can be reliable without being valid, the opposite is not true; a necessary prerequisite for validity is that the test must have achieved an adequate level of reliability. Thus, a valid test is one that accurately measures the variable it is intended to measure. For example, a test comprising questions about a person’s musical preference might erroneously state that it is a test of creativity. The test might be reliable in the sense that if it is given to the same person on different occasions, it produces similar results each time. However, it would not be reliable in that an investigation might indicate it does not correlate with other more valid measurements of creativity.
Establishing the validity of a test can be extremely difficult, primarily because psychological variables are usually abstract concepts such as intelligence, anxiety, and personality. These concepts have no tangible reality, so their existence must be inferred through indirect means. In addition, conceptualization and research on constructs undergo change over time requiring that test validation go through continual refinement (G. Smith & McCarthy, 1995). In constructing a test, a test designer must follow two necessary, initial steps. First, the construct must be theoretically evaluated and described; second, specific operations (test questions) must be developed to measure it (S. Haynes et al., 1995). Even when the designer has followed these steps closely and conscientiously, it is sometimes difficult to determine what the test really measures. For example, IQ tests are good predictors of academic success, but many researchers question whether they adequately measure the concept of intelligence as it is theoretically described. Another hypothetical test that, based on its item content, might seem to measure what is described as musical aptitude may in reality be highly correlated with verbal abilities. Thus, it may be more a measure of verbal abilities than of musical aptitude.
Any estimate of validity is concerned with relationships between the test and some external independently observed event. The Standards for Educational and Psychological Testing, American Educational Research Association [AERA], American Psychological Association [APA], & National Council for Measurement in Education [NCME], 1999; G. Morgan, Gliner, & Harmon, 2001) list the three main methods of establishing validity as content-related, criterion-related, and construct-related.
During the initial construction phase of any test, the developers must first be concerned with its content validity. This refers to the representativeness and relevance of the assessment instrument to the construct being measured. During the initial item selection, the constructors must carefully consider the skills or knowledge area of the variable they would like to measure. The items are then generated based on this conceptualization of the variable. At some point, it might be decided that the item content over-represents, under-represents, or excludes specific areas, and alterations in the items might be made accordingly. If experts on subject matter are used to determine the items, the number of these experts and their qualifications should be included in the test manual. The instructions they received and the extent of agreement between judges should also be provided. A good test covers not only the subject matter being measured, but also additional variables. For example, factual knowledge may be one criterion, but the application of that knowledge and the ability to analyze data are also important. Thus, a test with high content validity must cover all major aspects of the content area and must do so in the correct proportion.
A concept somewhat related to content validity is face validity. These terms are not synonymous, however, because content validity pertains to judgments made by experts, whereas face validity concerns judgments made by the test users. The central issue in face validity is test rapport. Thus, a group of potential mechanics who are being tested for basic skills in arithmetic should have word problems that relate to machines rather than to business transactions. Face validity, then, is present if the test looks good to the persons taking it, to policymakers who decide to include it in their programs, and to other untrained personnel. Despite the potential importance of face validity in regard to test-taking attitudes, disappointingly few formal studies on face validity are performed and/or reported in test manuals.
In the past, content validity has been conceptualized and operationalized as being based on the subjective judgment of the test developers. As a result, it has been regarded as the least preferred form of test validation, albeit necessary in the initial stages of test development. In addition, its usefulness has been primarily focused at achievement tests (how well has this student learned the content of the course?) and personnel selection (does this applicant know the information relevant to the potential job?). More recently, it has become used more extensively in personality and clinical assessment (Butcher, Graham, Williams, & Ben-Porath, 1990; Millon, 1994). This has paralleled more rigorous and empirically based approaches to content validity along with a closer integration to criterion and construct validation.
A second major approach to determining validity is criterion validity, which has also been called empirical or predictive validity. Criterion validity is determined by comparing test scores with some sort of performance on an outside measure. The outside measure should have a theoretical relation to the variable that the test is supposed to measure. For example, an intelligence test might be correlated with grade point average; an aptitude test, with independent job ratings or general maladjustment scores, with other tests measuring similar dimensions. The relation between the two measurements is usually expressed as a correlation coefficient.
Criterion-related validity is most frequently divided into either concurrent or predictive validity. Concurrent validity refers to measurements taken at the same, or approximately the same, time as the test. For example, an intelligence test might be administered at the same time as assessments of a group’s level of academic achievement. Predictive validity refers to outside measurements that were taken some time after the test scores were derived. Thus, predictive validity might be evaluated by correlating the intelligence test scores with measures of academic achievement a year after the initial testing. Concurrent validation is often used as a substitute for predictive validation because it is simpler, less expensive, and not as time consuming. However, the main consideration in deciding whether concurrent or predictive validation is preferable depends on the test’s purpose. Predictive validity is most appropriate for tests used for selection and classification of personnel. This may include hiring job applicants, placing military personnel in specific occupational training programs, screening out individuals who are likely to develop emotional disorders, or identifying which category of psychiatric populations would be most likely to benefit from specific treatment approaches. These situations all require that the measurement device provide a prediction of some future outcome. In contrast, concurrent validation is preferable if an assessment of the client’s current status is required, rather than a prediction of what might occur to the client at some future time. The distinction can be summarized by asking “Is Mr. Jones maladjusted?” (concurrent validity) rather than “Is Mr. Jones likely to become maladjusted at some future time?” (predictive validity).
An important consideration is the degree to which a specific test can be applied to a unique work-related environment (see Hogan, Hogan, & Roberts, 1996). This relates more to the social value and consequences of the assessment than the formal validity as reported in the test manual (Messick, 1995). In other words, can the test under consideration provide accurate assessments and predictions for the environment in which the examinee is working? To answer this question adequately, the examiner must refer to the manual and assess the similarity between the criteria used to establish the test’s validity and the situation to which he or she would like to apply the test. For example, can an aptitude test that has adequate criterion validity in the prediction of high school grade point average also be used to predict academic achievement for a population of college students? If the examiner has questions regarding the relative applicability of the test, he or she may need to undertake a series of specific tasks. The first is to identify the required skills for adequate performance in the situation involved. For example, the criteria for a successful teacher may include such attributes as verbal fluency, flexibility, and good public speaking skills. The examiner then must determine the degree to which each skill contributes to the quality of a teacher’s performance. Next, the examiner has to assess the extent to which the test under consideration measures each of these skills. The final step is to evaluate the extent to which the attribute that the test measures are relevant to the skills the examiner needs to predict. Based on these evaluations, the examiner can estimate the confidence that he or she places in the predictions developed from the test. This approach is sometimes referred to as synthetic validity because examiners must integrate or synthesize the criteria reported in the test manual with the variables they encounter in their clinical or organizational settings.
The strength of criterion validity depends in part on the type of variable being measured. Usually, intellectual or aptitude tests give relatively higher validity coefficients than personality tests because there are generally a greater number of variables influencing personality than intelligence. As the number of variables that influences the trait being measured increases, it becomes progressively more difficult to account for them. When a large number of variables are not accounted for, the trait can be affected in unpredictable ways. This can create a much wider degree of fluctuation in the test scores, thereby lowering the validity coefficient. Thus, when evaluating a personality test, the examiner should not expect as high a validity coefficient as for intellectual or aptitude tests. A helpful guide is to look at the validities found in similar tests and compare them with the test being considered. For example, if an examiner wants to estimate the range of validity to be expected for the extra-version scale on the Myers Briggs Type Indicator, he or she might compare it with the validities for similar scales found in the California Personality Inventory and Eysenck Personality Questionnaire. The relative level of validity, then, depends both on the quality of the construction of the test and on the variable being studied.
An important consideration is the extent to which the test accounts for the trait being measured or the behavior being predicted. For example, the typical correlation between intelligence tests and academic performance is about .50 (Neisser et al., 1996). Because no one would say that grade point average is entirely the result of intelligence, the relative extent to which intelligence determines grade point average has to be estimated. This can be calculated by squaring the correlation coefficient and changing it into a percentage. Thus, if the correlation of .50 is squared, it comes out to 25%, indicating that 25% of academic achievement can be accounted for by IQ as measured by the intelligence test. The remaining 75% may include factors such as motivation, quality of instruction, and past educational experience. The problem facing the examiner is to determine whether 25% of the variance is sufficiently useful for the intended purposes of the test. This ultimately depends on the personal judgment of the examiner.
The main problem confronting criterion validity is finding an agreed-on, definable, acceptable, and feasible outside criterion. Whereas for an intelligence test the grade point average might be an acceptable criterion, it is far more difficult to identify adequate criteria for most personality tests. Even with so-called intelligence tests, many researchers argue that it is more appropriate to consider them tests of scholastic aptitude rather than of intelligence. Yet another difficulty with criterion validity is the possibility that the criterion measure will be inadvertently biased. This is referred to as criterion contamination and occurs when knowledge of the test results influences an individual’s later performance. For example, a supervisor in an organization who receives such information about subordinates may act differently toward a worker placed in a certain category after being tested. This situation may set up negative or positive expectations for the worker, which could influence his or her level of performance. The result is likely to artificially alter the level of the validity coefficients. To work around these difficulties, especially in regard to personality tests, a third major method must be used to determine validity.
The method of construct validity was developed in part to correct the inadequacies and difficulties encountered with content and criterion approaches. Early forms of content validity relied too much on subjective judgment, while criterion validity was too restrictive in working with the domains or structure of the constructs being measured. Criterion validity had the further difficulty in that there was often a lack of agreement in deciding on adequate outside criteria. The basic approach of construct validity is to assess the extent to which the test measures a theoretical construct or trait. This assessment involves three general steps. Initially, the test constructor must make a careful analysis of the trait. This is followed by a consideration of the ways in which the trait should relate to other variables. Finally, the test designer needs to test whether these hypothesized relationships actually exist (Foster & Cone, 1995). For example, a test measuring dominance should have a high correlation with the individual accepting leadership roles and a low or negative correlation with measures of submissiveness. Likewise, a test measuring anxiety should have a high positive correlation with individuals who are measured during an anxiety-provoking situation, such as an experiment involving some sort of physical pain. As these hypothesized relationships are verified by research studies, the degree of confidence that can be placed in a test increases.
There is no single, best approach for determining construct validity; rather, a variety of different possibilities exist. For example, if some abilities are expected to increase with age, correlations can be made between a population’s test scores and age. This may be appropriate for variables such as intelligence or motor coordination, but it would not be applicable for most personality measurements. Even in the measurement of intelligence or motor coordination, this approach may not be appropriate beyond the age of maturity. Another method for determining construct validity is to measure the effects of experimental or treatment interventions. Thus, a posttest measurement may be taken following a period of instruction to see if the intervention affected the test scores in relation to a previous pretest measure. For example, after an examinee completes a course in arithmetic, it would be predicted that scores on a test of arithmetical ability would increase. Often, correlations can be made with other tests that supposedly measure a similar variable. However, a new test that correlates too highly with existing tests may represent needless duplication unless it incorporates some additional advantage such as a shortened format, ease of administration, or superior predictive validity. Factor analysis is of particular relevance to construct validation because it can be used to identify and assess the relative strength of different psychological traits. Factor analysis can also be used in the design of a test to identify the primary factor or factors measured by a series of different tests. Thus, it can be used to simplify one or more tests by reducing the number of categories to a few common factors or traits. The factorial validity of a test is the relative weight or loading that a factor has on the test. For example, if a factor analysis of a measure of psychopathology determined that the test was composed of two clear factors that seemed to be measuring anxiety and depression, the test could be considered to have factorial validity. This would be especially true if the two factors seemed to be accounting for a clear and large portion of what the test was measuring.
Another method used in construct validity is to estimate the degree of internal consistency by correlating specific subtests with the test’s total score. For example, if a subtest on an intelligence test does not correlate adequately with the overall or Full Scale IQ, it should be either eliminated or altered in a way that increases the correlation. A final method for obtaining construct validity is for a test to converge or correlate highly with variables that are theoretically similar to it. The test should not only show this convergent validity but also have discriminate validity, in which it would demonstrate low or negative correlations with variables that are dissimilar to it. Thus, scores on reading comprehension should show high positive correlations with performance in a literature class and low correlations with performance in a class involving mathematical computation.
Related to discriminant and convergent validity is the degree of sensitivity and specificity an assessment device demonstrates in identifying different categories. Sensitivity refers to the percentage of true positives that the instrument has identified, whereas specificity is the relative percentage of true negatives. A structured clinical interview might be quite sensitive in that it would accurately identify 90% of schizophrenics in an admitting ward of a hospital. However, it may not be sufficiently specific in that 30% of schizophrenics would be incorrectly classified as either normal or having some other diagnosis. The difficulty in determining sensitivity and specificity lies in developing agreed-on, objectively accurate outside criteria for categories such as psychiatric diagnosis, intelligence, or personality traits.
As indicated by the variety of approaches discussed, no single, quick, efficient method exists for determining construct validity. It is similar to testing a series of hypotheses in which the results of the studies determine the meanings that can be attached to later test scores (Foster & Cone, 1995; Messick, 1995). Almost any data can be used, including material from the content and criterion approaches. The greater the amount of supporting data, the greater is the level of confidence with which the test can be used. In many ways, construct validity represents the strongest and most sophisticated approach to test construction. In many ways, all types of validity can be considered as subcategories of construct validity. It involves theoretical knowledge of the trait or ability being measured, knowledge of other related variables, hypothesis testing, and statements regarding the relationship of the test variable to a network of other variables that have been investigated. Thus, construct validation is a never-ending process in which new relationships always can be verified and investigated.