What is Reliability?
The quality of being trustworthy or of performing consistently well. The degree to which the result of a measurement, calculation, or specification can be depended on to be accurate.
Definitions of Reliability? The ability of an apparatus, machine, or system to consistently perform its intended or required function or mission, on demand and without degradation or failure.
Manufacturing: The probability of failure-free performance over an item’s useful life, or a specified time-frame, under specified environmental and duty-cycle conditions. Often expressed as mean time between failures (MTBF) or reliability coefficient. Also called quality over time.
Consistency and validity of test results determined through statistical methods after repeated trials.
The reliability of a test refers to its degree of stability, consistency, predictability, and accuracy. It addresses the extent to which scores obtained by a person are the same if the person is reexamined by the same test on different occasions. Underlying the concept of reliability is the possible range of error, or error of measurement, of a single score. This is an estimate of the range of possible random fluctuation that can be expected in an individual’s? score. It should be stressed, however, that a certain degree of error or noise is always present in the system, from such factors as a misreading of the items, poor administration procedures, or the changing mood of the client. If there is a large degree of random fluctuation, the examiner cannot place a great deal of confidence in an individual’s scores. The goal of a test constructor is to reduce, as much as possible, the degree of measurement error, or random fluctuation. If this is achieved, the difference between one score and another for a measured characteristic is more likely to result from some true difference than from some chance fluctuation.
Two main issues relate to the degree of error in a test. The first is the inevitable, natural variation in human performance. Usually the variability is less for measurements of ability than for those of personality. Whereas ability variables (intelligence, mechanical aptitude, etc.) show gradual changes resulting from growth and development, many personality traits are much more highly dependent on factors such as mood. This is particularly true in the case of a characteristic such as anxiety. The practical significance of this in evaluating a test is that certain factors outside the test itself can serve to reduce the reliability that the test can realistically be expected to achieve. Thus, an examiner should generally expect higher reliabilities for an intelligence test than for a test measuring a personality variable such as anxiety. It is the examiner’s responsibility to know what is being measured, especially the degree of variability to be expected in the measured trait.
The second important issue relating to reliability is that psychological testing methods are necessarily imprecise. For the hard sciences, researchers can make direct measurements such as the concentration of a chemical solution, the relative weight of one organism compared with another, or the strength of radiation. In contrast, many constructs in psychology are often measured indirectly. For example, intelligence cannot be perceived directly; it must be inferred by measuring behavior that has been defined as being intelligent. Variability relating to these inferences is likely to produce a certain degree of error resulting from the lack of precision in defining and observing inner psychological constructs. Variability in measurement also occurs simply because people have true (not because of test error) fluctuations in performance between one testing session and the next. Whereas it is impossible to control for the natural variability in human performance, adequate test construction can attempt to reduce the imprecision that is a function of the test itself. Natural human variability and test imprecision make the task of measurement extremely difficult. Although some error in testing is inevitable, the goal of test construction is to keep testing errors within reasonably accepted limits. A high correlation is generally .80 or more, but the variable being measured also changes the expected strength of the correlation. Likewise, the method of determining reliability alters the relative strength of the correlation. Ideally, clinicians should hope for correlations of .90 or higher in tests that are used to make decisions about individuals, whereas a correlation of .70 or more is generally adequate for research purposes.
The purpose of reliability is to estimate the degree of test variance caused by error. The four primary methods of obtaining reliability involve determining (a) the extent to which the test produces consistent results on retesting (test-retest), (b) the relative accuracy of a test at a given time (alternate forms), (c) the internal consistency of the items (split half), and (d) the degree of agreement between two examiners (inter-scorer). Another way to summarize this is that reliability can be time to time (test-retest), form to form (alternate forms), item to item (split half), or scorer to scorer (inter-scorer). Although these are the main types of reliability, there is a fifth type, the Kuder-Richardson; like the split half, it is a measurement of the internal consistency of the test items. However, because this method is considered appropriate only for tests that are relatively pure measures of a single variable, it is not covered in this book.
Test-retest reliability is determined by administering the test and then repeating it on a second occasion. The reliability coefficient is calculated by correlating the scores obtained by the same person on the two different administrations. The degree of correlation between the two scores indicates the extent to which the test scores can be generalized from one situation to the next. If the correlations are high, the results are less likely to be caused by random fluctuations in the condition of the examinee or the testing environment. Thus, when the test is being used in actual practice, the examiner can be relatively confident that differences in scores are the result of an actual change in the trait being measured rather than random fluctuation.
A number of factors must be considered in assessing the appropriateness of test-retest reliability. One is that the interval between administrations can affect reliability. Thus, a test manual should specify the interval as well as any significant life changes that the examinees may have experienced such as counseling, career changes, or psychotherapy. For example, tests of preschool intelligence often give reasonably high correlations if the second administration is within several months of the first one. However, correlations with later childhood or adult IQ are generally low because of innumerable intervening life changes. One of the major difficulties with test-retest reliability is the effect that practice and memory may have on performance, which can produce improvement between one administration and the next. This is a particular problem for speeded and memory tests such as those found on the Digit Symbol and Arithmetic sub-tests of the WAIS-III. Additional sources of variation may be the result of random, short-term fluctuations in the examinee, or of variations in the testing conditions. In general, test-retest reliability is the preferred method only if the variable being measured is relatively stable. If the variable is highly changeable (e.g., anxiety), this method is usually not adequate.
The alternate forms method avoids many of the problems encountered with test-retest reliability. The logic behind alternate forms is that, if the trait is measured several times on the same individual by using parallel forms of the test, the different measurements should produce similar results. The degree of similarity between the scores represents the reliability coefficient of the test. As in the test-retest method, the interval between administrations should always be included in the manual as well as a description of any significant intervening life experiences. If the second administration is given immediately after the first, the resulting reliability is more a measure of the correlation between forms and not across occasions. Correlations determined by tests given with a wide interval, such as two months or more, provide a measure of both the relation between forms and the degree of temporal stability.
The alternate forms method eliminates many carryover effects, such as the recall of previous responses the examinee has made to specific items. However, there is still likely to be some carryover effect in that the examinee can learn to adapt to the overall style of the test even when the specific item content between one test and another is unfamiliar. This is most likely when the test involves some sort of problem-solving strategy in which the same principle in solving one problem can be used to solve the next one. An examinee, for example, may learn to use mnemonic aids to increase his or her performance on an alternate form of the WAIS-III Digit Symbol subtest.
Perhaps the primary difficulty with alternate forms lies in determining whether the two forms are actually equivalent. For example, if one test is more difficult than its alternate form, the difference in scores may represent actual differences in the two tests rather than differences resulting from the unreliability of the measure. Because the test constructor is attempting to measure the reliability of the test itself and not the differences between the tests, this could confound and lower the reliability coefficient. Alternate forms should be independently constructed tests that use the same specifications, including the same number of items, type of content, format, and manner of administration.
A final difficulty is encountered primarily when there is a delay between one administration and the next. With such a delay, the examinee may perform differently because of short-term fluctuations such as mood, stress level, or the relative quality of the previous night’s sleep. Thus, an examinee’s abilities may vary somewhat from one examination to another, thereby affecting test results. Despite these problems, alternate forms reliability has the advantage of at least reducing, if not eliminating, many carryover effects of the test-retest method. A further advantage is that the alternate test forms can be useful for other purposes, such as assessing the effects of a treatment program or monitoring a patient’s changes over time by administering the different forms on separate occasions.
Split Half Reliability
The split half method is the best technique for determining reliability for a trait with a high degree of fluctuation. Because the test is given only once, the items are split in half, and the two halves are correlated. As there is only one administration, it is not possible for the effects of time to intervene as they might with the test-retest method. Thus, the split half method gives a measure of the internal consistency of the test items rather than the temporal stability of different administrations of the same test. To determine split half reliability, the test is often split on the basis of odd and even items. This method is usually adequate for most tests. Dividing the test into a first half and second half can be effective in some cases, but is often inappropriate because of the cumulative effects of warming up, fatigue, and boredom, all of which can result in different levels of performance on the first half of the test compared with the second.
As is true with the other methods of obtaining reliability, the split half method has limitations. When a test is split in half, there are fewer items on each half, which results in wider variability because the individual responses cannot stabilize as easily around a mean. As a general principle, the longer a test is, the more reliable it is because the larger the number of items, the easier it is for the majority of items to compensate for minor alterations in responding to a few of the other items. As with the alternate forms method, differences in content may exist between one half and another.
In some tests, scoring is based partially on the judgment of the examiner. Because judgment may vary between one scorer and the next, it may be important to assess the extent to which reliability might be affected. This is especially true for projectives and even for some ability tests where hard scorers may produce results somewhat different from easy scorers. This variance in interscorer reliability may apply for global judgments based on test scores such as brain-damaged versus normal, or for small details of scoring such as whether a person has given a shading versus a texture response on the Rorschach. The basic strategy for determining interscorer reliability is to obtain a series of responses from a single client and to have these responses scored by two different individuals. A variation is to have two different examiners test the same client using the same test and then to determine how close their scores or ratings of the person are. The two sets of scores can then be correlated to determine a reliability coefficient. Any test that requires even partial subjectivity in scoring should provide information on interscorer reliability.
The best form of reliability is dependent on both the nature of the variable being measured and the purposes for which the test is used. If the trait or ability being measured is highly stable, the test-retest method is preferable, whereas split half is more appropriate for characteristics that are highly subject to fluctuations. When using a test to make predictions, the test-retest method is preferable because it gives an estimate of the dependability of the test from one administration to the next. This is particularly true if, when determining reliability, an increased time interval existed between the two administrations. If, on the other hand, the examiner is concerned with the internal consistency and accuracy of a test for a single, one-time measure, either the split half or the alternate forms would be best.
Another consideration in evaluating the acceptable range of reliability is the format of the test. Longer tests usually have higher reliabilities than shorter ones. Also, the format of the responses affects reliability. For example, a true-false format is likely to have a lower reliability than multiple choice because each true-false item has a 50% possibility of the answer being correct by chance. In contrast, each question in a multiple-choice format having five possible choices has only a 20% possibility of being correct by chance. A final consideration is that tests with various subtests or subscales should report the reliability for the overall test as well as for each of the subtests. In general, the overall test score has a significantly higher reliability than its subtests. In estimating the confidence with which test scores can be interpreted, the examiner should take into account the lower reliabilities of the subtests. For example, a Full Scale IQ on the WAIS-III can be interpreted with more confidence than the specific subscale scores.
Most test manuals include a statistical index of the amount of error that can be expected for test scores, which is referred to as the standard error of measurement (SEM). The logic behind the SEM is that test scores consist of both truth and error. Thus, there is always noise or error in the system, and the SEM provides a range to indicate how extensive that error is likely to be. The range depends on the test’s reliability so that the higher the reliability, the narrower the range of error. The SEM is a standard deviation score so that, for example, a SEM of 3 on an intelligence test would indicate that an individual’s score has a 68% chance of being ± 3 IQ points from the estimated true score. This is because the SEM of 3 represents a band extending from -1 to +1 standard deviations above and below the mean. Likewise, there would be a 95% chance that the individual’s score would fall within a range of ± 5 points from the estimated true score. From a theoretical perspective, the SEM is a statistical index of how a person’s repeated scores on a specific test would fall around a normal distribution. Thus, it is a statement of the relationship among a person’s obtained score, his or her theoretically true score, and the test reliability. Because it is an empirical statement of the probable range of scores, the SEM has more practical usefulness than a knowledge of the test reliability. This band of error is also referred to as a confidence interval.
The acceptable range of reliability is difficult to identify and depends partially on the variable being measured. In general, unstable aspects (states) of the person produce lower reliabilities than stable ones (traits). Thus, in evaluating a test, the examiner should expect higher reliabilities on stable traits or abilities than on changeable states. For example, a person’s general fund of vocabulary words is highly stable and therefore produces high reliabilities. In contrast, a person’s level of anxiety is often highly changeable. This means examiners should not expect nearly as high reliabilities for anxiety as for an ability measure such as vocabulary. A further consideration, also related to the stability of the trait or ability, is the method of reliability that is used. Alternate forms are considered to give the lowest estimate of the actual reliability of a test, while split half provides the highest estimate. Another important way to estimate the adequacy of reliability is by comparing the reliability derived on other similar tests. The examiner can then develop a sense of the expected levels of reliability, which provides a baseline for comparisons. In the example of anxiety, a clinician may not know what is an acceptable level of reliability. A general estimate can be made by comparing the reliability of the test under consideration with other tests measuring the same or a similar variable. The most important thing to keep in mind is that lower levels of reliability usually suggest that less confidence can be placed in the interpretations and predictions based on the test data. However, clinical practitioners are less likely to be concerned with low statistical reliability if they have some basis for believing the test is a valid measure of the client’s state at the time of testing. The main consideration is that the sign or test score does not mean one thing at one time and something different at another.