Inter-Subtest Branching in Computerized Adaptive Testing Shu-Hui Chang, Ph.D. Chung-Yuan Christian University PO Box 12-339, Chung-Li (320) Taiwan, R.O.C. Shu-Hui Chang, Chung-Yuan Christian University Taiwan, R. O. C. William R. Koch, The University of Texas at Austin Barbara G. Dodd, The University of Texas at Austin 9. Computer applications Abstract This study compared three inter-subtest branching methods used in the CAT procedures. The inter-subtest branching methods were based on multiple correlation to decide the order of presentation of the subtests for use of multiple repression with the preceding subtests to predict the initial trait level for the current subtest. One thousand simulees were used for item calibration, and 200 simulees were used for the CAT procedures. Results comparing various branching methods in terms of test length did not reveal distinct differences. It was concluded that branching based on raw scores yielded better results than the other competing methods, however, the differences among various methods were not large enough to make a practical distinction. A basic assumption of IRT models is that the test measures a single trait (Lord & Novick, 1968; Lord, 1980). Applying unidimensional IRT models may not be appropriate when this assumption is violated. Most tests, particularly achievement tests seldom measure a single dimension. The unidimensional IRT models are only applicable if the robustness to violating the assumption is shown. According to previous studies investigating this issue (Reckase, 1979; Ansley & Foryth, 1985), the violation of unidimensionality has an effect on parameter estimation. LOGIST is robust only to minor violations of the assumption. The key issue is interpretability of the final trait estimate, for which information may be lost in using multicontent achievement tests. Therefore, applying multidimensional IRT models to multicontent achievement tests or separately estimating an ability level on each dimension may be more appropriate. Since multidimensional IRT models are still under development, little research has been conducted in applying these models to practical testing settings. Thus, the application of multidimensional IRT models to CAT will not be possible until the characteristics of these new models and the properties of their parameter estimates are well defined and fully investigated. At the current state, applying the branching strategy to computerized adaptive achievement tests seem a more promising way to resolve such practical problems. According to previous studies (Brown & Weiss, 1977; Gialluca & Weiss, 1979; Maurelli & Weiss, 1981), the combination of inter- subtest and intra-subtest branching resulted in reduction of the mean test length significantly, while a high level of psychometric information was maintained. The "inter-subtest" branching method is based on linear multiple regression to decide the order of presentation of the subtests. The "intra-subtest" branching procedure, on the other hand, chooses the item which maximally provides information for the current ability estimate. Therefore, the present study will focus on comparing various branching methods in the CAT version of an achievement test. Previous branching studies have been limited to the use of Baysian estimation. In contrast, the present study used maximum likelihood estimation of achievement levels. Methods The research design will follow four steps: data simulation, item calibration, CAT branching, and data analysis. Step 1: Data Simulation. One data set was generated to simulate item response data obtained from one hypothetical achievement test which contained four subtests. The number of items in each subtest was 100. Thus the complete achievement test was composed of an item bank containing 400 items. A computer program for data generation procedures proposed by Wherry, Naylor, Wherry & Fallis (1965) was used to generate the data sets according to a general linear factor analytic model. Simulated responses from 1200 examinees were generated for each item in the data set. These simulees were divided into two samples. The first 1000 simulees were used as the "calibration sample" and the last 200 simulees were used as the "CAT sample". This design is intended to simulate a real testing situation when the examinees' trait levels are unknown before the CAT session. Step 2: Item Calibration. The 3PL logistic model was used for item calibration before forming a CAT item pool. Since each subtest has one dominant factor (unidimesionality ). LOGIST was run separately for each one of the four subtests. Step 3: CAT Branching. Two hundred simulees (CAT sample) were used for the CAT procedure. Three branching methods were implemented. These methods shared the same principle: the entry point (initial theta estimate) for the first subtest was the item with median difficulty. The entry points for the remaining three subtests were based on the final theta estimates from the preceding subtests. The difference between these methods consisted in the different orders of administration of subtests. Finally, the fourth methods is a "no inter-subtest branching" method. These methods mentioned above are listed as the following: (1) branching based on multiple correlation of raw scores -- the higher the correlation, the earlier the subtest is administered; (2) branching based on multiple correlation of raw scores -- the lower the correlation, the earlier the subtest is administered; (3) branching based on random order; and (4) no- branching. Step 4: Data Analysis. The data analysis was based on comparisons among the four methods in terms of test length reduction. Mean test length was obtained for the subtests in each of the three branching methods and in the no-branching method to investigate the relationship between test length reduction and various inter-subtest branching methods. The branching method which results in a shorter test while providing equivalent psychometric information (measured by the average standard error and the correlation between known true theta and theta estimate) to that obtained with the whole version of the test was evaluated as the best CAT procedure among the competing branching methods. Results Step 1: Data Simulation. 1200 simulees (1000 for calibration and 200 for the CAT procedure) were generated. The inter-subtest correlations verify the desirable feature of a correlation matrix which ranges form low to high (r=.14 to .69 among subtests) and thus reflects the real test situation better. In addition, four Z scores (Z1, Z2, Z3 and Z4 represent the known true trait level for each subtest respectively) are available for future comparison of known true trait level and the theta estimates obtained from CAT procedures. According to the descriptive statistics of these Z scores, they are roughly within -3 and +3 and have mean close to zero and standard deviation close to one. Step 2: Item Calibration. The output for LOGIST runs are item and person parameter estimates for four subtests. The results include the theta estimates for 1000 Simulees (calibration sample) and the item characteristics (a estimates, b estimates and c estimates) of 100 items for each subtest. Table 1 shows the descriptive statistics of item parameter estimates. The levels of the "a" estimates are good. Most discrimination levels are greater than .80 which is the minimum requirement for a good item pool. The ranges of the "b" estimates are somewhat restricted, and the means of the "b" estimates are slightly negative. The ranges of the "c" estimates are reasonable. Besides, the peaks of the test information functions for the four subtests are all at -0.20 on the theta scale. Therefore, the initial theta estimate was set to be -0.20, and item with the most information for that theta level was selected as the first item to administer for the CAT procedure. Finally, the correlations between known true traits (Z scores) and theta estimates obtained from LOGIST runs range from .84 to .96. These correlations confirm the compatibility of Z scores and theta estimates. Step 3: CAT Branching. Four CAT runs were conducted using four methods. An additional CAT was also run on the full length test for the purpose of comparison. The order of subtests for each method is based on multiple correlations and presented in Table 2. For method 1, the order of subtests is based on the correlations among subtests. According to the inter-subtest correlations, the correlation (r=.69) between subtest 1 and subtest 2 is the highest among the six possible correlations of four subtests. Subtest 2 was administered first and subtest 1 second because R2.34 (.61) was higher than R1.34 (.47). Furthermore, subtest 3 was administered as the third test, and subtest 4 the last because R3.12 (.51) was higher than R4.12. Thus, the order of subtests for method 1 was subtest 2, subtest 1, subtest 3, and subtest 4. For method 2, which applies the same principle as method 1 but orders the subtests according to the lowest correlation instead of the highest, therefore forms a different subtest order from method 1. For the two remaining methods, method 3 was based on random order for each simulee and method 4 is a non-branching method. Finally, the stopping rules of CAT procedure are standard error of .30 and maximum number of items of 20. Step 4: Data Analysis. The comparison of mean test lengths across the four methods is presented in Table 3. For subtest 1, method 1 and method 3 produced the shortest mean test length (12.810 items). For subtest 2, method 2 has the shortest mean test length (10.685). For subtest 3, method 1 had the shortest mean test length (13.715). For subtest 4, method 1 has the shortest mean test length (18.215). The other three methods had the same mean test length (18.265). Subtest 4 is the longest test to be administered among the four subtests. Table 4 presents the correlation matrix which provides additional information for confirmation. The correlations between the theta estimates of CAT and Z scores are around 0.80, and the correlation between the theta estimates of CAT and theta estimates of the full length test are around 0.90. This is the case for all four methods. Conclusions No large differences were found among the four methods. However, within the small range of differences, the inter-subtest branching methods appeared to be superior to the no-branching methods. Particularly, inter-subtest branching did not contribute much to shorten the test. Therefore, the no-branching method may be useful in practice. With the no-branching method, however, the subtest order could be an issue. Whether administering subtests according to a fixed order for each examinee or having the examinees take whatever subtest order they prefer may be an important factor that needs to be investigated in future research. In addition, the present study was limited to four subtests and had inter-subtest correlations ranging form low to high. The number of subtests and the magnitude of inter-subtest correlations and their interactions could have an impact on the improvement of branching and needs to be investigated systematically in future research. Table 1 Descriptive Statistics of a, b and c Estimates for Four Subtests _______________________________________________________________ Subtest 1 Subtest 2 Subtest 3 Subtest 4 _______________________________________________________________ Est. a Mean 1.22 1.35 1.20 1.04 SD 0.18 0.20 0.18 0.14 Min 0.85 0.83 0.83 0.76 Max 1.73 1.85 1.72 1.49 Est. b Mean - 0.22 - 0.17 - 0.19 - 0.16 SD 0.76 0.75 0.73 0.79 Min - 1.91 - 1.66 - 1.79 - 1.78 Max 1.19 1.44 1.03 1.36 Est. c Mean 0.13 0.13 0.14 0.14 SD 0.06 0.04 0.07 0.05 Min 0.00 0.03 0.00 0.01 Max 0.33 0.25 0.42 0.31 No.Items 100 100 100 100 _______________________________________________________________ Table 2 The Order of Subtest for Four Methods _______________________________________________________________ Method Subtest Order _______________________________________________________________ 1 Subtest 2 - Subtest 1 - Subtest 3 - Subtest 4 2 Subtest 4 - Subtest 3 - Subtest 1 - Subtest 2 3 Random Order 4 No-Branching _______________________________________________________________ Table 3 Mean Test Length for Four Methods _______________________________________________________________ Method Subtest 1 Subtest 2 Subtest 3 Subtest 4 _______________________________________________________________ 1 12.810 10.945 13.715 18.215 2 12.845 10.685 13.825 18.265 3 12.810 10.820 13.780 18.265 4 12.925 10.945 13.920 18.265 _______________________________________________________________ Table 4 Correlations of Theta Estimates of CAT and Z scores , Correlations of Theta Estimates of CAT and Theta Estimates of Whole Test for Four Methods ______________________________________________________________ Corr. of Est Theta of CAT Corr. of Est. Theta of CAT and Z scores and Est. Theta of whole test ______________________________________________________________ Method Subtest Subtest 1 2 3 4 1 2 3 4 1 0.79 0.78 0.84 0.88 0.94 0.96 0.96 0.96 2 0.79 0.77 0.84 0.87 0.93 0.96 0.96 0.96 3 0.79 0.78 0.84 0.88 0.93 0.96 0.96 0.96 4 0.80 0.78 0.84 0.88 0.93 0.96 0.96 0.96 ______________________________________________________________ References Ansley, T.N. & Forsyth, R.A. (1985). An examination of the characteristics of unidimensional IRT parameter estimates derived from two-dimensional data. Applied Psychological Measurement, 1, 37-48. Brown, J.M. & Weiss, D.J. (1977). An adaptive testing strategy for achievement test batteries, (Research Report 77-6). Minneapolis, University of Minnesota, Department of Psychology, Psychometric Methods Program. Gialluca, K.A. & Weiss, D.J. (1979). Efficiency of an adaptive inter-subtest branching strategy in the measurement of classroom achievement, (Research Report 79-6). Minneapolis, University of Minnesota, Department of Psychology, Psychometric Methods Program. Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, N.J.: Erlbaum. Lord, F.M. & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Maurelli, V.A. & Weiss, D.J. Factors influencing the psychometric characteristics of an adaptive testing strategy for test batteries, (Research Report 81-4). Minneapolis, University of Minnesota, Department of Psychology, Psychometric Methods Program. Reckase, M.D. (1979). Unifactor latent trait models applied to multifactor tests: results and implications. Journal of Educational Statistics, 3, 207-230. Wherry, R.J., Sr. Naylor, J.C., Wherry, R.J., Jr., & Fallis, R.F. (1965). Generating multiple samples of multivariate data with arbitrary population parameters. Psychometrica, 30, 303- 313.