Changes in students' mathematics achievement in

Australian lower secondary schools over time:

A Rasch Analysis.

 

 

Tilahun Mengesha Afrassa

John P Keeves

School of Education

The Flinders University of South Australia

 

 

 

Abstract

 

This paper aims to analyse and scale mathematics data over time by

applying the Rasch model using the QUEST (Adams & Khoo, 1993) computer

program. The mathematics achievement of the students is brought to a

common scale. This common scale is independent of both the samples of

students tested and the samples of items employed. The scale is used to

examine the changes in mathematics achievement of students in Australia

over time. Conclusions are drawn as to the robustness of the common

scale and the changes of students' mathematics achievement over time in

Australia.

 

 

1. CHANGES IN MATHEMATICS ACHIEVEMENTS OVER TIME

 

Over the past three decades researchers have shown considerable

interest in the study of student achievement in mathematics at all

levels across educational systems and over time. Many important

considerations can be drawn from various research studies about

students' achievement in mathematics over time. Willett (1997, 327)

argues that by measuring change over time, it is possible to map

phenomena at the heart of the educational enterprise. In addition, he

argues that education seeks to enhance learning, and to develop change

in achievement, attitudes and values. It is Willett's belief that "only

by measuring individual change is it possible to document each person's

progress and, consequently, to evaluate the effectiveness of

educational systems" (Willett, 1997, 327). Therefore, measuring changes

in achievement over time is one of the most important tools for finding

ways and means to improve the educational system of a country.

 

Since, Australia participated in the 1964, 1978 and 1994 International

Mathematics Studies, it should be possible to examine the mathematics

achievement differences over time across the 30-year time period.

 

Therefore, the purpose of this study is to investigate changes in

achievement in mathematics of Australian lower secondary school

students between 1964, 1978 and 1994.

 

In this paper the results of the Rasch analyses of the mathematics

achievement of the 1964, 1978 and 1994 Australian students who

participated in FIMS, SIMS and TIMS are presented and discussed. The

paper is divided into seven sections. The sampling procedures used on

the three occasions are presented in the first section, while the

second section examines the measurement procedures employed in the

study. The third section considers the statistical procedures applied

in the calibration and scoring of the mathematics tests. The fourth

section assesses whether or not the mathematics items administered in

the studies fit the Rasch model. Section five discusses the equating

procedures used in the study. The comparisons of the achievement of

 

 

FIMS, SIMS and TIMS students are presented in the next section. The

last section of the paper examines the findings and conclusions drawn

from the study.

 

1.1. Sampling procedure

Table 1 shows the Target Populations of the three international

studies conducted in Australia. In the First International Mathematics

Study (FIMS), conducted in 1964, two groups of students participated,

13-year-old students in Years 7, 8 and 9 and students in Year 8 of

schooling. The total number of students taking part was 4320 (see Table

1).

 In the first study only government schools in New South Wales (NSW),

Victoria (VIC), Queensland (QLD), Western Australia (WA) and Tasmania

(TAS) participated. In the Second International Mathematics Study

(SIMS), which was administered in 1978, nongovernment schools and the

Australian Capital Territory (ACT) and South Australia (SA) were also

involved as well as those states that participated in FIMS. Thus in

SIMS government and nongovernment school students in six states and one

territory were involved. The total number of participants was 5120

students (see Table 1).

 

Meanwhile, in the Third International Mathematics Study (TIMS), which

was conducted in 1994, government and nongovernment school students in

all states and territories including Northern Territory were involved.

The total number of students tested was 12852 (see Table 1).

 

In 1964 and 1978 the samples were age samples and included students

from Years 7, 8 and 9 in all participating states and territory, while

in TIMS the samples were grade samples drawn from Years 7 and 8 or

Years 8 and 9. In ACT, NSW, VIC and TAS Years 7 and 8 students were

selected while in QLD, SA, WA and NT samples were drawn from Years 8

and 9.

 

Therefore, to make the most meaningful possible comparison of

mathematics achievement over time by using the 1964, 1978 and 1994 data

sets, the following steps were taken.

 

Table 1:- Target populations in FIMS, SIMS and TIMS

 

The 1964 students were divided into two groups 13-year-old students in

one group and all Year 8 students including 13-year-old students at

 

 

that year level as the second group since in addition to an age sample

a grade sample had also been drawn. It is important to observe that

13-year-old students in Year 8 were considered as members of both

groups. In the first group students were chosen for their age and in

the second group for their year level. The 1978 students were chosen as

an age sample and included students from both government or

non-government schools. In order to make meaningful comparisons between

the 1978 sample comparable and the 1964 sample, students from

non-government schools in all participating states and all students from

SA and ACT were excluded from the analyses presented in this paper,

although Rosier (1980) and Moss (1982) considered both the total

student groups and the restricted government school student groups

drawn from only five states.

 

Meanwhile, in TIMS the only common sample for all states and

territories was Year 8 students. In order to make the TIMS samples

comparable with the FIMS samples, only Year 8 government school

students in the five states that participated in FIMS are considered as

the TIMS data set in this study. All non-government school students in

the five states and all students in SA, ACT and NT are excluded from

the analyses presented in this study, although they have been

considered in the recent report by Lokan, et al.(1996).

 

After excluding schools and the states and territories that did not

participate in the 1964 study, two sub-populations of students were

identified for comparison between occasions. The two groups were

13-year-old students in FIMSA and SIMS: all were 13-year-old students

and were distributed across Years 7, 8 and 9 on both occasions. Hence,

these two groups of 13-year-old students were considered to be

comparable for the examination of achievement over time, between 1964

and 1978. Whereas for the comparison between FIMS and TIMS the other

sub-populations consisted of 1964 and 1994 Year 8 students. Students in

both groups were at the same year level, although there were

differences in the ages between these groups which were tested on the

two occasions. Hence, the comparisons in this study are between

13-year-old students in FIMSA and SIMS on the one hand, and FIMSB and

TIMS Year 8 students on the other.

 

1. 2. Measurement procedures employed in the study

In this study the procedures employed to measure mathematics

achievement level of students on the three occasions involved the use

of the Rasch model to scale students' responses to the mathematics test

items. The tests included both multiple choice and constructed response

items, and in the 1994 testing program both dichotomous and

polychotomous scoring procedures were employed for the constructed

response items.

 

1. 2. 1. Use of Rasch model

The Rasch model has been shown to be the most robust of the item

response models (Sontag, 1984), and was used in this study primarily to

equate students' performance in mathematics on a common scale for the

Australian investigations conducted in FIMS, undertaken in 1964, SIMS,

carried out in 1978 and TIMS, which was conducted in 1994.

 

1. 2. 1. 1 Unidimensionality

In order to employ the Rasch model for calibrating the items in the

mathematics tests it was necessary to examine whether or not the items

were unidimensional since the unidimensionality of items is one of the

requirements for the use of the Rasch model (Hambleton and Cook, 1977;

Anderson, 1994). If the items were found not to satisfy the condition

of unidimensionality, it would not be possible to employ the Rasch

procedures in the calibration of the tests. Hence, a literature search

was undertaken to examine whether or not the test developers had

examined the dimensionality of these items, and to the investigators'

knowledge the items had at no time been examined for unidimensionality,

although Peaker (1969) had considered how the part scores should be

weighted.

 

Consequently, confirmatory factor analysis procedures were employed to

test the unidimensionality of the mathematics test items. Confirmatory

factor analysis is a statistical procedure employed for investigating

relations between a set of observed variables and the underlying latent

variables (Byrne, 1989; Kim & Mueller, 1978a, 1978b; Long, 1983;

Spearritt, 1994). Thus, confirmatory factor analysis assumes that the

observed variables are derived from some underlying source variables

(Kim & Mueller, 1978a). Factor analysis may also be used as an

appropriate method for determining the minimum number of hypothetical

variables that would account for the observed covariation, and thus as

a means of exploring the data for possible data reduction (Kim &

Mueller, 1978a). However, one of the main purposes of confirmatory

factor analysis is to examine and test the common underlying dimensions

associated with a number of observed variables.

 

The results of the confirmatory factor analyses of FIMS and SIMS data

sets revealed that a nested model in which the mathematics items were

assigned to three specific correlated first-order factors of

Arithmetic, Algebra and Geometry as well as a general higher order

factor, which was labelled as Mathematics provided the best fitting

model. In addition, in the confirmatory factor analyses undertaken, no

evidence was found to reject the assumption of the existence of one

general factor involved in the mathematics tests, in so far as in the

nested model the Mathematics factor extracted more of the total

variance than did the specific first-order factors taken together.

Therefore, the mathematics test items in the FIMS and SIMS studies are

considered to satisfy the requirement of unidimensionality. The item

cluster-based design procedure (Adams and Gonzalez, 1996) employed in

the construction of the TIMS data sets would seem to preclude the use

of confirmatory factor analysis to test the unidimensionality of the

TIMS data set.

 

1. 3. The statistical procedures employed in the study

In this section the statistical procedures employed in the study are

discussed.

 

1. 3. 1. Effect size

In this paper both the standardized effect size and the magnitude of

effect on the calibrated scales are used to examine the level of

practical significance of the differences between FIMS, SIMS and TIMS

in mathematics achievements over time.

 

In this study effect size values less than 0.20 are considered as

trivial, while values between 0.20 and 0.50 are considered as small.

Furthermore, effect size values between 0.50 and 0.80 are taken as

moderate and values above 0.80 are treated as large (Cohen, 1991;

Keeves, 1992).

 

1. 3. 2. Growth between grade levels

It is possible since the TIMS project tested in two adjacent grades to

estimate the gain between the lower grade and the upper grade for the

Australian sample and thus to interpret the calibrated effect size in

terms of a year of mathematics learning at the lower secondary school

level. The present study revealed that the growth in achievement per

year in mathematics achievement in Australian lower secondary schools

was 37 centilogits. This value is equivalent to an effect size of 0.30.

Keeves (1992) has indicated that an effect size of 0.30 was also found

to be equivalent to a year's learning in science at the lower secondary

school level . Therefore, this additional information allows the

differences between the achievement level of the different groups to be

interpreted in terms of practical significance rather than depending

solely on statistical significance.

 

1. 3. 3. The t -test

In order to determine the level of statistical significance between the

mean scores on FIMS, SIMS and TIMS in mathematics achievements a t

statistic was calculated, which took into account errors from three

scores: (a) sampling error, (b) errors of calibration, and (c) equating

error. Further comment on the estimation of these errors is given in

Appendix A.

 

1. 3. 4. Treatment of omits and non-responses

Issues regarding the occurrence and handling of missing data in

achievement tests of the kind employed in these three mathematics

studies were considered. However, the results of the Rasch analyses did

not show marked differences between ignoring the missing data or

treating the missing data as wrong during the calibration and scoring.

Therefore, for both calibration and scoring purposes it was decided to

treat the missing data as wrong. While all the items in the mathematics

test were employed for scoring purposes, for calibration purposes only

those items that fitted the Rasch scale were considered. The main

justification for the use of these procedures would seem to lie in the

greater number of misfitting items when the procedure that involved the

ignoring of the missing data was tested with the SIMS data.

Consequently, the procedure that involved treating missing data as

wrong was chosen for this study.

 

1. 3. 5. Treatment of zero and perfect scores

The QUEST computer program (Adams and Khoo, 1993) by default does not

process cases with perfect and zero scores, because both groups do not

provide information for the calibration of the scale. Cases with

perfect scores are those cases who provided correct responses for all

the items, while cases with zero scores are those cases who provided

wrong responses for all items. Hence, in order to include those cases

with perfect and zero scores in the calculation of the mean and

standard deviation of the mathematics achievement test scores for each

student sample, the values of perfect and zero scores were calculated

by extrapolation from the logit table produced by the QUEST computer

program (see Appendix B). Subsequently, The SPSS 6.1. (Norusis & SPSS

Inc, 1990) computer program was used to calculate the case estimate

mean scores and standard deviations with appropriate weights, and the

WesVarPC 2.11 (Brick, Broene, James and Severynse, 1997) computer

program was employed for calculating the standard error of the mean

values, again with appropriate weighting of the data, and with

allowance made for the fact that all samples were of a stratified

cluster sample design.

 

1. 3. 6. Developing a common mathematics scale

The calibration of the mathematics data permitted a scale to be

constructed that extended across the three groups, namely FIMS, SIMS

and TIMS students on the mathematics scale. The fixed point of the

scale was set at 500 with one logit, the natural metric of the scale,

being set at 100 units. The fixed point of the scale, namely 500 was

taken as the mean of the difficulty level of the calibrated items in

the FIMS test administered in 1964. The mathematics scale constructed

in this way for all different sample groups of students in FIMS, SIMS

and TIMS is presented in Figures 1, 2 and 3, with 100 scale units

(centilogits) being equivalent to one logit.

 

1. 3. 7. Conclusion

In the last two sections the scaling and the statistical procedures

employed in the study are discussed. The Rasch model was the major

scaling procedure employed. The effect size and the t-test were

employed for comparing the mean values of different groups of students.

 

With respect to the missing data a decision was made, from a study of

the results of the Rasch analyses, for both calibration and scoring

purposes to treat the missing data as wrong, while for calibration

purposes only those items that fitted the Rasch scale were employed.

 

1. 4. Rasch Analysis

Three groups of students namely FIMS (4320), SIMS (5120) and TIMS

(7926) were employed in the calibration and scoring analyses. The

necessary requirement for calibration in Rasch scaling is that the

items and persons must fit the Rasch scale. Items and persons that do

not fit the scale must be deleted in calibration. In order to examine

whether or not the items and persons fitted the scale, it was also

important to evaluate both the item fit statistics and the person fit

statistics. The results of these analyses are presented below.

 

1. 4. 1. Item fit statistics

One of the key item fit statistics is the infit mean square (INFIT

MNSQ). The infit mean square measures the consistency of fit of the

students to the item characteristic curve for each item with weighted

consideration given to those persons close to the 0.5 probability

level. The acceptable range of the infit mean square statistic for each

item in this study was taken to be from 0.77 to 1.30 (Adams and Khoo,

1993). Values outside this acceptable range, that is above 1.30

indicate that these items do not discriminate well, and below 0.77 the

items provide redundant information. Hence, consideration must be given

to excluding those items that are outside this range. In calibration,

items that do not fit the Rasch model and which are outside of the

acceptable range must be deleted from the analysis (Rentz and Bashaw,

1975; Wright and Stone, 1979: Kolen and Whitney, 1981; Smith and

Kramer, 1992). Hence, in FIMS two items (Items 13 and 29), in SIMS two

items (Items 21 and 29) and in TIMS one item [(Item T1b No 148) with

one item (no 94) having been excluded from the international TIMSS

analysis] were removed from the calibration analyses due to the

misfitting of these items to the Rasch model (see Appendices, C and D).

 

 

1. 4. 2. Case Estimates

The other way of investigating the fit of the Rasch scale to data is to

examine the estimates for each case. The case estimates give the

performance level of each student on the total scale. In order to

identify whether the cases fit the scale or not, it is important to

examine the case OUTFIT mean square statistic (OUTFIT MNSQ) which

measures the consistency of the fit of the persons to the student

characteristic curve for each student, with special consideration given

to extreme items. In this study, the general guideline used for

interpreting t as a sign of misfit is if t> 5 (Wright and Stone, 1979,

169). Thus, if the OUTFIT MNSQ value for a person has a _embed Equation ___

greater than 5, that person does not fit the scale and is deleted from

the analysis. In this analysis no person was deleted, because the _embed

Equation ___ for all cases was less than 5. However, students with a zero

score or with a perfect score were automatically excluded from the

calibration procedure.

 

1. 4. 3. Conclusion

In summary, the results of the infit mean square indices, revealed that

68 out of 70 items for FIMS, 70 out of 72 items for SIMS and 156 out

of 157 items for TIMS data sets fitted the Rasch model. In addition,

the evidence indicated that for all cases, the responses of the

students sampled fitted the Rasch model, except for those students who

had perfect or zero scores.

 

1. 5. Equating of mathematics achievement over time

Equating of the mathematics tests require common items between

occasions, that is between FIMS, SIMS and TIMS. Wright and Stone (1979)

have recommended that 10 to 20 (17 to 34 per cent of the items in each

test) items should be employed for equating two different test forms

consisting of 60 items each. Meanwhile, Hambleton et al., (1991)

suggested approximately between 20 and 25 per cent of the number of the

items in the tests should be common. However, Smith and Kramer (1992)

have argued that as few as a single item is required.

 

In this study, the number of common items in the mathematics test for

FIMS and SIMS data sets were 65. For the mathematics tests the common

items formed approximately 93 per cent of the items for FIMS, and 90

per cent for SIMS. Thus, the common items in the mathematics test for

these two occasions were well above the percentage ranges proposed by

Wright and Stone (1979) and Hambleton et al. (1991).

 

There were also some items which were common for FIMS, SIMS and TIMS

data sets. Garden and Orpwood (1996, 2-2) reported that achievement in

TIMSS was intended to be linked with the results of the two earlier IEA

studies. Thus, in the TIMS data set there were nine items which were

common for the three occasions. Therefore, it was possible to claim

that there were sufficient numbers of common items to equate the

mathematics test on the three occasions.

 

Rasch model equating procedures were employed for equating the three

data sets. Rentz and Bashaw (1975), Beard and Pettie (1979), Sontag

(1984) and Wright (1995) have argued that Rasch model equating

procedures are better than other procedures for equating achievement

tests. All three types of Rasch model equating procedures, namely

concurrent equating, anchor item equating and common item difference

equating were used for equating the three data sets.

 

Concurrent equating was employed for equating the data sets from FIMS

and SIMS. In this method, the 65 common items between FIMS and SIMS

were combined into one data set. Hence, the analysis was done on a

single data file. Only one misfitting item was deleted at a time so as

to avoid dropping some items that might eventually prove to be good

fitting items. The acceptable infit mean square values were between

0.77 and 1.30 (Adams and Khoo, 1993). The concurrent equating analyses

revealed that among the 65 common items 64 items fitted the Rasch

model. Therefore, the threshold values of these 64 items were used as

anchor values (see Appendix E) in the anchor item equating procedures

employed in the scoring of the FIMS and SIMS data sets separately.

Among the 64 common items, nine were common to the FIMS, SIMS and TIMS

data sets. The threshold values of these nine items generated in this

analysis are presented in Table 2 and were used in equating the FIMS

data set with TIMS data sets.

 

The design of TIMS was different from FIMS and SIMS in two ways. In the

first place, only one mathematics test was administered in both FIMS

and SIMS, however, in the 1994 study the test included mathematics and

science items and the study was named TIMSS (Third International

Mathematics and Science Study). The other difference was that in the

first two international studies, the test was designed as one booklet.

Every participant used the same test booklet. Whereas in TIMSS, a

rotated test design was used. The test was designed in eight booklets.

Garden and Orpwood (1996, 2-16) explained the arrangement of the test

in eight booklets as follows.

This design called for items to be grouped into "clusters", which were

distributed (or "rotated") through the test booklets so as to obtain

eight booklets of approximately equal difficulty and equivalent content

coverage. Some items (the core cluster) appeared in all booklets, some

(the focus cluster) in three or four booklets, some (the free-response

clusters) in two booklets, and the remainder (the breadth clusters) in

one booklet only. In addition, each booklet was designed to contain

approximately equal numbers of mathematics and science items.

All in all there were 286 unique items that were distributed across

eight booklets for Population 2 (Adams and Gonzalez, 1996, 3-2).

 

In order to investigate the level of mathematics achievement in TIMS,

it is necessary to find ways and means for equating these eight

booklets. Furthermore, in order to employ any kind of test equating

procedure there must be common items between the different booklets.

Garden and Orpwood (1996) reported that the core cluster items (six

items for mathematics) were common to all booklets. In addition, the

focus cluster and free-response clusters were common to some booklets.

Thus, it was possible to equate these eight booklets and report the

achievement level in TIMS on a common scale.

 

Hence, from among the Rasch model test equating procedures concurrent

equating was chosen for equating these eight booklets. The purposes of

the test equating in TIMS was to investigate the mathematics

achievement level of Australian students in TIMS and to compare the

result with FIMS and SIMS data sets.

 

Consequently, concurrent equating procedures were employed for the TIMS

data set. Appendix D shows the infit mean square values of the first

and the last concurrent equating runs. The result of the Rasch analysis

indicated that only one item was deleted from the analysis. The item

which was deleted from the analysis was Item T1b (No 148) which was

below the critical value of 0.77. All other items fitted well the Rasch

model.

 

Table 2:-Descriptive statistics of the common item difference equating

procedure employed in FIMS and TIMS

 

Out of 157 items, 156 of the TIMS test items fitted well the

Rasch model. From the output of the concurrent equating, it was

possible to obtain the threshold values of the nine common items in

TIMS. These threshold values are shown in Table 2.

 

The next step involved the equating of the FIMS data set with the TIMS

data set using the common item difference equating procedure. In this

method the threshold values of the FIMS test generated by the QUEST

computer program (Adams and Khoo, 1993) for each state are first

subtracted from threshold values of the TIMS test. Then the differences

are summed and divided by the number of anchor test items to obtain a

mean difference between FIMS and TIMS for each state (see Table 2). The

interesting point to be mentioned is that the difference in threshold

values between the two occasions in each of the five states was

generally similar. The difference between the state with the highest

mean threshold difference and the lowest mean score difference was only

0.18. The highest mean threshold estimate value was registered in WA

(1.14) while the lowest was in NSW and VIC, the mean difference score

for both states was 0.96 (see Table 2). This result revealed that the

common items in the two tests behaved similarly in all the five states.

 

 

The grand mean difference was calculated by adding the five states mean

difference threshold estimates and dividing them by five. The resulting

mean difference across states was 1.03. The grand mean of the

differences (1.03) is called the equating constant. The equating

constant is subsequently employed in the calculation of the TIMS scores

on the FIMS scale. That is the equating constant was subtracted from

the Rasch estimated mean score on the TIMS for each state to obtain the

adjusted mean value of TIMS for each state. A comparison of achievement

over time in the five Australian states using the weighted Rasch

estimated scores of 1964, 1978 and 1994 for each state are discussed in

the next section.

 

1. 6. Comparisons of Achievement over Time

The comparisons of the performance of students on the mathematics test

for the three occasions were undertaken for two different subgroups

namely: (a) 13-year-old students in government schools, who

participated in the FIMS and SIMS studies; and (b) Year 8 government

school students who participated in the FIMS and TIMS studies. All SIMS

students were 13-year-old students. Meanwhile, some of the FIMS

students were 13-year-old students, while others were younger and/or

older students who were in Year 8. Therefore, for comparison purposes

the FIMS students were divided into two groups, namely: (a) FIMSA -

involving all 13-year-old students, and (b) FIMSB - including all Year

8 students. Thus, FIMSA students' results could be compared with SIMS

students in the government schools of five states, because all students

were 13-year-olds. In the TIMS analyses a decision was made to include

only Year 8 students, because they were the only group of students who

were common to all participating states. Thus TIMS Year 8 government

school students from the five states involved could be compared with

FIMSB students, because in both groups the students were at the same

year level.

 

1. 6. 1. Comparison between students in mathematics achievement over

time

The first part of this section addresses the comparisons between FIMSA

and SIMS, while the second part discusses the comparison between FIMSB

and TIMS.

 

 

 

Table 3:- Comparisons between FIMS and SIMS 13-year-old Government

School Students

 

1. 6. 1. 1. Comparison between 13-year-old students' mathematics

achievement over time

In this section the achievement of 13-year-old Australian students who

participated in FIMS and SIMS are compared. Table 3 presents the

results of the analyses of the comparison between the two occasions.

The first and second panels of the table show the participating states,

the estimated case means of the 13-year-old students, the standard

deviations and the standard error values, the sample sizes, design

effects and effective sample sizes for FIMS and SIMS respectively.

While the third panel presents the estimated mean differences between

the two groups, the effect sizes and t-values of the differences and

the significant levels.

 

 

Figure 1:- Comparison of Achievement in Mathematics between 1964 and

1978 in Australia

 

1.6.1.1.1. State A

When the 1964, 13-year-old State A students estimated mean score is

compared with the 1978 same age group students in the same state, the

mean score of the 1964 students (458) was higher than that of their

1978 peers (442). The difference was 16 centilogits (see Figure 1 and

Table 3). The differences in standard deviation and standard error

values for the two groups were slight, while the design effect was

larger in 1964 than in 1978. The effect size was trivial (0.16) and the

t-value was 1.39. The estimated mean difference indicated that the

mathematics achievement of 13-year-old students in State A had declined

over time. However, the effect size and t-values showed that the

difference was not practically or statistically significant. Hence, it

is possible to conclude that there was no significant decline in

mathematics achievement in State A at the 13-year-old student level.

 

1.6.1.1.2. State B

Table 3 and Figure 1 indicate that when the 13-year-old State B

students' who participated in the 1964 First International Mathematics

Study, estimated mean score is compared with the mean score of the 1978

same age group students who participated in the Second International

Mathematics Study, the 1964 students (483) were found to be higher

achievers than their 1978 peers (472). However, the difference was

slight, 11 centilogits (see Figure 1 and Table 3). The differences in

the standard deviation values for the two groups were also slight,

while the standard error and design effect were large. Both were larger

in 1964 than in 1978. The effect size was trivial (0.11) and the

t-value was 0.70. The estimated mean difference in scores indicated

that the mathematics achievement of 13-year-old students in State B

declined only slightly over time, since, the effect size and t-values

showed that the difference was not practically or statistically

significant. Therefore, there was no significant decline in mathematics

achievement in State B at the 13-year-old student level.

 

1.6.1.1.3. State C

The next state that was considered in the comparison between 1964 and

1978 was State C. The estimated mean score of the 1964 13-year-old

State C students was 423, meanwhile, the same age group students in

1978 scored 428 (see Table 3 and Figure 1). The mean score difference

between the two group was five centilogits in favour of the 1978

students. This indicated that unlike State A and State B, in State C

the achievement level of 13-year-old students increased over time (see

Figure 1 and Table 3). The differences in standard deviation values for

the two groups was slight, while the standard errors and the design

effects were larger in 1964 than in 1978. The effect size was trivial

(0.05) and the t-value was 0.14. The estimated mean difference

indicated that the mathematics achievement of 13-year-old students in

State C had improved very slightly over time. However, the effect size

and t-value showed that the difference was neither practically nor

statistically significant. Hence, it was possible to conclude that

there was no significant improvement in mathematics achievement in

State C at the 13-year-old student level between 1964 and 1978.

 

1.6.1.1.4. State D

State D was one of the five Australian states that participated in both

the 1964 and 1978 international mathematics studies. The estimated mean

score value of the 1964 13-year-old State D students was compared with

the 1978 same age group government school students in that state. The

mean score difference between students in the two studies was 36

centilogits and the difference was in favour of the 1964 students (see

Figure 1 and Table 3). This showed that the achievement of the 1978

students was noticeably lower than that of the 1964 students. In other

words, achievement had declined from 1964 to 1978 in the State D

government schools. The differences in standard deviation and standard

error values for the two groups were slight, while the design effect

was larger in 1978 than in 1964. The effect size was small (0.37) and

the t-value was 3.12. The estimated mean difference indicated that the

mathematics achievement of 13-year-old students in State D government

schools had declined over time. In addition, the effect size and

t-values also showed that the difference was both practically and

statistically significant at the 0.01 level. Hence, it would seem

possible to conclude that there was a significant decline in

mathematics achievement in State D in government schools at the

13-year-old students level from 1964 to 1978, and that the decline in

mathematics achievement represented more than one year's learning of

mathematics in the lower secondary schools of Australia.

 

1.6.1.1.5. State E

 

 

State E was the last state for the comparison of performance between

the 1964 and 1978 13-year-old students who participated in FIMS and

SIMS respectively. When the estimated mean score value of the 1964,

13-year-old State E students was compared with the 1978 same age group

government school students in the same state there was no difference in

their mean scores. The mean scores of both groups was 444 (see Figure

1 and Table 3). There was no difference between the achievement of

13-year-old students in State E government schools between 1964 and

1978. The differences in standard deviation and standard error values

for the two groups were slight, while the design effect was larger in

1964 than in 1978. The effect size and the t-value were both 0.00.

Hence, it would seem possible to conclude that there was no difference

in mathematics achievement in State E government schools at the

13-year-old students level over the 14-year period.

 

The above comparisons were for 13-year-old students in the five states

between 1964 and 1978. Among the five states, even if it was not

statistically significant, it was only in State C, that achievement

over time improved slightly. However, there was no difference between

the two occasions in State E. Moreover, in the remaining three states,

that is in State A, State B and State D, achievement over time

declined, but the decline was significant at the 0.01 level only for

State D.

 

1.6.1.1.6. Australia

The results addressed in Sections 1.6.1.1.1 to 1.6.1.1.5, led to the

comparison of the overall Australian 13-year-old students between the

two occasions. The estimated mean score difference between the two

occasions was 19 centilogits and the difference was in favour of the

1964 13-year-old Australian students. This revealed that the

mathematics achievement of Australian students declined from 1964 to

1978. The differences in standard deviation and standard error values

for the two groups were small, while the design effect was slightly

larger in 1964 than in 1978. The effect size was not inconsiderable

(0.19) and the t-value was 2.91. Hence, the mean difference was

statistically significant at the 0.01 level (see Table 2 and Figures 1

and 3). Moreover, in Australia the mathematics achievement level of the

13-year-old students declined over time, between 1964 and 1978, to an

extent that represented approximately two-thirds of a year of

mathematics learning.

 

In conclusion, the comparisons of the mathematics achievement of the

13-year-old students between 1964 and 1978 in the five Australian

states and overall in Australia revealed that in three states and in

Australia overall achievement had declined over time. Statistically

significant declines were recorded only for State D and for Australia

overall. There was no difference between the two occasion for State E

students. While, an improvement was recorded for State C, the increase

was slight, and it was not statistically significant. The next section

presents the comparison of mathematics achievement at the Year 8 level

between 1964 and 1994 for students in the government schools of the

five states.

 

1. 6. 1. 2. Comparison between Year 8 students mathematics achievement

over time

In Section 1.6.1.1. the mathematics achievement of 13-year-old students

in the five Australia states and overall Australia was compared between

FIMS and SIMS. In this section the achievement level of the Year 8

students between 1964 and 1994 are compared. The results of the

comparisons of students by state are presented in Table 4 and Figure 2.

 

 

1.6.1.2.1. State A

The first state which was selected for comparison was State A. The

estimated mean score difference between the two occasions at the Year 8

level in State A was two centilogits, the difference was in favour of

the 1964 students (see Figure 2 and Table 4). This result indicated

that the mathematics achievement at the Year 8 level had declined very

slightly between 1964 and 1994 in State A government schools. The

effect sizes and t-values were too small to be considered, and this

decline in achievement over time in State A schools at the Year 8 level

was not found to be statistically significant.

 

1.6.1.2.2. State B

State B was the next state that participated in the three international

mathematics studies. When the estimated mean scores of the FIMSB and

TIMS groups were compared, the 1964 students mean score was noticeably

higher than that of the 1994 students. This revealed that mathematics

achievement over time had declined in State B schools at the Year 8

level. The standard deviation, standard error and the design effect

were larger in 1994 than in 1964. The effect size (0.83) and t-value

(4.22) were large. Hence, the decline in mathematics achievement at the

Year 8 level between 1964 and

 

Table 4:- Comparisons between FIMS and TIMS Year 8 Government School

Student

 

were larger in 1994 than in 1964. The effect size (0.83) and t-value

(4.22) were large. Hence, the decline in mathematics achievement at the

Year 8 level between 1964 and 1994 was practically and statistically

significant at the 0.01 level. Moreover, it should be noted that the

decline in mathematics achievement in this state represented more than

two years of learning.

 

1.6.1.2.3. State C

The mathematics achievement difference between 1964 and 1994 in State C

schools at the Year 8 level was small. The mean difference was 27

centilogits in favour of the 1964 Year 8 students. The result indicated

that the Year 8 students' mathematics achievement in State C had

declined over time. The standard deviation, standard error and the

design effect were larger in 1994 than in 1964. The effect size (0.28)

was small, but the t-value (1.46) was not significant. Thus, the

t-value indicated that the decline in mathematics achievement between

the 1964 and 1994 Year 8 State C students was

 

Figure 2:- Comparison of Achievement in Mathematics between 1964 and

1994 in Australia

 

not statistically significant. Hence, it would seem possible to

conclude that there was no statistically significant decline in

achievement between 1964 and 1994 in State C at the Year 8 level.

However, a substantial decline would seem to have occurred since the

effect size (0.28) represented approximately three quarters of a year

of mathematics learning.

 

1.6.1.2.4. State D

The next comparison was between the State D students, and the mean

score difference between 1964 and 1994 Year 8 students was 61

centilogits. The difference was in favour of the 1964 students. This

indicated that the mathematics achievement level of Year 8 State D

school students had declined by more than a year and a half of

mathematics learning over the last 30 years. This difference was

marked. The effect size was medium (0.59) and the t-value was also

large (3.47). Hence, the difference was statistically significant at

the 0.01 level. The standard deviation and standard error were larger

in 1994 than in 1964. However, the design effect was larger in 1964

than in 1994. Thus, in State D government schools the mathematics

achievement level of Year 8 students had declined substantially over

the last three decades.

 

1.6.1.2.5. State E

The last state for comparison between the 1964 and 1994 Year 8 students

who participated in FIMS and TIMS respectively was State E. When the

estimated mean score value of the 1964 State E Year 8 students was

compared with that of the 1994 students, the mean score of the 1994

students was higher than that of the 1964 students (see Figure 2 and

Table 3). The difference was 13 centilogits. The finding indicated that

in State E schools the mathematics achievement level of Year 8 students

had improved over the last three decades. The standard deviation,

standard error value and design effect were markedly larger in 1994

than in 1964. The effect size (0.14) and the t-values (0.74) were too

small to be considered significant. Hence, it would seem possible to

conclude that while there was no statistically significant difference

in mathematics achievement in State E schools at the Year 8 level

between 1964 and 1994, some signs of improvement had occurred in marked

contrast to the other four states, and that the gain was estimated to

be approximately half a year of mathematics learning.

 

 

 

The comparisons in the mathematics achievement level of Year 8 students

between 1964 and 1994 in State A, State B, State C, State D and State E

revealed that only State E showed improvement in mathematics

achievement over the last 30 years. However, the improvement was not

found to be statistically significant. Moreover in State B, State C and

State D the achievement of Year 8 students had declined over the past

three decades. A significant decline was recorded in both State B and

State D, although the decline in State C was not statistically

significant. The next comparison is between Australian Year 8 students

on the two occasions.

 

1.6.1.2.6. Australia

The estimated mean score of the 1964 Australian Year 8 students was

451, while, it was 426 in 1994. The difference was 31 centilogits in

favour of the 1964 students (see Table 4, Figures 2 and 3). This

difference revealed that the mathematics achievement level of

Australian Year 8 students had declined over the 30 year period. The

standard deviation, standard error and the design effect were larger in

1994 than in 1964. The effect size was 0.29 and the t-value was 2.16.

The effect size and t-value indicated that the decline in mathematics

achievement between the 1964 and 1994 Year 8 Australian students was

marginally significant at the 0.05 level, and the size of the decline

was a little less than a year of mathematics learning.

 

Table 5:- Comparisons of standard deviation values between FIMS and

TIMS

 

1. 6. 1. 3. Comparison of Standard Deviation

Table 3 shows the standard deviation values for each state in TIMS and

FIMS. There would appear to be a large increase in the spread of scores

as measured on the scale of mathematics achievement between 1964 and

1994. This increase may be a consequence of less accurate measurement

since in 1964 students answered 70 test items while in 1994 the

students answered between 33 and 41 items. However, it would seem more

probable that there was greater variability in students' mathematical

achievement in 1994 compared with 1964 at the Year 8 level as a

consequence of changed teaching and learning practices. This issue

warrants further investigation.

 

1. 7. Conclusion

In order to investigate the mathematics achievement level of lower

secondary school Australian students over time, three different data

sets, namely from the FIMS, SIMS and TIMS studies were analysed. From

the three data sets two groups of students were compared. The first

comparison was between 13-year-old government school students in five

states who participated in FIMS and SIMS. The result of the comparison

revealed that only State C showed improvement in mathematics

achievement over the 14-year period. However, the improvement was not

statistically significant. Furthermore, no achievement difference was

found in State E between 1964 and 1978. Meanwhile mathematics

achievement showed a decline in State A, State B and State D. Among the

three states a significant decline was found only in State D. When the

overall Australian students' performance was compared between 1964 and

1978, the mathematics achievement level of the 13-year-old students

declined over the 14-year period (see Figure 3).

 

 

Figure 3:- Comparison of Achievement in Mathematics between 1964, 1978

and 1994 in Australia

 

The second comparison was the mathematics achievement level of Year 8

government school students between 1964 and 1994 in the government

schools of five states. The findings indicated that State A and State

E have improved in mathematics achievement over the last 30 years,

however, the improvement was not found to be statistically significant.

Whereas, in State B, State C and State D the achievement of Year 8

students declined over the last three decades. Significant declines

were recorded in State B and State D. However, the decline in State C

was not significant. When Australian Year 8 government school students

who participated in FIMS and TIMS were compared the decline in

mathematics achievement level was found to be marginally significant

over the last 30-year period (see Figure 3), but of the order of a

little less than a year of mathematics learning in Australian lower

secondary schools.

 

The findings in both comparisons revealed that the mathematics

achievement level of Australian students at the lower secondary school

level have declined over the last three decades (see Figure 3). The

findings also indicate that there is a need to investigate differences

in conditions of learning. Carroll's (1963) model of school learning

has guided IEA studies and could guide this investigation. Carroll

(1963) has identified five factors that influence school learning.

These factors are divided into two levels, namely student and school

level factors. The student level factors in Carroll's model are

aptitude (home background), ability and perseverance (motivation,

attitudes). While the school level factors are time for learning

(including homework time for mathematics) and quality of instruction.

The investigation demands the use of both:

(1) multivariate analysis, and

(2) multilevel analysis.

Thus, it is necessary to conduct further research to identify the

reasons and to recommend solutions for the problems.

 

Acknowledgment

The first author was sponsored by The Flinders University of South

Australia Overseas Postgraduate Research Scholarship and the Flinders

University Research Scholarship.

 

Reference

Adams, R. J. & Khoo, S.T. (1993). Quest- The interactive test analysis

system. Hawthorn, Victoria: ACER.

Adams, R. J. & Gonzalez, E. J. (1996). The TIMSS test design. In M O

Martin & D L Kelly (eds.), Third International Mathematics and Science

Study Technical Report vol. 1, Boston: IEA, pp. 3-1 - 3-26.

Anderson, L. W. (1994). Attitude Measures. In T. Husén (eds.), The

International Encyclopedia of Education, vol. 1, (second ed.),

Pergamon, pp. 380-390.

 

Beard, J. G. & Pettie, A. L. (1979). A comparison of Linear and Rasch

Equating results for basic skills assessment Tests. Florida: Florida

state university: ERIC.

Brick, J. M., Broene, P., James, P. & Severynse, J. (1997). A user's

guide to WesVarPC. (Version 2.11). Boulevard, MA: Westat, Inc.

Byrne, B. M. (1989). A primer of LISLEL basic applications and

programming for confirmatory factor analytic models. New York:

Springer-Verlag.

Cohen, J. (1992). A power primer. Psychological Bulletin, 112 (1),

155-159.

Garden , R. A. & Orpwood, G. (1996). Development of the TIMSS

achievement tests. In M 0. Martin & D L Kelly (eds.), Third

International Mathematics and Science Study Technical Report Volume 1:

Design and Development, Boston: IEA, pp. 2-1 to 2-19.

Hambleton, R. K.& Cook, L. L. (1977). Latent trait models and their use

in the analysis of educational test data. Journal of educational

measurement, 14 (2), 75-96.

Hambleton, R. K., Zaal, J. N.& Pieters, J. P. M. (1991). Computerized

adaptive testing: theory, applications, and standards. In R K Hambleton

& J N Zaal (eds.), Advances in Educational and Psychological Testing,

Boston, Mass.: Kluwer AcademicPublishers, pp. 341-366.

Keeves, J. P. (1992). The design and conduct of the second science

study. In J P Keeves (eds.), The IEA Study of Science III: Changes in

Science Education and Achievement: 1970 to 1984, Oxford: Pergamon, pp.

42-67.

Kim, J & Mueller, C. W. (1978a). Introduction to Factor Analysis What

It Is and How to Do It.. London: Sage Publications.

Kim, J & Mueller, C. W. (1978b). Factor Analysis Statistical Methods

and Practical Issues. London: Sage Publications.

Kolen, M. J. & Whitney, D. R. (1981). Comparison of four procedures for

equating the tests of general educational development. Paper presented

at the annual meeting of thee American Educational Research

Association. Los Angeles, California.

Lokan, J., Ford, P. & Greenwood, L. (1996). Maths & Science On the

Line: Australian Junior Secondary Students' Performance in the Third

International Mathematics and Science Study. Camberwell: ACER.

Long, J. S. (1983). Confirmatory Factor Analysis: A preface to LISREL.

Beverly Hills: Sage Publications.

Moss, J. D. (1982). Towards Equality: Progress by Grls in Mathematics

in Australian Secondary Schools. Hawthorn, Victoria: ACER.

Norusis, M. J. & SPSS Inc (1993). SPSS for windows: Base system user's

guide: Release 6.0. Chicago: SPSS Inc.

Peaker, G. F. (1969). How should national part scores be weighted?

International Review of Education, 15, 229-237.

Rentz, R. R. & Bashaw, W. L. (1975). Equating Reading tests with the

Rasch model, Vol. I Final Report. Athens, Georgia: University of

Georgia: Educational Research Laboratory, College of Education.

Rosier, M. J. (1980). Changes in Secondary School Mathematics in

Australia. Hawthorn, Victoria: ACER.

Smith, R. M. and r, G. A. (1992). A comparison of two methods of test

Equating in the Rasch model. Educational and Psychological Measurement,

52 (4), 835-846.

Sontag, L. M. (1984). Vertical equating methods: A comparative study of

their efficacy. DAI, 45-03B, page 1000.

Spearritt, D. (1994). Factor Analysis. In T Husén & T.N Postlethwaite

(eds.), The International Encyclopedia of Education, (second ed.),

vol.4. Oxford: Pergamon, pp. 2230-2241.

Wright, B. D. (1995). 3PL or Rasch? Rasch Measurement Transactions, 9

(1), 408-409.

Wright, B. D., and Stone, M. H. (1979). Best Test Design: Rasch

Measurement. Chicago: Mesa Press.

Willett, J. B. (1997). Change, Measurement of. In J P Keeves (ed.),

Educational Research, Methodology, and Measurement: An International

Handbook, (second ed.), Oxford: Pergamon, pp. 327-334.

 

Appendix A:- Errors of Estimations and Scaling

In the present study, sources of errors of estimation and scaling that

are related to the calculation of gains and losses in mathematics

achievement are associated with the sampling design, the fitting of

individual items to a scale based on the Rasch model (calibration) and

the use of the equating constant based on the FIMS and TIMS data sets.

 

(1). The error associated with the sampling design for each data set was

generated using WesVarPC computer program (Brick, et al., 1997).

(2). The error associated with the use of the mean value of the equating

constant arises from the equating using the nine common items in the

FIMS and TIMS mathematics tests in the five state samples.

 

The error associated with the equating constant was estimated to be

0.104. The items employed for anchoring in the common item difference

equating procedure are not a random sample of items but a fixed sample

of specifically chosen items. Under these circumstances the error of

the grand mean is given by _embed Equation ___ where n is the number of

items in the sample for each state.

Standard error of equating constant = _embed Equation ___

(3). For individual students the QUEST computer program (Adams and Khoo,

1993) provided a value for the magnitude of the measurement error

associated with the estimation of student performance. This estimate

for TIMS was about 35 scale units. In order to calculate the error

arising from calibration the following formula was used:

 

 

The QUEST computer program (Adams and Khoo, 1993) by default does not

process cases with perfect and zero scores, because both groups do not

provide information for the calibration of the scale. Hence, in order

to include those cases with perfect and zero scores in the calculation

of the mean and standard deviation of the mathematics achievement test

scores for each student sampled, the values of the perfect and zero

scores were calculated by extrapolation from the logit table produced

by the QUEST computer program. Table A shows the procedures employed to

estimate the scores of cases with a perfect and zero score. The

calculation of the scores of the FIMS students who had perfect and zero

scores has been used here as an example. Table A1 shows the procedures

employed to estimate the scores of cases with a perfect score. The

first column indicates the top three raw scores (69, 68 and 67)

excluding the highest possible raw score (70). The second column

indicates the logit values obtained from the logit table generated by

the QUEST computer program (Adams and Khoo, 1993). This column provides

the Rasch scores corresponding to the top three possible raw scores in

the test excluding the maximum score. D1 gives the successive

differences between the top three logit values. It was assumed that

compared to the highest logit value, the perfect score was likely to be

greater than a value equal to the difference between the top two

scores and the difference between consecutive differences of the top

three scores. Therefore, the following calculation was employed to

estimate the perfect score. In order to get the first entry (0.73) in

column D1 the second highest logit value (4.17) was subtracted from

the first highest logit value (4.90). The same procedure was applied to

obtain the second entry (0.44) in which, the third highest value (3.73)

was subtracted from the second highest value (4.17). The difference

between the two entries in column D1, that is the difference between

0.73 and 0.44, namely 0.29, was entered in column D2. Therefore, the

estimated Rasch score for the maximum raw score 70 is assigned in the

column Perfect Score in Table A1. The score (5.92) was estimated by

adding the highest logit value (4.90) for a score of 69 and the first

entry in column D1 (0.73) and the entry in column D2 (0.29).

 

For the estimation of the zero score it was assumed that compared to

the lowest logit value, the zero score would most likely be less than

the logit value for a score of one, by a value equal to the difference

between the bottom two scores and the difference between consecutive

differences of the bottom three scores. Hence, the same procedure was

employed for the estimation of the zero score. Table A2 shows the

estimation of the zero scores. Thus estimation procedure employed for

zero scores was similar to the one applied for the perfect score

estimation. However, the subtractions for the estimation of zero score

were from the bottom. Thus, to obtain the first entry (-0.73) in column

D1 the second lowest value (-3.97) was subtracted from the first lowest

value (-4.70). In order to obtain the second entry (-0.44) in column D1

the third lowest value (-3.53) was subtracted from the second lowest

value (-3.97). Moreover, in order to obtain the entry in column D2 the

second lowest value of D1 was subtracted from the lowest value of D2.

Therefore, the estimated Rasch score for the minimum raw score of zero

is assigned in the column Zero Score in Table 2b. The score (-5.72) was

estimated by adding the lowest logit value (-4.70) for a score of one

and the lowest entry in column D1 (-0.73) and the entry in column D2

(-0.29).

 

Appendix C:- Infit mean square values (INFIT MNSQ) for Mathematics test

items all students in FIMS and SIMS using Anchor Item Equating

Procedure

 

=========================================|

| F I M S | S I M S |

=======|===============|================|

Item |Before |After |Before |After |

No |DeletionDeletionDeletion Deletion

=======|=======|=======|========|=======|

item 1 | 0.97| 0.97| 0.96 | 0.96|

item 2 | 0.81| 0.81| 0.83 | 0.83|

item 3 | 0.93| 0.93| 1.06 | 1.06|

item 4 | 0.88| 0.88| 0.86 | 0.86|

item 5 | 0.94| 0.94| 1.03 | 1.03|

item 6 | 0.93| 0.93| 0.95 | 0.96|

item 7 | 1.02| 1.03| 0.86 | 0.85|

item 8 | 0.93| 0.93| 0.97 | 0.98|

item 9 | 0.83| 0.83| 0.84 | 0.84|

item 10| 1.03| 1.03| 0.90 | 0.90|

item 11| 0.85| 0.85| 0.80 | 0.80|

item 12| 1.09| 1.09| 1.02 | 1.03|c

item 13| 0.74|Deleted| 0.95 | 0.94|

item 14| 1.19| 1.19| 1.11 | 1.12|

item 15| 1.11| 1.10| 1.04 | 1.05|

item 16| 1.08| 1.08| 1.05 | 1.06|d

item 17| 0.94| 0.94| 0.96 | 0.96|

item 18| 0.97| 0.97| 1.03 | 1.04|

item 19| 0.89| 0.90| 0.80 | 0.79|

item 20| 0.90| 0.90| 0.90 | 0.90|

item 21| 1.24| 1.23| 1.39 |Deleted|

item 22| 0.82| 0.82| 0.94 | 0.94|

item 23| 0.94| 0.94| 0.92 | 0.92|

item 24| 0.87| 0.87| 0.85 | 0.85|

item 25| 0.86| 0.86| 0.87 | 0.87|

item 26| 0.94| 0.94| 0.94 | 0.95|c

item 27| 0.97| 0.97| 0.85 | 0.86|

item 28| 1.00| 1.00| 0.96 | 0.96|

item 29| 1.34|Deleted| 0.74 |Deleted|

item 30| 0.83| 0.82| 1.13 | 1.13|

item 31| 0.82| 0.82| 1.09 | 1.09|c

item 32| 0.90| 0.90| 0.89 | 0.90|c

item 33| 0.82| 0.82| 0.87 | 0.87|c

item 34| 0.96| 0.95| 0.89 | 0.89|

item 35| 1.15| 1.15| 1.05 | 1.06|

item 36| 1.01| 1.01| 1.04 | 1.05|

item 37| 1.04| 1.04| 1.00 | 1.00|

item 38| 0.92| 0.92| 0.95 | 0.95|c

item 39| 0.94| 0.94| 0.90 | 0.90|

item 40| 0.92| 0.92| 0.96 | 0.96|

item 41| 1.09| 1.09| 1.11 | 1.12|d

item 42| 0.77| 0.77| 1.09 | 1.08|

item 43| 1.22| 1.21| 1.05 | 1.06|d

item 44| 1.08| 1.07| 1.13 | 1.14|

item 45| 1.18| 1.18| 0.98 | 0.99|d

item 46| 0.97| 0.97| 0.99 | 0.99|

item 47| 0.80| 0.80| 0.92 | 0.93|

item 48| 0.85| 0.85| 0.87 | 0.87|

item 49| 1.10| 1.10| 1.09 | 1.09|

item 50| 0.87| 0.87| 0.90 | 0.90|

---------------------------------------------------

Continued..

 

 

 

Appendix C: (Continued)

=========================================|

| F I M S | S I M S |

=======|===============|================|

Item |Before |After |Before |After |

No |DeletionDeletionDeletion Deletion

=======|=======|=======|========|=======|

item 51| 0.81| 0.81| 0.80 | 0.80|

item 52| 0.79| 0.79| 0.95 | 0.96|

item 53| 0.87| 0.86| 0.87 | 0.87|

item 54| 0.98| 0.98| 0.92 | 0.92|c

item 55| 0.91| 0.91| 0.92 | 0.93|

item 56| 0.89| 0.89| 0.93 | 0.93|

item 57| 0.87| 0.87| 0.99 | 0.98|

item 58| 0.91| 0.91| 0.92 | 0.93|

item 59| 1.18| 1.18| 0.98 | 0.99|d

item 60| 1.12| 1.12| 1.11 | 1.12|

item 61| 1.13| 1.13| 1.25 | 1.26|

item 62| 0.99| 0.99| 1.03 | 1.04|

item 63| 0.94| 0.94| 1.08 | 1.08|

item 64| 0.94| 0.93| 1.11 | 1.11|

item 65| 1.09| 1.08| 1.15 | 1.16|

item 66| 0.99| 0.98| 1.20 | 1.21|

item 67| 1.01| 1.01| 0.98 | 0.99|c

item 68| 0.92| 0.92| 0.95 | 0.96|

item 69| 0.97| 0.97| 0.98 | 0.98|

item 70| 1.08| 1.08| 1.18 | 1.19|

item 71| © | | 1.05 | 1.06|

item 72| | | 0.80 | 0.80|

===============|=======|======= ========|

Mean | 0.97 | 0.96| 0.98 | 0.98|

SD | 0.12 | 0.11| 0.12 | 0.11|

------------------------------------------

N | 4320 | 4320 | 5120 | 5120 |

==========================================

SD = standard deviation

d= Different items were administered for each occasion, therefore, the

items were not anchor items

© = 70 items were administered for FIMS while 72 for SIMS

c= Common items for FIMS, SIMS and TIMS

 

Appendix D: Infit mean square values for Mathematics test items Year 8

students in TIMS using Concurrent Equating procedure

 

==============================|==============================|==========

====================

|Before |After | |Before |After |

|Before |After

Items |Deletion |Deletion| Items |Deletion |Deletion| Items

|Deletion |Deletion

=============================|==============================|===========

===================

item 1 | 0.81 | 0.81 | item 54 | 0.99 | 0.99 | item 107

| 1.07 | 1.07

item 2 | 1.02 | 1.01 | item 55 | 1.02 | 1.01 | item 108

| 0.82 | 0.82

item 3 | 0.92 | 0.92 | item 56 | 1.15 | 1.14 | item 109

| 0.91 | 0.91

item 4 | 1.00 | 1.00 | item 57 | 0.92 | 0.92 | item 110

| 0.95 | 0.95

item 5 | 1.22 | 1.22 | item 58 | 0.86 | 0.86 | item 111

| 1.01 | 1.01

item 6c | 1.08 | 1.08 | item 59 | 1.12 | 1.12 | item 112

| 0.86 | 0.86

item 7 | 1.00 | 0.99 | item 60 | 0.98 | 0.98 | item 113

| 0.94 | 0.94

item 8 | 1.28 | 1.28 | item 61 | 0.98 | 0.98 | item 114

| 1.20 | 1.20

item 9 | 1.07 | 1.07 | item 62c | 1.01 | 1.01 | item 115

| 1.05 | 1.05

item 10 | 0.82 | 0.81 | item 63 | 1.07 | 1.07 | item 116

| 1.16 | 1.16

item 11 | 0.96 | 0.96 | item 64 | 1.07 | 1.07 | item 117

| 0.95 | 0.95

item 12 | 0.87 | 0.87 | item 65 | 1.03 | 1.03 | item 118

| 1.01 | 1.01

item 13 | 0.99 | 0.99 | item 66 | 1.15 | 1.15 | item 119

| 0.91 | 0.91

item 14 | 0.91 | 0.91 | item 67 | 1.08 | 1.07 | item 120

| 0.99 | 0.99

item 15 | 0.89 | 0.89 | item 68 | 0.93 | 0.93 | item 121

| 0.86 | 0.86

item 16 | 1.28 | 1.28 | item 69 | 1.05 | 1.05 | item 122

| 1.27 | 1.27

item 17 | 0.97 | 0.97 | item 70c | 1.17 | 1.15 | item 123

| 1.08 | 1.08

item 18 | 1.00 | 1.00 | item 71 | 0.80 | 0.79 | item 124

| 1.12 | 1.12

item 19 | 0.99 | 0.99 | item 72 | 1.16 | 1.15 | item 125

| 1.11 | 1.11

item 20 | 0.97 | 0.97 | item 73 | 0.95 | 0.95 | item 126

| 0.96 | 0.96

item 21 | 0.83 | 0.83 | item 74 | 1.07 | 1.06 | item 127

| 0.94 | 0.94

item 22 | 1.22 | 1.21 | item 75 | 0.83 | 0.83 | item 128

| 0.98 | 0.98

item 23 | 0.94 | 0.94 | item 76 | 1.06 | 1.06 | item

129c| 0.95 | 0.95

item 24 | 1.03 | 1.03 | item 77 | 0.92 | 0.92 | item 130

| 0.80 | 0.80

item 25 | 1.05 | 1.05 | item 78 | 0.97 | 0.97 | item 131

| 0.99 | 0.99

item 26 | 1.03 | 1.03 | item 79 | 1.19 | 1.19 | item 132

| 0.97 | 0.97

item 27 | 1.00 | 1.00 | item 80 | 0.87 | 0.87 | item 133

| 0.95 | 0.95

item 28 | 0.89 | 0.89 | item 81 | 0.94 | 0.94 | item 134

| 1.10 | 1.10

item 29 | 0.88 | 0.88 | item 82 | 1.13 | 1.13 | item 135

| 1.06 | 1.06

item 30 | 1.00 | 1.00 | item 83 | 1.05 | 1.05 | item

136c| 0.96 | 0.96

item 31c | 1.12 | 1.12 | item 84 | 0.88 | 0.88 | item 137

| 0.95 | 0.95

item 32 | 1.11 | 1.11 | item 85 | 0.91 | 0.91 | item 138

| 1.04 | 1.04

item 33 | 0.96 | 0.96 | item 86 | 0.93 | 0.93 | item 139

| 0.84 | 0.84

item 34 | 0.96 | 0.96 | item 87 | 1.10 | 1.10 | item 140

| 0.89 | 0.89

item 35 | 0.85 | 0.85 | item 88 | 1.03 | 1.03 | item 141

| 0.79 | 0.79

item 36 | 1.05 | 1.05 | item 89 | 0.93 | 0.93 | item 142

| 1.03 | 1.03

item 37 | 1.13 | 1.12 | item 90 | 1.08 | 1.08 | item 143

| 0.93 | 0.92

item 38 | 0.99 | 0.99 | item 91 | 0.91 | 0.91 | item 144

| 0.83 | 0.83

item 39c | 1.05 | 1.05 | item 92c | 1.03 | 1.03 | item 145

| 0.81 | 0.80

item 40 | 0.95 | 0.95 | item 93 | 1.00 | 1.00 | item 146

| 0.89 | 0.88

item 41 | 0.97 | 0.96 | item 94 Excluded Item | item 147

| 0.96 | 1.04

item 42c | 0.99 | 0.99 | item 95 | 0.92 | 0.92 | item 148

| 0.75 |Deleted

item 43 | 1.07 | 1.07 | item 96 | 1.08 | 1.08 | item 149

| 0.85 | 0.85

item 44 | 0.89 | 0.89 | item 97 | 0.92 | 0.92 | item 150

| 1.02 | 1.02

item 45 | 0.95 | 0.95 | item 98 | 1.01 | 1.01 | item 151

| 1.01 | 1.01

item 46 | 0.89 | 0.89 | item 99 | 1.27 | 1.27 | item 152

| 1.00 | 0.99

item 47 | 1.00 | 1.00 | item 100 | 1.12 | 1.12 | item 153

| 1.09 | 1.08

item 48 | 0.95 | 0.95 | item 101 | 1.14 | 1.14 | item 154

| 0.91 | 0.90

item 49 | 1.23 | 1.22 | item 102 | 0.91 | 0.91 | item 155

| 0.83 | 0.83

item 50 | 0.99 | 0.98 | item 103 | 0.78 | 0.78 | item 156

| 1.06 | 1.06

item 51 | 1.29 | 1.28 | item 104 | 0.96 | 0.96 | item 157

| 1.02 | 1.02

item 52 | 1.00 | 0.99 | item 105 | 0.95 | 0.95 | item 158

| 0.92 | 0.92

item 53 | 0.95 | 0.94 | item 106 | 1.02 | 1.02 |

| |