Year: 2019
Author: Low, Choy, Samantha, Singh, Parlo, Low-Choy, Samantha
Type of paper: Abstract refereed
Abstract:
As part of the global education reform movement, many countries are adopting standardised tests as a common approach to evaluate students’ outcomes. Two examples of these instruments are the Australian National Assessment Program – Literacy and Numeracy (NAPLAN) and the American National Assessment of Educational Progress (NAEP). They compare students’ ability to answer correctly against the average ability of all students who set the same test. In addition, standardised tests are also used to compare educational attainment across countries, for instance via international tests such as the Programme for International Student Assessment (PISA). For all these types of tests, Item Response Theory (IRT) is used. The reported results are normalised with a particular mean and standard deviation. There are many options of IRT models. For instance, NAPLAN scores are calculated fitting the simplest one, the Rasch model, which focuses on the relative difficulty of each item (questions). The PISA scores are calculated mainly fitting a more complex model, which also considered the capability of the items to discriminate students with low and high abilities. For NAEP an even more complex model is used. The choice of these IRT models can affect how the students’ abilities are calculated. Thus, given the same test responses different scores could be obtained depending on the chosen model. Hence, it is important to check if the model used is fitting the data well. Therefore, following the trend from other disciplines, such as psychology and science, and also highlighting the danger of relying solely on summary statistics (as illustrated by the “Datasaurus”, Matejka & Fitzmaurice, 2017), this paper proposes a similar approach to the “Datasaurus” in order to test the reproducibility of the reported educational data, using NAPLAN as an example.
Matejka, J., & Fitzmaurice, G. (2017). Same stats, different graphs: Generating datasets with varied appearance and identical statistics through simulated annealing. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (pp. 1290-1294). ACM
References:
Lash, T. L. (2017). The harm done to reproducibility by the culture of null hypothesis significance testing. American journal of epidemiology, 186(6), 627-635.
Matejka, J., & Fitzmaurice, G. (2017). Same stats, different graphs: Generating datasets with varied appearance and identical statistics through simulated annealing. In Proceedings of the 2017 CHI Conference on Hu
Matejka, J., & Fitzmaurice, G. (2017). Same stats, different graphs: Generating datasets with varied appearance and identical statistics through simulated annealing. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (pp. 1290-1294). ACM
References:
Lash, T. L. (2017). The harm done to reproducibility by the culture of null hypothesis significance testing. American journal of epidemiology, 186(6), 627-635.
Matejka, J., & Fitzmaurice, G. (2017). Same stats, different graphs: Generating datasets with varied appearance and identical statistics through simulated annealing. In Proceedings of the 2017 CHI Conference on Hu