AERO responds to James Ladwig’s critique

AERO’s response is below, with additional comments from Associate Professor Ladwig. For more information about the statistical issues discussed, a more detailed Technical Note is available at AERO.

On Monday, EduResearch Matters published a post by Associate Professor James Ladwig which critiqued the Australian Education Research Office’s Writing development: what does a decade of NAPLAN data reveal? 

AERO: This article makes three key criticisms about the analysis presented in the AERO report, which are inaccurate.

Ladwig claims that the report lacks consideration of sampling error and measurement error in its analysis of the trends of the writing scores. In fact, those errors were accounted for in the complex statistical method applied. AERO’s analysis used both simple and complex statistical methods to examine the trends. While the simple method did not consider error, the more complex statistical method (referred to as the ‘Differential Item Analysis’) explicitly considered a range of errors (including measurement error, and cohort and prompt effects).

Associate Professor Ladwig: AERO did not include any of that in its report nor in any of the technical papers. There is no overtime DIF analysis of the full score – and I wouldn’t expect one.  All of the DIF analyses rely on data that itself carries error (more below). There is no way for the educated reader to verify these claims without expanded and detailed reporting of the technical work underpinning this report. This is lacking in transparency, falls shorts of the standards we should expect from AERO and makes it impossible for AERO to be held accountable for its specific interpretation of their own results.

AERO: Criticism of the perceived lack of consideration of ‘ceiling effects’ in AERO’s analysis of the trends of high-performing students’ results, omits the fact that AERO’s analysis focused on the criteria scores (not the scaled measurement scores). AERO used the proportion of students achieving the top 2 scores (not the top score), for each criterion, as the matrix to examine the trends. Given only a small proportion of students achieved a top score for any criterion (as shown in the report statistics), there is no ‘ceiling effect’ that could have biased the interpretation of the trends.

Associate Professor Ladwig made his ‘ceiling effect’ comments while explaining how the NAPLAN writing scores are designed not in relation to the AERO analysis.

AERO: The third major inaccuracy relates to the comments made about the ‘measurement error’ around the NAPLAN bands and the use of adaptive testing to reduce error. These are irrelevant to AERO’s analysis because the main analysis did not use scaled scores, it did not use bands, and adaptive testing is not applicable to the writing assessment.

Associate Professor Ladwig’s comment was about the scaling in relation to explaining the score development, not about the AERO analysis.

In relation to the AERO use of NAPLAN criterion score data in the writing analysis, however, please note that those scores are created either through scorer moderation processes or (increasingly where possible) text interpretative algorithms.  Here again the address of the reliability of these raw scores was absent, but with one declared limitation noted, in AERO’s own terms:

Another key assumption underlying most of the interpretation of results in this report is that marker effects (that is, marking inconsistency across years) are small and therefore they do not impact on the comparability of raw scores over time. (p[.66)

This is where AERO has taken another short cut, with an assumption that should not be made.  ACARA has reported the reliability estimates to include that in the scores analysis.  It is readily possible to report those and use them for trend analyses.

AERO: A final point: the mixed-methods design of the research was not recognised in the article. AERO’s analysis examined the skills students were able to achieve at the criterion level against curriculum documents. Given the assessment is underpinned by a theory of language, we were able to complement quantitative with a qualitative analysis that specifically highlighted the features of language students were able to achieve. This was validated by analysis of student writing scripts.

Associate Professor Ladwig says this is irrelevant to his analysis. The logic of this is also a concern. Using multiple methods and methodologies does not correct for any that are technically lacking.  In relation to the overall point of concern, we have a clear example of an agency reporting statistical results in a manner that elides external scrutiny accompanied by an extreme media positioning. Any of the qualitative insights to the minutia these numbers represent will probably very useful for teachers of writing – but whether or not they are generalisable, big, or shifting depends on those statistical analysis themselves. 

