James Ladwig

AERO responds to James Ladwig’s critique

AERO’s response is below, with additional comments from Associate Professor Ladwig. For more information about the statistical issues discussed, a more detailed Technical Note is available at AERO.

On Monday, EduResearch Matters published a post by Associate Professor James Ladwig which critiqued the Australian Education Research Office’s Writing development: what does a decade of NAPLAN data reveal? 

AERO’s response is below, with additional comments from Associate Professor Ladwig. 

AERO: This article makes three key criticisms about the analysis presented in the AERO report, which are inaccurate.

Ladwig claims that the report lacks consideration of sampling error and measurement error in its analysis of the trends of the writing scores. In fact, those errors were accounted for in the complex statistical method applied. AERO’s analysis used both simple and complex statistical methods to examine the trends. While the simple method did not consider error, the more complex statistical method (referred to as the ‘Differential Item Analysis’) explicitly considered a range of errors (including measurement error, and cohort and prompt effects).

Associate Professor Ladwig: AERO did not include any of that in its report nor in any of the technical papers. There is no overtime DIF analysis of the full score – and I wouldn’t expect one.  All of the DIF analyses rely on data that itself carries error (more below). There is no way for the educated reader to verify these claims without expanded and detailed reporting of the technical work underpinning this report. This is lacking in transparency, falls shorts of the standards we should expect from AERO and makes it impossible for AERO to be held accountable for its specific interpretation of their own results.

AERO: Criticism of the perceived lack of consideration of ‘ceiling effects’ in AERO’s analysis of the trends of high-performing students’ results, omits the fact that AERO’s analysis focused on the criteria scores (not the scaled measurement scores). AERO used the proportion of students achieving the top 2 scores (not the top score), for each criterion, as the matrix to examine the trends. Given only a small proportion of students achieved a top score for any criterion (as shown in the report statistics), there is no ‘ceiling effect’ that could have biased the interpretation of the trends.

Associate Professor Ladwig made his ‘ceiling effect’ comments while explaining how the NAPLAN writing scores are designed not in relation to the AERO analysis.

AERO: The third major inaccuracy relates to the comments made about the ‘measurement error’ around the NAPLAN bands and the use of adaptive testing to reduce error. These are irrelevant to AERO’s analysis because the main analysis did not use scaled scores, it did not use bands, and adaptive testing is not applicable to the writing assessment.

Associate Professor Ladwig’s comment was about the scaling in relation to explaining the score development, not about the AERO analysis.

In relation to the AERO use of NAPLAN criterion score data in the writing analysis, however, please note that those scores are created either through scorer moderation processes or (increasingly where possible) text interpretative algorithms.  Here again the address of the reliability of these raw scores was absent, but with one declared limitation noted, in AERO’s own terms:

Another key assumption underlying most of the interpretation of results in this report is that marker effects (that is, marking inconsistency across years) are small and therefore they do not impact on the comparability of raw scores over time. (p[.66)

This is where AERO has taken another short cut, with an assumption that should not be made.  ACARA has reported the reliability estimates to include that in the scores analysis.  It is readily possible to report those and use them for trend analyses.

AERO: A final point: the mixed-methods design of the research was not recognised in the article. AERO’s analysis examined the skills students were able to achieve at the criterion level against curriculum documents. Given the assessment is underpinned by a theory of language, we were able to complement quantitative with a qualitative analysis that specifically highlighted the features of language students were able to achieve. This was validated by analysis of student writing scripts.

Associate Professor Ladwig says this is irrelevant to his analysis. The logic of this is also a concern. Using multiple methods and methodologies does not correct for any that are technically lacking.  In relation to the overall point of concern, we have a clear example of an agency reporting statistical results in a manner that elides external scrutiny accompanied by an extreme media positioning. Any of the qualitative insights to the minutia these numbers represent will probably very useful for teachers of writing – but whether or not they are generalisable, big, or shifting depends on those statistical analysis themselves. 

AERO’s writing report is causing panic. It’s wrong. Here’s why.

If ever there was a time to question public investment in developing reports using  ‘data’ generated by the National Assessment Program, it is now with the release of the Australian Educational Research Organisation’s report ‘Writing development: What does a decade of NAPLAN data reveal?’ 

I am sure the report was meant to provide reliable diagnostic analysis for improving the function of schools. 

It doesn’t. Here’s why.

There are deeply concerning technical questions about both the testing regime which generated the data used in the current report, and the functioning of the newly created (and arguably redundant) office which produced this report.

There are two lines of technical concern which need to be noted. These concerns reveal reasons why this report should be disregarded – and why media response is a beatup.

The first technical concern for all reports of NAPLAN data (and any large scale survey or testing data) is how to represent the inherent fuzziness of estimates generated by this testing apparatus.  

Politicians and almost anyone outside of the very narrow fields reliant on educational measurement would like to talk about these numbers as if they are definitive and certain.

They are not. They are just estimates – but all of the summary statistics reports are just estimates.  

The fact these are estimates is not apparent in the current report.  There is NO presentation of any of the estimates of error in the data used in this report. 

Sampling error is important, and, as ACARA itself has noted, (see, eg, the 2018 NAPLAN technical report) must be taken into account when comparing the different samples used for analyses of NAPLAN.  This form of error is the estimate used to generate confidence intervals and calculations of ‘statistical difference’.  

Readers who recall seeing survey results or polling estimates being represented with a ‘plus or minus’ range will recognise sampling error. 

Sampling error is a measure of the probability of getting a similar result if the same analyses were done again, with a new sample of the same size, with the same instruments, etc.  (I probably should point out that the very common way of expressing statistical confidence often gets this wrong – when we say we have X level of statistical confidence, that isn’t a percentage of how confident you can be with that number, but rather the likelihood of getting a similar result if you did it again.)  

In this case, we know about 10% of the population do not sit the NAPLAN writing exam, so we already know there is sampling error.  

This is also the case when trying to infer something about an entire school from the results of a couple of year levels.  The problem here is that we know the sampling error introduced by test absences is not random and accounting for it can very much change trend analyses, especially for sub-populations So, what does this persuasive writing report say about sampling error? 

Nothing. Nada. Zilch. Zero. 

Anyone who knows basic statistics knows that when you have very large samples, the amount of error is far less than with smaller samples.  In fact, with samples as large as we get in NAPLAN reports, it would take only a very small difference to create enough ripples in the data to show up as being statistically significant.  That doesn’t mean, however, the error introduced is zero – and THAT error must be reported when representing mean differences between different groups (or different measures of the same group).

Given the size of the sampling here, you might think it ok to let that slide.  However, that isn’t the only short cut taken in the report.  The second most obvious measure ignored in this report is measurement error.  Measurement error exists any time we create some instrument to estimate a ‘latent’ variable – ie something you can’t see directly.  We can’t SEE achievement directly – it is an inference based on measuring several things we can theoretically argue are valid indicators of that thing we want to measure.  

Measurement error is by no means a simple issue but directly impacts the validity of any one individual student’s NAPLAN score and an aggregate based on those individual results.  In ‘classical test theory’ a measured score is made of up what is called a ‘true score’ and error (+/-).  In more modern measurement theories error can become much more complicated to estimate, but the general conception remains the same.  Any parent who has looked at NAPLAN results for their child and queried whether or not the test is accurate is implicitly questioning measurement error.

Educational testing advocates have developed many very mathematically complicated ways of dealing with measurement error – and have developed new testing techniques for improving their tests.  The current push for adaptive testing is precisely one of those developments, in the local case being rationalised as adaptive testing (where which specific test item is asked of the person being tested changes depending on prior answers) does a better job of differentiation those at the top and bottom end of the scoring range (see the 2019 NAPLAN technical report for this analysis). 

 This bottom/top of the range problem is referred to as a floor or ceiling effect.  When large proportion of students either don’t score anything or get everything correct, there is no way to differentiate those students from each other – adaptive testing is a way of dealing with floor and ceiling effects better than a predetermined set of test items.  This adaptive testing has been included in the newer deliveries of the online form of the NAPLAN test.

Two important things to note. 

One, the current report claims the performance of high ‘performing’ students’ scores has shifted down – despite new adaptive testing regimes obtaining very different patterns of ceiling effect. Second, the test is not identical for all students (they never have been).  

The process used for selecting test items  is based on ‘credit models’ generated by testers. Test items are determined to have particular levels of ‘difficulty’ based on the probability of correct answers being given from different populations and samples, after assuming population level equivalence in prior ‘ability’ AND creating difficulties score for items while assuming individual student ‘ability’ measures are stable from one time period to the next.  That’s how they can create these 800 point scales that are designed for comparing different year levels.

So what does this report say about any measurement error that may impact the comparisons they are making?  Nothing.

One of the ways ACARA and politicians have settled their worries about such technical concerns as accurately interpreting statistical reports is by introducing the reporting of test results in ‘Bands’.  Now these bands are crucial for qualitatively describing rough ranges of what the number might means in curriculum terms – but they come with a big consequence.  Using ‘Band’ scores is known as ‘coarsening’ data – when you take a more detailed scale and summarise it in a smaller set of ordered categories – and that process is known to increase any estimates of error.  This later problem has received much attention in the statistical literature, with new procedures being recommended for how to adjust estimates to account for that error when conducting group comparisons using that data.  

As before, the amount of reporting of that error issue? Nada.

 This measurement problem is not something you can ignore – and yet the current report is worse than careless on this question.

It takes advantage of readers not knowing about it. 

When the report attempts to diagnose which component of the persuasive writing tasks were of most concern, it does not bother reporting that the error for each of the separate measures of those ten dimensions of writing has far more error than the total writing score, simply because the number of marks for each is a fraction of the total.  The smaller the number of indicators, the more error (and less reliability).

Now all of these technical concerns simply raises the question of whether or not the overall findings of the report will hold up to robust tests and rigorous analysis – there is no way to assess that from this report, but there is even bigger reason to question why it was given as much attention as it was.  That is, for any statistician, there is always a challenge to translate the numeric conclusions into some for of ‘real life’ scenario.

To explain why AERO has significantly dropped the ball on this last point, consider its headline claim that year 9 students have had declining persuasive writing scores and somehow representing that as a major new concern.  

First note that the ONLY reporting of this using the actual scale values is a vaguely labelled line graph showing scores from 2011 until 2018 – skipping 2016 since the writing task that year wasn’t for persuasive writing (p 26 of the report has this graph).  Of those year to year shifts, the only two that may be statistically significant, and are readily visible, are from 2011 to 2012, and then again from 2017 to 2018.  Why speak so vaguely? From the report, we can’t tell you the numeric value of that drop, because there is no reporting of the actual number represented in that line graph.  

Here is where the final reality check comes in.  

If this data matches the data reported in the national reports from 2011 and 2018, the named mean values on the writing scale were 565.9 and 542.9 respectively.  So that is a drop between those two time points of 23 points.  That may sound like a concern, but recall those scores are based on 48 marks given for writing.  In other words, that 23 point difference is no more than one mark difference (it could be far less since each different mark carries a different weighting in formulation that 800 scale).  

Consequently, even if all the technical concerns get sufficient address and the pattern still holds, the realistic title of Year 9 claim would be ‘Year 9 students in 2018 NAPLAN writing test scored one less mark than the Year 9 students of 2011.’

Now assuming that 23 point difference has anything to do with the students at all, start thinking about all the plausible reasons why students in that last year of NAPLAN may not have been as attentive to details as they were when NAPLAN was first getting started.   I can think of several, not least being the way my own kids did everything possible to ignore the Year 9 test – since the Year 9 test had zero consequences for them.  

Personally, these reports are troubling for many reasons, inclusive of the use of statistics to assert certainty without good justification, but also because saying student writing has declined belies that obvious fact that is hasn’t been all that great for decades.  This is where I am totally sympathetic to the issues raised by the report – we do need better writing among the general population.  But using national data to produce a report of this calibre, by an agency beholden to government, really does little more than provide click-bait and knee jerk diagnosis from all sides of a debates we don’t really need to have.

James Ladwig is Associate Professor in the School of Education at the University of Newcastle.  He is internationally recognised for his expertise in educational research and school reform.  Find James’ latest work in Limits to Evidence-Based Learning of Educational Science, in Hall, Quinn and Gollnick (Eds) The Wiley Handbook of Teaching and Learning published by Wiley-Blackwell, New York. James is on Twitter @jgladwig

AERO’s response to this post

ADDITIONAL COMMENTS FROM AERO provided on November 9: information about the statistical issues discussed, a more detailed Technical Note is available at AERO.

On Monday, EduResearch Matters published the above post by Associate Professor James Ladwig which critiqued the Australian Education Research Office’s Writing development: what does a decade of NAPLAN data reveal? 

AERO’s response is below, with additional comments from Associate Professor Ladwig. 

AERO: This article makes three key criticisms about the analysis presented in the AERO report, which are inaccurate.

Ladwig claims that the report lacks consideration of sampling error and measurement error in its analysis of the trends of the writing scores. In fact, those errors were accounted for in the complex statistical method applied. AERO’s analysis used both simple and complex statistical methods to examine the trends. While the simple method did not consider error, the more complex statistical method (referred to as the ‘Differential Item Analysis’) explicitly considered a range of errors (including measurement error, and cohort and prompt effects).

Associate Professor Ladwig: AERO did not include any of that in its report nor in any of the technical papers. There is no overtime DIF analysis of the full score – and I wouldn’t expect one.  All of the DIF analyses rely on data that itself carries error (more below). There is no way for the educated reader to verify these claims without expanded and detailed reporting of the technical work underpinning this report. This is lacking in transparency, falls shorts of the standards we should expect from AERO and makes it impossible for AERO to be held accountable for its specific interpretation of their own results.

AERO: Criticism of the perceived lack of consideration of ‘ceiling effects’ in AERO’s analysis of the trends of high-performing students’ results, omits the fact that AERO’s analysis focused on the criteria scores (not the scaled measurement scores). AERO used the proportion of students achieving the top 2 scores (not the top score), for each criterion, as the matrix to examine the trends. Given only a small proportion of students achieved a top score for any criterion (as shown in the report statistics), there is no ‘ceiling effect’ that could have biased the interpretation of the trends.

Associate Professor Ladwig made his ‘ceiling effect’ comments while explaining how the NAPLAN writing scores are designed not in relation to the AERO analysis.

AERO: The third major inaccuracy relates to the comments made about the ‘measurement error’ around the NAPLAN bands and the use of adaptive testing to reduce error. These are irrelevant to AERO’s analysis because the main analysis did not use scaled scores, it did not use bands, and adaptive testing is not applicable to the writing assessment.

Associate Professor Ladwig’s comment was about the scaling in relation to explaining the score development, not about the AERO analysis.

In relation to the AERO use of NAPLAN criterion score data in the writing analysis, however, please note that those scores are created either through scorer moderation processes or (increasingly where possible) text interpretative algorithms.  Here again the address of the reliability of these raw scores was absent, but with one declared limitation noted, in AERO’s own terms:

Another key assumption underlying most of the interpretation of results in this report is that marker effects (that is, marking inconsistency across years) are small and therefore they do not impact on the comparability of raw scores over time. (p[.66)

This is where AERO has taken another short cut, with an assumption that should not be made.  ACARA has reported the reliability estimates to include that in the scores analysis.  It is readily possible to report those and use them for trend analyses.

AERO: A final point: the mixed-methods design of the research was not recognised in the article. AERO’s analysis examined the skills students were able to achieve at the criterion level against curriculum documents. Given the assessment is underpinned by a theory of language, we were able to complement quantitative with a qualitative analysis that specifically highlighted the features of language students were able to achieve. This was validated by analysis of student writing scripts.

Associate Professor Ladwig says this is irrelevant to his analysis. The logic of this is also a concern. Using multiple methods and methodologies does not correct for any that are technically lacking.  In relation to the overall point of concern, we have a clear example of an agency reporting statistical results in a manner that elides external scrutiny accompanied by an extreme media positioning. Any of the qualitative insights to the minutia these numbers represent will probably very useful for teachers of writing – but whether or not they are generalisable, big, or shifting depends on those statistical analysis themselves. 

Why appeasing Latham won’t make our students any more remarkable

Are our schools making the kids we think we should? The tussle between politics and education continues and Latham is just the blunt end of what is now the assumed modus operandi of school policy in Australia. 

Many readers of this blog no doubt will have noticed a fair amount of public educational discussion about NSW’s School Success Model (SSM) which, according to the Department flyer, is ostensibly new. For background NSW context, it is important to note that this policy was released in the context of a new Minister for Education who has openly challenged educators to ‘be more accountable’, alongside of an entire set of parliamentary educational inquiries set up to appease Mark Latham, who chairs a portfolio committee with a very clear agenda motivated by the populism of his political constituency.  

This matters because there are two specific logics used in the political arena that have been shifted into the criticisms of schools: the public dissatisfaction leading to accountability question (so there’s a ‘public good’ ideal somewhere behind this), and the general rejection of authorities and elitism (alternatively easily labelled anti-intellectualism.)  Both of these political concerns are connected to the School Success Model.  The public dissatisfaction is motivating the desire for measures of accountability that the public believes can be free of tampering, and ‘matter’.  Test scores dictating students’ futures, so they matter, etc. The rejection of elitism is also embedded in the accountability issue. That is due to a (not always unwarranted) lack of trust.  That lack of trust often gets openly directed to specific people

Given the context, while the new School Success Model (SSM) is certainly well intended, it also represents one of the more direct links between politics and education we typically see.  The ministerialisation of schooling is clearly alive and well in Australia.  This isn’t the first time we have seen such direct links – the politics of NAPLAN was, afterall, straight from the political intents of its creators.  It is important to note that the logic at play has been used by both major parties in government.  Implied in that observation is that the systems we have live well beyond election cycles.

Now in this case, the basic political issues how to ‘make’ schools rightfully accountable, and at the same time push for improvement. I suspect this are at least popular sentiments, if not overwhelmingly accepted as a given by the vast majority of the public.  So alongside from general commitments to ‘delivering support where it is needed, and ‘learning from the past’, the model is most notable for it main driver – a matrix of measures ‘outcome’ targets.  In the public document that includes targets are the systems level and school level – aligned.  NAPLAN, Aboriginal Education, HCS, Attendance, Students growth (equity), and Pathways are the main areas specified for naming targets.

But, like many of the other systems created with the same good intent before it, this one really does invite the growing criticism already noted in public commentary. Since, with luck, public debate will continue, here I would like to put some broader historical context to these debates, take a look under the hood of these measures to show why they really aren’t fit for purpose for school accountability purposes without far more sophisticated understanding of what they can and can not tell you.

In the process of walking through some of this groundwork, I hope to show why the main problem here is not something a reform here or there will change.  The systems are producing pretty much what they are designed to do.  

On the origins of this form of governance

Anyone who has studied the history of schooling and education (shockingly few in the field these days) would immediately see the target-setting agenda as a ramped up version of scientific-management (see Callaghan, 1962), blended with a bit of Michael Barber’s methodology for running government (Barber, 2015), using contemporary measurements.

More recently, at least since the then labelled ‘economic rationalist’ radical changes brought to the Australia public services and government structures in the late 1980s and early 1990s, the notion of measuring outcomes of schools as a performance issue has matured, in tandem with the past few decades of an increasing dominance of the testing industry (which also grew throughout the 20th century). The central architecture of this governance model would be called neo-liberal these days, but it is basically a centralised ranking system based on pre-defined measures determined by a select few, and those measures are designed to be palatable to the public.  Using such systems to instil a bit of group competition between schools fits very well with those who believe market logic works for schooling, or those who like sport.

The other way of motivating personnel in such systems is, of course, mandate, such as the now mandated Phonic Screening Check announce in the flyer.

The devil in details

Now when it comes to school measures, there are many types we actually know a fair amount about most if not all of them – as most are generated from research somewhere along the way. There are some problems of interpretation that all school measures face which relate the basic problem that most measures are actually measures of individuals (and not the school), or vice-versa.  Relatedly, we also often see school level measures which are simply the aggregate of the individuals.  In all of these cases, there are many good intentions that don’t match reality.

For example, it isn’t hard to make a case for saying schools should measure student attendance.  The logic here is that students have to be at school to learn school things (aka achievement tests of some sort). You can simply aggregate individual students attendance to the school level and report it publicly (as on MySchool), because students need to be in school. But it would be a very big mistake to assume that the school level aggregated mean attendance of the student data is at all related to school level achievement.  It is often the case that what is true for individual, is also not true for the collective in which the individual belongs.  Another case in point here is policy argument that we need expanded educational attainment (which is ‘how long you stay in schooling’) because if more people get more education, that will bolster the general economy.  Nationally that is a highly debatable proposition (among OECD countries there isn’t even a significant correlation between average educational attainment and GDP).  Individually it does make sense – educational attainment and personal income, or individual status attainment is generally quite positively related.  School level attendance measures that are simple aggregates are not related to school achievement (Ladwig and Luke, 2011).  This may be why the current articulation attendance target is a percentage of students attending more than 90% of the time (surely a better articulation than a simple average – but still an aggregate of untested effect).  The point is more direct – often these targets are motivated by an goal that has been based on some causal idea – but the actually measures often don’t reflect that idea directly.

Another general problem, especially for the achievement data, is the degree to which all of the national (and state) measures are in fact estimates, designed to serve specific purposed.   The degree to which this is true varies from test to test.   Almost all design options in assessment systems carry trade offs.  There is a big difference between an HSC score – where the HSC exams and syllabuses are very closely aligned and the student performance is designed to reflect that; as opposed to NAPLAN, which is designed to not be directly related to syllabuses but overtly as a measure designed to estimate achievement on an underlying scale that is derived from the populations.  For HSC scores, it makes some sense to set targets but notice those targets come in the forms of percentage of students in a given ‘Band.’

Now these bands are tidy and no doubt intended to make interpretation of results easier for parents (that’s the official rational). However, both HSC Bands and NAPLAN bands represent ‘coarsened’ data.  Which means that they are calculated on the basis of some more finely measured scale (HSC raw scores, NAPLAN scale scores).  There are two known problems with coarsened data: 1) in general they increase measurement error (almost by definition), and 2) they are not static overtime.  Of these two systems, the HSC would be much more stable overtime, but even there much development occurs overtime, and the actual qualitative descriptors of the bands changes as syllabuses are modified.  So these band scores, and the number of students in each, is something that really needs to understood to be very less precise than counting kids in those categories implies. For more explanation and an example of one school which decides to change its spelling programs on the basis of needing one student to get one more item test correct, in order for them to meet their goal of having a given percentage of students in a given band, (see Ladwig, 2018).

There is a lot of detail behind this general description, but the point is made very clearly in the technical reports, such as when ACARA shifted how it calibrated its 2013 results relative to previous test years – where you find the technical report explaining that ACARA would need to stop assuming previous scaling samples were ‘secure’.  New scaling samples are drawn each year since 2013. When explaining why they needed to estimate sampling error in a test that was given to all students in a given year, ACARA was forthright and made it very clear: 

‘However, the aim of NAPLAN is to make inference about the educational systems each year and not about the specific student cohorts in 2013’ (p24).

Here you can see overtly that the test was NOT designed for the purposes to which the NSW Minister wishes to pursue.  

The slippage between any credential (or measure) and what it is supposed to represent has a couple of names.  When it comes to testing and achievement measurements, it’s called error.  There’s a margin within which we can be confident, so analysis of any of that data requires a lot of judgement, best made by people who know what and who is being measured.  But that judgement can not be exercised well without a lot of background knowledge that is not typically in the extensive catalogue of background knowledge needed by school leaders.

At a system level, the slippage between what’s counted and what it actually means is called decoupling.  And any of the new school level targets are ripe for such slippage.  Numbers of Aboriginal students obtaining an HSC is clear enough – but does it reflect the increasing numbers of alternative pathways used by an increasingly wide array of institutions? Counting how many kids continue to Year 12 make sense, but it also is motivation for schools to count kids simply for that purpose. 

In short, while the public critics have spotted potential perverse unintended consequence, I would hazard a prediction that they’ve just covered the surface.  Australia already has ample evidence of NAPLAN results being used as the based of KPI development with significant problematic side effects – there is no reason to think this would be immune from misuse, and in fact invites more (see Mockler and Stacey, 2021).

The challenge we need to take is not how to make schools ‘perform’ better or teachers ‘teach better’ – any of those a well intended, but this is a good time to point out common sense really isn’t sensible once you understand how the systems work.  To me it is the wrong question to ask how we make this or that part of the system do something more or better.

In this case, it’s a question of how can we build systems in which school and teachers are rightfully and fairly accountable and in which schools, educators, students are all growing.  And THAT question can not reached until Australia opens up bigger questions about curriculum that have been locked into what has been a remarkable resilience structure ever since the early 1990s attempts to create a national curriculum.

Figure 1 Taken from the NAPLAN 2013 Technical Report, p.19

This extract shows the path from a raw score on a NAPLAN test and what eventually becomes a ‘scale score’ – per domain.  It is important to note that the scale score isn’t a count – it is based on a set of interlocking estimations that align (calibrate) the test items. That ‘logit’ score is based on the overall probability of test items being correctly answered. 

James Ladwig is Associate Professor in the School of Education at the University of Newcastle and co-editor of the American Educational Research Journal.  He is internationally recognised for his expertise in educational research and school reform.  Find James’ latest work in Limits to Evidence-Based Learning of Educational Science, in Hall, Quinn and Gollnick (Eds) The Wiley Handbook of Teaching and Learning published by Wiley-Blackwell, New York (in press). James is on Twitter @jgladwig

References

Barber, M. (2015). How to Run A Government: So that Citizens Benefit and Taxpayers Don’t Go Crazy: Penguin Books Limited.

Callahan, R. E. (1962). Education and the Cult of Efficiency: University of Chicago Press.

Ladwig, J., & Luke, A. (2013). Does improving school level attendance lead to improved school level achievement? An empirical study of indigenous educational policy in Australia. The Australian Educational Researcher, 1-24. doi:10.1007/s13384-013-0131-y

  Ladwig, J. G. (2018). On the Limits to Evidence‐Based Learning of Educational Science. In G. Hall, L. F. Quinn, & D. M. Gollnick (Eds.), The Wiley Handbook of Teaching and Learning (pp. 639-658). New York: WIley and Sons.

Mockler, N., & Stacey, M. (2021). Evidence of teaching practice in an age of accountability: when what can be counted isn’t all that counts. Oxford Review of Education, 47(2), 170-188. doi:10.1080/03054985.2020.1822794

Main image:

DescriptionEnglish: Australian politician Mark Latham at the 2018 Church and State Summit
Date15 January 2018
Source“Mark Latham – Church And State Summit 2018”, YouTube (screenshot)
AuthorPellowe Talk YouTube channel (Dave Pellowe)

Put professional judgement of teachers first or we’ll never get the systemic education improvements we all want. Let’s talk about it

In this blog I’d like to bring together three different lines of educational analysis to show how our contemporary discussions of policy are really not going to lead to any significant change or educationally defensible reforms.  I realise that is a very big call, but I’m pretty confident in saying it, and I hope to show why.

Essentially I think we really need to change the way educational reform debates are framed, because they are based on questions that will not lead us to systemic improvement that I think most ‘stake-holders’ really seek in common.  Before I launch into this discussion, though, I also need to point out that there are a host of related issues which really can’t be sufficiently addressed here, and which I won’t explain at all – but which I will name toward the end of this post.

For now, consider three main points

1) there is growing recognition that a fundamental linch-pin in quality schooling is always going to be our reliance on the professional judgement of teachers,

 2) there is also growing recognition that our current system architecture works against that in several ways, and

3) this is the clincher, the systems that we have implemented are producing exactly that for which they were designed (where teacher professional judgement plays little or no part).

The practical conclusion of bringing these observations together is obvious to me. We are never going to get that “systemic improvement” that we all seem to think will be good for Australia, because we don’t have the right system architecture to achieve it. I believe we need to start thinking more carefully and creatively about how our educational systems are designed.

The hard part begins after sufficient numbers of stakeholders come to this realisation and want to shift the debates.  We aren’t there yet, so for now I just want to open up this line of thought.

The first starting point won’t be a surprise for readers of this blog, and followers of public educational policy pitches.  On the one hand, anyone with Findlandia envy and followers of the recent statements from Pasi Sahlberg, now at UNSW’s Gonski Institute, will know that much of the strength of the ‘Finnish Education Mystery’ (as it has been named by Hannu Simola) has been built on a strong commitment to the professional autonomy and expertise of Finnish teachers.  This isn’t simply accidental, but a consequence of a long understandable history that included (but isn’t only due to) careful and intelligent design by the Finnish Government.

On the other hand, here in Australia, Associate Professor at the University of Sydney, Nicole Mockler, and her colleagues have aptly shown that teachers are more than interested in using evidence-based approaches to help guide their local decision, but their judgements are not really being supported by evidence they see as relevant and useful.

My own analysis of this situation has led me to raise significant questions about the way in which advance technical issues of measurement and its statistical applications have been reduced to incorrect and really misleading uses, and the way in which the institutions which are supposed to promote teachers and teaching has reduced that exercise to classic institutional credentialism based on tick box exercises that really don’t reflect that which it claims. 

No matter how much politicians and other stakeholders might wish to create systems that guarantee this or that universal practice, student learning is always individual and in schools always dependent on whomever is guiding that learning (the same would apply to entirely automated systems, by the way).  So the goal of designing systems based on the presumption that we can somehow specify practice to a point where there is no uncertainty in delivery, is folly.

And yet, point two, these are precisely the sorts of education systems Australia has been building since at least the late 1980s. In broad terms this corresponds to the significant changes in educational governance known as ‘the ministerialisation of education’ documented by educational researchers Dr Janice Dudley and Professor Lesley Vidovich, long ago.  It was in this time period where the penultimate attempt to nationalise curriculum developed, with the corresponding creation of national goals (the Hobart, Adelaide, Melbourne declarations), former civil servants were replaced by contracted ‘Senior Executives’ across federal and state bureaucracies, and teacher education was handed to the federally funded Universities alone (plus a range of massive shifts in TAFE). 

Since then it has been a long slow process of standardisation within and across state systems, the formation of ‘professional institutes,’ and the expansion of public funding to private schooling. 

The roll out of ‘standardisation’

The case for why these systems inhibit or actively work against the exercise of teachers’ professional judgement should be pretty obvious with the term ‘standardisation’.  These days, national curriculum is designed with the intent of making sure children of the military can move around that nation and ‘get the same stuff,’ accountability is centrally developed and deployed via the least expensive forms, like NAPLAN (and an expanding host of supposedly valid measures), teaching has become regulated through standardising the people (at least on paper, via the ‘professional standards’), and securing employment and advancement has been directly tied to these mechanisms. 

Even measurement instruments originally designed only for research and later to help provide evidence for teachers’ use have become tick box instruments of surveillance.  As a researcher I am not opposed to good measurement, and in fact I’ve created some of those being used in this larger schema, but how systems deploy them make huge differences. 

From the reports of the implementation of NAPLAN it is very clear (as was predicted by then opponents) that many of these have become much more high stakes than advocates predicted or intended (opponents were right about this one).  Whether it be novice teachers beholden to developing paper work ‘evidence’ of standards for their job security through to executives whose jobs depend on meeting Key Performance Indicators (which are themselves abstracted from actual effect), we have developed systems of compliance within institutes in which real humans play roles that are pre-defined and largely circumscribed.  And those who readily fit them without too much critique fill these roles. 

After years of this, is it any wonder that teacher education programs by and large no longer teach the history and practice of curriculum design, nor the history and philosophy of education (which is now largely relegated ‘ethics’ in service of codes of conduct) and what once were lively fields of educational psychology and sociology of education have become handmaidens of ‘evidence-based’ teaching techniques and bureaucratic definitions of ‘equity’? (In the University sector these ‘foundational’ disciplines literally do not belong in education anymore for research accountability purposes.)

One bit of historical memory: in the late 1970s and early 1980s, this process of moving the intellectual (‘mental’) work of teachers into standardised categories defined by management was shown to have a long term effect known as ‘de-skilling.’  From our work in the New Basics Trial in Queensland (which was actually much more successful than most realise) it has been very clear that what once were wide spread teacher capacities in local curriculum design and development have been forfeited to (extremely well paid) bureaucrats.  When I met the teachers who took part in the early 1990s National Schools Project (in 1993 and 1994), state differences on this were really obvious and relevant. 

When teachers were invited to restructure any aspect of their work to improve student learning, through an overt agreement between the Unions and employers, teachers from states where there were strong traditions of local curriculum development and pedagogical reflection (most obviously Victoria and South Australia) were squarely focused on trying to find ways of providing rich educational experiences for their students (curriculum, pedagogy were their mainstay).  Teachers from the state that has provided the basic structure of our current systems (NSW) were largely concerned about timetables and budgets.  Of course this is a very big generalisation, but it is also obvious when you work with teachers in schools developing new curriculum projects.

What is the effect of all this?  Precisely as intended, the systems are standardised, stratified, countable and a ready source of ‘evidence’ used to meet the needs of the politicians and ‘independent’ stakeholders, and advancing employees who probably actually believe in the reforms and initiatives they advocate. 

But let’s be honest, these actors are not around after they have used the political capital gained from initiating their pet projects.

Let’s go further

Here is where there are hosts of other developments that buttress this larger system which need further analysis and elaboration than I can provide here.  From the expansion of testing measures based on statistical assumptions few teachers and principals and fewer parents really know well (they aren’t taught them), to professional development schemes based on market determined popularity, to pre-packaged curriculum and apps literally sold as the next silver-bullet, contemporary ‘texts’ of education carry far more implications than the ones named by those selling them.

There are the huge range of ideas and presumptions that lie behind those sales pitches. Of course some teachers sometimes blindly seek these out in the hope of finding new ideas and effective practices.  Teachers’ dispositions and capacities have not come from nowhere, they are the historical product of this system. But who is going to blame them (or the bureaucrats, for that matter) when they rightfully focus on making sure they have a job in that system so they can support their own children and parents?

Yes, we have systems we created. On the one hand, that’s not encouraging.  On the other hand, that does mean that we can re-create them into something quite different.

Change the questions

One of the first steps to collectively trying to find new ways of constructing our school systems, I think, really is about changing the questions we think we are answering.  Instead of using the type of questions needed to drive research, e.g. anything of the form ‘what works?’, we need to start asking, ‘how do we build systems that increase the likelihood that teachers will make intelligent and wise decision in their work?’ 

Research and the categories of analysis CAN provide clear ideas about what has occurred in the past (with all the necessary qualifications about when, where, measured how) but those answers should never be the basis for systems to prescribe what teachers are supposed to do in any given individual event or context.  For example, diagnostic testing can be incredible useful for teachers, but they can’t tell teachers what to do, with whom, when. 

Do we have systems that support teachers in taking the next step in their decisions about which students need what support at what time, while knowing what those tests actually measure, with what margin of error, in what contexts for whom?  The question for systems designs isn’t what’s ‘best practice’, it’s what system increases the probability of teachers making wise and compassionate decisions for their students in their context at the appropriate time.  That includes making judgements relative to what’s happening in our nation, economy and in the larger global transformations. 

Our systems, in the pursuit of minimising risk, are very good as proscribing what teachers’ shouldn’t do; but, they are not designed to support teachers to wisely exercise the autonomy they need to do their jobs in a manner that demonstrates the true potential of our nation.

We can see that potential in the all too rare events in which our students and teachers are given that sort of support – often on the backs of incredibly dedicated and professional teachers and school leaders. From local innovative uses of technology, to large scale performances in the arts, the potential of Australian educators isn’t really hard to find.  But we need new systems to support them in doing more of that type of work, with more students, more of the time.

So when it comes to advocating this or that system reform, please, change the focus.  We don’t need more ‘best practice’ policies from vested interests, to discipline our teachers, we need systems designed to promote true, authentic excellence in education.

James Ladwig is Associate Professor in the School of Education at the University of Newcastle and co-editor of the American Educational Research Journal.  He is internationally recognised for his expertise in educational research and school reform.  Find James’ latest work in Limits to Evidence-Based Learning of Educational Science, in Hall, Quinn and Gollnick (Eds) The Wiley Handbook of Teaching and Learning published by Wiley-Blackwell, New York (in press). James is on Twitter @jgladwig

It’s time to be honest with parents about NAPLAN: your child’s report is misleading, here’s how

You know there is something going wrong with Australia’s national testing program when the education minister of the largest state calls for it to be axed. The testing program, which started today across the nation, should be urgently dumped according to NSW Education Minister, Rob Stokes, because it is being “used dishonestly as a school rating system” and that it has “sprouted an industry that extorts money from desperate families”.

I think it should be dumped too, in its current form, but for an even more compelling reason than Stokes has aired. I believe we are not being honest with parents about how misleading the results can be.

The Federal Minister for Education, Simon Birmingham, has rejected the NSW minister’s call, of course, arguing that “parents like NAPLAN”. Birmingham is probably right. Many parents see NAPLAN scores as one of the few clear indications they get of how their child’s performance at school compares to other children.

Parents receive a NAPLAN student report showing their child’s score as dots on a band, clearly positioned in relation to their school average and national average. It all looks very precise.

But how precise are the NAPLAN results? Should parents, or any of us, be so trusting of the results?

There is considerable fallout from the reporting of NAPLAN results, so I believe it is important to talk about what is going on. The comparisons we make, the decisions we make, and the assumptions we make about those NAPLAN results can all be based on very imprecise data.

How NAPLAN results can be widely inaccurate

While communication of results to parents suggests a very high level of precision, the technical report issued by ACARA each year suggests something quite different. Professor Margaret Wu, a world leading expert in educational measurement and statistics, has done excellent and sustained work over a number of years on what national testing data can (and can’t) tell us.

Her work says that because of the relatively small number of questions asked in each section of NAPLAN tests, that are then used to estimate a child’s performance for each (very large) assessment area, there is a lot of what statisticians call ‘measurement error’ involved. This means that while parents are provided with an indication of their child’s performance that looks very precise, the real story is quite different.

Here’s an example of how wrong the results can be

Figure A is based on performance on the 2016 Year 7 Grammar and Punctuation test: in this case, the student has achieved a score of 615, placing them in the middle of Band 8. We can see that on the basis of this, we might conclude that they are performing above their school average of about 590 and well above the national average of 540. Furthermore, the student is at the cut-off of the 60% shaded area, which means their performance appears to be just in the top 20% of students nationally.

Figure A

However, Figure B tells a different story. Here we have the same result, with the ‘error bars’ added (using the figures provided in the 2016 NAPLAN Technical Report, and a 90% Confidence Interval, consistent with the MySchool website). The solid error bars on Figure B indicate that while the student has received a score of 615 on this particular test, we can be 90% confident on the basis of this that their true ability in grammar and punctuation lies somewhere between 558 and 672, about two bands worth. If we were to use a 95% confidence interval, which is the standard in educational statistics, the span would be even wider, from 547 to 683 – this is shown by the dotted error bars.

In other words, the student’s ‘true ability’ might be very close to the national average, toward the bottom of Band 7, or quite close to the top of Band 9.

That’s a very wide ‘window’ indeed.

Figure B

Wu goes on to note that NAPLAN is also not very good at representing student ability at the class or school level because of what statisticians call ‘sampling error’, the error caused by variation in mean scores of students due to the characteristics of different cohorts. (Sampling error is affected by the number of students in a year group – the smaller the cohort size, the larger the sampling error. Wu points out that the margin of error on school means can easily be close to, or indeed larger than, one year of expected annual growth. So for schools with cohorts of 50 or less, a very significant change in mean performance from one year to another would be possible just on the basis of sampling error.)

The problem is school NAPLAN results are published on the MySchool website. Major decisions are made based on them and on the information parents get in their child’s individual report; parents can spend a lot of money (tutoring, changing school, choosing a school) based on them. As Minister Stokes said a big industry has developed around selling NAPLAN text books, programs and tutoring services. But the results we get are not precise. The precision argument just doesn’t hold. Don’t fall for it.

Any teacher worth their salt, especially one who hadn’t felt the pressure to engage in weeks of NAPLAN preparation with their students, would be far more precise than any dot on a band, in assessing their students’ ability. Teachers continually assess their students and continually collect substantial evidence as to how their students are performing.

Research also suggests that publishing the NAPLAN results on the MySchool website has played a driving role in Australian teachers and students experiencing NAPLAN as ‘high stakes’.

So is NAPLAN good for anything?

At the national level, however, the story is different. What NAPLAN is good for, and indeed what it was originally designed for, is to provide a national snapshot of student ability, and conducting comparisons between different groups (for example, students with a language background other than English and students from English-speaking backgrounds) on a national level.

This is important data to have. It tells us where support and resources are needed in particular. But we could collect the data we need by using a rigorous sampling method, where a smaller number of children are tested (a sample) rather than having every student in every school sit tests every two years. This a move that would be a lot more cost effective, both financially and in terms of other costs to our education system.

So, does NAPLAN need to be urgently dumped?

Our current use of NAPLAN data definitely does need to be urgently dumped. We need to start using NAPLAN results for, and only for, the purpose for which they are fit. I believe we need to get the individual school results off the MySchool website for starters. That would quite quickly cut out much of the hype and anxiety. I think it is time, at the very least, to be honest with parents about what NAPLAN does and doesn’t tell them about their children’s learning and about their schools.

In the process we might free up some of that precious classroom time for more productive things than test preparation.

*With thanks to A/Prof James Ladwig for his helpful comments on the draft of this post.

 

Dr Nicole Mockler is an Associate Professor in Education at the University of Sydney. She has a background in secondary school teaching and teacher professional learning. In the past she has held senior leadership roles in secondary schools, and after completing her PhD in Education at the University of Sydney in 2008, she joined the University of Newcastle in 2009, where she was a Senior Lecturer in the School of Education until early 2015. Nicole’s research interests are in education policy and politics, professional learning and curriculum and pedagogy, and she also continues to work with teachers and schools in these areas.

Nicole is currently the Editor in Chief of The Australian Educational Researcher, and a member of the International Advisory Board of the British Educational Research Journal and Educational Action Research. She was the Communications Co-ordinator for the Australian Association for Research in Education from 2011 until 2016, and until December 2016 was Associate Editor for both Critical Studies in Education and Teaching and Teacher Education.

(Note to readers from the Ed. We are having tech problems with our share counters at the moment across some search engines and devices. For those interested – as of 5:25pm 18/5/18 this post had been shared over 1000 times (1K+) on FB and 102 on Twitter.)

Here’s what is going wrong with ‘evidence-based’ policies and practices in schools in Australia

An academic‘s job is, quite often, to name what others might not see. Scholars of school reform in particular are used to seeing paradoxes and ironies. The contradictions we come across are a source of intellectual intrigue, theoretical development and at times, humour. But the point of naming them in our work is often a fairly simple attempt to get policy actors and teachers to see what they might not see when they are in the midst of their daily work. After all, one of the advantages of being in ‘the Ivory Tower’ is having the opportunity to see larger, longer-term patterns of human behaviour.

This blog is an attempt to continue this line of endeavour. Here I would like to point out some contradictions in current public rhetoric about the relationship between educational research and schooling – focusing on teaching practices and curriculum for the moment.

The call for ‘evidenced-based’ practice in schools

By now we have all seen repeated calls for policy and practice to be ‘evidence-based’. On the one hand, this is common sense – a call to restrain the well-known tendency of educational reforms to fervently push one fad after another, based mostly on beliefs and normative appeals (that is messages that indicate what one should or should not do in a certain situation). And let’s be honest, these often get tangled in party political debates – between ostensible conservatives and supposed progressives. The reality is that both sides are guilty of pushing reforms with either no serious empirical bases or half-baked re-interpretation of research – and both claiming authority based on that ‘research.’ Of course, not all high quality research is empirical – nor should it all be – but the appeal to evidence as a way of moving beyond stalemate is not without merit. Calling for empirical adjudication or verification does provide a pathway to establish more secure bases for justifying what reforms and practices ought to be implemented.

There are a number of ways in which we already know empirical analysis can now move educational reform further, because we can name very common educational practices for which we have ample evidence that the effects of those practices are not what advocates intended. For example, there is ample evidence that NAPLAN has been implemented in a manner that directly contradicts what some of its advocates intended; but the empirical experience has been that NAPLAN has become far more high-stakes than intended and has carried the consequences of narrowing curriculum, a consequence its early advocates said would not happen. (Never mind that many of us predicted this. That’s another story.) This is an example of where empirical research can serve the vital role of assessing the difference between intended and experienced results.

Good research can turn into zealous advocacy

So on a general level, the case for evidence-based practice has a definite value. But let’s not over-extend this general appeal, because we also have plenty of experience of seeing good research turn into zealous advocacy with dubious intent and consequence. The current over-extensions of the empirical appeal have led paradigmatic warriors to push the authority of their work well beyond its actual capacity to inform educational practice. Here, let me name two forms of this over-extension.

Synthetic reviews

Take the contemporary appeal to summarise studies of specific practices as a means of deciphering which practices offer the most promise in practice. (This is called a ‘synthetic review’. John Hattie’s well-known work would be an example). There are, of course, many ways to conduct synthetic reviews of previous research – but we all know the statistical appeal of meta-analyses, based on one form or another of aggregating effect sizes reported in research, has come to dominate the minds of many Australian educators (without a lot of reflection on the strengths and weaknesses of different forms of reviews).

So if we take the stock standard effect size compilation exercise as authoritative, let us also note the obvious constraints implied in that exercise. First, to do that work, all included previous studies have to have measured an outcome that is seen to be the same outcome. This implies that outcome is a) actually valuable and b) sufficiently consistent to be consistently measured. Since most research that fits this bill has already bought the ideology behind standardised measures of educational achievement, that’s its strongest footing. And it is good for that. These forms of analysis are also often not only about teaching, since the practices summarised often are much more than just teaching, but include pre-packaged curriculum as well (e.g. direct instruction research assumes previously set, given curriculum is being implemented).

Now just think about how many times you have seen someone say this or that practice has this or that effect size without also mentioning the very restricted nature of the studied ‘cause’ and measured outcome.

Simply ask ‘effect on what?’ and you have a clear idea of just how limited such meta-analyses actually are.

Randomised Control Trials

Also keep in mind what this form of research can actually tell us about new innovations: nothing directly. This last point applies doubly to the now ubiquitous calls for Randomised Control Trials (RCTs). By definition, RCTs cannot tell us what the effect of an innovation will be simply because that innovation has to already be in place to do an RCT at all. And to be firm on the methodology, we don’t need just one RCT per innovation, but several – so that meta-analyses can be conducted based on replication studies.

This isn’t an argument against meta-analyses and RCTs, but an appeal to be sensible about what we think we can learn from such necessary research endeavours.

Both of these forms of analysis are fundamentally committed to rigorously studying single cause-effect relationships, of the X leads to Y form, since the most rigorous empirical assessment of causality in this tradition is based on isolating the effects of everything other than the designed cause – the X of interest. This is how you specify just what needs to be randomised. Although RCTs in education are built from the tradition of educational psychology that sought to examine generalised claims about all of humanity where randomisation was needed at the individual student level, most reform applications of RCTs will randomise whatever unit of analysis best fits the intended reform. Common contemporary forms of this application will randomise teachers or schools in this or that innovation. The point of that randomisation is to find effects that are independent of the differences between whatever is randomised.

Research shows what has happened, not what will happen

The point of replications is to mitigate against known human flaws (biases, mistakes, etc) and to examine the effect of contexts. This is where our language about what research ‘says’ needs to be much more precise than what we typically see in news editorials and twitter. For example, when phonics advocates say ‘rigorous empirical research has shown phonics program X leads to effect Y’, don’t forget the background presumptions. What that research may have shown is that when phonics program X was implemented in a systemic study, the outcomes measured were Y. What this means is that the claims which can reasonably be drawn from such research are far more limited than zealous advocates hope. That research studied what happened, not what will happen.

Such research does NOT say anything about whether or not that program, when transplanted into a new context, will have the same effect. You have to be pretty sure the contexts are sufficiently similar to make that presumption. (Personally I am quite sceptical about crossing national boundaries with reforms, especially into Australia.)

Fidelity of implementation studies and instruments

More importantly, such studies cannot say anything about whether or not reform X can actually be implemented with sufficient ‘fidelity’ to expect the intended outcome. This reality is precisely why researchers seeking the ‘gold standard’ of research are now producing voluminous ‘fidelity of implementation’ studies and instruments. The Gates Foundation has funded many of these in the US, and I see intended publications from them all the time in my editorial role. Essentially fidelity of implementation measures attempt to estimate the degree to which the new program has been implemented as intended, often by analysing direct evidence of the implementation.

Each time I see one of these studies, it begs the question: ‘If the intent of the reform is to produce the qualities identified in the fidelity of implementation instruments, doesn’t the need of the fidelity of information suggest the reform isn’t readily implemented?’ And why not use the fidelity of implementation instrument itself if that’s what you really think has the effect? For a nice critique and re-framing of this issue see Tony Bryk’s Fidelity of Implementation: Is It the Right Concept?

The reality of ‘evidence-based’ policy

This is where the overall structure of the current push for evidence-based practices becomes most obvious. The fundamental paradox of current educational policy is that most of it is intended to centrally pre-determine what practices occur in local sites, what teachers do (and don’t do) – and yet the policy claims this will lead to the most advanced, innovative curriculum and teaching. It won’t. It can’t.

What it can do is provide a solid basis of knowledge for teachers to know and use in their own professional judgements about what is the best thing to do with their students on any given day. It might help convince schools and teachers to give up on historical practices and debates we are pretty confident won’t work. But what will work depends entirely on the innovation, professional judgement and, as Paul Brock once put it, nous of all educators.

 

James Ladwig is Associate Professor in the School of Education at the University of Newcastle and co-editor of the American Educational Research Journal.  He is internationally recognised for his expertise in educational research and school reform. 

Find James’ latest work in Limits to Evidence-Based Learning of Educational Science, in Hall, Quinn and Gollnick (Eds) The Wiley Handbook of Teaching and Learning published by Wiley-Blackwell, New York (in press).

James is on Twitter @jgladwig

National Evidence Base for educational policy: a good idea or half-baked plan?

The recent call for a ‘national education evidence base’ by the Australian Government came as no surprise to Australian educators. The idea is that we need to gather evidence, nationally, on which education policies, programs and teaching practices work in order for governments to spend money wisely on education. There have long been arguments that Australia has been increasing its spending on education, particularly school education, without improving outcomes. We need to ‘get more bang for our buck’ as Education Minister, Simon Birmingham, famously told us or as the Australian Productivity Commission put it, we need to ‘improve education outcomes in a cost‑effective manner’.

I am one of the many educators who submitted a response to the Australian Productivity Commission’s national education evidence base proposal as set out in the draft report ‘National Education Evidence Base’. This blog post is based on my submission. Submissions are now closed and the Commission’s final report is due to be forwarded to the Australian Government in December 2016.

Inherent in the argument for a national education evidence base are criticisms of current educational research in Australia. As an educational researcher working in Australia this is the focus of my interest.

Here I will address five points raised in the report as follows: 1) the extent to which there is a need for better research to inform policy, 2) the nature of the needed research, 3) the capacity needed to produce that research, 4) who the audience of that research should be.

The need for better research to inform policy

As the report notes, there are several aspects of ongoing educational debate which could well be better advanced if a stronger evidence base existed. Examples of ongoing public educational debates are easily identified in Australia, most notably being the perpetual literacy wars. In a rational world, so the report seems to suggest, such debate could well become a thing of the past if only we had strong enough research to settle them. To me, this is a laudable goal.

However, such a standard position is naive in its assessment of why these debates are in fact on-going, and more naive in proposing recommendations that barely address any but the most simplistic reasons for the current situation. For example, whatever the current state of literacy research, the report itself demonstrates that the major source of these debates is not actually the research that government directed policy agents decide to use and interpret, but the simple fact there is NO systemic development of research informed policy analysis which is independent from government itself in Australia.

The introductory justification for this report, based loosely on a weak analysis of a small slice of available international comparative data demonstrates clearly how government directed research works in Australia.

As an editor of a top ranking educational research journal (the American Educational Research Journal) I can confidently say this particular analysis would not meet the standards of our highest ranked research journals because it is apparently partial, far from comprehensive and lacking in its own internal logic. It is a very good example of the very sort of research use away from which the report claims to want to move.

The nature of the needed research

The report makes much of the need for research which tests causal claims (a claim of the form “A was a cause of B”) placing high priority on experimental and quasi-experimental design. This portion of the report simply sums up arguments about the need for of the type of research in education promoted as ‘gold-standard’ more than a decade ago in the USA and UK. This argument is in part common-sense. However, it is naïve to make presumptions that such research will provide what policy makers in Australia today need to develop policy.

Comparisons are made between research in education and research in medicine for a variety of sensible reasons. However the implications of that comparison are vastly unrecognized in the report.

If Australia wishes to develop a more secure national evidence base for educational policy akin to that found in medicine, it must confront basic realities which most often are ignored and which are inadequately understood in this report:

a) the funding level of educational research is a minuscule fraction of that available to medicine,

b) the range and types of research that inform medical policy extend far beyond anything seen as ‘gold standard’ for education, including epidemiological studies, program evaluations and qualitative studies relevant to most medical practices, and

c) the degree to which educational practices are transportable across national and cultural differences is far less than that confronted by doctors whose basic unit of analysis is the human body.

Just at a technical level, while the need for randomised trials is identified in the report, there are clearly naïve assumptions about how that can actually be done with statistically validity that accounts for school level error estimations and the subsequent need for large samples of schools. (Individual level randomisation is insufficient.) Thus, the investment needed for truly solid evidence-based policy research in education is dramatically under-estimated in the report and most public discussions.

The capacity needed to produce that research

The report does well to identify a substantial shortage of Australia expertise available for this sort of research, and in the process demonstrates two dynamics which deserve much more public discussion and debate. First, there has been a trend to relying on disciplines outside of education for the technical expertise of analyzing currently available data. While this can be quite helpful at times, it is often fraught with the problems of invalid interpretations, simplistic (and practically unhelpful) policy recommendations which fail to take the history of the field and systems into account, and over-promising future effects of following the policy advice given.

Second, the report dramatically fails to acknowledge that the current shortage of research capacity is directly related to the manner and form of higher education funding available to do the work needed to develop future researchers. There is the additional obvious issue of a lack of secure career development in Australia for educational researchers. This, of course, is directly related to the previous point.

Audience of evidence-based policy research

While the report is clearly directed to developing solid evidence for policy-makers, it understates the need for that research to also provide sufficient reporting to a broader public for the policy making process. By necessity this involves the development of a much larger dissemination infrastructure than currently exists.

At the moment it would be very difficult for any journalist, much less any member of the general public, to find sound independent reports of larger bodies of (necessarily complicated and sometimes conflicting) research written for the purposes of informing the public. Almost all of the most independent research is either not translated from its scholarly home journals or not readily available due to restrictions in government contracts. What is available publicly and sometimes claims to be independent is almost always conducted with clear and obviously partial political and/or self- interest.

The reason this situation exists is simply that there is no independent body of educational research apart from that conducted by individual researchers in the research projects conducted with the independent funding of the ARC (and that is barely sufficient to its current disciplinary task).

Governance structure needed to produce research that is in the public interest

Finally I think perhaps the most important point to make about this report is that it claims to want to develop a national evidence base for informing policy, but the proposed governance of that evidence and research is entirely under the same current government strictures that currently limit what is done and said in the name of educational policy research in Australia. That is, however much there is a need to increase the research capacities of the various government departments and agencies which currently advise government, all of those are beholden to currently restrictive contracts, or conducted by individuals who are obligated to not publicly open current policy to public criticism.

By definition this means that public debate cannot be informed by independent research under the proposed governance for the development of the proposed national evidence base.

This is a growing trend in education that warrants substantial public debate. With the development of a single curriculum body and a single institute for professional standards, all with similarly restricted governance structures (just as was recently proposed in the NSW review of its Board of Studies), the degree to which alternative educational ideas, programs and institutions can be openly developed and tested is becoming more and more restricted.

Given the report’s desire to develop experimental testing, it is crucial to keep in mind that such research is predicated on the development of sound alternative educational practices which require the support of substantial and truly independent research.

 

ladwig_james

 

James Ladwig is Associate Professor in the School of Education at the University of Newcastle and co-editor of the American Educational Research Journal.  He is internationally recognised for his expertise in educational research and school reform.

Myth buster: improving school attendance does not improve student outcomes

Does improved student attendance lead to improved student achievement?

Join prime ministers, premiers and education ministers from all sides of politics if you believe it does. They regularly tell us about the need to “improve” or “increase” attendance in order to improve achievement.

We recently had unprecedented access to state government data on individual school and student attendance and achievement in over 120 schools  (as part of a major 2009-2013 study of the reform and leadership of schools serving Indigenous students and communities) so we decided to test the widely held assumption.

What we found is both surprising and challenging.

The overall claim that increased attendance is linked with improved achievement seems like common sense. It stands to reason that if a student attends more, s/he is more likely to perform better on annually administered standardised tests. The inverse also seems intuitive and common sensical: that if an individual student doesn’t attend, s/he is less likely to achieve well on these conventional measures.

But sometimes what appears to make sense about an individual student may not factually hold up when we look at the patterns across a larger school or system.

In our research we were studying background patterns on attendance and achievement using very conventional empirical statistical analysis.  What we found in first up was that, whatever else we may hope, school level attendance rates generally don’t change all that much.

Despite officially supported policies and high profile school and regional foci, schools making big improvement in attendance rates are the exception, and are very rare.

Further, we found, the vast majority (around 76%) of the level of school attendance empirically is related to geographic remoteness, the percentage of Indigenous kids, and levels of socio-economic marginalisation. These are matters that for the most part are beyond the purview of schools and systems to change. Most importantly and most surprisingly, we found there is no relationship between school attendance and school level NAPLAN results. This is the case whether you are looking at overall levels and rates of change or the achievement of specific groups of Indigenous and non-Indigenous students.

The particular policy story that improved attendance will improve or bootstrap conventional achievement has no basis in fact at a school level. The policy making and funding that goes into lifting attendance rates of whole schools or systems assumes erroneously that improvements in achievement by individual students will logically follow.

The bottom line is you can’t simply generalise an individual story and apply it to schools. The data shows this.

Further, and this is important in current reform debates, we observed that the very few schools with high percentages of Indigenous children that both increased attendance and achievement also had implemented significant curriculum and teaching method reforms over the same period examined.

In other words, attending school may or may not help generally, but improving achievement depends on what children do once we get them to school.

In our view, there is no short cut around the need for substantial ongoing reforms of curriculum and teaching methods and affiliated professional development for teachers.  Building quality teaching and learning relations are the problem and the solution – not attendance or testing or accountability policies per se.

 

ladwig_james James Ladwig                        Allan Luke 2  Allan Luke

James Ladwig is an Associate Professor in the School of Education at the University of Newcastle and Adjunct Professor in the Victoria Institute of Victoria University.  He is internationally recognised for his expertise in educational research and school reform.

Allan Luke is Emeritus Professor in the ‎Faculty of Education at the Queensland University of Technology and Adjunct Professor in the Werklund School of Education, University of Calgary, Canada, where he works mentoring first nations academics. He is an educator, researcher, and theorist studying multiliteracies, linguistics, family literacy, and educational policy. Dr. Luke has written or edited over 14 books and more than 140 articles and book chapters.

Here is  the full article:  Does improving school level attendance lead to improved school level achievement? An empirical study of indigenous educational policy in Australia.