The former TES journalist writes for NAHT on current education issues. The views expressed do not necessarily reflect those of NAHT
Has Ofqual misled the public? Part 1
The comment, from a senior exam board official in a paper dated July 30th as problems with this year’s GCSE English awards were emerging, was clear.
It said: “If asked by centres or the press to explain the rise in controlled assessment [grade] boundaries, the rationale has to be based on [examiners’] qualitative judgements of work seen, not on a statistical fix.”
Nearly three months on, the central question remaining to be answered over this year’s grading controversy is whether, in fact, there has been a “statistical fix”, and if there has - with some kind of “fix” almost inarguable now - the nature and defensibility of it. To put it another way, have all pupils been treated fairly or not?
For this blog, I’m going to draw on correspondence between Ofqual and the boards presented in a Freedom of Information request which has drawn coverage in the TES http://bit.ly/QyivGj, on a subset of those documents which were reported last month on Newsnight http://bbc.in/Pnrwkp and on a report by the Welsh Assembly Government http://bit.ly/TJrJC2 .
The picture these documents present means that Ofqual’s explanation so far as to what has happened is very vulnerable, I think.
I believe it’s vulnerable in two ways, the first of which relates to the substantive issue of whether pupils have been treated unfairly; the second relating to whether Ofqual, in its initial report on the issue, was at all transparent as to its own role in how this controversy developed.
In this blog, however, I only want to discuss the first issue. I’m likely to come back to the second in another piece next week.
The substantive point centres on one argument, deployed by Ofqual as probably the central justification for not re-grading any GCSE English papers this year.
The bare publicly-discussed outline of what happened this year is reasonably clear, and uncontested. As GCSE results were published in late August, some schools began to pick up on a plummeting in their English results, with the knock-on effect for many of them that the headline statistic by which institutions are judged – the proportion of pupils achieving five or more A*-Cs, including English and maths – would fall, too.
Schools began to notice that grade boundaries, particularly at the C/D borderline, had shifted upwards between January, the previous time written exams had been sat, and June. Perhaps most suspiciously, some grade boundaries for controlled assessment tasks – for readers outside of secondary schools, this is coursework sat in exam conditions – had also changed from January to June despite pupils submitting the same controlled assessment assignments in June as in January.
At the same time, Ofqual had introduced a system of statistical checking of exam boards’ suggested results, known as “comparable outcomes”, which, as discussed several times now on this blog, was designed to ensure that roughly the same proportions of pupils achieved each grade as in previous years.
The suspicion from the start, in a repetition of concerns which marred the first grading of new-style A-levels 10 years ago, has been that Ofqual and the exam boards had to make it harder for pupils to achieve a C grade in papers and controlled assessment taken at the end of the course in June than in January because this was the only way they could hit what essentially were a set of statistical targets for results overall.
In other words, the claim from schools was that, having given out too many good grades earlier, the boards had to toughen up their grading at the end. Some pupils sitting papers at the end were, then, a victim of that “statistical fix”: treated harshly by boards – sometimes under severe pressure to do so from Ofqual – with each board having to ensure the end numbers came out right, no matter what.
Ofqual’s argument, in its initial report on GCSE English http://bit.ly/RFl1Nq , is that there was no fix. It has said that grade boundaries were set too low in January, effectively making it too easy to get a C grade in particular, but that they were about right in June. So, although some candidates sitting exams and having controlled assessment graded in January may have “got lucky”, those getting graded in June were not disadvantaged by any earlier mistakes in the grading process: they got what they would have got in any other year and thus “got the grades they deserved”. The results for GCSE English as a whole were also, Ofqual said, about right.
So what’s the new evidence on this?
Well, I’ve looked first at a very detailed Freedom of Information request http://bit.ly/Ou26WF to Ofqual. This has had some coverage in the TES already. But this was more around the boards’ belief about whether Ofqual was the right organisation to conduct an inquiry into events this summer, which I want to discuss in my next blog.
What interested me first, here, is whether, actually, we do have a “statistical fix”.
On reading the response to the FOI request highlighted by the TES (note 1) I decided to start to read through the correspondence it produced in reverse date order. And I immediately came upon something which deepened my suspicions that some statistical jiggery-pokery might have been going on.
Among the earliest pieces of correspondence are emails from the exam boards to Ofqual reporting on how the January grading session for GCSE English had gone.
The Welsh Joint Education Council (WJEC) board – the second largest for GCSE English - wrote, in a form sent on March 13th, about its uncertainty as to where to set grade boundaries for English and English language papers taken in January.
It said: “The awarders [senior examiners] had some difficulty in reconciling the standard that they would expect at unit [individual paper] level with boundary values that would lead to outcomes closer to those suggested by the statistics.
“Final decisions reflected a degree of caution and a desire not to set unacceptably low boundary values (from a standards perspective).”
I think this is saying that mathematical modelling was telling the board to set grade boundaries at particular points on these papers to enable it to hit its statistical target for how many of each grade to award by the end of the course – with Ofqual insisting that overall grades be comparable with the last full year of the previous version of this GCSE, in 2010 - but that examiners were struggling to reconcile that with the work they saw completed by pupils.
The most interesting point came next, however. The board said: “There remains 60 per cent of the course (two internally assessed units) to be awarded for the first time in June 2012 which allows some flexibility in bringing 2012 subject outcomes in line with those of 2010.”
This seemed to me to be the board saying to Ofqual: “it’s not the end of the world if the grading is not perfect in January, as we can always adjust grade boundaries of internally-assessed work submitted for grading in June in order that we hit our comparable outcomes targets in the end”. In other words, “we can toughen things up for the final modules if we need to in order to hit the overall statistical targets required by the comparable outcomes policy”. Is this, in fact, in line with what happened in the end?
Similarly, forms filled out and sent to Ofqual on March 9th by AQA, by far the largest board for GCSE English, include a statement that, while there was some uncertainty as to where to set controlled assessment grade boundaries in January 2012, in the end a decision was taken in the knowledge that there was “room for adjustments to be made in June 2012, should this be necessary to protect subject-level standards next summer”. In other words, the grade boundaries, even for controlled assessment, could be moved in June 2012 to allow for the overall proportion of grades in the whole course to be in line with previous years.
There is other evidence which might support a hypothesis of a statistical “fix” elsewhere in this dossier. On 9th August, the Welsh board produced a document setting out possible grade boundary changes in response to a letter the previous day from Dennis Opposs, director of standards and research at Ofqual, which had asked them to “review” the C grade boundary “in order to produce outcomes which are much closer to [statistical] predictions”.
The WJEC paper sets out three possible options which include changes of the number of marks needed for a C grade. These would only increase the number of marks needed by a mark or two. However, because of the number of pupils with marks clustered around these scores, those small changes could make a big difference to the numbers achieving C grades overall. Ofqual eventually accepts the first option WJEC puts forward, with grade boundaries raised by a single mark in three papers.
The WJEC paper seems to be accepting that the need to raise the grade boundaries in individual papers is driven by the need to produce a result overall – in terms of the numbers of pupils gaining each grade across the GCSE as a whole - which Ofqual feels is comparable with previous years, in admitting that “only dramatic grade boundary changes [in individual papers] affect the [overall] outcomes.
Finally, a letter from Glenys Stacey, chief executive of Ofqual, to Russell Hobby dated 25th August (after the controversy had broken) seems to include an implicit admission that the structure of these new qualifications can lead to the necessity of what this observer could describe as a “statistical fix”, centring on grade boundaries in modules which have yet to be graded.
Ms Stacey writes: “The key difficulty posed by modular qualifications is that when the qualification awards come to be made, some modules have already been awarded, so the scope for adjusting grade boundaries so that standards are set correctly is reduced.”
This is really very clear, I think. Ms Stacey is describing the structure of the current GCSE, whereby pupils can be graded for modules they take part-way through the course, as well as at the end. For a GCSE course which ran, for example, as the one at issue here did, from September 2010 to June 2012, exam boards would have been handing out grades in January and June 2011 and in January 2012, as well as in final papers taken in June 2012.
But, for the first three of those sessions, they would have to hand out the grades without knowing the standard of work or entry profile of those modules which were to be graded in June 2012.
Yet Ofqual’s “comparable outcomes” system – which seeks to ensure roughly the same number of good grades on the course overall are handed out from year to year, all other things being equal (as discussed in previous blogs) to control grade inflation – effectively means that the boards have very limited room for manoeuvre when it comes to ensuring that they do what Ofqual want them to do.
Because they had already handed out a certain number of good grades in grading sessions before June 2012, and they couldn’t take these decisions back from pupils, if they wanted to make adjustments to grade boundaries in order to hit what are their overall statistical targets for the qualification as a whole, the only grade boundaries they could change, in this case, were those in papers taken in June 2012.
The risk, and the central charge, perhaps, of this controversy, is that the boards – under pressure in certain cases from Ofqual – gave too many good grades away in the early sessions and therefore had to over-compensate by making grade boundaries much higher in the later ones. Importantly, a central argument of those now contesting this process might be that the later grade boundaries were not only set to make them harder for pupils to clear than had been the “generous” ones of earlier papers, but they were even harder to reach than they would have been had Ofqual and the boards known from the start what the entry and achievement patterns for all modules were going to be. In other words, those completing later modules were the victims of grade boundaries being made extra-high because they were set too low in earlier modules.
If this analysis is correct, pupils taking the later papers could be the victims of a “statistical fix”: the need to keep results at a certain level overall.
I wanted to pull back from this, now, to try to offer a very simple mathematical analogy.
Imagine a scenario where I have been set a task: to give out some oranges to two groups of people. I have a box containing oranges to give out, and, at the start, I know that I’m going to be allowed to give out roughly 12 oranges. But I don’t know precisely how many, and I’m not going to be told how many I can give out until I’ve already given the first group of oranges.
So suppose the first group of people comes along. There are seven people in this group and, knowing I have to keep to around 12 oranges in total, I don’t want to give them two each. So I decide to give them one each. I know this comes to seven oranges given out to this first group, and I know that’s slightly more than half of the 12, but I decide it’s the best thing to do at this point.
Then I’m told that, actually, the number of oranges given out in total should be set at precisely 12, even though there are more than that in the box. And then another seven people arrive and want to be given oranges. What do I do?
Perhaps, as scenario A, I should argue with the person who controls the number of oranges to be given out in total that the fairest thing to do would be to give this second group seven oranges between them. For that is what the first group had, and I need to treat both groups fairly and equally. If I take this course, I need to argue firmly with the controller of the oranges that this is the only fair course of action.
Or maybe, as scenario B, I look back at what I’ve done and think that I’ve messed up by deciding to give the first group seven oranges. Looking back, if I’d have known that there was a fixed limit of 12 oranges, and if I’d have known that the second group were going to turn up with seven people, I should have viewed the two groups as identical and split the number of oranges down the middle. So I should have handed out six oranges to each group. I made a mistake, handing out seven oranges to the first group. However, I could say to the second group: “here, have six oranges. This may not be exactly what the first group had, but it is what you would have got, if I had divided the oranges out with the benefit of hindsight. So you may have lost out relative to the first group, but you have not lost out compared to what I think is a ‘fair’ outcome, so you should be happy.” This, though, would still leave me having given away a total of 13 oranges, which is one more than I’m supposed to give away.
Or consider the third scenario, scenario C. I decide that I clearly made an error – knowing what I know now – in giving seven oranges to the first group. But I consider it absolutely imperative that I hit my target and don’t give out more than 12 oranges overall. So I say to the second group: “I’m sorry, I can only give you five oranges. I know that isn’t very fair, but I made a mistake with my first allocation of oranges, and I absolutely can’t give out more than 12, so you have to pay the price of that. Sorry, but that’s life.”
OK, well I think Ofqual has so far offered a version of scenario B in its explanation of what has happened so far. But, as you might have guessed by now, there are question marks over whether it really adds up, in more ways than one. If the real answer is closer to scenario C, then it will look like there has been a clear injustice to some pupils.
So let’s go through the three scenarios again. There is a case, as I’ve argued previously, http://bit.ly/PBRwgr that a situation analogous to scenario A should be followed in relation to GCSE English. If the boards did really make a mistake in making it too “easy” for pupils to get good marks in earlier modules, there is a case, in terms of fairness, for holding that standard for the later papers so as not to disadvantage one set of candidates against another. If this means handing out a greater number of A*-C grades in total, so be it, would be the argument: the fairness to pupils (within a single academic year) principle trumps all else.
But Ofqual has said this is not possible, saying only that those pupils (arguing that it is relatively few) who benefited from relatively “easy” grade boundaries in the earlier modules got “lucky”, but that the most important thing is that the standard of the exam was set correctly for the later modules.
As discussed, I think this is a version of scenario B. But this explanation has puzzled me a bit since Ofqual used it in its initial report on this controversy in January.
Ofqual’s argument has been: the standard of papers was set to be slightly too easy in modules taken before June 2012. It was about right in June 2012. And the standard of the exam – the number of pupils gaining good grades overall – was about right.
The thing that puzzled me is that the number of pupils gaining good grades overall essentially comes from adding up marks on modules taken before June 2012, and in June 2012. And Ofqual has said the numbers gaining good marks before June 2012 was too high; that the numbers doing so in June 2012 was about right; and that the numbers gaining good grades overall (summing results from June and before June) was about right. For me, the maths of this does not add up.
It’s as if, in scenario B above, I’d said: “I’ve given out seven oranges to the first group. Oh well, they just got lucky. Then I’ve given out six oranges to the second group. That’s OK: this was what they should have had. And in total, I’ve only given out 12 oranges: exactly as I should have done.”
Yes, the maths does not add up. If you think too many good grades were given out early on, and the right amount were given out in June, then the logic of this is that you should also think too many grades were given out in total.
Ok, it is certainly possible to argue that my scenarios were overly simplistic. What about if, instead of 12 oranges to give out, I know I am going to have roughly 120, but far more groups to distribute them amongst. Say I still have to make a decision, after the first group turns up, as to how many oranges to give it, and that only after this first group has been given oranges will I know how many other groups I will have to distribute oranges to, how many people will be in each of the following groups and exactly how many oranges I will have to distribute overall.
Suppose, again, that seven people turn up in this first group, and I also know there will be roughly 20 groups overall. I decide, again, to give this group seven oranges. Then I’m told that there are exactly 120 oranges to give out overall, and at the same time another 19 groups of people turn up, each with seven people in them.
I could then give the remaining 19 groups six oranges each, arguing that this is the fairest thing to do, and that it was how I would have divided the oranges among all 20 groups with the benefit of hindsight. Yes, could be the argument: the first group did get lucky. But they represent a small fraction of the overall number of people being given oranges. The other 19 groups got exactly the right number of oranges. And, although the total number of oranges given out, at 121, is one more than it should have been, in the grand scheme of things this is a very small amount, and it really doesn’t mean we shouldn’t say that it was, really, not the correct number of oranges to have given out, since 121 is very close to 120.
The above seems to be very close to Ofqual’s argument in its paper, in that it has to say that the number of pupils taking modules before June 2012 is so small, relative to the overall total, that some generosity in the awarding process at that stage is not hugely significant.
It is very difficult to be sure, even after looking at a fair amount of documentation, whether the maths of that stands up. But the alternative explanation, which is that we have a version of scenario C, with pupils taking later modules being made to pay for the mistakes of Ofqual and the boards in being too generous with their grading earlier on, and grade boundaries in June were set not just above the “overly-generous” ones from before that, but below them by an extra amount to compensate for the earlier generosity, is potentially explosive. Allegations of this kind are what appear to have made the Curriculum 2000 A-level grading controversy of 2002 so messy. And I certainly haven’t seen definitive proof yet that this is not exactly what happened this summer. Certainly, these documents do not make me less suspicious.
There is one other very interesting piece of documentation in this FOI which is worth discussing in relation to the business of how pupils have been treated, I think.
On Saturday, August 25th, with the controversy out in the open for two days, someone from the AQA board (the name is redacted) sent an email to Glenys Stacey discussing the results and grade boundaries for the two elements of AQA’s English exams which were proving, and have proved, the most controversial.
First, it referred to the controlled assessment element, where teachers were furious that the number of marks needed to achieve a grade C rose by three. This is despite the tasks completed by pupils handing in work in January, and in June, remaining the same.
The email simply makes the point that, if the grade boundaries for June had remained at what they were in January, then overall grades at C grade or better would be up by a massive 6.5 per cent on what the statistical modelling (ie the number of C or better grades the boards thought they should be awarding, given these pupils’ prior results at key stage two) said they should be.
The second point was, if anything, however, more revealing. It discussed the decision to raise the C grade boundary for the foundation tier written paper (taken by less able pupils) from 43 marks out of 80 in January to 53 marks out of 80 in June.
The email argues that “it is not surprising that boundaries move between series [of sittings of a particular paper] given varying demands of the individual papers”. This has continued to be Ofqual’s line, including in questioning by MPs. I find it surprising that it can be said without any acknowledgement not just that the grade boundaries moved (which in principle is to be expected), but the extent of that movement. A ten mark change suggests a very large change in the difficulty level of the paper between January and June. By contrast, the Welsh government’s report on the controversy says the C grade boundaries would “typically” be expected to vary by only “up to three marks” between sittings of an individual paper.
The email goes on: “The Chair of Examiners[sic] report to me this summer is very clear that the boundary of 53 produces scripts [sic] clearly meet the grade c descriptors, the tick chart evidence [a system whereby senior examiners set provisional grade boundaries by marking ticks or crosses according to whether completed scripts given a particular mark meet the qualitative standard for a grade, or not] also clearly supports this boundary.”
The email goes on: “As you and your colleagues will know judgements in early modules of new specifications are more challenging when there is no subject level award to look at which would have been the case in the winter award this year but the tick chart looks reasonably secure for this.”
Dennis Opposs, in an email response to Glenys Stacey and Amanda Spielman, Ofqual’s chair, writes that “this is very helpful and along the lines that Amanda and I discussed yesterday”. Again, this seems to have been in line with Ofqual’s overall response in its initial report and since then: that grades were handed out too generously before June, but were about right in June and about right overall.
It looks to me as if it is a way of fitting this year’s grading to a version of scenario B above. But it also looks like an example of post hoc rationalisation. The grade boundaries for the foundation paper may be “defensible”, to use a word which features a fair bit in these documents. But would they have been set in the same place if AQA – and the other boards – were not either seeking themselves, or under pressure from Ofqual, to limit the number of A*-C grades overall? In other words, would these boundaries have been set in the same way had “too many” high marks not already been awarded earlier in this GCSE course?
I think it is an open question. For examiners’ judgements themselves would seem to leave open a degree of doubt as to where boundaries should be set. Evidence of this even comes within this document cache itself: a set of official descriptions as to the qualities scripts should display at particular grades comes with the following caveat:
“Note: in principle the grade descriptions apply to the mid-grade point. In practice, it is unlikely that there would be a material difference [ie in whether one can be sure that the quality of work matches the grade descriptor] between the boundary mark and the mid-grade point. However, it is likely that candidates at the boundary will only just meet the descriptions set out below.”
In other words, this is not a precise science. A particular grade boundary, set by a board because it is where the statistics tell it that it needs to be set in order to meet the “comparable outcomes” objectives overall, may be defensible afterwards in terms of fairly loose qualitative descriptors of pupils’ work. But students may still be being set a higher grade boundary than they would have been had examiners not been guided by calculations based on the number of grades already awarded and the need to limit the numbers overall.
Indeed, although in AQA’s case there is no record of much disagreement between examiners and Ofqual over where to set grade boundaries, in the case of the WJEC and another board, Edexcel, there was a clear disagreement, with the latter two having to be forced to set some boundaries higher than they had originally intended.
So: has Ofqual misled the public as to whether all pupils got the grades their work deserved, in effect without recourse to any “statistical fixes”? My hunch is that the answer is ‘yes’, but it is not possible to know definitively without knowing all the numbers behind the boards’ calculations.
As I say, I hope to return to Ofqual’s report on this issue, and give more thoughts on what the documents now in the public domain show but Ofqual’s official report does not, in my next blog.
Note 1: The FOI request discussed at length in this piece and covered by the TES was submitted by Antony Carpen, who I think is a former civil servant, writer and blogger.