Gradingproblems

Notes on : Anderson, L. W. (2018). A critique of grading: Policies, practices, and technical matters. Education Policy Analysis Archives, 26(49). http://dx.doi.org/10.14507/epaa.26.3814 This article is part of the Special Issue, Historical and Contemporary Perspectives on Educational Evaluation: Dialogues with the International Academy of Education, guest edited by Lorin W. Anderson, Maria de Ibarrola, and D. C. Phillips.

Dave Harris

I am very grateful to an AI review of one of my Academia Ed papers for suggesting this reference!

The central questions are why we grade students, what grades mean, how reliable the grades are, how valid they are, and what the consequences are for grading students. The results suggest that there are several purposes for grading students and that the way they are assigned and imported should be consistent with them, but that grades mean different things to different people including the teachers of assignment. Grades on a single task are 'quite unreliable' (2) while cumulative grades are 'reasonably reliable'. Validity of grades on a single task 'is virtually impossible to determine' while 'cumulative grades are reasonably valid'. Grades influence student affective characteristics like self esteem but 'their influence is no greater, nor less than, a host of other school-related factors'.

There is a general negativity about student grading and several criticisms have been published. Academics teachers and consultants have all criticised, over 50 years. Criticisms remain. Grading requires a careful critique. It is defining a grade as a position on the continuum of 'quality, proficiency, intensity or value', expressed as a percentage or by letters or by verbal descriptors. Grades are often used synonymously with marks [bad mistake].

Grades might be used to encourage students to record shortcomings and to enable teachers to modify instructions. There might be a third reason, to communicate information to a variety of audiences. As motivations, they do work, if competition among students works, and if the ratings and rankings work, but this might be still 'the "wrong" kind of motivation' (4), not synonymous with interest in learning, and possibly leading to 'gamesmanship'(citing Schwartz and Sharpe 2011). There is also a summary in Schinske and Tanner (2014): grading can motivate high achieving students to continue getting good grades whether or not this overlaps with learning, but it can also lower interest in learning and enhance anxiety and extrinsic motivation. In terms of feedback, grading normally '"registers relative standing… [It does]… Not provide for treatment' [citing a really early source here, which she claims is still valid — a grade should really provide information about what students have not learned, although standards-based grading claims that this does provide enough detail, since there are many objectives attracting separate grades. Even so, there is no information about how to change instruction {and an assumption that it's possible to provide individual instruction}]. When aggregated they might teach an entire class, and might help compare student performance between classes.

Communication with other audiences is important, although 'relevant' audiences seem to be growing. Teachers who need information about students entering their purview in subsequent years is one relevant audience, and so is policymakers who want to level the playing field or deal with students transferring into their aegis, may be providing state-supported scholarships [US]. The media is occasionally interested as well with grading, grade inflation and the ease of failure.

There may be a structured tension between '" what promotes learning what enables a massive system to function"' (6) reflected in the interests of the different audiences. An obvious answer is writing reports, narratives [profiles], but teacher interest in qualitative data means aggregation is difficult. Selective universities express this tension best, and there is a trend to stress qualitative means like interviews and essays in admissions, while the media more interested in mean SAT scores.

There must be some standard notion of a grade for merit. Teachers do seem to vary. Early reports suggested that sometimes memorising what was taught was important, or critical analysis, handing good work in on time, or quality standards and another. Generally, grades may represent performance on single or multiple tasks, achievement at a single point in time or over time, achievement only or effort and attendance, achievement of learning outcomes or achievement in comparison with peers. Series of grades require data aggregation. Grading improvement is particularly difficult, and might penalise high achieving students to have little chance to demonstrate improvement, although meritorious achievement might be more valuable. It is hard to isolate academic achievement only, and teachers find it particularly hard to focus on achievement only anyway.

'Virtually all grading systems in the early 20th century were norm referenced' but in 1963 there was a move to criterion referenced measurement, but this has not brought standardisation 'and, quite likely, never will' (8). Instead of trying to work towards standard grades, it might be 'more reasonable' to openly acknowledge 'contextual or situational specific nature of grades', getting teachers to explain what they mean. This has been done [in the form of classic marking criteria], and these can sometimes be made explicit and combined with a contract signed with students who intend to achieve a particular grade and the teacher commits to award this grade if the agreed levels are achieved. There can be generic or specific contract systems.

The reliability of grades varies. Single task grades are rated according to interrater reliability, and most studies refer to percentage grades. Again early studies showed considerable range in grades awarded, especially if teachers were marking without an agreed scale. Teachers showed themselves to be inconsistent even in terms of marking the same work at different times. Early identification of sources included the inability of teachers to distinguish close degrees of merit, different criteria used by different teachers, differences in quality standards, differences in the way teachers distributed their grades. Early work [1918!] suggested that five divisions could be '"handled accurately by teachers"' (10) leading to establishing letter grades which are still popular today. Further suggestions included a rubric with criteria for evaluating students work with criteria and descriptions of quality standards, covering things like 'spelling, mechanics… Sentence construction… Ability to reason from premises to conclusion and "ability to present the argument effectively, that is with tact and force"' [quoting from a piece of work in 1915].

Rubrics are still popular, but there are still some doubts with modern studies, like one in 2011 that supplied training but still produced grades ranging from 50 to 96 for a single essay (11). Exemplars or ideal papers might help, although this was attempted early on as well, in 1917. So was an attempt to suggest that answers would fall on some sort of accepted curve distribution, that there would be '"five approximately equal steps of ability"', with particular percentages falling into each step, such as 4% in excellent and 4% in failure. This was 1914, and 'educators' belief and faith in the normal distribution continued through much of the 20th century'.

Actual teachers did not distribute grades like this, however either in 1918 or in 1994. So that in 1998 a longitudinal study by the US office of research showed that 'almost 70% of eighth grade students in their national sample reported receiving "mostly A's" or "mostly B's"'.

The GPA in the USA is the primary cumulative grade, an aggregate from all the individual task grades across the courses for a particular semester or for an entire career. 'Typically, an A grade is worth four points, B grades were three points, and so on' (12). Studies tend to show a 'stability of GPAs over courses and over time' (12) [universities are much better at imposing standard distributions — or were]

A study in 2012 at the University of Missouri collected GPAs over several semesters, calculated alpha reliability coefficients across different groups of semesters to measure the percentage of variants that can be attributed to differences among students rather than semesters [meaning that 'the larger the coefficient the more reliable the GPAs are over time'] alpha coefficients were 0..7 for 2 semesters, , 0.8 for 4, 0.8 for 6 , 0.9 for 8 semesters'.

So how do cumulative grades get reliable by aggregating single unreliable task grades. Task grades may be unreliable in terms of being consistent across teachers, but there may be patterns detectable among students nevertheless [I think — in the example 'even with the lack of agreement across teachers on each individual student's essay… It is quite clear that the teachers consistently favour student A over student B. In other words, they get the rank order right] [individual variations randomise out?].

The question of validity is more difficult. There are different types of validity and different threats to validity. If students who learn more get better grades, the grades are descriptively valid. If they go on to be more successful later, the grades are protectively valid. This sort of validity arises from cumulative grades not single task grades.

Threats to validity arise from differences of grades awarded in different schools, especially if those schools have 'radically different student populations'. The longitudinal study took students in a nationally representative sample and asked about their grades, then took students in 'high poverty schools' and in 'more affluent schools' (14) and found a difference [indicating grade inflation — students in poverty schools who got mostly A's at the same reading scores as students in affluent schools who got C and D, and similar with maths, a students in poverty schools 'most closely resembled the scores of D students in the more affluent schools'.

She separates out grade inflation as a general tendency affecting higher academic grades. Higher grades and themselves do not show this, but only if they are somehow not deserved. It might be the result of capping grades leading to a greater concentration at the top, compressing grades and reducing their value as an indicator.

Descriptive for concurrent validity looks at the relation between cumulative grades and [separate] test scores, assuming that test scores reflect achievement and that those with higher tests have learned more. In general, correlations between grades and test scores range from 0.3 to 0.7. Correlations increase if subject specific special tests are designed and aligned with specific courses.

In terms of predicting success in subsequent institutions, say comparing high school averages with GPAs, we have 'quite positive' (15) results. High school GPA is the strongest predictor of college grades, with correlation coefficients from 0.35 to 0.55, when we allow for other differences increase. Order the changeover of college career. One large study says they are more important than race gender special education placement, free lunch status and standardised test scores. [Important evidence for the warming up effect of high schools].

The consequences of grading students can be important and impact the quality of student lives positively and negatively. The negative ones tend to be emphasised, but there are lots of other negative factors in schools as well, including boring teachers and pupil cliques. The negative effects of grades on students can be cumulative, one study shows, because students tend to internalise low grades. However, her own view is that if grades are explained and if they are administered in a fair and impartial way, they can improve.

Grading must be fully discussed as a part of improving the education system, along with improving the curriculum, the training of teachers, personalising learning and so on. The integrity and fairness of grading system must be exemplary — tasks could be representative learning outcomes, quality standards made consistent, students given sufficient information about the basis for the grades, a variety of audiences asked to provide input. Teachers are still notorious leaking consistent or those students I have more consistent views and like teachers who follow the guidelines, use reliable in formation, avoid influenced by irrelevant factors and give ambiguous or unclear explanations. Reform is especially important for students with special needs. We should consider grading for different purposes, and for different audiences — parents might be different information from the lawyers, as a survey on page 19 indicates.

This should carry on into teacher training and certification. She continued to research grading policies and practices, rather than focusing on 'the comfort of Op Ed pieces' (21) which can advocate particular approaches or the elimination of grading altogether.