Notes on : Anderson, L. W.
(2018). A critique of grading: Policies,
practices, and technical matters. Education
Policy Analysis Archives, 26(49).
http://dx.doi.org/10.14507/epaa.26.3814
This article is part of the Special
Issue, Historical and Contemporary
Perspectives on Educational Evaluation:
Dialogues with the International Academy
of Education, guest edited by Lorin W.
Anderson, Maria de Ibarrola, and D. C.
Phillips.
Dave Harris
I am very grateful to an AI review of one
of my Academia Ed papers for suggesting
this reference!
The central questions are why we grade
students, what grades mean, how reliable
the grades are, how valid they are, and
what the consequences are for grading
students. The results suggest that there
are several purposes for grading students
and that the way they are assigned and
imported should be consistent with them,
but that grades mean different things to
different people including the teachers of
assignment. Grades on a single task are
'quite unreliable' (2) while cumulative
grades are 'reasonably reliable'. Validity
of grades on a single task 'is virtually
impossible to determine' while 'cumulative
grades are reasonably valid'. Grades
influence student affective
characteristics like self esteem but
'their influence is no greater, nor less
than, a host of other school-related
factors'.
There is a general negativity about
student grading and several criticisms
have been published. Academics teachers
and consultants have all criticised, over
50 years. Criticisms remain. Grading
requires a careful critique. It is
defining a grade as a position on the
continuum of 'quality, proficiency,
intensity or value', expressed as a
percentage or by letters or by verbal
descriptors. Grades are often used
synonymously with marks [bad mistake].
Grades might be used to encourage students
to record shortcomings and to enable
teachers to modify instructions. There
might be a third reason, to communicate
information to a variety of audiences. As
motivations, they do work, if competition
among students works, and if the ratings
and rankings work, but this might be still
'the "wrong" kind of motivation' (4), not
synonymous with interest in learning, and
possibly leading to 'gamesmanship'(citing
Schwartz and Sharpe 2011). There is also a
summary in Schinske and Tanner (2014):
grading can motivate high achieving
students to continue getting good grades
whether or not this overlaps with
learning, but it can also lower interest
in learning and enhance anxiety and
extrinsic motivation. In terms of
feedback, grading normally '"registers
relative standing… [It does]… Not provide
for treatment' [citing a really early
source here, which she claims is still
valid — a grade should really provide
information about what students have not
learned, although standards-based grading
claims that this does provide enough
detail, since there are many objectives
attracting separate grades. Even so, there
is no information about how to change
instruction {and an assumption that it's
possible to provide individual
instruction}]. When aggregated they might
teach an entire class, and might help
compare student performance between
classes.
Communication with other audiences is
important, although 'relevant' audiences
seem to be growing. Teachers who need
information about students entering their
purview in subsequent years is one
relevant audience, and so is policymakers
who want to level the playing field or
deal with students transferring into their
aegis, may be providing state-supported
scholarships [US]. The media is
occasionally interested as well with
grading, grade inflation and the ease of
failure.
There may be a structured tension between
'" what promotes learning what enables a
massive system to function"' (6) reflected
in the interests of the different
audiences. An obvious answer is writing
reports, narratives [profiles], but
teacher interest in qualitative data means
aggregation is difficult. Selective
universities express this tension best,
and there is a trend to stress qualitative
means like interviews and essays in
admissions, while the media more
interested in mean SAT scores.
There must be some standard notion of a
grade for merit. Teachers do seem to vary.
Early reports suggested that sometimes
memorising what was taught was important,
or critical analysis, handing good work in
on time, or quality standards and another.
Generally, grades may represent
performance on single or multiple tasks,
achievement at a single point in time or
over time, achievement only or effort and
attendance, achievement of learning
outcomes or achievement in comparison with
peers. Series of grades require data
aggregation. Grading improvement is
particularly difficult, and might penalise
high achieving students to have little
chance to demonstrate improvement,
although meritorious achievement might be
more valuable. It is hard to isolate
academic achievement only, and teachers
find it particularly hard to focus on
achievement only anyway.
'Virtually all grading systems in the
early 20th century were norm referenced'
but in 1963 there was a move to criterion
referenced measurement, but this has not
brought standardisation 'and, quite
likely, never will' (8). Instead of trying
to work towards standard grades, it might
be 'more reasonable' to openly acknowledge
'contextual or situational specific nature
of grades', getting teachers to explain
what they mean. This has been done [in the
form of classic marking criteria], and
these can sometimes be made explicit and
combined with a contract signed with
students who intend to achieve a
particular grade and the teacher commits
to award this grade if the agreed levels
are achieved. There can be generic or
specific contract systems.
The reliability of grades varies. Single
task grades are rated according to
interrater reliability, and most studies
refer to percentage grades. Again early
studies showed considerable range in
grades awarded, especially if teachers
were marking without an agreed scale.
Teachers showed themselves to be
inconsistent even in terms of marking the
same work at different times. Early
identification of sources included the
inability of teachers to distinguish close
degrees of merit, different criteria used
by different teachers, differences in
quality standards, differences in the way
teachers distributed their grades. Early
work [1918!] suggested that five divisions
could be '"handled accurately by
teachers"' (10) leading to establishing
letter grades which are still popular
today. Further suggestions included a
rubric with criteria for evaluating
students work with criteria and
descriptions of quality standards,
covering things like 'spelling, mechanics…
Sentence construction… Ability to reason
from premises to conclusion and "ability
to present the argument effectively, that
is with tact and force"' [quoting from a
piece of work in 1915].
Rubrics are still popular, but there are
still some doubts with modern studies,
like one in 2011 that supplied training
but still produced grades ranging from 50
to 96 for a single essay (11). Exemplars
or ideal papers might help, although this
was attempted early on as well, in 1917.
So was an attempt to suggest that answers
would fall on some sort of accepted curve
distribution, that there would be '"five
approximately equal steps of ability"',
with particular percentages falling into
each step, such as 4% in excellent and 4%
in failure. This was 1914, and 'educators'
belief and faith in the normal
distribution continued through much of the
20th century'.
Actual teachers did not distribute grades
like this, however either in 1918 or in
1994. So that in 1998 a longitudinal study
by the US office of research showed that
'almost 70% of eighth grade students in
their national sample reported receiving
"mostly A's" or "mostly B's"'.
The GPA in the USA is the primary
cumulative grade, an aggregate from all
the individual task grades across the
courses for a particular semester or for
an entire career. 'Typically, an A grade
is worth four points, B grades were
three points, and so on' (12). Studies
tend to show a 'stability of GPAs over
courses and over time' (12) [universities
are much better at imposing standard
distributions — or were]
A study in 2012 at the University of
Missouri collected GPAs over several
semesters, calculated alpha reliability
coefficients across different groups of
semesters to measure the percentage of
variants that can be attributed to
differences among students rather than
semesters [meaning that 'the larger the
coefficient the more reliable the GPAs are
over time'] alpha coefficients were 0..7
for 2 semesters, , 0.8 for 4, 0.8 for 6 ,
0.9 for 8 semesters'.
So how do cumulative grades get reliable
by aggregating single unreliable task
grades. Task grades may be unreliable in
terms of being consistent across teachers,
but there may be patterns detectable among
students nevertheless [I think — in the
example 'even with the lack of agreement
across teachers on each individual
student's essay… It is quite clear that
the teachers consistently favour student A
over student B. In other words, they get
the rank order right] [individual
variations randomise out?].
The question of validity is more
difficult. There are different types of
validity and different threats to
validity. If students who learn more get
better grades, the grades are
descriptively valid. If they go on to be
more successful later, the grades are
protectively valid. This sort of validity
arises from cumulative grades not single
task grades.
Threats to validity arise from differences
of grades awarded in different schools,
especially if those schools have
'radically different student populations'.
The longitudinal study took students in a
nationally representative sample and asked
about their grades, then took students in
'high poverty schools' and in 'more
affluent schools' (14) and found a
difference [indicating grade inflation —
students in poverty schools who got mostly
A's at the same reading scores as students
in affluent schools who got C and D, and
similar with maths, a students in poverty
schools 'most closely resembled the scores
of D students in the more affluent
schools'.
She separates out grade inflation as a
general tendency affecting higher academic
grades. Higher grades and themselves do
not show this, but only if they are
somehow not deserved. It might be the
result of capping grades leading to a
greater concentration at the top,
compressing grades and reducing their
value as an indicator.
Descriptive for concurrent validity looks
at the relation between cumulative grades
and [separate] test scores, assuming that
test scores reflect achievement and that
those with higher tests have learned more.
In general, correlations between grades
and test scores range from 0.3 to 0.7.
Correlations increase if subject specific
special tests are designed and aligned
with specific courses.
In terms of predicting success in
subsequent institutions, say comparing
high school averages with GPAs, we have
'quite positive' (15) results. High school
GPA is the strongest predictor of college
grades, with correlation coefficients from
0.35 to 0.55, when we allow for other
differences increase. Order the changeover
of college career. One large study says
they are more important than race gender
special education placement, free lunch
status and standardised test scores.
[Important evidence for the warming up
effect of high schools].
The consequences of grading students can
be important and impact the quality of
student lives positively and negatively.
The negative ones tend to be emphasised,
but there are lots of other negative
factors in schools as well, including
boring teachers and pupil cliques. The
negative effects of grades on students can
be cumulative, one study shows, because
students tend to internalise low grades.
However, her own view is that if grades
are explained and if they are administered
in a fair and impartial way, they can
improve.
Grading must be fully discussed as a part
of improving the education system, along
with improving the curriculum, the
training of teachers, personalising
learning and so on. The integrity and
fairness of grading system must be
exemplary — tasks could be representative
learning outcomes, quality standards made
consistent, students given sufficient
information about the basis for the
grades, a variety of audiences asked to
provide input. Teachers are still
notorious leaking consistent or those
students I have more consistent views and
like teachers who follow the guidelines,
use reliable in formation, avoid
influenced by irrelevant factors and give
ambiguous or unclear explanations. Reform
is especially important for students with
special needs. We should consider grading
for different purposes, and for different
audiences — parents might be different
information from the lawyers, as a survey
on page 19 indicates.
This should carry on into teacher training
and certification. She continued to
research grading policies and practices,
rather than focusing on 'the comfort of Op
Ed pieces' (21) which can advocate
particular approaches or the elimination
of grading altogether.
|
|
|