What follows are some actual working documents, reflecting discussion about assessment that I have had with colleagues in a a variety of higher education institutions. Obviously, I have chnaged the names of these institutions and colleagues They offer a much-needed 'practical' dimension to more abstract debates about assessment, its functions, and its operation, some of which are included elsewhere -- such as here and here
THE ASSESSMENT SUB-COMMITTEE OF THE EGYPTOLOGY DEPARTMENT, ARCHER UNIVERSITY
The following Report is intended to summarise the main points of discussion that emerged during the 2 meetings with students, and during the several informal discussions among the staff members that also took place,
(1) Why Look at the
Assessment Scheme ?
Initially, we approached the Egyptology Department with the intention of getting some “raw” data in order to calculate the actual weightings of each part of the assessment scheme as it had affected 3rd year students. Dr Meldew produced a Report (which everyone in the Department subsequently received). This Report (below) formed the basis of discussion in the 2 meetings, which ensued.
Although the specific “Implications” attached to the end of Dr. Meldew’s Report were the subject of some disagreement; the main points of the analysis in the Report were generally accepted. In the discussions that ensued, the following points in particular were agreed upon.
Anomalies in the workings of our assessment scheme can easily occur unless careful overall monitoring and control be instituted. In particular, the weightings which we intend to attach to the various components of the scheme (essays, special studies and exams) can be badly upset in practice. This is because it is the extent to which the test spreads out the students which determines its weight in the final ranking as well as the percentage of marks awarded for each test. Further, anomalies can occur if no standardisation of marking is performed — tutors can vary in their marking standards, Departments can vary, different tests can give different scores with the same populations. (The evidence for the existence of such anomalies in last year’s schemes is given in the Report). Some important design points also emerged — if the distribution of students provided by a test is heavily skewed then discrimination between students can become very unreliable, in particular, if a test produces a marked clustering with most students getting around the same scores, then, obviously, any attempt to draw a pass/fail line between students will rely on very fine differences in scores. The finer the differences between pass marks and fail marks, the greater the possibilities of error in allocating students - if 1 mark is all that separates the successful from the unsuccessful, the chances of mistakes being made over the award of that 1 mark can obviously be pretty high.
Certain specific recommendations to improve design and to control anomalies were made in Dr. Meldew’s Report. We felt, though, that the implications of such recommendations would need to be discussed to the fullest extent and that, in a sense, they had already prejudged the answers to important questions about the functions of our assessment scheme. In order to gain the time for full discussion, we decided to recommend the appointment of an Assessment Officer as indicated on the enclosed slip. We hoped such a person would not only collect data but actually suggest modifications to the assessment scheme, thereby initiating, in concrete ways the full discussion of the assessment scheme which we favour,
(3) What We Might Do in the Future
As examples of the issues which we would like to see discussed in this Committee and in the Egyptology Department as a whole, the following points emerged at our meetings. The issues can be divided into 2 major sections — those connected with the reliability of assessment schemes and those connected with the validity. It was agreed that reliability should receive our attention in the short run,
The first question that needs to be asked is — should we use our assessment scheme to discriminate between students (e,g. in allocating them to advanced courses) ? The consensus of opinion was that this is an important function of assessment and that, at least in the short run, this function was not likely to disappear. Given this function, though, do we not need to ensure that our assessment scheme discriminates as reliably as possible — i.e that it doesn’t work arbitrarily but produces an intended distribution of students? If this too is desirable then there are certain ways to increase the reliability of discrimination — e.g. skewed distributions must be replaced by more even distributions (full ranges of scores must be used by markers); attempts must be made to standardise and normalise tests; more reliable tests (“objective tests”) might be considered as a substitute for the present reliance on essays and exams — and so on. At this point, objections were raised, based on doubts about the purely educational merits of items like “objective tests” (and this is discussed below under “Validity”). However, there was wide agreement that if discrimination were to be an important function of our assessment scheme then the reliability of that discrimination must be investigated — the function of discrimination could not be ignored at the design stage only to be introduced at a later stage. It was further pointed out that the logic of objective testing might already be embodied in our marking of essays and examinations — it is possible to see an essay answer in terms of a series of answers to specific questions which are then marked “right” or “wrong” just as in multiple-choice “ objective” tests, If we are to reject multiple—choice tests as being educationally inferior in some way —should we not also doubt any attempts to mark essays “objectively”?
Some members of the Committee argued that reliability and validity were complementary — and indeed, statisticians do tend to run the two together (e.g. by using reliability as an operational measure of validity). Others saw a definite difference between the two concepts. For us, validity refers to the ability of assessment items to tell us something about the qualities of the students we are interested in. With this definition, a possible clash occurs between the interests of reliable and valid testing. The most reliable tests might be the least valid — e.g. a test which ask students for their height might produce the desired distribution, enabling fine discrimination, but their height might not be a very valid way of assessing the students’ suitability for teaching. The clash between the two interests is especially acute in colleges since the simplest and most reliable tests tend to feature an emphasis on the factual and on the non-controversial, whereas sometimes we want to emphasise the open-ended and controversial, we want students’ own creative thinking and individual meanings to be expressed in assignment answers, Naturally this doesn’t mean that essays (even when given the most open—ended titles) will bring forth such creativity where objective tests cannot do so. We would need further evidence concerning how the essays were actually written by students, and marked by tutors, before we could assert the superiority of essays in this way.
It would seem to us, then, that assignments are expected to perform at least two functions one, to discriminate between students, and two, to validly establish some qualities inherent in students. A third function is also sometimes stated — assessment is a means of gaining “feedback” to judge the effectiveness of teaching. What remains to be discussed is whether one set of assignment items can fulfill all of these functions — and some of us are doubtful about this. If we want maximum reliability in discrimination we may have to compromise on some of our educationally valid aims. If we want to emphasise individual creativity then we may have to sacrifice a good deal of our claims to objectivity when it comes to grading essays. If students: think that assignment answers are key elements in determining their careers they will tend to play safe and stress the non-controversial, whatever their own beliefs, and they will coach each.other and co—operate and this raises doubts about validity as defined above. What seems especially undesirable is that one set of assignments be assumed to cover all of these aspects equally and as efficiently as possible In practice, by blurring these functions, none of them is likely to be performed well.
With these considerations before us, one possible long—term development might involve a clear splitting of functions for assignments. We might develop reliable discriminations on the one hand, and effective ways of gaining knowledge about students’ individual beliefs, free from any grading on the other. A third kind of non—graded assignment might be designed to test out the effectiveness of our courses.
SELECTION FOR ADVANCED STUDY IN THE EGYPTOLOGY DEPARTMENT OF ARCHER UNIVERSITY
After scrutinising the results in Egyptology pertaining to all certificate students examined in the University at the end of the Academic Year 1973/4, it has been found that whilst the various marks contributing to the final assessment were weighted more or less in accordance with tutors’ intentions, the fairness of contributory marks seems highly questionable. Evidence has been discovered that points to a lack of comparability among tutorial teams marking the three ‘option papers’ set at the end of the courses so that almost half of the candidates received a ‘bonus’ of over three marks if they offered Advanced Hieroglyphics or Demotic Script rather than Archaeology. Grave doubts regarding the validity of the final assessments have been entertained.
The intention of the Egyptology Department was apparently, to ‘weight’ scores obtained for (a) five essays, (b) a Special Study and (c) the final examination option paper in the ratio of 5:3:2 so that marks for coursework would be the main factor influencing ultimate grades. Members of the Department perhaps naively assumed that marking five individual essays “Out of 12”, marking the special study “out of 36” and the examination paper “out of 24” would automatically secure the weighting intended. A close inspection of the marks indicates that this is not the case.
The matter becomes clear if the marks
obtained by three candidates are scrutinised:
Let us suppose that we keep the same
relationship within the Special Study marks~ but score “out of 18” in order
to halve the weighting of this component as follows:
We now find that the same one-mark difference between the students’ final scores exists, and the ultimate ‘order of merit’ is unchanged. And from this example it is clear that component weights cannot be defined by the mere statement of maximum marks available for each – indeed, the statement of such maxima is largely irrelevant to the issue! In fact, weights are determined by the way marks for each component are ‘spread’ in relation to the mark distributions for the other components.
An examination of the distributions obtaining in the final assessment of students in 1974 indicated that operative weights were approximately in the ratio of 43:34:23, and therefore fairly close to the intended 50:30:20 relationship.
The Option Paper Component
The results of a detailed investigation into the scores candidates obtained with respect to the three option papers are summarised in the following table.
The significance of differences among means for the three option groups was tested by an Analysis of Variance, It has been found that such differences are extremely significant (F = 17.38, 2/112 d.f., P less than .001). In other words, if there had been no real differences among groups of students or option paper markers, discrepancies among means as large as those observed could occur by ‘sheer bad luck’ less than once in a thousand trials. However, these figures alone do not indicate whether the cause of the discrepancies was probably genuinely higher ability on the part of Advanced Hieroglyphics and Demotic Script students, or a more ruthless team of Archaeology markers — or even some combination of both ability and marker effects,
The Special Studies
In relation to option groups, means
for Special Studies were as follows:
Apparently, examination of Special Studies provides no evidence of a marked inferiority on the part of students opting for Archaeology. On the other band, the lower average mark obtained by Advanced Hieroglyphics students contrasts with their higher average mark on the option paper. However, analysis of variance yields a result (F = 2.16, 2/112 d.f.) that lends no support to the notion that any real differences exist.
To a certain extent, students may be misled by the information that special studies are “marked out of 36”, Tutors did not discriminate among special studies on a 37-point scale — and maybe none would claim the ability to do so. The great majority of students were awarded marks on an 8-point scale: 15, 18, 21, 24, 27, 30, 33, 36 — apart from four students who for some obscure reason, obtained 22, 26 or 29 marks. Only two students were awarded a mark lower than 15. 79 students were graded B-, B or B+. and awarded 24, 27 or 30 marks accordingly.
From this description of the distribution of Special Study marks, the informed test-constructor will understand that the scores are unlikely to be very reliable. Furthermore, ‘multiplying the grades by 3’ does nothing to help remove inaccuracies, but merely serves to increase the effects of such errors on the final assessments.
The Essay Component
Since the publication of the classic study “An Examination of Examinations” by Hartog and Rhodes, the essentially unreliable nature of essay marks has become a well—known phenomenon, and it is therefore not proposed to elaborate on this matter at this stage. Nevertheless, it should be understood that if one wishes to teat a. candidate’s verbal fluency, his ability to select relevant facts, marshal his arguments, and so on then an essay is quite appropriate; but as a measure of his knowledge about Ptolemy, the Eleventh Dynasty, what Herodotus said, and the like - the essay is of very dubious worth.
The distribution of marks for essays was, like the Special Studies mark distribution, decidedly top-heavy, with a maximum of 60 marks, only five candidates scored less than 30. However, marks bunched between 43 and 55, and below 43 there was a long ‘tail’. Statisticians call this a ‘negative skew’.
The observed skew is of more than mere academic interest, because to a certain extent it invalidates calculations of ‘spread’ used to determine the weighting of components in the final assessment. It is impossible to adjust the estimates of weights on the basis of statistical measurements related to skewness. However, one should note the spuriously high value of 43% weighting for essays, and consider that in the ‘Advanced Studies qualification region’ a more realistic figure might be more in the 30% bracket.
In relation to option groups, mean
essay marks were as follows:
The pattern of group differences matches that observed in connection with Special Study marks, However, analysis of variance again indicated (F = 1.52, 2/112. d.f.) the probable absence of real differences.
At this point, it might occur to the reader that if data for essays and special studies had been scrutinised together in a two-way analysis of variance, the significance of observed differences among groups might have been established, However, to carry out such an analysis, groups of equal size are required.
The analyses of marks for essays and special studies throw some light on the problem posed by the significant differences found among mean scores for option papers. . These analyses reveal no justification for equating the abilities and attainments of groups opting for Advanced Hieroglyphics and Demotic Script, and giving these students a 3-plus mark lead over all those opting for Archaeology. There were unavoidable limitations that dictated the use of a less sensitive significance test. Nevertheless, if real differences of ability between groups had been proven, essay grades and special studies marks would have arranged the groups, independently and consistently, in the order: Demotic Script (above-average), Archaeology (average) and Advanced Hieroglyphics (below-average).
This investigation has revealed no strong evidence to support a claim for validity with regard to the assessment procedure adopted by the Egyptology Department, weighting seems to be a hit-and-miss affair, governed by wishful thinking and the vagaries of distorted, abnormal mark distributions. At the Advanced Studies cut—off point, distributions seem to be too compressed for adequate reliability and reasonably sensitive discrimination. In view of the great amount of time and care expended by many students on Special Studies, grading on an eight point scale might appear to be a remarkably crude means of assessment.
Assessable submissions may be long or short. They nay be unsupervised or written ‘under examination conditions’. But they are all, invariably and persistently, essay-type answers to questions. There are many objections to the exclusive use of essay-type assignments. Apart from creating major difficulties for markers, essays immediately by their very nature put non-verbal-type students (physicists, mathematicians, Egyptologyians, artists, craftsmen, etc.) at a considerable disadvantage. They unfairly give extra marks to the good speller, the neat calligrapher, the fluent talker and the waffler. Any assessment procedure tied securely to only one kind of measuring instrument is bound to lead to frustration on the part of some students who are altogether denied an opportunity perhaps to show their real worth in another way,
The case against essay-type responses would be weakened if it could be shown that reasonably objective assessments might easily be achieved. This investigation has unearthed no evidence to support such a claim. On the contrary, it would seem that for no good reason, Advanced Hieroglyphics students received a bonus of over three marks, whilst Archaeology students were deprived. It is difficult to imagine what the Educational Philosophers mean by objective in such circumstances.
Some suggestions for improving assessment of students are made:
1 Measuring instruments more reliable than essays should be tried.Document 3
OBJECTIVITY IN ASSESSMENT-- A DISCUSSION DOCUMENT
1 In order to avoid a long and complex discussion3 let me offer a working definition of an “objective” judgement as one which involves the maximum intersubjective agreement by people who are qualified to make the judgement. Such agreement is not the result of a social consensus in a simple sense - rather, it is produced by the use of agreed, specialist, relatively reliable procedures (e,g “scientific methods”). Any individual judge who uses these specialist methods will came to the agreed conclusion, if that conclusion is an “objective” one.
2 Assessment involves making judgements
about the abilities of candidates to reach certain standards of performance.
These standards can be thought of as being expressed in terms of certain
“right answers” (discussed below) to the questions used in assessment,
The extent to which an actual answer approximates to its particular “right
answer” determines the marks awarded to each candidate for each question.
If assessment is to be “objective” there must be the maximum intersubjective
agreement among the (expert) assessors arrived at by the use of specialist
procedures of relative reliability as suggested above, This intersubjective
agreement must extend to two major areas:
3 Both of these issues can be resolved by using purely personal judgements which may be derived from all kinds of private feelings, hunches, assertions, the results of personal experiences, and so on. However, if there is no intersubjective agreement based upon explicit procedures, there can be no chins made to “objectivity”. Non-objective assessment might be perfectly acceptable, of course, although it can lead to problems of comparability, problems in justifying marks actually awarded, suspicions of unfair treatment of some students — and so on. Attempts to make assessment “objective” might help to solve some of these possible problems although such attempts raise their own problems, as below.
4 If “objectivity” depends upon expert intersubjective agreement, then it is easier to come by in some subjects than in others, obviously. In mathematics, pure logic, and other areas which primarily involve the manipulation of symbols according to conventional grammars such intersubjective agreement is relatively easy to come by. Mathematicians tend to agree upon what might be the “right answer” to a problem, and the ways in which they come to that agreement are accepted as being relatively reliable (compared to, say, guessing or speculating). It is also relatively easy to decide if an actual answer conforms to a right answer, even if deciding on the actual marks awarded to “wrong answers” might present a few problems.
5 In other subject areas -- notoriously the humanities and social sciences -- intersubjective agreements about “right answers” to questions are more complex. One reason for this is that these subjects offer not only concepts to be manipulated, but interpretations of phenomena — it is the meaning of empirical phenomena that allows for disagreements among the experts. However, the existence of profound disagreements about the interpretation of phenomena need not preclude all attempts to assess “objectively” -- on some issues, one typically finds recognizable schools of thought which offer a manageable number of different intersubjective agreements about the interpretation of a phenomena. An assessor can then construct a kind of meta “right answer” for assessment purposes which involves an agreed combination of the interpretations offered by the different schools. Thus a “right answer” in these subject areas might consist of a display of an open argument about interpretations, without the need to accept any particular interpretation as the right one.
6 A further complication is that there may be very little time available to consult with interested parties in order to arrive at a “right answer”. What we would have to do is to devise a “right answer” which would be acceptable in principle to our colleagues involved in teaching a course, and which would be acceptable in principle to external bodies as a legitimate, specialist University “right answer”. There may be a further requirement that the “right answer” should, perhaps, be attainable in principle by our actual students. It is my own opinion that no “right answer” should be based heavily upon one’ s own unpublished work or upon a particularly obscure or esoteric text, or upon arguments which had not been covered in the course, for example. However, despite these difficulties, and despite the lack of a simple set of “right answers” which can be assumed to be shared by all assessors, it seems possible to construct working intersubjective agreements, or to assess in that spirit, even in humanities or social sciences.
7 In my view, the real problems arise when we set assessment questions which ask for interpretations about localised, very recent, or purely personally-experienced phenomena. If we ask our students to write essays on the problems they encountered on archaeological practice, or on how their own experience of family life fits with certain theories of life in ancient Egypt, or if we invite them to express their personal interpretations of various writings, then we encounter a serious difficulty. Clearly, there can be no agreed expert interpretations of such matters -- the actual events may be known only to the candidate, and they will not have been researched. Several ways out of the difficulty are available:
(a) We could decide not to ask students for these interpretations and suggest that we should only assess in areas where there is a possible area of intersubjective agreement. Thus for example instead of asking questions like “What social and methodological problems did you encounter in your visit to the local area?”, we should restrict ourselves to asking questions like “What are the social and methodological problems identifiable in the fieldwork undertaken by Lord Carnavon?” -- the latter question offers much more chance of having an “objective” right answer as defined above. Of course, questions like the former one might well be very important ones, and they could still be asked and discussed. However, they could not be used in any assessment system that aimed to be “objective”.
(b) We could extend our concept of a “right answer” to make it include a “right approach”, Thus the actual conclusions reached by each candidate in answering questions about personal experiences could be ignored in favour of an analysis of the approach employed, the way opinions were tested, arguments developed and so on, Again, as long as this “right approach” were constructed according to the same principles as other forms of “right answer”, this would support a more “objective” approach.
(c) We could award marks on a non-objective basis and offer a “right answer” which contained our own (personal) judgements about the events being discussed and the validity of the interpretations offered by the candidates.
8 Actual answers may be difficult to interpret, even if an agreed “right answer” has been constructed In my opinion, written answers should be interpreted using the same procedures as are used to interpret other written texts whose meaning is unclear (such as an historical document, a sacred text, or whatever). Thus there must be agreed expert interpretations -- again, at least in principle — of each answer, if “objectivity” is to be achieved. This might involve the use of vivas with the students actually present -- although the students might not be considered as expert enough to be involved in constructing “right answers”, their expertise in interpreting their own actual answers might be considerable.
9 Some implications for practice
(b) If we do decide to increase the “objectivity” of our assessment scheme, we should follow the implications for the design of assessment items, and we should devote much more time to securing the necessary intersubjective agreements — with colleagues, with our External Examiner, with students.
(c) Whatever scheme of assessment we choose, it should be made absolutely clear to the students that they are being assessed on certain criteria, and that these criteria may be ntersubjectively agreed or not as the case may be.
(d) Disputes about marks should always be referred to the “right answer” involved and an account of how the actual answer in dispute had been compared to this “right answer”.
COMPARABILITY IN ASSESSMENT -- A DISCUSSI0ON PAPER
This seems to be one issue which causes students considerable concern as reflected in the letter written by some of last year’s students.
As I understand it, the term comparability has to do with equality in a loose sense. To talk of comparability in the context of assessment would seem to mean that there were no gross inequalities relating to the quantity and quality of teaching, the provision of pastoral care, the amount of assessable items and their evaluation and work-loads between and/or within units and/or departments. Naturally there are considerable difficulties in attempting to equate the sorts of demands which can reasonably be made of a student studying English and a History student but an attempt should be made, it seems to me, to eradicate gross anomalies or inequalities and to ensure as far as possible that students in some areas did not feel that they were expected to work harder or that their efforts were evaluated more harshly than students in other areas.
Comparability would seem to fall into 3 distinct areas
A. Comparability within units/departments. This would relate to the quality and quantity of teaching and contact time and the possible variations in assessment of common items0
B. Comparability across units/departments. Again concerns here would be related to the quality and quantity of contact time, relative work loads and number, type and methods of evaluating assessable items.
C. Comparability across universities. There might be concern that one university might be marking to different standards and thus awarding more lst/2nd class degrees.
A. The concerns here are normally catered for by moderation, This in
my opinion is engaged in more for political reasons than to ensure comparability,
The fact that a unit co-ordinator asks for an A, B, C, D and E essay from
each marker often poses more problems than it solves. In a large unit (200
students) it involves the co-ordinator with a fairly heavy marking load.
Also suppose the co-ordinator discovers that one marker is consistently
marking lower than the others. Does he then automatically upgrade all the
assignments marked by that one tutor? Similarly does he downgrade all assignments
over-generously marked? This only makes sense where the re-marker is operating
with similar criteria -and it may be that the co-ordinator and markers
may not be able to agree on the weighting to be allotted to particular
items. Another solution is to subject all marks to a statistical analysis
to ensure the marks fall within a predetermined spread. This is politically
inadvisable. Problems of moderation do not normally arise within small
units where the number of assignments and markers is limited by the number
of students. One answer within large units is to submit assignments throughout
the course to external moderation (External Examiner ?) Another - which
I am not in favour of in all instances - would be a detailed marking guide
and chart. My reservations regarding this are in the appendix.
B. Comparability across units/disciplines. Here, it seems to me, students are reconciled to the difficulties in equating an “A” student in Maths with an “A” student in Geography. The similarities would lie in ‘quality of work’ but there are semantic difficulties in making explicit the criteria employed. However what is important is that students see that the work-loads i.e. amount of time spent each week and assessable items required over a particular period are roughly equivalent. Obviously there are difficulties here. Does a Linguistics student have to spend more time in the workshop to become an efficient Egyptologist than a Hieroglyphics student in the lecture-room ? Nevertheless it would seem that an effort should be made to prevent anomalies. The number and type of assessable items required of e.g. an English student should bear comparison with an Egyptology student. It should not seem to students that one discipline appears to spend more time on its students than another (see reference to ‘flying Doctor’ Service - recent students’ letter), Obviously again there, are considerable difficulties here relating to the quality of students admitted to the various disciplines and also to the disparity between individual weaker students requiring a disproportionate amount of time. It might be thought that if all units/ departments published their work-loads, numbers and types of assessable items and criteria employed in marking these, and if these were collated and made available to students then ill-informed criticisms would be avoided.
C. Comparability across different colleges could be catered for by more
cross-reference and cross-marking and by greater co-ordination. Again however
there would be difficulties relating to the question of agreement over
the number and type of assignments and criteria employed for assessment
nevertheless it would seem that politically a greater degree of co-operation
would be beneficial.
(1) The idea that students suggest a grade themselves for assignments could be investigated.
Bacharach University criteria for marking essay assignments.
i) Depth of treatment. This is considered to be the most important of the 3 categories and it refers to the quality or level of thinking, which is apparent from the essay.
Poor answers may take the form of a string of unrelated facts. Substantial sections of the text may be reproduced without any evident structure. The answer will reveal very little understanding of the units and there will be no attempt to analyse or produce a coherent argument.
Adequate answers will go beyond a simple listing of facts. There will be an attempt to analyse the source material, The student will show evidence of being able to apply the concepts introduced in the units and there will also be evidence of structure and organisation in the thinking, Good answers will show some of the following characteristics : —
(iii) Style of presentation This general heading is intended to cover a large number of aspects of the student’s work which affect his ability to communicate his ideas. It does not refer to the structure or organisation of his essay. It refers rather to his ability to provide an adequate introduction and conclusion; his use of language; his referencing of quotations; even his punctuation and spelling0 This is an area where tutors should on the whole be more concerned with teaching than with testing. The course team certainly consider this category to be less important than the other two. Tutors should not be misled into overrating an essay simply because the style is elegant. Nor should poor style and presentation result in an otherwise acceptable answer being judged a failure - excepting the rare occasion when the students ability to communicate his ideas has been seriously impaired0
N.B. You should bear in mind that Depth of Treatment is the most important of the 3 categories “Style of Presentation” is the least important and should not normally greatly affect the overall grade0