Assessment in Higher Education -- Policy and Practice

Introduction

What follows are some actual working documents, reflecting discussion about assessment that I have had with colleagues in a a variety of higher education institutions. Obviously, I have chnaged the names of these institutions and colleagues They offer a much-needed 'practical' dimension to more abstract debates about assessment, its functions, and its operation, some of which are included elsewhere -- such as here and here

Document 1

THE ASSESSMENT SUB-COMMITTEE OF THE EGYPTOLOGY DEPARTMENT,  ARCHER UNIVERSITY

CHAIRMAN’S REPORT

The following Report is intended to summarise the main points of discussion that emerged during the 2 meetings with students, and during the several informal discussions among the staff members that also took place,

(1) Why Look at the Assessment Scheme ?
 

  1. Firstly, assessment seems to take up a good deal of the time of both students and staff, and the workings of the assessment scheme do have important consequences for students’ careers. 
  2. Secondly, there is a considerable body of research on the workings of assessment schemes in other institutions and such research might well have implications for our scheme.  For example, some work on the design of assessment items have shown that unfortunate consequences can occur unless design is very carefully controlled. Again, some works on the effects of assessment schemes (especially continuous assessment schemes) have suggested that an “instrumental” perspective among students can be produced (as an unintended consequence), and that such a perspective can in turn result in students trivialising the academic content of Archer courses.
  3. Against this background of general concern, specific doubts had been raised about the workings of the assessment of last year’s 3rd year. In particular, 2 of us felt that the true weighting of the 3rd year exam had been much greater than many students and staff had expected. As we discovered later, the students had specific doubts about the comparability of standards among tutors.


(2) What We Did

Initially, we approached the Egyptology Department with the intention of getting some “raw” data in order to calculate the actual weightings of each part of the assessment scheme as it had affected 3rd year students. Dr Meldew produced a Report (which everyone in the Department subsequently received). This Report (below) formed the basis of discussion in the 2 meetings, which ensued.

Although the specific “Implications” attached to the end of Dr. Meldew’s Report were the subject of some disagreement; the main points of the analysis in the Report were generally accepted. In the discussions that  ensued, the following points in particular were agreed upon.

Anomalies in the workings of our assessment scheme can easily occur unless careful overall monitoring and control be instituted. In particular, the weightings which we intend to attach to the various components of the scheme (essays, special studies and exams) can be badly upset in practice. This is because it is the extent to which the test spreads out the students which determines its weight in the final ranking as well as the percentage of marks awarded for each test. Further, anomalies can occur if no standardisation of marking is performed — tutors can vary in their marking standards, Departments can vary, different tests can give different scores with the same populations. (The evidence for the existence of such anomalies in last year’s schemes is given in the Report). Some important design points also emerged — if the distribution of students provided by a test is heavily skewed then discrimination between students can become very unreliable, in particular, if a test produces a marked clustering with most students getting around the same scores, then, obviously, any attempt to draw a pass/fail line between students will rely on very fine differences in scores. The finer the differences between pass marks and fail marks, the greater the possibilities of error in allocating students - if 1 mark is all that separates the successful from the unsuccessful, the chances of mistakes being made over the award of that 1 mark can obviously be pretty high.

Certain specific recommendations to improve design and to control anomalies were made in Dr. Meldew’s Report. We felt, though, that the implications of such recommendations would need to be discussed to the fullest extent and that, in a sense, they had already prejudged the answers to important questions about the functions of our assessment scheme. In order to gain the time for full discussion, we decided to recommend the appointment of an Assessment Officer as indicated on the enclosed slip. We hoped such a person would not only collect data but actually suggest modifications to the assessment scheme, thereby initiating, in concrete ways the full discussion of the assessment scheme which we favour,

(3) What We Might Do in the Future

As examples of the issues which we would like to see discussed in this Committee and in the Egyptology Department as a whole, the following points emerged at our meetings. The issues can be divided into 2 major sections — those connected with the reliability of assessment schemes and those connected with the validity. It was agreed that reliability should receive our attention in the short run,

(a) Reliability

The first question that needs to be asked is — should we use our assessment scheme to discriminate between students (e,g. in allocating them to advanced courses) ? The consensus of opinion was that this is an important function of assessment and that, at least in the short run, this function was not likely to disappear. Given this function, though, do we not need to ensure that our assessment scheme discriminates as reliably as possible — i.e that it doesn’t work arbitrarily but produces an intended distribution of students? If this too is desirable then there are certain ways to increase the reliability of discrimination — e.g. skewed distributions must be replaced by more even distributions (full ranges of scores must be used by markers); attempts must be made to standardise and normalise tests; more reliable tests (“objective tests”) might be considered as a substitute for the present reliance on essays and exams — and so on. At this point, objections were raised, based on doubts about the purely educational merits of items like “objective tests” (and this is discussed below under “Validity”). However, there was wide agreement that if discrimination were to be an important function of our assessment scheme then the reliability of that discrimination must be investigated — the function of discrimination could not be ignored at the design stage only to be introduced at a later stage. It was further pointed out that the logic of objective testing might already be embodied in our marking of essays and examinations — it is possible to see an essay answer in terms of a series of answers to specific questions which are then marked “right” or “wrong” just as in multiple-choice “ objective” tests, If we are to reject multiple—choice tests as being educationally inferior in some way —should we not also doubt any attempts to mark essays “objectively”?

(b) Validity

Some members of the Committee argued that reliability and validity were complementary — and indeed, statisticians do tend to run the two together (e.g. by using reliability as an operational measure of validity). Others saw a definite difference between the two concepts. For us, validity refers to the ability of assessment items to tell us something about the qualities of the students we are interested in. With this definition, a possible clash occurs between the interests of reliable and valid testing. The most reliable tests might be the least valid — e.g. a test which ask students for their height might produce the desired distribution, enabling fine discrimination, but their height might not be a very valid way of assessing the students’ suitability for teaching. The clash between the two interests is especially acute in colleges since the simplest and most reliable tests tend to feature an emphasis on the factual and on the non-controversial, whereas sometimes we want to emphasise the open-ended and controversial, we want students’ own creative thinking and individual meanings to be expressed in assignment answers, Naturally this doesn’t mean that essays (even when given the most open—ended titles) will bring forth such creativity where objective tests cannot do so. We would need further evidence concerning how the essays were actually written by students, and marked by tutors, before we could assert the superiority of essays in this way.

It would seem to us, then, that assignments are expected to perform at least two functions one, to discriminate between students, and two, to validly establish some qualities inherent in students. A third function is also sometimes stated — assessment is a means of gaining “feedback” to judge the effectiveness of teaching. What remains to be discussed is whether one set of assignment items can fulfill all of these functions — and some of us are doubtful about this. If we want maximum reliability in discrimination we may have to compromise on some of our educationally valid aims. If we want to emphasise individual creativity then we may have to sacrifice a good deal of our claims to objectivity when it comes to grading essays. If students: think that assignment answers are key elements in determining their careers they will tend to play safe and stress the non-controversial, whatever their own beliefs, and they will coach each.other and co—operate and this raises doubts about validity as defined above. What seems especially undesirable is that one set of assignments be assumed to cover all of these aspects equally and as efficiently as possible In practice, by blurring these functions, none of them is likely to be performed well.

With these considerations before us, one possible long—term development might involve a clear splitting of functions for assignments. We might develop reliable discriminations on the one hand, and effective ways of gaining knowledge about students’ individual beliefs, free from any grading on the other. A third kind of non—graded assignment might be designed to test out the effectiveness of our courses.

Document 2

SELECTION FOR ADVANCED STUDY IN THE EGYPTOLOGY DEPARTMENT OF ARCHER UNIVERSITY

H Meldew

After scrutinising the results in Egyptology pertaining to all certificate students examined in the University at the end of the Academic Year 1973/4, it has been found that whilst the various marks contributing to the final assessment were weighted more or less in accordance with tutors’ intentions, the fairness of contributory marks seems highly questionable. Evidence has been discovered that points to a lack of comparability among tutorial teams marking the three ‘option papers’ set at the end of the courses so that almost half of the candidates received a ‘bonus’ of over three marks if they offered Advanced Hieroglyphics or Demotic Script rather than Archaeology. Grave doubts regarding the validity of the final assessments have been entertained.

The intention of the Egyptology Department was apparently, to ‘weight’ scores obtained for (a) five essays, (b) a Special Study and (c) the final examination option paper in the ratio of 5:3:2 so that marks for coursework would be the main factor influencing ultimate grades. Members of the Department perhaps naively assumed that marking five individual essays “Out of 12”, marking the special study “out of 36” and the examination paper “out of 24” would automatically secure the weighting intended. A close inspection of the marks indicates that this is not the case. 

The matter becomes clear if the marks obtained by three candidates are scrutinised: 
 

 
Essays
Special Study
Option Paper
Final Mark
Cand.A
45
21
16
82
Cand.B
42
24
15
81
Cand.C
33
27
20
80

Let us suppose that we keep the same relationship within the Special Study marks~ but score “out of 18” in order to halve the weighting of this component as follows:
 

 
Essays
Special Study
Option Paper
Final Mark
Cand.A
45
3
16
64
Cand.B
42
6
15
63
Cand.C
33
9
20
62

We now find that the same one-mark difference between the students’ final scores exists, and the ultimate ‘order of merit’ is unchanged. And from this example it is clear that component weights cannot be defined by the mere statement of maximum marks available for each – indeed, the statement of such maxima is largely irrelevant to the issue! In fact, weights are determined by the way marks for each component are ‘spread’ in relation to the mark distributions for the other components.

An examination of the distributions obtaining in the final assessment of students in 1974 indicated that operative weights were approximately in the ratio of 43:34:23, and therefore fairly close to the intended 50:30:20 relationship.

The Option Paper Component

The results of a detailed investigation into the scores candidates obtained with respect to the three option papers are summarised in the following table.

 
Group N
Means
Standard Deviations
Mark Ranges
Advanced Hieroglyphics
19
16.7
3.6
10--22
Demotic Script
34
16.8
2.3
10--21
Archaeology
62
13.4
3.1
8--21
All
115
14.9
3.4
8--22

The significance of differences among means for the three option groups was tested by an Analysis of Variance, It has been found that such differences are extremely significant (F = 17.38, 2/112 d.f., P less than .001). In other words, if there had been no real differences among groups of students or option paper markers, discrepancies among means as large as those observed could occur by ‘sheer bad luck’ less than once in a thousand trials. However, these figures alone do not indicate whether the cause of the discrepancies was probably genuinely higher ability on the part of Advanced Hieroglyphics and Demotic Script students, or a more ruthless team of Archaeology markers — or even some combination of both ability and marker effects,

The Special Studies

In relation to option groups, means for Special Studies were as follows:
 
 
 

Advanced Hieroglyphics
24.3
 
Demotic Script
27.2
 
Both
 
26.2
Archaeology
26.3
 
All
 
26.3

Apparently, examination of Special Studies provides no evidence of a marked inferiority on the part of students opting for Archaeology. On the other band, the lower average mark obtained by Advanced Hieroglyphics students contrasts with their higher average mark on the option paper. However, analysis of variance yields a result (F = 2.16, 2/112 d.f.) that lends no support to the notion that any real differences exist.

To a certain extent, students may be misled by the information that special studies are “marked out of 36”, Tutors did not discriminate among special studies on a 37-point scale — and maybe none would claim the ability to do so. The great majority of students were awarded marks on an 8-point scale: 15, 18, 21, 24, 27, 30, 33, 36 — apart from four students who for some obscure reason, obtained 22, 26 or 29 marks. Only two students were awarded a mark lower than 15.  79 students were graded B-, B or B+. and awarded 24, 27 or 30 marks accordingly.

From this description of the distribution of Special Study marks, the informed test-constructor will understand that the scores are unlikely to be very reliable. Furthermore, ‘multiplying the grades by 3’ does nothing to help remove inaccuracies, but merely serves to increase the effects of such errors on the final assessments.

The Essay Component

Since the publication of the classic study “An Examination of Examinations” by Hartog and Rhodes, the essentially unreliable nature of essay marks has become a well—known phenomenon, and it is therefore not proposed to elaborate on this matter at this stage. Nevertheless, it should be understood that if one wishes to teat a. candidate’s verbal fluency, his ability to select relevant facts, marshal his arguments, and so on then an essay is quite appropriate; but as a measure of his knowledge about Ptolemy, the Eleventh Dynasty, what Herodotus said, and the like - the essay is of very dubious worth.

The distribution of marks for essays was, like the Special Studies mark distribution, decidedly top-heavy, with a maximum of 60 marks, only five candidates scored less than 30. However, marks bunched between 43 and 55, and below 43 there was a long ‘tail’. Statisticians call this a ‘negative skew’.

The observed skew is of more than mere academic interest, because to a certain extent it invalidates calculations of ‘spread’ used to determine the weighting of components in the final assessment. It is impossible to adjust the estimates of weights on the basis of statistical measurements related to skewness. However, one should note the spuriously high value of 43% weighting for essays, and consider that in the ‘Advanced Studies qualification region’ a more realistic figure might be more in the 30% bracket.

In relation to option groups, mean essay marks were as follows:
 
 

Advanced Hieroglyphics
44.4
 
Demotic Script
46.8
 
Both
 
46.0
Archaeology
45.3
 
All
 
45.6

The pattern of group differences matches that observed in connection with Special Study marks, However, analysis of variance again indicated (F = 1.52, 2/112. d.f.) the probable absence of real differences.

At this point, it might occur to the reader that if data for essays and special studies had been scrutinised together in a two-way analysis of variance, the significance of observed differences among groups might have been established, However, to carry out such an analysis, groups of equal size are required.

DISCUSSION

The analyses of marks for essays and special studies throw some light on the problem posed by the significant differences found among mean scores for option papers. . These analyses reveal no justification for equating the abilities and attainments of groups opting for Advanced Hieroglyphics and Demotic Script, and giving these students a 3-plus mark lead over all those opting for Archaeology. There were unavoidable limitations that dictated the use of a less sensitive significance test. Nevertheless, if real differences of ability between groups had been proven, essay grades and special studies marks would have arranged the groups, independently and consistently, in the order: Demotic Script (above-average), Archaeology (average) and Advanced Hieroglyphics (below-average).

CONCLUSIONS 

This investigation has revealed no strong evidence to support a claim for validity with regard to the assessment procedure adopted by the Egyptology Department, weighting seems to be a hit-and-miss affair, governed by wishful thinking and the vagaries of distorted, abnormal mark distributions. At the Advanced Studies cut—off point, distributions seem to be too compressed for adequate reliability and reasonably sensitive discrimination.  In view of the great amount of time and care expended by many students on Special Studies, grading on an eight point scale might appear to be a remarkably crude means of assessment.

Assessable submissions may be long or short. They nay be unsupervised or written ‘under examination conditions’. But they are all, invariably and persistently, essay-type answers to questions. There are many objections to the exclusive use of essay-type assignments. Apart from creating major difficulties for markers, essays immediately by their very nature put non-verbal-type students (physicists, mathematicians, Egyptologyians, artists, craftsmen, etc.) at a considerable disadvantage. They unfairly give extra marks to the good speller, the neat calligrapher, the fluent talker and the waffler. Any assessment procedure tied securely to only one kind of measuring instrument is bound to lead to frustration on the part of some students who are altogether denied an opportunity perhaps to show their real worth in another way,

The case against essay-type responses would be weakened if it could be shown that reasonably objective assessments might easily be achieved. This investigation has unearthed no evidence to support such a claim. On the contrary, it would seem that for no good reason, Advanced Hieroglyphics students received a bonus of over three marks, whilst Archaeology students were deprived. It is difficult to imagine what the Educational Philosophers mean by objective in such circumstances.

IMPLICATIONS

Some suggestions for improving assessment of students are made:

1 Measuring instruments more reliable than essays should be tried. 
2 The marking of coursework essays should be improved by either (a) independent grading by two tutors, or (b)the use of check—lists
3 Literal grading should be completely abandoned in favour of numerical grading on a full 21-point scale
4 Wherever possible, assignments should be marked in fairly large batches (40..50) and steps taken to ensure a normal distribution of scores over the whole available range
5.One tutor should be made responsible for collection of scripts, dispersal to markers, etc.
6.Essays should be submitted to markers in such a way that the names. of authors are known only to the tutor in charge.
7.The tutor in charge should arrange for 10% of all scripts to be re-marked, either by the same or by a different marker.
8.All mark distributions should he standardised (and normalised if necessary) to common means and standard deviations.
9.The use of student teams marking in conjunction with a tutor should be considered.
10 Criteria should he agreed with students, so that when presenting an essay, the student would be able to ‘claim’ a particular mark. If the mark claimed could not be ‘agreed’ by the marker, reasons might then be stated in terms of the criteria. The student would then be in a position to improve his work and re-submit with the original claim.
11.At some point, if the term ‘continuous assessment’  has any meaning, a student reaches ‘Advanced Studies probable’ status, The policy might be adopted whereby he is infcrmed of this, instantly, rather than prolonging the agony through to the end of his course. The failure to give this information could act as a spur to students.
12. Some thought should be given to the problem of re-conceptualising the notion of continuous assessment.
Document 3 

OBJECTIVITY IN ASSESSMENT-- A DISCUSSION DOCUMENT

1 In order to avoid a long and complex discussion3 let me offer a working definition of an “objective” judgement as one which involves the maximum intersubjective agreement by people who are qualified to make the judgement. Such agreement is not the result of a social consensus in a simple sense - rather, it is produced by the use of agreed, specialist, relatively reliable procedures (e,g “scientific methods”). Any individual judge who uses these specialist methods will came to the agreed conclusion, if that conclusion is an “objective” one.

2 Assessment involves making judgements about the abilities of candidates to reach certain standards of performance. These standards can be thought of as being expressed in terms of certain “right answers” (discussed below) to the questions used in assessment, The extent to which an actual answer approximates to its particular “right answer” determines the marks awarded to each candidate for each question. If assessment is to be “objective” there must be the maximum intersubjective agreement among the (expert) assessors arrived at by the use of specialist procedures of relative reliability as suggested above, This intersubjective agreement must extend to two major areas:
(a) what is the “right answer” to each question
(b) how far the actual answers submitted approximate to the “right answer” in each case.

3 Both of these issues can be resolved by using purely personal judgements which may be derived from all kinds of private feelings, hunches, assertions, the results of personal experiences, and so on. However, if there is no intersubjective agreement based upon explicit procedures, there can be no chins made to “objectivity”. Non-objective assessment might be perfectly acceptable, of course, although it can lead to problems of comparability, problems in justifying marks actually awarded, suspicions of unfair treatment of some students — and so on. Attempts to make assessment “objective” might help to solve some of these possible problems although such attempts raise their own problems, as below.

4 If “objectivity” depends upon expert intersubjective agreement, then it is easier to come by in some subjects than in others, obviously. In mathematics, pure logic, and other areas which primarily involve the manipulation of symbols according to conventional grammars such intersubjective agreement is relatively easy to come by. Mathematicians tend to agree upon what might be the “right answer” to a problem, and the ways in which they come to that agreement are accepted as being relatively reliable (compared to, say, guessing or speculating). It is also relatively easy to decide if an actual answer conforms to a right answer, even if deciding on the actual marks awarded to “wrong answers” might present a few problems.

5 In other subject areas -- notoriously the humanities and social sciences -- intersubjective agreements about “right answers” to questions are more complex. One reason for this is that these subjects offer not only concepts to be manipulated, but interpretations of phenomena — it is the meaning of empirical phenomena that allows for disagreements among the experts. However, the existence of profound disagreements about the interpretation of phenomena need not preclude all attempts to assess “objectively” -- on some issues, one typically finds recognizable schools of thought which offer a manageable number of different intersubjective agreements about the interpretation of a phenomena. An assessor can then construct a kind of meta “right answer” for assessment purposes which involves an agreed combination of the interpretations offered by the different schools. Thus a “right answer” in these subject areas might consist of a display of an open argument about interpretations, without the need to accept any particular interpretation as the right one.

6 A further complication is that there may be very little time available to consult with interested parties in order to arrive at a “right answer”. What we would have to do is to devise a “right answer” which would be acceptable in principle to our colleagues involved in teaching a course, and which would be acceptable in principle to external bodies as a legitimate, specialist University “right answer”. There may be a further requirement that the “right answer” should, perhaps, be attainable in principle by our actual students. It is my own opinion that no “right answer” should be based heavily upon one’ s own unpublished work or upon a particularly obscure or esoteric text, or upon arguments which had not been covered in the course, for example. However, despite these difficulties, and despite the lack of a simple set of “right answers” which can be assumed to be shared by all assessors, it seems possible to construct working intersubjective agreements, or to assess in that spirit, even in humanities or social sciences.

7 In my view, the real problems arise when we set assessment questions which ask for interpretations about localised, very recent, or purely personally-experienced phenomena. If we ask our students to write essays on the problems they encountered on archaeological practice, or on how their own experience of family life fits with certain theories of life in ancient Egypt, or if we invite them to express their personal interpretations of various writings, then we encounter a serious difficulty. Clearly, there can be no agreed expert interpretations of such matters -- the actual events may be known only to the candidate, and they will not have been researched. Several ways out of the difficulty are available:

(a) We could decide not to ask students for these interpretations and suggest that we should only assess in areas where there is a possible area of intersubjective agreement. Thus for example instead of asking questions like “What social and methodological problems did you encounter in your visit to the local area?”, we should restrict ourselves to asking questions like “What are the social and methodological problems identifiable in the fieldwork undertaken by Lord Carnavon?”  -- the latter question offers much more chance of having an “objective” right answer as defined above. Of course, questions like the former one might well be very important ones, and they could still be asked and discussed. However, they could not be used in any assessment system that aimed to be “objective”.

(b) We could extend our concept of a “right answer” to make it include a “right approach”, Thus the actual conclusions reached by each candidate in answering questions about personal experiences could be ignored in favour of an analysis of the approach employed, the way opinions were tested, arguments developed and so on, Again, as long as this “right approach” were constructed according to the same principles as other forms of “right answer”, this would support a more “objective” approach.

(c) We could award marks on a non-objective basis and offer a “right answer” which contained our own (personal) judgements about the events being discussed and the validity of the interpretations offered by the candidates.

8 Actual answers may be difficult to interpret, even if an agreed “right answer” has been constructed  In my opinion, written answers should be interpreted using the same procedures as are used to interpret other written texts whose meaning is unclear (such as an historical document, a sacred text, or whatever). Thus there must be agreed expert interpretations -- again, at least in principle — of each answer, if “objectivity” is to be achieved. This might involve the use of vivas with the students actually present -- although the students might not be considered as expert enough to be involved in constructing “right answers”, their expertise in interpreting their own actual answers might be considerable.

9 Some implications for practice are 
(a) Since we must all be using some suitable “right answer” whenever we assess, we might as well have this right answer written down, whether we arrive at it “objectively” or not.

(b) If we do decide to increase the “objectivity” of our assessment scheme, we should follow the implications for the design of assessment items, and we should devote much more time to securing the necessary intersubjective agreements — with colleagues, with our External Examiner, with students.

(c) Whatever scheme of assessment we choose, it should be made absolutely clear to the students that they are being assessed on certain criteria, and that these criteria may be ntersubjectively agreed or not as the case may be.

(d) Disputes about marks should always be referred to the “right answer” involved and an account of how the actual answer in dispute had been compared to this “right answer”.

Document 4

COMPARABILITY IN ASSESSMENT --   A DISCUSSI0ON PAPER

This seems to be one issue which causes students considerable concern as reflected in the letter written by some of last year’s students.

As I understand it, the term comparability has to do with equality in a loose sense. To talk of comparability in the context of assessment would seem to mean that there were no gross inequalities relating to the quantity and quality of teaching, the provision of pastoral care, the amount of assessable items and their evaluation and work-loads between and/or within units and/or departments. Naturally there are considerable difficulties in attempting to equate the sorts of demands which can reasonably be made of a student studying English and a History student but an attempt should be made, it seems to me, to eradicate gross anomalies or inequalities and to ensure as far as possible that students in some areas did not feel that they were expected to work harder or that their efforts were evaluated more harshly than students in other areas.

Comparability would seem to fall into 3 distinct areas

A. Comparability within units/departments. This would relate to the quality and quantity of teaching and contact time and the possible variations in assessment of common items0

B. Comparability across units/departments. Again concerns here would be related to the quality and quantity of contact time, relative work loads and number, type and methods of evaluating assessable items.

C. Comparability across universities. There might be concern that one university might be marking to different standards and thus awarding more lst/2nd class degrees.

A. The concerns here are normally catered for by moderation, This in my opinion is engaged in more for political reasons than to ensure comparability, The fact that a unit co-ordinator asks for an A, B, C, D and E essay from each marker often poses more problems than it solves. In a large unit (200 students) it involves the co-ordinator with a fairly heavy marking load. Also suppose the co-ordinator discovers that one marker is consistently marking lower than the others. Does he then automatically upgrade all the assignments marked by that one tutor? Similarly does he downgrade all assignments over-generously marked? This only makes sense where the re-marker is operating with similar criteria -and it may be that the co-ordinator and markers may not be able to agree on the weighting to be allotted to particular items. Another solution is to subject all marks to a statistical analysis to ensure the marks fall within a predetermined spread. This is politically inadvisable. Problems of moderation do not normally arise within small units where the number of assignments and markers is limited by the number of students. One answer within large units is to submit assignments throughout the course to external moderation (External Examiner ?) Another - which I am not in favour of in all instances - would be a detailed marking guide and chart. My reservations regarding this are in the appendix. 
Comparability of assessment with reference to practical courses would appear to be reasonable. The machinery whereby good and bad students are often seen by 2 tutors and/or an External Examiner seems to work satisfactorily although there would seem to be some evidence that students feel that comparability of supervision needs investigation, Do all students receive the same degree of help and support ?

B. Comparability across units/disciplines. Here, it seems to me, students are reconciled to the difficulties in equating an “A” student in Maths with an “A” student in Geography. The similarities would lie in ‘quality of work’ but there are semantic difficulties in making explicit the criteria employed. However what is important is that students see that the work-loads i.e. amount of time spent each week and assessable items required over a particular period are roughly equivalent. Obviously there are difficulties here. Does a Linguistics student have to spend more time in the workshop to become an efficient Egyptologist than a Hieroglyphics student in the lecture-room ? Nevertheless it would seem that an effort should be made to prevent anomalies. The number and type of assessable items required of e.g. an English student should bear comparison with an Egyptology student. It should not seem to students that one discipline appears to spend more time on its students than another (see reference to ‘flying Doctor’ Service - recent students’ letter), Obviously again there, are considerable difficulties here relating to the quality of students admitted to the various disciplines and also to the disparity between individual weaker students requiring a disproportionate amount of time. It might be thought that if all units/ departments published their work-loads, numbers and types of assessable items and criteria employed in marking these, and if these were collated and made available to students then ill-informed criticisms would be avoided.

C. Comparability across different colleges could be catered for by more cross-reference and cross-marking and by greater co-ordination. Again however there would be difficulties relating to the question of agreement over the number and type of assignments and criteria employed for assessment nevertheless it would seem that politically a greater degree of co-operation would be beneficial.
More generally this question might arise with our CNAA- validated degrees although I would think that to a great degree the problem for colleges offering CNAA. degrees would be solved by the CNAA itself. Some sugrestions for alleviating concern over comparability might be as follows

(1) The idea that students suggest a grade themselves for assignments could be investigated.
(2) Students should have the right of appeal if they disagree with a grade. The Staff/Student Liaison Committee might be the proper body to investigate requests for a re-mark.
(3) Markers should, wherever possible, explain at the end of assessable items of work why they have allotted a particular grade and make suggestions as to how they think the particular item could have been improved.
(4) It should be emphasised to students that assessable items are not merely measuring devices of a student’s work but are designed to increase a student’s understanding and help him improve.
(5) All units/departments should use the same scale and the grades within that scale should be equivalent. Also it should be indicated what the “average” student should be expected to score (e.g. C,  B-, B).
(6) Methods of arriving at cumulative grades should be roughly equivalent.
(7) It would seem advisable that all units/depts keep a ‘running total’ so that students know their position termly. It ought not to be the case that in some units/depts, students know fairly precisely what is required of them but in others the final grade is only computed at a very late period in the course.
(8) It would appear inadvisable to allow students to qualify without completing, the whole of the course.


Appendix

Bacharach University criteria for marking essay assignments.

i) Depth of treatment. This is considered to be the most important of the 3 categories and it refers to the quality or level of thinking, which is apparent from the essay.

Poor answers may take the form of a string of unrelated facts. Substantial sections of the text may be reproduced without any evident structure. The answer will reveal very little understanding of the units and there will be no attempt to analyse or produce a coherent argument.

Adequate answers will go beyond a simple listing of facts. There will be an attempt to analyse the source material, The student will show evidence of being able to apply the concepts introduced in the units and there will also be evidence of structure and organisation in the thinking, Good answers will show some of the following characteristics : —

  1. An attempt to evaluate, indicating the criteria by which a judgement is made and providing the evidence to support the judgement. 
  2. An attempt to synthesize, bringing together knowledge acquired from different sources, an attempt to produce hypotheses relevant to the assignment.
(ii) Breadth of content This is intended to refer to both the student’s ability to answer the whole of the question and his selection of an adequate number of points of content, It is not possible to state with certainty the points which should be included in an essay. The assignment notes however suggest a number of issues which we should expect to see discussed. A poor answer will probably contain too little content, inaccurate content or (more likely) too many points of content which are irrelevant to the issue being discussed.

(iii) Style of presentation This general heading is intended to cover a large number of aspects of the student’s work which affect his ability to communicate his ideas. It does not refer to the structure or organisation of his essay. It refers rather to his ability to provide an adequate introduction and conclusion; his use of language; his referencing of quotations; even his punctuation and spelling0 This is an area where tutors should on the whole be more concerned with teaching than with testing. The course team certainly consider this category to be less important than the other two. Tutors should not be misled into overrating an essay simply because the style is elegant. Nor should poor style and presentation result in an otherwise acceptable answer being judged a failure - excepting the rare occasion when the students ability to communicate his ideas has been seriously impaired0

N.B. You should bear in mind that Depth of Treatment is the most important of the 3 categories “Style of Presentation” is the least important and should not normally greatly affect the overall grade0

Length
Guidance should be given - somewhere in region of 2000 words