The grading of students is possibly the most significant social or 'political' aspect of higher education, yet one of the least explicit and least criticized. In conventional teaching, there may be substantial efforts devoted to the design of courses, much concern about standards and resources, yet assessment design and the implementation of assessment policies, remains much less well discussed or analysed. There are some technical discussions of various assessment procedures, and some work on the effects of assessment on student perspectives, but very little detailed discussion of actual assessment practices.
In many ways, this is a surprising omission, especially in the social science literature, since student grading seems to involve problems which have been well examined in other areas. Thus assigning numerical grades to student essays is a coding procedure, involving the same difficulties as, say, scoring questionnaire returns. Interpreting student essays involves similar problems to interpreting other texts, behaviour, actions, dreams, or interview responses. Arranging students' work on a continuum provided by a range of marks in a way which gains consent involves an 'accounting procedure' similar to those in use in other professions such as counselling or policing. All these activities have been critically examined in a variety of ways, in hermeneutics, 1 in critical theory, 2 in ethnomethodolgy, 3. There are specific factors relating to the context of assessment which must be included in any comparisons, of course, but nevertheless it is a significant omission in the literature, with few exceptions. 4
Perhaps the turn towards more specific aspects of practice in critical analyses of education might rectify this omission, and eventually affect practice. At present, it seems common, in British higher education, to accomplish student grading with a mixture of relentless 'common sense' at the initial phases of design and marking, with much reliance placed on 'experience', 'hunches', personal insights and the like, and a sophisticated technical manipulation of the data once produced in the 'hardened' form of numerical grades.
This inconsistent approach is the result of a particular shift in the organizing principles of the British system, as well as a more general contradiction. In the old 'sponsored' system of higher education, internal assessment did not need to grade finely or discriminate accurately, since a relatively homogeneous student population, a chance to understand students better face-to-face, and no systematic student failure policies, made fine discriminations redundant. 5 Yet in 'contest' systems, with open access, continuous assessment, and more emphasis on selection, failure, and rational, 'fair' grading, the old tacit understandings and 'experience' are quite inadequate.
The current British system has adopted many of the characteristic forms of 'contest' assessment, but without the necessary insights to control unintended consequences. With new forms of assessment on the agenda for schools, and with increased Government interest in assessment as a 'performance indicator', or as 'quality control', this curious mixture of assessment styles might be changing. At the Open University, assessment techniques, and the need to systematize them, soon led to a series of debates which offer a discussion of the likely fate of these new initiatives, and which reveal interesting paradoxes.
Debates about assessment indicate the nature of the potential and performance of educational technology described in the previous chapter. On the one hand, educational technologists were able to show that conventional forms of student assessment were often inconsistent, incoherent, and ineffective, but, on the other, the response to these inadequacies took the form of imposing certain debatable technical solutions. These led to an argument about principles of assessment, and to certain critical insights about the role of social context at last. In brief, educational technologists discovered that in the context of their earlier work (small scale groups, activities marginal to the education system as a whole, or exercises in skill instruction), assessment could be used solely as a diagnostic device, but in a university context, especially at an open university, assessment necessarily meant grading, selection, and failure for some students.
An early version of the Student Handbook expresses the original diagnostic function of assessment very clearly :
'. . . [students']. . . success will be measured. . . [but]. . . there is no need for an Open University student to get obsessed by grades. The Open University system is not based on a rat-race conception of grading. The award of a course credit does not depend on a competition: we do not say "We will pass x per cent and fail the rest". The continuous assessment scheme has two main functions besides assessing your progress - to help you learn and to provide feedback about the effectiveness of the learning materials. Thus as we analyse your assignment results. . . we can identify any areas within the course that are not teaching effectively, and the course team can take remedial action. ' 6The implications of 'student-centred' educational technology forbade the use of assessment for any other purpose at that time. Educational technologists were able to draw upon a sizeable literature on assessment design which assumed a diagnostic function. This also considered a wider range of techniques than was usual, and promised greater precision and reliability. Thus, from the beginning, three major forms of assessement were provided: the final examination, the 'tutor-marked assignment' (TMA), and, most novel, the multiple-choice question, in its latest format - the 'computer-marked assignment' (CMA). Each of these enabled different objectives to be operationalized in an assessable form: broadly, CMAs tested students' abilities to recall information, or manipulate specific arguments in the course materials; TMAs tested the conventional skills of being able to discuss a topic, apply an argument, or carry out some practical task; and final examinations tested the usual skills of tackling topics while under constraints of time and memory. 7
The use of 'objective testing' in CMAs was the most controversial of these proposals. Educational technologists had to convince academics, especially in Arts and Social Sciences, that such testing was possible and desirable, and this involved a critique of existing conventional practice. One argument held that disciplines in Arts and Social Sciences inherently resisted objective assessment since there were no right answers. Educational technologists were able to reply that, if this were the case, no assessment at all could be justified. In practice, academics in these areas did indeed award a range of grades, so there must be some implicit conception of good or bad answers: why could not these implict criteria be made explicit?
Further, reliance upon essays could be demonstrated to be conservative and irrational. Essay questions were frequently vague or ambiguous, and marks often awarded according to dubious assumptions about what students actually meant by what they had written.8 Clear possibilities arose for the intrusion of personal bias, or, more of a problem for a British open university, for the quiet 'sponsorship' of those students with approved 'cultural capital'. Educational technology could do better, transforming vague, general, and ambiguous essay questions into a series of precise, limited, and specific questions, based exclusively on the texts provided. The questions would be readily understandable, and much more reliable. So transparent and rational would assessment become, so little would it all depend on the suspiciously elusive qualities of interpretation and judgment by professional academics, that the actual marking of the tests could be performed by a computer program.
Thus it was possible to test whether students had read and grasped the major implications of, say, a passage from 'The German Ideology' (an actual example), by asking them to choose, from a list of plausible statements, the one which most closely resembled the actual arguments in the passage. While essays allow students to bluff or 'waffle', objective tests force them to be specific about what they know of the text. While the 'right answer' in such a test clearly depends still upon the judgment of the course team, at least that judgment is clear, specific, and open, and not buried in the marking schemes used, often implicitly, privately, and unclearly, in essay marking.
Of course, test designers have to guard against students guessing, but various techniques can minimize this, including penalizing incorrect answers. The grid format of CMAs also permitted other types of test, apart from the 'choose one answer from the above list' variety. The most complex forms asked students to use an array of information, and draw particular inferences from it. 9 Finally, CMAs had one particular 'practical' advantage - they enabled the rapid marking and return of enormous numbers of assignment items. Indeed the problems of time and cost involved in the marking of TMAs still serve to limit their deployment, regardless of any debate about relative 'educational' merits. Resistance to the proposals for 'objective testing', which was considerable, found itself once more forced to rely on relatively easily criticized 'progressive' or conservative assumptions, or to be arguing against the very logic and practice of the teaching system itself: how else could assessment 'at a distance' be organized?
The usual design problems soon arose in practice, of course, when educational technologists realized that there were no purely rational procedures to guide test construction that did not involve the judgments or experience of academics. For British academics, there was precious little experience! Yet at least it was possible to commend a few basic techniques, such as the need to avoid unintended ambiguity, to try to match test answer alternatives in terms of length, style and level of sophistication to avoid guessing, and so on.
The major problem that emerged was of a different order, however. A recognizeable 'political' crisis arose when the Science Faculty discovered that, prior to taking their end-of-course examinations, some 40 per cent of Science students had gained the equivalent of an upper second (B) grade, or better, on course work. This discovery led to some alarm about the public acceptability and credibility of the OU teaching system, especially in the context of press speculations about the slide in standards that could be expected (see file) Some educational technologists might have been prepared to argue that these very good results showed the high levels of effectiveness that the very first courses had achieved (or indeed that the first intakes had been overwhelmingly well-qualified already). However, they would have been in a small minority had they done so, and assessment results were already being doubted as totally accurate measurements of student ability.
The crisis produced a growing recognition that the grading functions of student assessment could not be ignored in a public university setting. In industrial training, the goal might well be to produce as many well-trained persons as possible, to teach to a criterion, rather than producing an acceptable range of graded performances. In teaching certain basic skills in educational contexts, the same requirements might hold. But in an open university, for mainstream academic courses, discriminatory grading had to be accomplished too. All would be well if acceptable distributions of students just happened to follow from attempts to raise all to the agreed standard but it became clear that this need not often happen, and that a grading function would have to be introduced deliberately.
Recent attempts to discuss criterion-based assessment in schools seem likely to encounter this same problem. Assessment is not an even and constant, or abstract process: at certain stages it can indeed be devoted to diagnostic functions, or to be aimed at teaching most students to the agreed standard. But at other stages, in other conditions, grading students becomes a necessity. Without a proper understanding of the forces operating on educational systems from outside, from labour markets, from State and other bureaucracies, from funding bodies and even from students, all of whom might be able to insist on a grading system, this point can be missed. Educational systems are not unitary, and techniques can not always be abstracted from one part and 'applied' to another .
The movement in assessment functions at the OU is the reverse of the one in some recent proposals. At the OU, assessment began as a diagnostic, but rapidly became a matter of discrimination, the 'rat-race conception' of grading, after all. This movement acquired momentum from another direction too - the attempt to improve certain technical aspects of assessment. The reliability of assessment items had also emerged as a problem, for example: different Faculties seemed to be 'more generous' than others in the award of good grades, raising the issues of comparability and equivalence between courses and grades. Different types of grading raised the same problems - were tutor-awarded grades comparable to CMAs for example?
The OU system, like many developed since, proposed to combine scores gained in a variety of different forms to produce a grade for the course as a whole, and, ultimately, for the degree scheme as a whole. Further, this was to be complicated by a decision to 'weight' different components, to reflect different levels of significance or difficulty. Technical problems were clearly involved in these procedures, and these had to be investigated and dealt with. It is worth saying at this point that the problems do not seem even to have been recognized clearly in other schemes in conventional education, and that it is doubtful if any other institutions of higher education could have pursued these problems in such an open and rigorous manner. The results at the OU were to be particularly informative.
The arrival of an educational consultant from the USA in 1971 led to some preliminary analysis, apparently routine in American education, which drew attention to the very widespread incidence of technical problems, especially with CMAs. Virtually all the discussion which follows is drawn from, or suggested by, the arguments begun by this visit, and the papers which the arguments produced. Although actual techniques have improved, doubtless, since these papers were written, they remain as an example of clarity as far as the principles involved are concerned. It is to be regretted that they were never published.
The analysis found, for example, different equivalences between literal grades and their corresponding percentage points in the different Faculties, so that an A grade in Humanities (actually called Arts, but the consultant's usage is adhered to) and in Mathematics translated as a mark of 80 per cent, while in Science an A meant a mark of 85 per cent, and in Social Science it meant 90 per cent. 10 During the first year of operation, the Humanities Faculty had changed their conventions for their seventh CMA, raising the percentage mark equivalent for an A grade because they had thought the assignment 'too easy'. Distributions of grades varied widely too - only 0. 2 per cent of students had received an A grade overall in Social Science, while 20 per cent got an overall A in Science (for A and B grades together, the figures were 15. 5 per cent and 70 per cent respectively). The obvious explanations for these variations - that there were differences inherent in the subject matter or in the calibre of students - were called into doubt since substantial internal discrepancies among the grades awarded within each Faculty were also apparent. Thus the paper showed that proportions of students obtaining an A grade in Science varied in different asignments, from 0 per cent to 41 per cent. In one Science assignment, no less than 99. 5 per cent of the students received a grade of B or better. Comparability between these considerably different grades was obviously not easy to establish.
In subsequent analyses, similar anomalies emerged even more strongly, virtually everywhere that CMAs were used. Some simple design faults were still apparent - some testers had developed the habit of putting the right answers always in the first two or three options, which could help shrewd students maximize their chances of guessing. Some items tended to assume that students would have understood particular concepts without actually testing them, so that a basic mistake about the use of a specialist concept, or a category mistake, could produce a range of wrong answers on the actual test. 11
Generally, there was a preponderance of items producing high mean scores and skewed distributions ('bunching' at the high scoring end of the range). It is worth noting that this would be exactly the result that effective criterion-based teaching would produce. However, as indicated earlier, poor design of assessment items made this defence difficult to sustain. The assignments in question often seemed likely to produce inconsistent results. The explanation for 'bunching' was not, therefore, 'some real but unexplained characteristics of a single group of up to 5000 students. . . [but an]. . . infinitely more probable outcome - unsteadiness in one or two test writers'. 12 Or, on a different tack, other universities had much more even distributions of grades than the OU, it was argued. That the OU was somehow particularly effective compared with its peers 'seems an unlikely explanation. . . On the other hand, the poor construction of tests and CMAs can very easily produce these effects'. 13
Behind these views might lie an unfamiliarity with the British context, in fact. British universities did not set out particularly to produce reliable discrimination, as has been argued. Moreover, actual research (very recently systematized in Britain) shows considerable variation in the proportions of graduates receiving 'good degrees', not only between institutions but between different subject departments in the same institution. 14 This probably seemed unthinkable to anyone familiar with American practice: discussions of American practice, as in the case of 'item banks', for example, tend to be met with equal incredulity from British academics, of course. The British OU was in the forefront, for good or ill, of the introduction of American techniques and expertise, and thus was quite different from its conventional peers.
The strategic, public relations, aspects of the suggested discrepancy with other universities were stressed by British commentators too, though: 'justly or no, we are looked at askance by our colleagues in other universities', 15 although again the evidence for this scepticism is not clear, and it would be surprising to find many British universities who would want to debate the issue in public with the OU even today. Students were also seen as being likely to suffer from bunched grades : 'The Open University must either conform broadly to ordinary university practice in these matters or accept different treatment for its graduates'. 16 In these circumstances, efficient selection and discrimination seemed 'mandatory'. 17
Strategic and technical pressures came together in the plan for research to identify anomalies and to perfect 'quality control' devices. One paper begins by discussing the crucial issue of validity in assessment. A valid test is defined as one that produces results according to some external criterion. In academic life, where there are no observable skills or performances to act as a criterion, difficulties arise, it was argued. It might be possible to design a valid test in, say, History, by specifying elements which were essential and had to be mastered in order for the testee to become a succesful historian, but a circularity arises since passing the test itself defines success: a successful historian is one who does well on the test set by the History Department, and there is no other external or independent criterion. Some sort of external criterion might be found in the subsequent fate of OU graduates, so that a graduate with a good degree who went on to develop a glittering career elsewhere might be seen as a kind of vindication of the validity of the assessment scheme, but this is rather long term and remote, and possible only if it is assumed that careers are based on talent alone (a most unwise assumption, as the social mobility data in an earlier file suggests). The usual solution to the problems of validity is offered here, just as it was in similar problems faced in curriculum design - to 'trust that test constructors are homing in on something which they know to be worth measuring'. 18
However, another alternative is available in this case. The reliability of assessment items is measurable, even if validity is not. Reliability was defined as the extent to which a test produces results which would be replicated if the same test, or a similar one, were to be applied to the same or similar group(s) of students. The concept can be operationalized still further: since it is practically impossible to repeat tests while avoiding the effects of rehearsal, a more convenient way to measure reliability is to gauge how well an individual test item reproduces the results of the test as a whole. Reliability is connected to validity in that a valid test must also be a reliable one - but the reverse does not apply, and a reliable test need not be valid at all. Tests might be reliable enough to reproduce the same pattern of results, but to be measuring some quality different to that which the designer was hoping to measure. Deciding to opt for reliable tests as the goal of assessment policy is a most significant change: a concern for valid measurement has been abandoned, and with it, inevitably, the last notion of assessment as diagnostic.
A peculiar unevenness in educational technology's procedures is also detectable here. In curriculum design, problems of moving from intentions to some valid depiction of them in behavioural objectives, or of moving from argument to knowledge structure, were cheerfully 'solved' by relying on judgment, trust, or power exercised by course teams. In assessment design, this was not enough, and some more objective procedures were preferred, even at the expense of a significant shift in emphasis and function. Validity could have been defined simply enough as what the course team decided were the 'essential elements to be mastered', and indeed this is what continued to happen in practice. The drive for reliability does not arise just because the problem of validity can not be dealt with (there is surely an element of sophistry in the arguments of the consultants here). The desire for reliable tests is also determined by the wish to produce acceptable distributions of marks for strategic reasons.
This double determination can be seen in the development of various indices to measure reliabilty. The first step was to construct a 'facility index' to compare the performance of assessment items : clearly, an easy item will produce a different distribution from a difficult one, and this will produce anomalies if not controlled. The facility index was established by examining the proportions of students who can achieve success on the test item (select the right answer on a CMA). If 50 per cent of students select the right answer, the facility index is 50. Of course, it was argued, the facility index is improved if it is known that the most able students have chosen the right response. For this reason, facility indices were calculated originally on the proportions of the best 27 per cent of students (those who did best on the whole assignment, on all the test items overall) who chose the right answer. 19 Performance on the test as a whole is crucial in the measurement of reliability: as argued above it is the operational substitute for validity.
Internal consistency between test items can also be measured in the same way. Scores produced by individual items are correlated with distributions of scores produced by the whole assignment. The correlation coefficient so produced was termed an 'index of discrimination' or a 'correlation index'. The two indices used together can guide the design of assignments with some power. As an example, if an item has a high facility index, it means that some of those students who scored poorly on the test as a whole will have been able, neverthless, to have chosen the right answer. The reverse applies for a test with a low facility index. In both cases, the correlation index will be low. It follows that the items with the highest possible correlation index will have a facility index of 50 - all the best 27 per cent of students will have chosen the right answer, none of the worst 27 per cent, and the best half of the remainder. Since the OU has to grade students more finely than just in terms of producing half the student population as passes and half as failures, items with different facility indices will be required, to produce a distribution across four or five further categories of pass. It will have to be accepted that these items will have lower correlation indices. Reliability here will have to be redefined, not in terms of a simple correlation with the results of the test as a whole, but in terms of a correspondence to the maximum possible correlation for a given facility index (which can be calculated in advance). 20
As an example of the usefulness of these exercises, one assessment item in the Arts Foundation Course (A100) was investigated in some depth. It had produced significant numbers of students in the two adjacent categories of C and D (equivalent to 'bare pass' and 'fail' respectively). On this particular item, 79 per cent of students had received grades lower than C. Only one other item in the whole range for the course had produced such a negatively skewed distribution (or low facility index). A low correlation index, between the distribution of grades produced by this one item and the overall distribution of grades for A100, confirmed the atypicality of this item. The best 27 per cent of students overall were checked to see how they had fared on the item, and 68 per cent of these best students had not managed to get the right answer on this occasion. The item seemed to be a most dubious one, and, when further examined, turned out to contain a possible ambiguity, although this had not been detected at the time in the routine scrutiny of the course team. Unfortunately, this particular item had had a considerable effect in determining student grades of C or D for this particular assignment - it had given so many students such low marks that it had pulled down grades for the assignment as a whole. 21
The main public purpose for 'item analysis' like this was to identify such poorly designed items and thus avoid some of the unintended consequences. Generally, unreliable items were to be diluted, at least, by including more assessment items overall. Large numbers of items would produce a sufficient pool of data to make the correlation index itself more meaningful. However, other possibillities arose for controlling the 'bunching' that had caused the embarrassment mentioned above. Items that were 'too easy' were to be identified and eliminated.
Here, a slip occurs again between the use of techniques designed simply to improve the reliability of assessment items and a use of the techniques to pursue more debatable policy goals of 'improving' distributions by eliminating items which produce undesired results. The same slippage is apparent in the discussion which followed about the kind of distribution of grades that would be most desirable. The original paper had proposed that distributions be normalized and standardized - that is that the original rank ordering of students produced by an assignment be preserved, but the original scores be converted, projected on to a 'normal curve'. Grade boundaries would be redrawn at standard intervals, and students reallocated to grades. The major justification for this step was a technical one - normalization enabled a more accurate distinction to be drawn between grades. For example, projecting grades on to a normal distribution enables the use of a number of powerful routine statistical measures to calculate standard errors of measurement. Once errors are known, borderline cases can be considered, and all those which fall within the standard errors of measurement can be reallocated. 22
Normalization and standardization also provide a convenient, standard definition of each grade, which solves the problem of comparability and enables grades to be conflated with some confidence at last. Without some standardized definition, unintended consequences and anomalies are bound to occur, and policy concerning 'weighting', for example, can lead to contradictory results. This argument applies above all to any attempt to mix forms of assessment based on different measurement philosophies or procedures, in conventional education and at the OU itself, as will be argued.
Of course, normalization and standardization procedures imply that there will be a fixed proportion of grades available in each category, and that therefore students will be competing for 'good' grades after all. Although a more familiar practice in the USA, in 1971, in Britain, such a policy would have been 'anathema' to senior policy makers, even at the OU. 23 At this early stage, those educational technologists who believed in assessment as diagnostic also rallied to resist normalization: the technique would produce a distribution of grades that was 'not the result of social or educational policy but a technical artefact'. 24 The procedures would finally sever the connections between assessment and validity (or diagnosis) since virtually any test results can be made to yield a normal distribution, whether they relate adequately to the course materials or not. 25 Nevertheless, even those who objected to full normalization and standardization were beginning to realize that something had to be done to produce acceptable distributions, and to pay much more attention to the technical problems raised by grading. The notion of an 'expected' distribution gained ground, even if this was less rigid and specific than a 'normal' distribution. Techniques to achieve such distributions were also being discussed explicitly.
One alternative proposal to increase the technical rigour of conflation procedures was developed as part of a more general 'theory of assessment' produced as a result of the crisis and the debates about grading. For conflation purposes, grades from different types of assessment were conceived as 'orthogonal vectors' which could then be summed qualitiatively, so to speak, by an Examination Board, rather than just being added together as if they were both standard measures on the smae scale. This would take care of the need to reduce all grades to a common standard and, again, give Examination Boards more control over the distribution of grades.
For example, grades from continuous assessment, and grades from examinations would be presented to Boards in the form of a matrix, with each type of grade representing each axis of the matrix. The Board then would decide which of the cells in the matrix were to receive which final grades - whether three adjacent cells in the top corner received a First (A grade), or only two, or six cells, or whatever number seemed appropriate. The Board could also decide whether the chosen cells should be distributed evenly around the diagonal, or skewed so as to favour either course work or examinations. This approach leaves conflation policy in the hands of the Boards concerned, to conform with British practice, but still offers a rigorous and clear focus for decisions.
The more general 'theory of assessment' also offers a step back from the 'technical artefact' of normalization and standardization. The latter proposals had led to the use of the simplest forms of assessment items, for example, on the principle that they offered as good a form of discrimination as any. The 'theory' attempts to retain something of the original multiple functions for assessment requiring multiple and more complex forms. It tries to clarify issues such as the 'level', 'weight', or 'relevance' of assessment items in terms of an agreed set of 'primary descriptors', for example. 'Scope' and 'length' are easily quantifiable (operationalizable) 'extensive' factors, while 'complexity', 'difficulty' and 'importance' are 'intensive' factors. Of these latter, 'difficulty' can be quantified, via the construction of the facility index described above, but the others could at least be clarified, and possibly arranged on ordinal scales. 26 Here again, pinning down the characteristics of assessment in this way provides a rational framework in order to classify, compare or aggregate different types of assessment items.
Further, if course materials too could be analysed in terms of these primary descriptors, as the 'knowledge structures' project described in Chapter Two seemed to promise, assessment could be tightly related to courses. To anticipate how this could have been done, a concept discussed in a course could be defined as being, say, 'complex', or 'important' by examining its place in a knowledge structure. It would even be possible to quantify these descriptors by counting the number of connections joining a concept to those around it ('complexity'), or the number of more basic concepts needed before the chosen one could be reconstructed ('importance'). Complex and important concepts would have to be given a prominent place in assessment, and treated there in ways which would also be complex and important. Concepts, and types of asessment would be related using the same limited vocabulary and agreed definitions, enabling rational discussion of the entire problem.
Assessment techniques had developed in practice too. A year after normal distributions were being described as 'anathema', expected distributions were being firmly institutionalized. In 1972, the Examinations and Assessments Committee was noting that 'the Academic Advisory Committee had already advised [sic] that the majority of students should fall in the category of "pass with merit"', and that 'there were guidelines for Examinations and Assessment Boards' which had 'already been approved by Senate'. These guidelines were to be expanded to include '. . . a notion of an equitable distribution of 1st, 2nd and 3rd Class Honours Degrees'. This was to be achieved by asking Boards to consider the categories they were awarding in Foundation Courses, and to aim at:
'approximately the following proportions: (1) "pass with distinction" - between 5 per cent and 10 per cent of students awarded a credit; (2) "pass" - about 10 - 20 per cent of the students awarded a credit; (3) "pass with merit"- the remainder of the students awarded a credit'. 27No fixed quota of failures were to be decided by this procedure, it is worth noting. Other possible quota were held in reserve in case the actual distributions of grades required further action: the 'pass with merit' category could be subdivided into two equal bands, or more, should 'the conflation procedure for the determination of class of Honours' so require. These amendments would offer more categories or grades than was apparent in public, in order to meet the needs for accurate discrimination now central to the activities of Examination Boards.
Educational technologists too were in the business of expected distributions. One suggestion advised that examinations in Social Science be marked to produce a distribution of grades as follows: 'A - 5 per cent, B - 35 per cent, C - 35 per cent, D - 10 per cent, F - 10 per cent, R - 5 per cent'. 28 This distribution was calculated as that necessary to produce the desired overall distribution, after course work grades were known. This proposal, whether implemented or not, raises an interesting possibility of using examination marking, which is under the control of central staff in the last instance, to 'correct' any deviations introduced by course work grading, which is performed by part-time tutors outside the immediate control of central staff. This issue of controlling any excesses by part-time regional tutors led to other developments, as seen below.
The general procedures outlined in the discussion papers were implemented, in more or less identical forms, for every CMA after 1976. 29 Proposals for normalization were raised again as a result of wide variations in the means and standard deviations of scores on assignments produced, once more, in Science. This time, the problem arose at the conflation stage, since assignments with larger standard deviations have more influence on the final score than those with a lower spread of scores - whatever official policies of 'weighting' may have decreed. The basic indices were also used to identify unreliable items in final examinations (whether just the multiple-choice items or not is unclear). Finally, more reliable items, together with reports and various performance data on these items were being stored in an 'item bank'. In 1978, the bank was not capable of use for all courses, however, due to technical difficulties with data processing. 30 Although item banks have been in use in the USA since the 1960s, they are still new to Britain: Byrne describes the 'state of the art', and suggests a definite place for standardized and tested items in the assessment of routine aspects of university education such as the 'lower skills' of English language usage for students. 31
As a final example of how far practice had developed towards discrimination as the main goal, a 'workshop' on assessment had identifed a particular assessment item as offering opposing interpretations and implications. On one hand the item contained an ambiguity, and course team members had been unable to agree on the choice of a right answer. According to the interest in identifying weak items, which was behind the original research into assessment, the question should have been replaced. However, on the other hand, the item had produced the required distribution of grades among the student population, despite its ambiguity. According to the interest in discrimination, the item should be retained. Indeed, as the educational technologists attached to the course team pointed out: 'The problem arises, if the question is modified to eliminate the [ambiguity], will it continue to discriminate?' 32 The actual recommendation had been to keep the item. There can be no clearer example of how the position on assessment had become centred on discrimination and grading. Even if ambiguities existed which impeded student understanding or destroyed validity, items could be retained. OU practice came to resemble precisely those aspects of conventional practice which had been so ably satirized in the early discussions, where unintended ambiguity is permitted and, if detected, is rationalized by some claim that it helps to sort out the 'able student'!
All the analysis described so far concerns CMAs, since these were the only items which could be analysed directly with the appropriate techniques. Yet there are implications for TMAs in the proposals too. As suggested above, tutor marking is not easily controlled in distance systems, and it is reasonable to expect considerable variation in tutor practices. The usual processes of coming to some intersubjective agreement could not be organized realistically in sufficient detail in the widely dispersed conditions of the teaching system. Dangers existed of unfair practices, with tutors varying in terms of their severity or leniency. Of course, detailed marking instructions could be despatched, describing expected answers in some detail and recommending marking strategies - but different course teams offered different levels of advice, and tutors conformed to the advice with different degrees of enthusiasm. British academics apparently were not used to having someone else decide the specifics of assessment criteria for them, and have reported feelings of lack of involvement. 33
One early suggestion involved the use of CMAs to regulate tutor assessment. CMAs could act as a reliable yardstick, 'common beacons', 34 to measure tutor performance: any discrepancies between grades for CMAs and TMAs for groups of students might lead to an investigation of the marking patterns of the tutor concerned. Whether or not CMA grades were ever actually used in this way, before they became less common, is not known. A policy of comparing individual tutor grades with overall distributions, in the logic of the CMA work, was developed, however. Individual tutor's grades are correlated with the overall pattern produced by all tutor grading. Any particularly 'deviant' tutors are contacted by senior Regional staff, 35 but many still seem able to resist any advice to fall into line, despite a very weak market position. 36 Nevertheless, when tutor grades are considered centrally, it would be easy enough to 'tag' the grades from deviant tutors and simply 'correct' them. Grades from 'deviant' tutors are taken into account in borderline cases at Examination Board 'standardization' meetings. 37
A correlation index exercise seems to operate with final examinations too, where large numbers of markers have to be coordinated and standardized. Once again, 'abnormal' distributions produced by particular markers are identified by comparing individual grades to grades awarded overall. Again, an early suggestion was that the computer simply be programmed to apply suitable weighting factors to correct any anomalies. 38 Certainly, Examination Boards seem to be able to make a number of adjustments to scores, for particular components or overall, in order to maintain expected distributions, either by exercising considerable discretion in conflation procedures or by altering means and distributions directly. 39 The discretionary powers of Boards in British higher education are often far-reaching too, of course, and possibly less organized and focussed.
Finally, much work seems to have been done on TMAs, as the use of CMAs has declined, particularly for Arts and Social Science courses. Analyses of the reliability of TMAs have also detected considerable anomalies - when a sample of essays was double-marked 'blind', for example, the range of differences was often two grades or more. 40 (It is worth adding in passing that it is difficult to imagine tutors even permitting such experiments in conventional universities ). The group undertaking these comparisons seem to have settled for solutions which involve advice to improve the design of TMAs, rather than any attempt to 'correct' grades against 'common beacons', however.
The advice sounds completely reasonable and sensible for anyone using essays, in fact, stressing the need to check question wording and coverage, to pre-test questions with tutor panels, and to design better tutor notes. 41 The investigations also uncovered other points of interest which will be discussed later: first, student plagiarism seemed widespread, and caused some of the greatest discrepancies in marking, and, second, and more immediate, problems of low reliability seemed to have affected project work too.
Project work as a mode of student assessment appeared to offer a radical rejection of the older logic of testing and discrimination according to pre-set requirements to achieve mastery. The proposals look 'progressive' and the choice of project work was indeed inspired by Deweyan views of education. As part of the apparent 'paradigm shift' in educational technology (discussed in the previous chapter), projects were intended to '[challenge] the myth of "absolute objectivity" in measurement'. However, project work remains limited, and seems to have encountered some problems, including 'student apprehension', low cost-effectiveness, and 'problems with grading and standardization'. 42
These problems are not surprising, of course, since project work seems rooted in the old non-discriminatory view of assessment which had been largely discarded. If students are permitted to select their own titles, and if all receive individual and skilled advice and criticism before submitting the finished piece, projects should indeed lead to distributions of grades with high means and low scatters, unless there is a deliberate policy of subsequent discrimination. Because of this tendency, conventional systems which use project work invariably use other forms of assessment as well in order to supply the necessary degree of discrimination.
At the OU, certain stages in completing project work seem to have been subject to conventional assessment, for one course at least, and a chance to submit draft reports to tutors seems common, if only for 'formative assessment'. 43 It seems that the majority of students writing projects also feel the need for considerable tutorial guidance, and that this often involves tutors 'narrow[ing] the scope'. 44 Although there are assessment criteria in general, there seems to be a tendency for tutors to 'mark for effort, and a reluctance to fail students'. 45 In distance systems, projects seem to produce a 'bi-modal distribution' of grades. Although it is not clear from the actual studies, it is tempting to see this 'extra' group clustered at the lower end as those students who have not sought help and who have not therefore had their 'own' ideas shaped and developed by tutorial guidance: no doubt a 'bi-modal' pattern will appear in conventional education in Britain as staff-student ratios rise.
The inclusion of some projects need not challenge the basic tendencies to discrimination at all, therefore. Even if individual differences in diligence or in requests for supervision fail to produce a spread of grades in practice, grades for projects can be made to conform to an expected distribution just as easily as grades for essays. Or, in a mixed system, any tendency for project work to produce 'bunched' grades can be offset by including other forms to spread out the distributions again at the conflation stage. Projects can come to act as a kind of constant, to raise students to the pass level, say, while examinations can be used to determine more or less completely the details of actual grades and class bands. This sort of thing seems to go on in an unintended way in conventional education, for example when project scores are simply added to examination scores with larger standard deviations, resulting in the 'unintended consequence' noted in the Science Faculty at the OU. At the much more technically proficient OU, the necessary calculations are readily available to establish a deliberate policy, should one be necessary.
For these reasons, it is surely optimistic to suggest that project work alone has checked the use of grading as a discrimination device. The error seems to involve assuming that the mere employment of a technique can alter the social relations between assessor and assessed which have determined the course of events beforehand. As with course design techniques, it is more likely, that the technique will take on new consequences according to its new context of application. The same points apply to other attempts to liberalize the assessment scheme: not to grade Foundation Courses at all, for example, or to develop profile assessment to avoid the problems and vulgarities of conflation procedures. 46
These devices are currently in vogue and in use, and they do offer a solution, but only a partial one. At the OU, as usual, problems with new techniques are pursued in much more detail and with more rigour than in conventional organizations. Byrne, for example, has pointed to some logical difficulties with profile assessment (in terms of the internal consistency of the items in the profile), and to similar problems with formative assessment. 47 With the latter, there are also non-logical difficulties: formative assessment has to be similar enough to summative (usually graded) assessment to attract the interests of students, but dissimilar enough to be given a different status, and to avoid the effects of 'rehearsal' or 'coaching'. 48 As usual, abstract difficulties have not prevented the use of these devices, but without adequate clarification, unintended consequences are almost certain to occur.
If Foundation courses, or other elements, are not assessed, for example, this only means that the burden of discrimination is borne by later courses, often with fewer items and lower reliability as a result. Profiles are also paradoxical: in proposing that the technical logic of conflation be held at bay, advocates often seem prepared to extend processes of assessment into all sorts of areas that had been omitted before, such as motives, extra-curricular activities and so on. In these proposals too, non-discriminating forms of assessment need not challenge the principles of discrimination, but can add areas or constants to them. Championing particular techniques, in a spirit of seeking an instant 'technical fix', is common, but is no solution to the genuine difficulties raised by the need to assess.
The career of student assessment policy at the OU is testimony, of a particularly powerful and authoritative kind, to these difficulties. There was a genuine commitment to a then rather radical type of diagnostic and criterion-based scheme, supported with experience and enthusiasm by the educational technologists who were designing the system, but even then, the need to discriminate asserted itself. Of course, the shift to a discriminatory system, based on the reliable reproduction of expected distributions of grades can be read in different ways.
At the simplest level, the realpolitik of grading made its appearance for the first time, in the new context of publicly-funded higher education, and the effects had not been anticipated by the designers of the system. As with admissions policy, it is possible to see the development of a sophisticated discrimination system as a rather covert activity, performed at the centre, in relative privacy, while maintaining a public face of concern for students and their problems, and tolerance for academic disagreements about grading procedures among Regional staff. Again as with admissions policy, however, this would be too conspiratorial.
Before leaving this view, though, it is worth noting one aspect of it: educational technology reveals itself in this area too as a most selective radicalism. Although in the forefront of attempts to open to doubt the practices of conventional assessment and to question the many assumptions at work (still), educational technology proved remarkably uncritical (with some exceptions) of the demands of the politicians, both internal and external, for reliable discrimination, convenient distributions of grades, and conformity to some (largely imagined) agreed university practice. A discipline devoted to a 'practical', 'value-free', 'technological' orientation to its subject matter is always vulnerable to being steered according to the values of the strongest groups. 49
The same factors affect the recent re-emergence of criterion-referenced assessment at the OU. Although introduced and discussed as a largely technical matter, a solution to the problems of validity rediscovered in discriminatory assessment, connections with the values of dominant groups are not far from view. 50 Criterion referencing is advocated this time not primarily as a diagnostic device to assist student learning as such, but more as a licensing device to guarantee to employers that students really have achieved certain levels of expertise and skill. The same shift, for the same reasons, is apparent in new proposals for national examinations for schoolchildren - the GCSE schemes. 51
These moves could be explained in terms of perceived changes by employers in the demand for skilled labour in the British economy: a low demand for labour, in the current recession, might mean that the detailed grading of applicants to make them fit available slots in occupational systems becomes less relevant. Instead, a simpler focus upon explicit training would arise. This sort of analysis, even when modified to include the necessary ideological background to neoconservatism, misses some of the specifics, however. New assessment devices win the consent of teachers and lecturers because they also appear more 'fair' or more 'progressive' than the older forms: continuous assessment triumphs because it is 'fairer' than the old final examinations scheme, or more 'relevant' or more 'universal', for example. 52 Given widespread ignorance of or indifference to the mechanics of assessment, different specific options can be circulated to, or be discovered by, different audiences at different times.
Yet the career of the assessment scheme is not just the old story of technologism triumphant. It is possible to argue that at least the educational technologists managed to raise some of the issues for discussion and to pursue implications to their logical conclusions. The rational analyses of assessment techniques and practices offer a clear and pointed critique of much of what goes on in conventional assessment, and the discussions of these analyses were intended to lead to clear policies. In conventional education, much less is known about how systems actually work, and still less about the policies adopted to resolve the problems. At a guess, based on limited experience, it might be possible to suggest that expected distributions are maintained in a rather ad hoc manner, at the final Boards themselves, shrouded in a protective confidentiality, and in ways often deeply in contradiction with public pronouncements meant for students or junior staff. By contrast, the rational procedures at the OU are models of reasonableness and responsibility. The technological orientation has at least penetrated some of the obfuscation and mystery, and led to modest but welcome improvements.
In assessment matters, mastering some of the principles behind the technical arguments is essential if discussion is to proceed beyond the 'experience', beliefs, or often simply the prejudices of university examiners, while avoiding another kind of 'veiling', this time by statistical expertise. The critique offered by educational technology is certainly better than no critique at all, or an 'anti-numerate' idealism.
The real omission in the work on assessment can be identified as a failure, once more, to reflect upon the social context of assessment policy. Universities must grade their students, but where exactly does the pressure to grade emanate from? To understand this question involves considerable reflection on more than technical matters. A naive functionalism ('Society expects it') or a naive economism ('the labour market, employers, or the British economy demands it') was all that was available to inform educational technology. In particular, an active role for the OU in the politics of grading was underemphasised. There are no clear requirements from labour markets, from validating bodies, or from students, and there is no simple agreement about practice between universities or even between university departments. Grading policies seem not to be 'rational' in some sense of being founded on deep-seated traditions, or social requirements, although it might be most convenient to argue that they were. Disappointingly, educational technology failed to penetrate these rationalizations.
Finally, another social context seems to have been omitted too. With assessment and with curriculum design, little attention was given to the ways in which students might react, especially if those ways took them outside officially prescribed roles. Despite the close attention to the views and responses of students in market research exercises, no thought was given to the possibility that students might be operating with covert strategies of their own. Yet students do 'play the game' in conventional higher education, using a variety of 'short-cuts' and semi-deviant activities to gain the best grade for the work put in. The extent and nature of such activities at the OU will be discussed in the next file, but it is sufficient now to suggest that student 'instrumentalism' towards assessment can produce the most serious modifications to the most carefully planned courses, especially 'at a distance'. Once more, the 'technical fix' which ignores social contexts produces contradictory results.
NOTES AND REFERENCES
1 See,for example,Palmer, R.(1969)
Evanston, Northwestern University Press, or Thompson, J. (1981) Critical
Hermeneutics A study in the thought of Paul Ricoeur and Jurgen Habermas,
Cambridge, Cambridge University Press .