Is it a Level Playing Field? Factors Which Influence Student Evaluation of Teaching. Moya Adams, Ruth Neumann and Cathy Rytmeister Centre for higher Education and Professional Development Macquarie University, Sydney. Introduction Students' evaluation of their teachers is regarded as a legitimate and defensible way for university teachers to gain feedback on their teaching, and to gain summative evaluation for career purposes. While it is important to understand that it is not the only way for academics to evaluate their teaching, student evaluation of teaching is being used increasingly in Australian universities, and has in fact become mandatory in some. When university teachers interpret the results of student evaluation of their teaching, whether for formative or summative purposes, they are often intuitively aware that factors other than their own teaching may influence the evaluations. Many, for example, believe that they receive poorer ratings from larger classes in non-elective subjects. Research has been directed towards understanding these factors since the seventies, but the research is often difficult to interpret, because it takes place in different countries, mainly in the USA, and different educational contexts, using a variety of questionnaires which are not necessarily comparable. Data based on the Australian university context is limited, and has not generally addressed these contextual factors which may influence the evaluation of teaching. Factors which have been shown in earlier studies to have effects on student evaluations have been contextual characteristics such as the discipline, course level, class size, electivity of the course and the time of day the course is held. Some teacher characteristics, such as position and years of experience and gender have also been shown to relate to student ratings. This paper will focus on three course characteristics in the Australian context: discipline, course level and class size. Research so far Disciplines In a meta-analysis of research on the relationship of course characteristics to student ratings of their teachers, Feldman (1978) reported that studies consistently found differences among ratings of teachers in varying academic fields. To enable him to compare the results of a diverse group of studies, he ranked the different academic fields for each study and standardised the ranks. The relationship between rankings and academic fields proved to be quite strong as these differences still generally appeared when they were controlled for other factors such as class size, gender and rank of the teacher. His analysis showed that the humanities, English, arts and languages mostly fall in the high and medium ranks; the social sciences, especially political science, sociology, psychology and economics, tend to fall in the medium to low third of rankings. The sciences, mathematics and engineering fall in the lower range. Later studies such as Cranton and Smith (1986), Cashin and Clegg (1987) and Cashin (1990) have supported Feldman's evidence that generally speaking humanities tend to rate highly compared with the mathematics and sciences. Cashin (1990) used the same approach of ranking the academic fields on data from two widely used scales in the USA, the Student Instructional Report (SIR) and the Instructional Development and Assessment (IDEA) system and found a similar pattern, especially in disciplines where course effectiveness and teacher effectiveness were ranked similarly. That is, the humanities tended to have the highest rankings and mathematics, sciences and technology received relatively low rankings. Class Level Feldman's meta-analysis also looked at research on the relationship of class level and Class Size on student ratings. Many of the studies reviewed showed that the higher the course level the higher the average student rating of the teacher. This relationship was relatively consistent across items in almost all studies, although it tended to be quite weak in strength. Class Size Overall, Feldman (1978) concluded after examining large numbers of studies using different approaches that it was "much more likely for a researcher to find an association between class size and ratings than not to find one" (p 208) and that this was likely to be an inverse relationship. That is, that the smaller the class size the higher the ratings. A small group of studies found, however, a U-shaped curvilinear relationship in which relatively smaller and relatively larger classes tended to receive higher ratings than did medium sized classes. A further smaller group of studies found no relationship or a positive relationship between class size and ratings, but the significance of this group is questionable since large classes sizes were defined in such ways as being more than 21 students or even more than 9 students in the class. Other more recent studies have found a linear inverse relationship between class size and student ratings of teachers, for example Avi-Itzhak and Kremer (1983) and those reviewed in Feldman's later meta-analysis (1984). The present study A detailed study of the accumulated data from student evaluation of teaching at Macquarie University, Sydney has been made, looking at various factors which affect student ratings. It is based on data accumulated over four and a half years from second semester 1989 to second semester 1993 inclusive, using the Student Evaluation of Teaching and Subjects (SETS) questionnaire (originally developed at the University of Queensland and slightly modified at the University of Technology, Sydney). A notable aspect of this study is that it is based on data already collected in the normal course of events, and therefore is limited by the nature of such data. It was not possible to set up comparisons or control groups, or include information that had not already been collected as part of the normal processes. The sample is self-selected, since student evaluation of teaching is voluntary at Macquarie and therefore may not be truly representative of the whole population of teachers. During the years that the data represents, staff participation in SETS increased from 9 per cent to 33 per cent. The present paper will focus on the undergraduate data. For the present paper the relationship of student ratings of their teachers to the three context variables of discipline group, course level and class size has been analysed as part of a wider research project at Macquarie University, Sydney, funded by the National (Priority) Reserve Fund. Method The sample The sample is based on all classes which used SETS during the period from 1989 to the end of 1993, representing nine semesters of evaluations. Where small classes had less than 50 per cent response rate, this data was omitted as being invalid, resulting in 736 classes being deemed valid for analysis. Although the sample was self selected, the teachers approximately represented the gender balance and numbers of staff in each school or department at the university. The basic unit for analysis is the mean rating of each class evaluating their teacher. The Questionnaire The SETS questionnaire consists of fifteen core questions, and up to nine optional questions. Of the core questions, ten questions are about teaching, four questions about the course and one overall "global" question on teaching: (Question 11) All things considered, how would you rate the teaching of this lecturer in this subject. The analysis in this paper focuses on the ten core questions on aspects of teaching. All questions except the global teaching question require students to respond on a five-point scale, ranging from Strongly Agree (5), through Neutral (3), to Strongly Disagree (1). Students can also select a response of (0) which indicates that the question is Not Applicable. (The global teaching question requires a rating on a seven-point scale, ranging from Excellent (7), to Very Poor (1)). Rather than report the results for each individual question, four scales were constructed by means of principal components factor analysis. Question 6 (The lecturer demonstrated the relevance of the subject to the whole course) was omitted because it was judged to be ambiguous in the Macquarie context and student responses reflected this. The four scales were thus developed from the remaining nine questions. Table 1: Scales used from items in the questionnaire, representing aspects of teaching Name of scale Questions in each scale Eigen-values* Organisation Q 1: having clear objectives for each session Q 3: being well prepared for each session .44 Communication Q 2: able to explain concepts clearly Q 4: helped understanding of subject matter Q 5: tried to make subject interesting 5.77 Availability Q 7: made opportunities to ask questions Q 8: was available for consultation 1.14 Feedback Q 9: made assessment criteria clear Q 10: gave adequate feedback on assignments .69 * for correlated factors The four scales constructed represented the following aspects of teaching: organisation, communication, availability and feedback. The questions included in each scale are set out in Table 1, together with the eigenvalues for each of the corresponding highly correlated factors. Although the eigenvalues might indicate a two factor solution, we found, using a Varimax rotation method, that the four factor solution gave the best fit, and the most interpretable, least ambiguous results. The four factors account for 89 per cent of the variance in the undergraduate data. Context variables Staff order forms for SETS contained considerable demographic data, and from this we were able to construct three context variables to be used for the present analysis. 1.Discipline Groups. The fourteen schools at Macquarie represent over seventy different subjects, and some schools are cross-disciplinary in that they encompass sciences and social sciences for instance, or social sciences and professional courses. To enable meaningful analysis of these data, the subjects from all schools were grouped into four broad discipline groups which are based on recent studies of disciplinary differences in higher education (Becher, 1989). They are as follows: Humanities, including Modern Languages, History, Philosophy and Politics Professional, including Accounting, Actuarial Studies, Computing, Law and Teacher Education Sciences, including Mathematics, Physics, Electronics, Biology, Chemistry, Earth Sciences Social Sciences, including Anthropology, Education, Human Geography, Sociology, Psychology. 2.Level. The undergraduate data included five levels, but levels four and five, with smaller numbers, represented senior law students and honours students. Because these students share much in common with level three students, levels four and five were recoded to be included in level three. Thus there were three levels for analysis: first year, second year, and third year and above. 3.Class Size. Class Size was analysed in four categories: fewer than 20 students; 21-50 students; 51-100 students; more than 100 students. Results To determine the relationship of the context variables to student evaluations, one-way analyses of variance were conducted, using each scale (Organisation, Communication, Availability and Feedback) in turn as the dependent variable, and the context characteristics (Discipline Group, Level, and Class Size) as the independent variables. Further two-way analyses were conducted where appropriate. Where preliminary analysis identified extreme outliers, these were omitted from the final analysis, scale by scale (that is, different cases were omitted for different scales). Table 2 contains the descriptive statistics for each scale after excluding extreme outliers. As we have noted, the means for each aspect of teaching are derived from items answered on a five point scale. A score of 5 represents a high rating on the aspects of teaching (Organisation, Communication, Availability and Feedback) while a score of 1 indicates a low rating. The total sample of classes included in the analysis was 736, but the total number for each scale varies because of the omission of extreme outliers. Table 2: Descriptive statistics for scales of aspects of teaching (outliers excluded) Scale N Mean S D Minimum rating Maximum rating Organisation 735 4.12 .35 2.95 4.98 Communication 731 4.01 .48 2.47 4.99 Availability 735 4.11 .42 2.73 5.00 Feedback 736 3.76 .45 2.35 5.00 From Table 2 it can be seen that the sample of teachers at Macquarie were rated highly overall by their students. This may be an indication of the good teaching at the university, but is also likely to be influenced by the fact that evaluation of teaching is voluntary. Nevertheless there are clear differences in the mean ratings, with Organisation rating highest, closely followed by Availability, and Feedback rating lowest. Organisation, generally speaking, is not a factor which discriminates greatly among teachers in the present data, having non-significant differences in means and the smallest standard deviations. It seems that most teachers who evaluate their teaching are judged as well organised by their students. The other three scales did discriminate more clearly among teachers however. Context factors effects One-way analyses of variance Table 3. Significance and Explained Variation for Multiple-level One-way Analyses. DEPENDENT VARIABLES Independent Organisation Communication Availability Feedback Variables p ný(%) p ný(%) p ný(%) p ný(%) Discipline Group <0.0001 3.60 <0.0001 6.62 <0.0001 4.89 <0.0001 4.07 Level 0.3782 0.27 0.0005 2.04 <0.0001 5.94 0.0010 1.87 Class Size 0.0018 2.32 <0.000113.42 <0.000140.19 <0.000115.55 * p = level of significancený = explained variance Given the generally high ratings and the differences apparent in students' ratings of the four aspects of teaching, the important question we are examining is whether there is a relationship between teaching context and student ratings of teaching. Multiple-level one-way analyses of variance were conducted using the three context factors: Discipline Group, Level and Class Size as the independent variables, and the four scales on aspects of teaching: Organisation, Communication, Availability and Feedback, as the dependent variables. Table 3 indicates that there were significant relationships for almost all three context factors on all aspects of teaching measured. The following section will describe the results of the one-way analyses for each context variable in turn. 1. Discipline Group. The Discipline Group clearly influenced student ratings revealing a fairly consistent pattern across all four scales (see Table 4). Thus, Humanities consistently rated highest, and Science consistently rated the lowest. In between these two, the Social Sciences and Professional groups were not significantly different from each other on any of the scales. The Science and Professional groups differed significantly only on the Communication scale. Table 4. Scale means and standard deviations by Discipline Group Organisation Communication Availability Feedback Discipline Group mean sd mean sd mean sd mean sd Humanities 4.28 0.32 4.17 0.39 4.27 0.37 3.93 0.45 Professional 4.18 0.37 4.01 0.53 4.08 0.41 3.73 0.42 Science 4.07 0.41 3.75 0.45 3.98 0.43 3.68 0.49 Social Science 4.23 0.27 4.00 0.41 4.07 0.43 3.70 0.42 Communication was the scale which was most differentiated by Discipline, (n-squared = 6.62%), and Organisation was the least differentiated (n-squared = 3.60%). The Humanities group rated significantly higher than the Sciences and Professional groups on all four scales, and significantly higher than the Social Sciences on all scales except Organisation. The Science group rated significantly lower than the Humanities group on all scales, and lower than Social Sciences on Organisation and Communication. 2. Level. Level was significantly related to student ratings on all scales except Organisation, with third year and above rating highest, and first year usually rating lowest. The relationship was not consistently linear, however, as first year and second year had similar scores on all scales except Availability, and the difference was only just significant on this scale. Third year and above rated significantly higher than both first year and second year on the Communication and Availability scales. For the Feedback scale, the only significant difference was between third year and above, and first year, however. Table 5. Scale Means and Standard Deviations by Course Level Organisation Communication Availability Feedback Course Level mean sd mean sd mean sd mean sd first year 4.17 0.32 3.96 0.44 3.95 0.50 3.66 0.45 second year 4.18 0.36 3.94 0.53 4.06 0.40 3.75 0.45 third year & over 4.22 0.36 4.08 0.45 4.21 0.37 3.82 0.45 Among the four scales, Availability was differentiated most in terms of Level among the scales, with 5.94 per cent of variance explained, while Feedback was least differentiated, with Level explaining only 1.87 per cent of variance. 3. Class Size. The relationship of Class Size to student ratings of teaching was initially analysed in five categories: fewer than 10 students, 10-20 students, 21-50 students, 51-100 students and greater than 100 students. Since there was little difference between the two smallest categories, however, and because of small numbers in some cells, these two groups were combined to form one group of fewer than 20 students. Given these categories class size proved to be a highly significant factor in ratings on all four scales. In general, this was an inverse relationship: larger classes rated lower and smaller classes rated higher. Table 6. Scale means and standard deviations by Class Size Organisation Communication AvailabilityAssess/Feedback Class Size mean sd mean sd mean sd mean sd <20 students 4.26 0.32 4.18 0.37 4.35 0.29 3.92 0.43 21-50 students 4.21 0.37 4.08 0.45 4.23 0.30 3.83 0.43 51-100 students 4.15 0.37 3.84 0.52 3.92 0.40 3.67 0.44 >100 students 4.11 0.32 3.73 0.48 3.63 0.37 3.43 0.34 Class Size had powerful effects on students ratings and accounted for most of the variability on three teaching scales. The effect of Class Size was most apparent on the Availability scale (n-squared = 40.19%), and the inverse relationship of ratings decreasing as classes were larger occurred, so that all four categories were significantly different from each other. A similar situation existed on the Feedback scale (n-squared = 15.55%), where differences were significant for all Class Sizes except between the two smaller groups. The Communication scale accounted for 13.43% of variance, but the main division came between the two largest and the two smallest groups of class size. That is, the classes of over 50 students differed from those with fewer than 50 students in terms of Communication. As expected, Class Size has least explanatory power on scale Organisation (n-squared = 2.32%). Two-way analyses of variance The two-way analyses of variance were carried out for each scale using Discipline Group and Level, Discipline Group and Class Size, and Level and Class Size, and a short summary of significant results is reported in the following section. There are limitations to the interpretation of two-way analyses of variance: because of different cell counts, interpretation of any significant interaction is complex and can be inconsistent, making it difficult to isolate the significance of different interactional effects. Discipline Group and Level On the Communication and Feedback scales, the pattern of results found in the one-way analyses of Discipline Group generally held across all Levels, and vice versa. On the Availability scale, the Discipline Groups were generally more clearly differentiated at third year than at the lower levels. Discipline Group and Class Size. On scales Availability, Feedback and Communication, the basic patterns of differences for each factor (as observed in the oneway analyses) still held when the other factor was taken into account. The proportion of variance explained by fitting both Discipline Group and Class Size was, however, not much greater than that explained by Class Size on its own. In other words Class Size was the strongest factor. For the Organisation scale, there is significant interaction between Class Size and Discipline Group: for each factor patterns of differences were inconsistent across levels of the other factor. It should be kept in mind, however, that of all the scales, Organisation appears to be least affected by Class Characteristics, and is also the scale which differentiated least among teachers. Class Size and Level A problem with this design is that the two factors are related: the lower-level classes tend to be larger than the higher-level classes. It is therefore unsurprising that, once Class Size is taken into account, the effect of Level is of little significance, at least on the Feedback and Communication scales. For the Availability scale, differentiation according to Level is only apparent in the largest classes (>100 students). The usual inverse relationship between Class Size and ratings is still apparent at all levels. Discussion Although evaluation of teaching is voluntary at Macquarie University, the data used was representative of university staff in terms of gender and discipline spread. Since the staff are self-selected, however, it is not clear whether the sample is biased in terms of commitment to, or skills in, teaching. Nevertheless, the results support the findings of Feldman and others that the course characteristics of class size and discipline area have a strong association with students' ratings of their teachers, and "are more resistant to disappearing when control variables are introduced into the analysis than other course characteristics." (Feldman, 1978, p225). The pattern for disciplinary differences found elsewhere was clearly evident in our study. That is, teachers in the Humanities consistently rated highest and those in the Sciences lowest, with the Professions and Social Sciences falling between them, and not so greatly or consistently differentiated from each other. The fact that the discipline group's effect was strongest in relation to Communication would seem to indicate that students in the humanities have a perception that their teachers are better able to explain concepts clearly, help students' understanding and make the subject interesting. This may reflect something of the communication style of the teaching in this area. It is certainly apparent from our data that of all the aspects of teaching measured, students judge Communication very keenly, clearly differentiated by discipline groups. It is not as easy to theorise about the fact that the discipline group related in the same way to students' ratings of staff's availability and giving adequate feedback. It might possibly be a reflection of the different kinds of interaction and communication which take place in the humanities compared with the sciences. Certainly discussion with teachers in the sciences indicates that the first two years in most science subjects tend to provide large amounts of factual information that have to be mastered, so that students do not tend to become independent learners until their third year or even their honours year. This would not account for the fact that disciplinary differences held across year levels, however. Class Size proved to be the most powerful influence on student ratings in our study. Even when differences between means were not substantial, the continuing pattern of results remained when controlled for Discipline and Level in almost all cases. Consistently the smaller the class size, the higher the ratings of teaching, with no indication of a U-shaped curvilinear relationship as found in some studies referred to by Feldman (1978, 1984). The reasons for these relationships are still not clear, and we can only theorise until we have firmer independent evidence that teaching practices are actually different for different class sizes. It may be that teachers change their teaching approach when teaching large classes compared with small classes in such a way that students rate it less well. Certainly the very strong relationship of Availability to class size is understandable, as larger numbers may well make opportunities for questions and consultation more limited. The relationship of class size to Communication and Feedback are less easy to interpret. Perhaps the demands of large classes do in fact influence the effectiveness of the communication process and the ability to give adequate feedback. It should also be recognised that with very large classes, feedback is more likely to be given by tutors than the lecturer who is being evaluated. Although course Level proved significant for the Communication, Availability and Feedback scales, its influence on ratings was weaker than the other two course characteristics, and its influence was quite weak once Class Size was taken into account. One interesting finding was the fact that disciplinary differences were more apparent at the higher level (third year and above) for the Availability scale, however. This would seem strange in view of the fact that staff and students are more likely to be highly motivated at these levels, yet these disciplinary differences held. We need to analyse how staff interact with students differently in the disciplines to understand what the are which might account for this. Overall the results indicate that students perceive significant differences in their teachers in the context of different disciplines and class sizes. Although the reasons for this are not verified beyond our own understanding of the area, these differences can at least be taken into account by teachers interpreting their results for formative or summative purposes. To enable staff to make use of this information, we have developed Interpretation Guides which give staff the norms for their discipline, class size and year level, when they receive their results from student evaluations. To enable us to interpret the effects of these course characteristics better, it is planned to conduct classroom observations and interviews which may help our understanding of what underlies the different perceptions of teaching in different disciplines and in different class sizes. References Avi-Itzhak, T and Kremer, L. (1983). The Effects of Organizational Factors on Student Ratings and Perceived Instruction. Higher Education, 12, 411-418. Becher, T., (1989). Academic Tribes and Territories. Milton Keynes: SRHE and Open University Press. Cashin, W. E., (1990). Students do rate different academic fields differently, in Theall, M and Franklin, J, (Eds). Student Ratings of Instruction: Issues for Improving Practice. San Francisco: Jossey-Bass Inc. Cashin and Clegg (1987). Are Student Ratings of Different Academic Fields Different? Paper presented at the annual meeting of the American Educational Research Association, Washington, DC. Cranton, P. A. and Smith, R. A., (1986). A New Look at the Effect of Course Characteristics on Student Ratings of Instruction. American Educational Research Journal, 23, 1, 117-128. Feldman, K. A., (1978). Course Characteristics And College Students' Ratings Of Their Teachers: What we know and what we don't know. Research in Higher Education, 9, 199-242. Feldman, K. A., (1984). Class size and College Students' Evaluations of Teachers and Courses: a closer look. Research in Higher Education, 21,11, 45-116. ________________________________________________________________________ Moya Adams, Ruth Neumann, Cathy Rytmeister, AARE, 1996