The Use of Student Evaluations of University Teaching in Different Settings: The Applicability Paradigm Herbert W. Marsh and Lawrence A. Roche University of Western Sydney, Macarthur October 3, 1992 Running Head: Students' Evaluations of Teaching Descriptors: Student Evaluation of Teacher Performance, Teacher Evaluation, Teacher Effectiveness, Higher Education, University Environment, Multitrait-Multimethod Analysis, SEEQ Instrument Requests for further information should be sent to Herbert W. Marsh, Professor of Education, University of Western Sydney, Macarthur, PO Box 555, Campbelltown, NSW 2560, Australia. The authors would like to acknowledge financial support from the Australian Department of Employment, Education and Training National Priority (Reserve) Fund. Abstract The applicability paradigm has been used in five studies to evaluate the applicability of items from two North American instruments designed to measure students' evaluations of teaching effectiveness. In the present investigation, the paradigm is used at the new University of Western Sydney, Macarthur (UWSM). Items from both instruments were seen as appropriate and important, and differentiated among lecturers chosen as good, average and poor teachers. A multitrait-multimethod analysis of responses from the two instruments supported their convergent and divergent validity. The pattern of items judged to be most important at UWSM was more similar to patterns found at two research universities than patterns at TAFE or two other institutions. These results support the applicability of the instruments at UWSM and across a diversity of educational settings. Students' Evaluations of Teaching  In Australian universities, students' evaluations of teaching (SET) effectiveness are becoming more widely considered for purposes such as feedback to lecturers that may lead to the improvement of teaching, student course selection, personnel decisions, and research on teaching (e.g., Dunkin, 1990; Isaacs, 1989; Moses, 1986a, 1986b; Prosser & Trigwell, 1990). In North American universities, SETs are collected almost universally and much of the existing research comes from there. This research shows that the ratings are multidimensional (a teacher may be enthusiastic but lack organization), reliable, stable, reasonably valid against a variety of indicators of effective teaching, relatively unrelated to a wide variety of background variables, and apparently useful to lecturers for purposes of improving teaching effectiveness (see Marsh, 1987, for an overview of this research). However, there have been only limited attempts to test the applicability of North American instruments, or the generalizability of findings from North American research, to Australian settings. Marsh (1981a) noted that there is danger in assuming that instruments developed in one setting can be used effectively in new settings without first testing their applicability. In order to address this issue, he introduced the applicability paradigm for studying the applicability of two widely researched North American instruments: Students' Evaluations of Educational Quality (SEEQ; Marsh, 1982a; 1982b, 1984, 1987; in press; Marsh & Hocevar, 1991) and Endeavor (Frey, 1973, 1978; Frey, Leonard & Beatty, 1975). Thus far, applicability paradigm studies have been conducted in Australia -- at the University of Sydney (Marsh, 1981a) and in the Technical and Further Education (TAFE) sector (Hayton, 1983), Papua New Guinea (PNG; Clarkson, 1984), Spain (Marsh, Touron, & Wheeler, 1985), and New Zealand (Watkins, Marsh, & Young, 1987). In a review of all but the subsequent New Zealand study, Marsh (1986) concluded that the results provided strong support for the applicability of both North American instruments across a diversity of educational settings. He further suggested that by comparing the importance assigned to different components of teaching effectiveness in the studies, it may be possible to infer differences in academic climate. For example, consistent with Clarkson's (1984) observation that the educational climate in the PNG study differed substantially from those in a typical Western environment, Marsh (1986) reported that the pattern of importance ratings in the PNG study were most distinct, whereas patterns in the Spanish and TAFE studies were more similar to each other than to the pattern in the University of Sydney study. The present study has two main purposes: to apply the applicability paradigm at the newly established University of Western Sydney, Macarthur (UWSM) and to use these findings to compare the academic climate at UWSM with those in other applicability studies (including the New Zealand study that was not included in Marsh's 1986 review). Prior to November, 1989 UWSM was an autonomous institution within the College of Advanced Education sector -- the middle tier of Australia's three-tier system of higher education, the other two tiers consisting of research universities like the University of Sydney and the TAFE sector. However, the formal distinction between research universities and Colleges of Advanced Education was abolished and all institutions from the middle tier were either amalgamated into one of the old universities or formed new universities. (For further discussion of the restructuring, see Harman, 1989; Meek and O'Neill, 1990). What was to become UWSM combined with two other former Colleges of Advanced Education to establish the three federated network members of the University of Western Sydney. Whereas the formal distinction between the old Colleges of Advanced Education and universities have been abolished, it may take longer for differences in educational climates to disappear. From this perspective, UWSM is in the midst of an important transition. Hence, it is particularly timely to use the applicability paradigm to compare the similarity of the educational climate at UWSM with climates at other settings in which the paradigm has been used. Of particular relevance is the question of whether the inferred UWSM climate is more similar to the climates in the two research universities (Sydney University and the University of Canterbury, New Zealand), the TAFE sector, or the other two overseas institutions (Universidad de Navarra, Spain and the Papua New Guinea University of Technology). Multidimensionality of Students' Evaluations of Teaching Effectiveness Effective teaching is a multidimensional construct (e.g., a teacher may be organized but lack enthusiasm). Thus, it is not surprising that a considerable body of North American research has also shown that SETs are also multidimensional (see Marsh, 1987). In evaluating the need to distinguish among appropriately defined multiple dimensions it is important to consider the purposes that the evaluations are intended to serve. Marsh (1984, 1987; also see Braskamp, Brandenburg & Ory, 1985; Centra, 1979; Doyle, 1983; McKeachie, 1979; Murray, 1980) noted that student ratings are used variously to provide: (a) formative feedback to faculty about the effectiveness of their teaching; (b) a summative measure of teaching effectiveness to be used in personnel decisions; (c) information for students to use in the selection of lecturers and courses; and (d) an outcome or a process description for research on teaching. Whereas there is some disagreement about whether a single summary score is more useful than multidimensional ratings for purposes of personnel decisions, there is general agreement that appropriately constructed multiple dimensions are more useful for the other three purposes. Information from SETs depends upon the content of the items. Poorly worded or inappropriate items will not provide useful information. If a survey instrument contains an ill-defined hodgepodge of different items and student ratings are summarized by an average of these items, then there is no basis for knowing what is being measured. Particularly when the purpose of the ratings is formative, it is important that careful attention be given to the components of teaching effectiveness that are to be measured. Surveys should contain separate groups of related items which are derived from a logical analysis of the content of effective teaching and the purposes which the ratings are to serve, and should be supported by empirical procedures such as factor analysis and multitrait-multimethod (MTMM) analysis. The SET literature contains several examples of well constructed instruments with clearly defined factor structures that provide measures of distinct components of teaching effectiveness. In addition to his SEEQ instrument, Marsh (1987) noted Frey's Endeavor instrument (Frey, Leonard & Beatty, 1975; also see Marsh, 1981a, 1986), the Student Description of Teaching questionnaire (Hildebrand, Wilson & Dienst, 1971), and the Michigan State Student Instructor Rating System (Warrington, 1973). Factor analyses of responses to each of these instruments provided clear support for the factor structure they were designed to measure, demonstrating that the SETs do measure distinct components of teaching effectiveness. He suggested that the systematic approach used in the development of these instruments and the similarity of the factors which they measure, supports their construct validity. Two of these instruments -- Endeavor and SEEQ -- are used in applicability paradigm studies including the present investigation. The Endeavor Instrument The Endeavor instrument measures seven components of effective teaching -- components that have been identified in factor analytic studies in different settings (Frey, Leonard & Beatty, 1975). The seven Endeavor scales are: Presentation Clarity, Workload, Personal Attention, Class Discussions, Organization/Planning, Grading, and Student Accomplishments (the items in each scale are presented in subsequent discussion of Table 4). Frey has shown that Endeavor responses are correlated with student learning (Frey, 1973, 1978; Frey, Leonard, & Beatty, 1975) in multisection validity studies. In these studies (see Cohen, 1981; Marsh, 1987 for an overview), SETs are collected in large multisection courses in which large numbers of students are divided into separate sections. All instruction is delivered separately to each section by different lecturers, but each section is taught according to a similar course outline, has similar goals and objectives, and, most importantly, is tested with the same standardized final examination at the end of the course. Frey demonstrated that the sections shown to be taught most effectively according to the SETs were the ones that learned the most, thus supporting the validity of the Endeavor ratings. Frey (1978) emphasized that it is important to recognize the multidimensionality of SETs. In an examination of relations between Endeavor responses and a variety of criterion variables, he demonstrated that the size and even the direction of correlations vary depending on the Endeavor scale and the particular criterion. The SEEQ Instrument In the development of SEEQ: 1) a large item pool was obtained from a literature review, forms in current usage, and interviews with faculty and students about what they saw as effective teaching; 2) students and faculty were asked to rate the importance of items; 3) faculty were asked to judge the potential usefulness of the items as a basis for feedback; and 4) open-ended student comments were examined to determine if important aspects had been excluded. These criteria, along with psychometric properties, were used to select items and revise subsequent versions, thus supporting the content validity of SEEQ responses. The SEEQ scales and the items in each scale are presented in subsequent discussion of Table 4. In addition, a copy of the most recent version of the SEEQ (incorporating minor Australian spelling and usage modifications) is presented in the appendix. Factor analytic support for SEEQ is particularly strong. To date, more than 30 published factor analyses of SEEQ responses have identified the factors that SEEQ is designed to measure (e.g., Marsh, 1982b; 1984; 1987; in press; Marsh & Hocevar, 1991). Factor analysis of responses by 50,000 classes (representing responses to nearly 1 million SEEQ surveys) provided clear support for the SEEQ factor structure (Marsh & Hocevar, 1991). In separate analyses of responses from 21 different groups representing different levels of instruction (e.g., undergraduate and graduate level courses) and a diversity of academic disciplines, the same set of SEEQ factors were identified. When lecturers evaluated their own teaching effectiveness on the same SEEQ form as completed by their students, factor analyses of student ratings and lecturer self-evaluations each identified the same SEEQ factors (Marsh, 1982b; Marsh, Overall & Kesler, 1979). Marsh and Bailey (1991) evaluated profiles of SEEQ responses for a cohort of 221 teachers who had been evaluated with SEEQ regularly (an average of 25 sets of ratings per teacher) over a 13 year period. Not only were ratings on separate SEEQ scales stable over time (Marsh & Hocevar, in press), but so were the multidimensional profiles of ratings. The profile for each teacher (e.g., high on Enthusiasm but low on Organization) was distinct from the profiles of other teachers, and generalized over time and course level (graduate vs. undergraduate). These studies demonstrate the broad generalizability of SEEQ factors over time, across academic discipline, and across responses by students and by teachers. SEEQ responses have been successfully validated in relation to learning in multisection validity studies (Marsh, Fleiner & Thomas, 1975; Marsh & Overall, 1980), the ratings of former students (Marsh, 1977; Overall & Marsh, 1980), lecturer self-evaluations of their own teaching effectiveness (Marsh, 1982b; Marsh, Overall & Kesler, 1979), affective course consequences such as plans to pursue further study (Marsh & Overall, 1980), and a variety of other criteria (see Marsh, 1987, for an overview). Sixteen potential sources of bias (e.g., class size, expected grades, course level, prior subject interest) were relatively unrelated to SEEQ responses (Marsh, 1980; 1983). SEEQ ratings are primarily a function of the lecturer who teaches a course and not the course that is being evaluated (Marsh, 1981b; Marsh & Overall, 1981). Feedback from SEEQ responses, particularly when coupled with a candid discussion with an external consultant, led to improved student ratings and better student learning (Overall & Marsh, 1979; also see Cohen, 1980). This research is largely based on a construct validity approach in which the various SEEQ scales are shown to be differentially related to each validity criterion. For example, MTMM studies of agreement between SETs and lecturer self-evaluations showed lecturer/student agreement on matching scales was higher than agreement on nonmatching scales for all 9 SEEQ factors. This approach is also important in interpreting relations between SETs and potential biases. For example, class size is negatively related to SETs but almost all of the relation is due to lower ratings of Group Interaction and Individual Rapport, whereas the remaining 7 SEEQ factors are almost unrelated to class size. A similar pattern of responses was observed between class size and lecturer self-evaluation, suggesting that the class size is a valid source of influence for these SEEQ factors. Thus, Marsh (1987), like Frey (1978) emphasized the multidimensionality of SETs. Feldman's Categories of Effective Teaching Feldman (1976; also see Feldman, 1983, 1984) took an alternative approach to determining the different components of effective teaching. He categorized the different characteristics of the superior university teacher from the student's point of view. Feldman systematically reviewed research that either asked students to specify these characteristics or inferred them on the basis of correlations between specific characteristics and overall SET. His list of categories (Table 1) provides the most extensive set of characteristics that are likely to underlie SETs. Nevertheless, Feldman used primarily a logical analysis based on his examination of the SET literature, and his results do not necessarily imply that students can differentiate these characteristics. This set of characteristics does, however, provide a useful basis for evaluating the comprehensiveness of the set of evaluation factors on a given instrument. Insert Table 1 About Here Feldman (1976) noted that factors identified by factor analysis typically corresponded to more than one of his categories. The highest loading items on any given factor often come from more than one of his categories. In Table 1 we have attempted to match Feldman's categories with the empirical factors identified in responses to the SEEQ and Endeavor instruments that are used in the present investigation. There is substantial overlap in the empirical factors from the two instruments (also see Marsh, 1986). Most of Feldman's categories are associated with empirical factors from the two instruments, although SEEQ factors represent Feldman's categories more comprehensively than do Endeavor factors. All of the empirical factors in SEEQ and Endeavor represent at least one of Feldman's categories and most reflect two or more categories. In contrast, none of Feldman's categories reflects more than one of the empirical factors. This logical content analysis demonstrates that there is substantial overlap between Feldman's categories and the empirical factors but that Feldman's categories reflect more narrowly defined constructs than do the empirical factors. The Applicability Paradigm In the first applicability paradigm study Marsh (1981a) asked University of Sydney students from diverse disciplines to select "one of the best" and "one of the worst" lecturers they had experienced, and to rate each on an instrument containing SEEQ and Endeavor items. Because of the politically sensitive nature of the study, students were specifically instructed not to indicate their own name or the names of lecturers whom they had selected. As part of the study, students were asked to indicate "inappropriate" items, and to select up to five items that they "felt were most important in describing either positive or negative aspects of the overall learning experience in this instructional sequence" for each lecturer who they evaluated. Analyses of the results included a discrimination analysis examining the ability of items and factors to differentiate between "best" and "worst" lecturers, a summary of "not appropriate" responses, a summary of "most important item" responses, and a MTMM analysis of agreement between SEEQ and Endeavor scales. This applicability paradigm was subsequently used in four other studies: Hayton (1983) with Australian students in TAFE; Marsh, Touron and Wheeler (1985) with students from the Universidad de Navarra in Spain; Clarkson (1984) with students from the Papua New Guinea (PNG) University of Technology; and Watkins, Marsh and Young (1987) with students from the University of Canterbury in New Zealand. The PNG study differed from the others in that it was based on a much more limited selection of students and teachers. Clarkson also noted that the PNG setting was "non-Western" and differed substantially from the "Western" settings in most SET research. The TAFE, New Zealand, and Spanish studies differed from the other two in that students were asked to select a good, an average, and a poor teacher instead of a best and worst teacher, and the five-point response scale used in the original study was expanded to include nine response categories. The Spanish study also differed in that the items were first translated into Spanish. (English is the official language in PNG even though Clarkson, 1984, noted that 720 different languages are spoken in that country.) The groups of best and worst lecturers selected by the students constitute criterion groups, and student ratings should be able to differentiate between these groups. In each of the studies, all but the Workload/Difficulty items strongly differentiated between the groups. In the more recent studies with three criterion groups -- good, average, and poor lecturers -- nearly all of the between group variance could be explained by a linear component. Differences in the Workload/Difficulty items tended to be much smaller, sometimes failing to reach statistical significance, although the best teachers tended to teach courses that were judged to be more difficult and to have a heavier workload. It is hardly surprising that a lecturer selected as being "best" by a student is consistently rated more favourably than one who is selected as being "worst". The halo effect produced by this selection process probably exaggerates the differentiation among groups, but also makes it more difficult to distinguish among the multiple components of effective teaching in the MTMM analyses and substantially increases the size of correlations among the different factors. Hence, the differentiation is a double-edged sword; too little would suggest that the ratings are not valid, but too much would undermine support for their multidimensionality. SEEQ and Endeavor items were judged to be "inappropriate" if a student specifically indicated the item to be inappropriate or failed to respond to the item. Results from all the studies were similar in that every item was judged to be appropriate by 80% or more of the students, even though 3 to 7 of the 55 items were judged to be "inappropriate" by more than 10% of the students; on average each student indicated 2 of the 55 items to be inappropriate. The items most frequently judged to be "inappropriate" came from the Group Interaction, Individual Rapport, Examination, and Assignment factors from the two instruments. After completing a survey, students were asked to select up to five items that were "most important" in describing the overall learning environment. In all but the PNG study, every item was selected by at least some of the students as being "most important." Across all the studies the most frequently nominated items came from the Enthusiasm, Learning/value, and Organization/clarity factors. These findings support the applicability of the SEEQ and Endeavor items in a diversity of settings. It is also important to note that the items seen to be most important varied from study to study, and an evaluation of these patterns is an important focus of the present investigation. The SEEQ and Endeavor forms were independently designed, and do not even measure the same number of components of effective teaching. Nevertheless, a content analysis of the items and factors from each instrument suggests that there is considerable overlap (Marsh, 1981a; also see Table 1). There appears to be a one-to-one correspondence between five SEEQ factors (Learning/Value, Group Interaction, Individual Rapport, Examinations/Grading, Workload/Difficulty) and five Endeavor factors (Student Accomplishments, Class Discussion, Personal Attention, Grading, and Workload), whereas the Organization/Clarity factor from SEEQ seems to combine the Organization/Planning and Clarity factors from Endeavor. The remaining three SEEQ factors -- Instructor Enthusiasm, Breadth of Coverage, and Assignments -- do not parallel any factors from Endeavor. This hypothesized structure of correspondence between SEEQ and Endeavor factors is the basis of the MTMM analyses conducted in each of the studies. In MTMM analyses, relations are examined among multiple traits (the student evaluation factors) measured by multiple methods (the SEEQ and Endeavor instruments). The MTMM matrix is used to infer evidence in support of convergent validity and discriminant validity. Convergent validity is inferred when there is substantial agreement between matching traits measured by different methods (i.e., the SEEQ Learning/Value scale and the Endeavor Student Accomplishment scale). Divergent validity refers to the distinctiveness of the different scales. It is inferred when correlations between matching traits are systematically higher than other correlations in the MTMM matrix. Support for the divergent validity also constitutes support for the multidimensionality of the ratings. Application of criteria developed by Campbell and Fiske (1959; also see Marsh, 1988) in each of the different studies provided clear support for both convergent and divergent validity of the responses to the SEEQ and Endeavor instruments. In summary, these studies supported the applicability of the two North American instruments in each of the settings where the paradigm was used. Methods Sample and Procedures. Lecturers from a cross-section of disciplines at UWSM were asked to seek student volunteers from their classes to complete the evaluation instrument. Students were informed that the research was part of a study about teaching effectiveness and that they would not be asked to identify themselves or the lecturers whom they selected to evaluate. Volunteer students were given the questionnaire in a preaddressed, postage paid envelope at the end of the class session and asked to complete the survey within the next week. Completed surveys were returned to the School offices in sealed envelopes or directly mailed to the first author. Completed questionnaires were obtained from a total of 51 undergraduate students who provided evaluations of 153 different classes (a "poor", an "average," and a "good," lecturer for each of the 51 students). Approximately equal numbers of students were enrolled in first, second and third year courses and there were classes representing all five schools at UWSM: Arts, Business and Technology (including mathematics and sciences), Education and Language Studies, Nursing, and Welfare. The majority of evaluations, however, were of courses in Education and Language Studies and in Business and Technology. Each questionnaire consisted of a cover page with instructions and a limited number of items requesting demographic information. Students were initially requested to select a good, an average, and a poor lecturer from their experience at UWSM. Students were asked to try to limit their choices to lecturers who were in charge of an instructional sequence that lasted at least one term, and who taught courses that employed a lecture or discussion format. Students were then asked to complete three separate questionnaires, one for each of the "good," "average," and "poor" lecturers. The SEEQ and Endeavor items, in paraphrased form, and the scales that they reflect appear in Table 4. Items were presented in a randomized order on the questionnaire. Students responded to items on a nine-point response scale which varied from "1--very poor, very low, or almost never" to "9--very good, very high, or almost always" (except for the three SEEQ items designed to measure Workload/Difficulty -- see Table 4). An additional "not appropriate" response was provided for items judged to be not relevant to the particular course being evaluated (responses to items left blank were also counted as "not appropriate"). After completing the ratings for a given lecturer, students were asked to select up to five items that they felt were "most important in describing either positive or negative aspects of the overall learning experience in this instructional sequence." Statistical Analysis Each item was tested in terms of: (a) its ability to discriminate among the good, average, and poor lecturers; (b) its appropriateness (i.e., a lack of "not appropriate" responses); and (c) its importance (i.e., the number of "most important" nominations). The internal consistency estimates of reliability were computed for each of the SEEQ and Endeavor scales. MTMM analyses were used to test the convergent and discriminant validity of SEEQ and Endeavor responses. All statistical analyses were conducted with the commercially available SPSSx statistical package (SPSS, 1988). Results from the present investigation were also compared with those from five other applicability paradigm studies that were reviewed earlier. Methodological details of the earlier studies and the results used here are all available in the original published version of each study or in Marsh's 1986 summary of the applicability paradigm, and were summarized earlier. Results Convergent and Discriminant Validity The SEEQ and Endeavor instruments are designed to measure 9 and 7 scales of effective teaching respectively. Correlations among these 16 scales are presented in the form of a MTMM matrix (Table 2) in which the different scales are the multiple traits and the different instruments are the multiple methods. Based on a content analysis of the two instruments, Marsh (1981a; see earlier discussion and Table 1) noted that 5 SEEQ factors correspond to 5 Endeavor factors and that a sixth SEEQ scale corresponds to two different Endeavor factors. For purposes of the MTMM analyses in the applicability paradigm studies, correlations among "matching scales" are interpreted as convergent validities (the 7 correlations marked with asterisks in Table 2). Support for the convergent and divergent validity of the responses requires that the convergent validities are large, and larger than other correlations in the MTMM matrix. The traditional application of MTMM analysis requires that the same scales are measured with each instrument, but with minor modifications the Campbell-Fiske criteria have been applied in the applicability studies (Marsh, 1981a, 1986). In this study: 1. Convergent validities, correlations between SEEQ and Endeavor factors that are hypothesized to match (correlations marked with asterisks in Table 2), should be substantial. Convergent validities vary from .66 to 0.96 (mean r = .83) and all but one is greater than .8. These results clearly satisfy this criterion. 2. One criterion of discriminant validity is that convergent validities should be higher than correlations between nonmatching SEEQ and Endeavor scales in the same row and column of the rectangular submatrix (i.e., the heterotrait-heteromethod correlations). The convergent validities (mean r = 0.83) are consistently higher than the comparison correlations for this criterion (mean r = .54), and this criterion is satisfied for 95 of 96 comparisons. These results clearly satisfy this criterion. 3. The second criterion of discriminant validity is that the convergent validities (mean r = .83) should be higher than heterotrait-homomethod correlations -- correlations among the SEEQ factors (mean r = .63) and correlations among the Endeavor scales (mean r = .63). This criterion is satisfied for 93 of 98 comparisons and these results satisfy this criterion. Insert Tables 2 and 3 About Here Although not formally part of the MTMM analysis, relations between the specific evaluation scales and three global criteria are presented in Table 2. The first two are overall ratings of the course and the lecturer from the SEEQ instrument. Results from North American settings (Marsh, 1982b, 1983, 1984, 1987; Marsh & Hocevar, 1991) consistently show that the overall lecturer rating is most strongly related to Instructor Enthusiasm and, to lesser extents, the Organization/Clarity and Learning/Value factors, whereas the overall course rating is most strongly related to the Learning/Value and, to a lesser extent, the Instructor Enthusiasm factors. Although the overall rating items are highly correlated with all the SEEQ factors except Workload/Difficulty, a similar pattern is evident in the correlations in Table 2. The third global indicator, the discrimination index in Table 2 -- is the linear component of differences between ratings of lecturers selected by students to be good, average, and poor. As with the overall rating items, these discrimination indices are all very large except those associated with the Workload/Difficulty scales. It is interesting to note, however, that the highest indices are for the Organization/Clarity (SEEQ) factor and the corresponding Clarity of Presentations (Endeavor) factor. The results of the MTMM analysis of the UWSM data are compared with the results from the five applicability paradigm studies in Table 3. This comparison indicates that the UWSM results are generally similar to those from the other studies. In relation to the mean across all studies, the UWSM results show slightly higher reliability estimates and slightly higher convergent validities, but correlations among nonmatching SEEQ and Endeavor factors are also slightly higher. Thus, support for convergent validity is modestly stronger in the UWSM study whereas support for discriminant validity is modestly weaker. These differences, however, are small and there is clear support for the convergent and divergent validity of SEEQ and Endeavor responses in each of the different studies. Whereas there is clear support for the convergent and discriminant validity of the SEEQ and Endeavor responses, the correlations among most of the scales (all but the Workload/Difficulty and Workload scales) is very high. In his review of the applicability paradigm, Marsh (1986; also see Marsh, 1987) noted that this was due in part to a halo bias that is apparently inherent in the applicability paradigm. Because students were specifically instructed to select a "poor," an "average," and a "good" lecturer, there is a likely halo effect in which the "poor" lecturers receive poorer ratings on all the scales and the "good" lecturers receive higher ratings on all the scales. Because the applicability paradigm has not actually been conducted in a North American setting, this pattern can not be evaluated in relation to findings from there. We would expect, however, that such findings would would be similar to those observed here and in the other applicability paradigm studies. In typical use, ratings are based on class-average responses for instructors not specifically chosen as being poor, average or good, thus reducing the likelihood of such halo effects. Although correlations between class-average sets of ratings such as those considered by Marsh and Hocevar (1991) are not directly comparable, those results indicate that correlations among the different scales are much smaller than those found in the applicability paradigm studies, including the present one. This supports the interpretation of halo effects offered here. Inappropriate and Most Important Items Inappropriate items. Items were designated to be inappropriate if students specifically marked them as inappropriate or left the item blank. In the UWSM study (Table 4) 1.8% of the SEEQ and Endeavor items -- an average of about 1 item for each of the 153 sets of evaluation -- were judged to be inappropriate. This value is smaller than mean of 3.8% across all the applicability studies and the smallest reported in any of the 6 studies. Whereas all the previous studies had at least one item judged to be inappropriate for at least 10% of the evaluations, the largest in the UWSM study is only 8%. It is interesting to note that, along with the UWSM study, the lowest number of inappropriate items were found in the two studies involving research universities (Sydney University, 3.6%; and Canterbury University, 3.4%). Overall, these results indicate that the SEEQ and Endeavor items are generally appropriate for each of the different settings, but that they may be particularly appropriate at UWSM and the two research universities. Insert Table 4 About Here Across all studies, the items most frequently judged to be inappropriate come from Group Interaction, Individual Rapport, Examination, and Readings/Assignments scales from the two instruments. Whereas items from the Readings/ Assignment scales are also most frequently judged to be inappropriate in the UWSM study, the number of inappropriate items in the other scales is consistently lower than in the other studies. This may reflect the typically small class sizes at UWSM and the fact that students are formally assessed in all classes. Most important items. For each of the 153 sets of evaluations, students were asked to nominated up to 5 items as being most important in describing the overall learning environment. In the UWSM study (Table 4) every SEEQ and Endeavor item was nominated as being most important by at least some students. There were, however, large differences in the number of times different items were selected. Seven items, all from SEEQ, were selected as most important for at least 15% of the sets of ratings: teaching style held your interest (33%), lecturer explanations were clear (27%), lecturer was dynamic and energetic (20%), the course was challenging and stimulating (19%), the lecturer was enthusiastic about teaching (16%), and the lecturer enhanced presentations with the use of humour (16%). These items represent the Instructor Enthusiasm, Learning/Value, and Organization scales. Across all the different applicability studies, there appeared to be a similar pattern of results. Patterns of relations in most important responses from different studies. Clarkson (1984) noted that the pattern of most important responses in the PNG study seemed to differ substantially from that in the Sydney University study whereas we suggested that the pattern for the UWSM study seemed to be similar to the average pattern across all studies. More generally, Marsh (1986) proposed that the relative similarity in patterns of the "most important" responses across all the studies may provide an interesting way to better understand the educational contexts in the different settings. Because of the large number of values (importance indices for 55 items in each of 6 studies and their total), however, an objective index of similarity is needed. In order to index the similarity of two or more sets of scores (i.e., the sets of importance indices in each of the studies), Nunnally(1978) suggested that the two sets of coefficients should be correlated -- a proposal the he likened to the Q approach to factor analysis. Whereas tests of statistical significance for such similarity indices are not available, the indices provide a readily interpretable descriptive index of the similarity between two sets of coefficients on a standardized metric that is familiar to most researchers. Following this proposal, a matrix of similarity indices was constructed (Table 5) to index the similarity in patterns of most important items in each of the six applicability studies and the total across all the studies. Insert Table 5 About Here The similarity indices relating each individual study to the total indicates how representative the pattern of importance ratings in a particular study is to the overall pattern. Consistent with Clarkson's (1984) suggestion, the pattern in PNG study is much less consistent with the overall pattern than any of the other studies, although the similarity index (.596) is still substantial. UWSM has the highest index (.913), followed by the two research universities ( Sydney University, .879, and the University of Canterbury, .878), the Spanish university (.861), and TAFE (.847). This indicates that the pattern of most important items at UWSM is most representative of any of the applicability paradigm studies. A related question of interest in this investigation is how the patterns from each of the different studies compare with each other. The highest similarity index between two individual schools (.857) is between the two research universities, followed by the similarity of UWSM to the University of Canterbury (.822) and the similarity of UWSM to Sydney University (.803). The pattern of importance indices at UWSM is less similar to patterns at the Spanish University and TAFE (.746 and .750) and particularly to PNG (.440). Thus, in relation to the results of this study, UWSM is more similar to the two research universities than to institutions in the other applicability studies. Although suggestive, the interpretation of the similarity indices must be made cautiously. First, the comparisons are based on only the relative importance of different components of teaching effectiveness as judged by students. Second, there were differences among the six studies that may have influenced the degree of similarity. The PNG study differed most drastically from the others -- in terms of the sample of students and limitations in the choice of criterion lecturers -- and this is the study with the most discrepant pattern of most important items. In the TAFE, Spanish, UWSM studies -- in contrast to the University of Sydney, University of Canterbury, and PNG studies -- each student was asked to select three criterion lecturers instead of two and used a nine-point response scale instead of a five-point scale. The methodological similarities in the two research university studies may have contributed to the similarity in patterns of most important items. These methodological differences, however, would apparently make the pattern for UWSM more similar to the TAFE and Spanish studies, and less similar to the two research university studies. Hence, this methodological consideration does not appear to undermine the conclusion that the UWSM pattern is more like the patterns at the two research universities than the patterns in the other studies. The most serious limitation, however, is the representativeness of the sample of students used in each study. The PNG sample was clearly not chosen to be representative of that institution and the TAFE sample was specifically chosen to be representative of that institution, but the self-selection of students in the other studies makes their representativeness difficult to judge. Summary and Implications Items from two North American instruments -- SEEQ and Endeavor -- designed to measure SETs were administered to a sample of UWSM students. Most of the items were judged to be appropriate by students, every item was chosen by at least a few students to be most important, and all but the Workload/Difficulty items clearly differentiated between lecturers chosen to be good, average, and poor. MTMM analyses provided support for the convergent and divergent validity of responses from the two instruments. Whereas correlations among different components of teaching effectiveness were large -- due in part to the built in halo bias of asking students to evaluate teachers whom they had already selected to be good, average or poor teachers -- the MTMM analyses indicated that students did differentiate among the components. These results clearly support the applicability of the two North American instruments to the UWSM setting, one of the major issues in this study. Taken together with the other 5 applicability studies, the results suggest that the components of teaching effectiveness derived from North American research are appropriate in a wide variety of educational settings. The second major issue of this study was to more fully develop Marsh's 1986 suggestion that the pattern of importance that students placed on different components of teaching effectiveness in each applicability study may provide a basis of understanding different educational contexts in the institutions that were considered. Not surprisingly the patterns at the two research universities were most similar, but the pattern at UWSM was more similar to patterns in the research universities than to those in the other applicability studies. Whereas limitations in the comparability of the different studies dictates caution, these results suggest that the educational climate at this newly formed university may be similar to those at older research universities -- at least in terms of the components of teaching seen to be important by undergraduate students. More generally the results show, with the possible exception of the PNG study, that the relative importance placed on different components of teaching by students is surprisingly similar across a diversity of different settings. REFERENCES Braskamp, L. A., Brandenburg, D. C. & Ory, J. C. (1985). Evaluating teaching effectiveness: A practical guide. Beverly Hills, CA: Sage. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105. Centra, J. A. (1979). Determining Faculty Effectiveness. San Francisco: Jossey-Bass. Clarkson, P. C. (1984). Papua New Guinea Students' Perceptions of Mathematics lecturers. Journal of Educational Psychology, 76, 1386-1395. Cohen, P. A. (1980). Effectiveness of student-rating feedback for improving college instruction: a meta-analysis. Research in Higher Education, 13, 321-341. Cohen, P. A. (1981). Student ratings of instruction and student achievement: A meta-analysis of multisection validity studies. Review of Educational Research, 51, 281-309. Doyle, K. O. (1983). Evaluating teaching. Lexington, MA: Lexington Books. Dunkin, M. J. (1990). Willingness to obtain student evaluations as a criterion of academic staff performance. Higher Education Research and Development, 9, 51-60. Feldman, K. A. (1976). The superior college teacher from the student's view. Research in Higher Education, 5,243-288. Feldman, K. A. (1983). The seniority and instructional experience of college teachers as related to the evaluations they receive from their students. Research in Higher Education, 18, 3-124. Feldman, K. A. (1984). Class size and students' evaluations of college teacher and courses: A closer look. Research in Higher Education, 21, 45-116. Frey, P. W. (1973). Student ratings of teaching: Validity of several factors. Science, 182, 83-85. Frey, P. W. (1978). A two dimensional analysis of student ratings of instruction. Research in Higher Education, 9, 69-91. Frey, P. W., Leonard, D. W., & Beatty, W. W. (1975). Student ratings of instruction: Validation research. American Educational Research Journal, 12, 327-336. Harman, G.(1989). The Dawkins reconstruction of Australian higher education. Paper presented at the Annual Meeting of the American Educational Research Association, 27-31 March, San Francisco. Hayton, G. E. (1983). An investigation of the applicability in Technical and Further Education of a student evaluation of teaching instrument. An unpublished thesis, Faculty of Education, University of Sydney. Hildebrand, M., Wilson, R. C., & Dienst, E. R. (1971). Evaluating university teaching. Berkeley: Center for Research and Development in Higher Education, University of California, Berkeley. Isaacs, G. (1989). Changes in ratings for staff who evaluated their teaching more than once. Assessment and Evaluation in Higher Education, 14, 1-10. Marsh, H. W. (1977). The validity of students' evaluations: classroom evaluations of instructors independently nominated as best and worst teachers by graduating seniors. American Educational Research Journal, 14, 441-447. Marsh, H. W. (1980) The influence of student, course and instructor characteristics on evaluations of university teaching. American Educational Research Journal, 17, 219-237. Marsh, H. W. (1981a). Students' evaluations of tertiary instruction: Testing the applicability of American surveys in an Australian setting. Australian Journal of Education, 25, 177-192. Marsh, H. W. (1981b). The use of path analysis to estimate teacher and course effects in student ratings of instructional effectiveness. Applied Psychological Measurement, 6, 47-60. Marsh, H. W. (1982a). SEEQ: A reliable, valid, and useful instrument for collecting students' evaluations of university teaching. British Journal of Educational Psychology, 52, 77-95. Marsh, H. W. (1982b). Validity of students' evaluations of college teaching: A multitrait-multimethod analysis. Journal of Educational Psychology, 74, 264-279. Marsh, H. W. (1983). Multidimensional ratings of teaching effectiveness by students from different academic settings and their relation to student/course/instructor characteristics. Journal of Educational Psychology, 75, 150-166. Marsh, H. W. (1984). Students' evaluations of university teaching: Dimensionality, Reliability, Validity, Potential Biases, and Utility. Journal of Educational Psychology, 76, 707-754. Marsh, H. W. (1986). Applicability paradigm: Students' evaluations of teaching effectiveness in different countries. Journal of Educational Psychology, 78, 465-473. Marsh, H. W. (1987). Students' evaluations of university teaching: Research findings, methodological issues, and directions for future research International Journal of Educational Research, 11, 253-388. (Whole Issue No. 3) Marsh, H. W. (1988). Multitrait-multimethod analyses. In J. P. Keeves (Ed.), Educational research methodology, measurement and evaluation: An international handbook. Oxford, Pergamon Press, 570-580. Marsh, H. W. (in press). Multidimensional students' evaluations of teaching effectiveness: A test of alternative higher-order structures. Journal of Educational Psychology. Marsh, H. W., & Bailey, M. (1991). Multidimensional students' evaluations of teaching effectiveness: A profile analysis. (In Review). Marsh, H. W., Fleiner, H., & Thomas, C. S. (1975). Validity and usefulness of student evaluations of instructional quality. Journal of Educational Psychology, 67, 833-839. Marsh, H. W. & Hocevar, D. (1991). The multidimensionality of students' evaluations of teaching effectiveness: The generality of factor structure across academic discipline, instructor level, and course level. Teaching and Teacher Education, 7, 9-18. Marsh, H. W., & Hocevar, D. (in press). Students' evaluations of teaching effectiveness: The stability of mean ratings of the same teacher over a 13-year period. Teaching and Teacher Education. Marsh, H. W. & Overall, J. U. (1980). Validity of students' evaluations of teaching effectiveness: Cognitive and affective criteria. Journal of Educational Psychology, 72, 468-475. Marsh, H. W. & Overall, J. U. (1981). The relative influence of course level, course type, and instructor on students' evaluations of college teaching. American Educational Research Journal, 18, 103-112. Marsh, H. W., Overall, J. U., & Kesler, S. P. (1979). Validity of students' evaluations of teaching effectiveness: A comparison of faculty self-evaluations and evaluations by their students. Journal of Educational Psychology, 71, 149-160. Marsh, H. W., Touron, J., & Wheeler, B. (1985). Students' evaluations of university instructors: The applicability of American instruments in a Spanish setting. Teaching and Teacher Education: An International Journal of Research and Studies, 1,123-138. McKeachie, W. J. (1979). Student ratings of faculty: A reprise. Academe, 384-397. Meek, V. L. and O'Neill, A. Organizational change in Australian higher education: Process and Outcome. Australian Educational Researcher, 17, 1-23. Moses, I. (1986a). Self and student evaluations of academic staff. Assessment and Evaluation in Higher Education, 11, 76-86. Moses, I. (1986b). Student evaluation of teaching in an Australian university: Staff perceptions and reactions. Assessment and Evaluation in Higher Education, 11, 117-129. Murray, H. G. (1980). Evaluating university teaching: A review of research. Toronto, Canada: Ontario Confederation of University Faculty Associations. Nunnally, J. C. (1978). Psychometric theory (2nd ed). New York: McGraw-Hill. Overall, J. U., & Marsh, H. W. (1979). Midterm feedback from students: Its relationship to instructional improvement and students' cognitive and affective outcomes. Journal of Educational Psychology, 71, 856-865. Overall, J. U., & Marsh, H. W. (1980). Students' evaluations of instruction: A longitudinal study of their stability. Journal of Educational Psychology, 72, 321-325. Prosser, M., & Trigwell, K. (1990). Student evaluations of teaching and courses: Student study strategies as a criterion of validity. Higher Education, 20, 135-142. SPSS (1986). SPSS-x User's Guide (2nd edition). New York: McGraw-Hill. Warrington, W. G. (1973). Student evaluation of instruction at Michigan State University. In A. L. Sockloff (Ed.), Proceedings: The first invitational conference on faculty effectiveness as evaluated by students (pp. 164-182). Philadelphia: Measurement and Research Center, Temple University. Watkins, D., Marsh, H. W., & Young, D. (1987). Evaluating tertiary teaching: A New Zealand Perspective. Teaching and Teacher Education: An International Journal of Research and Studies, 2, 41-53. Students' Evaluations of Teaching  Table 1 Categories of Effective Teaching Adapted From Feldman (1976, 1983, 1984) and the Students' Evaluations of Educational Quality (SEEQ) and Endeavor factors Most Closely Related to Each Category --------------------------------------------------------------------------------------- Feldman's Categories SEEQ Factors Endeavor Factors ---------------------------------- ------------------- -------------------------- 1) Stimulation of interest Instructor Enthusiasm None 2) Enthusiasm Instructor Enthusiasm None 3) Subject knowledge Breadth of Coveragea None 4) Intellectual expansiveness Breadth of Coverage None 5) Preparation and organization Organization/Clarity Organization/Planning 6) Clarity and understandableness Organization/Clarity Presentation Clarity 7) Elocutionary skills None None 8) Sensitivity to class progress None None 9) Clarity of objectives Organization/Clarity Organization/Planning 10) Value of course materials Assignments/Readings None 11) Supplementary materials Assignments/Readings None 12) Perceived outcome/impact Learning/Value Student Accomplishments 13) Fairness, impartiality Examinations/Grading Grading/Exams 14) Classroom management None None 15) Feedback to students Examinations/Grading Grading/Examsa 16) Class discussion Group Interaction Class Discussion 17) Intellectual challenge Learning/Value Student Accomplishmentsa 18) Respect for students Individual Rapport Personal Attention 19) Availability/helpfulness Individual Rapport Personal Attention 20) Difficulty/workload Workload/Difficulty Workload ---------------------------------------------------------------------------------------------- Note. The actual categories used by Feldman in different studies (e.g., Feldman, 1976, 1983, 1984) varied somewhat. Categories 14 and 20 were included by Feldman (1976) but not in subsequent studies. a Whereas these factors most closely match the corresponding categories, the match is apparently not particularly close. Table 2 Multitrait-multimethod Matrix of Correlations Among SEEQ Scales and Endeavor Scales For Responses By UWS Students (N=150 Sets of Ratings). ----------------------------------------------------------------------------------------------------- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 SEEQ Scales 1 Learning/Value, 1000 2 Instructor Enthusiasm 840 1000 3 Organization/Clarity 900 879 1000 4 Group Interaction 806 854 829 1000 5 Individual Rapport 793 811 776 833 1000 6 Breadth of Coverage 814 819 836 823 839 1000 7 Examinations/Grading 801 719 795 766 800 795 1000 8 Assignments/Readings 829 713 785 763 710 750 794 1000 9 Workload/Difficulty 047 -038 058 -137 025 032 098 -034 1000 Endeavor Scales 10 Student Accomplishment *949 817 870 810 788 808 790 844 023 1000 11 Clarity 873 879 *922 820 766 828 770 782 -022 861 1000 12 Organization/Planning 716 704 *801 701 708 733 731 623 083 685 727 1000 13 Class Discussion 786 801 782 *956 803 809 761 768 -090 791 777 682 1000 14 Personal Attention 796 787 774 838 *939 824 808 728 033 796 789 687 831 1000 15 Grading 765 653 710 719 771 729 *884 732 040 753 683 667 711 751 1000 Workload 360 238 364 233 344 338 381 346 *656 326 268 403 248 383 328 1000 Global Criteria Overall Course Rating 878 845 869 821 777 796 790 743 -000 860 866 678 793 796 750 302 Overall Instructor Rating 832 894 873 811 782 784 744 719 007 806 903 706 778 808 643 255 Discrimination Indexa 751 806 832 697 698 714 671 610 077 739 811 656 642 709 605 334 ------------------------------------------------------------------------------------------ Note. Coefficients are presented without decimal points. * Convergent validities. a Discrimination Index is the linear component of differences in ratings of lectures chosen as good, average, and poor. Table 3 Summaries of Three Campbell-Fiske Criteria For The Analysis of Six MTMM Matrices Presented in Table 3 ----------------------------------------------------------------------------- Analysis ------------------------------------- SU Span TAFE PNG NZ UWSM Total Criterion 1 ------------ 1) Convergent Validities Mean .833 .869 .824 .721 .851 .882 .830 2) Proportion Statistically Significant (out of 7) 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Criterion 2 ------------ 1) Comparison Coefficients Mean .468 .541 .573 .509 .464 .627 .530 2) Proportion of Successful Comparisons (out of 96) .990 1.000 1.000 .995 1.000 .990 .996 Criterion 3 ----------- 1) Comparison Coefficients Mean for SEEQ Factors .487 .554 .579 .512 .496 .625 .542 Mean for Endeavor factors .448 .530 .567 .501 .445 .626 .520 Mean for SEEQ & Endeavor factors .473 .545 .574 .508 .477 .625 .534 2) Proportion of Successful Comparison SEEQ factors (out of 56) .964 .964 .982 .964 1.000 .929 .967 Endeavor factors (out of 42) .976 1.000 1.000 .988 1.000 .976 .990 SEEQ & Endeavor factors (out of 98) .980 .980 .990 .975 1.000 .949 .979 Coefficient Alpha Reliability Estimate -------------------------------------- Mean for SEEQ Factors .901 .884 .820 .869 .898 .890 .877 Mean for Endeavor factors .887 .889 .826 .801 .911 .926 .873 Mean for SEEQ & Endeavor factors .895 .886 .822 .839 .904 .906 .875 -------------------------------------------------------------------------------- Note. NZ = University of Canterbury, New Zealand; UWS=University of Western Sydney; SU=Sydney University; Span=Universidad de Navarra, Spain; TAFE =Technical and Further Education; PNG = University of Technology, Papua New Guinea. Comparisons conducted to test criteria 2 and 3 were counted as half a success and half a failure when a convergent validity was equal to a comparison coefficient. Table 4 Paraphrased Items and the Factors They Are Designed To Represent in Marsh's (M) SEEQ Instrument and Frey's (F) Endeavor Instruments ---------------------------------------------------------------------------------------------------------------- Proportion of "not appropriate" Proportion of "most important" responses for each study: responses for each study: --------------------------------- ---------------------------------- Tot NZ UWS SU Span TAFE PNG Tot NZ UWS SU Span TAFE PNG Learning (SEEQ) M1 Course challenging & stimulating .022 .006 .000 .003 .026 .023 .075 .171 .275 .193 .199 .175 .104 .080 M2 Learned something valuable .026 .003 .006 .006 .026 .014 .100 .114 .137 .113 .161 .142 .113 .020 M3 Increase subject interest .010 .008 .000 .013 .019 .018 .000 .102 .115 .126 .095 .112 .066 .100 M4 Learned & understood subject matter .010 .003 .012 .003 .011 .018 .015 .054 .048 .099 .051 .037 .057 .030 Instructor Enthusiasm (SEEQ) M5 Enthusiastic about teaching .014 .000 .000 .003 .008 .000 .075 .204 .216 .159 .278 .260 .164 .150 M6 Dynamic and energetic .013 .000 .000 .003 .027 .017 .030 .125 .148 .199 .198 .069 .064 .070 M7 Enhanced presentation with humour .031 .000 .025 .013 .035 .029 .085 .131 .104 .159 .158 .115 .098 .150 M8 Teaching style held your interest .014 .006 .006 .003 .024 .018 .025 .273 .280 .332 .367 .193 .179 .285 Organization/clarity (SEEQ) M9 Lecturer explanations clear .007 .000 .000 .006 .006 .000 .030 .231 .165 .272 .212 .290 .300 .145 M10 Materials well explained & prepared .018 .011 .012 .013 .019 .015 .035 .140 .132 .139 .108 .147 .109 .205 M11 Course objectives stated & pursued .047 .017 .012 .006 .049 .046 .150 .088 .087 .093 .168 .067 .060 .055 M12 Lectures facilitated taking notes .035 .003 .043 .022 .033 .052 .060 .088 .137 .079 .095 .102 .046 .070 Group interaction (SEEQ) M13 Encouraged class discussion .063 .045 .049 .101 .032 .024 .130 .073 .073 .099 .111 .077 .078 .000 M14 Students shared knowledge/ideas .063 .070 .031 .079 .075 .024 .100 .052 .034 .053 .038 .045 .070 .075 M15 Encouraged questions & gave answers .022 .036 .012 .022 .019 .006 .035 .070 .067 .046 .038 .030 .115 .125 M16 Encouraged expression of ideas .038 .050 .018 .038 .049 .044 .030 .030 .042 .033 .032 .013 .040 .020 Individual Rapport (SEEQ) M17 Friendly towards individual students .021 .014 .006 .032 .011 .009 .055 .092 .073 .093 .073 .137 .075 .100 M18 Welcomed students seeking help/advice .031 .017 .006 .022 .078 .028 .035 .124 .098 .106 .076 .057 .089 .320 M19 Interested in individual students .044 .022 .000 .047 .057 .015 .120 .072 .059 .099 .032 .083 .060 .100 M20 Accessible to individual students .057 .070 .031 .066 .056 .086 .035 .065 .042 .079 .032 .049 .047 .140 Breadth of coverage (SEEQ) M21 Contrasted various implications .058 .067 .043 .057 .072 .060 .050 .057 .073 .046 .035 .065 .046 .080 M22 Gave background of ideas/concepts .040 .028 .037 .032 .053 .061 .030 .040 .028 .046 .044 .024 .032 .065 M23 Gave different points of view .061 .087 .031 .085 .096 .037 .030 .043 .062 .020 .070 .045 .041 .020 M24 Discussed current developments .067 .053 .049 .054 .183 .060 .000 .061 .070 .066 .028 .035 .052 .115 Examinations/grading (SEEQ) M25 Examination feedback valuable .085 .106 .018 .130 .142 .080 .035 .053 .078 .086 .025 .045 .056 .030 M26 Evaluation methods fair/appropriate .048 .056 .000 .070 .061 .043 .060 .063 .059 .060 .054 .122 .063 .020 M27 Tested course content as emphasized .075 .090 .025 .082 .091 .133 .030 .037 .034 .053 .041 .045 .028 .020 Assignments (SEEQ) M28 Readings/texts were valuable .089 .059 .080 .038 .211 .130 .015 .043 .056 .066 .032 .026 .046 .030 M29 They contributed to understanding .060 .048 .055 .006 .158 .066 .025 .042 .048 .053 .063 .022 .049 .015 Overall Rating Items (SEEQ) M31 Overall Instructor Rating .006 .008 .000 .003 .021 .005 .000 .070 .039 .079 .123 .045 .099 .035 M30 Overall Course Rating .008 .003 .000 .000 .019 .011 .015 .071 .081 .093 .101 .078 .055 .020 Workload/difficulty (SEEQ) M32 Course difficulty (easy-hard) .011 .000 .000 .006 .000 .028 .030 .049 .073 .053 .025 .030 .058 .055 M33 Course workload (light-heavy) .011 .003 .000 .000 .000 .023 .040 .073 .076 .060 .063 .030 .073 .135 M34 Course pace (slow-fast) .012 .003 .000 .000 .003 .029 .040 .083 .070 .060 .073 .053 .063 .180 Presentation clarity (Endeavor) F1 Presentations clarified materials .018 .011 .000 .000 .069 .020 .010 .088 .101 .079 .196 .080 .070 .000 F2 Presented clearly & summarized .017 .000 .018 .000 .016 .011 .060 .166 .199 .132 .285 .137 .153 .090 F3 Made good use of examples .015 .011 .018 .000 .019 .017 .025 .070 .073 .040 .070 .093 .037 .105 Workload (Endeavor) F4 Students had to work hard .008 .008 .006 .000 .002 .012 .020 .044 .022 .066 .038 .053 .047 .040 F5 Course required a lot of work .016 .008 .000 .000 .008 .020 .060 .043 .025 .086 .016 .030 .031 .070 F6 Course workload was heavy .019 .011 .006 .000 .011 .052 .035 .046 .031 .046 .009 .053 .028 .110 Personal Attention (Endeavor) F7 Listened & was willing to help .052 .070 .031 .066 .057 .015 .070 .103 .045 .099 .051 .067 .136 .220 F8 Able to get personal attention .103 .090 .037 .184 .155 .067 .085 .066 .042 .079 .057 .057 .095 .065 F9 Concerned about student difficulties .029 .028 .018 .028 .035 .017 .045 .095 .031 .046 .066 .102 .083 .245 Class discussion (Endeavor) F10 Class discussion was welcome .053 .053 .031 .098 .010 .018 .105 .045 .031 .033 .019 .054 .057 .075 F11 Students encouraged to participate .072 .070 .037 .101 .033 .034 .155 .064 .048 .060 .111 .056 .075 .035 F12 Encouraged students to express ideas .064 .050 .037 .050 .054 .031 .160 .017 .017 .013 .009 .032 .032 .000 Planning/objectives (Endeavor) F13 Presentations planned in advance .025 .003 .006 .006 .019 .015 .100 .112 .092 .093 .133 .139 .122 .095 F14 Provided detailed course schedule .036 .008 .018 .013 .035 .038 .105 .047 .087 .060 .057 .028 .049 .000 F15 Activities orderly scheduled .055 .087 .031 .089 .029 .028 .065 .041 .020 .033 .013 .048 .051 .080 Grading/examinations (Endeavor) F16 Grading was fair and impartial .071 .090 .000 .066 .086 .063 .120 .073 .039 .079 .051 .139 .072 .060 F17 Grading reflected student performance .073 .090 .006 .098 .086 .078 .080 .042 .042 .053 .032 .062 .046 .015 F18 Grading indicative of accomplishments .073 .081 .006 .089 .086 .102 .075 .028 .025 .040 .019 .032 .052 .000 Student Accomplishments (Endeavor) F19 Understood the advanced material .034 .059 .025 .035 .022 .063 .000 .052 .039 .040 .095 .030 .050 .055 F20 Ability to analyze issues .022 .028 .031 .009 .027 .038 .000 .077 .104 .086 .101 .081 .057 .035 F21 Increased knowledge & competence .012 .006 .006 .006 .019 .021 .015 .087 .132 .106 .082 .085 .083 .035 ---------------------------------------------------------------------------------------------------------------- Note. NZ = University of Canterbury, New Zealand; UWS=University of Western Sydney; SU=Sydney University; Span=Universidad de Navarra, Spain; TAFE =Technical and Further Education; PNG = University of Technology, Papua New Guinea. Table 5 Similarity In Patterns of Items Judged to Be Most Important Responses in the Six Different Studies --------------------------------------------------------------------------- TOT NZ UWS SU SP TF PNG TOT 1000 878 913 879 861 847 596 NZ 878 1000 822 857 732 622 328 UWS 913 822 1000 803 746 750 440 SU 879 857 803 1000 719 686 283 SP 861 732 746 719 1000 791 363 TF 847 622 750 686 791 1000 459 PNG 596 328 440 283 363 459 1000 Note. NZ = University of Canterbury, New Zealand; UWS=University of Western Sydney; SU=Sydney University; Span=Universidad de Navarra, Spain; TAFE =Technical and Further Education; PNG = University of Technology, Papua New Guinea. The proportion of most important item responses for the 55 SEEQ and Endeavor items in the six different studies and their total were correlated with each other (i.e., the values in the last five columns in Table 4). Thus each coefficient (presented without decimal points) in this table represents the similarity in patterns of most important items in all possible pairs of studies. The Use of Students' Evaluations and an Individually Structured Intervention to Enhance University Teaching Effectiveness Herbert W. Marsh and Lawrence Roche University of Western Sydney, Macarthur October 3, 1992 Running Head: Students Evaluation Feedback Effects Requests for further information should be sent to Herbert W. Marsh, Professor of Education, University of Western Sydney, Macarthur, PO Box 555, Campbelltown NSW 2560, Australia. The authors would like to acknowledge the financial support of the Australian Department of Employment and Educational Training for the conduct of this research, the teachers at the University of Western Sydney, Macarthur who volunteered to participate in this research, and Raymond Debus who made helpful comments on an earlier draft of this article. ABSTRACT The present investigation evaluates the effectiveness of students' evaluations of teaching effectiveness (SETs) as a means for enhancing university teaching. We emphasize the multidimensionality of SETs, an Australian version of the Students' Evaluations of Educational Quality (Marsh, 1987) instrument (ASEEQ), and Wilson's (1986) feedback/consultation intervention. All teachers (N=92) completed self-evaluation surveys and were evaluated by students at the middle of semester 1 and at the end of semester 1 and 2. Three randomly assigned groups received the feedback/consultation intervention at midterm of semester 1 (MT), at the end of semester 1 (ET), or received no intervention (control). Each MT and ET teacher "targeted" specific ASEEQ dimensions that were the focus of his/her individually structured intervention. The ratings for all groups improved over time, but only ratings for the ET group improved significantly more than those the control group. For both ET and MT groups, targeted dimensions improved more than nontargeted dimensions. The results suggest that SET feedback coupled with consultation is an effective means to improve teaching effectiveness and provide one model for feedback/consultation. Student Evaluation Feedback Effects page  Students' evaluations of teaching effectiveness (SETs) are variously collected to provide: (1) diagnostic feedback to faculty that will be useful for the improvement of teaching; (2) a measure of teaching effectiveness to be used in personnel and administrative decision making; (3) information for students to use in the selection of courses and teachers; and (4) an outcome or a process description for research on teaching. Presumably, all SET programs would endorse the first reason for collecting SETs. None of the other reasons is so universal (although SETs are widely used for personnel decisions). Consistent with these priorities, the purpose of the present investigation is to evaluate the effectiveness of feedback from multidimensional SETs and feedback/consultation as a means for enhancing university teaching, based on a new Australian version of the Students' Evaluations of Educational Quality (SEEQ; Marsh, 1987) instrument called ASEEQ. The Multidimensionality of SETs and the SEEQ Instrument Effective teaching is a multidimensional construct (e.g., a teacher may be organized but lack enthusiasm). Thus, it is not surprising that a considerable body of research shows that SETs are also multidimensional (see Marsh, 1987). Information from SETs depends upon the content of the items. Poorly worded or inappropriate items will not provide useful information. If a survey instrument contains an ill-defined hodgepodge of different items and SETs are summarized by an average of these items, then there is no basis for knowing what is being measured. Particularly when the purpose of the ratings is to provide teachers with formative feedback about their teaching effectiveness, it is important that careful attention be given to the components of teaching effectiveness that are to be measured. Surveys should contain separate groups of related items that are derived from a logical analysis of the content of effective teaching and the purposes that the ratings are to serve, and that are supported by theory, previous research, and empirical procedures such as factor analysis and multitrait-multimethod (MTMM) analysis. The strongest support for the multidimensionality of SETs apparently comes from research using Students' Evaluations of Educational Quality (SEEQ; Marsh, 1987) instrument. In the development of SEEQ: 1) a large item pool was obtained from a literature review, forms in current usage, and interviews with teachers and students about what they saw as effective teaching; 2) students and teachers were asked to rate the importance of items; 3) teachers were asked to judge the potential usefulness of the items as a basis for feedback; and 4) open-ended student comments were examined to determine if important aspects had been excluded. These criteria, along with psychometric properties, were used to select items and revise the instrument, thus supporting the content validity of SEEQ responses. Marsh and Dunkin (in press) subsequently demonstrated that the SEEQ dimensions are consistent with principles of effective teaching and learning established on the basis of accepted theory and research. Based on their review, they concluded that SEEQ factors conform to principles of teaching and learning emerging from attempts to synthesize knowledge of teaching effectiveness. Factor analytic support for the SEEQ scales is particularly strong. To date, more than 30 published factor analyses of SEEQ responses have identified the factors that SEEQ is designed to measure (e.g., Marsh, 1982a; 1983; 1984; 1987; 1991a; Marsh & Hocevar, 1984, 1991). Factor analyses of teacher self-evaluations using SEEQ also identified the SEEQ factors, demonstrating that the factors generalize beyond responses by students. Multitrait-multimethod analyses of student/teacher agreement on SEEQ factors provided support for the convergent and discriminant validity of SEEQ responses (Marsh, 1982b; Marsh, Overall & Kesler, 1979). More recently, Marsh and Bailey (in press) examined the consistency of profiles of SEEQ scores (e.g., high on Enthusiasm and low on Organization) for a cohort of teachers who had been evaluated continuously over a 13-year period. They reported that each teacher has a relatively unique profile of SEEQ scales that generalizes over different courses, over graduate and undergraduate level courses, and over an extended period of time. Whereas the value of multidimensional ratings is widely accepted for purposes of diagnostic feedback, there is heated debate about the relative usefulness of multidimensional profiles and overall summary ratings for purposes of personnel decisions (e.g., Abrami, 1989; Abrami & d'Apollonia, 1991; Marsh, 1987, 1991b). Although this issue is not a specific focus of the present investigation, a possible compromise arising from this debate is to summarize SETs as a weighted average of specific SEEQ dimensions. Marsh and Dunkin (in press; also see Marsh & Bailey, in press) noted that one approach to operationalizing a weighted average approach is to weight specific SEEQ components according to the relative importance of each scale as judged by the teacher who is being evaluated. This strategy has the added benefit of providing the teacher with a systematic role in the interpretations of the ratings used to summarize his/her teaching effectiveness, but to our knowledge this weighted average approach has not been previously employed (but see a related application by Hoyt, Owens, & Grouling, 1973). Because all teachers were asked to judge the relative importance of each SEEQ dimension as part of the feedback intervention used in the present investigation, we were able to construct a (teacher rated importance) weighted average of SEEQ dimensions and use it as one of our criterion measures. SETs are commonly collected and frequently studied at North American universities, but not in most other parts of the world. Because of the extensive exposure of North American research, there is a danger that North American instruments will be used in new settings without first studying their applicability. In order to address this issue, Marsh (1981) described the applicability paradigm for studying the initial suitability of SEEQ that was used in several Australian studies as well as studies in Spain, New Zealand, Papua New Guinea, and elsewhere (e.g., Marsh, 1987). Of particular relevance, Marsh and Roche (in press) conducted one of these studies at the newly established University of Western Sydney that served as a pilot study for the present investigation. Utility Of Student Ratings Braskamp, Brandenburg, and Ory (1985), using a broad rationale based on organizational research, argued that it is important for universities and individual teachers to take evaluations seriously. Summarizing this perspective Braskamp, et al. (p. 14) stated that: "the clarity and pursuit of purpose is best done if the achievements are known. A course is charted and corrections are inevitable. Evaluation plays a role in the clarity of purpose and determining if the pursuit is on course." In a related perspective, Marsh (1984, 1987) argued that the introduction of a broad institution-based, carefully planned program of SETs is likely to lead to the improvement of teaching. Teachers will give serious consideration to their own teaching in order to evaluate the merits of the program. Clear support of a program by the central administration will serve notice that teaching effectiveness is being taken seriously. The results of SETs, as one indicator of effective teaching, will provide a basis for informed administrative decisions and thereby increase the likelihood that quality teaching will be recognized and rewarded, and that good teachers will be given tenure. The social reinforcement of gaining favorable ratings will provide added incentive for the improvement of teaching, even for teachers who are tenured. Finally, teachers report that the feedback from student evaluations is useful to their own efforts for the improvement of their teaching. Murray (1987) presented a similar logic in making the case for why SETs improve teaching effectiveness and offered four reasons: (a) SETs provide useful feedback for diagnosing strengths and weaknesses, (b) feedback can provide the impetus for professional development aimed at improving teaching, (c) the use of SETs in personnel decisions provides a tangible incentive to working to improve teaching, and (d) the use of SETs in tenure decisions means that good teachers are more likely to be retained. In support of his argument, Murray (1987) summarized results of published surveys from seven universities that asked teachers whether SETs are useful for improving teaching and, across the seven studies, about 80% of the respondents indicated that SETs led to improved teaching. None of these logical arguments, however, provides an empirical demonstration of improved of teaching effectiveness resulting from SET feedback. Feedback Studies. In most studies of the effects of feedback from SETs, teachers are randomly assigned to experimental (feedback) and one or more control groups; SETs are collected during the term; ratings of the feedback teachers are returned to teachers as quickly as possible; and the various groups are compared at the end of the term on a second administration on SETs. Earlier versions of SEEQ were employed in two such feedback studies using multiple sections of the same course (also see related research by McKeachie, et al., 1980). In the first study results from an abbreviated form of the survey were simply returned to teachers, and the impact of the feedback was positive, but very modest (Marsh, Fleiner, & Thomas, 1975). In the second study (Overall & Marsh, 1979) researchers actually met with teachers in the feedback group to discuss the evaluations and possible strategies for improvement. In this study students in the feedback group subsequently performed better on a standardized final examination, rated teaching effectiveness more favorably at the end of the course, and experienced more favorable affective outcomes (i.e., feelings of course mastery, and plans to pursue and apply the subject). This second study is particularly important because it is one of the few to demonstrate that SET feedback supplemented by consultation can lead to the improvement in objectively measured student learning and other student learning outcomes as well as leading to the improvement of teaching effectiveness inferred from subsequent SETs (also see McKeachie, et al., 1980). Together, these two studies suggest that feedback, coupled with a candid discussion with an external consultant, can be an effective intervention for the improvement of teaching effectiveness. In his classic meta-analysis, Cohen (1980) found that teachers who received midterm (MT) feedback were subsequently rated about one-third of a standard deviation higher than controls on the Total Rating (an overall rating item or the average of multiple items), and even larger differences were observed for ratings of Instructor Skill, Attitude Toward Subject, and Feedback to Students. Studies that augmented feedback with consultation produced substantially larger differences, but other methodological variations had little effect. The results of this meta-analysis support the SEEQ findings described above and demonstrate that SET feedback, particularly when augmented by consultation, can lead to improvement in teaching effectiveness. L'Hommedieu, Menges, and Brinko (1990; also see L'Hommedieu, Menges, & Brinko, 1988) noted the need for meta-analyses of the influence of design and contextual effects. They updated Cohen's (1980) meta-analysis and critically evaluated the methodology used in the 28 feedback studies. They concluded that the overall effect size (.342) attributable to feedback was probably attenuated by threats to validity in existing research and developed methodological recommendations for future research. Among their many recommendations, they emphasized the need to: use stratified random assignment and covariance analyses in conjunction with a sufficiently large number of teachers to ensure the initial equivalence of the groups; more critically evaluate findings within a construct validity framework as emphasized by Marsh (1987); more critically evaluate the assumed generalizability of MT feedback to ET feedback; and to base results on well-standardized instruments such as SEEQ. They also noted the apparently inevitable threat of a John Henry effect in which the anticipation of being rated may lead to more effective teaching by teachers in randomly assigned control groups, thus making it more difficult to measure the true effects of the intervention. Although not a particular focus of their review, L'Hommedieu et al.(1990) also noted that teachers initially rated lowest tended to be more positively influenced by the feedback intervention than other experimental teachers -- an aptitute-treatment interaction. Consistent with Cohen (1980) they concluded that "the literature reveals a persistently positive, albeit small, effect from written feedback alone and a considerably increased effect when written feedback is augmented with personal consultation" (1990, p. 240), but that improved research that incorporated their suggestions would probably lead to larger, more robust effects. Marsh (1987; Marsh & Dunkin, in press) summarized important issues that remain unresolved in SET feedback research. Of particular relevance to the present investigation, it was noted that nearly all of the studies were based on MT feedback. This limitation probably weakens effects in that many instructional characteristics cannot be easily altered within the same semester (in some studies the period between receiving the feedback and subsequent data collection is as little as 4 or 5 weeks). Also, because students may be substantially influenced by what happens in the first half of the course, even substantial changes in teaching effectiveness in the last half of the term may have only modest effects on SETs. Thus, for example, Marsh, Fleiner and Thomas (1976) found significant differences for an overall teacher rating in which students were asked to judge changes in teaching effectiveness between the middle of the term and the end of the term, but not on a traditional overall teacher rating. Adding to these concerns, Marsh and Overall (1980) used results from a multisection validity study to demonstrate that MT ratings were less valid than ET ratings. Because SETs are typically collected near the end of the term, the more relevant question for SET feedback research to address is the impact of feedback from ET ratings. Even if there are short-term gains due to MT feedback, it is important to determine whether these effects generalize to ratings in subsequent semesters. L'Hommedieu et al. (1990, p. 238) similarly argued that "most experiments using midterm feedback are intended to generalize to a quite different situation: end-of-term summative ratings to be used by teachers for improving instruction in subsequent terms" leading them to conclude that "the legitimacy of extrapolating the results to end-of-term rating situations is questionable." A few studies have considered long-term follow-ups of short-term interventions, but these were apparently not designed for this purpose and were sufficiently flawed in relation to this extrapolation that no generalizations are warranted (see Marsh, 1987). No research has examined the effects of continued SET feedback over a long period of time with a true experimental design, and such research may be ethically dubious and very difficult to conduct. The long-term effects of SET feedback may be amenable to quasi-experimental designs (e.g., Aleamoni & Yimer, 1973; Voght & Lasher, 1973), but the difficulties inherent in the interpretation of such studies may preclude any firm generalizations. For shorter periods, however, it may be justifiable to withhold the SETs from randomly selected teachers or not to collect SETs at all. In particular, it is reasonable to evaluate the effects of feedback from ET ratings -- augmented with consultation -- on SETs collected the next semester in relation to SETs for no-feedback controls. The failure to systematically compare the effects of MT and ET interventions is one of the two most important deficits in the SET feedback research. Although not a specific emphasis in their review of SET feedback studies, Marsh and Dunkin (in press) reviewed considerable research demonstrating the validity of SETs and support for their multidimensionality in relation to objective student learning, teacher self-evaluations, ratings by former students at the time of graduation or several years after graduation, and the frequency of occurrence of specific behaviors by trained observers. They lamented, however, the heavy reliance of such validity research on the multisection validity study and its typically narrow focus on student outcomes inferred from results on teacher-made multiple choice tests. Instead, they called for studies that incorporated a much wider variety of student outcomes as criteria for validating SETs. A similar concern exists in SET feedback studies in that most studies -- with some notable exceptions (e.g., Overall & Marsh, 1980) -- evaluate the effects of SET feedback in relation to subsequent SETS by the same students. The most robust finding from the SET feedback research is that consultation augments the effects of written summaries of SETs. Other sources also support this conclusion. For example, in the Jacobs (1987) survey of Indiana University faculty, 70% of the respondents indicated that SETs had helped them improve their teaching but 63% indicated that even when teachers can interpret their ratings, they often do not know what to do in order to improve their teaching. Also, Franklin and Theall (1989), based on an 153-item, multiple-choice test of knowledge about SETs that was validated by experts in the field, concluded that many users lacked the knowledge to adequately use the SETs for summative or formative purposes. Nevertheless, insufficient attention in SET research has been given to the nature of consultative feedback that is most effective. Based on a review of research in education, psychology, and organizational behavior, Brinko (1991; also see Brinko, 1990) contrasted 5 models of interaction relevant to instructional consultation: product model (the consultant is the expert and provider of expertise), prescription model (the consultant identifies, diagnoses, and remedies problems), collaborative/process model (there is a synergistic relationship between the consultant as a facilitator of change and the teacher as the content expert), affiliative model (the consultant is both an instructional consultant and psychological counsellor and the teacher is a seeker of personal and professional growth), and confrontational model (in which the consultant is a challenger or a devil's advocate). She suggested that a skillful instructional consultant may need to master several styles and use the one that is most responsive to the needs of the teacher-client and the particular situation. Brinko (p. 48) concluded, however, that "we still have no empirical evidence to differentiate between strategies and practices that make consultation successful and those that do not." It is surprising that there is not more systematic research on this practical issue and it represents, perhaps, the other most important deficit in SET feedback research. Wilson's feedback/consultation process Wilson (1986; also see Wilson, 1984, 1987) described a feeback/consultation process that appears to have considerable potential. A key element in this process was a set of teaching packets that were keyed to the items on the SET instrument used in his research. Each packet contained suggestions from teachers who had received Distinguished Teaching Awards or received multiple "best teacher" nominations by graduating seniors. In an application of this program conducted over a three-year period, participants were volunteer teachers who had been evaluated previously in the same course they would again be teaching. Based on SETs and self-evaluations of their own teaching, participants nominated specific evaluation items on which they would like assistance at a preliminary consultation session. The main consultation was held shortly before the second time the teachers were to teach the same course. The consultant began the session by noting items on which the teacher received the highest ratings, and then considered 3 to 5 items which the teacher had selected or had received the lowest ratings. For each selected item the three to six strategies from the corresponding teaching packet were described and the teacher was given copies of the two or three that were of most interest to the teacher. During the next week the consultant summarized the main consultation and strategies to be pursued in a letter to the teacher and subsequently telephoned the teacher during the term to ask how things were going. This process clearly fits the "collaborative/process model" in Brinko's (1991) typology, but may be sufficiently flexible to incorporate aspects of other models as appropriate. Wilson's results indicated that ratings were systematically better at time 2 for the targeted items -- particularly those items that referred to concrete behaviors (e.g., states objectives for each class session) -- and an overall rating item. He also recognized the need for a nonintervention comparison group. For this purpose he considered SETs for 101 teachers who had not volunteered to be in the study but who had been evaluated on two occasions for the same course during the period of his study. For this large comparison group, there were no systematic changes in either specific or global SETs, supporting Wilson's contention that SETs without a consultation intervention are not likely to lead to improved teaching. Wilson suggested that the key elements in the consultation intervention were providing teachers with information on how to improve teaching in areas in which they are weak and the interpersonal expectations that created for some teachers a desire to fulfill an implied contract with their consultant. Despite the obvious appeal of Wilson's feedback/consultation process and its successful application, empirical support for its effectiveness is weak -- based on a non-experimental design that does not rule out alternative explanations. The interpretation of pretest/posttest gain scores is a weak basis for causal inference, particularly since subjects were self-selected volunteers and more than half of the original participants either dropped out or did not complete the intervention within three years (when the study was terminated). Although Wilson reported that another group of teachers who did not volunteer to participate in the study showed no improvement, the comparability of the two groups may be dubious. A more subtle problem is the comparison of pretest/posttest gain scores on the "targeted" items that was the major focus of Wilson's conclusions. The results suggest that the intervention is most effective for those items that teachers target. Extending the logic of Marsh's (1987) construct validity approach, this finding apparently supports the construct validity of interpretations of the intervention in that gains are larger for those areas that were the focus of the intervention and smaller for those areas that were not. Wilson, however, did not actually report comparisons of these gains on targeted items with gains on other, untargeted items or with gains in the same items by other teachers who did not target them. Also, because the targeted items were typically selected that had particularly low ratings at time 1, regression to the mean alone would result in some positive gains and must be controlled more adequately. Whereas it might be possible to randomly assign the items to be targeted by each teacher in order to strengthen the experimental design, such a strategy would pervert the intended purpose of the intervention to provide the teacher with a personal stake in the intervention by selecting the areas to focus self-improvement efforts. The pretest/posttest gains on overall rating items for all participating teachers provides a stronger basis of inference (although the lack of a randomly assigned control group is still a concern). Because the overall ratings reflect all the different SET areas to some extent, changes in the overall ratings reflect an implicit average across targeted and nontargeted items. The essence of the feedback/consultation, however, is that teachers target particular items and the intervention is specific to these targeted items. Logically, the design of the study requires that improvement should be larger for targeted items than for nontargeted items. If there is not differential growth on targeted and nontargeted items, then the specific and individual nature of the intervention is called into question. Thus, gains on overall rating items do not adequately capture the multidimensionality of SETS or the content specificity of the intervention as embodied in the construct validity approach. In summary, Wilson developed an apparently valuable feedback/consultation process, described its systematic application, and provided evidence suggestive of its effectiveness. In the present investigation we provide a methodologically stronger paradigm for testing this intervention and an apparently stronger evaluation of its potential usefulness. The Present Investigation The purpose of the present investigation is to evaluate the effectiveness of feedback from a new Australian version of the SEEQ instrument (ASEEQ) and an adaptation of Wilson's feedback/consultation process for purposes of improving university teaching as inferred on the basis of subsequent SETs. The study is apparently unique in incorporating earlier proposals (e.g., L'Hommedieu, et al., 1990; Marsh, 1987; Marsh & Dunkin, in press) to systematically compare the effectiveness of MT and ET feedback. In particular, different randomly assigned experimental groups received the feedback/consultation in the middle of semester 1 or at the end of semester 1 and were compared to a randomly assigned control group that received no feedback (until the end of semester 2 after the end of the study). Results for all three groups were compared on ratings collected at the middle of semester 1 (T1), the end of semester 1 (T2), and the end of semester 2 (T3). The feedback intervention was based substantially on the work of Wilson (1986) and incorporated slightly modified versions of his idea packets that are designed to parallel the SEEQ dimensions. We extended his research by evaluating the effectiveness of the feedback intervention with a stronger experimental design that incorporated a randomly assigned control group. Furthermore, we more systematically evaluated his suggestion that areas of teaching effectiveness selected by teachers to be targeted in the intervention are the areas most substantially influenced by the intervention. Other components considered in the study are: the use of a well-standardized SET instrument (L'Hommedieu, et al. (1990) specifically recommended SEEQ); a systematic evaluation of aptitude-treatment interactions to test L'Hommedieu, et al.'s (1990) suggestion that initially less effective teachers benefit more from the intervention; the use of a weighted average total score in which teacher self-ratings of the importance of each ASEEQ factor are used to weight the factors; and the application of a construct-validity approach in order to test the underlying rationale of the intervention. Methods Background. Prior to 1990, the recently established University of Western Sydney, Macarthur (UWSM) was an autonomous institution within the College of Advanced Education sector that was the middle tier of Australia's three-tier system of higher education. In 1990, however, the formal distinction between research universities and Colleges of Advanced Education was abolished and all institutions from the middle tier were either amalgamated into one of the former universities or formed new universities. What was to become UWSM combined with two other institutions from the College of Advanced Education sector to establish the three UWS campuses. As part of a large "priority reserve" source of funding for areas of national importance, the Australian Department of Employment and Education Training provided funding for projects that would improve the quality of teaching in Australian universities. Projects were selected that would improve the quality of teaching at a specific university but that were sufficiently general to provide a model for use at other universities. The present investigation describes results from one of these grants. (Subsequent matching funds, based in part on the success of this program, were provided by the Australian Department of Employment and Education Training to establish the Centre for Teaching Development that provided a permanent home for this program.) Because there was no campus-wide SET program at UWSM prior to the initiation of this project, the formal collection of SETs using a standardized form was unfamiliar to many staff and students, although many teachers used a variety of informal means to evaluate their teaching effectiveness. Sample and Procedures. Teachers were recruited to participate in the study through a variety of sources including letters sent to each teacher, brief descriptions of the study in a university newletter, and presentations to faculty staff meetings. The final sample of 92 teachers who volunteered to participate in the study represented all the UWSM faculties and all the different academic ranks. Discipline areas included arts and social sciences, education, foreign languages, public health, nursing, social welfare, business, technology, mathematics and sciences. Reasons for volunteering included the desire to improve teaching effectiveness, to formally evaluate teaching effectiveness for purposes of personnel decisions, to support a good cause, or simply to satisfy curiosity about the program. Not surprisingly, given the variety of reasons for participation, teachers varied substantially in their initial level of teaching effectiveness as indicated by the T1 SETs. Prior to the initiation of the study, teachers were told that they would be asked to evaluate their own teaching effectiveness and the relative importance of different components of teaching effectiveness, and to be evaluated by their own students in the middle of the first semester (T1), at the end of the first semester (T2), and at the end of the second semester (T3). Prospective volunteers were told that the confidentiality of all individual responses would be strictly maintained and that results of the ratings for each teacher would only be sent to the individual teacher. They were also told that randomly selected teachers would be asked to participate in a feedback/consultation program in which a consultant (one of the authors) would meet with the teacher to discuss the results of the SETs and strategies to improve ratings in areas selected by the teacher, whereas other teachers randomly assigned to the control group would not receive any feedback until the end of the second semester after the completion of the study. Prospective participants were asked to nominate two instructional sequences -- typically separate classes but occasionally an independent component that was part of a larger program -- in which to be evaluated in the first and second semester. Although all teachers were encouraged to nominate similar settings in which to be evaluated, this was not always possible and no one was excluded from the study for this reason. At the middle of the first semester 92 teachers volunteered to be in the study, completed a self-evaluation survey, and were evaluated by students. At T1, T2, and T3 the ratings were collected by the teacher or a nominated student from his/her class. Standardized administration instructions were read aloud to students, students completed the ASEEQ forms, and forms were put into a sealed envelope that was returned to the faculty office. The completed forms were subsequently sent to the principal investigator of the study to be processed. Although teachers were encouraged to collect ET ratings during the last week or two of regularly scheduled classes, teachers selected when the ratings were actually collected. (Not all teaching sequences that were evaluated corresponded to university calendar.) Participants were stratified on the basis of overall teacher ratings by students at T1 and were randomly assigned to the MT feedback group, the ET feedback group, or the control group. MT teachers were immediately sent relevant materials (ASEEQ instruments completed by their students, a computerized summary sheet, and a guide for interpreting the ratings) and were contacted to set up individual feedback/consultation sessions. All other participants were merely told that they had not been selected to be in the MT group (i.e., they were not told whether they were in the control or ET group). Similarly, at the end of the first semester, ET teachers were contacted to set up their feedback/consultation, and MT teachers were contacted to set up their second feedback/consultation. Finally, at the end of the second semester, all previously unreturned materials were returned to all participants -- including the control teachers. The feedback/consultation protocol used for MT and ET groups was based substantially on earlier work by Wilson (1986). Each session began by the consultant providing a general overview and specifically stating that "I do not have sufficient background in your area to know what is 'best.' Instead, I will discuss the ratings and work with you to develop some strategies in particular areas selected by you. In this sense, my role is to be a facilitator." The teacher was then asked to describe the special or unique characteristics about the class being evaluated, the students, or the circumstances. The consultant ascertained that the teacher had read the materials previously sent to him/her. Focusing on the ASEEQ scale scores rather than responses to individual items, the consultant first emphasized the ASEEQ areas of relative strength and then noted areas in which the ratings were relatively lower. Student written comments were then examined for themes and relevant information. The consultant then suggested that the teacher select 2 or 3 ASEEQ dimensions that were important to the teacher (based on responses to the self-evaluation instrument that had previously been completed by the teacher) and that had received relatively lower ratings by students (based on decile ranks that compared the ratings by that teacher with those of all other teachers in the study -- noting potential limitations in these normative comparisons). In some cases the consultant suggested ASEEQ dimensions that satisfied these criteria, but the final selection of target areas was always made by the teacher. As part of the process, the consultant asked the teacher if "these are appropriate areas to target improvement efforts." The consultant and teacher then considered the ratings of individual ASEEQ items in each targeted dimension and the student written comments relevant to the targeted areas. The consultant then introduced the teaching idea packet relevant for each ASEEQ dimension that had been targeted. There were between 17 and 32 suggested strategies for each ASEEQ dimension that were largely adapted from materials developed by Wilson (1986). The consultant noted that each strategy was only a potential suggestion and that some would be inappropriate in a particular situation, but that some strategies -- or derivations of them -- might be appropriate for the teacher to test out. The teacher read the suggested strategies, discussed how they might be applied in his/her situation, and noted the strategies (or variations thereof) that he/she would pursue. The teacher then recorded the particular strategy and any variations in his/her copy of the teaching idea packet. In concluding the session, the consultant noted the ASEEQ areas selected by the teacher that would be the focus of the intervention and the strategies selected for this purpose. The teacher was asked if he/she felt that they would be able to carry out the suggestions and whether the strategies were likely to lead to improvement. In closing, the consultant noted that he would send a brief letter summarizing the feedback/consultation session (particularly the targeted areas and selected strategies) in 2 weeks and in 4 weeks would telephone the teacher to check on progress. All materials, including the teaching idea packets for all (targeted and nontargeted) ASEEQ dimensions, were left with the teacher and the teacher was encouraged to telephone the consultant if he/she subsequently wanted to discuss any aspect of the study. Statistical Analyses. An important problem in any applied field research -- particularly a longitudinal study involving multiple waves of data collection -- is how to deal with missing data. Of the 92 teachers who began the study, a total of 9 had missing SETs for at least one of the three waves; 5 in the MT group, 3 in the control group, and 1 in the ET group. The reasons for the missing data were that the teacher was not teaching any instructional sequence at either T2 or T3 (reflecting a change in scheduling or a misunderstanding of the requirements for participation in the study), was unable to allocate class time for the administration of the SETs, or forgot to administer the forms despite intending to do so and having been sent ASEEQ forms and relevant materials. In order to facilitate analyses and presentation, all results presented in the results section are based on the 83 teachers with complete data. Supplemental analyses were conducted for teachers with partially complete data. The 9 teachers with some missing data did not differ significantly from the remaining 83 teachers on T1 ratings that were complete for all participating teachers. Four of the 5 MT teachers and 2 of the 3 control teachers with missing data had T2 or T3 responses, making possible some comparisons between the MT and control groups. Each of these comparisons was pursued in unreported analyses, but did not differ from the results that are presented in the results section in terms of effects being statistically significant or nonsignificant. Effect sizes in these supplemental analyses were also similar to those subsequently reported in the results section. Similarly, comparisons of ET and control groups were possible for teachers who had missing T2 data by using T1 responses -- instead of the average of T1 and T2 responses -- as the pretest covariate. Again, however, none of the differences based on these unreported analyses differed from those reported in the analyses in terms of being significant or nonsignificant, and the effect sizes were similar to those subsequently reported. An interesting feature of the present investigation is that teachers completed self-evaluation surveys that included ratings of the importance of each ASEEQ area. These importance ratings were, of course, a central component of the intervention process. In addition, however, we used the importance ratings to construct importance weighted total scores based on the 8 ASEEQ dimensions. Following procedures described by Marsh (1986) the importance ratings were "ipsatized." For each teacher the mean importance rating for the 8 ASEEQ dimensions was computed and then each individual importance rating was divided by this mean. This resulted in a set of 8 "ipsatized" importance scores that had a mean of exactly 1.0 for each individual teacher. This provided an index of the relative importance of each ASEEQ factor -- relative to the importance ratings assigned by the same teacher to other ASEEQ factors. The importance weighted total score was then computed by taking the mean crossproduct of each ipsatized importance rating multiplied by the student rating of the corresponding ASEEQ factor. Separate weighted averages were computed using the importance ratings and SETs at T1, T2, and T3 so long as there were no missing values. If SETs were missing, the weighted average total was deemed to be missing (and these teachers were excluded from the final analyses). If the importance ratings were missing, however, the relative importance of each ASEEQ factor was determined by taking the average importance assigned to that ASEEQ factor on the remaining self-evaluation surveys. Because all teachers completed the self-evaluation survey at least once, this procedure allowed us to compute weighted averages for all teachers included in the final analyses. For purposes of comparison, the corresponding unweighted average of the 8 ASEEQ scales were also computed. Although a wide variety of analytic techniques are appropriate, a multiple regression (general linear model) approach to analysis of variance (ANOVA) was selected because of its flexibility. In the general analytic strategy, each outcome (SETs at T2 or T3 depending on the particular comparison) was related to a dichotomous grouping variable (ET vs. control or MT vs control), a pretest covariate (SETs at T1 or the average of T1 and T2 responses depending on the comparison), and the group x covariate crossproduct reflecting the aptitude-treatment interaction. In order to facilitate comparisons, all independent and dependent variables were first standardized (mean = 0, SD = 1), the crossproduct terms were based on products of z-scores (but were not subsequently standardized), and the results are presented in terms of unstandardized beta weights. Except for the interaction terms these results are exactly the same as the standardized beta weights resulting from the analysis of untransformed independent and dependent variables that are typically easier to interpret than unstandardized beta weights. The standardized beta weights for interaction terms, however, are not generally comparable to those based on nonproduct terms. The procedure used here is an effective compromise in which all effects -- including interaction terms -- are appropriately presented in relation to the standard deviations of the underlying variables (see Aiken & West, 1991, for more discussion). In subsequent analyses, a traditional repeated measures ANOVA was used to compare relative changes in the ratings of targeted and nontargeted ASEEQ factors over time. Because of potential problems related to the "sphericity" assumption in repeated measures analysis, all tests of statistical significance were conducted using the Greenhouse-Geisser epsilon and the Huynh-Feldt epsilon correction factors (SPSS, 1991). Because both these estimates of epsilon were close to 1.0, there were no differences in effects judged to be statistically significant using either of these approaches or the traditional (uncorrected) approach. Preliminary analyses. Factor analysis. Particularly because the ASEEQ has not been used previously, it is important to demonstrate that SEEQ dimensions identified in North American settings (e.g., Marsh, 1987; Marsh & Hocevar, 1991) can be replicated. As argued elsewhere (e.g., Marsh, 1987), the most appropriate unit of analysis for such factor analyses is the class average. In order to obtain as large a sample as possible, all classes evaluated with ASEEQ were included -- those formally considered in the feedback intervention and those that were not. Also, each set of evaluations for each teacher -- those based on T1, T2, and T3 responses were considered as separate cases. Hence, the evaluations were based on a total of 305 sets of ratings of 118 different teachers (92 of whom also participated in the feedback/consultation intervention). The factor analysis -- as in earlier SEEQ research -- consisted of a principal axis factor extraction, following a Kaiser normalization, and an oblique rotation using the commercially available SPSS (1991) procedure. The 32 "target loadings" (see Table 1), the factor loadings of items designed to measure each factor, are consistently high (median loading = .64) and none is less than .37. The remaining 256 "nontarget loadings" are consistently much smaller (median loading = .07) and none is greater than .38. Not surprisingly, the remaining 27 factor loadings associated with the 3 overall rating items tend to load on several different SEEQ factors -- particularly the Teacher Enthusiasm, Organization, Learning/value, Assignments, and Individual Rapport factors. In summary, the results of this factor analysis demonstrate a clear "simple structure" that is consistent with previous SEEQ research. Insert Table 1 About Here Although not presented, separate factor analyses were conducted for classes evaluated at T1, T2, and T3. Particularly, the evaluations collected at the end of each semester (T2 and T3) provided very good solutions -- slightly better, perhaps, than the one based on all three times (Table 1). The factor solution based on T1 (midterm) ratings, was not quite so clean. It is not clear whether this was because some items could not be adequately evaluated at this time (e.g., some students responded "not appropriate" to items about examinations and assignments) or because many students had not previously completed a SET instrument. Whereas the factor analyses generally support the a priori factor structure, there may be some support for the Marsh and Overall (1980) contention that MT SETs may not be as valid as ET SETS. Reliability. The reliability of SETs is most appropriately evaluated in studies of interrater agreement (i.e., agreement among ratings by different students in the same class; for further discussion see Feldman, 1977, Gilmore, Kane, & Naccarato, 1978, Marsh, 1987). The correlation between responses by any two students in the same class (i.e., the single rater reliability) is typically in the .20s but the reliability of the class-average response depends upon the number of students rating the class. For example, the estimated reliability for SEEQ factors (Marsh, 1987) is about .95 for the average response from 50 students, .90 from 25 students, .74 from 10 students, and .60 from five students. Given a sufficient number of students, the reliability of class-average SETs compares favorably with that of the best objective tests. For present purposes the intra-class correlation was used to assess the reliability of class-average responses to each ASEEQ item and each ASEEQ scale. This was accomplished with a oneway ANOVA that divides variability in individual student scores into within-class and between-class components. If class-average differences are no larger than expected by chance (i.e., the F-ratio is 1.0), then the reliability of class-average scores is -- by definition -- zero. Estimates of reliability for class-average scores are higher when there are larger differences between classes, smaller differences within classes, and larger numbers of students within each class. The reliability should be higher when the average number of students in each class is larger (all other things being equal -- always a worrisome assumption). Based on results from the total sample of 305 classes, the median interrater reliability is .89 (Table 2) for an average class size of 23 students and is comparable to the median of .90 for an average class size of 25 student reported in earlier (North American) SEEQ research. Reliability estimates of ASEEQ scale scores are consistently somewhat higher than the items that comprise the scales. Consistent with expectations, the reliability estimates vary systematically with class size (median estimates are .80, .83, .86, and .96 for groups of classes in which the average class sizes are 10, 16, 21 and 48). In summary, these results demonstrate that responses to ASEEQ are reliable for classes of even moderate size and are consistent with previous SEEQ research conducted in North America. Insert Table 2 About Here The SEEQ Workload factor and responses to the experimental item asking students to rank overall teaching effectiveness in relation to a hypothetical "representative sample" of 100 teachers (on a 1 to 100 scale) are not considered further. The Workload factor was treated as a background factor (see Marsh, 1983, 1987); information was presented to teachers and discussed as part of the feedback/consultation -- along with other background information such as expected grades, class size, etc -- but it was not specifically considered as a target dimension and there were no "strategies" associated with this scale. Also, interpretations of the workload ratings are complicated in that scores vary along a nonlinear scale in which some intermediate value is optimal (i.e., a class that is neither too easy nor too difficult). The experimental overall teacher "ranking" (Q32 in Tables 1 and 2) was included in an unsuccessful attempt to counter the typical negative skew in SETs. Also, some students apparently misunderstood the 1-100 response scale and responded on the 1-9 scale used for the other SEEQ items. For purposes of the preliminary analyses presented here, values between 1 and 9 were multiplied by 10. Also, whereas the extended response scale was intended to produce more reliable responses, reliability analyses in Table 2 indicate that it is slightly less reliable than the traditional overall teacher rating (Q31). Results In order to facilitate presentation of the results, separate analyses of the effects of ET and MT feedback are presented. We begin with the ET results in which the analyses are summarized most easily, and then move to the more complicated analyses based on the MT feedback. Finally, we compare results based on those ASEEQ scales that teachers specifically selected to target for purposes of the intervention with those based on the remaining nontargeted areas. End of Term Feedback A multiple regression approach to analysis of covariance (Table 3) was used to assess differences between ET and control group ratings at the end of the second semester (T3). For purposes of this analysis, each T3 ASEEQ score was related to its covariate (the mean of the corresponding T1 and T2 score after standardizing each), a group contrast variable (ET vs. control), and their interaction. Not surprisingly, the effects of the covariate were substantial (betas of .5 to .7) indicating that SET ratings are stable over time (i.e., semester 1 to semester 2). Of central importance to the present investigation, the ET feedback group has higher ratings for all 12 ASEEQ scores and 8 of these differences are statistically significant (see group effects in Table 3). Only one of the covariate x group interactions is statistically significant, indicating that the generally positive effects of the intervention generalize reasonably well across teachers with initial differences in their teaching effectiveness (i.e., there were no aptitude-treatment interactions). The one statistically significant interaction -- as well as the largest of the nonsignifiant interactions -- suggested that initially less effective teachers according to T1 responses benefited most from the intervention. Insert Table 3 About Here In interpreting these ET feedback results, it is important to note that the use of the overall ratings may be more defensible to use than ratings of the specific ASEEQ scales. Because each teacher in the ET group targeted only a few (typically 2 or 3) of the ASEEQ dimensions, the experimental group means for specific ASEEQ scales includes ratings by teachers -- typically a majority of the teachers -- who did not target that specific dimension. Thus, it is not surprising that the differences are apparently smaller and sometimes do not reach statistical significance for the specific ASEEQ dimensions. Whereas it would be possible to base comparisons on only those experimental teachers who targeted each dimension, this subset of self-selected teachers no longer constitutes a randomly assigned group so that comparisons with the control group may be dubious. (This issue of the distinction between targeted and nontargeted dimensions is addressed in subsequent analyses). It is, however, reasonable to expect the intervention to significantly influence overall teaching effectiveness no matter what specific ASEEQ dimensions were targeted. Consistent with this expectation, there were statistically significant effects for all 4 summary ratings (global ratings and total scores). Whereas it was anticipated the effects would be larger for the importance weighted total scores (Table 3), the effect sizes for the different summary scores are reasonably similar (i.e., the effect size, d statistic varies between .4 and .5). Mid-Term Feedback A multiple regression approach similar to that used with the ET intervention was used to evaluate the MT intervention. The intervention, however, is complicated by the fact that both T2 (end of first semester) and T3 (end of second semester) ratings are outcome measures. As in the typical (mid-term) feedback study, the T2 ratings provide a basis for evaluating the short-term, immediate effects of the intervention administered in the middle of the first semester. Because MT feedback teachers received an additional feedback/consultation at the end of the first semester, the T3 scores provide a basis for evaluating the continued and cumulative effectiveness of the intervention process. Insert Table 4 About Here Š Each T2 and T3 ASEEQ score was related to the effects of the covariate (the corresponding T1 score), a group contrast (MT vs. control group), and their interaction. The T2 (Table 4) results indicate that none of the group differences are statistically significant and that this lack of difference does not interact with initial levels of teaching effectiveness (i.e., the group x covariate interactions were nonsignificant). The T3 results also indicate that none of the group differences are statistically significant. In these analyses, however, 4 of the 12 group x covariate interactions are statistically significant. As with the ET group comparisons, the nature of these interactions (as well as the nonsignificant interactions that approach statistical significance) indicate that the intervention is more beneficial for the initially less effective teachers. Thus, whereas evidence for the effectiveness of the MT intervention is weak, there is some support that it works with teachers who are initially less effective. SETs in Targeted and NonTargeted ASEEQ Dimensions The distinction between targeted and nontargeted ASEEQ dimensions is a critical feature of the intervention that has not been adequately captured in the analyses presented thus far. For any particular ASEEQ dimension, the so-called intervention effect was based on results of some teachers who actually targeted that dimension but the majority of the experimental group did not (i.e., they selected other dimensions to target). In this respect, results for the specific dimensions presented thus far may not give an adequate representation of the intervention effect. In contrast to the ratings of specific ASEEQ dimensions, the overall ratings -- particularly the overall teacher rating -- and total scores provide a fairer representation of the intervention effects in that all teachers in the experimental groups attempted to enhance their overall teaching effectiveness. Even these summary scores, however, do not adequately represent the multidimensional emphasis in previous SEEQ research that was the basis of this intervention. Unfortunately, there appears to be no fully satisfactory approach to the analysis of the target/nontarget ratings. Whereas is would be possible to compare ratings of experimental teachers who did and did not target a specific ASEEQ dimension and to compare these with those of the control group, interpretations of these results would be dubious. Because each experimental teacher typically targeted only 2 or 3 of the 8 ASEEQ dimensions, such comparisons would be based on small samples. More importantly, the self-selected group of teachers selecting any one ASEEQ dimension is clearly not a "random" sample of teachers. Thus any observed group differences confound the effects of the intervention with initial group differences. Furthermore, to the extent that teachers initially selected ASEEQ scales on which they initially had the poorest ratings, apparent gains in these scales relative to teachers who did not target these scales and the control group would be expected on the basis of regression to the mean. The approach used here is to consider 6 scores for each teacher: the mean of targeted and nontargeted ASEEQ dimensions at T1, T2, and T3. In this sense, the nontargeted dimensions form one basis of comparison for evaluating changes in the targeted dimensions that most accurately reflect the intervention effects. Whereas teachers in the control group did not actually target any dimensions, the targeted dimensions for experimental groups usually consisted of ASEEQ scales that were relatively high in importance (as perceived by the teacher) and relatively low in terms of SETs at T1. Using these criteria, we selected ASEEQ factors that we would have recommended to be targeted by control teachers. Although not totally satisfactory, this approach provides a basis for comparing differences in the ratings of targeted and nontargeted dimensions in the three groups (Figure 1) that provides an apparently defensible control for regression effects. A preliminary inspection of Figure 1 reveals that targeted dimensions (dashed lines) are rated substantially lower than nontargeted dimensions (solid lines) for all three groups at T1 and T2. This is, of course, to be expected in that these scales were targeted in part because the initial ratings were low. At T3, targeted dimensions are still rated substantially lower than targeted dimensions in the control group. In the two experimental groups, however, ratings of the targeted dimensions are marginally better than those of the nontargeted dimensions at T3. Over the course of the study, ratings of targeted dimensions improved substantially relative to nontargeted areas for both experimental groups, but not for the control group. Insert Table 5 and Figure 1 About Here In order to evaluate the statistical significance of these apparent effects, a 3 group (MT, ET, control) x 3 time (T1, T2, T3) x 2 target (target, nontarget) analysis of variance was conducted in which time and target are within-subject factors (repeated measures factors) and group is a between-subject factor. The results (Table 5) demonstrate significant main effects of time and target, and a significant time x target interaction. Overall, ratings went up over time, nontarget ratings were lower than target ratings, and the target/nontarget difference changed over time. The critical effect for present purposes, however, is the statist