A Multi-Facet Rasch Analysis of the College Teacher Evaluation Inventory

 

Wen-Chung Wang

National Chung Cheng University

Ying-Yao Cheng

National Sun Yat-Sen University

Correspondence:

Wen-Chung Wang

Department of Psychology

National Chung Cheng University

Chia-Yi, Taiwan

Phone: (886) 52720411 EXT. 6430

FAX: (886) 52720857

E-Mail: psywcw@ccunix.ccu.edu.tw

 

Abstract

A teacher evaluation inventory with ten Likert items was developed. Thirty college teachers were rated by 293 students. A multi-facet Rasch technique was applied to analyze the test data. The items fit the Rasch-type model fairly well. The separation reliability for the teachers is .98, meaning that the items can differentiate the teachers very well. A cutoff score was set to .50 logits that only teachers with performances above the cutoff score can apply an outstanding teacher award.

Keywords: teacher evaluation, Rasch model, outstanding teachers, inventory.

Introduction

Teacher evaluation becomes very common in colleges (Peterson, 1995; McLaughlin & Pfeifer, 1988; Shinkfield & Stufflebeam, 1995). Students' rating on teachers' performance is the keystone of the evaluation. Most teacher evaluation inventories attempt to identify poor teachers and to give teachers feedbacks for instruction. Because of this purpose, items in teacher evaluation inventories are usually not very "difficult" to accomplish for ordinary teachers. These items are usually too "easy" to identify outstanding teachers, as easy items do not provide sufficient information about gifted students. To identify outstanding teachers, more difficult items are needed to distinguish outstanding teachers from ordinary ones.

This paper addresses how we were sponsored by the College of Management, National Sun Yat-Sen University, to develop a teacher evaluation inventory and how the Rasch technique (Rasch, 1960) were applied to analyze test data and to set up a cutoff score for teachers who wish to apply the outstanding teacher award in the college. The faculty in the college was not satisfied with the teacher evaluation inventories developed by the university. These inventories, like most other teacher evaluation inventories, require students to evaluate teachers mainly on preparation, instruction behavior, and assessment. Carefully examining these items, we found that these items contain the tasks that ordinary teachers should accomplish, such as preparing syllabus, giving instruction as scheduled, and giving quizzes. These items are so "easy" that accomplishing the tasks does not lead to outstanding teachers. Even students give a teacher a very high scores, it does not necessarily mean that the teacher's performance is very promising.

The college wished to develop an inventory to scan teachers: Only if teachers pass a cutoff score on the inventory can they apply the outstanding teacher award, granted by the college. For those teachers who pass the cutoff score, a second evaluation will be carried out. The inventory should be short enough because the college did not want to increase students' burden, given they still have to fill out the old inventories, but not too short to differentiate outstanding teachers from ordinary teachers.

The Teacher Evaluation Inventory

What makes a good teacher? This question has been raised for thousands of years. Ancient Chinese recognize two kinds of teachers: Knowledge teachers and mentors. A knowledge teacher focuses on transferring knowledge to students, whereas a mentor on cultivating personality. A famous ancient Chinese philosopher gave a definition of teacher: A teacher is to deliver "Dao" (i.e., values, meaning of life, etc.), to transfer knowledge, and to answer questions. To sum up, it is the "spirit" that distinguishes an outstanding teacher from an ordinary one. This is consistent with the recent findings, such as Ahern (1973), Carrol and Tyson (1981), Kowalski and Weaver (1988), Norris and Richburg (1997), and Zahorski (1996).

Based on literature reviews and interviews of the faculty in the college, the inventory was developed. The final version of the inventory contains ten five-point Likert items, shown in Figure 1. The inventory focuses mainly on teacher-student interaction and being role models for students and other teachers. To assess the inventory empirically and to set up the cutoff score, three hundreds and six college students were surveyed. Altogether, 1111 ratings were given to 60 teachers by these students. Half of the teachers were rated by less than 10 students and were removed from further analysis. Consequently, the data set contains 981 ratings, which were given to 30 teachers by 293 students. Table 1 shows the teacher ID and the number of ratings. For these 30 teachers, the numbers of ratings are between 11 and 72, with a mean of 32.13.

Table 1

Teacher ID and number of ratings

ID

Frequency

ID

Frequency

ID

Frequency

B01

3

F01

15

M04

3

B03

4

F02

14

M05

13

B04

1

F03

51

M06

2

B07

4

F04

48

M07

20

B08

2

F05

29

M08

4

B12

34

F06

61

M09

11

B13

5

F07

27

M10

6

B14

26

F08

8

M11

22

B15

11

F09

29

M12

5

B16

18

F10

2

M13

12

B17

5

F11

72

M14

46

B18

33

F12

59

M15

5

B19

41

F13

1

M16

6

B20

9

F14

8

M17

27

B22

1

F15

56

M18

50

B23

4

F16

6

M19

50

B25

36

H03

14

M20

7

B27

5

M01

25

M21

31

B28

1

M02

5

P02

4

B29

5

M03

3

P04

6

 

 

  1. Strongly Disagree
  2. Disagree
  3. In Between
  4. Agree
  5. Strongly Agree
  6. Not Applicable

The teacher 1 2 3 4 5 6

1 Emphasizes teacher-student interaction and
stimulates students' reaction. □ □ □ □ □ □

2 Is good at establishing dynamic learning climate
and promoting learning interests. □ □ □ □ □ □

3 Is able to combine theories and practices to help
students apply knowledge. □ □ □ □ □ □

4 Actively identifies those students with leaning
difficulty and provides remedial instruction. □ □ □ □ □ □

5 Provides those students with special leaning interests
with additional materials to increase their learning
effects. □ □ □ □ □ □

6 Respects students' ideas and treats them equally. □ □ □ □ □ □

7 Is actively concerned with students and establish
warm teacher-student relationship. □ □ □ □ □ □

8 Not only transfers knowledge to students but also
helps them cultivate values. □ □ □ □ □ □

9 Behaves himself/herself as to be a role model of
behavior and thought for students. □ □ □ □ □ □

10 Treats teaching as profession, respects and loves
the career, and is a role model for other teachers. □ □ □ □ □ □

Figure 1

The teacher evaluation inventory

The Rasch Model

In recent years, the Rasch technique has been widely used to analyze test data. If data fit the Rasch model or its extended models (e.g., the rating scale model, Andrich, 1978; the partial credit model, Masters, 1982), the estimates of person ability and item difficulty are mutually independent (i.e., specific objectivity). The derived scale is interval and parametric statistics becomes applicable. If test data do not fit the Rasch model, the items in the test do not construct a quantitative variable.

Let p1 denote the probability of scoring 1 and p0 be that of scoring 0. Let qn denote the ability of person n and di be the difficulty of item i. According to the Rasch model,

log (p1 / p0) = qn - di. (1)

If the items are polytomously and orderly scored (e.g., 0, 1, 2, 3, 4), the Rasch model can be extended to:

log (pj / pj-1) = qn - dij, (2)

where pj and pj-1 are the probabilities of scoring j and j - 1, respectively; dij is the jth step difficulty of item i. This is the partial credit model.

To further extend the model, the item parameters can be linearly decomposed into many facets, such as a rater facet (e.g., essay questions rated by judges) and a criteria facet (e.g., each essay questions rated on several criteria).

log (pj / pj-1) = qn - (Ak + Bl + dij), (3)

where A and B are two facets, such as rater and criteria. The linear decomposition has been proposed by several researchers, for example, Adams and Wilson (1996), Adams, Wilson, and Wang (1997), Fischer (1973, 1983), Fischer and Ponocny (1994), Linacre (1989), to name a few. In this study, the students responded to the items based on the performances of the specific teachers. In addition to the usual person (student) and item facets, a teacher facet was formed. Each teacher was modeled with a parameter to depict his/her performance. A larger value indicates a higher evaluation acquired. The following data analyses were done with the computer software, ConQuest (Wu, Adams, & Wilson, 1998).

 

Results

The Data-Model Fit

Figure 2 displays the fit statistics for the ten items. According to the unweighted fit Z (Wu, Adams, & Wilson, 1998), all items have a good data-model fit, except item 10 has a relatively poor fit.

Figure 2

Fit statistics for the ten items

Item 10 is:

The teacher treats teaching as profession, respects and loves the career, and is a model for other teachers.

This item might be too abstract for the college students to respond. Some interviewed students complained that they had little opportunity to observe teachers in this regard. This item needs revision. However, we did not discard this item in the further analysis.

The Linear Relationship

Figure 3 displays the linear relationship of the student facet, the teacher facet, and the item facet. The student distribution was assumed to be normal and the estimates are: and . As the mean of the item difficulties are constrained to zero for model identification and the mean of the students is far above zero, these items are relatively easy for the students. In other words, taking these items as criterion for assessing teachers' performances, the students gave quite generous ratings, which means that the teachers' performances are satisfactory.

The estimates for the teacher performances are between -1.90 logits and 1.67 logits. The separation reliability (Wright & Masters, 1982) is .98. The ten items differentiate the teachers very well. The reliability will be even higher when the inventory is administrated to all the students in the college (about 1000 students). In such a case, most teachers (except those teachers who only offer graduate courses) will be rated by more than 50 students.

Student Teacher Overall difficulty Threshold difficulty

| | | X| | |

| | |

4 | | |

X| | |

| | |

X| | |

| | |

XX| | |

X| | |

3 X| | |

X| | |

XX| | |

XXX| | |

XX| | |

XXX| | |

XXXXX| | |9.3

2 XXXX| | |

XXXXX| | |

XXXXX|28 | |8.3 10.3

XXXXX| | |2.3 4.3 7.3

XXXXXX| | |1.3 5.3 6.3

XXXXXXXX|5 | |3.3

XXXXXXXX| | |

1 XXXXXX|6 11 19 | |

XXXXXXX| | |

XXXXXXX|14 16 27 |4 |

XXXXXXXXX|1 24 26 30 | |

XXXXXXXXXX|8 9 20 | |

XXXXXXXXX|23 |5 |

XXXXXXXXX| |2 |

0 XXXX|13 |6 9 |

XXXXX| |1 7 |

XXXX|21 22 |3 8 10 |

XX|15 | |6.2

XX|2 18 29 | |1.2 3.2 10.2

X| | |

X|10 | |2.2 7.2 8.2 9.2

-1 X|12 | |

|25 | |4.2 5.2

| | |

|4 | |

|17 | |

|7 | |

| | |

-2 |3 | |

| | |

| | |

| | |

| | |6.1 10.1

| | |2.1 7.1 9.1

| | |3.1 4.1 5.1

-3 | | |1.1 8.1

| | |

====================================================================

Figure 3

Linear relationship of the student facet, the teacher facet, and the item facet

We expected item 10 to be the most difficult because it represents the highest standard for successful teachers. According to Figure 3, item 4 is the most difficult and item 10 is much easier than expected. Item 4 is:

The teacher actively identifies students with leaning difficulty and provides remedial instruction.

The interviewed students considered item 4 very difficult to achieve because it is laborious, especially in large classes.

The Cutoff Score

Since the college intended to adopt the teacher evaluation inventory to scan teachers for the outstanding teacher award, a cutoff score was to be set. Only if teachers pass the cutoff score can they apply the award. Since the award is very competitive, the requirement should be hard enough. A committee was called to set up the minimum requirement. After assessing the items in the inventory, the committee decided that the teachers should reach "Agree" level on at least five items to apply the award.

To make the above requirement applicable, it should be transferred to the logit scale. Suppose a teacher is rated by average students (i.e., 1.06 logits), what is the performance needed to reach the "Agree" level for five items? Figure 4 shows the expected scores for various levels of teacher performance on the ten items. A teacher should be above .28 logits and 1.58 logits to reach "Agree" level on the easiest item (tem 10) and the most difficult item (item 4), respectively. To reach "Agree" level for the easiest five items, a teacher should be above .63 logits. Figure 5 displays the logits needed to reach "Agree" level for the ten items.

Estimation errors should be taken into account to set up the cutoff score. Figure 6 shows the standard errors of estimates for teacher performances across various numbers of ratings. When the number of ratings is 50, the standard errors of the estimates are around .08. In this regard, .50 logits was set as the cutoff score, which is about 1.5 standard errors below .63 logits. Figure 7 shows the performances in logits for the 30 teachers. Twelve of them (40%) pass the cutoff score, .50 logits. This proportion is satisfactory because it is reasonable and applicable.

Figure 4

Expected rating for various levels of teacher performance for the ten items

Figure 5

Teacher performances in logits needed to reach "Agree" level for the ten items

Figure 6

Relationship between standard errors of the estimates for teacher performances and numbers of ratings

Figure 7

Teacher performances in logits

Summary

In this study, we developed a teacher evaluation inventory to differentiate outstanding teachers and ordinary teachers. To assess the inventory and to set up a cutoff score for applications of the outstanding teacher award, three hundreds and six students were surveyed. A multi-facet Rasch technique was used to analyze the data set. It was found that the items in the inventory construct a quantitative variable. The ten five-point items differentiate the teachers very well, with a separation reliability as high as .98.

To set up the cutoff score in logits, the expected score for each item was calculated. To reach "Agree" level on at least half of the items, teachers' performances should be above .63 logits, provided they are rated by average students. Taking the estimation errors into account, the cutoff score was set to be .50 logits. Only those teachers with performances above .50 logits can apply the award. Of the 30 teachers rated, 40% are qualified.

In this study, each teacher is given a distinct parameter to depict his/her performance on all the courses he/she taught. In doing so, we assumed that the teachers performed consistently across courses. This assumption had to be made because of the small sample size. Once the inventory is administrated to all the students in the college, we are able to investigate the variation across courses. Provided that the number of ratings is large enough, we are able to give each teacher/course combination a distinct parameter. This extension can be easily carried out with ConQuest.

References

Adams, R. J., & Wilson, M. R. (1996). Formulating the Rasch model as a mixed coefficients multinomial logit. In G. Engelhard and M. Wilson, (Eds.), Objective measurement: Theory into practice, Volume 3. (pp. 143-166). Norwood, NJ: Ablex.

Adams, R. J., & Wilson, M. R., & Wang, W.-C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1-23.

Ahern, J. (1973). Identifying outstanding teachers in New England. Improving College and University Teaching, 21, 46-47.

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-573.

Carrol, M. A., & Tyson, J. C. (1981). Good teachers can become better. Improving College and University Teaching, 29, 82-84.

Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologia, 37, 359-374.

Fischer, G. H. (1983). Logistic latent trait models with linear constraints. Psychometrika, 48, 3-26.

Fischer, G. H., & Ponocny, I. (1994). An extension of the partial credit model with an application to the measurement of change. Psychometrika, 59, 177-192.

Kowalski, T. J. & Weaver, R. A. (1988). Characteristics of outstanding teachers: An academic and social involvement profile. Action in Teacher Education, 10, 93-99.

Linacre, J. M. (1989). Many-faceted Rasch measurement. Chicago: MESA Press.

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174.

McLaughlin, M. W., & Pfeifer, R. S. (1988). Teacher evaluation: Improvement, accountability, and effective learning. NY: Teacher College Press.

Norris, G., & Richburg, R. W. (1997). Hiring the best. American School Board Journal, 184 , 46, 48, 55.

Peterson, K. O. (1995). Teacher evaluation: A comprehensive guile to new directions and practices. CA: Corwin Press.

Rasch, G. (1960). Probabilistic models for some intelligent and attainment tests. Copenhagen: Institute of Educational Research. (Expanded edition, 1980. Chicago: The University of Chicago Press.)

Shinkfield, A. J., & Stufflebeam, D. L. (195). Teacher evaluation: Guide to effective practice. Boston: Kluwer Academic Publishers.

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago, IL: MESA Press.

Wu, M., Adams, R. J., & Wilson. M. R. (1998). ConQuest. Camberwell, Victoria: Australian Council for Educational Research.

Zahorski, K. J. (1996). Honoring exemplary teaching in the liberal arts institution. New Directions for Teaching and Learning, 65, 85-92.