A series of studies across countries and disciplines in higher education confirm that student evaluations of teaching (SET) are significantly correlated with instructor gender, with students regularly rating female instructors lower than male peers. Anne Boring, Kellie Ottoboni and Philip B. Stark argue the findings warrant serious attention in light of increasing pressure on universities to measure teaching effectiveness. Given the unreliability of the metric and the harmful impact these evaluations can have, universities should think carefully on the role of such evaluations in decision-making.
Many universities rely heavily or exclusively on student evaluations of teaching (SET) for hiring, promoting and firing instructors. After all, who experiences teaching more directly than students? But to what extent do SET measure what universities expect them to measure—teaching effectiveness?
To answer this question, we apply nonparametric permutation tests to data from a natural experiment at a French university (the original study by Anne Boring is here), and a randomized, controlled, blind experiment in the US (the original study by Lillian MacNell, Adam Driscoll and Andrea N. Hunt is here). We confirm and extend the studies’ main conclusion: Student evaluations of teaching (SET) are strongly associated with the gender of the instructor. Female instructors receive lower scores than male instructors. SET are also significantly correlated with students’ grade expectations: students who expect to get higher grades give higher SET, on average. But SET are not strongly associated with learning outcomes.
Image credit: Daniel R. Blume CC BY-SA (Flickr)
Some studies have found little difference between average SET for male and female instructors, but the design of those studies have serious flaws. Not only are they observational studies rather than experiments, they ask the wrong question, namely, “do male and female instructors get similar SET?” A better question is, “would female instructors get higher SET but for the mere fact that they are women?” We can answer that question using these unique data sets: “yes.”
The French Data: Since effective teaching should promote student learning, students of more effective instructors should have better learning outcomes on average. Students in different sections of each course, taught by different instructors, take the same final exam, allowing us to compare learning outcomes. We find that SET are at best weakly associated with student performance (Figure 1).
Figure 1. Average correlation between SET and final exam score, by subject
Note: p-values are one-sided, since, if SET measured teaching effectiveness, mean SET should be positively associated with mean final exam scores. Correlations are computed for course-level averages of SET and final exam score within years, then averaged across years. *** p<0.01, * p<0.1
On the other hand, SET are significantly correlated with instructor gender (male students gave higher SET to male instructors, Figure 2) and with students’ expected grades. This adds evidence to the hypothesis that instead of promoting better teaching, SET contribute to grade inflation. We find no evidence that male teachers are more effective than female teachers. If anything, students of male instructors perform worse on the final exam.
Figure 2. Average correlation between SET and gender concordance
Note: p-values are two-sided. *** p<0.01, ** p<0.05, * p<0.1
The US Data: MacNell et al. (2014) collected data from four online sections of a course, two taught by a male instructor and two by a female instructor. Students were assigned randomly to the four sections. The male instructor taught one section using his own identity and switched identities with the female instructor for the other section, and vice versa. This lets us see how believing that an instructor is male or female affects SET for the very same instructor. We confirm the original authors’ main finding that students generally rate perceived female instructors lower in several dimensions of teaching (Figure 3).
Even on measures one would expect to be objective, ratings were lower for perceived female instructors. For instance, graded assignments were returned simultaneously in all four sections, but students reported that the perceived female instructor was less prompt in returning assignments. Since SET were on a scale of 1 to 5, the observed difference in means, 0.80, is 20% of the full range.
Figure 3. Difference in mean ratings and reported instructor gender (male minus female)
Note: The scale is 1-5 points, so a difference of 0.8 is 20% of the full range. p-values are two-sided. *** p<0.01, * p<0.1
In both the French and US data, male instructors got higher SET, but in the US data, female students tended to give higher scores to perceived male instructors (Figure 4), whereas in the French data, male students tended to give higher scores to male instructors.
Figure 4. Difference in mean SET by student gender, for perceived and actual instructor gender (male minus female)
Note: The p-values are not reported but can be found in the article (p.26-27).
In another study conducted in the Netherlands, researchers are finding that female instructors receive lower scores because male students give lower scores to female instructors. Differences among these studies might be cultural or related to topic, class size, mode of instruction (online versus face-to-face), ethnicity, race, physical attractiveness, or other confounding variables that have been found to affect SET. Clearly, there can be no simple adjustment for the bias.
The French data show that bias varies by course subject, further complicating any attempt to correct for these biases. The only field in which male students do not rate male instructors significantly higher is Sociology (Figure 1). This is especially interesting because Sociology is the only field in which there was near gender balance among instructors (46.4% female instructors). This might suggest that gender balance in a field affects gender stereotypes and might reduce bias against female instructors.
Why don’t universities use better methods? SET are the familiar devil. Habits are hard to change. Alternatives (reviewing teaching materials, peer observation, surveying past students, and others) are more expensive and time-consuming, and this cost falls on faculty and administrators rather than on students. The mere fact that SET are numerical gives them an un-earned air of scientific precision and reliability. And reducing the complexity of teaching to a single (albeit meaningless) number makes it possible to compare teachers. This might seem useful to administrators, but it is a gross over-simplification of teaching quality.
The sign of any connection between SET and teaching effectiveness is murky, whereas the associations between SET and grade expectations and between SET and instructor gender are clear and significant. Because SET are evidently biased against women (and likely against other underrepresented and protected groups)—and worse, do not reliably measure teaching effectiveness—the onus should be on universities either to abandon SET for employment decisions or to prove that their reliance on SET does not have disparate impact.
This blog post is based on a ScienceOpen preprint and can be found here: Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness 10.14293/S2199-1006.1.SOR-EDU.AETBZC.v1
Featured image credit: The university and TAA negotiating an end to the 1970 strike (Wikimedia)
Note: This article gives the views of the author(s), and not the position of the LSE Impact blog, nor of the London School of Economics. Please review our Comments Policy if you have any concerns on posting a comment below.
Anne Boring is a research fellow at Sciences Po (OFCE-Presage) and a research affiliate at the University Paris Dauphine (LEDa-DIAL). Her main research interests include the study of gender biases and stereotypes in higher education. She also conducts research on interest groups, trade and development.
Kellie Ottoboni is a PhD student in the Statistics Department at the University of California, Berkeley and a fellow at the Berkeley Institute for Data Science. Her research interests include nonparametric statistics, causal inference, reproducibility, and applications in the health and social sciences.
Philip B. Stark is Professor of Statistics and Associate Dean of Mathematical and Physical Sciences at the University of California, Berkeley. He works primarily in uncertainty quantification, with applications to physical science, risk, natural disasters, elections, health, food security, litigation, and legislation.
This is indeed an important finding, and although for faculty, we get a gut feeling of how the class is going on, such feedback mechanisms seldom help. Great informative article and I hope this gets noticed.
Great article and nice to feel validated. Thank you!
Very informative and timely article! However, I cannot believe there isn’t a better way of finding a solution to the male/female teacher problem.
Well the bible teaches that only males are authoritative–hence we have a wee problem, Houston.
We don’t live in the iron age anymore. Typical example of prejudice here that needs to be erradicated. I wonder what the evaluation would have been if one (male or female) was African, moslim or a born-again christian!.
It’s ironic, but the instructor I disliked the most in my freshman year of college (and who got very low “approval ratings” from students) turned out to be one of my favourite teachers of all tine in retrospect.Teacher evaluations were an entirely new concept back then (1966-67). To be of use they really need to be done later. So often the evaluation process is more a popularity contest, which has little to do with teaching effectiveness.
I definitely wouldn’t call a correlation coefficient of < 0.18 "significantly correlated".
Agree. These correlation coefficients are extraordinarily weak, far weaker than I would have expected.
Akshay, you appear to be confusing significance and strength. It is a moderate correlation (not weak given the complexity of social variables–weak would be less than 0.10), but it is statistically significant.
The findings are very important and need to be followed up by qualitative inquiry to ascertain reasons for bias (sample surveys could help). But to suggest doing away with SET is akin to throwing the baby out with the bathwater. Teachers’ accountability to students cannot be diluted when students are being charged through the roof for getting higher education. In India where I live, such accountability does not exist. So I feel rather strongly about it.
The findings are very important and need to be followed up by qualitative inquiry to ascertain reasons for bias (sample surveys could help). But to suggest doing away with SET is akin to throwing the baby out with the bathwater. Teachers’ accountability to students cannot be diluted when students are being charged through the roof for getting higher education. In India where I live, such accountability does not exist. So I feel rather strongly about it.
These findings are very important, especially in the American context where as much as 65% of all post-secondary instruction is now being taught by non-tenure-track/adjunct faculty. These faculty people are often hired, rehired, or let go based ONLY on student evaluation results–with there never having been a single classroom observation conducted by any of the other members of the department, or any other means of evaluating teaching effectiveness used in the decision. What this means at this point is that student evaluations are what controls who is actually conducting the lion’s share of higher education throughout the United States.
This is true in the Australian context. Good teachers often have had to “explain” or defend themselves in relation to offensive and even racist comments made on student evaluations. This has happened to an Aboriginal colleague of mine who was described as ” lazy”. She happens to be one of the most hard-working academics I know. I perish at the thought of my teen at home evaluating me in a pique.
Student evaluations are worse than unreliable : they are significantly biased against good teachers (those who help their students progress):
https://hbr.org/2014/09/better-teachers-receive-worse-student-evaluations/
It is only natural: most students (like anybody) prefer not to work too much. And a good teacher demands some kind of sustained effort to help her students achieve their potential.
A good teacher isn’t one who makes a student work hard…. A Good teacher makes the student want to work hard
Simply wanting to work hard is not enough – you also have to feel that you *need* to work hard, otherwise despite good intentions the hard working will not materialise.
The data from the online class randomized experiment is a bit strange – it uses the “perceived” gender of the instructor. I realize some online classes are completely asynchronous, but that is far from the only model (and far from the best in my view). I might go so far as to say that such courses are designed to appeal to a particular type of student – one that is interested in superficial learning. If so, then it is not surprising that the name of the instructor – and implied gender – has a significant impact on the ratings. I am not at all sure this result can be generalized, however.
Very interesting result. When I taught overseas, some nonwhite instructors felt that SET were also biased toward Caucasians, which is a reasonable hypothesis arising from some theories of racism. Not that we need more evidence that SET are a poor measure of teacher effectiveness, but the same test could be run with “perceived race/ethnicity”, and test that prediction. I would be interested to learn the results.
same results
So what’s new. The average male doesn”t change and hasn’t changed in any field, despite 200 years of feminism.
I think women psychologists and women social psychologists should apply themselves to the problem, look into male psychology. Has enough research been done on male psychology? Esp. by women-researchers? I think not.
Did you notice how female students in certain countries were actually more likely than male studies to score female instructors lower? Don’t be so quick to put blame entirely on men once again.
Too few instructors to make this an accurate finding. There are too many type II errors that could be possible
I see, so students can’t tell when they have a bad teacher. Parents can’t tell when they have a bad teacher either. Administrators who aren’t in class can’t tell when they have a bad teacher either. This there is no way to determine who is a bad teacher and bad teachers need to never be fired for poor teaching, right?
Why not instead reach the conclusion that for lack of reliable data no teacher should get lifetime tenure since they can’t demonstrate that they are any good? Just asking.
I think you missed the point of this – or the multiple studies cited, which showed that student evaluations of teachers do NOT correlate to student achievement, which would be a strong indicator of teacher effectiveness, when adjusted for student ability and achievement when entering the course versus when exiting the course.
Appropriate measures of teacher effectiveness would be student scores on a common exam, compared across sections, with students assigned to sections in a way that doesn’t stack better or weaker students in particular sections.
Another highly useful measure would be the performance of students who pass one course in subsequent courses that build on the content and skills of the first course. I have many students who come back to talk to me, a year or two after taking my classes, to thank me for being so demanding – because the knowledge and skills they gained in my class helped them succeed in later courses. At the time they were in my class, many of them didn’t appreciate me nearly as much.
They are likely able to tell a bad teacher, but by experience I know they can’t reliably tell who is a good teacher – especially immediately, starting three weeks before the course is over, and always before they know their final grades for the course.
Students who are taking college courses for the first time, many in subjects they have never studied before, and sometimes courses that are required of them are unlikely to be able to evaluate the reasons behind pedagogical decisions that are made by faculty – although not all, many of us actually read and workshop best teaching strategies in our fields. Many times, however, they will come to us years after with acknowledgment of what they learned and how our teaching had an impact on their success during the later years of their studies.
All of this is then compounded with the stereotypes students and others that filter our daily judgments, without even being aware of it. This is why these studies are so important, since they can steer us towards better ways for us to actually evaluate whether we are good teachers, and if our learning strategies are then shown to be inadequate we know how to improve them.
These SET questioners are not even designed by social scientists, who know how to ask the right kind of questions that can lead us to deduce what pedagogies or just simple practices are affecting an individual faculty’s teaching in the long run. For instance, if designed differently, I would be able to look at the last five years and find trends from student feedback that indicate what is working and what is not working. That would give each faculty member some basis for deciding what to do differently. So SETs are not even good for that purpose.
As is, the incentive is to care less, inflate grades throughout the entire semester, and fear of trying new and targeted pedagogies that may actually have a positive impact.
As a bias researcher I find this article ironically amusing.
really informative but it should be different with respect to culture to culture because ours society have different norms and values demographically.
I stumbled upon an interesting test case which while anecdotal was provocative. I met a professor who transitioned from female to male and he said he did not change his teaching methodology at all, but students rated him higher than they had when the professor was perceived to be female.
The list of variables that determine good teaching is long and broad. Perhaps in this case, a sense of increased confidence and happiness was exuded following the transition which may also result in higher ratings.
I’m not a bias denier, but I was not convinced by the McNell et al. (2014) study. It shares experimental design flaws of numerous previous studies. In this case, the experimental unit should be the faculty, not the students because the focus of inference is on differences in performance among faculty members with different genders. Consequently, it is not replicated (n = 1 for each gender). In addition, while the design was clever and eliminated some confounding variables by having the courses online, other potentially confounding variables were not accounted for, as acknowledged in this article.
The larger issue is no study has found a relationship between student knowledge retention and SET, which should be what educators and administrators are most concerned about.
Oh, that is b.s. Try researching Ben Barres to find out why.
Welcome to the “corporate university …going forward” “we might not like the way it is, but the university sector has changed.” “we just have to tick the box and then get on with the real work” ….etc etc . And so the farce will play out in a spiral of desperately ambitious self-obsessed control-freak imagination free managerial-spin diktat committee-led decline as the corporate university becomes a self-fulfilling prophecy – i.e. universities become pseudo- business rather than the essential places of learning, academic freedom and discovery they are meant to be. Student Debt is now the master of what universities have become.
When Big Data fuels the New Big University online – and millions of online degrees from one top university are bought for £300 what will the committees do then? They will form to decide how to sell off their buildings and make themsleves redundant. Everyone else will be doing what they should be doing at a university – only it will all be online.
I once had a student who told me he did not evaluate me as highly because some of students who sat around him complained about their grades – but that in fact, my course had been his favorite course thus far …
How does a professor defend herself against these types of rationalizations? Would he have rationalized it the same way if the professor was a man? Would he had placed the benefit of the doubt on the faculty rather then the complaining students, when he had no idea what the quality of their work was?
We are talking about SET that have a direct impact on the economic stability of my family, whether in terms of merit raises or keeping one’s job.
SETs do not have a problem at all. Students are free to give whatever feedback they have. The main issue is how ultimately the feedback is understood and used. At Botho University(Botswana),we use a Teaching Performance Index(TPI) which is a composite of multiple measures of teaching effectiveness.This helps us to reduce over reliance of SETs.
Can you please share the format of your evaluation please? Thanks
I have taught at several universities in many parts of the USA for 40 years. I have proof that I am well decorated on my teaching by both males and females (from all nationalities); supported by numerous strong positive comments.
Based on the above, if males and females from all nationalities and walk of life are objective in their SET toward a male teacher, then why in the world should they be bias in their SET of female teachers?!!!
My personal experience – from my teaching and observations of female teachers’ SET in the past 40 years – tells me that there is no significant proof that students from all walk of life rate female teachers lower just because they are females. If you are a good and caring effective teacher, students will always give you – in the long run – high SET regardless of your gender, race or any other factor.
There should be additional studies involving appropriates factors in appropriate experimental design.
How are positive comments on evals proof? Are other measures being taken to measure your teaching effectiveness? Is it possible that students are rating you highly because of likeability and charisma but you also happen to be a good teacher but say someone who has the same likeability characteristics as you but isn’t an effective teacher or as effective is getting the same or higher evals? As far as why women might not be evaluated fairly . . .are you really that confused on this? I’ve had a student flat out say in the first few weeks that he didn’t trust me on academic matters because he doesn’t trust women . . .what exactly am I supposed to do with that? I also find it hard to believe that you don’t know at least a handful of instructoss that brag about how little time they spend prepping, or neglecting office hours and student emails or how easy they grade and who always get really high evals
Je sais que ça être mal vu mais… est-ce qu’on demande à un chien d’évaluer son “dresseur canin” ? Alors ? Sans doute les étudiants veulent-ils devenir riches et célèbres. Mais les instructeurs se la pètent à vouloir croire qu’ils ont une sorte de vérité à transmettre. Incompatibilité totale, donc.
Does anyone have a view about the use of the mean and standard deviation in this context, which for many institutions underlies their analysis of such ordinal data?
“We find no evidence that male teachers are more effective than female teachers.” – my intuition agrees, though student evaluations of teachers don’t rely that much on effectiveness. Many people I know, including myself, would have rated highly effective teachers quite lowly at the time I took certain classes, despite having performed well at the final examination.
The conclusion here is that students rating teachers are biased on their evaluation based on gender. I wouldn’t argue with that though I’ll say that these ratings are subjective by definition. Saying that students should rate teachers based solely on perceived effectiveness would mean that the final grade should be the only thing that matters. You say “the teacher was effective because you got a good grade, therefore you learned the subject well enough under his/her tutelage”.
But if you ask a human being to evaluate, they will look at how well they interacted, the interpersonal relationship built, the teacher’s own bias perceived by the student, etc.
France has a severe gender imbalance in primary and secondary education in my experience with a very small proportion of male teachers throughout but more so in primary than secondary – data validates this https://data.worldbank.org/indicator/SE.PRM.TCHR.FE.ZS?end=2013&locations=FR&start=1975&view=chart
http://ec.europa.eu/eurostat/documents/2995521/7672738/3-04102016-BP-EN.pdf/9f0d2d04-211a-487d-87c3-0a5f7d6b22ce
So does most of Europe / World according to that data.
Seems very likely that the male teachers are being perceived better due to their rarity or graded better wether actively or subconsciously as a way of promoting the rebalancing this gender imbalance.
The biases by the time the students reach tertiary education are probably ingrained even if the tertiary education sector is either more balanced or more male dominated.
To caricature the situation, the gender imbalances tells children that female teachers are good at teaching you the basics – i.e. to read and count – but if you want to explore higher levels of knowledge in a field, you need to ask a male teacher. This is absolutely wrong – the systemic imbalance needs to be fixed to stop such biases being set.
At the end of day, as has been highlighted by many above comments, the SET data shouldn’t be used in isolation; it must be combined with student performance / teaching effectiveness data as it is well known that effective learning can correlate to a negative student experience & would therefore lead to lower SET scores.
I was nodding my head until the very end (author bios). Is the research really reliable if all 3 of its designers/authors would benefit from the cancellation or invalidation of SET?
The article proposes 3 alternatives, of which 2 are peer-reviews and 1 is feedback from former students. Is this again that academics see themselves as oh so special and above everyone else so only they themselves can evaluate themselves?
I’m sure peer-reviewing teaching quality would end up being colleagues covering up for each other coz they all want to spend more time on research. It’s like police internal investigation on police brutality – never their fault.
The curse of academia is exactly this: we’re the ones who are supposed to research everything, including ourselves…