Throughout its history, psychology has been faced with fundamental crises that all revolve around its disciplinary rigour. Current debates – led in Nature, Science and high-ranking psychology journals – are geared towards the frequent lack of replicability of many psychological findings. New research led by Jana Uher highlights methodological limitations of the widely used questionnaire methods. These limitations may be a major source for the lack of repeatability in psychology.
In previous debates, the causes of psychology’s replication problem were sought in the technical details of the statistical methods of data analysis used, the samples studied, the lack of transparency in research practices or the publication biases. But in my recent research, I have highlighted a more fundamental cause not yet widely considered—the methods used in psychology research to generate the data in the first place.
In psychology and the social sciences more generally, standardised assessment methods such as rating scales are widely used for generating quantitative data. But despite their popularity, these methods contain serious methodological limitations that inevitably compromise the replicability of findings.
In recent research, funded by the German Science Foundation (DFG) and published in the April issue of the Journal of Research on Personality, I pinpointed essential methodological differences between assessments and observations by exploring the elementary question: What demands do the different methods place on the persons who generate the data?
Image credit: – originally posted to Flickr as questionnaire (CC BY 2.0)
Behaviours are public phenomena; therefore, the same act can be perceived by multiple observers. This allows for training observers to categorise behaviours in the same ways and for establishing standardised rules of how to encode the events observed into data. Challenges arise because behaviours can be observed only in the moments in which they occur—in other words, real-time. But once observers have taken down the just-observed, their job in generating the data is completed.
Assessors, by contrast, must accomplish an entirely different task. To judge someone, raters must draw on the experiences that they have made in the past and must compare individuals with other individuals and over time. Thus, ratings are inherently memory-based and retrospective. But memory recall is known to be fallible and biased in many ways.
Further differences occur in the ways in which the data are encoded and the schemes that are therefore used. Observational categories are explicitly defined and observers are trained to understand and use them in the same ways. By contrast, the questions and answer categories of questionnaires are worded in everyday language and commonly no training occurs. Instead, raters must use their common-sense knowledge to interpret the meaning of the categories used for generating the data. But everyday concepts are fuzzy and context sensitive. What people may specifically have in mind when filling in a given questionnaire and how they assign their ideas to the categories marked on the scale is never enquired.
But the thought processes involved in assessments are highly complex. For example, to assess how “often” somebody tends to do something, raters must consider that some behaviours generally occur more often than others. Thus, to be judged as “often”, some behaviours must occur more often than other behaviours. This means that the same answer category encodes different information! Given this, what do the quantitative data derived from questionnaire markings actually signify?
I find it astonishing and alarming that even after a century of research in which questionnaire methods have been used intensely, we still do not know much about what is going on in people’s minds when they are generating assessments on a rating scale. Scientific quantification depends on the exact specification of what is to be measured and of how this is encoded into data. But questionnaire assessments do not fulfil these basic requirements. With such methods, how could replicability ever be achieved?
To illustrate these methodological limitations, I have conducted a five-method study in collaboration with Elisabetta Visalberghi from the Institute of Cognitive Sciences and Technologies, National Research Council of Italy (ISTC-CNR) in Rome. The study showed that, compared to observations, assessments contained numerous biases derived from raters’ stereotypical beliefs about individuals.
It also showed that raters’ interpretations of the standardised survey questions varied considerably because they considered diverse contexts that viewed the same question in a different light. As a consequence, raters’ interpretations overlapped with those intended by the researchers to only 54-77%! Thus, contrary to beliefs widespread in psychometrics and the social sciences, standardised questionnaire items do not reflect standardised meanings. As a consequence, standardised rating items do not enable repeated measurements of the same idea.
These profound methodological differences between assessments and observations have not been previously well considered. They offer a novel approach to explaining the frequent lack of replicability of findings in psychology.
This piece is based on a recently published journal article: Uher, J., & Visalberghi, E. (2016). Observations versus assessments of personality: A five-method multi-species study reveals numerous biases in ratings and methodological limitations of standardised assessments. Journal of Research in Personality, 61, 61-79. http://www.sciencedirect.com/science/article/pii/S0092656616300058
Note: This article gives the views of the authors, and not the position of the LSE Impact blog, nor of the London School of Economics. Please review our Comments Policy if you have any concerns on posting a comment below.
Jana Uher is a Senior Research Fellow and Marie Curie Fellow at the London School of Economics and Political Science. She has developed the Transdisciplinary Philosophy-of-Science Paradigm for Research on Individuals (TPS-Paradigm), a novel paradigm aimed at making explicit and scrutinising the philosophical, metatheoretical and methodological foundations of psychological, behavioural and social-science research on individuals. Jana publishes widely on topics in theoretical psychology, philosophy of science, methodology, behavioural research, biology, primatology, evolutionary research, personality psychology, social psychology and cognitive sciences. @ResIndividuals
Dear Dr. Uher,
I am firmly in your camp when it comes to the assessment of the questionable validity of questionnaires (although I’d add that there are some uniquely human phenomena that can probably ONLY be studied through self-report — the goals that people set and pursue would be an example). And I have read your JRP paper with a sense of constant inner cheering. This is truly great and meticulous work!
But I think you underestimate the role of questionnaire research for “replicability”. Research based on self-report is wonderful for replicability, if only for the type that Feynman called cargo-cult science in his famous commencement speech. Entire bodies of literature have been erected in personality, but also social psychology, based on the high replicability of findings based on correlating questionnaires with other questionnaires. Or, as McClelland once put it, “questionnaires predict questionnaires — so what else is new?” That’s the perverse raison d’être of this type of research — anyone with a semi-intact sense of semantics can guess what kind of questionnaire items are likely to converge with what kinds of other questionnaire items. Extraversion with positive affect? Check! Openness to experience with self-reported creativity and intellectual interests? Check! Grit and perseverance? Almost tautological. Identification with one’s goals and life satisfaction? Check! Never mind what Freud, Gazzaniga, Nisbett & Wilson, Kagan, and many others have offered in terms of criticisms of this approach. The business model of entire journals (like Personality and Individual Differences) thrives on the questionnaires-predict-questionnaires principle.
So what the camp of researchers tied to this principle will immediately reply is: “What replication crisis? We can replicate our core findings really well!” And they would, of course, be right. Beliefs about behavior, and the relationships between such beliefs, can be replicated rather easily. But only in the parallel, cargo-cult world of self-report psychology where all measures have an incestuous relationship with each other, tied to each other by whatever semantic overlap exists between ideas and beliefs in the heads of research respondents. They tap what Gazzaniga has termed the left-hemispheric interpreter, a language-based brain system dedicated to constructing and maintaining a coherent sense of meaning, but mostly cut off from any direct access to the actual generators of behavior, be it in ourselves or others.
In the world of observational and experimental research (i.e., research that bypasses what people think they’re doing and that goes for the jugular of actual behavior), replication is much more difficult for a number of reasons, both inherent in the research itself and in the manifold ways that researchers can fool themselves and others (see the debate raging for the past 5 or so years).
Self-report research is insulated in a sense against these difficulties and debates. But not because it somehow aspired to and attains higher criteria of measurement and replication, but because in most instances it play a completely different game altogether, one that Feynman characterized pretty well in his commencement speech.
Best regards,
Oliver Schultheiss