For many Google Scholar has become a critical piece of research infrastructure. Yet, revelations in the manipulability of its metrics and its inclusion of AI generated papers have led some to ask is it still functional? Kirsten Elliott argues rather than being broken, these issues reflect the limitations of any academic search tool, but for those done with the platform there are alternatives.
For those who haven’t ventured into the ‘other place’, this post builds on some threads posted on BlueSky, in response to discussions on that platform about Google Scholar’s “brokenness.”
The question of whether Google Scholar is broken has the obvious answer of “It depends”: on what it’s being used for, how it’s being used, and what alternatives are available.
Google Scholar has advantages over traditional academic databases like Scopus and Web of Science: it’s free to use, requires no log in for searching, and has more comprehensive coverage, especially of non-journal sources such as books and theses. These benefits are particularly important for unaffiliated scholars without institutional access to resources, and those in the humanities.
Google Scholar is used for many different kinds of academic information-seeking: finding the full text of an article, exploratory searches on a broad topic, forwards citation chasing (i.e. looking at where a publication has been cited), finding citation metrics to demonstrate research impact, and even systematic review searching. For each of these purposes there are different criteria for whether it is the best tool, or even appropriate to use at all.
As AI generated publications proliferate, Google Scholar is particularly vulnerable to being swamped by fake research.
However, there are downsides to Google Scholar. Where most other academic databases have inclusion criteria for what will and will not be indexed, typically at journal level, Google Scholar relies on web scraping. Publications deemed excluded elsewhere on the grounds of poor quality or integrity concerns are likely to be picked up by Google Scholar. Even when there is clear evidence of citation manipulation papers are not removed, as evidenced in the case of Larry the Cat and his impressive H-Index. As AI generated publications proliferate, Google Scholar is particularly vulnerable to being swamped by fake research.
Another key difference from most academic databases is that Google Scholar, like Google, ranks results. The algorithm for doing so is not transparent – studies have attempted to reverse engineer it, but they become dated very quickly. The ranking is probably based on a combination of the number of citations, number of times the searched words appear in title and full text, and date, with more recent research appearing higher. Many users of Google Scholar look only at the first few pages of results, as there are diminishing returns in looking beyond that. Doing so may exacerbate the Matthew Effect, with highly cited works more likely to accrue future citations and the bias towards English-language publications.
There is no perfect version of the algorithm that presents the “best” results for all possible searches, because what “best” means varies by purpose and discipline.
The ranking algorithm might result in unexpected results, like a foundation work in a discipline disappearing from the first page, or a dissertation appearing unexpectedly high. Google Scholar searches are not consistently reproducible – anecdotally, this undermines trust in its results and creates a perception of brokenness. There is no perfect version of the algorithm that presents the “best” results for all possible searches, because what “best” means varies by purpose and discipline. Finding the most recent publications is far more important in medicine than the humanities, for example.
One mitigation of some of the problems with Google Search is to use the Publish or Perish software rather than searching it directly. Doing so allows for the saving of exact search terms, so searches can be accurately repeated at a later date. There are options to sort by alternatives to the default Google ranking, including number of citations and date.
On a more philosophical level, there are objections to the lack of transparency in the data used and presented by Google Scholar. Key stakeholders in research such as universities and funders are increasingly advocating for open research. Whilst efforts so far have focussed primarily on openness of publications, the recent Barcelona Declaration applies the principles of openness to information about research, with signatories committing to “work with services and systems that support and enable open research information.” Google Scholar cannot meaningfully be said to do so given the opacity of the processes for inclusion and ranking of research outputs. The closed proprietary systems run by large profit-making companies like Elsevier and Clarivate clearly do not meet this criterion either.
Another point in the Barcelona Declaration is the “sustainability of infrastructures,” and that is a key concern for Google Scholar. It’s unclear what the long-term funding model is, and if it will be maintained in the future, meaning it might be a wise choice to explore other options.
It’s unclear what the long-term funding model is, and if it will be maintained in the future, meaning it might be a wise choice to explore other options.
There are alternatives to Google Scholar which operate from an open research ethos and are free to use. Three prominent alternatives are The Lens, Matilda and OpenAlex.
The one I’ve used most is OpenAlex. One study has found it to have comparable coverage to Web of Science and Scopus, and my own limited testing found significantly more publications indexed from social sciences and humanities subjects. Their code is fully open, and the data is reusable. The system has thorough documentation and, in my experience, the OpenAlex team are responsive to feedback. The levels of transparency and engagement with the academic community are significant advantages over Google Scholar. OpenAlex is still relatively new, and the data is not perfect. The author disambiguation process, for example, struggles with authors like myself who have published across disciplines.
What does all this mean moving forward? For researchers, there is value in reflecting on current searching practices and whether Google Scholar is still the best option for their purposes, given the caveats above, and bearing in mind the limitations and biases of other systems available. For academic librarians like myself, I encourage the exploration of open research information systems, and support the development of critical information literacy in our library users, incorporating into teaching about search tools an understanding of how the systems we use to find and access information are created and funded, and how that shapes the results.
The content generated on this blog is for information purposes only. This Article gives the views and opinions of the authors and does not reflect the views and opinions of the Impact of Social Science blog (the blog), nor of the London School of Economics and Political Science. Please review our comments policy if you have any concerns on posting a comment below.
Image Credit: sirtravelalot on Shutterstock.
Excellent review – decidedly useful. I’m pleased to have encountered your article. Thank you for giving us your views on this matter.
And what about ScholarGPS ?
https://scholargps.com/
It’s not a database I’ve come across before so I don’t know much about it, but from a quick look through the website I’m not sure its approach is aligned with the principle of open research information e.g. the mentions of proprietary algorithms.
Thanks a lot for this insightful blog post. I fully agree with all arguments except one: the funding model is not a problem. G Scholar is the only Google service with no data tracking (hence no GDPR page before using it), so it probably brings no revenue for Google and costs “a lot.” G Scholar is a typical example of a prestige economy service: Google sees a political/reputation value in running it, independently of its cost.
That means that a crucial research infrastructure depends solely on the will of a for-profit company over which research communities have no control. And in the past, we have seen Google shutting off entire services in just months.
This is why we should build and maintain our own infrastructures for such crucial services.
Best
Didier Torny
Matilda scientific director
I think we are in fact in agreement about this! The uncertainty over long-term funding of Google Scholar is one of my concerns about academia’s reliance on it.
You should also be looking at ScienceOpen – with over 100 million records including articles, books, chapters, conference papers, datasets and more – and it is FREE to use.
Thanks, this is another one I’ve not used before! I’d be interested to know what the coverage is like for the humanities and social sciences.
Removal of services does happen — recall Microsoft Academic (search)? Though at least OpenAlex took on some of their data, I believe.
In my view, Google Scholar isn’t broken but latest tools like Undermind.ai show a lot more innovation is possible with the latest generation of ML/AI by improving relevancy etc