Many bibliometricians and university administrators remain wary of Google Scholar citation data, preferring “the gold standard” of Web of Science instead. Anne-Wil Harzing, who developed the Publish or Perish software that uses Google Scholar data, here sets out to challenge some of the misconceptions about this data source and explain why it offers a serious alternative to Web of Science. In addition to its flaws having been overstated, Google Scholar’s coverage of high-quality publications is more comprehensive in many areas, including in the social sciences and humanities, books and book chapters, conference proceedings and non-English language publications.
Publish or Perish uses Google Scholar as one of its data sources (the other being Microsoft Academic). Many bibliometricians and university administrators are fairly conservative in their approach to citation analysis. It is not unusual to see them prefer the Web of Science (ISI for short) as “the gold standard” and discard Google Scholar out of hand, simply because they have heard some wild-west stories about its “overly generous” coverage. These stories are typically based one or more of the following misconceptions, which I will dispute below.
- First, the impression that everything “on the web” citing an academic’s work counts as a citation.
- Second, the assumption that any publication that is not listed in the Web of Science is not worth considering at all.
- Third, a general impression that citation counts in Google Scholar are completely unreliable.
Not everything published on the internet counts in Google Scholar
Some academics are under the misplaced impression that anything posted on the internet that includes references will be counted in Google Scholar. This might also be the source behind the misconception that one can put simply put phantom papers online to improve one’s citation count. However, Google Scholar only indexes scholarly publications. As its website indicates: “we work with publishers of scholarly information to index peer-reviewed papers, theses, preprints, abstracts, and technical reports from all disciplines of research.”
Some non-scholarly citations, such as student handbooks, library guides or editorial notes slip through. However, incidental problems in this regard are unlikely to distort citation metrics, especially robust ones such as the h-index. Hence, although there might be some overestimation of the number of non-scholarly citations in Google Scholar, for many disciplines this is preferable to the very significant and systematic underestimation of scholarly citations in ISI or Scopus. Moreover, as long as one compares like with like, i.e. compares citation records for the same data source, this should not be a problem at all.
Non-ISI listed publications can be high-quality publications
There is also a frequent assumption amongst research administrators that ISI listing is a stamp of quality and that hence one should ignore non-ISI listed publications and citations. There are two problems with this assumption. First, ISI has a bias towards science, English-language and North American journals. Second, ISI completely ignores a vast majority of publications in the social sciences and humanities.
- ISI journal listing is very incomplete in the social sciences and humanities: ISI’s listing of journals is much more comprehensive in the sciences than in the social sciences and humanities. Butler (2006) analysed the distribution of publication output by field for Australian universities between 1999 and 2001. She found that whereas for the chemical, biological, physical and medical/health sciences between 69.3% and 84.6% of the publications were published in ISI listed journals, this was the case for only 4.4%-18.7% of the publications in the social sciences such as management, history education and arts. Many high-quality journals in the field of economics and business are not ISI listed. Only 30%-40% of the journals in accounting, marketing and general management and strategy listed on my Journal Quality List (already a pretty selective list) are ISI listed. There is no doubt that – on average – journals that are ISI listed are perceived to be of higher quality. However, there is a very substantial number of non-ISI indexed journals that have a higher than average h-index.
- ISI has very limited coverage of non-journal publications: second, even in the cited reference search, ISI only includes citations in ISI listed journals. In the general search function it completely ignores any publications that are not in ISI-listed journals. As a result a vast majority of publications and citations in the social sciences and humanities, as well as in engineering and computer science, are ignored. In the social sciences and humanities this is mainly caused by a complete neglect of books, book chapters, publications in languages other than English, and publications in non-ISI listed journals. In engineering and computer science, this is mostly caused by a neglect of conference proceedings. ISI has recently introduced conference proceedings in its database. However, it does not provide any details of which conferences are covered beyond listing some disciplines that are covered. I was unable to find any of my own publications in conference proceedings. As a result ISI very seriously underestimates both the number of publications and the number of citations for academics in the social sciences and humanities and in engineering and computer science.
Google Scholar’s flaws have been played up far too much
Peter Jacsó, a prominent academic in information and library science, has published several rather critical articles about Google Scholar (e.g. Jacsó, 2006a and 2006b). When confronted with titles such as “Dubious hit counts and cuckoo’s eggs” and “Deflated, inflated and phantom citation counts”, Deans, academic administrators and tenure/promotion committees could be excused for assuming Google Scholar provides unreliable data.
However, the bulk of Jacsó’s critique is levelled at Google Scholar’s inconsistent number of results for keyword searches, which are not at all relevant for the author and journal impact searches that most academics use Publish or Perish for. For these types of searches, the following caveats are important.
- Citation metrics are robust and insensitive to occasional errors: most of the metrics used in Publish or Perish are fairly robust and insensitive to occasional errors as they will not generally change the h-index or g-index and will only have a minor impact on the number of citations per paper. There is no doubt that Google Scholar’s automatic parsing occasionally provides us with nonsensical results. However, these errors do not appear to be as frequent or as important as implied by Jacsó’s articles. They also do not generally impact the results of author or journal queries much, if at all.
- Google Scholar parsing has improved significantly: Google Scholar has also significantly improved its parsing since the errors were pointed out to them. However, many academics are still referring to Jacsó’s 2006 articles as convincing arguments against any use of Google Scholar. I would argue this is inappropriate. As academics, we are only all too well aware that all of our research results include a certain error margin. We cannot expect citation data to be any different.
- Google Scholar errors are random rather than systematic: what is most important is that errors are random rather than systematic. I have no reason to believe that the Google Scholar errors identified in Jacsó’s articles are anything else than random. Hence they will not normally advantage or disadvantage individual academics or journals.
- ISI and Scopus have systematic errors of coverage: in contrast, commercial databases such as ISI and Scopus have systematic errors as they do not include many journals in the social sciences and humanities, nor have good coverage of conferences proceedings, books or book chapters. Therefore, although it is always a good idea to use multiple data sources, rejecting Google Scholar out of hand because of presumed parsing errors is not rational. Nor is presuming ISI is error-free simply because it charges high subscription fees.
Conclusion
As I have argued in the past, Google Scholar and Publish or Perish have democratised citation analysis. Rather than leaving it in the hands of those with access to commercial databases with high subscription fees, anyone with a computer and internet access can now run their own analyses. If you’d like to know more about this, have a look at this presentation.
This blog post originally appeared on the author’s personal website and is republished here with permission. Copyright © 2017 Anne-Wil Harzing.
Featured image credit: Viele bunte Bälle by Maret Hosemann (licensed under a CC BY 2.0 license).
Note: This article gives the views of the author, and not the position of the LSE Impact Blog, nor of the London School of Economics. Please review our comments policy if you have any concerns on posting a comment below.
I think that Google is merely showing the Academy what can be done. They have not even monetized, Since universities and colleges constitute by far the vast majority of scholarly publishing content contributors and consumers they should develop their own peer reviewed platform. What is stopping the Academy from launching such a platform? Personally, I think that they are capable this and even more.
Google is not accessible in China. Any standard based on this will exclude a large fraction of non-English speaking scientists from using the tool.
Very few international sites are accessible in China. You cannot take China as a reference. Also, a lot of Chinese researchers publish in English, if they want international readers to access their work …
Now really, Anon. Google is blocked in China. It cannot be used to measure any performance, no matter what language papers are written in, where they are published and however many cirtations they have collected. That’s the point. I am surprised this is so difficult to understand.
Besides, your claim that ‘very few international sites are accessible in China’ is plain quatsch. How, for example, do you think I found this site?
The point about the ISI is certainly true. Some stellar work in the social sciences in not covered. It is, therefore, a poor measure. Pretty much everybody I know used Google Scholar – particularly in the hope of finding a non-firewalled copy of the the paper or chapter they need. Because the social sciences lag so far behind in many aspects of journal publication (selecting ethical outlets; prioritizing open access; avoiding APCs when they do publish OA) I have produced a listing of cheap or free and reputable journals in 6 fields of social science, with a more generic SS list as well. A lot is at stake if we do not act to reduce pressures on library budgets, by ‘taking back’ publishing. https://simonbatterbury.wordpress.com/2015/10/25/list-of-open-access-journals/
As a software engineer, I find it quite odd to label Google’s errors as “random”. In other words, any parsing or other defect will lead to as systematic errors as the omission of data sources in ISI.
Web of Science (referred to here also as ISI), is a core offering of Clarivate Analytics. On behalf of Clarivate Analytics, here are the facts.
Claim: ISI has a bias towards science, English-language and North American journals.
– FACT: Web of Science covers journals, conference proceedings, books, datasets and patents across sciences, social sciences, arts, and humanities, with coverage dating back to 1900.
– FACT: The Web of Science is a global collection of over 33,000 journals of which 25% are North American based. Core global coverage is enhanced by regional citation indexes within the Web of Science, developed in partnership with leading research bodies such as the Chinese Academy of Sciences, which provide a complete global picture of new insights from research in China, Russia/ CIS, South Korea, Latin America, Spain, Portugal, the Caribbean and South Africa.
– FACT: Web of Science indexes journals originally published in over 53 different languages. Approximately, 25% of items in the Web of Science Core Collection are from journals that are either dual language or not originally published in English, a figure that increases to over 40% in the complete Web of Science.
Claim: ISI completely ignores a vast majority of publications in the social sciences and humanities
– FACT: Web of Science Core Collection comprises the Science Citation Index Expanded (SCIE), the Social Sciences Citation Index (SSCI), the Arts & Humanities Citation Index (AHCI), and the Emerging Sources Citation Index (ESCI) including a comprehensive back-file and cited reference data from 1900 to the present across 55 disciplines. SCIE is a carefully selected and evaluated collection of 8,892 high impact journals, SSCI is a collection of 3,250 high impact journals, AHCI indexes 1,780 arts and humanities journals, and ESCI adds an additional 5,721 journals, with 53% of these Emerging Sources journals in the Social Sciences, Arts & Humanities (all counts as of March 2017).
– FACT: The Web of Science includes over 190,000 conference proceedings and 80,000 editorially selected books fully covering subjects such as computer science as well as the social sciences, arts and humanities (as of March 2017).
Claim: ISI has systematic errors of coverage
– FACT: The Web of Science Core Collection maintains clearly established journal subject categorizations; fairly assigns “document types” to all publications; and has clearly defined scope for inclusion of any journal in the index. These standards are derived from rigorous editorial policies that have been accepted by the library community, the research community, the publishing industry, and bibliometricians wordwide.
– FACT: These characteristics makes it possible to use the Web of Science Core Collection as a source for normalizing citation performance, providing reliable benchmarks for establishing citation performance for journals and individual articles, and meet the criteria for being an accepted source for citation analysis as described in the Leiden Manifesto for Research Metrics http://www.nature.com/news/bibliometrics-the-leiden-manifesto-for-research-metrics-1.17351
To ensure that researchers have access to current information about Web of Science and for researchers who would like to understand more about the Web of Science content or verify facts about our coverage, editorial policies or analytics, please contact Marian Hollingsworth, Director of Publisher Relations at marian.hollingsworth@clarivate.com
All great, but SCOPUS almost always produces more citations to an individual article than WoS, and Google Scholar finds even more, even if some may be a bit sketchy at the margins. This is the case for my own articles – there is a huge discrepancy for one published in Science. Why is that? because WoS tracks citations from the the “carefully selected and evaluated collection” which – unless i am mistaken – naturally leaves out many journals that have not made it into the WoS (and, ones like ACME and Spatial Justice / Justice Spatiale that refuse indexing). This is the nature of exclusivity, unfortunately, but it is a major reason to distrust the WoS as a definitive citation guide. A lot of the cutting edge, seat-of the pants journals that are really traveling well (many listed on my webpage above) just don’t count in the WoS – they don’t exist there. The ESCI was presumably an effort to rectify this, but I am unsure how that helps those journals, since they don’t get an Impact Factor. I run a journal in the ESCI, and Scopus already gave it a Citescore of 1.65 and the Citescore page is available free of charge as well. The WoS one is not.
It’s a pity that both the author of the article, and the most detailed criticism of it, come from people with vested interests, namely services that rely on Google Scholar, and on WoS respectively. My own experience, as a bibliometrics researcher, is that GS is much less reliable than WoS, but I last undertook research in the area some years ago, so things might have changed since.
I can understand that why there is an expectation for scholars to to publish in good journals, but I do not understand why our citations have to be from a “carefully selected source”. If we are to be measured by our research impact, why a journalist citing my paper to spread my findings to whom may find it useful is inferior to another academician who may not even read my article but just to cite it because my paper appears in the same journal and he wants to please the editor?