As the COVID-19 pandemic unfolds, researchers from all disciplines are coming together and contributing their expertise. CORD-19, a dataset of COVID-19 and coronavirus publications, has been made... Show moreAs the COVID-19 pandemic unfolds, researchers from all disciplines are coming together and contributing their expertise. CORD-19, a dataset of COVID-19 and coronavirus publications, has been made available alongside calls to help mine the information it contains and to create tools to search it more effectively. We analyse the delineation of the publications included in CORD-19 from a scientometric perspective. Based on a comparison to the Web of Science database, we find that CORD-19 provides an almost complete coverage of research on COVID-19 and coronaviruses. CORD-19 contains not only research that deals directly with COVID-19 and coronaviruses, but also research on viruses in general. Publications from CORD-19 focus mostly on a few well-defined research areas, in particular: coronaviruses (primarily SARS-CoV, MERS-CoV and SARS-CoV-2); public health and viral epidemics; molecular biology of viruses; influenza and other families of viruses; immunology and antivirals; clinical medicine. CORD-19 publications that appeared in 2020, especially editorials and letters, are disproportionately popular on social media. While we fully endorse the CORD-19 initiative, it is important to be aware that CORD-19 extends beyond research on COVID-19 and coronaviruses. Show less
Modern natural language processing techniques have given rise to embedding techniques that can represent documents based on their content or context, and several papers have operationalized these... Show moreModern natural language processing techniques have given rise to embedding techniques that can represent documents based on their content or context, and several papers have operationalized these to perform bibliometric tasks. The relationship between these embeddings and conventional citation based or title and abstract based mappings remains unclear. Contrary to citation-based or term-based relatedness, embedding-based relatedness is not immediately interpretable. We consider four embedding-derived publication relatedness measures, based on: 1) word2vec embeddings of citation labels, sentence embeddings using 2) BERT and 3) SciBERT, and 4) title and abstract embeddings using SPECTER, and compare them with conventional bibliometric publication relatedness measures derived from citation relations and title and abstract noun phrases. We show that there is stronger overlap between these embedding-derived relatedness measures and citation-based relatedness than with title and abstract noun phrase-based relatedness, and that embedding-derived relatedness measures outperform conventional techniques when used to cluster publications cited with the same citation intent. Show less
Waltman, L.; Boyack, K.W.; Colavizza, G.; Eck, N.J. van 2020
There are many different relatedness measures, based for instance on citation relations or textual similarity, that can be used to cluster scientific publications. We propose a principled... Show moreThere are many different relatedness measures, based for instance on citation relations or textual similarity, that can be used to cluster scientific publications. We propose a principled methodology for evaluating the accuracy of clustering solutions obtained using these relatedness measures. We formally show that the proposed methodology has an important consistency property. The empirical analyses that we present are based on publications in the fields of cell biology, condensed matter physics, and economics. Using the BM25 text-based relatedness measure as the evaluation criterion, we find that bibliographic coupling relations yield more accurate clustering solutions than direct citation relations and cocitation relations. The so-called extended direct citation approach performs similarly to or slightly better than bibliographic coupling in terms of the accuracy of the resulting clustering solutions. The other way around, using a citation-based relatedness measure as evaluation criterion, BM25 turns out to yield more accurate clustering solutions than other text-based relatedness measures. Show less
Colavizza, G.; Franssen, T.; Leeuwen, T.N. van 2018
Citation networks among journal articles are perhaps the most common object of investigation in bibliometrics. For example, citation networks are widely used for science mapping as a way to explore... Show moreCitation networks among journal articles are perhaps the most common object of investigation in bibliometrics. For example, citation networks are widely used for science mapping as a way to explore the cognitive structure of scientific fields. Within this framework, the disciplines traditionally part of the humanities fare differently. Their main trait being the interplay of a broader array of publication typologies – monographs, edited volumes, journal articles – with a richer set of cited objects, including primary evidence. Consequently, when considered from a science mapping perspective, a community, field or specialism in the humanities might be represented as a multilayer network. We consider here a specialism in history, the history of Venice, and represent it using a set of publications including both books (edited and monographs) and journal articles. This set of publications is interconnected using three similarity measures: bibliographic coupling over references to books, bibliographic coupling over references to primary sources and textual similarity. The result is a multi-relation network with three distinct dimensions (that we will call layers), one per similarity measure, connecting the same publications. Given this representation, we proceed to analyse the different communities emerging from the three layers, to qualify them and consider to what extent they overlap or instead provide for orthogonal conceptual spaces. Show less
A long tradition of sociological research aims to understand the differences in the organizational and cognitive structure of scientific fields. This sociological tradition was in its earlier years... Show moreA long tradition of sociological research aims to understand the differences in the organizational and cognitive structure of scientific fields. This sociological tradition was in its earlier years intimately connected with the emerging field of bibliometric methods and applications, originated in the 1960s with the work of Storer and Price. However, the sociology of science and scientometrics have since the early 1980s drifted apart and attempts to reconcile them, or to reconcile the more theoretically inclined field of science and technology studies with scientometrics, have not had the desired effect. Recently, scholars have again argued for the need for interdisciplinary work bridging the sociology of science or science and technology studies with scientometrics. We take up these calls and explore ways to bridge the sociology of science with scientometrics by offering an operationalisation and empirical assessment of the rural and urban sociological framework by Becker and Trowler (2001). We compare ten specialisms from five disciplines: history, computer science, astrophysics, literature and biology, and study the connectivity properties of the bibliographic coupling networks of each. Our results show that the specialisms in the humanities possess a much lower connectivity, organising in many, smaller topics of research. They also show a lower reliance on shared core sources, contrary to the framework's predictions, suggesting that more theoretical and empirical work is required in order to fully characterise different specialisms of research. Show less
Colavizza, G.; Boyack, K.W.; Eck, N.J. van; Waltman, L. 2017
We investigated the similarities of pairs of articles that are cocited at the different cocitation levels of the journal, article, section, paragraph, sentence, and bracket. Our results indicate... Show moreWe investigated the similarities of pairs of articles that are cocited at the different cocitation levels of the journal, article, section, paragraph, sentence, and bracket. Our results indicate that textual similarity, intellectual overlap (shared references), author overlap (shared authors), proximity in publication time all rise monotonically as the cocitation level gets lower (from journal to bracket). While the main gain in similarity happens when moving from journal to article cocitation, all level changes entail an increase in similarity, especially section to paragraph and paragraph to sentence/bracket levels. We compared the results from four journals over the years 2010–2015: Cell, the European Journal of Operational Research, Physics Letters B, and Research Policy, with consistent general outcomes and some interesting differences. Our findings motivate the use of granular cocitation information as defined by meaningful units of text, with implications for, among others, the elaboration of maps of science and the retrieval of scholarly literature. Show less
Boyack, K.W.; Eck, N.J. van; Colavizza, G.; Waltman, L. 2017
We report characteristics of in-text citations in over five million full text articles from two large databases – the PubMed Central Open Access subset and Elsevier journals – as functions of time,... Show moreWe report characteristics of in-text citations in over five million full text articles from two large databases – the PubMed Central Open Access subset and Elsevier journals – as functions of time, textual progression, and scientific field. The purpose of this study is to understand the characteristics of in-text citations in a detailed way prior to pursuing other studies focused on answering more substantive research questions. As such, we have analyzed in-text citations in several ways and report many findings here. Perhaps most significantly, we find that there are large field-level differences that are reflected in position within the text, citation interval (or reference age), and citation counts of references. In general, the fields of Biomedical and Health Sciences, Life and Earth Sciences, and Physical Sciences and Engineering have similar reference distributions, although they vary in their specifics. The two remaining fields, Mathematics and Computer Science and Social Science and Humanities, have different reference distributions from the other three fields and between themselves. We also show that in all fields the numbers of sentences, references, and in-text mentions per article have increased over time, and that there are field-level and temporal differences in the numbers of in-text mentions per reference. A final finding is that references mentioned only once tend to be much more highly cited than those mentioned multiple times. Show less