We know a great deal about how to identify the topics that researchers are working on. One can use citations and/or text to identify about one hundred thousand document clusters from either the... Show moreWe know a great deal about how to identify the topics that researchers are working on. One can use citations and/or text to identify about one hundred thousand document clusters from either the Scopus database or the WoS database. For purposes of discussion, we refer to these document clusters as topics. In our models there are about a thousand very large topics and tens of thousands of small topics. But why doesn’t topic size follow an expected linear Zipfian distribution? Is it possible that there is a different organizing principle for large vs. small topics? In this study, we explore the possibility that the organizing principle for large topics is the continued use of very expensive tools (such as specialized equipment, specialized databases and specialized software). For our initial exploration, we use grant size (NSF or NIH grants in excess of $5 million annually) as proxy for preferential investment in specialized tools. By using links between 52,097 grants and tens of thousands of topics, we will test whether large topics get more (than expected) funding from large grants and, by inference, expensive tools are an organizing principle for large topics. Show less
Recently, citation links have been shown to produce accurate delineations of tens of millions of scientific documents into a large number (~100,000) of clusters (Sjögårde & Ahlgren, 2018). Such... Show moreRecently, citation links have been shown to produce accurate delineations of tens of millions of scientific documents into a large number (~100,000) of clusters (Sjögårde & Ahlgren, 2018). Such clusters, which we refer to as topics, can be used for research evaluation and planning (Klavans & Boyack, 2017a) as well as to identify hot and/or emerging topics (Small, Boyack, & Klavans, 2014). While direct citation links have been shown to produce more accurate topics using large citation databases than co-citation or bibliographic coupling links (Klavans & Boyack, 2017b), no such comparison has been done at a similar scale using topics based on textual relatedness due to the extreme computational requirements of calculating an enormous number of document-document similarities using text. Thus, we simply do not know if topics identified from a large database using textual characteristics are as accurate as those that are identified using direct citation. This paper aims to fill that gap. In this work we cluster over 23 million documents from the PubMed database (1975-2017) using a text-based similarity and compare the accuracy of the resulting topics to those from existing citation-based topics using three different measures. Show less