The Research Cluster Dataset is not publicly available. This documentation is provided as context for the ETO products that use the dataset.
The Research Cluster Dataset (“RC Dataset”) organizes articles from the Merged Academic Corpus into over 100,000 clusters, which are groups of articles that cite each other unusually often. The dataset records the articles associated with each cluster along with unique metadata about each cluster, such as common terms and topics, growth rate, and degree of connectedness (through citations) to other clusters. You can use ETO’s Map of Science to explore these clusters and related trends in global science and technology.
The RC Dataset is generated from the Merged Academic Corpus, which includes data on over 270 million scholarly articles. Read more >>
Because this dataset incorporates licensed data from commercial providers, it is not publicly available. However, you can view and filter most of the data in it indirectly using the Map of Science.
Because the dataset is not publicly available, you should cite the Map of Science or this documentation page instead.
The basic unit of the RC Dataset is the cluster. Each cluster has the following metadata fields, each generated from the MAC article-level data for each article in the cluster. You can view this metadata for individual clusters using the Map of Science interface.
Each cluster has a unique numerical ID. ID numbers are arbitrarily selected and are purely for identification.
Everything in the RC Dataset is based on the Merged Academic Corpus, including the clusters in the dataset and the metadata for each cluster.
A research cluster is a group of articles that are highly interconnected through direct citation.
By "direct citation," we mean one article explicitly citing another. As of late 2022, the articles in the MAC contained about 1.6 billion of these citations.
When we say that a group of articles is "highly interconnected" through citations, we mean that the articles in that group directly cite each other a lot, compared to how often they cite articles outside the group or articles outside the group cite them. The technical term for this property is modularity of citation.
In less technical terms, you can compare relationships between articles in a cluster to relationships between people in a family. Everybody is connected to people inside and outside their family, but in many contexts, you probably have stronger and more frequent connections to people within your family than people outside it. (For example, you may spend more time with family members than others, especially if you live with them.) Likewise, an article in a particular cluster may cite, and be cited by, other articles inside and outside that cluster - but it will generally have more citation links to articles inside the cluster than articles in other clusters.
Research clusters are created from citation links only: the articles in a cluster don't necessarily have to be related in any other way. But in practice, research articles usually cite each other because they have other things in common. For example, researchers writing in Chinese are likely to cite other research written in Chinese, and researchers who study tree frogs are likely to cite prior work on tree frogs.
This means that citation-based research clusters often represent topics, shared language, or other real-world concepts indirectly - that is, even though those concepts aren’t used to create the clusters in the first place. These emergent concepts can be very precise and nuanced, helping users explore trends in the research literature in great detail and without having to specify the concepts in advance. For example, our clustering process can automatically identify and isolate research related to concepts like automatic speech recognition of multiple voices in noisy environments, organic vapor sensors for food safety monitoring, or nature-inspired robots that grow like vines.
In other cases, the citation-based clustering method can give results that are messy or unintuitive. Articles may be clustered together just because they happened to cite each other (and not cite other articles) in such a way that our algorithm recognized them as distinct groups; thematically, they may not seem to have much in common. However, we usually find that studying a cluster using the cluster detail view in the Map of Science interface will reveal real-world reasons why the articles in it tend to cite each other.
We organize articles into clusters using the Leiden algorithm. Essentially, this algorithm generates many random clusterings of articles, then finds the one whose clusters have the highest modularity of citation (as described above).
However, because the clustering algorithm involves a degree of randomness, relying on the algorithm alone to generate the clusters can produce unstable results - that is, different runs of the algorithm will often assign the same article to different clusters. We address this by screening out articles and citations that are less likely to convey meaningful information, which tend to “throw off” the clustering algorithm. Specifically, we screen out articles unless:
We also disregard “unstable” citation links when clustering articles - that is, citations that cause the clustering structure to vary significantly over different runs of the clustering algorithm.1
After screening articles and citations, we run the Leiden algorithm again using only the articles and citations that remain. After that, we delete clusters with fewer than 50 articles, and assign each of the articles in those clusters to the remaining cluster that the article most often cites. (Articles published between clustering updates are assigned to clusters using a similar method.)
This process of data screening, algorithmic sorting, and selective reassignment produces clusters that are (1) highly modular; (2) large enough to be analytically meaningful - generally in the hundreds of articles; and (3) stable across different runs.2
For each cluster, the RC Dataset includes a variety of metadata generated using the MAC metadata of the articles in the cluster.
Some of the cluster-level metadata is created through relatively simple operations over the article-level metadata - for example, counting the number of articles in a cluster with a certain characteristic, or calculating the average or most prevalent value of a metadata field within the cluster. In straightforward cases like these, the process used to create the metadata is described above. In other cases, more complex operations are needed.
Each cluster is assigned coordinates in a 2D space containing every other cluster in the dataset, with clusters that cite each other more often positioned closer together. These coordinates are approximations. Technically, the “location” of a cluster has thousands of dimensions - one for each other cluster in the map, representing the degree of citation connection between those two clusters. That means the distances between clusters can’t be perfectly represented in two-dimensional space; 2D visualizations like the Map of Science can only approximate these distances.
To generate 2D coordinates for the clusters, we create a graph from the intercluster citation counts for every cluster in the dataset, then process the graph with the DRL algorithm (implemented in igraph), generating several candidate 2D layouts. Across these layouts, clusters that share more citations are generally closer together, but the layouts differ due to random variation in the algorithm. We identify the candidate layout with the least overall difference from all the others. Finally, we extract coordinates for each individual cluster from the layout.
To identify cluster subjects, we take each general and specific subject assigned to at least one article in the cluster, then calculate a cluster-level relevance score, defined as (the sum of its scores for each article in the cluster) divided by (the number of articles in the cluster). We then select the highest-scoring general and specific subjects. In each of these steps, we use articles from the past 10 years only.
To identify key concepts in clusters, we used the yake algorithm to identify the top twenty keyphrases for each cluster based on their titles and abstracts, aiming for keyphrases that were relatively distinct from each other so that we would be able to identify discrete concepts. We then removed any phrases that were common English words (identified using the wordfreq Python library) or were single words that were part of other phrases that had been surfaced for the same cluster, to ensure that phrases were more useful and variable.
Our model for predicting extreme cluster growth is described in detail in Rahkovsky et al. (2021). In brief, the model makes binary (yes/no) predictions based on the overall recency of articles in the cluster, the recency of citations to those articles, the prevalence of articles in top journals and conferences, and the past growth of the cluster compared to the entire global literature. We define “extreme growth” as 8% or more annual growth over the next 3-year period. Predictions are only available for clusters with at least 20 articles from the past 12 months. For more details on the model and how it was evaluated, see Rahkovsky et al. (2021), section 2.3.
There are two main types of updates to the RC Dataset.
Clustering updates recalculate the clustering structure of the entire dataset “from scratch” using the most recent MAC data and the method described above. The last clustering update was in August 2021.
Article and metadata updates assign articles that were added to the MAC after the last clustering update to existing clusters, and then recalculate cluster-level metadata with the newly assigned articles taken into account. Each new article is assigned to the cluster it cites the most. Article and metadata updates use an automated process that we run roughly quarterly.
Virtually all CSET data team members have contributed to the RC Dataset in some form, whether by providing feedback, reviewing code, or helping generate ideas. In alphabetical order, some particular contributions follow:
Dewey Murdick and SciTech Strategies helped develop the clustering concept and methodology. Student research assistant Jerod Sun contributed data characterization. Zach Arnold and Katherine Quinn helped create this documentation.
|10/13/22||Initial release (ETO/CSET internal)|
Our clustering algorithm involves a random seed, so the clusters it generates may vary somewhat between runs. To improve stability, we ran the algorithm five times with the same input parameters and different random seeds, resulting in five different cluster structures. For each citation link in the MAC, we then identified how often the cited and citing papers appeared in the same cluster across the five runs. We defined a citation link as “unstable” if the cited and citing paper were in the same cluster in fewer than three of the five runs. ↩
We can measure consistency by measuring how often articles switch clusters across successive runs of the clustering algorithm. In our current process, fewer than 1% of articles switch clusters from run to run (on average). ↩