The Research Cluster Dataset ("RC Dataset") organizes articles from the Merged Academic Corpus into nearly 92,000 clusters, which are groups of articles that cite each other unusually often and feature similar text in their titles and abstracts. The dataset includes publications in each cluster along with unique metadata about each cluster, such as common subjects and topics, growth rate, and connections to other clusters. You can use ETO’s Map of Science to explore these clusters and related trends in global science and technology.
The RC Dataset is generated from the Merged Academic Corpus, which includes data on over 287 million scholarly articles. Read more >>
Because this dataset currently incorporates licensed data from commercial providers, it is not publicly available. However, you can view and filter most of the data in it indirectly using the Map of Science.
Because the dataset is not publicly available, you should cite the Map of Science or this documentation page instead.
The basic unit of the RC Dataset is the cluster. Each cluster has the following metadata fields, each generated from the MAC article-level data for each article in the cluster. You can view this metadata for individual clusters using the Map of Science interface.
The RC Dataset is built using articles and article-level metadata from the Merged Academic Corpus.
A research cluster is a group of articles that are highly interconnected through direct citation or similarity in the text of their titles and abstracts.
By "direct citation," we mean one article explicitly citing another. When we say that a cluster of articles is "highly interconnected" through citations, we mean that the articles in that cluster directly cite each other a lot, compared to how often they cite articles outside the cluster, or how often articles outside the cluster cite them. The technical term for this property is modularity of citation.
Articles are also connected to each other via similarity in their titles and abstracts. Similarity measures are determined using pre-trained embeddings from a multilingual SentenceTransformer model. Articles that have similar text in their titles and abstracts may relate to common topics, methods, approaches, or research areas that would lead to them naturally grouping together, even if there is no direct citation between them.
Research clusters are created from citation links and text similarity: the articles in a cluster don't necessarily have to be related in any other way. But in practice, research articles often cite each other because they have other things in common. For example, researchers who study tree frogs are likely to cite prior work on tree frogs, and researchers may be more likely to cite other researchers in their professional network.
This means that research clusters based on citations and text similarity of the articles often represent topics, shared language, collegial connections, or other real-world concepts. These emergent concepts can be very precise and nuanced, helping users explore trends in the research literature in great detail and without having to specify the concepts in advance. For example, our clustering process can automatically identify and isolate research related to concepts like the application of patient-derived organoids for drug screening, machine learning models for predicting air quality, or methods for enhancing power grid resilience during natural disasters.
We first generate a large network connecting all scientific articles in our Merged Academic Corpus using both the similarity between articles determined by the similarity in their embeddings using a pre-trained, multilingual SentenceTransformer model. We combine this with a network based on between-article citations, to generate a hybrid network that includes both the relevant citation information and text similarities between articles.
We then organize articles into clusters using the Leiden algorithm. Essentially, this algorithm generates many random clusterings of articles, then finds the one whose clusters have the highest modularity.
However, because the clustering algorithm involves a degree of randomness, relying on the algorithm alone to generate the clusters can produce unstable results - that is, different runs of the algorithm will often assign the same article to different clusters. We address this by screening out articles and citations that are less likely to convey meaningful information, which tend to "throw off" the clustering algorithm. Specifically, we screen out articles unless:
We also disregard "unstable" article links when clustering articles - that is, network connections that cause the clustering structure to vary significantly over different runs of the clustering algorithm.1
After screening articles and connections, we run the Leiden algorithm again using only the stable articles and network links that remain. After that, we delete clusters with fewer than 50 articles, and assign each of the articles in those clusters to the remaining cluster that is most strongly connected to that article. (Articles published between clustering updates are assigned to clusters using a similar method.)
Finally, each article that was screened out is added to the cluster to which it has the strongest connection.
This process of data screening, algorithmic sorting, and selective reassignment produces clusters that are (1) highly modular; (2) large enough to be analytically meaningful - generally in the hundreds of articles; and (3) stable across different runs.2
For each cluster, the RC Dataset includes a variety of metadata generated using the MAC metadata of the articles in the cluster.
Some of the cluster-level metadata is created through relatively simple operations over the article-level metadata - for example, counting the number of articles in a cluster with authors from a certain organization, or calculating the average age of articles within a cluster using each article's publication date. In straightforward cases like these, the process used to create the metadata is described above. In other cases, more complex operations are needed.
Each cluster is assigned coordinates in a 2D space containing every other cluster in the dataset, with clusters that cite each other more often positioned closer together. These coordinates are approximations. Technically, the "location" of a cluster has thousands of dimensions - one for each cluster in the map, where each new dimension is mathematically needed to accurately represent a cluster’s "distance" (or degree of citation connection) from the other clusters. That means the distances between clusters can’t be perfectly represented in two-dimensional space; 2D visualizations like the Map of Science can only approximate these distances.
To generate 2D coordinates for the clusters, we create a graph from the intercluster connections for every cluster in the dataset, then process the graph with the DRL algorithm (implemented in igraph), generating several candidate 2D layouts. Across these layouts, clusters that share more citations are generally closer together, but the layouts differ due to random variation in the algorithm. We identify the candidate layout with the least overall difference from all the others. Finally, we extract coordinates for each individual cluster from the layout.
To identify a cluster's research subjects (disciplines, fields, subfields, and topics), we first take each discipline and field assigned to at least one article in the cluster, then calculate a cluster-level relevance score, defined as the number of articles with that subject in its top-three subjects (for each discipline and field), based on relevance scores for each article in the cluster. We then select the three top-scoring disciplines and three top-scoring fields as the overall disciplines and fields for that cluster. In each of these steps, we use articles from the past 10 years only.
Because subfields and topics are defined hierarchically, meaning that a subfield and topic are nested under a particular field, we then consider each subfield and topic assigned to at least one article in the cluster, provided its parent field is determined to be in that cluster’s top-scoring fields. We again calculate a cluster-level relevance score, defined as the number of articles with that subject (subfield or topic) with a significant relevance score in the cluster. We then select the three top-scoring subfields and topics. However, because a small number of articles in a cluster may be connected to a particular subfield or topic, we only use these as a means of searching, not description for the clusters (i.e. to see where these subjects are distributed).
To identify key concepts in clusters, we used the yake algorithm to identify the top twenty keyphrases for each cluster based on their titles and abstracts, aiming for keyphrases that were relatively distinct from each other so that we would be able to identify discrete concepts. We then removed any phrases that were common English words (identified using the wordfreq Python library) or were single words that were part of other phrases that had been surfaced for the same cluster, to ensure that phrases were more useful and variable. The top five scoring keywords of those that remain were then assigned to the cluster.
Our model for predicting extreme cluster growth is described in detail in Rahkovsky et al. (2021). In brief, the model makes binary (yes/no) predictions based on the overall recency of articles in the cluster, the recency of citations to those articles, the prevalence of articles in top journals and conferences, and the past growth of the cluster compared to the entire global literature. We define "extreme growth" as 8% or more annual growth over the next 3-year period. Predictions are only available for clusters with at least 20 articles from the past 12 months. For more details on the model and how it was evaluated, see Rahkovsky et al. (2021), section 2.3.
There are two main types of updates to the RC Dataset.
Clustering updates recalculate the clustering structure of the entire dataset "from scratch" using the most recent MAC data and the method described above. The last clustering update was in fall summer 2025.
Article and metadata updates assign articles that were added to the MAC after the last clustering update to existing clusters, and then recalculate cluster-level metadata with the newly assigned articles taken into account. Each new article is assigned to the cluster it cites the most. Article and metadata updates use an automated process that we run roughly monthly.
Virtually all CSET data team members have contributed to the RC Dataset in some form, whether by providing feedback, reviewing code, or helping generate ideas. In alphabetical order, some particular contributions follow:
Dewey Murdick and SciTech Strategies helped develop the clustering concept and methodology. Katherine Quinn implemented structural changes to clustering concept and methodology to include citation and text similarity. Zach Arnold wrote the original documentation, and this updated version was written by Katherine Quinn, Adrian Thinnyun, and Catherine Aiken.
Emerging technology topic classifications are based on work supported in part by the Alfred P. Sloan Foundation under Grant No. G-2023-22358.
| 10/13/22 | Initial release (ETO/CSET internal) |
| 11/22/23 | 2023 update following changes to MAC (ETO/CSET internal) |
| 12/15/25 | Changes to Clustering Method to include both citation and text similarity |
Our clustering algorithm involves a random seed, so the clusters it generates may vary somewhat between runs. To improve stability, we ran the algorithm five times with the same input parameters and different random seeds, resulting in five different cluster structures. For each citation link in the MAC, we then identified how often the cited and citing papers appeared in the same cluster across the five runs. We defined a citation link as "unstable" if the cited and citing paper were in the same cluster in fewer than three of the five runs. ↩
We can measure consistency by measuring how often articles switch clusters across successive runs of the clustering algorithm. In our current process, fewer than 1% of articles switch clusters from run to run (on average). ↩