Documentation: Research Cluster Dataset

Overview

The Research Cluster Dataset is not publicly available. This documentation is provided as context for the ETO products that use the dataset.

What is this dataset?

The Research Cluster Dataset ("RC Dataset") organizes articles from the Merged Academic Corpus into nearly 92,000 clusters, which are groups of articles that cite each other unusually often and feature similar text in their titles and abstracts. The dataset includes publications in each cluster along with unique metadata about each cluster, such as common subjects and topics, growth rate, and connections to other clusters. You can use ETO’s Map of Science to explore these clusters and related trends in global science and technology.

Which ETO products use it?

What are its sources?

The RC Dataset is generated from the Merged Academic Corpus, which includes data on over 287 million scholarly articles. Read more >>

What are its main limitations?

  • Some clusters don’t have a single, clear topic. The clusters in the RC Dataset are generated algorithmically, using the citation links between articles and the similarity of their titles and abstracts. In some cases, the resulting clusters might not neatly map to a single research topic. Read more >>
  • The clusters are derived from other data and reflect the limitations of that data. Known limitations in the Merged Academic Corpus and its underlying data sources, including lags, will exist in the cluster-level data.
  • The dataset isn’t suitable for all types of analysis. We regularly add new articles to the clusters, but we don’t regularly add clusters, delete clusters, or (with limited exceptions) change the cluster assignment of already assigned articles. Read more >>
  • Cluster subjects (disciplines, fields, subfields, topics, and concepts) may be imprecise. We rely on algorithms and numerical thresholds to identify relevant research subjects (disciplines, fields, subfields, topics, and concepts) for each cluster. Read more >>

What are the terms of use?

Because this dataset currently incorporates licensed data from commercial providers, it is not publicly available. However, you can view and filter most of the data in it indirectly using the Map of Science.

How do I cite it?

Because the dataset is not publicly available, you should cite the Map of Science or this documentation page instead.

Structure and content

The basic unit of the RC Dataset is the cluster. Each cluster has the following metadata fields, each generated from the MAC article-level data for each article in the cluster. You can view this metadata for individual clusters using the Map of Science interface.

Cluster ID

  • Each cluster has a unique numerical ID. ID numbers are arbitrarily selected and are purely for identification.

Research disciplines and fields

  • Each cluster is associated with three high-level disciplines (such as medicine or computer science) and three more specific fields (such as cardiology or soil science), determined using the article-level disciplines and fields in the cluster and their relevance scores.
    • To learn how fields and scores are assigned at the article level, refer to the MAC documentation.
    • To learn how article-level scores are used to determine cluster disciplines and fields, see below.

Size

  • Cluster size is defined as the number of articles in the cluster published in the past 5 years.

Average article age

  • The average age of articles in the cluster (age = current year - year of article).

Citation rating

  • A cluster's citation rating is equal to the average citation percentile of articles in the cluster published in the past 5 years.

Growth rating

  • A cluster’s growth rating is a percentile-based measure reflecting how fast it grew in the past three years relative to other clusters in the dataset. For example, a cluster with a 90th-percentile growth rating grew faster than 90% of other clusters over the past three years.

Extreme growth forecast

  • Each cluster has a binary value (true/false) indicating whether the cluster is expected to grow unusually rapidly in the next three years.
  • About 5% of clusters in the dataset are currently forecasted to have extreme growth.

Key concepts

Emerging technology subjects

  • In addition to the general research fields and subfields assigned to each cluster, each cluster includes the share of its articles that are relevant to various emerging subjects, such as AI, robotics, cybersecurity, LLM development, and chip fabrication and design. Read more >>
  • Using these article-level classifications, we calculate the percentage of articles in each cluster that are relevant to each emerging subject. We use only articles published in the past 5 years in these calculations.

Country affiliation

  • Using the author countries associated with each article in the cluster from the past 5 years, we calculate:
    • US affiliation: The percentage of articles in the cluster with at least one author from a US organization.
    • Chinese affiliation: The percentage of articles in the cluster with at least one author from a Chinese organization.
    • Leading countries: The three countries linked to the most articles in the cluster.
  • Here (and generally in ETO resources) we use "country" informally, as a shorthand term for sovereign countries, independent states, and certain other geographic entities. Read more >>

Patent citation statistics

  • For each cluster, we calculate:
    • The number of patents (from the last 10 years only) citing one or more articles in the cluster.
    • How often patents cite articles in the cluster, compared to other clusters in the Map: for example, in a 90th percentile cluster, the number of articles that are cited by patents is higher than in 90% of other clusters.
    • The percentage of articles in the cluster that are cited by at least one patent.

Industry affiliation percentage

  • For each cluster, we calculate the percentage of articles with at least one author from an industry organization. We use only articles from the past 5 years in these calculations.

Key articles

  • Each cluster has a list of top articles from the past 5 years, including each article’s title, source, year, and number of citations, and reason for inclusion.
  • Articles are included in the list if they qualify as core articles, review articles, or top-cited articles.
    • Core articles: Articles that are especially highly connected to other articles in the cluster.
      • To identify these articles, we calculate a "core statistic" for each article in the cluster, incorporating the article’s age, total number of citations, and how often it cites articles within the cluster.
    • Review articles: Articles that describe and systematize other research. We assume any article with between 100 and 1000 citations to other articles, implying engagement with an unusually wide range of related scholarship, is a review article.
    • Top-cited articles: The articles in the cluster with the most citations.
  • We also calculate the field percentile for each article, defined as its citation percentile relative to all other articles in its research field and publication year.

Top sources

  • Each cluster has a set of top sources (such as journals or preprint servers), including each source’s name and the number of associated articles in the cluster.
  • Top sources are defined as the sources of the most articles in the cluster published in the past 5 years.

Top authors

  • Each cluster has a set of top authors, including each author’s name, affiliated organization(s), number of associated articles in the cluster, and average yearly citations per article in the cluster.
  • Top authors are defined as the authors with the most articles in the cluster published in the past 5 years.

Top author organizations

  • Each cluster has a set of top author organizations, including each organization’s name, country, number of associated articles in the cluster, average yearly citations for each associated article, and organization type (commercial, education, nonprofit, government).
  • Top author organizations are defined as the organizations associated (through their authors) with the most articles in the cluster published in the past 5 years.

Collaboration statistics

  • Each cluster has a set of collaboration statistics, structured as a list of country pairs with corresponding article counts. Each count represents the number of articles in the cluster published over the last 10 years that had at least one author affiliated with an organization in each country in the pair.

Intercluster connections

  • We count how strongly each cluster is connected to each other cluster in the dataset using the underlying network between articles. The strength of a connection between two clusters is determined by the number of connections between their respective articles, which is determined based on citations and the similarity of their titles and abstracts.

Map coordinates

  • Each cluster has coordinates representing its "location" in a two-dimensional space along with all the other clusters. We use intercluster connections to generate these coordinates; clusters with more citation connections are typically located closer together. Read more >>

Top citing and cited clusters

  • For each cluster, we use inter-cluster citation counts to identify the other clusters that most often cite and are cited by the articles in the cluster.

Sources and methodology

Data sources

The RC Dataset is built using articles and article-level metadata from the Merged Academic Corpus.

What is a research cluster?

A research cluster is a group of articles that are highly interconnected through direct citation or similarity in the text of their titles and abstracts.

By "direct citation," we mean one article explicitly citing another. When we say that a cluster of articles is "highly interconnected" through citations, we mean that the articles in that cluster directly cite each other a lot, compared to how often they cite articles outside the cluster, or how often articles outside the cluster cite them. The technical term for this property is modularity of citation.

Articles are also connected to each other via similarity in their titles and abstracts. Similarity measures are determined using pre-trained embeddings from a multilingual SentenceTransformer model. Articles that have similar text in their titles and abstracts may relate to common topics, methods, approaches, or research areas that would lead to them naturally grouping together, even if there is no direct citation between them.

What do clusters mean?

Research clusters are created from citation links and text similarity: the articles in a cluster don't necessarily have to be related in any other way. But in practice, research articles often cite each other because they have other things in common. For example, researchers who study tree frogs are likely to cite prior work on tree frogs, and researchers may be more likely to cite other researchers in their professional network.

This means that research clusters based on citations and text similarity of the articles often represent topics, shared language, collegial connections, or other real-world concepts. These emergent concepts can be very precise and nuanced, helping users explore trends in the research literature in great detail and without having to specify the concepts in advance. For example, our clustering process can automatically identify and isolate research related to concepts like the application of patient-derived organoids for drug screening, machine learning models for predicting air quality, or methods for enhancing power grid resilience during natural disasters.

How clusters are generated

We first generate a large network connecting all scientific articles in our Merged Academic Corpus using both the similarity between articles determined by the similarity in their embeddings using a pre-trained, multilingual SentenceTransformer model. We combine this with a network based on between-article citations, to generate a hybrid network that includes both the relevant citation information and text similarities between articles.

We then organize articles into clusters using the Leiden algorithm. Essentially, this algorithm generates many random clusterings of articles, then finds the one whose clusters have the highest modularity.

However, because the clustering algorithm involves a degree of randomness, relying on the algorithm alone to generate the clusters can produce unstable results - that is, different runs of the algorithm will often assign the same article to different clusters. We address this by screening out articles and citations that are less likely to convey meaningful information, which tend to "throw off" the clustering algorithm. Specifically, we screen out articles unless:

  • they cite at least 10 other articles and were published more than 4 years ago, OR
  • they cite at least one other article AND are cited by at least one article, AND were published more than 1,000 days ago.

We also disregard "unstable" article links when clustering articles - that is, network connections that cause the clustering structure to vary significantly over different runs of the clustering algorithm.1

After screening articles and connections, we run the Leiden algorithm again using only the stable articles and network links that remain. After that, we delete clusters with fewer than 50 articles, and assign each of the articles in those clusters to the remaining cluster that is most strongly connected to that article. (Articles published between clustering updates are assigned to clusters using a similar method.)

Finally, each article that was screened out is added to the cluster to which it has the strongest connection.

This process of data screening, algorithmic sorting, and selective reassignment produces clusters that are (1) highly modular; (2) large enough to be analytically meaningful - generally in the hundreds of articles; and (3) stable across different runs.2

Cluster metadata: creating information about clusters

For each cluster, the RC Dataset includes a variety of metadata generated using the MAC metadata of the articles in the cluster.

Some of the cluster-level metadata is created through relatively simple operations over the article-level metadata - for example, counting the number of articles in a cluster with authors from a certain organization, or calculating the average age of articles within a cluster using each article's publication date. In straightforward cases like these, the process used to create the metadata is described above. In other cases, more complex operations are needed.

Map coordinates

Each cluster is assigned coordinates in a 2D space containing every other cluster in the dataset, with clusters that cite each other more often positioned closer together. These coordinates are approximations. Technically, the "location" of a cluster has thousands of dimensions - one for each cluster in the map, where each new dimension is mathematically needed to accurately represent a cluster’s "distance" (or degree of citation connection) from the other clusters. That means the distances between clusters can’t be perfectly represented in two-dimensional space; 2D visualizations like the Map of Science can only approximate these distances.

To generate 2D coordinates for the clusters, we create a graph from the intercluster connections for every cluster in the dataset, then process the graph with the DRL algorithm (implemented in igraph), generating several candidate 2D layouts. Across these layouts, clusters that share more citations are generally closer together, but the layouts differ due to random variation in the algorithm. We identify the candidate layout with the least overall difference from all the others. Finally, we extract coordinates for each individual cluster from the layout.

Assigning research subjects

To identify a cluster's research subjects (disciplines, fields, subfields, and topics), we first take each discipline and field assigned to at least one article in the cluster, then calculate a cluster-level relevance score, defined as the number of articles with that subject in its top-three subjects (for each discipline and field), based on relevance scores for each article in the cluster. We then select the three top-scoring disciplines and three top-scoring fields as the overall disciplines and fields for that cluster. In each of these steps, we use articles from the past 10 years only.

Because subfields and topics are defined hierarchically, meaning that a subfield and topic are nested under a particular field, we then consider each subfield and topic assigned to at least one article in the cluster, provided its parent field is determined to be in that cluster’s top-scoring fields. We again calculate a cluster-level relevance score, defined as the number of articles with that subject (subfield or topic) with a significant relevance score in the cluster. We then select the three top-scoring subfields and topics. However, because a small number of articles in a cluster may be connected to a particular subfield or topic, we only use these as a means of searching, not description for the clusters (i.e. to see where these subjects are distributed).

Key concepts

To identify key concepts in clusters, we used the yake algorithm to identify the top twenty keyphrases for each cluster based on their titles and abstracts, aiming for keyphrases that were relatively distinct from each other so that we would be able to identify discrete concepts. We then removed any phrases that were common English words (identified using the wordfreq Python library) or were single words that were part of other phrases that had been surfaced for the same cluster, to ensure that phrases were more useful and variable. The top five scoring keywords of those that remain were then assigned to the cluster.

Extreme growth forecasting

Our model for predicting extreme cluster growth is described in detail in Rahkovsky et al. (2021). In brief, the model makes binary (yes/no) predictions based on the overall recency of articles in the cluster, the recency of citations to those articles, the prevalence of articles in top journals and conferences, and the past growth of the cluster compared to the entire global literature. We define "extreme growth" as 8% or more annual growth over the next 3-year period. Predictions are only available for clusters with at least 20 articles from the past 12 months. For more details on the model and how it was evaluated, see Rahkovsky et al. (2021), section 2.3.

Known limitations

  • Some clusters don’t have a single, clear topic. The clusters in the RC Dataset are generated algorithmically, using the citation links between articles and text similarity of their titles and abstracts. Often this results in clusters that have other things in common, like language, topic, or method, but in some cases the clustering method may group articles in ways that don’t have an obvious theme.
  • The dataset isn’t suitable for all types of analysis. The clusters in the RC Dataset were last defined in 2025. We continuously add new articles to the clusters based on their connections, but we don’t continuously add clusters, delete clusters, or change the cluster assignment of already assigned articles.
  • Some articles may switch clusters when the dataset is updated. Articles published after the data was initially clustered are put in the clusters they cite the most. As articles gain more citations, their clusters may change, which could lead to changes in cluster metadata.
  • Cluster disciplines, fields, subfields, topics, and concepts may be imprecise. We rely on algorithms and numerical thresholds to identify relevant subjects (including disciplines and fields and emerging technology subjects) and [concepts](#key concepts) for each cluster. In practice, we find that their results generally give a good sense of subjects and concepts relevant to different clusters. However, they may not capture every relevant subject or concept, and some of the subjects and concepts they do identify may be questionable. Use caution, especially when analyzing clusters individually rather than in the aggregate.

Maintenance

How are the data updated?

There are two main types of updates to the RC Dataset.

Clustering updates recalculate the clustering structure of the entire dataset "from scratch" using the most recent MAC data and the method described above. The last clustering update was in fall summer 2025.

Article and metadata updates assign articles that were added to the MAC after the last clustering update to existing clusters, and then recalculate cluster-level metadata with the newly assigned articles taken into account. Each new article is assigned to the cluster it cites the most. Article and metadata updates use an automated process that we run roughly monthly.

Credits

Virtually all CSET data team members have contributed to the RC Dataset in some form, whether by providing feedback, reviewing code, or helping generate ideas. In alphabetical order, some particular contributions follow:

  • Rebecca Gelles: Phrase extraction and patent metrics development
  • Jennifer Melot: Clustering pipeline implementation and automation support; metadata aggregation and development support
  • Katherine Quinn: Clustering methodology and experimentation, metadata aggregation, ongoing metric development
  • Ilya Rahkovsky: Clustering methodology and experimentation, metadata aggregation, and primary metrics development
  • Autumn Toney: Metrics development and review

Dewey Murdick and SciTech Strategies helped develop the clustering concept and methodology. Katherine Quinn implemented structural changes to clustering concept and methodology to include citation and text similarity. Zach Arnold wrote the original documentation, and this updated version was written by Katherine Quinn, Adrian Thinnyun, and Catherine Aiken.

Emerging technology topic classifications are based on work supported in part by the Alfred P. Sloan Foundation under Grant No. G-2023-22358.

Major change log

10/13/22Initial release (ETO/CSET internal)
11/22/232023 update following changes to MAC (ETO/CSET internal)
12/15/25Changes to Clustering Method to include both citation and text similarity

Footnotes

  1. Our clustering algorithm involves a random seed, so the clusters it generates may vary somewhat between runs. To improve stability, we ran the algorithm five times with the same input parameters and different random seeds, resulting in five different cluster structures. For each citation link in the MAC, we then identified how often the cited and citing papers appeared in the same cluster across the five runs. We defined a citation link as "unstable" if the cited and citing paper were in the same cluster in fewer than three of the five runs.

  2. We can measure consistency by measuring how often articles switch clusters across successive runs of the clustering algorithm. In our current process, fewer than 1% of articles switch clusters from run to run (on average).

Keep in touch

Twitter
LinkedIn
Substack
Email
RSS
Terms of Use and Privacy Policy