ETO Logo
Documentation: Research Cluster Dataset

Overview

The Research Cluster Dataset is not publicly available. This documentation is provided as context for the ETO products that use the dataset.

What is this dataset?

The Research Cluster Dataset ("RC Dataset") organizes articles from the Merged Academic Corpus into over 85,000 clusters, which are groups of articles that cite each other unusually often. The dataset records the articles associated with each cluster along with unique metadata about each cluster, such as common terms and topics, growth rate, and degree of connectedness (through citations) to other clusters. You can use ETO’s Map of Science to explore these clusters and related trends in global science and technology.

Which ETO products use it?

What are its sources?

The RC Dataset is generated from the Merged Academic Corpus, which includes data on over 260 million scholarly articles. Read more >>

What are its main limitations?

  • Some clusters don’t have a clear theme or topic. The clusters in the RC Dataset are generated algorithmically, using the citation links between articles and no other information. In some cases, the resulting clusters might not intuitively "mean" anything when trying to characterize clusters by research topics. Read more >>
  • The clusters are derived from other data and reflect the limitations of that data. Known limitations in the Merged Academic Corpus and its underlying data sources will exist in the clusters themselves.
  • The dataset isn’t suitable for all types of trend analysis. We regularly add new articles to the clusters, but we don’t currently add clusters, delete clusters, or (with limited exceptions) change the cluster assignment of already assigned articles. Read more >>
  • Cluster subjects and concepts may be imprecise. We rely on algorithms and numerical thresholds to identify relevant subjects and concepts for each cluster. These methods are inherently imprecise. Read more >>

What are the terms of use?

Because this dataset incorporates licensed data from commercial providers, it is not publicly available. However, you can view and filter most of the data in it indirectly using the Map of Science.

How do I cite it?

Because the dataset is not publicly available, you should cite the Map of Science or this documentation page instead.

Structure and content

The basic unit of the RC Dataset is the cluster. Each cluster has the following metadata fields, each generated from the MAC article-level data for each article in the cluster. You can view this metadata for individual clusters using the Map of Science interface. (For ease of navigation, the fields listed below roughly follow the order in which the relevant metadata appear in the interface’s cluster view; for example, cluster size is listed first because that is the first item listed in cluster view.)

Cluster ID

Each cluster has a unique numerical ID. ID numbers are arbitrarily selected and are purely for identification.

Academic subjects

  • Each cluster is associated with three general academic subjects and three specific academic subjects, determined using the article-level subjects in the cluster and their relevance scores.
    • To learn how subjects and scores are assigned at the article level, refer to the MAC documentation.
    • To learn how article-level subjects and scores are used to determine cluster subjects, see below.

Size

  • Cluster size is defined as the number of articles in the cluster published in the past 5 years.

Average article age

  • The average age of articles in the cluster (age = current year - year of article).

Citation rating

  • A cluster's citation rating is equal to the average citation percentile of articles in the cluster published in the past 5 years.

Growth rating

  • A cluster’s growth rating is a percentile-based measure reflecting how fast it grew in the past three years relative to other clusters in the dataset. For example, a cluster with a 90th-percentile growth rating grew faster than 90% of other clusters over the past three years.

Extreme growth forecast

  • Each cluster has a binary value (true/false) indicating whether the cluster is expected to grow unusually rapidly in the next three years.
  • About 5% of clusters in the dataset are currently forecasted to have extreme growth.

Key concepts

Emerging technology subjects

  • In addition to the academic subjects assigned to each cluster, each cluster includes the share of its articles that are relevant to artificial intelligence and certain related subjects.
  • Currently, each MAC article with an English title or abstract is classified as relevant to artificial intelligence and four different subfields: computer vision, natural language processing, robotics, and AI safety. Read more >>
  • Using these article-level classifications, we calculate the percentage of articles in each cluster that are relevant to AI and each of the four AI subfields. We use only articles published in the past 5 years in these calculations.

Country affiliation

  • Using the author countries associated with each article in the cluster from the past 5 years, we calculate:
    • US affiliation: The percentage of articles in the cluster with at least one author from a US organization.
    • Chinese affiliation: The percentage of articles in the cluster with at least one author from a Chinese organization.
    • Leading countries: The three countries linked to the most articles in the cluster.
  • Here (and generally in ETO resources) we use "country" informally, as a shorthand term for sovereign countries, independent states, and certain other geographic entities. Read more >>

Patent citation statistics

  • For each cluster, we calculate:
    • The number of patents (from the last 10 years only) citing one or more articles in the cluster.
    • How often patents cite articles in the cluster, compared to other clusters in the Map: for example, in a 90th percentile cluster, the number of articles that are cited by patents is higher than in 90% of other clusters.
    • The percentage of articles in the cluster that are cited by at least one patent.

Industry affiliation percentage

  • For each cluster, we calculate the percentage of articles with at least one author from an industry organization. We use only articles from the past 5 years in these calculations.

Key articles

  • Each cluster has a list of top articles from the past 5 years, including each article’s title, source, year, and number of citations, and reason for inclusion.
  • Articles are included in the list if they qualify as core articles, review articles, or top-cited articles.
    • Core articles: Articles that are especially highly connected to other articles in the cluster.
      • To identify these articles, we calculate a "core statistic" for each article in the cluster, incorporating the article’s age, total number of citations, and how often it cites articles within the cluster.
    • Review articles: Articles that describe and systematize other research. We assume any article with between 100 and 1000 citations to other articles, implying engagement with an unusually wide range of related scholarship, is a review article.
    • Top-cited articles: The articles in the cluster with the most citations.
  • We also calculate the field percentile for each article, defined as its citation percentile relative to all other articles in its general subject area and publication year.

Top sources

  • Each cluster has a set of top sources (such as journals or preprint servers), including each source’s name and the number of associated articles in the cluster.
  • Top sources are defined as the sources of the most articles in the cluster published in the past 5 years.

Top authors

  • Each cluster has a set of top authors, including each author’s name, affiliated organization(s), number of associated articles in the cluster, and average yearly citations per article in the cluster.
  • Top authors are defined as the authors with the most articles in the cluster published in the past 5 years.

Top author organizations

  • Each cluster has a set of top author organizations, including each organization’s name, country, number of associated articles in the cluster, average yearly citations for each associated article, and organization type (commercial, education, nonprofit, government).
  • Top author organizations are defined as the organizations associated (through their authors) with the most articles in the cluster published in the past 5 years.

Top funding organizations

  • Each cluster has a set of top funding organizations, including each organization’s name, country, number of associated articles in the cluster, average yearly citations for each associated article, and organization type (commercial, education, nonprofit, government).
  • Top funding organizations are defined as the organizations associated with the most articles in the cluster published in the past 5 years.

Collaboration statistics

  • Each cluster has a set of collaboration statistics, structured as a list of country pairs with corresponding article counts. Each count represents the number of articles in the cluster published over the last 10 years that had at least one author from both countries in the pair.
    • Author nationality is inferred from the location of their affiliated organization.

Intercluster citation counts

Map coordinates

  • Each cluster has coordinates representing its "location" in a two-dimensional space along with all the other clusters. We use intercluster citation counts to generate these coordinates; clusters with more citation connections are typically located closer together. Read more >>

Top citing and cited clusters

  • For each cluster, we use intercluster citation counts to identify the other clusters that most often cite and are cited by the articles in the cluster.

Relevant GitHub repositories

  • For each cluster, we display the GitHub repositories that most frequently appear in the titles, abstracts, or fulltext (where available from Semantic Scholar or arXiv as of October 2021) of the papers in the cluster; that are linked to the paper via Papers with Code; or that we were able to link to a paper using a DOI present in the repository’s README.

Sources and methodology

Data sources

Everything in the RC Dataset is based on the Merged Academic Corpus, including the clusters in the dataset and the metadata for each cluster.

What is a research cluster?

A research cluster is a group of articles that are highly interconnected through direct citation.

By "direct citation," we mean one article explicitly citing another. When we say that a group of articles is "highly interconnected" through citations, we mean that the articles in that group directly cite each other a lot, compared to how often they cite articles outside the group or articles outside the group cite them. The technical term for this property is modularity of citation.

In less technical terms, you can compare relationships between articles in a cluster to relationships between people in a family. Everybody is connected to people inside and outside their family, but in many contexts, you probably have stronger and more frequent connections to people within your family than people outside it. (For example, you may spend more time with family members than others, especially if you live with them.) Likewise, an article in a particular cluster may cite, and be cited by, other articles inside and outside that cluster - but it will generally have more citation links to articles inside the cluster than articles in other clusters.

What do clusters mean?

Research clusters are created from citation links only: the articles in a cluster don't necessarily have to be related in any other way. But in practice, research articles often cite each other because they have other things in common. For example, researchers who study tree frogs are likely to cite prior work on tree frogs, and Moroccan researchers may be more likely to cite other Moroccan researchers.

This means that citation-based research clusters often represent topics, shared language, collegial connections, or other real-world concepts indirectly - that is, even though those concepts aren’t used to create the clusters in the first place. These emergent concepts can be very precise and nuanced, helping users explore trends in the research literature in great detail and without having to specify the concepts in advance. For example, our clustering process can automatically identify and isolate research related to concepts like aqueous zinc-ion batteries, reconfigurable intelligent reflecting surfaces for wireless communication, or using neural networks to predict water quality.

In other cases, the citation-based clustering method can give results that are messy or unintuitive. Articles may be clustered together just because they happened to cite each other (and not cite other articles) in such a way that our algorithm recognized them as distinct groups; thematically, they may not seem to have much in common. However, we usually find that studying a cluster using the cluster detail view in the Map of Science interface will reveal real-world reasons why the articles in it tend to cite each other.

How clusters are generated

We organize articles into clusters using the Leiden algorithm. Essentially, this algorithm generates many random clusterings of articles, then finds the one whose clusters have the highest modularity of citation (as described above).

However, because the clustering algorithm involves a degree of randomness, relying on the algorithm alone to generate the clusters can produce unstable results - that is, different runs of the algorithm will often assign the same article to different clusters. We address this by screening out articles and citations that are less likely to convey meaningful information, which tend to "throw off" the clustering algorithm. Specifically, we screen out articles unless:

  • they cite at least 10 other articles and were published more than 4 years ago, OR
  • they cite at least one other article AND are cited by at least one article, AND were published more than 1,000 days ago.

We also disregard "unstable" citation links when clustering articles - that is, citations that cause the clustering structure to vary significantly over different runs of the clustering algorithm.1

After screening articles and citations, we run the Leiden algorithm again using only the stable articles and citations that remain. After that, we delete clusters with fewer than 50 articles, and assign each of the articles in those clusters to the remaining cluster that the article most often cites. (Articles published between clustering updates are assigned to clusters using a similar method.)

Finally, each article that was screened out is added to the cluster it most often cites.

This process of data screening, algorithmic sorting, and selective reassignment produces clusters that are (1) highly modular; (2) large enough to be analytically meaningful - generally in the hundreds of articles; and (3) stable across different runs.2

Cluster metadata: creating information about clusters

For each cluster, the RC Dataset includes a variety of metadata generated using the MAC metadata of the articles in the cluster.

Some of the cluster-level metadata is created through relatively simple operations over the article-level metadata - for example, counting the number of articles in a cluster with authors from a certain organization, or calculating the average age of articles within a cluster using each article's publication date. In straightforward cases like these, the process used to create the metadata is described above. In other cases, more complex operations are needed.

Map coordinates

Each cluster is assigned coordinates in a 2D space containing every other cluster in the dataset, with clusters that cite each other more often positioned closer together. These coordinates are approximations. Technically, the "location" of a cluster has thousands of dimensions - one for each cluster in the map, where each new dimension is mathematically needed to accurately represent a cluster’s "distance" (or degree of citation connection) from the other clusters. That means the distances between clusters can’t be perfectly represented in two-dimensional space; 2D visualizations like the Map of Science can only approximate these distances.

To generate 2D coordinates for the clusters, we create a graph from the intercluster citation counts for every cluster in the dataset, then process the graph with the DRL algorithm (implemented in igraph), generating several candidate 2D layouts. Across these layouts, clusters that share more citations are generally closer together, but the layouts differ due to random variation in the algorithm. We identify the candidate layout with the least overall difference from all the others. Finally, we extract coordinates for each individual cluster from the layout.

Assigning academic subjects

To identify a cluster's academic subjects, we take each subject assigned to at least one article in the cluster, then calculate a cluster-level relevance score, defined as (the sum of its relevance scores for each article in the cluster) divided by (the number of articles in the cluster). We then select the three top-scoring general subjects and six top-scoring specific subjects as the overall subjects for that cluster. In each of these steps, we use articles from the past 10 years only.

Key concepts

To identify key concepts in clusters, we used the yake algorithm to identify the top twenty keyphrases for each cluster based on their titles and abstracts, aiming for keyphrases that were relatively distinct from each other so that we would be able to identify discrete concepts. We then removed any phrases that were common English words (identified using the wordfreq Python library) or were single words that were part of other phrases that had been surfaced for the same cluster, to ensure that phrases were more useful and variable. The top five scoring keywords of those that remain were then assigned to the cluster.

Extreme growth forecasting

Our model for predicting extreme cluster growth is described in detail in Rahkovsky et al. (2021). In brief, the model makes binary (yes/no) predictions based on the overall recency of articles in the cluster, the recency of citations to those articles, the prevalence of articles in top journals and conferences, and the past growth of the cluster compared to the entire global literature. We define "extreme growth" as 8% or more annual growth over the next 3-year period. Predictions are only available for clusters with at least 20 articles from the past 12 months. For more details on the model and how it was evaluated, see Rahkovsky et al. (2021), section 2.3.

Known limitations

  • Some clusters don’t have a clear theme or topic. The clusters in the RC Dataset are generated algorithmically, using the citation links between articles and no other information. Usually, articles cite each other because they have other things in common, like language, topic, or method. In some cases, though, the citation-based clustering method may group articles that don’t have much in common thematically, and the resulting clusters might not intuitively "mean" anything.
  • The dataset isn’t generally useful for trend analysis. The clusters in the RC Dataset were last defined in 2023. We continuously add new articles to the clusters that best suit them, but we don’t add clusters, delete clusters, or change the cluster assignment of already assigned articles. In most cases, this makes it hard to use the RC Dataset to study changes over time in the overall research literature or in specific fields.
  • Some articles may switch clusters when the dataset is updated. Articles published after the data was initially clustered are put in the clusters they cite the most. As articles gain more citations, their clusters may change, which could lead to changes in cluster metadata.
  • Cluster subjects and concepts may be imprecise. We rely on algorithms and numerical thresholds to identify relevant subjects (including [academic subjects](#assigning-academic subjects) and emerging technology subjects) and [concepts](#key concepts) for each cluster. These methods are inherently imprecise. In practice, we find that their results generally give a good sense of subjects and concepts relevant to different clusters. However, they may not capture every relevant subject or concept, and some of the subjects and concepts they do identify may be questionable. Use caution, especially when analyzing clusters individually rather than in the aggregate.

Maintenance

How are the data updated?

There are two main types of updates to the RC Dataset.

Clustering updates recalculate the clustering structure of the entire dataset "from scratch" using the most recent MAC data and the method described above. The last clustering update was in fall 2023.

Article and metadata updates assign articles that were added to the MAC after the last clustering update to existing clusters, and then recalculate cluster-level metadata with the newly assigned articles taken into account. Each new article is assigned to the cluster it cites the most. Article and metadata updates use an automated process that we run roughly monthly.

Credits

Virtually all CSET data team members have contributed to the RC Dataset in some form, whether by providing feedback, reviewing code, or helping generate ideas. In alphabetical order, some particular contributions follow:

  • Rebecca Gelles: Phrase extraction and patent metrics development
  • Jennifer Melot: Clustering pipeline implementation and automation support; metadata aggregation and development support
  • Katherine Quinn: Clustering methodology and experimentation
  • Ilya Rahkovsky: Clustering methodology and experimentation, metadata aggregation, and primary metrics development
  • Autumn Toney: Metrics development and review

Dewey Murdick and SciTech Strategies helped develop the clustering concept and methodology. Student research assistant Jerod Sun contributed data characterization. Zach Arnold helped create this documentation.

Major change log

10/13/22Initial release (ETO/CSET internal)
11/22/232023 update following changes to MAC (ETO/CSET internal)

Footnotes

  1. Our clustering algorithm involves a random seed, so the clusters it generates may vary somewhat between runs. To improve stability, we ran the algorithm five times with the same input parameters and different random seeds, resulting in five different cluster structures. For each citation link in the MAC, we then identified how often the cited and citing papers appeared in the same cluster across the five runs. We defined a citation link as "unstable" if the cited and citing paper were in the same cluster in fewer than three of the five runs.

  2. We can measure consistency by measuring how often articles switch clusters across successive runs of the clustering algorithm. In our current process, fewer than 1% of articles switch clusters from run to run (on average).