Documentation: Research Cluster Dataset

Overview

The Research Cluster Dataset is not publicly available. This documentation is provided as context for the ETO products that use the dataset.

What is this dataset?

The Research Cluster Dataset (“RC Dataset”) organizes articles from the Merged Academic Corpus into over 100,000 clusters, which are groups of articles that cite each other unusually often. The dataset records the articles associated with each cluster along with unique metadata about each cluster, such as common terms and topics, growth rate, and degree of connectedness (through citations) to other clusters. You can use ETO’s Map of Science to explore these clusters and related trends in global science and technology.

Which ETO products use it?

What are its sources?

The RC Dataset is generated from the Merged Academic Corpus, which includes data on over 270 million scholarly articles. Read more >>

What are its main limitations?

  • Some clusters don’t have a clear theme or topic. The clusters in the RC Dataset are generated algorithmically, using the citation links between articles and no other information. In some cases, the resulting clusters might not intuitively “mean” anything. Read more >>
  • The dataset isn’t suitable for all types of trend analysis. We continuously add new articles to the clusters, but we don’t currently add clusters, delete clusters, or (with limited exceptions) change the cluster assignment of already assigned articles. In many cases, this makes it hard to use the RC Dataset to study changes over time in the overall research literature or in specific fields. Read more >>

What are the terms of use?

Because this dataset incorporates licensed data from commercial providers, it is not publicly available. However, you can view and filter most of the data in it indirectly using the Map of Science.

How do I cite it?

Because the dataset is not publicly available, you should cite the Map of Science or this documentation page instead.

Structure and content

The basic unit of the RC Dataset is the cluster. Each cluster has the following metadata fields, each generated from the MAC article-level data for each article in the cluster. You can view this metadata for individual clusters using the Map of Science interface.

Cluster ID

Each cluster has a unique numerical ID. ID numbers are arbitrarily selected and are purely for identification.

Subjects

  • Each cluster is associated with a list of subjects, determined using the article-level subjects in the cluster and their relevance scores.
    • To learn how subjects and scores are assigned at the article level, refer to the MAC documentation.
  • The list includes 3 general subjects and 6 specific subjects, along with corresponding cluster-level relevance scores. Read more >>

Size

  • Cluster size is defined as the number of articles in the cluster published in the past 5 years.

Average article age

  • The average age of articles in the cluster (age = current year - year of article).

Average citation percentile

Growth percentile

  • A cluster’s growth percentile reflects how fast it grew relative to other clusters in the dataset. For example, a 90th percentile cluster grew faster than 90% of other clusters.

Extreme growth forecast

  • Each cluster has a binary value (true/false) indicating whether the cluster is expected to grow unusually rapidly in the next three years.
  • 4.51% of clusters in the dataset are currently forecasted to have extreme growth.

Key concepts

AI relevance and AI subfield relevance

  • Each MAC article is classified as relevant to artificial intelligence or not, and each English-language MAC article is also classified as relevant or not to each of three different AI subfields: computer vision, natural language processing, and robotics. Read more >>
  • Using these article-level classifications, we calculate the percentage of articles in each cluster relevant to AI, and the percentage of articles in each cluster relevant to the three AI subfields. We use only articles from the past 5 years in these calculations.

Country affiliation

  • Using the author countries associated with each article in the cluster from the past 5 years, we calculate:
    • US affiliation: The percentage of articles in the cluster with at least one author from a US institution.
    • Chinese affiliation: The percentage of articles in the cluster with at least one author from a Chinese institution.
    • Leading country: The country linked to the most articles in the cluster.

Languages

  • Using the MAC’s article-level language tags for each article in the cluster from the past 5 years, we calculate:
    • English-only articles: The percentage of articles in the cluster with an English-language title or abstract and no foreign title or abstract.
    • Articles in Chinese: The percentage of articles in the cluster with a Chinese-language title or abstract, or where the first 1000 characters of the full text of the article are in Chinese (for articles where the full text is available in the MAC).

Patent citation statistics

  • For each cluster, we calculate:
    • The number of patents (from the last 10 years only) citing one or more articles in the cluster.
    • How often patents cite articles in the cluster, compared to other clusters in the Map: for example, in a 90th percentile cluster, the number of articles that are cited by patents is higher than in 90% of other clusters.
    • The percentage of articles in the cluster that are cited by at least one patent.

Industry affiliation percentage

  • For each cluster, we calculate the percentage of articles with at least one author from an industry organization. We use only articles from the past 5 years in these calculations.

Top articles

  • Each cluster has a list of top articles from the past 5 years, including each article’s title, source, year, and number of citations, and reason for inclusion.
  • Articles are included in the list if they qualify as core articles, review articles, or top-cited articles.
    • Core articles: Articles that are especially highly connected to other articles in the cluster.
      • To identify these articles, we calculate a “core statistic” for each article in the cluster, incorporating the article’s age, total number of citations, and how often it cites articles within the cluster.
    • Review articles: Articles that describe and systematize other research. We assume any article with between 100 and 1000 citations to other articles, implying engagement with a wide range of related scholarship, is a review article.
    • Top-cited articles: The articles in the cluster with the most citations.
  • We also calculate the field percentile for each article, defined as its citation percentile relative to all other articles in its general subject area and publication year.

Top sources

  • Each cluster has a set of top sources, including each source’s name, number of associated articles in the cluster, and impact factor.
    • Impact factor is equal to (the number of citations in the last full year for articles published in the past two preceding years) divided by the number of articles in the two preceding years - for example, (citations in 2020 for articles in 2018-19)/(number of articles in 2018-2019). This calculation uses all of the source’s articles in the MAC, not just the articles in the cluster.
  • Top sources are defined as the sources of the most articles in the cluster published in the past 5 years.

Top authors

  • Each cluster has a set of top authors, including each author’s name, affiliated institution(s), number of associated articles in the cluster, and average yearly citations per article in the cluster.
  • Top authors are defined as the authors with the most articles in the cluster published in the past 5 years.

Top author institutions

  • Each cluster has a set of top author institutions, including each institution’s name, country, number of associated articles in the cluster, average yearly citations for each associated article, and organization type (commercial, education, nonprofit, government).
  • Top author institutions are defined as the institutions associated (through their authors) with the most articles in the cluster published in the past 5 years.

Top funding institutions

  • Each cluster has a set of top funding institutions, including each institution’s name, country, number of associated articles in the cluster, average yearly citations for each associated article, and organization type (commercial, education, nonprofit, government).
    • Note that many articles in the Map lack funder information, and funder organization type is not always available.
  • Top funding institutions are defined as the institutions associated with the most articles in the cluster published in the past 5 years.

Collaboration statistics

  • Each cluster has a set of collaboration statistics, structured as a list of country pairs with corresponding article counts. Each count represents the number of articles in the cluster published over the last 10 years that had at least one author from both countries in the pair.
    • Author nationality is inferred from the location of their affiliated institution.

Intercluster citation counts

Map coordinates

  • Each cluster has coordinates representing its “location” in a two-dimensional space along with all the other clusters. We use intercluster citation counts to generate these coordinates; clusters with more citation connections are typically located closer together. Read more >>

Top citing and cited clusters

  • For each cluster, we use intercluster citation counts to identify the other clusters that most often cite and are cited by the articles in the cluster.

Relevant GitHub repositories

  • For each cluster, we display the GitHub repositories that most frequently appear in the titles, abstracts, fulltext (where available from CNKI or arXiv as of October 2021) of the papers in the cluster, or that are linked to the paper via Papers with Code.

Sources and methodology

Data sources

Everything in the RC Dataset is based on the Merged Academic Corpus, including the clusters in the dataset and the metadata for each cluster.

Collection, processing, and enrichment

The clustering method

What is a research cluster?

A research cluster is a group of articles that are highly interconnected through direct citation.

By "direct citation," we mean one article explicitly citing another. As of late 2022, the articles in the MAC contained about 1.6 billion of these citations.

When we say that a group of articles is "highly interconnected" through citations, we mean that the articles in that group directly cite each other a lot, compared to how often they cite articles outside the group or articles outside the group cite them. The technical term for this property is modularity) of citation.

In less technical terms, you can compare relationships between articles in a cluster to relationships between people in a family. Everybody is connected to people inside and outside their family, but in many contexts, you probably have stronger and more frequent connections to people within your family than people outside it. (For example, you may spend more time with family members than others, especially if you live with them.) Likewise, an article in a particular cluster may cite, and be cited by, other articles inside and outside that cluster - but it will generally have more citation links to articles inside the cluster than articles in other clusters.

What do clusters mean?

Research clusters are created from citation links only: the articles in a cluster don't necessarily have to be related in any other way. But in practice, research articles usually cite each other because they have other things in common. For example, researchers writing in Chinese are likely to cite other research written in Chinese, and researchers who study tree frogs are likely to cite prior work on tree frogs.

This means that citation-based research clusters often represent topics, shared language, or other real-world concepts indirectly - that is, even though those concepts aren’t used to create the clusters in the first place. These emergent concepts can be very precise and nuanced, helping users explore trends in the research literature in great detail and without having to specify the concepts in advance. For example, our clustering process can automatically identify and isolate research related to concepts like automatic speech recognition of multiple voices in noisy environments, organic vapor sensors for food safety monitoring, or nature-inspired robots that grow like vines.

In other cases, the citation-based clustering method can give results that are messy or unintuitive. Articles may be clustered together just because they happened to cite each other (and not cite other articles) in such a way that our algorithm recognized them as distinct groups; thematically, they may not seem to have much in common. However, we usually find that studying a cluster using the cluster detail view in the Map of Science interface will reveal real-world reasons why the articles in it tend to cite each other.

How clusters are generated

We organize articles into clusters using the Leiden algorithm. Essentially, this algorithm generates many random clusterings of articles, then finds the one whose clusters have the highest modularity of citation (as described above).

However, because the clustering algorithm involves a degree of randomness, relying on the algorithm alone to generate the clusters can produce unstable results - that is, different runs of the algorithm will often assign the same article to different clusters. We address this by screening out articles and citations that are less likely to convey meaningful information, which tend to “throw off” the clustering algorithm. Specifically, we screen out articles unless:

  • they cite at least 10 other articles, OR
  • they cite at least one other article AND are cited by at least one article.

We also disregard “unstable” citation links when clustering articles - that is, citations that cause the clustering structure to vary significantly over different runs of the clustering algorithm.1

After screening articles and citations, we run the Leiden algorithm again using only the articles and citations that remain. After that, we delete clusters with fewer than 50 articles, and assign each of the articles in those clusters to the remaining cluster that the article most often cites. (Articles published between clustering updates are assigned to clusters using a similar method.)

This process of data screening, algorithmic sorting, and selective reassignment produces clusters that are (1) highly modular; (2) large enough to be analytically meaningful - generally in the hundreds of articles; and (3) stable across different runs.2

Cluster metadata: creating information about clusters

For each cluster, the RC Dataset includes a variety of metadata generated using the MAC metadata of the articles in the cluster.

Some of the cluster-level metadata is created through relatively simple operations over the article-level metadata - for example, counting the number of articles in a cluster with a certain characteristic, or calculating the average or most prevalent value of a metadata field within the cluster. In straightforward cases like these, the process used to create the metadata is described above. In other cases, more complex operations are needed.

More complex cluster-level metadata

Map coordinates

Each cluster is assigned coordinates in a 2D space containing every other cluster in the dataset, with clusters that cite each other more often positioned closer together. These coordinates are approximations. Technically, the “location” of a cluster has thousands of dimensions - one for each other cluster in the map, representing the degree of citation connection between those two clusters. That means the distances between clusters can’t be perfectly represented in two-dimensional space; 2D visualizations like the Map of Science can only approximate these distances.

To generate 2D coordinates for the clusters, we create a graph from the intercluster citation counts for every cluster in the dataset, then process the graph with the DRL algorithm (implemented in igraph), generating several candidate 2D layouts. Across these layouts, clusters that share more citations are generally closer together, but the layouts differ due to random variation in the algorithm. We identify the candidate layout with the least overall difference from all the others. Finally, we extract coordinates for each individual cluster from the layout.

Subjects

To identify cluster subjects, we take each general and specific subject assigned to at least one article in the cluster, then calculate a cluster-level relevance score, defined as (the sum of its scores for each article in the cluster) divided by (the number of articles in the cluster). We then select the highest-scoring general and specific subjects. In each of these steps, we use articles from the past 10 years only.

Key concepts

To identify key concepts in clusters, we used the yake algorithm to identify the top twenty keyphrases for each cluster based on their titles and abstracts, aiming for keyphrases that were relatively distinct from each other so that we would be able to identify discrete concepts. We then removed any phrases that were common English words (identified using the wordfreq Python library) or were single words that were part of other phrases that had been surfaced for the same cluster, to ensure that phrases were more useful and variable.

Extreme growth forecasting

Our model for predicting extreme cluster growth is described in detail in Rahkovsky et al. (2021). In brief, the model makes binary (yes/no) predictions based on the overall recency of articles in the cluster, the recency of citations to those articles, the prevalence of articles in top journals and conferences, and the past growth of the cluster compared to the entire global literature. We define “extreme growth” as 8% or more annual growth over the next 3-year period. Predictions are only available for clusters with at least 20 articles from the past 12 months. For more details on the model and how it was evaluated, see Rahkovsky et al. (2021), section 2.3.

Known limitations

  • Some clusters don’t have a clear theme or topic. The clusters in the RC Dataset are generated algorithmically, using the citation links between articles and no other information. Usually, articles cite each other because they have other things in common, like language, topic, or method. In some cases, though, the citation-based clustering method may group articles that don’t have much in common thematically, and the resulting clusters might not intuitively “mean” anything.
  • The dataset isn’t generally useful for trend analysis. The clusters in the RC Dataset were last defined in 2021. We continuously add new articles to the clusters that best suit them, but we don’t add clusters, delete clusters, or change the cluster assignment of already assigned articles. In most cases, this makes it hard to use the RC Dataset to study changes over time in the overall research literature or in specific fields.
  • Some articles may switch clusters when the dataset is updated. Articles published after the data was initially clustered are put in the clusters they cite the most. As articles gain more citations, their clusters may change, which could lead to changes in cluster metadata.

Maintenance

How are the data updated?

There are two main types of updates to the RC Dataset.

Clustering updates recalculate the clustering structure of the entire dataset “from scratch” using the most recent MAC data and the method described above. The last clustering update was in August 2021.

Article and metadata updates assign articles that were added to the MAC after the last clustering update to existing clusters, and then recalculate cluster-level metadata with the newly assigned articles taken into account. Each new article is assigned to the cluster it cites the most. Article and metadata updates use an automated process that we run roughly quarterly.

Credits

Virtually all CSET data team members have contributed to the RC Dataset in some form, whether by providing feedback, reviewing code, or helping generate ideas. In alphabetical order, some particular contributions follow:

  • Rebecca Gelles: Phrase extraction and patent metrics development
  • Jennifer Melot: Clustering pipeline implementation and automation support; metadata aggregation and development support
  • Ilya Rahkovsky: Clustering methodology and experimentation, metadata aggregation, and primary metrics development
  • Autumn Toney: Metrics development and review

Dewey Murdick and SciTech Strategies helped develop the clustering concept and methodology. Student research assistant Jerod Sun contributed data characterization. Zach Arnold and Katherine Quinn helped create this documentation.

Major change log

10/13/22Initial release (ETO/CSET internal)

Footnotes

  1. Our clustering algorithm involves a random seed, so the clusters it generates may vary somewhat between runs. To improve stability, we ran the algorithm five times with the same input parameters and different random seeds, resulting in five different cluster structures. For each citation link in the MAC, we then identified how often the cited and citing papers appeared in the same cluster across the five runs. We defined a citation link as “unstable” if the cited and citing paper were in the same cluster in fewer than three of the five runs.

  2. We can measure consistency by measuring how often articles switch clusters across successive runs of the clustering algorithm. In our current process, fewer than 1% of articles switch clusters from run to run (on average).