ETO Logo
Documentation: Merged Academic Corpus

Overview

The Merged Academic Corpus is not publicly available. This documentation is provided as context for the ETO products that use the dataset.

What is this dataset?

The Merged Academic Corpus (MAC) contains detailed information on over 270 million scholarly articles. It combines data from public and private sources to achieve an unmatched view of the global literature, covering every country in the world and the large majority of known articles. The MAC is maintained by CSET and ETO and is not publicly available due to licensing restrictions.

Which ETO products use it?

What are its sources?

The MAC currently includes data from six commercial and open-access platforms, plus additional metadata derived from those platforms using CSET algorithms. Read more >>

What are its main limitations?

  • The MAC doesn’t cover non-public research. It only includes research that has been publicly released and is included in one of our data sources. Read more >>

  • The MAC’s sources may introduce problems. Errors, gaps, and judgment calls in the MAC’s underlying datasets are likely to make it into the MAC itself. Read more >>

  • The MAC’s merging and enrichment processes may introduce errors. The MAC uses automated processes to combine articles from different sources and link them to useful metadata. These processes usually work well, but there may be errors in some cases. Read more >>

What are the terms of use?

Because this dataset contains licensed data from commercial providers, it is not publicly available. However, you can interact with some of the data using ETO tools.

How do I cite it?

Because the MAC is not publicly available, you should cite the relevant ETO tool or this documentation page instead.

Structure and content

The basic unit of the MAC is the article. For our purposes, “articles” include peer-reviewed publications, working papers, and other works appearing in journals, preprint servers, or similar venues. After deduplicating the articles, we compile data about each one from the MAC’s data sources, then structure the compiled data as a series of standard metadata fields.

Title

  • Each article can have an English-language title, a foreign-language title, both, or neither.
    • 67.05% of articles have an English title.
    • 40.86% have a foreign-language title.
    • Fewer than 0.01% have no available title.
  • When our sources include multiple titles in the same language for the same article, we use the title that is last in dictionary order.

Abstract

  • Each article can have an English-language abstract, one or (infrequently) more foreign-language abstracts, both, or neither.
    • 41.76% of articles have an English abstract.
    • 21.67% have a foreign-language abstract.
    • 44.00% have no available abstract.
  • When our sources include multiple abstracts in the same language for the same article, we currently break the tie semi-arbitrarily, using the abstract that (taken as a whole) is last in dictionary order.

Title and abstract languages

  • We try to automatically detect the language of each article’s title(s) and abstract(s) using a pycld2, a standard language identification algorithm.
  • 93.82% of articles with non-null titles have a confident language label from pycld2, and 98.80% of articles with non-null abstracts have a confident language label from pycld2.
  • English is the most common language for titles and abstracts. 67.05% of articles have an English title, and 41.76% of articles have an English abstract.

Year

  • Each article has an article year.
  • When our sources include multiple years for the same article, we use the earliest year.

Venue

  • 69.45% of MAC articles have an associated venue of article, such as a journal (e.g., Nature) or open repository (e.g., arXiv). For the remaining 30.55%, there was no venue data in our sources.
  • When our sources list more than one venue for the same article, we currently break the tie semi-arbitrarily, using the venue with the name that is last in dictionary order.

Authors

  • For each article, our data sources include each author’s name and affiliated institution (e.g., “Jane Doe, Georgetown University”). In some cases, the sources also tag the authors or their institutions with a unique identifier code.
  • For each article, we compile all of the authors associated with that article in the MAC’s data sources, then identify and remove duplicate authors using their names, affiliated institutions, and identifiers (as available), resulting in a definitive list of authors for the article. (For deduplication purposes only, we normalize author names by removing some special characters, reversing strings separated by exactly one comma, and standardizing whitespace and formatting for author initials.)
  • 90.05% of MAC articles have at least one associated author. For the remaining 9.95%, there was no author data in our sources.

Author institutions

  • As noted above, our data sources usually include institutional affiliations for each author of an article.
  • We use this affiliation data to identify one or more institutions for each author of a given article in the MAC. (In some cases, an author is listed on an article as affiliated with multiple institutions.)
  • The MAC’s author institution data is article-specific. For example, if Professor Doe moves from Georgetown to Oxford, she will be counted as affiliated with Georgetown for the articles she published while at Georgetown, and with Oxford for the articles she published after moving there.
  • In general, you can think of the MAC’s affiliation data as a table with three columns: one listing the specific article, one listing the name of an author of that article, and one listing the institution that author is associated with in that article.
    • If no source lists an institutional affiliation for an author, we assign that author a null institution.
  • When different sources specify different institutions for the same author of a given article, we use a set of heuristics to pick the source most likely to be accurate, and use the institution information from that source. We preferentially use data from sources with more complete links between authors and institutions and more complete data about institutions (e.g., mappings to countries or unique identifiers).

Author institution countries

  • We try to assign a country to each author institution:
    • If the institution has a GRID identifier, we use the country specified by GRID.
    • If the institution doesn’t have a GRID identifier, we use the country specified for that institution in the data source that we used to identify the institution in the first place, if available.
    • We use an internal mapping table to normalize country names (for example, assigning “USA” and “U.S.” to the United States).
    • If that source doesn’t specify a country, we assign a null value for this field. 75.07% of distinct institutions have no country affiliation.
  • Here (and generally in ETO resources) we use "country" informally, as a shorthand term for sovereign countries, independent states, and certain other geographic entities. Specifically, any entity that has an ISO 3166-1 code is termed a "country" in ETO resources.

Author institution types

  • Each author-affiliated institution is assigned to one of five types: Education, Company, Nonprofit, Government, or null (unknown).
  • We use sectoral data from the GRID database to assign types. We assign institutions without GRID identifiers a null value for this field.

Funding institutions

  • Some articles mention the organizations that funded the research in the article, and some of the MAC’s data sources extract these mentions from article text.
  • We compile the extracted names from these sources for each article, then normalize and deduplicate the names, resulting in a clean list of funding institutions for the article.
    • We normalize institution names for deduplication by deaccenting, stripping punctuation, normalizing whitespace, replacing "&" with "and", machine-translating non-English names, replacing acronyms appearing in parentheses with their corresponding expansions, and removing parenthesized expressions at the end of names. If multiple names match after being normalized in this way, we use the name that occurs most often in the underlying data sources (or its translation, if the name is non-English) as the “official” name in the MAC.
  • 11.55% of articles in the MAC have at least some metadata about their funding institutions.

Funding institution countries

  • Each funding institution is associated with a country.
  • If the institution has a GRID identifier, we use the country specified by GRID.
  • If the institution doesn’t have a GRID identifier, but was extracted from a Chinese-language article, we assign the institution to China.
  • Otherwise, we assign a null value (unknown).

Funding institution types

  • Each funding institution is assigned to one of five types: Education, Company, Nonprofit, Government, or null (unknown).
  • We use sectoral data from the GRID database to assign types. We assign institutions without GRID identifiers a null value for this field.

Citations and citation percentile

Article citations

  • Most of the MAC’s data sources include extracted article citation metadata. For each article in the source dataset, these sources provide a list of citations from that article to other articles in the dataset.
  • We map each cited and citing article in these lists to deduplicated MAC articles using their source-specific unique identifiers.
  • Using this mapping, we create two consolidated lists for each deduplicated MAC article: one listing all of the other articles the original article cites (out-citations), and one listing all of the other articles that cite the original article (in-citations).
  • Finally, we calculate a citation percentile for each article. This value compares the article’s total number of in-citations with all other articles published in the same year. For example, a 90th percentile article has more in-citations than about 90% of all articles published in the same year.

Patent citations

  • One of the MAC’s sources, Digital Science Dimensions, includes extracted patent citation metadata in addition to extracted article citation metadata. This metadata works the same way as the article citation metadata, but tracks citations to and from patents in Digital Science’s patent dataset (the patents themselves are not part of the MAC).
  • We use this metadata to count patent in-citations, out-citations, and in-citation percentiles (i.e., citations by patents) for each MAC article available in Dimensions, using the same basic process as for citations between articles.

Subjects

  • Each MAC article is assigned a list of academic subjects at different levels of generality. For example, an article’s subjects might include medicine (general) or cancer research (specific).
  • Each subject in the list has a score indicating how relevant the article is to that subject.
  • We use a set of models to generate each article's relevance score for every subject in the list.
  • We designate the three top-scoring general subjects and six top-scoring specific subjects as the article's overall subjects. (This designation is used to identify cluster subjects in the RC Cluster Dataset.)
  • Finally, we designate each article’s highest-scoring general subject as that article’s overall subject. This overall subject is used (for example) to generate cluster colors in the Map of Science.

Emerging topics

  • Each MAC article is classified as relevant or not to various emerging topics, including AI and certain AI subfields.
  • We use different methods to apply these tags depending on the topic, including machine learning models and keyword searches. Read more >>

Unique identifiers

  • The MAC’s data sources sometimes include unique identifier codes for articles, authors, or institutions. These may be proprietary (i.e., codes used by vendors to identify articles within that vendor’s datasets only) or public.
  • We aggregate the unique identifiers associated with each article across all sources and include them in the MAC. In some cases, we also use these identifiers for deduplication.
  • Major identifiers in the MAC include DOI (public), ROR (public), and GRID (proprietary).

Sources and methodology

Parts of this section are adapted from Rahkovsky et al., “AI Research Funding Portfolios and Extreme Growth” (Frontiers in Research Metrics and Analytics, 2021).

Data sources

The MAC currently includes data from:

  • Dimensions, an inter-linked research information system provided by Digital Science (commercial/closed access).
  • Clarivate’s Web of Science platform (commercial/closed access). The MAC includes Web of Science articles published after 2005.
  • Microsoft Academic Graph (open access). This resource was discontinued in 2021. The MAC includes all of its data up to that point, with the exception of patents and academic datasets.
  • China National Knowledge Infrastructure (CNKI), the leading Chinese-language academic dataset (commercial/closed access). The MAC includes CNKI articles published after 2005.
  • The arXiv platform for open-access scientific articles and preprints (open access).
  • Papers with Code, a free platform for machine learning articles and related resources (open access).
  • OpenAlex, a large-scale open-access dataset viewed as the replacement for MAG. We only use OpenAlex to help deduplicate articles that occur in our other sources - the MAC doesn’t currently include any OpenAlex article metadata.

Some of the data in the MAC is taken directly from these sources. Other data is derived from them algorithmically, as discussed below.

Collection, processing, and enrichment

Consolidating raw data from different sources

We automatically incorporate raw data into the MAC from each source weekly. (Most of the underlying sources are updated at least weekly. Exceptions are Chinese-language data, which update quarterly, and MAG data, which no longer updates.) We use a set of Airflow pipelines to retrieve the raw data and send it through the MAC’s merging and enrichment processes, described below.

Deduplicating articles

There is duplication between and within the six sources feeding into the MAC - articles often appear in multiple sources, or multiple times within the same source. We resolve these duplicates with an automated process:

  1. Normalize every article’s title and abstract.
    • For matching purposes, we apply the Unicode Normalization Form Compatibility Composition (NFKC) standard: unicode characters are decomposed by compatibility, then recomposed by canonical equivalence; letters are de-accented and HTML tags, copyright signs, punctuation, numbers, and non-alphanumeric characters strings are stripped; and all white space is removed from the strings.
  2. Identify and remove duplicates. Articles are presumed to be duplicates if one of the following is true:
    • They match on at least three of the following characteristics, excluding shared null and empty values:
      • Title (normalized)
      • Abstract (normalized)
      • Article year
      • Author last names (normalized)
      • References within each article (checked only if the articles are from the same underlying data source, because different sources each record references with their own system)
      • DOI
    • Their normalized, concatenated titles and abstracts have simhash values that differ in at most two places (using a rolling window of three characters) AND they have the same article year.

For more details on this process and public code, visit our github repo.

Compiling article metadata

After we deduplicate articles, we link each one to the metadata associated with it or any of its duplicates in the MAC’s underlying data sources. For most metadata fields, this is a relatively straightforward process of aggregation; see above for details pertaining to each field.

However, the MAC also includes some article-level metadata fields not present in any of the underlying sources. These include article subjects and emerging topic tags.

Identifying academic subjects

We assign each article a set of general and specific academic subjects using a method originally developed by Microsoft Academic. In short, Microsoft researchers built a taxonomy of research subjects by extracting scientific concepts from Wikipedia, then generated text embeddings for each subject from the relevant Wikipedia articles. They then generated a similar embedding for each article in their own academic corpus. Finally, they calculated the similarity between the article embedding and each field embedding, and assigned the article a corresponding relevance score for each subject.

For the current version of the MAC, we used Microsoft’s method and models to calculate similarity scores for every English-language MAC article. Microsoft’s models only work on English text, but Chinese is by far the most common non-English language in the MAC, and the vendor that provides the MAC's Chinese-language data includes CLC topic codes for each article. We mapped CLC codes to the subjects in Microsoft's taxonomy, then used this mapping to "convert" each article's CLC assignments into subjects comparable to the English-language articles in our corpus. (This is only necessary for articles that are only in Chinese; articles with alternate English and Chinese text, which are common, are processed with Microsoft's models as usual.)

Microsoft stopped updating its subject models in 2021. We are currently developing our own subject-assignment models (using the same basic method described above) and will integrate them into the MAC in the near future.

Identifying relevance to emerging technology topics

In addition to academic subjects, the MAC tags each article as relevant or not relevant to various emerging topics. These tags are currently used in the Research Almanac and for the AI-related filters in the Map of Science (certain tags only).

Emerging technology topics have fuzzy boundaries; there's no objectively correct answer to whether a particular article is "AI safety" research (for example). For each emerging topic, we try to capture articles in the MAC that subject matter experts would consider highly relevant to the topic in question. We use different methods to identify these articles depending on the topic, and we evaluate our results against "ground truth" corpora that also vary by topic. Still, it's important to note that this process inevitably involves some judgment calls. In addition, we rely on statistical models to apply the topic tags. For both reasons, analytic results derived from the MAC's emerging technology topic tags are necessarily imprecise and should be interpreted as estimates.

AI, computer vision, NLP, robotics

We tag each Chinese- or English-language article as relevant to artificial intelligence or not, and classify each English-language article as relevant or not to computer vision, natural language processing, and robotics.

To classify English-language articles, we use a set of machine learning models trained on arXiv data. Each article in arXiv includes author-provided subject tags, including tags for artificial intelligence, computer vision, natural language processing, and robotics. We trained a separate SciBERT model on the titles and abstracts of the arXiv articles tagged with each subject. Then, we ran each model over all of the other English-language articles in the MAC, assigning each one its own set of tags. For more details about how we developed, evaluated, and deployed the models, see Dunham et al. (2020).

SciBERT models only work on English text. To classify articles without English-language abstracts, we searched each article for a list of English and Chinese keywords related to artificial intelligence. Any article with at least one of the keywords in its title or abstract was classified as AI-relevant.

AI safety

The field of "AI safety" research is young, quickly evolving, and poorly defined (even compared to other emerging topics), with no authoritative and comprehensive compilations of "AI safety articles" to refer to. We use a machine learning classifier to systematically identify AI safety articles in the MAC, but caution that the results are inherently imprecise; different methods could produce different results.

In short, we consider an article an "AI safety article" if it (a) is categorized as relevant to AI and (b) is categorized by our AI safety classifier as relevant to AI safety.

To develop the AI safety classifier, we began by creating our own definition of AI safety research encompassing safety-related concepts such as robustness, misspecification, unwanted bias, explainability, and value alignment. In parallel, we compiled a set of articles potentially related to AI safety, such as MAC articles categorized as relevant to AI in general and recent articles from safety-relevant AI conferences, workshops, and open-source repositories. Based on their titles and abstracts, ETO staff read the titles and abstracts of 2806 of these articles, then manually annotated each one as relevant or not to AI safety according to our definition. (To measure the stability of the definition in practice, 256 articles were independently double-annotated by other CSET researchers; intercoder agreement was 75%.)

We then used Snorkel Flow, a platform for developing models under programmatic weak supervision, to train the model. (For more information on this approach, see Snorkel: rapid training data creation with weak supervision and A Survey on Programmatic Weak Supervision). We ingested metadata for the 2806 manually annotated articles, plus 15,000 unlabeled English-language articles marked relevant to AI by our AI classifier, into the Snorkel Flow platform. We split this data into 15% development, 15% validation, and 70% training sets. We then developed 83 labeling functions using the development set. 69 of these functions were keyword, regular expression, or time interval matches based on the values of the article title, abstract, publication venue, or publication years. 14 labeling functions were based on articles' membership in clusters derived from a custom support vector machine trained over word embeddings of the article titles and abstracts. These labeling functions had coverage of 99% of the development set, conflict of 10.5%, and label density of 5.329.

Through the Snorkel Flow platform, we used these labeling functions to create "weak" AI safety relevance labels for the data we ingested. We then trained an AutoML tuned model using logistic regression over the weak labels. This resulted in a model that achieved a macro-averaged F1 of 82.5% on the validation set, with precision of 73.1%, recall of 79.8%, and F1 of 76.3% over articles manually labeled relevant to AI safety.

AI + [pharmacology, genetics, ...]

We tag an article as an "AI + [subject]" article (for example, AI + pharmacology or AI + genetics, as displayed in the Research Almanac) if it is classified as an AI article and is assigned the specific academic subject in question.

Known limitations

  • The MAC doesn’t cover non-public research. The MAC only includes research that has been publicly released and is included in one of our data sources. We believe these sources cover the large majority of publicly released research worldwide, but a great deal of research is never made public. In particular, military research and commercial research may never be written up in open sources, or even written up at all. We have no way of knowing how much of this “hidden” research exists, or how different the MAC would look if it were somehow included.
  • The MAC’s sources may introduce problems. The MAC incorporates articles from many different datasets, making it a uniquely comprehensive dataset on worldwide research. No dataset is perfect, though. Errors, gaps, and judgment calls in the MAC’s underlying datasets are likely to make it into the MAC itself.
  • The MAC’s merging and enrichment processes may introduce errors. The MAC uses automated processes to combine articles from different datasets and create metadata about them. These processes usually work well, but they do have limitations, especially when there are issues with the raw article data. These issues include:
    • We have not done complete metadata translation. So far we have completely translated only the names of funding institutions. Other fields have been translated in part. However, translating all article titles and abstracts is cost-prohibitive.
    • Deduplication is less effective when articles have limited metadata. Our method of deduplicating articles across corpora relies on the presence of six metadata fields. If some of these fields are absent, we are less likely to successfully merge the articles. Additionally, if an article’s title or abstract appears in multiple languages across datasets, we will have to rely on that article’s other metadata to perform a successful match.
    • Named entities may not be fully resolved. Named entities, such as institutions, authors, and articles, are often given different names across or within our data sources. For example, Google may be called “Google” in one article, “Google Cloud Services” in another, “Google UK” in another, “AI Team - Google Research” in another. For the MAC, we use different processes to “resolve” these different names where appropriate, as described above, but we can’t catch all of them. This could affect some calculations using the MAC. For example, an author’s articles could be split across multiple versions of the author’s name, making it seem like that author has written fewer articles than she really has.
    • Sub-organizations are not linked to their parents. Some organizations in the MAC are parts of other organizations. For example, an article might have some authors associated with “Google Research” and others associated with “Google Cloud Services.” In some contexts, users might want to group these organizations (and their authors) together under “Google.” The MAC doesn’t group organizations like this.

Maintenance

How are the data updated?

We update the MAC through a sequence of automated pipelines that retrieve data from our sources, merge it together, and enrich it. These pipelines normally run weekly, with occasional pauses to resolve issues due to vendor data changes or failures of automated checks in the pipelines.

The underlying data sources are updated on their own schedules - weekly or daily in most cases. A few of the sources have longer lags, up to quarterly for Chinese-language papers provided through CNKI.

Credits

Virtually all CSET data team members have contributed to the MAC in some form, whether by providing feedback, reviewing code, or helping generate ideas. In alphabetical order, some particular contributions follow:

  • Daniel Chou: Chinese-language data parsing and normalization
  • James Dunham: Article classifier development, field of study modeling, citation percentile calculation
  • Rebecca Gelles: Organizational entity resolution
  • Jennifer Melot: Article linkage, data pipeline automation and scaling
  • Ilya Rahkovsky: Article linkage, metadata merge

Student research assistants Chenxi Liu, Luwei Lei, and Jerod Sun contributed data characterization and normalization. Zach Arnold helped create this documentation.

Major change log

10/13/22Initial release (ETO/CSET internal)
5/19/23New emerging topics added as part of Research Almanac launch