ETO Logo
Documentation: Merged Academic Corpus

Overview

The Merged Academic Corpus is not publicly available. This documentation is provided as context for the ETO products that use the dataset.

What is this dataset?

The Merged Academic Corpus (MAC) contains detailed information on over 260 million scholarly articles, combining data from public and private sources to achieve an unmatched view of the global literature. The MAC is maintained by CSET and ETO and is not publicly available in raw form due to licensing restrictions.

Which ETO products use it?

What are its sources?

The MAC currently includes data from six commercial and open-access platforms, plus additional metadata derived from those platforms using CSET algorithms. Read more >>

What are its main limitations?

  • The MAC doesn’t cover non-public research. It only includes research that has been publicly released and is included in one of our data sources. Read more >>

  • The MAC’s sources may introduce problems. Errors, gaps, and judgment calls in the MAC’s underlying datasets are likely to make it into the MAC itself. Read more >>

  • The MAC’s merging and enrichment processes may introduce errors. The MAC uses automated processes to combine articles from different sources and link them to useful metadata. These processes usually work well, but there may be errors in some cases. Read more >>

  • The MAC’s coverage of Chinese publications is incomplete. Although the MAC includes many Chinese publications, many others are only available in China-based journals that are not included in our data sources. Use particular caution when interpreting MAC-derived data related to Chinese research organizations, funders, or authors. Read more >>

What are the terms of use?

Because this dataset contains licensed data from commercial providers, it is not publicly available in raw form. However, you can interact with some of the data using ETO tools.

How do I cite it?

Because the MAC is not publicly available, you should cite the relevant ETO tool or this documentation page instead.

Structure and content

The basic unit of the MAC is the article. For our purposes, "articles" include peer-reviewed publications, working papers, and other works appearing in journals, preprint servers, or similar venues. After deduplicating the articles, we compile data about each one from the MAC’s data sources, then structure the compiled data as a series of standard metadata fields.

Title

  • Each article can have an English-language title, a foreign-language title, both, or neither.
    • 68.00% of articles have an English title.
    • 29.76% have a foreign-language title.
    • 3.21% have no available title.
  • When our sources include multiple titles in the same language for the same article, we use the title from the most recently published article.

Abstract

  • Each article can have an English-language abstract, one or (infrequently) more foreign-language abstracts, both, or neither.
    • 33.04% of articles have an English abstract.
    • 7.80% have a foreign-language abstract.
    • 59.94% have no available abstract.
  • When our sources include multiple abstracts in the same language for the same article, we use the abstract from the most recently published article.

Title and abstract languages

  • We try to automatically detect the language of each article’s title(s) and abstract(s) using a pycld2, a standard language identification algorithm.
  • 92.08% of articles with non-null titles have a confident language label from pycld2, and 98.65% of articles with non-null abstracts have a confident language label from pycld2.
  • English is the most common language for titles and abstracts. 59.93% of articles with a title have an English title, and 41.29% of articles with an abstract have an English abstract.

Year

  • Each article has an article year.
  • When our sources include multiple years for the same article, we use the earliest year.

Venue

  • 75.89% of MAC articles have an associated venue of article, such as a journal (e.g., Nature) or open repository (e.g., arXiv). For the remaining 24.11%, there was no venue data in our sources.
  • When our sources list more than one venue for the same article, we currently break the tie semi-arbitrarily, using the venue with the name that is last in dictionary order.

Authors

  • For each article, our data sources include each author’s name and affiliated organization (e.g., "Jane Doe, Georgetown University"). In some cases, the sources also tag the authors or their organizations with a unique identifier code, such as a ROR code or ORCID code.
  • For each article, we compile all of the authors associated with that article in the MAC’s data sources, then identify and remove duplicate authors using their names, affiliated organizations, and identifiers (as available), resulting in a final list of authors for the article. (For deduplication purposes only, we normalize author names by removing some special characters, reversing strings separated by exactly one comma, and standardizing whitespace and formatting for author initials.)
  • 82.61% of MAC articles have at least one associated author. For the remaining 17.39%, there was no author data in our sources.

Author organizations

  • As noted above, our data sources usually include organizational affiliations for each author of an article.
  • We use this affiliation data to identify one or more organizations for each author of a given article in the MAC. (In some cases, an author is listed on an article as affiliated with multiple organizations.)
  • The MAC’s author organization data is article-specific. For example, if Professor Doe moves from Georgetown to Oxford, she will be counted as affiliated with Georgetown for the articles she published while at Georgetown, and with Oxford for the articles she published after moving there.
  • In general, you can think of the MAC’s affiliation data as a table with three columns: one listing the specific article, one listing the name of an author of that article, and one listing the organization that author is associated with in that article.
    • If no source lists an organizational affiliation for an author, we assign that author a null organization.
  • When different sources specify different organizations for the same author of a given article, we use a set of heuristics to pick the source most likely to be accurate, and use the organization information from that source. We preferentially use data from sources with more complete links between authors and organizations and more complete data about organizations (e.g., mappings to countries or unique identifiers).

Author organization countries

  • We try to assign a country to each author organization:
    • If the organization has a ROR identifier, we use the country specified by ROR.
    • If the organization doesn’t have a ROR identifier, we use the country specified for that organization in the data source that we used to identify the organization in the first place, if available.
    • In cases where the country provided by our data source appears to be incorrect based on an analyst’s review, we selectively apply corrections.
    • We use an internal mapping table to normalize country names (for example, assigning "USA" and "U.S." to the United States).
    • If that source doesn’t specify a country, we assign a null value for this field. 79.86% of distinct organizations have no country affiliation.
  • Here (and generally in ETO resources) we use "country" informally, as a shorthand term for sovereign countries, independent states, and certain other geographic entities. Specifically, any entity that has an ISO 3166-1 code is termed a "country" in ETO resources. In other words, Entities such as territories and special administrative regions may or may not appear in MAC data separately from the sovereign countries with which they are associated (if any), but if they do, they are described as "countries."

Author organization types

  • Each author-affiliated organization is assigned to one of five types: Education, Company, Nonprofit, Government, or null (unknown).
  • We use sectoral data from the ROR database to assign types. We assign organizations without ROR identifiers a null value for this field.

Funding organizations

  • Some articles mention the organizations that funded the research in the article, and some of the MAC’s data sources extract these mentions from article text.
  • We compile the extracted names from these sources for each article, then normalize and deduplicate the names, resulting in a clean list of funding organizations for the article.
    • We manually review a subset of the funder names that appear most often in the corpus and manually resolve variants to single, "official" names.
    • We normalize the remaining organization names for deduplication by deaccenting, stripping punctuation, normalizing whitespace, replacing "&" with "and", machine-translating non-English names, replacing acronyms appearing in parentheses with their corresponding expansions, replacing certain shorthand terms (for example, "University" replaces "Univ"), and removing parenthesized expressions at the end of names.
    • If the normalized versions of multiple names are the same after these steps, we use the original name that occurs most often in the underlying data sources (or its translation, if the name is non-English) as the "official" name in the MAC.
  • 7.55% of articles in the MAC have at least some metadata related to funding organizations.

Funding organization countries

  • Each funding organization is associated with a country.
  • If the organization has a ROR identifier, we use the country specified by ROR.
  • Otherwise, we assign a null value (unknown).
  • For more details on how we define "countries," see above.

Funding organization types

  • Each funding organization is assigned to one of five types: Education, Company, Nonprofit, Government, or null (unknown).
  • We use sectoral data from the ROR database to assign types. We assign organizations without ROR identifiers a null value for this field.

Citations and citation percentile

Article citations

  • Most of the MAC’s data sources include extracted article citation metadata. For each article in the source dataset, these sources provide a list of citations from that article to other articles in the dataset.
  • We map each cited and citing article in these lists to deduplicated MAC articles using their source-specific unique identifiers.
  • Using this mapping, we create two consolidated lists for each deduplicated MAC article: one listing all of the other articles the original article cites (out-citations), and one listing all of the other articles that cite the original article (in-citations).
  • Finally, we calculate a citation percentile for each article. This value compares the article’s total number of in-citations with all other articles published in the same year. For example, a 90th percentile article has more in-citations than about 90% of all articles published in the same year.

Patent citations

  • One of the MAC’s sources, The Lens, tracks citations from patents in The Lens’s patent dataset to publications in the MAC (the patents themselves are not part of the MAC).
  • We use this metadata to count patent in-citations (i.e., citations by patents) and in-citation percentiles for each MAC article available in The Lens, using the same basic process as for citations between articles.

Academic subjects

  • Each MAC article is assigned general and specific academic subjects. For example, an article’s subjects might include medicine (general) or cancer research (specific).
  • Each subject has a relevance score indicating how relevant the article is to that subject. We use a set of models to generate each article's relevance score for every subject we monitor.
  • We designate the three top-scoring general subjects and three top-scoring specific subjects as the article's overall subjects. These article-level subject identifications are used across many ETO tools - for example, to identify cluster subjects in the RC Cluster Dataset or to generate subject-wide article counts in the Research Almanac.
  • Finally, we designate each article’s highest-scoring general subject as that article’s overall subject. This overall subject is used (for example) to generate cluster colors in the Map of Science.

Emerging subjects

  • In addition to academic subjects, each MAC article is classified as relevant or not to certain emerging technology topics, including AI and certain AI subfields.
  • We use a set of subject-specific models to apply these tags to each article to apply these tags. Read more >>

Unique identifiers

  • The MAC’s data sources sometimes include unique identifier codes for articles, authors, or organizations. These may be proprietary (i.e., codes used by vendors to identify articles within that vendor’s datasets only) or public.
  • We aggregate the unique identifiers associated with each article across all sources and include them in the MAC. In some cases, we also use these identifiers for deduplication.
  • Major identifiers in the MAC include DOI and ROR.

Fulltext

  • Full article text is available from two of the MAC's data sources (Semantic Scholar and arXiv). When available, this data is incorporated into the MAC.

Sources and methodology

Parts of this section are adapted from Rahkovsky et al., "AI Research Funding Portfolios and Extreme Growth" (Frontiers in Research Metrics and Analytics, 2021).

Data sources

The MAC currently includes data from:

  • Clarivate’s Web of Science platform (commercial/closed access). The MAC includes Web of Science articles published after 2005.
  • The Lens (commercial/closed access).
  • The arXiv platform for open-access scientific articles and preprints (open access). arXiv is one of our two sources of fulltext.
  • Papers with Code, a free platform for machine learning articles and related resources (open access).
  • Semantic Scholar, a large-scale open-access dataset and one of our two sources of fulltext (open access).
  • OpenAlex, a large-scale open-access dataset viewed as the replacement for MAG.

Some of the data in the MAC is taken directly from these sources. Other data is derived from them algorithmically, as discussed below.

Consolidating raw data from different sources

We automatically incorporate raw data into the MAC from each source weekly. (The underlying data sources are updated between daily and monthly.) We use a set of Airflow pipelines to retrieve the raw data and send it through the MAC’s merging and enrichment processes, described below.

Deduplicating articles

There is duplication between and within the sources feeding into the MAC - articles often appear in multiple sources, or multiple times within the same source. We resolve these duplicates with an automated process.

First, we normalize every article’s title and abstract.

  • For matching purposes, we apply the Unicode Normalization Form Compatibility Composition (NFKC) standard: unicode characters are decomposed by compatibility, then recomposed by canonical equivalence; letters are de-accented and HTML tags, copyright signs, punctuation, numbers, and non-alphanumeric characters strings are stripped; and all white space is removed from the strings.

Next, we identify and remove duplicates. For deduplication purposes, we first filter out any article with a title, abstract, or DOI occurring more than 10 times in the MAC; from our evaluation, these tend to be short or generic records that cause a significant number of false matches if included. Across the articles that remained, we presume two articles are duplicates if one of the following is true:

  • They match on at least two of the following metadata fields, excluding shared null and empty values:
    • Title (normalized)
    • Abstract (normalized)
    • References within each article
    • DOI
  • OR they match on one of the metadata fields listed above, plus one of the following metadata fields, again excluding shared null and empty values:
    • Article year
    • Author last names (normalized);
  • OR their normalized, concatenated titles and abstracts have simhash values that differ in at most two places (using a rolling window of three characters) AND they have the same article year.

We chose this method after iteratively testing different variants against a "ground truth" dataset that included true duplicates (pairs of documents confirmed by human annotators to be different instances of the same article) and false duplicates (pairs of apparently similar documents confirmed by human annotators not to be different instances of the same article). For more details on the deduplication process, visit our github repo.

Compiling article metadata

After we deduplicate articles, we link each one to the metadata associated with it or any of its duplicates in the MAC’s underlying data sources. For most metadata fields, this is a relatively straightforward process of aggregation; see above for details pertaining to each field.

However, the MAC also includes some article-level metadata fields not present in any of the underlying sources. These include academic subjects as well as emerging subjects such as AI.

Identifying academic subjects

We score each article for relevance to general and specific academic subjects using the method described in Toney-Wails and Dunham (2022).

In short:

  • We begin with a taxonomy of academic subjects, including 19 general subjects (such as biology or computer science) and 294 more specific subjects (such as radiochemistry, media studies, or nanotechnology). This taxonomy was originally developed by Microsoft researchers by extracting scientific concepts from Wikipedia.
  • We generate text embeddings for each subject from the Wikipedia articles on each subject.
  • We then generate a similar embedding for each article in the MAC.
  • Finally, we calculate the similarity between the article embedding and each field embedding, and assign the article a corresponding relevance score for each subject.

Note that although this model-driven approach is reliable in the aggregate (see Toney-Wails and Dunham (2022) for details), it occasionally produces questionable results for individual articles - that is, it may assign an article a high relevance score for a field that is not actually highly relevant to the article (as judged by a human), or a low score for a field that is relevant.

Identifying relevance to AI and other emerging subjects

We take a different approach to identify articles relevant to artificial intelligence, computer vision, cybersecurity, natural language processing, robotics, and AI safety. Tags for these subjects are currently used in the Research Almanac and the subjects filter in the Map of Science.

Emerging technology topics have fuzzy boundaries; there's no objectively correct answer to whether a particular article is "AI safety" research (for example). For each emerging topic, we try to capture articles in the MAC that subject matter experts would consider highly relevant to the topic in question. We use different methods to identify these articles depending on the topic, and we evaluate our results against "ground truth" corpora that also vary by topic. Still, it's important to note that this process inevitably involves some judgment calls. In addition, we rely on statistical models to apply the topic tags.

For both reasons, analytic results derived from the MAC's emerging technology topic tags are necessarily imprecise and should be interpreted as estimates.

To classify articles as relevant or not to artificial intelligence, computer vision, natural language processing, and robotics, we use a set of machine learning models trained on arXiv data. Articles in arXiv include subject tags that are initially provided by arXiv authors and revised by arXiv editors as appropriate. These include tags for artificial intelligence, computer vision, natural language processing, and robotics. For each of these categories, we trained a separate SPECTER model on the titles and abstracts of the tagged arXiv articles. Then, we ran each model over all of the other articles in the MAC with English titles or abstracts, assigning each one its own set of tags. (For more details about how we developed, evaluated, and deployed earlier, conceptually similar versions of these models, see Dunham et al. (2020).)

(In addition, for purposes of the Research Almanac only, we tag an article as an "AI + [academic subject]" article (for example, AI + pharmacology or AI + genetics) if it is classified as an AI article and is assigned the academic subject in question.)

Finally, we tag articles as relevant or not to AI safety. We consider an article an "AI safety article" if it (a) is categorized as relevant to AI and (b) is categorized by our AI safety classifier as relevant to AI safety.

The field of AI safety research is young, quickly evolving, and poorly defined (even compared to other emerging topics), with no authoritative and comprehensive compilations of "AI safety articles" to refer to. Our classifier systematically identifies AI safety articles in the MAC, but we caution that the results are inherently imprecise; different methods could produce different results.

To develop the AI safety classifier, we began by creating our own definition of AI safety research encompassing safety-related concepts such as robustness, misspecification, unwanted bias, explainability, and value alignment. In parallel, we compiled a set of articles potentially related to AI safety, such as MAC articles categorized as relevant to AI in general and recent articles from safety-relevant AI conferences, workshops, and open-source repositories. Based on their titles and abstracts, ETO staff read the titles and abstracts of 2806 of these articles, then manually annotated each one as relevant or not to AI safety according to our definition. (To measure the stability of the definition in practice, 256 articles were independently double-annotated by other CSET researchers; intercoder agreement was 75%.)

We then used Snorkel Flow, a platform for developing models under programmatic weak supervision, to train the model. (For more information on this approach, see Snorkel: rapid training data creation with weak supervision and A Survey on Programmatic Weak Supervision). We ingested metadata for the 2806 manually annotated articles, plus 15,000 unlabeled articles marked relevant to AI by our AI classifier, into the Snorkel Flow platform. We split this data into 15% development, 15% validation, and 70% training sets. We then developed 83 labeling functions using the development set. 69 of these functions were keyword, regular expression, or time interval matches based on the values of the article title, abstract, publication venue, or publication years. 14 labeling functions were based on articles' membership in clusters derived from a custom support vector machine trained over word embeddings of the article titles and abstracts. These labeling functions had coverage of 99% of the development set, conflict of 10.5%, and label density of 5.329.

Through the Snorkel Flow platform, we used these labeling functions to create "weak" AI safety relevance labels for the data we ingested. We then trained an AutoML tuned model using logistic regression over the weak labels. This resulted in a model that achieved a macro-averaged F1 of 82.5% on the validation set, with precision of 73.1%, recall of 79.8%, and F1 of 76.3% over articles manually labeled relevant to AI safety.

Known limitations

  • The MAC doesn’t cover non-public research. The MAC only includes research that has been publicly released and is included in one of our data sources. We believe these sources cover a significant fraction of publicly released research worldwide (and in particular, the large majority of published research in English), but a great deal of research is never made public. In particular, military research and commercial research may never be written up in open sources, or even written up at all. We have no way of knowing how much of this "hidden" research exists, or how different the MAC would look if it were somehow included.
  • The MAC’s sources may introduce problems. The MAC incorporates articles from many different datasets, making it a uniquely comprehensive dataset on worldwide research. No dataset is perfect, though. Errors, gaps, and judgment calls in the MAC’s underlying datasets are likely to make it into the MAC itself.
  • The MAC’s merging and enrichment processes may introduce errors. The MAC uses automated processes to combine articles from different datasets and create metadata about them. These processes usually work well, but they do have limitations, especially when there are issues with the raw article data. These issues include:
    • We have not done complete metadata translation. So far we have completely translated only the names of funding organizations. Other fields have been translated in part. However, translating all article titles and abstracts is cost-prohibitive.
    • Deduplication is less effective when articles have limited metadata. Our method of deduplicating articles across corpora relies on the presence of six metadata fields. If some of these fields are absent, we are less likely to successfully merge the articles. Additionally, if an article’s title or abstract appears in multiple languages across datasets, we will have to rely on that article’s other metadata to perform a successful match.
    • Named entities may not be fully resolved. Named entities, such as organizations, authors, and articles, are often given different names across or within our data sources. For example, Google may be called "Google" in one article, "Google Cloud Services" in another, "Google UK" in another, "AI Team - Google Research" in another. For the MAC, we use different processes to "resolve" these different names where appropriate, as described above, but we can’t catch all of them. This could affect some calculations using the MAC. For example, an author’s articles could be split across multiple versions of the author’s name, making it seem like that author has written fewer articles than she really has.
    • Sub-organizations are not linked to their parents. Some organizations in the MAC are parts of other organizations. For example, an article might have some authors associated with "Google Research" and others associated with "Google Cloud Services." In some contexts, users might want to group these organizations (and their authors) together under "Google." The MAC doesn’t group organizations like this.
  • The MAC’s coverage of Chinese publications is incomplete. Although the MAC includes many Chinese publications, many others are only available in China-based journals that are not included in our data sources. (Earlier versions of the MAC had better coverage of these sources, but unfortunately, ETO and many other organizations outside China are no longer able to access them.) Use particular caution when interpreting MAC-derived data related to Chinese research organizations, funders, or authors.
  • The MAC’s article funding data is incomplete. Articles often list their funders (for example: "This research was supported by the National Science Foundation under grant XYZ…"). In many, but not all cases, the MAC’s data sources extract these mentions as metadata linking funders to the relevant articles. The MAC’s funder data is derived from this metadata. But because not all funder mentions are extracted as metadata, the funder data in the MAC is not complete.

Maintenance

How are the data updated?

We update the MAC through a sequence of automated pipelines that retrieve data from our sources, merge it together, and enrich it. These pipelines normally run weekly, with occasional pauses to resolve issues due to vendor data changes or failures of automated checks in the pipelines.

The underlying data sources are updated on their own schedules, between daily and monthly.

Credits

Virtually all CSET data team members have contributed to the MAC in some form, whether by providing feedback, reviewing code, or helping generate ideas. In alphabetical order, some particular contributions follow:

  • Daniel Chou: Chinese-language data parsing and normalization
  • James Dunham: Article classifier development, field of study modeling, citation percentile calculation, organizational entity resolution, metadata merge
  • Rebecca Gelles: Organizational entity resolution, metadata merge
  • Jennifer Melot: Article linkage, metadata merge, data orchestration
  • Ilya Rahkovsky: Article linkage, metadata merge
  • Christian Schoeberl: Article classifier development
  • Autumn Toney-Wails: Subject modeling

Student research assistants Chenxi Liu, Luwei Lei, and Jerod Sun contributed data characterization and normalization. Zach Arnold helped create this documentation.

Major change log

11/22/23Updates related to new underlying data sources and subject classifiers
5/19/23New emerging topics added as part of Research Almanac launch
10/13/22Initial release (ETO/CSET internal)