Documentation: Merged Academic Corpus – Emerging Technology Observatory

Documentation: Merged Academic Corpus

Overview

The Merged Academic Corpus is not publicly available. This documentation is provided as context for the ETO products that use the dataset.

What is this dataset?

The Merged Academic Corpus (MAC) contains detailed information on over 260 million scholarly articles, combining data from public and private sources to achieve an unmatched view of the global literature. The MAC is maintained by CSET and ETO and is not publicly available in raw form due to licensing restrictions.

Which ETO products use it?

What are its sources?

The MAC currently includes data from six commercial and open-access platforms, plus additional metadata derived from those platforms using CSET algorithms. Read more >>

What are its main limitations?

The MAC doesn’t cover non-public research. It only includes research that has been publicly released and is included in one of our data sources. Read more >>
The MAC’s sources may introduce problems. Errors, gaps, and judgment calls in the MAC’s underlying datasets are likely to make it into the MAC itself. Read more >>
Recent years have incomplete data. It takes time for our data sources to incorporate the latest publications and metadata. The MAC reflects this lag. Read more >>
The MAC’s merging and enrichment processes may introduce errors. The MAC uses automated processes to combine articles from different sources and link them to useful metadata. These processes usually work well, but there may be errors in some cases. Read more >>
The MAC’s coverage of Chinese publications is incomplete. Although the MAC includes many Chinese publications, many others are only available in China-based journals that are not included in our data sources. Use particular caution when interpreting MAC-derived data related to Chinese research organizations, funders, or authors. Read more >>

What are the terms of use?

Because this dataset contains licensed data from commercial providers, it is not publicly available in raw form. However, you can interact with some of the data using ETO tools.

How do I cite it?

Because the MAC is not publicly available, you should cite the relevant ETO tool or this documentation page instead.

Structure and content

The basic unit of the MAC is the article. For our purposes, "articles" include peer-reviewed publications, working papers, and other works appearing in journals, preprint servers, or similar venues. After deduplicating the articles, we compile data about each one from the MAC’s data sources, then structure the compiled data as a series of standard metadata fields.

Title

Each article can have an English-language title, a foreign-language title, both, or neither.
- 68.00% of articles have an English title.
- 29.76% have a foreign-language title.
- 3.21% have no available title.
When our sources include multiple titles in the same language for the same article, we use the title from the most recently published article.

Abstract

Each article can have an English-language abstract, one or (infrequently) more foreign-language abstracts, both, or neither.
- 33.04% of articles have an English abstract.
- 7.80% have a foreign-language abstract.
- 59.94% have no available abstract.
When our sources include multiple abstracts in the same language for the same article, we use the abstract from the most recently published article.

Title and abstract languages

We try to automatically detect the language of each article’s title(s) and abstract(s) using a pycld2, a standard language identification algorithm.
92.08% of articles with non-null titles have a confident language label from pycld2, and 98.65% of articles with non-null abstracts have a confident language label from pycld2.
English is the most common language for titles and abstracts. 59.93% of articles with a title have an English title, and 41.29% of articles with an abstract have an English abstract.

Year

Each article has an article year.
When our sources include multiple years for the same article, we use the earliest year.

Venue

75.89% of MAC articles have an associated venue of article, such as a journal (e.g., Nature) or open repository (e.g., arXiv). For the remaining 24.11%, there was no venue data in our sources.
When our sources list more than one venue for the same article, we currently break the tie semi-arbitrarily, using the venue with the name that is last in dictionary order.

Authors

For each article, our data sources include each author’s name and affiliated organization (e.g., "Jane Doe, Georgetown University"). In some cases, the sources also tag the authors or their organizations with a unique identifier code, such as a ROR code or ORCID code.
For each article, we compile all of the authors associated with that article in the MAC’s data sources, then identify and remove duplicate authors using their names, affiliated organizations, and identifiers (as available), resulting in a final list of authors for the article. (For deduplication purposes only, we normalize author names by removing some special characters, reversing strings separated by exactly one comma, and standardizing whitespace and formatting for author initials.)
82.61% of MAC articles have at least one associated author. For the remaining 17.39%, there was no author data in our sources.

Author organizations

As noted above, our data sources usually include organizational affiliations for each author of an article.
We use this affiliation data to identify one or more organizations for each author of a given article in the MAC. (In some cases, an author is listed on an article as affiliated with multiple organizations.)
The MAC’s author organization data is article-specific. For example, if Professor Doe moves from Georgetown to Oxford, she will be counted as affiliated with Georgetown for the articles she published while at Georgetown, and with Oxford for the articles she published after moving there.
In general, you can think of the MAC’s affiliation data as a table with three columns: one listing the specific article, one listing the name of an author of that article, and one listing the organization that author is associated with in that article.
- If no source lists an organizational affiliation for an author, we assign that author a null organization.
When different sources specify different organizations for the same author of a given article, we use a set of heuristics to pick the source most likely to be accurate, and use the organization information from that source. We preferentially use data from sources with more complete links between authors and organizations and more complete data about organizations (e.g., mappings to countries or unique identifiers).

Author organization countries

We try to assign a country to each author organization:
- If the organization has a ROR identifier, we use the country specified by ROR.
- If the organization doesn’t have a ROR identifier, we use the country specified for that organization in the data source that we used to identify the organization in the first place, if available.
- In cases where the country provided by our data source appears to be incorrect based on an analyst’s review, we selectively apply corrections.
- We use an internal mapping table to normalize country names (for example, assigning "USA" and "U.S." to the United States).
- If that source doesn’t specify a country, we assign a null value for this field. 79.86% of distinct organizations have no country affiliation.
Here (and generally in ETO resources) we use "country" informally, as a shorthand term for sovereign countries, independent states, and certain other geographic entities. Specifically, any entity that has an ISO 3166-1 code is termed a "country" in ETO resources. In other words, Entities such as territories and special administrative regions may or may not appear in MAC data separately from the sovereign countries with which they are associated (if any), but if they do, they are described as "countries."

Author organization types

Each author-affiliated organization is assigned to one of five types: Education, Company, Nonprofit, Government, or null (unknown).
We use sectoral data from the ROR database to assign types. We assign organizations without ROR identifiers a null value for this field.

Funding organizations

Some articles mention the organizations that funded the research in the article, and some of the MAC’s data sources extract these mentions from article text.
We compile the extracted names from these sources for each article, then normalize and deduplicate the names, resulting in a clean list of funding organizations for the article.
- We manually review a subset of the funder names that appear most often in the corpus and manually resolve variants to single, "official" names.
- For organizations linked to a ROR id, we use the organization name from ROR.
- We normalize the remaining organization names for deduplication by deaccenting, stripping punctuation, normalizing whitespace, replacing "&" with "and", machine-translating non-English names if the resulting English name appears elsewhere in the corpus, replacing acronyms appearing in parentheses with their corresponding expansions when this can be done unambiguously, and removing parenthesized expressions at the end of names.
- If the normalized versions of multiple names are the same after these steps, we use the original name that occurs most often in the underlying data sources (or its translation, if the name is non-English) as the "official" name in the MAC.
7.55% of articles in the MAC have at least some metadata related to funding organizations.

Funding organization countries

Each funding organization is associated with a country.
If the organization has a ROR identifier, we use the country specified by ROR.
Otherwise, we assign a null value (unknown).
For more details on how we define "countries," see above.

Funding organization types

Each funding organization is assigned to one of five types: Education, Company, Nonprofit, Government, or null (unknown).
We use sectoral data from the ROR database to assign types. We assign organizations without ROR identifiers a null value for this field.

Citations and citation percentile

Article citations

Most of the MAC’s data sources include extracted article citation metadata. For each article in the source dataset, these sources provide a list of citations from that article to other articles in the dataset.
We map each cited and citing article in these lists to deduplicated MAC articles using their source-specific unique identifiers.
Using this mapping, we create two consolidated lists for each deduplicated MAC article: one listing all of the other articles the original article cites (out-citations), and one listing all of the other articles that cite the original article (in-citations).
Finally, we calculate a citation percentile for each article. This value compares the article’s total number of in-citations with all other articles published in the same year. For example, a 90th percentile article has more in-citations than about 90% of all articles published in the same year.

Patent citations

One of the MAC’s sources, The Lens, tracks citations from patents in The Lens’s patent dataset to publications in the MAC (the patents themselves are not part of the MAC).
We use this metadata to count patent in-citations (i.e., citations by patents) and in-citation percentiles for each MAC article available in The Lens, using the same basic process as for citations between articles.

Research fields and subfields

Each MAC article is assigned general research fields and specific research subfields. For example, an article might be assigned to medicine (field) or cancer research (subfield).
Each field and subfield has a relevance score indicating how relevant the article is to that field or subfield. We use a set of models to generate each article's relevance score for every field and subfield we monitor.
We designate the three top-scoring fields and three top-scoring subfields as the article's overall fields and subfields. These article-level identifications are used across many ETO tools - for example, to identify cluster fields and subfields in the RC Cluster Dataset or to generate field- or subfield-wide article counts in the Research Almanac.
Finally, we designate each article’s highest-scoring research field as that article’s overall field. This overall field is used (for example) to generate cluster colors in the Map of Science.

Emerging topics

In addition to research fields and subfields, each MAC article is classified as relevant or not to certain emerging technology topics, such as artificial intelligence and cybersecurity.
We use a set of models to apply these tags to each article to apply these tags. Read more >>

Unique identifiers

The MAC’s data sources sometimes include unique identifier codes for articles, authors, or organizations. These may be proprietary (i.e., codes used by vendors to identify articles within that vendor’s datasets only) or public.
We aggregate the unique identifiers associated with each article across all sources and include them in the MAC. In some cases, we also use these identifiers for deduplication.
Major identifiers in the MAC include DOI and ROR.

Fulltext

Full article text is available from two of the MAC's data sources (Semantic Scholar and arXiv). When available, this data is incorporated into the MAC.

Sources and methodology

Parts of this section are adapted from Rahkovsky et al., "AI Research Funding Portfolios and Extreme Growth" (Frontiers in Research Metrics and Analytics, 2021).

Data sources

The MAC currently includes data from:

The Lens (commercial/closed access).
The arXiv platform for open-access scientific articles and preprints (open access). arXiv is one of our two sources of fulltext.
Papers with Code, a free platform for machine learning articles and related resources (open access).
Semantic Scholar, a large-scale open-access dataset and one of our two sources of fulltext (open access).
OpenAlex, a large-scale open-access dataset viewed as the replacement for MAG.

Some of the data in the MAC is taken directly from these sources. Other data is derived from them algorithmically, as discussed below.

Consolidating raw data from different sources

We automatically incorporate raw data into the MAC from each source weekly. (The underlying data sources are updated between daily and monthly.) We use a set of Airflow pipelines to retrieve the raw data and send it through the MAC’s merging and enrichment processes, described below.

Deduplicating articles

There is duplication between and within the sources feeding into the MAC - articles often appear in multiple sources, or multiple times within the same source. We resolve these duplicates with an automated process.

First, we normalize every article’s title and abstract.

For matching purposes, we apply the Unicode Normalization Form Compatibility Composition (NFKC) standard: unicode characters are decomposed by compatibility, then recomposed by canonical equivalence; letters are de-accented and HTML tags, copyright signs, punctuation, numbers, and non-alphanumeric characters strings are stripped; and all white space is removed from the strings.

Next, we identify and remove duplicates. For deduplication purposes, we first filter out any article with a title, abstract, or DOI occurring more than 10 times in the MAC; from our evaluation, these tend to be short or generic records that cause a significant number of false matches if included. Across the articles that remained, we presume two articles are duplicates if one of the following is true:

They match on at least two of the following metadata fields, excluding shared null and empty values:
- Title (normalized)
- Abstract (normalized)
- References within each article
- DOI
OR they match on one of the metadata fields listed above, plus one of the following metadata fields, again excluding shared null and empty values:
- Article year
- Author last names (normalized);
OR their normalized, concatenated titles and abstracts have simhash values that differ in at most two places (using a rolling window of three characters) AND they have the same article year.

We chose this method after iteratively testing different variants against a "ground truth" dataset that included true duplicates (pairs of documents confirmed by human annotators to be different instances of the same article) and false duplicates (pairs of apparently similar documents confirmed by human annotators not to be different instances of the same article). For more details on the deduplication process, visit our github repo.

Compiling article metadata

After we deduplicate articles, we link each one to the metadata associated with it or any of its duplicates in the MAC’s underlying data sources. For most metadata fields, this is a relatively straightforward process of aggregation; see above for details pertaining to each field.

However, the MAC also includes some article-level metadata fields not present in any of the underlying sources. These include research fields and subfields as well as emerging topics such as AI.

Identifying research fields and subfields

We score each article for relevance to research fields and subfields using the method described in Toney-Wails and Dunham (2022).

In short:

We begin with a taxonomy of academic subjects, including 19 general subjects corresponding to research fields (such as biology or computer science) and 294 more specific subjects corresponding to research subfields (such as radiochemistry, media studies, or nanotechnology). This taxonomy was originally developed by Microsoft researchers by extracting scientific concepts from Wikipedia.
We generate text embeddings for each subject from the Wikipedia articles on each subject.
We then generate a similar embedding for each article in the MAC.
Finally, we calculate the similarity between the article embedding and each field embedding, and assign the article a corresponding relevance score for each field and subfield.

Note that although this model-driven approach is reliable in the aggregate (see Toney-Wails and Dunham (2022) for details), it occasionally produces questionable results for individual articles - that is, it may assign an article a high relevance score for a field that is not actually highly relevant to the article (as judged by a human), or a low score for a field that is relevant.

Identifying relevance to emerging technology topics

We use different methods to identify articles related to emerging tech topics such as AI. These topics typically cross traditional research field boundaries, so our field identifications don't reliably capture them. Instead, we use different machine learning approaches, described below, to identify and tag relevant research across the MAC.

Emerging technology topics have fuzzy boundaries; there's no objectively correct answer to whether a particular article is "AI safety" research (for example). For each emerging topic, we try to capture articles in the MAC that subject matter experts would consider highly relevant to the topic in question. We use different methods to identify these articles depending on the topic, and we evaluate our results against "ground truth" corpora that also vary by topic. Still, it's important to note that this process inevitably involves some judgment calls. In addition, we rely on statistical models to apply the topic tags.

For both reasons, analytic results derived from the MAC's emerging technology topic tags are necessarily imprecise and should be interpreted as estimates.

AI, AI subfields, and cybersecurity

To classify articles as relevant or not to artificial intelligence, computer vision, natural language processing, robotics, and cybersecurity, we use a set of machine learning models trained on arXiv data. Articles in arXiv include subject tags that are initially provided by arXiv authors and revised by arXiv editors as appropriate. These include tags for artificial intelligence, computer vision, natural language processing, robotics, and cybersecurity. For each of these categories, we trained a separate SPECTER model on the titles and abstracts of the tagged arXiv articles. Then, we ran each model over all of the other articles in the MAC (2010 publication or later) with English titles or abstracts, assigning each one its own set of tags. (We then add the "artificial intelligence" tag to any articles tagged for NLP, computer vision, or robotics but not initially for AI.)

👀

For details about how we developed, evaluated, and deployed earlier, conceptually similar versions of these models, see Dunham et al. (2020). Details of subsequent changes (e.g., moving from SciBERT to SPECTER base models) are found in Toney-Wails et al. (2024), sections 3.1 and 5.1.

AI safety

We also tag articles as relevant or not to AI safety. We consider an article an "AI safety article" if it (a) is categorized as relevant to AI and (b) is categorized by our AI safety classifier as relevant to AI safety.

The field of AI safety research is young, quickly evolving, and poorly defined (even compared to other emerging topics), with no authoritative and comprehensive compilations of "AI safety articles" to refer to. Our classifier systematically identifies AI safety articles in the MAC, but we caution that the results are inherently imprecise; different methods could produce different results.

To develop the AI safety classifier, we began by creating our own definition of AI safety research encompassing safety-related concepts such as robustness, misspecification, unwanted bias, explainability, and value alignment. In parallel, we compiled a set of articles potentially related to AI safety, such as MAC articles categorized as relevant to AI in general and recent articles from safety-relevant AI conferences, workshops, and open-source repositories. ETO staff read the titles and abstracts of 2806 of these articles, then manually annotated each one as relevant or not to AI safety according to our definition. (To measure the stability of the definition in practice, 256 articles were independently double-annotated by other CSET researchers; intercoder agreement was 75%.)

We then used Snorkel Flow, a platform for developing models under programmatic weak supervision, to train the model. (For more information on this approach, see Snorkel: rapid training data creation with weak supervision and A Survey on Programmatic Weak Supervision). We ingested metadata for the 2806 manually annotated articles, plus 15,000 unlabeled articles marked relevant to AI by our AI classifier, into the Snorkel Flow platform. We split this data into 15% development, 15% validation, and 70% training sets. We then developed 83 labeling functions using the development set. 69 of these functions were keyword, regular expression, or time interval matches based on the values of the article title, abstract, publication venue, or publication years. 14 labeling functions were based on articles' membership in clusters derived from a custom support vector machine trained over word embeddings of the article titles and abstracts. These labeling functions had coverage of 99% of the development set, conflict of 10.5%, and label density of 5.329.

Through the Snorkel Flow platform, we used these labeling functions to create "weak" AI safety relevance labels for the data we ingested. We then trained an AutoML tuned model using logistic regression over the weak labels. This resulted in a model that achieved a macro-averaged F1 of 82.5% on the validation set, with precision of 73.1%, recall of 79.8%, and F1 of 76.3% over articles manually labeled relevant to AI safety.

LLMs and chips

Finally, we use a different approach to identify articles related to large language models and chip design and fabrication. For these topics, we apply a series of prompts to a generative LLM, currently Google’s Gemini 1.5 Flash. In the first prompt, we instruct the LLM to write a one-sentence summary of the work described in a publication’s title and abstract, to include the motivation and then the problem or research task(s) addressed and the methods applied. Then, in a second prompt, we instruct the model to classify each publication, based on the summary output from the first prompt, as relevant to the development of LLMs, chip design and fabrication, or neither.

This zero-shot approach offers substantial efficiency gains. For each model, we manually labeled a small set of papers in 2024 for use in prompt development. We then drew and labeled a larger random sample for initial evaluation purposes, but overall annotated many fewer papers than would have been necessary under a supervised approach.

As an initial filtering step, we run this generative LLM method only on articles in broader domains relevant to LLMs and chips, respectively. Specifically:

We count an article as related to LLM research if it (a) is tagged as an AI article according to the method described above and (b) is flagged as relevant using the generative LLM method.
We count an article as related to chip design and fabrication research if it (a) includes chemistry, engineering, physics, or materials science as one of its three highest-scoring general subjects and (b) is flagged as relevant using our generative LLM method.

In each case, we include only articles published 2010 or later with English titles or abstracts.

👀

For initial evaluation statistics for the LLM and chip research classification methods, read CSET's technical paper: Identifying Emerging Technologies in Research. Our public repo has related code and technical details.

Known limitations

The MAC doesn’t cover non-public research. The MAC only includes research that has been publicly released and is included in one of our data sources. We believe these sources cover a significant fraction of publicly released research worldwide (and in particular, the large majority of published research in English), but a great deal of research is never made public. In particular, military research and commercial research may never be written up in open sources, or even written up at all. We have no way of knowing how much of this "hidden" research exists, or how different the MAC would look if it were somehow included.
The MAC’s sources may introduce problems. The MAC incorporates articles from many different datasets, making it a uniquely comprehensive dataset on worldwide research. No dataset is perfect, though. Errors, gaps, lags, and judgment calls in the MAC’s underlying datasets are likely to make it into the MAC itself.
Recent years have incomplete data. It takes time for our data sources to incorporate the latest publications and metadata. The MAC reflects this lag. Different sources and data types may take longer or shorter to be integrated, but as a rough, conservative rule of thumb, MAC data from the most recent two years may be materially incomplete. (Even earlier years may also be incomplete, but to a lesser degree that we believe is less likely to affect analysis.) We continuously add new data for both the current year and all prior years as our sources provide it.
The MAC’s merging and enrichment processes may introduce errors. The MAC uses automated processes to combine articles from different datasets and create metadata about them. These processes usually work well, but they do have limitations, especially when there are issues with the raw article data. These issues include:
- Organization and country data has gaps. As described above, we rely on the MAC's underlying data sources to identify author and funder organizations for each publication, and use the same sources together with our own methods to link these organizations (and, in turn, the publications) with countries. However, organizational metadata is sometimes missing or incorrect in the sources that feed the MAC, and our methods to fill these gaps are not 100% effective.
- We have not done complete metadata translation. So far we have completely translated only the names of funding organizations. Other fields have been translated in part. However, translating all article titles and abstracts is cost-prohibitive.
- Deduplication is less effective when articles have limited metadata. Our method of deduplicating articles across corpora relies on the presence of six metadata fields. If some of these fields are absent, we are less likely to successfully merge the articles. Additionally, if an article’s title or abstract appears in multiple languages across datasets, we will have to rely on that article’s other metadata to perform a successful match.
- Named entities may not be fully resolved. Named entities, such as organizations, authors, and articles, are often given different names across or within our data sources. For example, Google may be called "Google" in one article, "Google Cloud Services" in another, "Google UK" in another, "AI Team - Google Research" in another. For the MAC, we use different processes to "resolve" these different names where appropriate, as described above, but we can’t catch all of them. This could affect some calculations using the MAC. For example, an author’s articles could be split across multiple versions of the author’s name, making it seem like that author has written fewer articles than she really has.
- Sub-organizations are not linked to their parents. Some organizations in the MAC are parts of other organizations. For example, an article might have some authors associated with "Google Research" and others associated with "Google Cloud Services." In some contexts, users might want to group these organizations (and their authors) together under "Google." The MAC doesn’t group organizations like this.
The MAC’s coverage of Chinese publications is incomplete. Although the MAC includes many Chinese publications, many others are only available in China-based journals that are not included in our data sources. (Earlier versions of the MAC had better coverage of these sources, but unfortunately, ETO and many other organizations outside China are no longer able to access them.) Use particular caution when interpreting MAC-derived data related to Chinese research organizations, funders, or authors.
The MAC’s article funding data is incomplete. Articles often list their funders (for example: "This research was supported by the National Science Foundation under grant XYZ…"). In many, but not all cases, the MAC’s data sources extract these mentions as metadata linking funders to the relevant articles. The MAC’s funder data is derived from this metadata. But because not all funder mentions are extracted as metadata, the funder data in the MAC is not complete.

Maintenance

How are the data updated?

We update the MAC through a sequence of automated pipelines that retrieve data from our sources, merge it together, and enrich it. These pipelines normally run weekly, with occasional pauses to resolve issues due to vendor data changes or failures of automated checks in the pipelines.

The underlying data sources are updated on their own schedules, between daily and monthly.

Credits

Virtually all CSET data team members have contributed to the MAC in some form, whether by providing feedback, reviewing code, or helping generate ideas. In alphabetical order, some particular contributions follow:

Daniel Chou: Chinese-language data parsing and normalization
James Dunham: Article classifier development, field of study modeling, citation percentile calculation, organizational entity resolution, metadata merge
Rebecca Gelles: Organizational entity resolution, metadata merge
Jennifer Melot: Article linkage, metadata merge, data orchestration
Ilya Rahkovsky: Article linkage, metadata merge
Christian Schoeberl: Article classifier development
Autumn Toney-Wails: Subject modeling

Student research assistants Chenxi Liu, Luwei Lei, and Jerod Sun contributed data characterization and normalization. Zach Arnold helped create this documentation.

Emerging technology topic classifications in the MAC are based upon work supported in part by the Alfred P. Sloan Foundation under Grant No. G-2023-22358.

Major change log

12/29/24	New emerging topics added
11/22/23	Updates related to new underlying data sources and field/subfield classifiers
5/19/23	New emerging topics added as part of Research Almanac launch
10/13/22	Initial release (ETO/CSET internal)