Documentation: Emerging Technology Overlay for OpenAlex

Overview

What is this dataset?

The Emerging Technology Overlay for OpenAlex (the "Overlay") provides original metadata developed by CSET for "works" (publications) in the OpenAlex open bibliometric dataset. Currently, this original metadata includes "work"-level language identifications and relevance identifications for various emerging technology subjects.

How do I get it?

The Overlay is hosted on Zenodo. Click here for the most recent version. Underlying code is available in the Overlay Github repo.

Access the dataset

Which ETO products use it?

The Overlay isn't directly used in any ETO products. However, it incorporates data from the Merged Academic Corpus that is in turn used in several ETO products, such as the Map of Science, Research Almanac, and Country Activity Tracker.)

What are its sources?

The Overlay is original to ETO. Language identifications are produced using pycld2. Subject identifications are produced using CSET classifiers trained on a variety of data, such as arXiv publications. Read more >>

What are its main limitations?

The Overlay only works in conjunction with OpenAlex. Other publication datasets are not currently supported. Read more >>
Subject identifications are imprecise. Emerging technology subjects have fuzzy boundaries. For each subject, we try to identify articles that subject matter experts would consider highly relevant to the subject in question. However, the outputs are necessarily imprecise. Read more >>
Data are missing for some OpenAlex works, including non-English works. Read more >>

What are the terms of use?

This dataset is subject to ETO's general terms of use. If you use it, please cite us. (If you use the dataset in conjunction with OpenAlex data, refer also to the OpenAlex terms of service.)

How do I cite it?

Please cite the "Emerging Technology Observatory ETO Emerging Technology Overlay for OpenAlex," including the link.

Structure and content

The Overlay consists of a series of identically structured tables, each using the following schema. The tables collectively cover every OpenAlex work.

Column name	Type	Description
id	text	The ID of an OpenAlex work.
title_language	text	The language of the work's title, as detected by pycld2.
abstract_language	text	The language of the work's abstract, as detected by pycld2.
is_ai	boolean	true if ETO's models identified the work as relevant to AI; false if the models identified the work as not relevant to AI; null if the models were not run on the work, i.e., if the work had a non-English or null title and abstract.
is_nlp	boolean	true if ETO's models identified the work as relevant to natural language processing; false if the models identified the work as not relevant to natural language processing; null if the models were not run on the work, i.e., if the work had a non-English or null title and abstract.
is_cv	boolean	true if ETO's models identified the work as relevant to computer vision; false if the models identified the work as not relevant to computer vision; null if the models were not run on the work, i.e., if the work had a non-English or null title and abstract.
is_robotics	boolean	true if ETO's models identified the work as relevant to robotics; false if the models identified the work as not relevant to robotics; null if the models were not run on the work, i.e., if the work had a non-English or null title and abstract.
is_cyber	boolean	true if ETO's models identified the work as relevant to cybersecurity; false if the models identified the work as not relevant to cybersecurity; null if the models were not run on the work, i.e., if the work had a non-English or null title and abstract.
is_ai_safety	boolean	true if ETO's models identified the work as relevant to AI safety; false if the models identified the work as not relevant to AI safety; null if the models were not run on the work, that is, if the work was not classified as relevant to AI.
is_llm	boolean	true if ETO's models identified the work as relevant to large language models; false if the models identified the work as not relevant to large language models; null if the models were not run on the work, that is, if the work was not classified as relevant to AI.
is_chip_design_fabrication	boolean	true if ETO's models identified the work as relevant to chip design and fabrication; false if the models identified the work as not relevant to chip design and fabrication; null if the models were not run on the work (read more).

Sources and methodology

The Overlay is original to ETO. Language identifications are produced using pycld2. Subject identifications are produced with classifiers trained on a variety of data.

Language identification

We generate language identifications using pycld2. Identifications are included only for works where pycld2 successfully output a language and marked the output as reliable.

Emerging technology subject identification

We use a set of original classifiers to determine each work's likely relevance to various emerging technology subjects: AI, natural language processing, computer vision, robotics, cybersecurity, and AI safety. The classifiers and implementation process are the same as for ETO's Merged Academic Corpus; refer to the MAC documentation for full details.

Known limitations

The Overlay only works in conjunction with OpenAlex. Other publication datasets are not currently supported.
Subject identifications are imprecise. Emerging technology subjects have fuzzy boundaries; there's no objectively correct answer to whether a particular article is "AI safety" research (for example). For each emerging technology subject included in the Overlay, we try to capture articles that subject matter experts would consider highly relevant to the subject in question. We use different methods to identify these articles depending on the subject, and we evaluate our results against "ground truth" corpora that also vary by subject. Still, it's important to note that this process inevitably involves some judgment calls. In addition, we rely on statistical models (classifiers) to generate the subject identifications. For both reasons, the identifications are necessarily imprecise and should be used with care.
Data are missing for some OpenAlex works. Language identifications are only available for works where pycld2 successfully output a language and marked the output as reliable. Subject identifications are only available for works with non-null, English titles or abstracts.

Maintenance

How are the data updated?

The Overlay is updated monthly through an automated pipeline that extracts ETO's original metadata for every OpenAlex work, then updates the Overlay's Zenodo record with the latest metadata. For details and code, visit the Overlay Github repo.

Credits

Concept and design: Jennifer Melot
Engineering: Jennifer Melot
Documentation: Zach Arnold, Jennifer Melot

The Overlay integrates outputs from classifiers developed by CSET's data science team, including James Dunham, Rebecca Gelles, Ilya Rahkovsky, and Christian Schoeberl. For full credits, visit the Merged Academic Corpus documentation.

Emerging technology topic classifications are based on work supported in part by the Alfred P. Sloan Foundation under Grant No. G-2023-22358.

Major change log

4/22/24

Initial release