The Emerging Technology Overlay for OpenAlex (the "Overlay") provides original metadata developed by CSET for "works" (publications) in the OpenAlex open bibliometric dataset. Currently, this original metadata includes "work"-level language identifications and relevance identifications for various emerging technology subjects.
The Overlay is hosted on Zenodo. Click here for the most recent version. Underlying code is available in the Overlay Github repo.
The Overlay isn't directly used in any ETO products. However, it incorporates data from the Merged Academic Corpus that is in turn used in several ETO products, such as the Map of Science, Research Almanac, and Country Activity Tracker.)
The Overlay is original to ETO. Language identifications are produced using pycld2. Subject identifications are produced using CSET classifiers trained on a variety of data, such as arXiv publications. Read more >>
This dataset is subject to ETO's general terms of use. If you use it, please cite us. (If you use the dataset in conjunction with OpenAlex data, refer also to the OpenAlex terms of service.)
Please cite the "Emerging Technology Observatory ETO Emerging Technology Overlay for OpenAlex," including the link.
The Overlay consists of a series of identically structured tables, each using the following schema. The tables collectively cover every OpenAlex work.
Column name | Type | Description |
---|---|---|
id | text | The ID of an OpenAlex work. |
title_language | text | The language of the work's title, as detected by pycld2. |
abstract_language | text | The language of the work's abstract, as detected by pycld2. |
is_ai | boolean | true if ETO's models identified the work as relevant to AI; false if the models identified the work as not relevant to AI; null if the models were not run on the work, i.e., if the work had a non-English or null title and abstract. |
is_nlp | boolean | true if ETO's models identified the work as relevant to natural language processing; false if the models identified the work as not relevant to natural language processing; null if the models were not run on the work, i.e., if the work had a non-English or null title and abstract. |
is_cv | boolean | true if ETO's models identified the work as relevant to computer vision; false if the models identified the work as not relevant to computer vision; null if the models were not run on the work, i.e., if the work had a non-English or null title and abstract. |
is_robotics | boolean | true if ETO's models identified the work as relevant to robotics; false if the models identified the work as not relevant to robotics; null if the models were not run on the work, i.e., if the work had a non-English or null title and abstract. |
is_cyber | boolean | true if ETO's models identified the work as relevant to cybersecurity; false if the models identified the work as not relevant to cybersecurity; null if the models were not run on the work, i.e., if the work had a non-English or null title and abstract. |
is_ai_safety | boolean | true if ETO's models identified the work as relevant to AI safety; false if the models identified the work as not relevant to AI safety; null if the models were not run on the work, that is, if the work was not classified as relevant to AI. |
is_llm | boolean | true if ETO's models identified the work as relevant to large language models; false if the models identified the work as not relevant to large language models; null if the models were not run on the work, that is, if the work was not classified as relevant to AI. |
is_chip_design_fabrication | boolean | true if ETO's models identified the work as relevant to chip design and fabrication; false if the models identified the work as not relevant to chip design and fabrication; null if the models were not run on the work (read more). |
The Overlay is original to ETO. Language identifications are produced using pycld2. Subject identifications are produced with classifiers trained on a variety of data.
We generate language identifications using pycld2. Identifications are included only for works where pycld2 successfully output a language and marked the output as reliable.
We use a set of original classifiers to determine each work's likely relevance to various emerging technology subjects: AI, natural language processing, computer vision, robotics, cybersecurity, and AI safety. The classifiers and implementation process are the same as for ETO's Merged Academic Corpus; refer to the MAC documentation for full details.
The Overlay is updated monthly through an automated pipeline that extracts ETO's original metadata for every OpenAlex work, then updates the Overlay's Zenodo record with the latest metadata. For details and code, visit the Overlay Github repo.
The Overlay integrates outputs from classifiers developed by CSET's data science team, including James Dunham, Rebecca Gelles, Ilya Rahkovsky, and Christian Schoeberl. For full credits, visit the Merged Academic Corpus documentation.
Emerging technology topic classifications are based on work supported in part by the Alfred P. Sloan Foundation under Grant No. G-2023-22358.
4/22/24 | Initial release |