ETO Logo
Documentation: Cross-Border Tech Research Metrics

Overview

What is this dataset?

ETO's Cross-Border Tech Research Metrics dataset includes metrics for cross-border research in emerging technology domains, such as AI, robotics, and cybersecurity.

How do I get it?

The dataset csv files are available on Zenodo.

Which ETO products use it?

This dataset powers ETO's Country Activity Tracker.

What are its sources?

The metrics in this dataset are original to ETO and CSET. We use our own methods, models and metadata to derive values for these metrics from ETO's Merged Academic Corpus. Read more >>

What are its main limitations?

  • The dataset focuses on countries, not organizations or individuals. No data is included on individual people and organizations within those countries.
  • The data do not give a complete picture of tech-related activity. There are many ways to assess these activities. This dataset includes only one type of metric - research publications.
  • The data have a lag, making counts incomplete for recent years.
  • There are some errors and gaps in our process for assigning publications to countries. We use metadata from the sources that feed our Merged Academic Corpus to associate publications with countries. There are sometimes errors or gaps in this metadata. We use various methods to fix these problems, but some errors remain. As a result, some publications remain unlinked to countries and others may be linked to the wrong countries.
  • The metrics are based on primarily English-language sources that miss many Chinese-language publications. These metrics are ultimately derived from ETO’s Merged Academic Corpus, which omits many Chinese-language publications. Because of this, metrics related to Chinese articles should be interpreted with caution.
  • The metrics omit small per-year counts. If a country pair has fewer than 25 joint publications in a year, we omit that year of data for that country pair. Read more >>

What are the terms of use?

This dataset is subject to ETO's general terms of use. If you use it, please cite us.

How do I cite it?

Please cite the "Emerging Technology Observatory Cross-Border Tech Research Metrics dataset," including the link.

If you use our Country Activity Tracker to access the data, you can cite that tool instead.

Structure and content

The Cross-Border Tech Research Metrics dataset consists of topic-specific csv tables, each organized as follows:

NameTypeDescription
country1textA country's name.
country2textAnother country's name.
fieldtextA research field.
yearnumberThe year of publication.
num_articlesnumberThe number of articles related to the specified field that were jointly published by researchers associated with country1 and country2 in the specified year. Read more about our methodology >>
completebooleanIndicates whether the row is from a year where we consider our data materially complete. If complete is false, we consider the data from that year materially incomplete and you should use it with caution.

Topics covered

The dataset currently covers the following fields of research. We plan to add more over time.

  • AI (all)
  • AI safety
  • Computer vision
  • Cybersecurity
  • Robotics
  • Natural language processing
  • Large language models
  • Chip design and fabrication

Sources and methodology

The Cross-Border Tech Research Metrics dataset relies on the same sources and methods as the Country AI Activity Metrics research metrics. Refer to that dataset's documentation for details on:

  • Our underlying publication data source: the ETO Merged Academic Corpus
  • How we attribute publications to countries
  • How we link publications to particular fields (note that the Cross-Border Tech Research Metrics dataset includes emerging tech fields unrelated to AI; these fields are linked using the same approach)
  • Nuances of counting and deduplicating publications
👀
Open the Country AI Activity Metrics documentation.

Identifying cross-border research publications

This dataset covers research publications whose authors are affiliated with institutions in multiple countries. Each such publication "counts" as a joint paper for each pair of countries affiliated with any authoring institutions. So, for example:

  • A publication with authors from New York University only would be omitted from the metrics in this dataset.
  • A publication with authors from New York University and Oxford University would be counted as one publication for the United States-United Kingdom country pair.
  • A publication with authors from New York University, Oxford University, and Peking University would be counted as one publication for the United States-United Kingdom country pair, one publication for the United States-China country pair, and one publication for the China-United Kingdom country pair.
  • A publication with authors from New York University, Harvard University, and Oxford University would be counted as one publication for the United States-United Kingdom country pair. (There's no "double counting" when multiple universities from the same country are involved.)

Omitted data

If a country pair has fewer than 25 joint publications in a year, we omit that year of data for that country pair. We consider numbers this small potentially unreliable due to unavoidable "background noise" in our underlying data sources (e.g., errors in linking authors to institutions or institutions to countries).

Credits

  • Engineering: Jennifer Melot
  • Documentation: Zach Arnold

Emerging technology topic classifications are based on work supported in part by the Alfred P. Sloan Foundation under Grant No. G-2023-22358.

👀
This dataset builds on the Country AI Activity Metrics dataset, featuring contributions from many different analysts, annotators and technical collaborators. View credits for the Country AI Activity Metrics dataset.

Major change log

12/5/24Initial release
ETO Logo

Keep in touch

Twitter