Documentation: Country AI Activity Metrics

Overview

What is this dataset?

ETO's Country AI Activity Metrics dataset includes national-level metrics for AI-related research, patents, and private-market investment. To browse the data visually, visit ETO's Country Activity Tracker.

How do I get it?

The dataset csv files are available on Zenodo.

Access the dataset

Which ETO products use it?

This dataset powers ETO's Country Activity Tracker.

What are its sources?

The metrics in this dataset are original to ETO and CSET. We use our own methods, models and metadata to derive values for these metrics from a variety of underlying data sources. These sources include ETO's Merged Academic Corpus for research data; The Lens, PATSTAT, and 1790 Analytics data for patents; and Crunchbase for company and investment data. Read more >>

What are its main limitations?

The Country AI Activity Metrics dataset focuses on countries, not organizations or individuals. No data is included on individual people and organizations within those countries.
The dataset focuses on AI and its subfields. Other technologies aren't covered for now.
The data may not give a complete picture of national AI activity. There are many ways to assess countries' AI activities. This dataset includes three types of metrics - research, patents, and private-sector activity (companies and investments) - that we think are meaningful, but they aren't exhaustive.
The data have a lag, making counts incomplete for recent years. The lag is especially significant for patent data, and affects publications and investment metrics to a lesser degree.
There are some errors and gaps in our process for assigning publications to countries. We use metadata from the sources that feed our Merged Academic Corpus to associate publications with countries. There are sometimes errors or gaps in this metadata. We use various methods to fix these problems, but some errors remain. As a result, some publications remain unlinked to countries and others may be linked to the wrong countries.
The company and investment metrics capture a particular type of investment activity, and don't represent everything happening with AI in the private sector. This dataset only includes data on equity investment transactions involving privately held, AI-related companies listed in the Crunchbase dataset. Read more >>
Publication metrics are based on primarily English-language sources that miss many Chinese-language publications. These metrics are ultimately derived from ETO’s Merged Academic Corpus, which omits many Chinese-language publications. Because of this, metrics related to Chinese articles should be interpreted with caution.

What are the terms of use?

This dataset is subject to ETO's general terms of use. If you use it, please cite us.

How do I cite it?

Please cite the "Emerging Technology Observatory Country AI Activity Metrics dataset," including the link.

If you use our Country Activity Tracker to access the data, you can cite that tool instead.

Structure and content

The Country AI Activity Metrics dataset consists of three groups of csv tables containing data on publications, patents, and companies.

publications

These tables include national research publication metrics by country, year, and field.

publications_yearly_articles

Name	Type	Description
country	text	The country's name.
field	text	A research field.
year	number	The year of publication.
num_articles	number	The number of articles related to the specified field that were published by researchers associated with country in the specified year.
complete	boolean	Indicates whether the row is from a year where we consider our data materially complete. If complete is false, we consider the data from that year materially incomplete and you should use it with caution.

publications_yearly_citations

Name	Type	Description
country	text	The country's name.
field	text	A research field.
year	number	The year of publication.
num_citations	number	The total number of citations to all articles related to the specified field that were published by researchers associated with country in the specified year.

Citation data are likely incomplete in recent years. We are currently evaluating how incomplete our citation data are for recent years. Pending the results of this analysis, we haven't included a complete field in this table as in other ones, but we expect to add that field once the analysis is done. In the meantime, users should be aware that citation data may be incomplete in recent years, consistent with other types of data in this dataset.

publications_summary

Name	Type	Description
country	text	The country's name.
field	text	A research field.
num_articles	number	The number of articles related to the specified field that were published by researchers associated with country over all years covered in the publications_yearly_articles and publications_yearly_citations tables.
num_citations	number	The total number of citations to all articles in the specified field that were published by researchers associated with country over all years covered in the publications_yearly_articles and publications_yearly_citations tables.

patents

These tables include national AI patent metrics by country of filing, year, and field.

Patent data are laggy

Some countries' patent offices take years after filing to release data on some patents. Unfortunately, this means that the last several years of patent data in this dataset are incomplete. Use data from incomplete years with caution.

patents_yearly_applications

Name	Type	Description
country	text	The country's name.
field	text	A thematic field, such as "Business," "Energy Management," or "Transportation."
year	number	The year.
num_patent_applications	number	The number of AI-related patent applications filed in country's patent office in the specified year that were related to the specified field.
complete	boolean	Indicates whether the row is from a year where we consider our data materially complete. If complete is false, we consider the data from that year materially incomplete and you should use it with caution.

patents_yearly_granted

Name	Type	Description
country	text	The country's name.
field	text	A thematic field, such as "Business," "Energy Management," or "Transportation."
year	number	The year.
num_patent_granted	number	The number of AI-related patents filed in country's patent office in the specified year that were related to the specified field and were subsequently granted.
complete	boolean	Indicates whether the row is from a year where we consider our data materially complete. If complete is false, we consider the data from that year materially incomplete and you should use it with caution.

patents_summary

Name	Type	Description
country	text	The country's name.
field	text	A thematic field, such as "Business," "Energy Management," or "Transportation."
num_patent_applications	number	The number of AI-related patent applications filed in country's patent office that were related to the specified field over all years covered in the patents_yearly_applications and patents_yearly_granted tables.
num_patent_granted	number	The number of AI-related patents granted by country's patent office that were related to the specified field and were subsequently granted over all years covered in the patents_yearly_applications and patents_yearly_granted tables.

companies

These tables include national AI investment metrics by country, year, and field.

companies_yearly_disclosed

Name	Type	Description
country	text	The country's name.
field	text	A field of business activity, such as "Professional services," "Finance," or "Education."
year	number	The year.
disclosed_investment	number	The amount of disclosed investment in the given year into AI companies in country that are active in the given field, in millions USD.
complete	boolean	Indicates whether the row is from a year where we consider our data materially complete. If complete is false, we consider the data from that year materially incomplete and you should use it with caution.

companies_yearly_estimated

Name	Type	Description
country	text	The country's name.
field	text	A field of business activity, such as "Professional services," "Finance," or "Education."
year	number	The year.
estimated_investment	number	Estimated total investment (including disclosed investments and estimated amounts of investments with undisclosed values) in the given year into AI companies in country that are active in the given field, in millions USD.
complete	boolean	Indicates whether the row is from a year where we consider our data materially complete. If complete is false, we consider the data from that year materially incomplete and you should use it with caution.

companies_summary

Name	Type	Description
country	text	The country's name.
field	text	A field of business activity, such as "Professional services," "Finance," or "Education."
disclosed_investment	number	Disclosed investment into AI companies in country that are active in the given field over all years covered in the companies_yearly_disclosed and companies_yearly_estimated tables, in millions USD.
estimated_investment	number	Estimated total investment (including disclosed investments and estimated amounts of investments with undisclosed values) into AI companies in country that are active in the given field over all years covered in the companies_yearly_disclosed and companies_yearly_estimated tables, in millions USD.

Sources and methodology

The Country AI Activity Metrics dataset relies on different source datasets for its research metrics, patent metrics, and investment metrics.

AI research metrics

Research metrics in this dataset are derived from ETO's Merged Academic Corpus (MAC), which contains detailed information on over 260 million scholarly articles from around the world. Every article in the MAC is tagged as AI-related or not using an automated, classifier-based process; this dataset currently uses the AI-related articles only, though we plan to add more topics over time. For more details, see the MAC documentation.

Linking articles to countries

We attribute articles to countries based on the author organizations listed in each article, as recorded in MAC metadata. (Here, and generally in ETO resources, we use "country" informally, as a shorthand term for sovereign countries, independent states, and certain other geographic entities. Read more >>) An article "counts for" a given country if it lists at least one author affiliated with an organization in that country. The MAC relies on the article to determine the author's organization; for instance, an article listing "Jane Smith, University of Texas" as its author would be attributed to the United States even if Professor Smith later moved to the University of Tokyo. By the same token, authors are associated with the country of their listed organization even if they're not "from" that country: once she moved to the University of Tokyo, Professor Smith's articles would count for Japan, even if she was born and raised in Chicago.

If an article lists authors from organizations in more than one country, the article will "count toward" multiple countries. However, if a single article has multiple authors from the same country, it will only be counted once for that country.

So, for example:

an article with a single author, from the University of Texas, will count as one article for the United States.
an article with one author from the University of Texas and one author from the University of Tokyo will count as one article for the United States and one article for Japan.
an article with two authors from the University of Texas and one author from the University of Tokyo will also count as one article for the United States and one article for Japan.

Assigning articles to AI subfields

We classify articles into AI subfields using Merged Academic Corpus subject assignments. The dataset currently includes the following subfields, in addition to AI as a whole:

AI safety
Computer vision
Natural language processing
Robotics
Large language models

AI patent metrics

This dataset uses patent data from 1790 Analytics, PATSTAT, and The Lens, and applies methods developed jointly by CSET and 1790 to screen and structure the data.

Patent data are laggy

Counting patents, patent families, and patent documents

The dataset includes over 360,000 AI-related patent families, which are groups of patent documents related to the same invention. These documents may include patent applications, which are requests pending at a country's patent office for the grant of a patent, and granted patents, which are approved requests awarding a property right for that invention. (We exclude other types of patent documents, such as amendments or other administrative documents.)

In this dataset, each patent family is counted as a single "patent." If the family includes at least one granted patent, the family is counted as a "granted patent." If the family only includes patent applications, it's treated as a "patent application."

Inventors often file patents for the same invention in multiple jurisdictions, since each jurisdiction's patent office can only enforce patent protections in their jurisdiction. For example, a company with U.S.-patented products might seek patents in France if it plans to start manufacturing or selling the same products there. Our patent dataset includes data from 52 different patent offices around the world, including national offices (such as the U.S. Patent and Trademark Office) and international offices (such as the European Patent Office). When an inventor seeks a patent for an invention in more than one of these jurisdictions, all of the documents from every jurisdiction are counted as part of the same patent. We assign each patent the year it was first applied for anywhere, and we attribute each patent to the first jurisdiction where the inventor filed; this is the "priority country." (We use the dates of patent documents to determine where the first patent related to an invention was filed.)

To make this more concrete, suppose:

Jane Smith invents new AI software and uses it to build a new type of robot.
She applies to the European Patent Office for a patent for the new software, plus another patent for the robot.
The EPO grants the robot patent.
While she waits on the patent for the software, Jane goes ahead and applies for a patent for the robot at the Chinese patent office (in case she decides to manufacture the robot in China).

At this point, the dataset would count two patents for Jane: one EPO granted patent (for the robot) and one EPO patent application (for the software). Note that there are at least four patent documents involved: an EPO application, EPO patent grant, and Chinese application for the robot, and an EPO application for the software. But the first three documents all relate to the same invention, so CAT counts them together.

Inventor country vs. filing country

Our patent metrics describe where patents are being filed, not which country has the most patents. In other words, we can't tell you how many AI patents are owned by Americans, but we can tell you how many patents were filed in the U.S. patent office. There may be overlap between these two categories, but it's not a perfect match: for example, about half of patent applications filed in the U.S. are from overseas.

We are working to build inventor nationality metrics into future versions. In the meantime, you can use the existing, filing location-based metrics to understand where AI innovators are most interested in protecting their inventions - and in turn, where they may be conducting R&D, manufacturing, marketing, expanding operations, or competing with foreign companies.

Identifying and classifying AI patents

This dataset includes only AI-related patents. CSET and 1790 Analytics developed a method to identify these patents from broader 1790 and The Lens patent data holdings using a combination of keywords and patent classification codes, which are categories applied to individual patents by some patent offices. We also used keywords and classification codes to link each patent to different industries (e.g., life sciences, transportation). For more information, you can read CSET's paper on AI patents or visit the Github repo for the CSET/1790 project.

Deduplicating patents

In the process of combining different patent data sources into the dataset, duplicate patents are created. We use patent IDs and patent family IDs, which are unique identifiers assigned by national patent offices, to detect and resolve these duplicates. We use data from the CSET/1790 project when different sources give different information for the same patent.

AI company and investment metrics

The investment and company metrics in the Country AI Activity Metrics dataset are derived from data from Crunchbase, a commercial dataset. Specifically, we use Crunchbase data related to equity investment into privately held, AI-related companies.

Equity investment: We count venture capital transactions, private equity transactions, and mergers and acquisitions, but exclude non-equity investment such as debt finance, grants, and crowdfunding.
Privately held: We exclude any company traded on a public stock exchange. This includes most of the "Big Tech" firms and other well-established companies.
Our method for identifying AI-related companies is described below.

Crunchbase data has gaps, especially for companies with a lower public profile, but we believe it is a relatively comprehensive and accurate source for the sort of data used in this dataset. For more details, please refer to the methodology section and appendices in this CSET report; the process we used to evaluate and extract data from Crunchbase is generally similar to the process described there.

The cross-border investment metrics in this dataset should be interpreted with care. Most private-market AI investment transactions, such as venture capital deals, combine contributions from multiple investors, and the exact amount of each investor's contribution is rarely disclosed. This makes it impossible to add up the total investment from investors in a specified country. Instead, the cross-border investment metrics reflect the total value or count of investment transactions with at least one participating investor from that country.

There is no single objective definition of an "AI company." We take a deliberately broad approach, using two different criteria to identify AI-related companies in Crunchbase. Any privately held company that meets at least one of the criteria is counted as an AI-related company, and investments involving that company will be included in the investment metrics.

Keyword match: The company's business description in Crunchbase matches a regular expression-based query for AI keywords. The query is reproduced in Appendix 2 of this CSET report.
Tag match: Crunchbase assigns the company an industry tag in the "artificial intelligence" industry group.

These criteria are designed to capture a wide range of companies with AI-related activities across the globe, even for smaller countries or territories. Because of this broad approach, they may capture some companies and investments in Crunchbase that others might not consider AI-related. At the same time, they may leave out some companies and investments others would describe as AI-related.

Identifying and assigning subfields

We assign each AI-related company, and all the investments into that company, one or more application fields based on the company's industry tags and groups in Crunchbase. This table maps the tags and groups to application fields in our dataset.

Other metadata

We adapt other Crunchbase data to generate the investment and company metrics:

Investment year: We assign each investment the year it was announced.
Investment count (total and per-year): We count private equity transactions and venture capital rounds with multiple investors as a single transaction, and not as multiple investments.
Estimated investment value: Equity investment transaction values are often kept confidential. The "disclosed value" metrics exclude all investments without a publicly announced dollar amount. The "estimated value" metrics include these investments, assigning each one a dollar value equal to the median value of Crunchbase investments with the same investment stage, target country, and year.
Company nationality: We assign each company the nationality of the country where it is headquartered according to Crunchbase.
Investor nationality: We assign each organizational investor the nationality of the country where it (for corporate investors) or its managing entity (for VC and PE funds) is headquartered. For example, an investment firm based in San Francisco will be classified as American even if most of its investors are Chinese.

Maintenance

How are the data updated?

We update the Indicators data through an automated pipeline that retrieves data from CSET's merged corpora of scholarly literature, patents, and investment data and calculates our metrics. This pipeline normally runs monthly, with occasional pauses to resolve issues due to vendor data changes or failures of automated data quality checks.

The underlying data sources are updated on their own schedules, between daily and half-year intervals.

Credits

Concept and design: Rebecca Gelles, Ngor Luong, Zach Arnold
Engineering: Jennifer Melot, Rebecca Gelles
Data collection and enrichment: Zach Arnold, Rebecca Gelles, Ngor Luong, Eileen Chen, Alina Thai, Kieran Lee, Lindsay Jenkins, Ari Filler, Maya Snyder, Julio Wang, Sonali Pednekar, Nikita Pardeshi, Tianyi Xu, Yudi Chen, Vinayak Kannan
Documentation: Zach Arnold, Rebecca Gelles, Jennifer Melot

Emerging technology topic classifications are based on work supported in part by the Alfred P. Sloan Foundation under Grant No. G-2023-22358.

Major change log

12/5/24

Initial release