ETO's Country AI Activity Metrics dataset includes national-level metrics for AI-related research, patents, and private-market investment. To browse the data visually, visit ETO's Country Activity Tracker.
The dataset csv files are available on Zenodo.
This dataset powers ETO's Country Activity Tracker.
The metrics in this dataset are original to ETO and CSET. We use our own methods, models and metadata to derive values for these metrics from a variety of underlying data sources. These sources include ETO's Merged Academic Corpus for research data; The Lens, PATSTAT, and 1790 Analytics data for patents; and Crunchbase for company and investment data. Read more >>
This dataset is subject to ETO's general terms of use. If you use it, please cite us.
Please cite the "Emerging Technology Observatory Country AI Activity Metrics dataset," including the link.
If you use our Country Activity Tracker to access the data, you can cite that tool instead.
The Country AI Activity Metrics dataset consists of three groups of csv tables containing data on publications, patents, and companies.
These tables include national research publication metrics by country, year, and field.
Name | Type | Description |
---|---|---|
country | text | The country's name. |
field | text | A research field. |
year | number | The year of publication. |
num_articles | number | The number of articles related to the specified field that were published by researchers associated with country in the specified year. |
complete | boolean | Indicates whether the row is from a year where we consider our data materially complete. If complete is false, we consider the data from that year materially incomplete and you should use it with caution. |
Name | Type | Description |
---|---|---|
country | text | The country's name. |
field | text | A research field. |
year | number | The year of publication. |
num_citations | number | The total number of citations to all articles related to the specified field that were published by researchers associated with country in the specified year. |
Citation data are likely incomplete in recent years. We are currently evaluating how incomplete our citation data are for recent years. Pending the results of this analysis, we haven't included a complete field in this table as in other ones, but we expect to add that field once the analysis is done. In the meantime, users should be aware that citation data may be incomplete in recent years, consistent with other types of data in this dataset.
Name | Type | Description |
---|---|---|
country | text | The country's name. |
field | text | A research field. |
num_articles | number | The number of articles related to the specified field that were published by researchers associated with country over all years covered in the publications_yearly_articles and publications_yearly_citations tables. |
num_citations | number | The total number of citations to all articles in the specified field that were published by researchers associated with country over all years covered in the publications_yearly_articles and publications_yearly_citations tables. |
These tables include national AI patent metrics by country of filing, year, and field.
Some countries' patent offices take years after filing to release data on some patents. Unfortunately, this means that the last several years of patent data in this dataset are incomplete. Use data from incomplete years with caution.
Name | Type | Description |
---|---|---|
country | text | The country's name. |
field | text | A thematic field, such as "Business," "Energy Management," or "Transportation." |
year | number | The year. |
num_patent_applications | number | The number of AI-related patent applications filed in country's patent office in the specified year that were related to the specified field. |
complete | boolean | Indicates whether the row is from a year where we consider our data materially complete. If complete is false, we consider the data from that year materially incomplete and you should use it with caution. |
Name | Type | Description |
---|---|---|
country | text | The country's name. |
field | text | A thematic field, such as "Business," "Energy Management," or "Transportation." |
year | number | The year. |
num_patent_granted | number | The number of AI-related patents filed in country's patent office in the specified year that were related to the specified field and were subsequently granted. |
complete | boolean | Indicates whether the row is from a year where we consider our data materially complete. If complete is false, we consider the data from that year materially incomplete and you should use it with caution. |
Name | Type | Description |
---|---|---|
country | text | The country's name. |
field | text | A thematic field, such as "Business," "Energy Management," or "Transportation." |
num_patent_applications | number | The number of AI-related patent applications filed in country's patent office that were related to the specified field over all years covered in the patents_yearly_applications and patents_yearly_granted tables. |
num_patent_granted | number | The number of AI-related patents granted by country's patent office that were related to the specified field and were subsequently granted over all years covered in the patents_yearly_applications and patents_yearly_granted tables. |
These tables include national AI investment metrics by country, year, and field.
Name | Type | Description |
---|---|---|
country | text | The country's name. |
field | text | A field of business activity, such as "Professional services," "Finance," or "Education." |
year | number | The year. |
disclosed_investment | number | The amount of disclosed investment in the given year into AI companies in country that are active in the given field, in millions USD. |
complete | boolean | Indicates whether the row is from a year where we consider our data materially complete. If complete is false, we consider the data from that year materially incomplete and you should use it with caution. |
Name | Type | Description |
---|---|---|
country | text | The country's name. |
field | text | A field of business activity, such as "Professional services," "Finance," or "Education." |
year | number | The year. |
estimated_investment | number | Estimated total investment (including disclosed investments and estimated amounts of investments with undisclosed values) in the given year into AI companies in country that are active in the given field, in millions USD. |
complete | boolean | Indicates whether the row is from a year where we consider our data materially complete. If complete is false, we consider the data from that year materially incomplete and you should use it with caution. |
Name | Type | Description |
---|---|---|
country | text | The country's name. |
field | text | A field of business activity, such as "Professional services," "Finance," or "Education." |
disclosed_investment | number | Disclosed investment into AI companies in country that are active in the given field over all years covered in the companies_yearly_disclosed and companies_yearly_estimated tables, in millions USD. |
estimated_investment | number | Estimated total investment (including disclosed investments and estimated amounts of investments with undisclosed values) into AI companies in country that are active in the given field over all years covered in the companies_yearly_disclosed and companies_yearly_estimated tables, in millions USD. |
The Country AI Activity Metrics dataset relies on different source datasets for its research metrics, patent metrics, and investment metrics.
Research metrics in this dataset are derived from ETO's Merged Academic Corpus (MAC), which contains detailed information on over 260 million scholarly articles from around the world. Every article in the MAC is tagged as AI-related or not using an automated, classifier-based process; this dataset currently uses the AI-related articles only, though we plan to add more topics over time. For more details, see the MAC documentation.
We attribute articles to countries based on the author organizations listed in each article, as recorded in MAC metadata. (Here, and generally in ETO resources, we use "country" informally, as a shorthand term for sovereign countries, independent states, and certain other geographic entities. Read more >>) An article "counts for" a given country if it lists at least one author affiliated with an organization in that country. The MAC relies on the article to determine the author's organization; for instance, an article listing "Jane Smith, University of Texas" as its author would be attributed to the United States even if Professor Smith later moved to the University of Tokyo. By the same token, authors are associated with the country of their listed organization even if they're not "from" that country: once she moved to the University of Tokyo, Professor Smith's articles would count for Japan, even if she was born and raised in Chicago.
If an article lists authors from organizations in more than one country, the article will "count toward" multiple countries. However, if a single article has multiple authors from the same country, it will only be counted once for that country.
So, for example:
We classify articles into AI subfields using Merged Academic Corpus subject assignments. The dataset currently includes the following subfields, in addition to AI as a whole:
This dataset uses patent data from 1790 Analytics, PATSTAT, and The Lens, and applies methods developed jointly by CSET and 1790 to screen and structure the data.
Some countries' patent offices take years after filing to release data on some patents. Unfortunately, this means that the last several years of patent data in this dataset are incomplete. Use data from incomplete years with caution.
The dataset includes over 360,000 AI-related patent families, which are groups of patent documents related to the same invention. These documents may include patent applications, which are requests pending at a country's patent office for the grant of a patent, and granted patents, which are approved requests awarding a property right for that invention. (We exclude other types of patent documents, such as amendments or other administrative documents.)
In this dataset, each patent family is counted as a single "patent." If the family includes at least one granted patent, the family is counted as a "granted patent." If the family only includes patent applications, it's treated as a "patent application."
Inventors often file patents for the same invention in multiple jurisdictions, since each jurisdiction's patent office can only enforce patent protections in their jurisdiction. For example, a company with U.S.-patented products might seek patents in France if it plans to start manufacturing or selling the same products there. Our patent dataset includes data from 52 different patent offices around the world, including national offices (such as the U.S. Patent and Trademark Office) and international offices (such as the European Patent Office). When an inventor seeks a patent for an invention in more than one of these jurisdictions, all of the documents from every jurisdiction are counted as part of the same patent. We assign each patent the year it was first applied for anywhere, and we attribute each patent to the first jurisdiction where the inventor filed; this is the "priority country." (We use the dates of patent documents to determine where the first patent related to an invention was filed.)
To make this more concrete, suppose:
At this point, the dataset would count two patents for Jane: one EPO granted patent (for the robot) and one EPO patent application (for the software). Note that there are at least four patent documents involved: an EPO application, EPO patent grant, and Chinese application for the robot, and an EPO application for the software. But the first three documents all relate to the same invention, so CAT counts them together.
Our patent metrics describe where patents are being filed, not which country has the most patents. In other words, we can't tell you how many AI patents are owned by Americans, but we can tell you how many patents were filed in the U.S. patent office. There may be overlap between these two categories, but it's not a perfect match: for example, about half of patent applications filed in the U.S. are from overseas.
We are working to build inventor nationality metrics into future versions. In the meantime, you can use the existing, filing location-based metrics to understand where AI innovators are most interested in protecting their inventions - and in turn, where they may be conducting R&D, manufacturing, marketing, expanding operations, or competing with foreign companies.
This dataset includes only AI-related patents. CSET and 1790 Analytics developed a method to identify these patents from broader 1790 and The Lens patent data holdings using a combination of keywords and patent classification codes, which are categories applied to individual patents by some patent offices. We also used keywords and classification codes to link each patent to different industries (e.g., life sciences, transportation). For more information, you can read CSET's paper on AI patents or visit the Github repo for the CSET/1790 project.
In the process of combining different patent data sources into the dataset, duplicate patents are created. We use patent IDs and patent family IDs, which are unique identifiers assigned by national patent offices, to detect and resolve these duplicates. We use data from the CSET/1790 project when different sources give different information for the same patent.
The investment and company metrics in the Country AI Activity Metrics dataset are derived from data from Crunchbase, a commercial dataset. Specifically, we use Crunchbase data related to equity investment into privately held, AI-related companies.
Crunchbase data has gaps, especially for companies with a lower public profile, but we believe it is a relatively comprehensive and accurate source for the sort of data used in this dataset. For more details, please refer to the methodology section and appendices in this CSET report; the process we used to evaluate and extract data from Crunchbase is generally similar to the process described there.
There is no single objective definition of an "AI company." We take a deliberately broad approach, using two different criteria to identify AI-related companies in Crunchbase. Any privately held company that meets at least one of the criteria is counted as an AI-related company, and investments involving that company will be included in the investment metrics.
These criteria are designed to capture a wide range of companies with AI-related activities across the globe, even for smaller countries or territories. Because of this broad approach, they may capture some companies and investments in Crunchbase that others might not consider AI-related. At the same time, they may leave out some companies and investments others would describe as AI-related.
We assign each AI-related company, and all the investments into that company, one or more application fields based on the company's industry tags and groups in Crunchbase. This table maps the tags and groups to application fields in our dataset.
We adapt other Crunchbase data to generate the investment and company metrics:
We update the Indicators data through an automated pipeline that retrieves data from CSET's merged corpora of scholarly literature, patents, and investment data and calculates our metrics. This pipeline normally runs monthly, with occasional pauses to resolve issues due to vendor data changes or failures of automated data quality checks.
The underlying data sources are updated on their own schedules, between daily and half-year intervals.
Emerging technology topic classifications are based on work supported in part by the Alfred P. Sloan Foundation under Grant No. G-2023-22358.
12/5/24 | Initial release |