ETO Logo
Documentation: Country Activity Tracker: Artificial Intelligence

Overview

What is this tool?

The Country Activity Tracker (CAT) is a dashboard of AI activity at the national level. It includes metrics on research, patents, and private-market investment, providing insight into AI competition and cooperation around the world. Users can compare AI activity statistics for any country or group of countries worldwide, explore data on specific AI subfields and applications, and track AI-related collaborations and exchanges between nations.

What can I use it for?

You can use CAT to:

  • Measure a country's AI activity using CAT's metric groups: research, patents, and private-sector activity (companies and investments). Each group includes detailed metrics and data on trends over time.
  • Measure activity within different AI subfields and application areas using any of CAT's metric groups.
  • Compare countries across different metrics of AI activity. Users can build customized lists of countries, regions, and political groupings to compare across many of CAT's metrics.
  • Track trends in transnational AI activity, such as cross-border investment and co-authorship.
  • View leading AI organizations and companies within a country or group of countries using CAT's "top ten" lists.

View details and examples >>

What are its most important limitations?

  • CAT is AI-specific. CAT only covers AI-related activity. (See below for details about how we identified "AI-related" publications, patents, and companies.) If you need data on other types of emerging technology, ETO may have another tool that can help.
  • CAT focuses on countries, not organizations or individuals. CAT is basically a tool for tracking and comparing countries, so it includes mostly country-level data and metrics. It has some limited information about individual people and organizations within those countries, but isn't designed to support analysis at that level of specificity.
  • CAT's metrics may not give a complete picture. There are many ways to assess countries' AI activities. CAT includes three metric groups - research, patents, and private-sector activity (companies and investments) - that we think are meaningful, but they aren't exhaustive.
  • CAT's company and investment metrics capture a particular type of investment activity, and don't represent everything happening with AI in the private sector. CAT only includes data on equity investment transactions involving privately held, AI-related companies listed in the Crunchbase dataset. Read more >>
  • Patent and publication data have a lag, making counts incomplete for recent years. The lag is especially significant for patent data.
  • CAT’s publication metrics omit many Chinese-language publications. CAT’s publication data comes from ETO’s Merged Academic Corpus, which omits many Chinese-language publications. Because of this, CAT metrics related to Chinese AI articles should be interpreted with caution.
  • There are some errors in CAT's sub-country data. Data on individual authors, articles, and inventors are taken from other datasets using automated processes. These datasets and processes may introduce errors, especially for countries and subtopics with less activity.

What are its sources?

CAT uses ETO's Merged Academic Corpus for research data; The Lens, PATSTAT, and 1790 Analytics data for patents; and Crunchbase for company and investment data. Read more >>

Does it contain sensitive information, such as personally identifiable information?

No, other than the names of certain authors and patent assignees (all taken from public documents).

What are the terms of use?

The CAT tool and metrics are subject to ETO's general terms of use. If you use the tool, please cite us. The underlying datasets are not publicly available due to licensing restrictions.

How do I cite it?

If you use data from CAT in your work, please cite the "Emerging Technology Observatory Country Activity Tracker: Artificial Intelligence" and include the link to the tool.

Using CAT

How do I use it?

These instructions focus on the desktop version of the tool. Some features may be missing or act differently on mobile devices.

CAT includes three basic views, one for each "dataset" or metric group - research, patents, and private-sector activity (companies and investments). Each view includes a different set of customizable tables and visuals.

Getting started

Start with the selection bar at the top of the tool:

The CAT selection bar

Use the "Dataset" dropdown selector to choose a group of metrics to display - the tool will update automatically to match your selection. Then, use the other selectors to specify:

  • The country or countries you'd like to analyze with those metrics. The list includes individual countries (including some notable territories), regional groupings (Africa, North America, etc.), and certain political groupings (NATO, EU, etc.) You can choose more than one country or group of countries to analyze; type into the dropdown to navigate the list more quickly. If you select one or more groups of countries, you can check the "Aggregate metrics for country groups" box to make CAT display metrics at the group level, rather than displaying metrics for each individual country in the group.
  • The AI subfields or application areas to include in the analysis. The default is "all AI." The options in this dropdown will change depending on whether research, patents, or private-sector activity are selected.

You can restore the defaults at any time with the "Clear" button.

Reading and modifying individual tables and graphs

Look for dropdown menus, sorting buttons, and similar elements in the CAT tables to customize your analysis. The data in each table will update in real time as you make your selections.

Animation of the CAT tool in use

Hover over the "?" icons to learn more about different sections and data points.

Animation: hovering over the question mark icon

Coming back to a view

As you work with CAT, your browser's address bar will update to reflect the applied filters and selections. Copy the URL in order to return to the same view later.

What can I use it for?

Measuring a country's AI activity using CAT's three metric groups: research, patents, and private-sector activity (companies and investments). Each group includes detailed metrics and data on trends over time.

Examples

Comparing countries across different metrics of AI activity. Users can build customized lists of countries, regions, and political groupings to compare across many of CAT's metrics.

Tracking trends in transnational AI activity, such as cross-border investment and co-publication.

Identifying leading AI organizations and companies within a country or group of countries using CAT's "top ten" features.

How have others used it?

Here are some public examples:

We’ll add new examples here as we learn about them.

  • Researching individual authors, inventors, or publishing organizations. CAT is basically a tool for tracking and comparing countries, so it includes mostly country-level data and metrics.
  • Measuring investment in large (publicly traded) AI companies. CAT's investment metrics only cover private-market investment flows, such as venture capital. It excludes all investment in publicly traded companies, such as the "Big Tech" firms. There is no data specific to these larger companies in CAT, and the country-level investment numbers don't include any investments into them.
  • Measuring AI patenting by inventors from a specific country or countries. CAT's patent metrics describe where patents are being filed, not which country has the most patents. In other words, CAT can't tell you how many AI patents are owned by Americans, but it can tell you how many patents were filed in the U.S. patent office. Read more >>

Sources and methodology

CAT uses different datasets for its research metrics, patent metrics, and investment metrics.

Research data

Research data in CAT comes from ETO's Merged Academic Corpus (MAC), which contains detailed information on over 260 million scholarly articles from around the world. Every article in the MAC is tagged as AI-related or not using an automated, classifier-based process; CAT uses the AI-related articles only. For more details, see the MAC documentation.

We updated our models for identifying AI-related research in November 2023 to improve accuracy and compatibility with new data sources. This caused changes to AI-related totals in our Merged Academic Corpus and in ETO tools that use MAC data, including CAT.

Linking articles to countries

CAT attributes articles to countries based on the author organizations listed in each article, as recorded in MAC metadata. (Here, and generally in ETO resources, we use "country" informally, as a shorthand term for sovereign countries, independent states, and certain other geographic entities. Read more >>) In CAT, an article "counts for" a given country if it lists at least one author affiliated with an organization in that country. The MAC relies on the article to determine the author's organization; for instance, an article listing "Jane Smith, University of Texas" as its author would be attributed to the United States even if Professor Smith later moved to the University of Tokyo. By the same token, authors are associated with the country of their listed organization even if they're not "from" that country: once she moved to the University of Tokyo, Professor Smith's articles would count for Japan, even if she was born and raised in Chicago.

If an article lists authors from organizations in more than one country, the article will "count toward" multiple countries in CAT. However, if a single article has multiple authors from the same country, it will only be counted once for that country.

So, for example:

  • an article with a single author, from the University of Texas, will count as one article for the United States.
  • an article with one author from the University of Texas and one author from the University of Tokyo will count as one article for the United States and one article for Japan.
  • an article with two authors from the University of Texas and one author from the University of Tokyo will also count as one article for the United States and one article for Japan.

Identifying top authors

The top ten authors for each country are the ten authors with the most citations to articles they released while affiliated with organizations in that country. (We exclude authors with fewer than five articles publishedin the last 10 years.) Authors may appear in the top ten for multiple countries if their output in each country qualifies them for each country's list. For example, if Professor Smith published highly cited articles in Texas and in Tokyo, she might make the top ten list for the United States (based on her Texas articles) and for Japan (based on her Tokyo articles).

Note that the "Affiliation" column for each author in the "Top Ten Authors" table is populated using the organization where the author received the most citations. For example, if Professor Smith worked at Georgia Tech before moving to Texas, but the articles she published in Georgia have fewer total citations than the articles she published in Texas, she would be listed in the table as affiliated with the University of Texas, not Georgia Tech.

Assigning articles to subfields

We classify articles into AI subfields using subject assignment scores in the Merged Academic Corpus, which are generated algorithmically. (The MAC's subject scoring models only work on English-language articles; we impute scores to non-English articles based on the average scores of the articles it cites or is cited by.)

CAT includes the following subfields:

  • Computer vision
  • Control theory and engineering
  • Data mining
  • Graphics and multimedia
  • Human–computer interaction
  • Information management and retrieval
  • Machine learning
  • Mathematical optimization
  • Natural language processing and linguistics
  • Real-time computing
  • Simulation
  • Speech recognition
  • Theoretical computer science

We calculate each article's scores for each of these subfields and assign it to its three highest-scoring subfields. to assign it to up to three of these subfields. Depending on their scores, some articles may not be assigned to subfields (for example, articles on niche subjects or whose topical focus is uncertain). Articles that lack subject assignment scores altogether are also left out of the subfield categorizations. Generally, these are non-English articles with insufficient citation data to impute scores, as described above.

Patent data

CAT uses patent data from 1790 Analytics, PATSTAT, and The Lens, and applies methods developed jointly by CSET and 1790 to screen and structure the data.

Patent data are laggy

Some countries' patent offices take years after filing to release data on some patents. Unfortunately, this means that the last several years of patent data in this dataset are incomplete. Use data from incomplete years with caution.

Counting patents, patent families, and patent documents

The CAT dataset includes over 360,000 AI-related patent families, which are groups of patent documents related to the same invention. These documents may include patent applications, which are requests pending at a country's patent office for the grant of a patent, and granted patents, which are approved requests awarding a property right for that invention. (We exclude other types of patent documents, such as amendments or other administrative documents.)

In CAT, each patent family is counted as a single "patent." If the family includes at least one granted patent, the family is counted as a "granted patent." If the family only includes patent applications, it's treated as a "patent application."

Inventors often file patents for the same invention in multiple jurisdictions, since each jurisdiction's patent office can only enforce patent protections in their jurisdiction. For example, a company with U.S.-patented products might seek patents in France if it plans to start manufacturing or selling the same products there. CAT's patent dataset includes data from 52 different patent offices around the world, including national offices (such as the U.S. Patent and Trademark Office) and international offices (such as the European Patent Office). When an inventor seeks a patent for an invention in more than one of these jurisdictions, all of the documents from every jurisdiction are counted as part of the same patent. We attribute that patent to the first jurisdiction where the inventor filed (the "priority country").

To make this more concrete, suppose:

  • Jane Smith invents new AI software and uses it to build a new type of robot.
  • She applies to the European Patent Office for a patent for the new software, plus another patent for the robot.
  • The EPO grants the robot patent.
  • While she waits on the patent for the software, Jane goes ahead and applies for a patent for the robot at the Chinese patent office (in case she decides to manufacture the robot in China).

At this point, CAT would count two patents for Jane: one EPO granted patent (for the robot) and one Chinese patent application (for the software). Note that there are at least four patent documents involved: an EPO application, EPO patent grant, and Chinese application for the robot, and an EPO application for the software. But the first three documents all relate to the same invention, so CAT counts them together.

Inventor country vs. filing country

CAT's patent metrics describe where patents are being filed, not which country has the most patents. In other words, CAT can't tell you how many AI patents are owned by Americans, but it can tell you how many patents were filed in the U.S. patent office. There may be overlap between these two categories, but it's not a perfect match: for example, about half of patent applications filed in the U.S. are from overseas.

We are working to build inventor nationality metrics into future versions of CAT. In the meantime, you can use the existing, filing location-based metrics to understand where AI innovators are most interested in protecting their inventions - and in turn, where they may be conducting R&D, manufacturing, marketing, expanding operations, or competing with foreign companies.

Identifying and classifying AI patents

CAT includes only AI-related patents. CSET and 1790 Analytics developed a method to identify these patents from broader 1790 and The Lens patent data holdings using a combination of keywords and patent classification codes, which are categories applied to individual patents by some patent offices. We also used keywords and classification codes to link each patent to different AI techniques (e.g., machine learning, logic models), applications (e.g., speech processing, computer vision), and industries (e.g., life sciences, transportation). A patent can have more than one of any of these labels: for example, a patent for a robot that recognizes and responds to spoken commands might be assigned to the robotics and speech processing applications. For more information on this method, you can read CSET's paper on AI patents or visit the Github repo for the CSET/1790 project.

Other metadata

CAT adapts other data from the patent dataset to generate metrics:

  • Filing order: We use the dates of patent documents to determine where the first patent related to an invention was filed (the "priority country").
  • Assignee: A patent's assignee is the person or organization that owns the property right to the patent. CAT includes a list of the ten assignees with the most patents granted by each patent office. A single granted patent may be associated with more than one assignee if multiple inventors filed it jointly or if its inventor sells or otherwise transfers it to someone else.
  • Patent year: CAT assigns each patent the year it was first applied for in any jurisdiction.

Deduplicating patents

In the process of combining different patent data sources into the CAT database, duplicate patents are created. We use patent IDs and patent family IDs, which are unique identifiers assigned by national patent offices, to detect and resolve these duplicates. We use data from the CSET/1790 project when different sources give different information for the same patent.

Company and investment data

The investment and company metrics in CAT rely on data from Crunchbase, a commercial dataset. Specifically, CAT uses Crunchbase data related to equity investment into privately held, AI-related companies.

  • Equity investment: We count venture capital transactions, private equity transactions, and mergers and acquisitions, but exclude non-equity investment such as debt finance, grants, and crowdfunding.
  • Privately held: We exclude any company traded on a public stock exchange. This includes most of the "Big Tech" firms and other well-established companies.
  • Our method for identifying AI-related companies is described below.

Crunchbase data has gaps, especially for companies with a lower public profile, but we believe it is a relatively comprehensive and accurate source for the sort of data CAT uses. For more details, please refer to the methodology section and appendices in this CSET report; the process we used to evaluate and extract data from Crunchbase is generally similar to the process described there.

There is no single objective definition of an "AI company." We take a deliberately broad approach, using two different criteria to identify AI-related companies in Crunchbase. Any privately held company that meets at least one of the criteria is counted as an AI-related company in CAT, and investments involving that company will be included in CAT's investment metrics.

  • Keyword match: The company's business description in Crunchbase matches a regular expression-based query for AI keywords. The query is reproduced in Appendix 2 of this CSET report.
  • Tag match: Crunchbase assigns the company an industry tag in the "artificial intelligence" industry group.

These criteria are designed to capture a wide range of companies with AI-related activities across the globe, even for smaller countries or territories. Because of this broad approach, they may capture some companies and investments in Crunchbase that others might not consider AI-related. At the same time, they may leave out some companies and investments others would describe as AI-related.

Identifying and assigning subfields

CAT assigns each AI-related company, and all the investments into that company, one or more application fields based on the company's industry tags and groups in Crunchbase. This table maps the tags and groups to CAT application fields.

Other metadata

CAT adapts other Crunchbase data to generate investment and company metrics:

  • Investment year: CAT assigns each investment the year it was announced.
  • Investment count (total and per-year): CAT counts private equity transactions and venture capital rounds with multiple investors as a single transaction, and not as multiple investments.
  • Estimated investment value: Equity investment transaction values are often kept confidential. CAT's "disclosed value" metrics exclude all investments without a publicly announced dollar amount. CAT's "estimated value" metrics include these investments, assigning each one a dollar value equal to the median value of Crunchbase investments with the same investment stage, target country, and year.
  • Company nationality: CAT assigns each company the nationality of the country where it is headquartered according to Crunchbase.
  • Investor nationality: CAT assigns each organizational investor the nationality of the country where it (for corporate investors) or its managing entity (for VC and PE funds) is headquartered. For example, an investment firm based in San Francisco will be classified as American even if most of its investors are Chinese.

Interpreting CAT's cross-border investment data

CAT's cross-border investment tables should be interpreted with care. Most private-market AI investment transactions, such as venture capital deals, combine contributions from multiple investors, and the exact amount of each investor's contribution is rarely disclosed. This makes it impossible to add up the total investment from investors in a specified country. Instead, the numbers in CAT's cross-border investment tables reflect the total value or count of investment transactions with at least one participating investor from that country.

For example, in this investment comparison, $5,936 million ($5.9 billion) is the value of transactions with a target company in the United States and at least one participating investor from Canada - not the amount Canadians invested into U.S. AI companies.

A cross-border investment table showing the cited figure

Maintenance

How is it updated?

The CAT user interface is updated as new features are developed. The underlying data is currently updated a minimum of once a quarter, although we plan to automate more frequent updates within the next year.

How can I report an issue?

Use our general issue reporting form, or click on the "Submit feedback" icons embedded in the tool to report issues related to specific data points.

Animation: hovering over the feedback button

Credits

  • Concept: Husanjot Chahal, Sara Abdulla, Jennifer Melot, Ilya Rahkovsky
  • Design and analysis: Husanjot Chahal, Sara Abdulla, Jennifer Melot, Zach Arnold
  • Engineering: Jennifer Melot
  • Review and testing: Aaron Melville, Matt Sheehan, Andreas Kuehn, Kaishap Raibagi, Katherine Koleski, Michael Selitto, Ari Filler, Sriya Guduru, Alina Thai, Tisyaketu Sirkar, Patrick Wu, and anonymous testers
  • Documentation: Zach Arnold, Husanjot Chahal, Sara Abdulla, Jennifer Melot, Katherine Quinn
  • Maintenance: Jennifer Melot, Neha Singh, Brian Love

Major change log

11/22/23Major data update following MAC changes
10/19/22Updated version, including changes to available AI subfields
8/18/22Initial release on CSET's website
ETO Logo

Keep in touch

Twitter