Documentation: Private-Sector AI Indicators

Overview

What is this dataset?

The Private-Sector AI Indicators dataset includes a diverse range of indicators of AI-related activity for hundreds of companies worldwide, from startups to multinationals. The dataset uses original metadata, models, and methods developed by ETO and CSET to transform information from a wide variety of data sources into AI-related research, patenting, and hiring metrics for every company covered in PARAT.

How do I get it?

The dataset csv files are available on Zenodo.

Access the dataset

Which ETO products use it?

The Indicators dataset powers ETO's PARAT tool.

What are its sources?

The AI activity metrics in the Indicators dataset are original to ETO and CSET. We use our own methods, models and metadata to derive values for these metrics from a variety of underlying data sources. These sources include:

ETO's Merged Academic Corpus for publications data
1790 Analytics, PATSTAT, and The Lens for patent data
Revelio Labs for workforce data

In addition to company descriptive metadata manually collected by ETO annotators, the Indicators dataset includes metadata sourced from the Crunchbase Open Data Map and PermID.

For more details on how we transform data from underlying sources into the Indicators AI activity metrics, see below.

What are its main limitations?

The dataset doesn't cover all AI-related companies, and the sample it does cover isn't necessarily representative of all global activity in AI. The companies currently included in the dataset represent a wide range of industries, sizes, countries and levels of AI activity. Some important groups of companies are covered in full (such as the S&P 500 and those companies typically considered part of "Big Tech"). Overall, though, the companies covered in PARAT represent only a slice of the global AI landscape. Read more >>
The dataset doesn't cover every important indicator of AI-related activity. The Indicators use publications, patents, and AI-related jobs to capture different dimensions of AI-related activity. There are many other ways to capture this activity that are not included in the dataset. Read more >>
The dataset is derived in part from publicly available data sources, which may not fully capture relevant activity. Metrics on companies' publications, patents, and job postings are imperfect proxies for AI-related research, invention, and hiring. Read more >>
Workforce metrics are incomplete. In particular, workforce data involving companies outside the U.S. should be used with extreme caution. Read more >>
Patent and publication data have a lag. Read more >>

🔔

Workforce data are not currently updating as we implement changes to the underlying data feeds. The workforce data included in the current version of the Indicators were last updated in late 2024. As we work to resume updates, feel free to contact us with any questions.

What are the terms of use?

This dataset is subject to ETO's general terms of use. If you use it, please cite us.

Some descriptive metadata in the core and id table are sourced from Crunchbase Open Data Map and PermID, meaning that additional terms applicable to those sources may apply to some uses of the metadata. If you are using descriptive metadata from those tables, consult the sources for additional terms.

How do I cite it?

Please cite the "Emerging Technology Observatory Private-Sector AI Indicators dataset," including the link.

If you use PARAT to access the data, you can cite that tool instead.

Structure and content

The Indicators dataset consists of the csv tables core, ticker, alias, and id.

core

This table includes company descriptive metadata and metrics related to AI publications, patents, and workers. All metrics are aggregated over the specified time periods. The yearly_publication_counts table includes corresponding disaggregated values for patents and publications (yearly counts for workforce metrics are unavailable).

Name	Type	Description
Name	text	The canonical name of the company.
ID	text	A unique numeric ID for the company.
PARAT link	URL	A link to the company's detail page in the PARAT tool.
Country	text	The country where the company's headquarters is located.
City	text	The city where the company's headquarters is located.
State/province	text	The state or province where the company's headquarters is located.
Website	URL	The company's homepage.
Groups	array	Thematic groups that include the company. Read more >>
Aggregated subsidiaries	text	Specific subsidiaries whose AI activity is "counted toward" the company in PARAT (in cases where PARAT's data sources report the subsidiary's data separately from the parent). This list is not necessarily exhaustive: PARAT's data sources may report data from other subsidiaries together with the parent's data. In other words, the fact that a specific subsidiary is not listed in this column does not mean that that subsidiary's data is not included in the parent's metrics. Read more >>
Parent company ID	number	If Parent company name is populated, the unique numeric ID for the parent company in this dataset (corresponds to id).
Region	text	The geographic region in which the company's headquarters is located (e.g., Asia, North America).
Stage	text	The company's stage of development. Read more >>
Sector	text	The company's "Primary Business Sector" (corresponding to "Business Sector" in the Thomson Reuters Business Classification) according to permid.org.
Description	text	A short description of the company.
Description source	text	The source of Description. Sources include Wikipedia pages, company websites, and the Crunchbase Open Data Map.
Description link	URL	The specific page where Description was located.
Description date	date	The date Description was retrieved at Description link.
Publications: AI publications	number	The total number of AI research publications that the company released publicly over the prior 10 years. Read more
Publications: Recent AI publication growth	number	Recent growth in the company's AI research publications, defined as the average percentage increase per year over the past three years of complete data. Read more
Publications: AI publication percentage	number	The percentage of the company's total research publications over the past ten years (Publications: 10-year total publications) that were AI publications (Publications: AI publications).
Publications: Highly cited AI publications	number	The total number of highly cited AI research publications that the company released publicly over the prior 10 years. Read more
Publications: AI publications in top conferences	number	The number of research publications by authors from the company that were accepted to top AI conferences over the past ten years. Read more >>
Publications: Citations to AI research	number	The number of citations to AI research publications released by the company over the prior 10 years. Read more
Publications: CV publications	number	The total number of computer vision research publications that the company released publicly over the prior 10 years. Read more
Publications: NLP publications	number	The total number of natural language processing research publications that the company released publicly over the prior 10 years. Read more
Publications: Robotics publications	number	The total number of robotics research publications that the company released publicly over the prior 10 years. Read more
Publications: AI safety publications	number	The total number of AI safety research publications that the company released publicly over the prior 10 years. Read more
Publications: Large language model publications	number	The total number of LLM research publications that the company released publicly over the prior 10 years. Read more
Publications: Total publications	number	The total number of research publications (on any subject) the company released publicly over the prior 10 years. Read more
Patents: AI patents	number	The number of AI-related patents the company filed over the past 10 years. Read more
Patents: AI patents: recent growth	number	Recent growth in the company's AI patent filings, defined as the average percentage increase per year over the past three years of complete data. (Patent data has a significant lag; the past three years of complete patent data typically equates to between three and six years behind the present.) Read more
Patents: AI patent percentage	number	The percentage of the company's total filed patents over the past ten years (Patents: 10-year total patents) that were AI-related (Patents: AI patents). Read more
Patents: Granted AI patents	number	The number of AI-related patents granted to the company over the past 10 years. Read more
Patents: Total patents	number	The number of patents (on any topic) the company filed over the past 10 years. Read more
Patents: AI use cases: Agriculture	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Banking and finance	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Business	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Computing in government	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Document management and publishing	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Education	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Energy	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Entertainment	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Industry and manufacturing	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Life sciences	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Military	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Nanotechnology	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Networking	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Personal devices and computing	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Physical sciences and engineering	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Security	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Semiconductors	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Telecommunications	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Transportation	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI applications and techniques: Analytics and algorithms	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more
Patents: AI applications and techniques: Computer vision	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more
Patents: AI applications and techniques: Control	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more
Patents: AI applications and techniques: Distributed AI	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more
Patents: AI applications and techniques: Knowledge representation	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more
Patents: AI applications and techniques: Language processing	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more
Patents: AI applications and techniques: Measuring and testing	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more
Patents: AI applications and techniques: Planning and scheduling	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more
Patents: AI applications and techniques: Robotics	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more
Patents: AI applications and techniques: Speech processing	number	The number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more
Workforce: AI workers	number	The number of known AI workers employed by the company as of the most recent data update.
Workforce: Tech Team 1 workers	number	The number of known Tech Team 1 workers employed by the company as of the most recent data update.

ticker

This table includes companies' ticker symbols on major exchanges, if any.

Name	Type	Description
Name	text	The name of the company. Maps to the same field in core.
ID	text	A unique numeric ID for the company. Maps to the same field in core.
Ticker	text	The ticker symbol for the company's shares on Exchange.
Exchange	text	The exchange on which the company's shares are traded.

alias

This table includes company aliases.

Name	Type	Description
Name	text	The name of the company. Maps to the same field in core.
ID	text	A unique numeric ID for the company. Maps to the same field in core.
Alias	text	An alternate name used for the company.
Language	text	The language of Alias (two-letter ISO code).

id

This table includes unique company identifiers in various external datasets. We currently compile identifiers for LinkedIn, Crunchbase, ROR, and PermID, as well as general-purpose regular expressions for each company. Read more

Name	Type	Description
Name	text	The name of the company. Maps to the same field in core.
ID	text	A unique numeric ID for the company. Maps to the same field in core.
Identifier	text	A unique identifier for the company (typically in the form of a string or URL) within the dataset indicated in type.
Type	text	The dataset that uses identifier to refer to the company. Read more >>

yearly_publication_counts

This table includes disaggregated yearly counts of AI-related research publications and patents over the prior 10 years. The core table includes corresponding aggregate values. Not all metrics provided in core have disaggregated values available.

Name	Type	Description
Name	text	The name of the company. Maps to the same field in core.
ID	text	A unique numeric ID for the company. Maps to the same field in core.
PARAT link	URL	A link to the company's detail page in the PARAT tool.
Dataset	text	Either "Publications" or "Patents."
Category	text	The relevant research publication- or patent-relevant metric. Together with the Dataset field, this field corresponds to the similarly named field in core. For example, a row in yearly_publication_counts with a Dataset value of "Patents" and a Category value of "AI use cases: Business" contains a yearly value aggregated into the "Patents: AI use cases: Business" field in core.
Year	number	The year for which data is included in the row.
Value	number	The yearly count of research publications or patents, as applicable, for the row's company (Name field) and the specific type of publication or patent counted by the row (Dataset and Category fields).
Complete	boolean	Indicates whether the row is from a year where we consider our data materially complete. If Complete is false, we consider the data from that year materially incomplete and you should use it with caution.

Sources and methodology

Selecting and aggregating companies

We select companies for inclusion in the Indicators dataset based on ETO and CSET’s research priorities, as well as feedback from users. Not all AI-related companies are covered. We plan to expand coverage over time; if there's a specific company or group of companies you'd like to see in the data, please let us know.

As described below, we collect common aliases and identifiers for each company in order to reliably identify data associated with the company in diverse source datasets.

The dataset also includes parent-subsidiary mappings for many companies (see "Aggregated subsidiaries" in core). When a subsidiary is mapped to a parent, the subsidiary's publications, patents, etc. are attributed to the parent company. In other cases, underlying data sources already consolidate a subsidiary's data with its parent, or the queries we run on those sources collect parent and subsidiary data together by design. We believe the data of most significant subsidiary companies are ultimately "rolled up" to the parent level in one of these ways.

Compiling company metadata

ETO annotators map each company to the unique identifiers included in the id table, craft regular expressions that capture the company's name and known aliases (also included in the id table), and manually collect further metadata for the core, ticker, and alias tables based on web searches. Additional metadata in the core and ticker tables comes from the Crunchbase Open Data Map and PermID.

Stages

Company stages are generated using Crunchbase Pro data and a simple algorithm. "Mature" companies must meet any of the following conditions: be publicly traded; have more than a thousand employees; or have last raised funding in late-stage venture capital or private equity rounds (e.g., Series D or post-IPO rounds). "Growth" and "Startup" companies are distinguished only by their last raised funding rounds: companies which raised funding in mid-stage rounds (i.e., Series A-C) are labeled "Growth;" those which last raised funding in early-stage rounds (e.g., seed or angel) are labeled "Startup." Any privately-held company with no information on funding rounds and fewer than 1000 employees or no employee data will be designated "Unknown."

Identifiers

The id table includes the following types of unique identifiers for each company, where available:

ROR: Research Organization Registry URLs corresponding to the company.
Crunchbase URL and Crunchbase UUID: Crunchbase URLs and UUIDs corresponding to the company.
LinkedIn URL: URLs for the company's LinkedIn profile(s).
PermID: The PermID associated with the company.
Regex: A regular expression devised by ETO to capture the company's name and known aliases. This is used to identify relevant data for companies that lack other unique identifiers in the source datasets.

Groups

Companies in the dataset may be assigned to one or more groups. These groups include:

S&P 500: S&P 500 companies as of May 2024.
Global Big Tech: Major multinational technology companies selected by ETO analysts.
GenAI Contenders: Companies especially active in generative AI and large language model technologies, as identified by ETO analysts.

Publications

Publications data in the Indicators dataset comes from ETO's Merged Academic Corpus (MAC), which contains detailed information on over 280 million scholarly articles from around the world. MAC publications are linked to companies using authors' stated institutional affiliations. For more information, refer to the MAC documentation.

If a publication is linked to multiple companies (or other organizations), the Indicators dataset gives "full credit" to each organization. For example, a publication with two authors from Microsoft, one author from the University of Texas, and one author from Pfizer would be counted in the Indicators dataset as one paper for Microsoft and one paper for Pfizer.

The MAC uses machine learning models to identify publications related to AI and several AI subfields, such as computer vision, AI safety, and large language models.. We use those publications to compute the publication-related fields in the core table.

Identifying highly cited AI publications

For PARAT purposes, "highly cited" AI publications are those that have a citation count in the 90th percentile or greater of AI-related publications published in the same year.

Identifying publications in top conferences

The Publications: AI publications in top conferences field in the core table is computed using a subset of the MAC publications dataset: papers from top global AI conferences. For purposes of the Indicators dataset, these conferences are the AAAI Conference on Artificial Intelligence (AAAI), International Joint Conference on Artificial Intelligence (IJCAI), IEEE Conference on Computer Vision and Pattern Recognition (CVPR), European Conference on Computer Vision (ECCV), IEEE International Conference on Computer Vision (ICCV), International Conference on Machine Learning (ICML), International Conference on Knowledge Discovery Data Mining (SIGKDD), Conference on Neural Information Processing Systems (NeurIPS), Annual Meeting of the Association for Computational Linguistics (ACL), North American Chapter of the Association for Computational Linguistics (NAACL), Conference on Empirical Methods in Natural Language Processing (EMNLP), International Conference on Research and Development in Information Retrieval (SIGIR), and the Web Conference (WWW).

Patents

Patent metrics are derived from 1790 Analytics, PATSTAT, and The Lens. These sources include hundreds of thousands of AI-related patent families, which are groups of patent documents filed in one or more jurisdictions that are related to the same invention. (The documents may include patent applications, which are requests pending at a country's patent office for the grant of a patent, and granted patents, which are approved requests awarding a property right for that invention. We exclude other types of patent documents, such as amendments or other administrative documents.) In the Indicators dataset, each patent family is counted as a single "patent." A patent family with at least one granted patent in it is counted as a "granted patent."

Most of the patent fields in the core table count AI-related patents. These patents are identified using a combination of keywords and patent classification codes, which are categories applied to individual patents by some patent offices.

We also use keywords and classification codes to link each patent to different AI applications (e.g., speech processing, computer vision) and use cases (e.g., life sciences, transportation). These categorizations are used to generate the relevant fields in the core table. (A patent can have more than one categorization: for example, a patent for a robot that recognizes and responds to spoken commands might be assigned to the robotics and speech processing applications.)

For more details, refer to CSET's paper on AI patents or visit CSET's Github repo for the method, which includes lists of the specific keywords and classification codes used. (The specific numbers and dates mentioned in the repo are derived from an earlier analysis and are now out of date, but the method and keywords described in the repo are the same ones used for this dataset.)

Workforce

The workforce-related values in the core table are derived from LinkedIn profile data compiled and processed by Revelio Labs. Using information from public LinkedIn profiles, Revelio identifies individuals employed at different organizations, including the organizations in the Indicators dataset (which we locate in the Revelio dataset using the company LinkedIn IDs compiled in id), and maps their specific jobs to one of 1,000 different roles defined in a standardized taxonomy.

The "Tech Team 1 workers" figures in core are meant to cover anyone with technical skills and a reasonable probability of working with AI. To identify Tech Team 1 roles, CSET researchers manually reviewed Revelio's taxonomy and identified 111 highly technical roles and responsibilities associated with the design, development, and deployment of AI, such as such as Data Center Operator, IT Project Manager, and Software Engineer. They also identified 26 additional roles that were less clearly technical, but still potentially AI-relevant. For purposes of the Indicators dataset, Tech Team 1 is defined to include individuals who have any of the 111 highly technical roles the CSET researchers identified, or any of the 26 less clearly technical roles plus a technical degree (in each case, according to the Revelio dataset).

👀

For more details on how CSET researchers identified Tech Team 1 roles, refer to Appendix A in Gehlhaus et al., The Race for U.S. Technical Talent (August 2023).

The "AI workers" figures in core cover a subset of Tech Team 1 meant to include anyone with a high probability of working with AI. Specifically, these figures include every workers with a Tech Team 1 role directly related to data science or technical research, plus (a) an undergraduate degree in computer science, computer engineering, or electrical engineering, if their specific position required an undergraduate degree, or (b) a PhD related to artificial intelligence, if their specific position required a PhD.

CSET researchers manually designated roles in the Revelio taxonomy relevant to data science and technical research; the designations are available here. Other keywords and queries used in this process are available here.

Known limitations

The dataset doesn't cover all AI-related companies, and the sample it does cover isn't necessarily representative of all global activity in AI. We select companies for inclusion in the Indicators dataset based on ETO and CSET’s research priorities, as well as feedback from users. The companies currently included in the dataset represent a wide range of industries, sizes, countries and levels of AI activity. Some important groups of companies are covered in full (such as the S&P 500 and those companies typically considered part of "Big Tech"). Overall, though, the companies covered in PARAT represent only a slice of the global AI landscape. We plan to expand coverage over time; if there's a specific company or group of companies you'd like to see in the data, please let us know.
The dataset doesn't cover every important indicator of AI-related activity. The Indicators use publications, patents, and AI-related job postings to capture different dimensions of AI-related activity. There are many other potential ways to capture this activity - for example, revenue from AI-related products and services, use of AI-related software or computing capacity, survey data or qualitative data from news reports and industry analyses. Users should be mindful of the limitations of the specific indicators in the dataset, and ideally combine them with other sources of data for a more holistic understanding.
The dataset is derived in part from publicly available data sources, which may not fully capture relevant activity. The Indicators can only capture what companies choose to make public - but companies don't always publish their AI research or patent their AI-related inventions, and not all of their workers have public profiles. That makes our publication, patent, and job posting metrics imperfect proxies for AI-related research, invention, and hiring.
Workforce metrics are incomplete. The workforce metrics in the dataset are derived from LinkedIn data. LinkedIn is less popular in many countries than it is in the United States - and is blocked in other countries, including China and Russia. (A local version of LinkedIn was available in China from 2021 to 2023; data from that period is included in the dataset.) The Indicators workforce data involving companies outside the U.S. should be used with extreme caution. More generally, LinkedIn data are self-reported by LinkedIn members, so there may be inaccuracies or inconsistencies in the data that affect the Indicators workforce metrics.
Patent and publication data have a lag. The lag is especially notable for patents. National patent offices often release information about companies' AI patent filings months or even years after the patents are filed. This makes our patent metrics incomplete, particularly for newer companies or companies that have become more active in AI in recent years. Similarly, our publication data sources run months or years behind in some cases. The lags in patent and publication data especially affect companies that are new or that have recently become more active in AI publishing or patenting; indicator values for these companies should be used with caution. (The LinkedIn data used to generate the workforce metrics in the dataset does not lag in the same way, so the workforce metrics can be considered roughly current.)

Maintenance

How are the data updated?

We update the Indicators data through an automated pipeline that retrieves data from CSET's merged corpora of scholarly literature, patents, and Revelio Labs workforce data and calculates our metrics. This pipeline normally runs monthly, with occasional pauses to resolve issues due to vendor data changes or failures of automated data quality checks.

The underlying data sources are updated on their own schedules, between daily and half-year intervals.

🔔

Workforce data are not currently updating as we implement changes to the underlying data feeds. The workforce data included in the current version of the Indicators dataset were last updated in late 2024. We expect to resume updates later this year; please feel free to contact us with any questions in the meantime.

Credits

Concept and design: Rebecca Gelles, Ngor Luong, Zach Arnold
Engineering: Jennifer Melot, Rebecca Gelles
Data collection and enrichment: Zach Arnold, Rebecca Gelles, Ngor Luong, Eileen Chen, Alina Thai, Kieran Lee, Lindsay Jenkins, Ari Filler, Maya Snyder, Julio Wang, Sonali Pednekar, Nikita Pardeshi, Tianyi Xu, Yudi Chen, Vinayak Kannan
Documentation: Zach Arnold, Rebecca Gelles, Jennifer Melot

Emerging technology topic classifications are based on work supported in part by the Alfred P. Sloan Foundation under Grant No. G-2023-22358.

Major change log

6/26/24

Initial release