ETO Logo
Documentation: Private-Sector AI Indicators

Overview

What is this dataset?

The Private-Sector AI Indicators dataset includes a diverse range of indicators of AI-related activity for hundreds of companies worldwide, from startups to multinationals. The dataset uses original metadata, models, and methods developed by ETO and CSET to transform information from a wide variety of data sources into AI-related research, patenting, and hiring metrics for every company covered in PARAT.

How do I get it?

The dataset csv files are available on Zenodo.

Which ETO products use it?

The Indicators dataset powers ETO's PARAT tool.

What are its sources?

The AI activity metrics in the Indicators dataset are original to ETO and CSET. We use our own methods, models and metadata to derive values for these metrics from a variety of underlying data sources. These sources include:

In addition to company descriptive metadata manually collected by ETO annotators, the Indicators dataset includes metadata sourced from the Crunchbase Open Data Map and PermID.

For more details on how we transform data from underlying sources into the Indicators AI activity metrics, see below.

What are its main limitations?

  • The dataset doesn't cover all AI-related companies, and the sample it does cover isn't necessarily representative of all global activity in AI. The companies currently included in the dataset represent a wide range of industries, sizes, countries and levels of AI activity, but still represent only a slice of the global AI landscape. Read more >>

  • The dataset doesn't cover every important indicator of AI-related activity. The Indicators use publications, patents, and AI-related jobs to capture different dimensions of AI-related activity. There are many other ways to capture this activity that are not included in the dataset. Read more >>

  • The dataset is derived in part from publicly available data sources, which may not fully capture relevant activity. Metrics on companies' publications, patents, and job postings are imperfect proxies for AI-related research, invention, and hiring. Read more >>

  • Workforce metrics are incomplete. In particular, workforce data involving companies outside the U.S. should be used with extreme caution. Read more >>

  • Patent and publication data have a lag. Read more >>

What are the terms of use?

This dataset is subject to ETO's general terms of use. If you use it, please cite us.

Some descriptive metadata in the core and id table are sourced from Crunchbase Open Data Map and PermID, meaning that additional terms applicable to those sources may apply to some uses of the metadata. If you are using descriptive metadata from those tables, consult the sources for additional terms.

How do I cite it?

Please cite the "Emerging Technology Observatory Private-Sector AI Indicators dataset," including the link.

If you use PARAT to access the data, you can cite that tool instead.

Structure and content

The Indicators dataset consists of the csv tables core, ticker, alias, and id.

core

This table includes company descriptive metadata and metrics related to AI publications, patents, and workers.

Column nameTypeDescription
NametextThe canonical name of the company.
IDtextA unique numeric ID for the company.
PARAT linkURLA link to the company's detail page in the PARAT tool.
CountrytextThe country where the company's headquarters is located.
CitytextThe city where the company's headquarters is located.
State/provincetextThe state or province where the company's headquarters is located.
WebsiteURLThe company's homepage.
GroupsarrayThematic groups that include the company. Read more >>
Aggregated subsidiariestextSpecific subsidiaries whose AI activity is "counted toward" the company in PARAT (in cases where PARAT's data sources report the subsidiary's data separately from the parent). This list is not necessarily exhaustive: PARAT's data sources may report data from other subsidiaries together with the parent's data. In other words, the fact that a specific subsidiary is not listed in this column does not mean that that subsidiary's data is not included in the parent's metrics. Read more >>
Parent company IDnumberIf Parent company name is populated, the unique numeric ID for the parent company in this dataset (corresponds to id).
RegiontextThe geographic region in which the company's headquarters is located (e.g., Asia, North America).
StagetextThe company's stage of development. Read more >>
SectortextThe company's "Primary Business Sector" (corresponding to "Business Sector" in the Thomson Reuters Business Classification) according to permid.org.
DescriptiontextA short description of the company.
Description sourcetextThe source of Description. Sources include Wikipedia pages, company websites, and the Crunchbase Open Data Map.
Description linkURLThe specific page where Description was located.
Description datedateThe date Description was retrieved at Description link.
Publications: AI publicationsnumberThe total number of AI research publications that the company released publicly over the prior 10 years. Read more
Publications: Recent AI publication growthnumberRecent growth in the company's AI research publications, defined as the average percentage increase per year over the past three years of complete data. Read more
Publications: AI publication percentagenumberThe percentage of the company's total research publications over the past ten years (Publications: 10-year total publications) that were AI publications (Publications: AI publications).
Publications: AI publications in top conferencesnumberThe number of research publications by authors from the company that were accepted to top AI conferences over the past ten years. Read more >>
Publications: Citations to AI researchnumberThe number of citations to AI research publications released by the company over the prior 10 years. Read more
Publications: CV publicationsnumberThe total number of computer vision research publications that the company released publicly over the prior 10 years. Read more
Publications: NLP publicationsnumberThe total number of natural language processing research publications that the company released publicly over the prior 10 years. Read more
Publications: Robotics publicationsnumberThe total number of robotics research publications that the company released publicly over the prior 10 years. Read more
Publications: Total publicationsnumberThe total number of research publications (on any subject) the company released publicly over the prior 10 years. Read more
Patents: AI patentsnumberThe number of AI-related patents the company filed over the past 10 years. Read more
Patents: AI patents: recent growthnumberRecent growth in the company's AI patent filings, defined as the average percentage increase per year over the past three years of complete data. (Patent data has a significant lag; the past three years of complete patent data typically equates to between three and six years behind the present.) Read more
Patents: AI patent percentagenumberThe percentage of the company's total filed patents over the past ten years (Patents: 10-year total patents) that were AI-related (Patents: AI patents). Read more
Patents: Granted AI patentsnumberThe number of AI-related patents granted to the company over the past 10 years. Read more
Patents: Total patentsnumberThe number of patents (on any topic) the company filed over the past 10 years. Read more
Patents: AI use cases: AgriculturenumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Banking and financenumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: BusinessnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Computing in governmentnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Document management and publishingnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: EducationnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: EnergynumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: EntertainmentnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Industry and manufacturingnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Life sciencesnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: MilitarynumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: NanotechnologynumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: NetworkingnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Personal devices and computingnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: Physical sciences and engineeringnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: SecuritynumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: SemiconductorsnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: TelecommunicationsnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI use cases: TransportationnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more
Patents: AI applications and techniques: Analytics and algorithmsnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more
Patents: AI applications and techniques: Computer visionnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more
Patents: AI applications and techniques: ControlnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more
Patents: AI applications and techniques: Distributed AInumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more
Patents: AI applications and techniques: Knowledge representationnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more
Patents: AI applications and techniques: Language processingnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more
Patents: AI applications and techniques: Measuring and testingnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more
Patents: AI applications and techniques: Planning and schedulingnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more
Patents: AI applications and techniques: RoboticsnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more
Patents: AI applications and techniques: Speech processingnumberThe number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more
Workforce: AI workersnumberThe number of known AI workers employed by the company as of the [#how-are-the-data-updated](most recent data update).
Workforce: Tech Tier 1 workersnumberThe number of known Tech Tier 1 employed by the company as of the [#how-are-the-data-updated](most recent data update).

ticker

This table includes companies' ticker symbols on major exchanges, if any.

Column nameTypeDescription
NametextThe name of the company. Maps to the same field in core.
IDtextA unique numeric ID for the company. Maps to the same field in core.
TickertextThe ticker symbol for the company's shares on Exchange.
ExchangetextThe exchange on which the company's shares are traded.

alias

This table includes company aliases.

NameTypeDescription
NametextThe name of the company. Maps to the same field in core.
IDtextA unique numeric ID for the company. Maps to the same field in core.
AliastextAn alternate name used for the company.
LanguagetextThe language of Alias (two-letter ISO code).

id

This table includes unique company identifiers in various external datasets. We currently compile identifiers for LinkedIn, Crunchbase, ROR, and PermID, as well as general-purpose regular expressions for each company. Read more

Column nameTypeDescription
NametextThe name of the company. Maps to the same field in core.
IDtextA unique numeric ID for the company. Maps to the same field in core.
IdentifiertextA unique identifier for the company (typically in the form of a string or URL) within the dataset indicated in type.
TypetextThe dataset that uses identifier to refer to the company. Read more >>

Sources and methodology

Selecting and aggregating companies

We select companies for inclusion in the Indicators dataset based on ETO and CSET’s research priorities, as well as feedback from users. Not all AI-related companies are covered. We plan to expand coverage over time; if there's a specific company or group of companies you'd like to see in the data, please let us know.

As described below, we collect common aliases and identifiers for each company in order to reliably identify data associated with the company in diverse source datasets.

The dataset also includes parent-subsidiary mappings for many companies (see "Aggregated subsidiaries" in core). When a subsidiary is mapped to a parent, the subsidiary's publications, patents, etc. are attributed to the parent company. In other cases, underlying data sources already consolidate a subsidiary's data with its parent, or the queries we run on those sources collect parent and subsidiary data together by design. We believe the data of most significant subsidiary companies are ultimately "rolled up" to the parent level in one of these ways.

Compiling company metadata

ETO annotators map each company to the unique identifiers included in the id table, craft regular expressions that capture the company's name and known aliases (also included in the id table), and manually collect further metadata for the core, ticker, and alias tables based on web searches. Additional metadata in the core and ticker tables comes from the Crunchbase Open Data Map and PermID.

Stages

Company stages are generated using Crunchbase Pro data and a simple algorithm. "Mature" companies must meet any of the following conditions: be publicly traded; have more than a thousand employees; or have last raised funding in late-stage venture capital or private equity rounds (e.g., Series D or post-IPO rounds). "Growth" and "Startup" companies are distinguished only by their last raised funding rounds: companies which raised funding in mid-stage rounds (i.e., Series A-C) are labeled "Growth;" those which last raised funding in early-stage rounds (e.g., seed or angel) are labeled "Startup." Any privately-held company with no information on funding rounds and fewer than 1000 employees or no employee data will be designated "Unknown."

Identifiers

The id table includes the following types of unique identifiers for each company, where available:

  • ROR: Research Organization Registry URLs corresponding to the company.
  • Crunchbase URL and Crunchbase UUID: Crunchbase URLs and UUIDs corresponding to the company.
  • LinkedIn URL: URLs for the company's LinkedIn profile(s).
  • PermID: The PermID associated with the company.
  • Regex: A regular expression devised by ETO to capture the company's name and known aliases. This is used to identify relevant data for companies that lack other unique identifiers in the source datasets.

Groups

Companies in the dataset may be assigned to one or more groups. These groups include:

  • S&P 500: S&P 500 companies as of May 2024.
  • Global Big Tech: Major multinational technology companies selected by ETO analysts.
  • GenAI Contenders: Companies especially active in generative AI and large language model technologies, as identified by ETO analysts.

Publications

Publications data in the Indicators dataset comes from ETO's Merged Academic Corpus (MAC), which contains detailed information on over 260 million scholarly articles from around the world. MAC publications are linked to companies using authors' stated institutional affiliations. For more information, refer to the MAC documentation.

If a publication is linked to multiple companies (or other organizations), the Indicators dataset gives "full credit" to each organization. For example, a publication with two authors from Microsoft, one author from the University of Texas, and one author from Pfizer would be counted in the Indicators dataset as one paper for Microsoft and one paper for Pfizer.

The MAC uses machine learning models to identify publications related to AI, natural language processing, robotics, and computer vision. We use those publications to compute the publication-related fields in the core table.

Identifying publications in top conferences

The Publications: AI publications in top conferences field in the core table is computed using a subset of the MAC publications dataset: papers from top global AI conferences. For purposes of the Indicators dataset, these conferences are the AAAI Conference on Artificial Intelligence (AAAI), International Joint Conference on Artificial Intelligence (IJCAI), IEEE Conference on Computer Vision and Pattern Recognition (CVPR), European Conference on Computer Vision (ECCV), IEEE International Conference on Computer Vision (ICCV), International Conference on Machine Learning (ICML), International Conference on Knowledge Discovery Data Mining (SIGKDD), Conference on Neural Information Processing Systems (NeurIPS), Annual Meeting of the Association for Computational Linguistics (ACL), North American Chapter of the Association for Computational Linguistics (NAACL), Conference on Empirical Methods in Natural Language Processing (EMNLP), International Conference on Research and Development in Information Retrieval (SIGIR), and the Web Conference (WWW).

Patents

Patent metrics are derived from 1790 Analytics, PATSTAT, and The Lens. These sources include hundreds of thousands of AI-related patent families, which are groups of patent documents filed in one or more jurisdictions that are related to the same invention. (The documents may include patent applications, which are requests pending at a country's patent office for the grant of a patent, and granted patents, which are approved requests awarding a property right for that invention. We exclude other types of patent documents, such as amendments or other administrative documents.) In the Indicators dataset, each patent family is counted as a single "patent." A patent family with at least one granted patent in it is counted as a "granted patent."

Most of the patent fields in the core table count AI-related patents. These patents are identified using a combination of keywords and patent classification codes, which are categories applied to individual patents by some patent offices.

We also use keywords and classification codes to link each patent to different AI applications (e.g., speech processing, computer vision) and use cases (e.g., life sciences, transportation). These categorizations are used to generate the relevant fields in the core table. (A patent can have more than one categorization: for example, a patent for a robot that recognizes and responds to spoken commands might be assigned to the robotics and speech processing applications.)

For more details, refer to CSET's paper on AI patents or visit CSET's Github repo for the method, which includes lists of the specific keywords and classification codes used. (The specific numbers and dates mentioned in the repo are derived from an earlier analysis and are now out of date, but the method and keywords described in the repo are the same ones used for this dataset.)

Workforce

The workforce-related values in the core table are derived from LinkedIn profile data compiled and processed by Revelio Labs. Using information from public LinkedIn profiles, Revelio identifies individuals employed at different organizations, including the organizations in the Indicators dataset (which we locate in the Revelio dataset using the company LinkedIn IDs compiled in id), and maps their specific jobs to one of 1,000 different roles defined in a standardized taxonomy.

The "Tech Team 1 workers" figures in core are meant to cover anyone with technical skills and a reasonable probability of working with AI. To identify Tech Team 1 roles, CSET researchers manually reviewed Revelio's taxonomy and identified 111 highly technical roles and responsibilities associated with the design, development, and deployment of AI, such as such as Data Center Operator, IT Project Manager, and Software Engineer. They also identified 26 additional roles that were less clearly technical, but still potentially AI-relevant. For purposes of the Indicators dataset, Tech Team 1 is defined to include individuals who have any of the 111 highly technical roles the CSET researchers identified, or any of the 26 less clearly technical roles plus a technical degree (in each case, according to the Revelio dataset).

👀

For more details on how CSET researchers identified Tech Team 1 roles, refer to Appendix A in Gehlhaus et al., The Race for U.S. Technical Talent (August 2023).

The "AI workers" figures in core cover a subset of Tech Team 1 meant to include anyone with a high probability of working with AI. Specifically, these figures include every workers with a Tech Team 1 role directly related to data science or technical research, plus (a) an undergraduate degree in computer science, computer engineering, or electrical engineering, if their specific position required an undergraduate degree, or (b) a PhD related to artificial intelligence, if their specific position required a PhD.

CSET researchers manually designated roles in the Revelio taxonomy relevant to data science and technical research; the designations are available here. Other keywords and queries used in this process are available here.

Known limitations

  • The dataset doesn't cover all AI-related companies, and the sample it does cover isn't necessarily representative of all global activity in AI. We select companies for inclusion in the Indicators dataset based on ETO and CSET’s research priorities, as well as feedback from users. The companies currently included in the dataset represent a wide range of industries, sizes, countries and levels of AI activity, but still represent only a slice of the global AI landscape. We plan to expand coverage over time; if there's a specific company or group of companies you'd like to see in the data, please let us know.

  • The dataset doesn't cover every important indicator of AI-related activity. The Indicators use publications, patents, and AI-related job postings to capture different dimensions of AI-related activity. There are many other potential ways to capture this activity - for example, revenue from AI-related products and services, use of AI-related software or computing capacity, survey data or qualitative data from news reports and industry analyses. Users should be mindful of the limitations of the specific indicators in the dataset, and ideally combine them with other sources of data for a more holistic understanding.

  • The dataset is derived in part from publicly available data sources, which may not fully capture relevant activity. The Indicators can only capture what companies choose to make public - but companies don't always publish their AI research or patent their AI-related inventions, and not all of their workers have public profiles. That makes our publication, patent, and job posting metrics imperfect proxies for AI-related research, invention, and hiring.

  • Workforce metrics are incomplete. The workforce metrics in the dataset are derived from LinkedIn data. LinkedIn is less popular in many countries than it is in the United States - and is blocked in other countries, including China and Russia. (A local version of LinkedIn was available in China from 2021 to 2023; data from that period is included in the dataset.) The Indicators workforce data involving companies outside the U.S. should be used with extreme caution. More generally, LinkedIn data are self-reported by LinkedIn members, so there may be inaccuracies or inconsistencies in the data that affect the Indicators workforce metrics.

  • Patent and publication data have a lag. The lag is especially notable for patents. National patent offices often release information about companies' AI patent filings months or even years after the patents are filed. This makes our patent metrics incomplete, particularly for newer companies or companies that have become more active in AI in recent years. Similarly, our publication data sources run months or years behind in some cases. The lags in patent and publication data especially affect companies that are new or that have recently become more active in AI publishing or patenting; indicator values for these companies should be used with caution.

Maintenance

How are the data updated?

We update the Indicators data through an automated pipeline that retrieves data from CSET's merged corpora of scholarly literature, patents, and Revelio Labs workforce data and calculates our metrics. This pipeline normally runs monthly, with occasional pauses to resolve issues due to vendor data changes or failures of automated data quality checks.

The underlying data sources are updated on their own schedules, between daily and half-year intervals.

Credits

  • Concept and design: Rebecca Gelles, Ngor Luong, Zach Arnold
  • Engineering: Jennifer Melot, Rebecca Gelles
  • Data collection and enrichment: Zach Arnold, Rebecca Gelles, Ngor Luong, Eileen Chen, Alina Thai, Kieran Lee, Lindsay Jenkins, Ari Filler, Maya Snyder, Julio Wang, Sonali Pednekar, Nikita Pardeshi, Tianyi Xu, Yudi Chen, Vinayak Kannan
  • Documentation: Zach Arnold, Rebecca Gelles, Jennifer Melot

Major change log

6/26/24Initial release
ETO Logo

Keep in touch

Twitter