The Private-Sector AI Indicators dataset includes a diverse range of indicators of AI-related activity for hundreds of companies worldwide, from startups to multinationals. The dataset uses original metadata, models, and methods developed by ETO and CSET to transform information from a wide variety of data sources into AI-related research, patenting, and hiring metrics for every company covered in PARAT.
The dataset csv files are available on Zenodo.
The Indicators dataset powers ETO's PARAT tool.
The AI activity metrics in the Indicators dataset are original to ETO and CSET. We use our own methods, models and metadata to derive values for these metrics from a variety of underlying data sources. These sources include:
In addition to company descriptive metadata manually collected by ETO annotators, the Indicators dataset includes metadata sourced from the Crunchbase Open Data Map and PermID.
For more details on how we transform data from underlying sources into the Indicators AI activity metrics, see below.
The dataset doesn't cover all AI-related companies, and the sample it does cover isn't necessarily representative of all global activity in AI. The companies currently included in the dataset represent a wide range of industries, sizes, countries and levels of AI activity. Some important groups of companies are covered in full (such as the S&P 500 and those companies typically considered part of "Big Tech"). Overall, though, the companies covered in PARAT represent only a slice of the global AI landscape. Read more >>
The dataset doesn't cover every important indicator of AI-related activity. The Indicators use publications, patents, and AI-related jobs to capture different dimensions of AI-related activity. There are many other ways to capture this activity that are not included in the dataset. Read more >>
The dataset is derived in part from publicly available data sources, which may not fully capture relevant activity. Metrics on companies' publications, patents, and job postings are imperfect proxies for AI-related research, invention, and hiring. Read more >>
Workforce metrics are incomplete. In particular, workforce data involving companies outside the U.S. should be used with extreme caution. Read more >>
Patent and publication data have a lag. Read more >>
This dataset is subject to ETO's general terms of use. If you use it, please cite us.
Some descriptive metadata in the core and id table are sourced from Crunchbase Open Data Map and PermID, meaning that additional terms applicable to those sources may apply to some uses of the metadata. If you are using descriptive metadata from those tables, consult the sources for additional terms.
Please cite the "Emerging Technology Observatory Private-Sector AI Indicators dataset," including the link.
If you use PARAT to access the data, you can cite that tool instead.
The Indicators dataset consists of the csv tables core, ticker, alias, and id.
This table includes company descriptive metadata and metrics related to AI publications, patents, and workers. All metrics are aggregated over the specified time periods. The yearly_publication_counts table includes corresponding disaggregated values for patents and publications (yearly counts for workforce metrics are unavailable).
Name | Type | Description |
---|---|---|
Name | text | The canonical name of the company. |
ID | text | A unique numeric ID for the company. |
PARAT link | URL | A link to the company's detail page in the PARAT tool. |
Country | text | The country where the company's headquarters is located. |
City | text | The city where the company's headquarters is located. |
State/province | text | The state or province where the company's headquarters is located. |
Website | URL | The company's homepage. |
Groups | array | Thematic groups that include the company. Read more >> |
Aggregated subsidiaries | text | Specific subsidiaries whose AI activity is "counted toward" the company in PARAT (in cases where PARAT's data sources report the subsidiary's data separately from the parent). This list is not necessarily exhaustive: PARAT's data sources may report data from other subsidiaries together with the parent's data. In other words, the fact that a specific subsidiary is not listed in this column does not mean that that subsidiary's data is not included in the parent's metrics. Read more >> |
Parent company ID | number | If Parent company name is populated, the unique numeric ID for the parent company in this dataset (corresponds to id). |
Region | text | The geographic region in which the company's headquarters is located (e.g., Asia, North America). |
Stage | text | The company's stage of development. Read more >> |
Sector | text | The company's "Primary Business Sector" (corresponding to "Business Sector" in the Thomson Reuters Business Classification) according to permid.org. |
Description | text | A short description of the company. |
Description source | text | The source of Description. Sources include Wikipedia pages, company websites, and the Crunchbase Open Data Map. |
Description link | URL | The specific page where Description was located. |
Description date | date | The date Description was retrieved at Description link. |
Publications: AI publications | number | The total number of AI research publications that the company released publicly over the prior 10 years. Read more |
Publications: Recent AI publication growth | number | Recent growth in the company's AI research publications, defined as the average percentage increase per year over the past three years of complete data. Read more |
Publications: AI publication percentage | number | The percentage of the company's total research publications over the past ten years (Publications: 10-year total publications) that were AI publications (Publications: AI publications). |
Publications: Highly cited AI publications | number | The total number of highly cited AI research publications that the company released publicly over the prior 10 years. Read more |
Publications: AI publications in top conferences | number | The number of research publications by authors from the company that were accepted to top AI conferences over the past ten years. Read more >> |
Publications: Citations to AI research | number | The number of citations to AI research publications released by the company over the prior 10 years. Read more |
Publications: CV publications | number | The total number of computer vision research publications that the company released publicly over the prior 10 years. Read more |
Publications: NLP publications | number | The total number of natural language processing research publications that the company released publicly over the prior 10 years. Read more |
Publications: Robotics publications | number | The total number of robotics research publications that the company released publicly over the prior 10 years. Read more |
Publications: Total publications | number | The total number of research publications (on any subject) the company released publicly over the prior 10 years. Read more |
Patents: AI patents | number | The number of AI-related patents the company filed over the past 10 years. Read more |
Patents: AI patents: recent growth | number | Recent growth in the company's AI patent filings, defined as the average percentage increase per year over the past three years of complete data. (Patent data has a significant lag; the past three years of complete patent data typically equates to between three and six years behind the present.) Read more |
Patents: AI patent percentage | number | The percentage of the company's total filed patents over the past ten years (Patents: 10-year total patents) that were AI-related (Patents: AI patents). Read more |
Patents: Granted AI patents | number | The number of AI-related patents granted to the company over the past 10 years. Read more |
Patents: Total patents | number | The number of patents (on any topic) the company filed over the past 10 years. Read more |
Patents: AI use cases: Agriculture | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more |
Patents: AI use cases: Banking and finance | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more |
Patents: AI use cases: Business | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more |
Patents: AI use cases: Computing in government | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more |
Patents: AI use cases: Document management and publishing | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more |
Patents: AI use cases: Education | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more |
Patents: AI use cases: Energy | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more |
Patents: AI use cases: Entertainment | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more |
Patents: AI use cases: Industry and manufacturing | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more |
Patents: AI use cases: Life sciences | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more |
Patents: AI use cases: Military | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more |
Patents: AI use cases: Nanotechnology | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more |
Patents: AI use cases: Networking | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more |
Patents: AI use cases: Personal devices and computing | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more |
Patents: AI use cases: Physical sciences and engineering | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more |
Patents: AI use cases: Security | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more |
Patents: AI use cases: Semiconductors | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more |
Patents: AI use cases: Telecommunications | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more |
Patents: AI use cases: Transportation | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed use case. Read more |
Patents: AI applications and techniques: Analytics and algorithms | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more |
Patents: AI applications and techniques: Computer vision | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more |
Patents: AI applications and techniques: Control | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more |
Patents: AI applications and techniques: Distributed AI | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more |
Patents: AI applications and techniques: Knowledge representation | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more |
Patents: AI applications and techniques: Language processing | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more |
Patents: AI applications and techniques: Measuring and testing | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more |
Patents: AI applications and techniques: Planning and scheduling | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more |
Patents: AI applications and techniques: Robotics | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more |
Patents: AI applications and techniques: Speech processing | number | The number of AI-related patents the company filed over the past 10 years that were relevant to the listed application. Read more |
Workforce: AI workers | number | The number of known AI workers employed by the company as of the most recent data update. |
Workforce: Tech Team 1 workers | number | The number of known Tech Team 1 workers employed by the company as of the most recent data update. |
This table includes companies' ticker symbols on major exchanges, if any.
Name | Type | Description |
---|---|---|
Name | text | The name of the company. Maps to the same field in core. |
ID | text | A unique numeric ID for the company. Maps to the same field in core. |
Ticker | text | The ticker symbol for the company's shares on Exchange. |
Exchange | text | The exchange on which the company's shares are traded. |
This table includes company aliases.
Name | Type | Description |
---|---|---|
Name | text | The name of the company. Maps to the same field in core. |
ID | text | A unique numeric ID for the company. Maps to the same field in core. |
Alias | text | An alternate name used for the company. |
Language | text | The language of Alias (two-letter ISO code). |
This table includes unique company identifiers in various external datasets. We currently compile identifiers for LinkedIn, Crunchbase, ROR, and PermID, as well as general-purpose regular expressions for each company. Read more
Name | Type | Description |
---|---|---|
Name | text | The name of the company. Maps to the same field in core. |
ID | text | A unique numeric ID for the company. Maps to the same field in core. |
Identifier | text | A unique identifier for the company (typically in the form of a string or URL) within the dataset indicated in type. |
Type | text | The dataset that uses identifier to refer to the company. Read more >> |
This table includes disaggregated yearly counts of AI-related research publications and patents over the prior 10 years. The core table includes corresponding aggregate values. Not all metrics provided in core have disaggregated values available.
Name | Type | Description |
---|---|---|
Name | text | The name of the company. Maps to the same field in core. |
ID | text | A unique numeric ID for the company. Maps to the same field in core. |
PARAT link | URL | A link to the company's detail page in the PARAT tool. |
Dataset | text | Either "Publications" or "Patents." |
Category | text | The relevant research publication- or patent-relevant metric. Together with the Dataset field, this field corresponds to the similarly named field in core. For example, a row in yearly_publication_counts with a Dataset value of "Patents" and a Category value of "AI use cases: Business" contains a yearly value aggregated into the "Patents: AI use cases: Business" field in core. |
Year | number | The year for which data is included in the row. |
Value | number | The yearly count of research publications or patents, as applicable, for the row's company (Name field) and the specific type of publication or patent counted by the row (Dataset and Category fields). |
Complete | boolean | Indicates whether the row is from a year where we consider our data materially complete. If Complete is false, we consider the data from that year materially incomplete and you should use it with caution. |
We select companies for inclusion in the Indicators dataset based on ETO and CSET’s research priorities, as well as feedback from users. Not all AI-related companies are covered. We plan to expand coverage over time; if there's a specific company or group of companies you'd like to see in the data, please let us know.
As described below, we collect common aliases and identifiers for each company in order to reliably identify data associated with the company in diverse source datasets.
The dataset also includes parent-subsidiary mappings for many companies (see "Aggregated subsidiaries" in core). When a subsidiary is mapped to a parent, the subsidiary's publications, patents, etc. are attributed to the parent company. In other cases, underlying data sources already consolidate a subsidiary's data with its parent, or the queries we run on those sources collect parent and subsidiary data together by design. We believe the data of most significant subsidiary companies are ultimately "rolled up" to the parent level in one of these ways.
ETO annotators map each company to the unique identifiers included in the id table, craft regular expressions that capture the company's name and known aliases (also included in the id table), and manually collect further metadata for the core, ticker, and alias tables based on web searches. Additional metadata in the core and ticker tables comes from the Crunchbase Open Data Map and PermID.
Company stages are generated using Crunchbase Pro data and a simple algorithm. "Mature" companies must meet any of the following conditions: be publicly traded; have more than a thousand employees; or have last raised funding in late-stage venture capital or private equity rounds (e.g., Series D or post-IPO rounds). "Growth" and "Startup" companies are distinguished only by their last raised funding rounds: companies which raised funding in mid-stage rounds (i.e., Series A-C) are labeled "Growth;" those which last raised funding in early-stage rounds (e.g., seed or angel) are labeled "Startup." Any privately-held company with no information on funding rounds and fewer than 1000 employees or no employee data will be designated "Unknown."
The id table includes the following types of unique identifiers for each company, where available:
Companies in the dataset may be assigned to one or more groups. These groups include:
Publications data in the Indicators dataset comes from ETO's Merged Academic Corpus (MAC), which contains detailed information on over 260 million scholarly articles from around the world. MAC publications are linked to companies using authors' stated institutional affiliations. For more information, refer to the MAC documentation.
If a publication is linked to multiple companies (or other organizations), the Indicators dataset gives "full credit" to each organization. For example, a publication with two authors from Microsoft, one author from the University of Texas, and one author from Pfizer would be counted in the Indicators dataset as one paper for Microsoft and one paper for Pfizer.
The MAC uses machine learning models to identify publications related to AI, natural language processing, robotics, and computer vision. We use those publications to compute the publication-related fields in the core table.
For PARAT purposes, "highly cited" AI publications are those that have a citation count in the 90th percentile or greater of AI-related publications published in the same year.
The Publications: AI publications in top conferences field in the core table is computed using a subset of the MAC publications dataset: papers from top global AI conferences. For purposes of the Indicators dataset, these conferences are the AAAI Conference on Artificial Intelligence (AAAI), International Joint Conference on Artificial Intelligence (IJCAI), IEEE Conference on Computer Vision and Pattern Recognition (CVPR), European Conference on Computer Vision (ECCV), IEEE International Conference on Computer Vision (ICCV), International Conference on Machine Learning (ICML), International Conference on Knowledge Discovery Data Mining (SIGKDD), Conference on Neural Information Processing Systems (NeurIPS), Annual Meeting of the Association for Computational Linguistics (ACL), North American Chapter of the Association for Computational Linguistics (NAACL), Conference on Empirical Methods in Natural Language Processing (EMNLP), International Conference on Research and Development in Information Retrieval (SIGIR), and the Web Conference (WWW).
Patent metrics are derived from 1790 Analytics, PATSTAT, and The Lens. These sources include hundreds of thousands of AI-related patent families, which are groups of patent documents filed in one or more jurisdictions that are related to the same invention. (The documents may include patent applications, which are requests pending at a country's patent office for the grant of a patent, and granted patents, which are approved requests awarding a property right for that invention. We exclude other types of patent documents, such as amendments or other administrative documents.) In the Indicators dataset, each patent family is counted as a single "patent." A patent family with at least one granted patent in it is counted as a "granted patent."
Most of the patent fields in the core table count AI-related patents. These patents are identified using a combination of keywords and patent classification codes, which are categories applied to individual patents by some patent offices.
We also use keywords and classification codes to link each patent to different AI applications (e.g., speech processing, computer vision) and use cases (e.g., life sciences, transportation). These categorizations are used to generate the relevant fields in the core table. (A patent can have more than one categorization: for example, a patent for a robot that recognizes and responds to spoken commands might be assigned to the robotics and speech processing applications.)
For more details, refer to CSET's paper on AI patents or visit CSET's Github repo for the method, which includes lists of the specific keywords and classification codes used. (The specific numbers and dates mentioned in the repo are derived from an earlier analysis and are now out of date, but the method and keywords described in the repo are the same ones used for this dataset.)
The workforce-related values in the core table are derived from LinkedIn profile data compiled and processed by Revelio Labs. Using information from public LinkedIn profiles, Revelio identifies individuals employed at different organizations, including the organizations in the Indicators dataset (which we locate in the Revelio dataset using the company LinkedIn IDs compiled in id), and maps their specific jobs to one of 1,000 different roles defined in a standardized taxonomy.
The "Tech Team 1 workers" figures in core are meant to cover anyone with technical skills and a reasonable probability of working with AI. To identify Tech Team 1 roles, CSET researchers manually reviewed Revelio's taxonomy and identified 111 highly technical roles and responsibilities associated with the design, development, and deployment of AI, such as such as Data Center Operator, IT Project Manager, and Software Engineer. They also identified 26 additional roles that were less clearly technical, but still potentially AI-relevant. For purposes of the Indicators dataset, Tech Team 1 is defined to include individuals who have any of the 111 highly technical roles the CSET researchers identified, or any of the 26 less clearly technical roles plus a technical degree (in each case, according to the Revelio dataset).
For more details on how CSET researchers identified Tech Team 1 roles, refer to Appendix A in Gehlhaus et al., The Race for U.S. Technical Talent (August 2023).
The "AI workers" figures in core cover a subset of Tech Team 1 meant to include anyone with a high probability of working with AI. Specifically, these figures include every workers with a Tech Team 1 role directly related to data science or technical research, plus (a) an undergraduate degree in computer science, computer engineering, or electrical engineering, if their specific position required an undergraduate degree, or (b) a PhD related to artificial intelligence, if their specific position required a PhD.
CSET researchers manually designated roles in the Revelio taxonomy relevant to data science and technical research; the designations are available here. Other keywords and queries used in this process are available here.
The dataset doesn't cover all AI-related companies, and the sample it does cover isn't necessarily representative of all global activity in AI. We select companies for inclusion in the Indicators dataset based on ETO and CSET’s research priorities, as well as feedback from users. The companies currently included in the dataset represent a wide range of industries, sizes, countries and levels of AI activity. Some important groups of companies are covered in full (such as the S&P 500 and those companies typically considered part of "Big Tech"). Overall, though, the companies covered in PARAT represent only a slice of the global AI landscape. We plan to expand coverage over time; if there's a specific company or group of companies you'd like to see in the data, please let us know.
The dataset doesn't cover every important indicator of AI-related activity. The Indicators use publications, patents, and AI-related job postings to capture different dimensions of AI-related activity. There are many other potential ways to capture this activity - for example, revenue from AI-related products and services, use of AI-related software or computing capacity, survey data or qualitative data from news reports and industry analyses. Users should be mindful of the limitations of the specific indicators in the dataset, and ideally combine them with other sources of data for a more holistic understanding.
The dataset is derived in part from publicly available data sources, which may not fully capture relevant activity. The Indicators can only capture what companies choose to make public - but companies don't always publish their AI research or patent their AI-related inventions, and not all of their workers have public profiles. That makes our publication, patent, and job posting metrics imperfect proxies for AI-related research, invention, and hiring.
Workforce metrics are incomplete. The workforce metrics in the dataset are derived from LinkedIn data. LinkedIn is less popular in many countries than it is in the United States - and is blocked in other countries, including China and Russia. (A local version of LinkedIn was available in China from 2021 to 2023; data from that period is included in the dataset.) The Indicators workforce data involving companies outside the U.S. should be used with extreme caution. More generally, LinkedIn data are self-reported by LinkedIn members, so there may be inaccuracies or inconsistencies in the data that affect the Indicators workforce metrics.
Patent and publication data have a lag. The lag is especially notable for patents. National patent offices often release information about companies' AI patent filings months or even years after the patents are filed. This makes our patent metrics incomplete, particularly for newer companies or companies that have become more active in AI in recent years. Similarly, our publication data sources run months or years behind in some cases. The lags in patent and publication data especially affect companies that are new or that have recently become more active in AI publishing or patenting; indicator values for these companies should be used with caution. (The LinkedIn data used to generate the workforce metrics in the dataset does not lag in the same way, so the workforce metrics can be considered roughly current.)
We update the Indicators data through an automated pipeline that retrieves data from CSET's merged corpora of scholarly literature, patents, and Revelio Labs workforce data and calculates our metrics. This pipeline normally runs monthly, with occasional pauses to resolve issues due to vendor data changes or failures of automated data quality checks.
The underlying data sources are updated on their own schedules, between daily and half-year intervals.
6/26/24 | Initial release |