ETO Logo
Documentation: ETO Research Almanac

Overview

What is this tool?

ETO's Research Almanac provides high-level data on trends in English-language emerging technology research, including overall research output, growth, and trends among countries, research organizations, and companies active in R&D. The initial version of the Almanac focuses on topics and applications in artificial intelligence.

What can I use it for?

Use the Almanac to:

  • Track trends in research across different fields and subfields.
  • Understand how different countries contribute to research in each field and how their contributions are changing over time.
  • Explore the impact of research on patents and the broader scientific literature.
  • View the most active research organizations and companies conducting research in different fields.
  • Analyze trends within highly cited research only, and see how they compare to trends among all research in a given emerging tech field.

View details and examples >>

What are its most important limitations?

  • Our methods for linking articles to topics involve judgment calls. The data on each topic in the Almanac is based on the articles we determine are relevant to that topic using different classification methods, including machine learning models and regular expression-based queries (depending on the topic - see the Merged Academic Corpus documentation for details. These methods involve some judgment calls. We try to apply intuitive, tested criteria when making these calls, but others might disagree with us about what counts as "AI safety research" or "robotics research" (for example). In addition, we rely on statistical models to apply the topic tags. For both reasons, the numbers in the Almanac are necessarily imprecise and should be interpreted as estimates.
  • We err on the side of inclusion when assigning articles to academic research subjects, meaning subject-specific counts in the Almanac (e.g., the total number of biology or chemistry articles) or may be higher than expected. In most cases, we "count" a research article toward a particular Almanac subject if that subject is among the article's three highest-scoring subjects, according to the Merged Academic Corpus subject models. (Totals for AI and its subfields use a different method based on different models.) This is a liberal standard. It also means that each article counts toward three different Almanac subjects (in other words, different Almanac subjects can include the same article). Both of these factors tend to increase the Almanac's subject-specific totals.
  • The Almanac only covers English-language research. Other than English, Chinese is the most prevalent language in academic publishing. Unfortunately, due to Chinese policies, it's become difficult for us to access Chinese-language publication data. Because of this, the Almanac’s data source, ETO’s Merged Academic Corpus, contains relatively few Chinese-language publications. Because of this, we decided to limit the Almanac to English-language articles, defined as articles with English titles or abstracts. This excludes a significant amount of global research from the Almanac, and means that Almanac facts and figures related to China should be interpreted especially carefully.
  • The Almanac doesn't cover non-public research. The Almanac only covers research that has been publicly released and is included in the Merged Academic Corpus. We believe the MAC covers a significant fraction of publicly released research worldwide (and in particular, the large majority of published research in English), but a great deal of research is never made public. In particular, military research and commercial research may never be written up in open sources, or even written up at all. We have no way of knowing how much of this "hidden" research exists, or how the facts and figures in the Almanac would change if the "hidden" research were somehow included.
  • Some articles lack information about author nationality and/or organizational affiliation, affecting the estimates in the Almanac. Articles that lack this information are counted in the Almanac's overall estimates of research output in each field, but may not be counted for purposes of the sections related to countries and organizations.
  • The Almanac's data sources may introduce errors. ETO's Merged Academic Corpus incorporates articles from many different datasets, making it a uniquely comprehensive dataset on worldwide research. No dataset is perfect, though. Errors, gaps, and judgment calls in the MAC's underlying datasets are likely to make it into the MAC itself, and the MAC's merging and enrichment processes may introduce errors. Any such errors affect the numbers in the Almanac. Read more >>
  • Research is only one indicator of emerging technology activity. The Almanac focuses on this single, incomplete indicator, so it shouldn't be used on its own to draw general conclusions about which countries or organizations are "ahead" or "behind" in different emerging technology areas. Read more >>

What are its sources?

The Almanac relies on ETO's Merged Academic Corpus. Read more >>

Does it contain sensitive information, such as personally identifiable information?

No.

What are the terms of use?

The Almanac tool and metrics are subject to ETO's general terms of use. If you use the tool, please cite us. The underlying datasets are not publicly available due to licensing restrictions.

How do I cite it?

If you use data from the Almanac in your work, please cite the "Emerging Technology Observatory Research Almanac" and include the link to the tool.

Using the Almanac

How do I use it?

These instructions focus on the desktop version of the tool. Some features may be missing or act differently on mobile devices.

The Almanac is a simple dashboard-style tool. Start by choosing a research topic:

Selecting a research topic from the Almanac main menu.

Each topic page has several sections. Scroll down to browse through them, or click a link in the left-hand navigation menu to jump to a specific section.

Scrolling through an Almanac topic page and using the navigation menu.

When viewing line charts, click on an item in the chart legend to hide (or redisplay) the corresponding line.

Hiding and redisplaying a line in an Almanac line chart.

The "top-cited" toggle

Some sections have a "top-cited" toggle. Click the toggle to display data on the 10% of articles in each year with the most citations, rather than all articles. (Articles are assigned to years based on their date of publication.)

Triggering the "top-cited" toggle.

Coming back to a view

When you click on a link in the left-hand navigation menu, your browser's address bar will update to reflect the section you're viewing. Copy the URL in order to return to the same view later:

Returning to a view in the Almanac using the address bar.

What can I use it for?

Track trends in research across different emerging technology fields and subfields.

Examples

Understand how different countries contribute to English-language research in each field and how their contributions are changing over time.

Explore the impact of research on patents and the broader scientific literature.

View the top research organizations and companies conducting English-language research in different fields.

Analyze trends within highly cited research only, and see how they compare to trends among all research in a given emerging tech field.

Analyzing specific subtopics, research organizations, or companies in detail. The Almanac provides high-level information on a limited number of emerging tech fields and actors. For more detailed analysis, we recommend using the Map of Science or Country Activity Tracker.

Determining which countries or organizations are "ahead" or "behind" overall in emerging tech. Emerging technology activity is multifaceted. Public research output is one of these facets, but there are many others, from commercializing and deploying new technologies to building organizations that can effectively use and govern them. Understanding how countries, research organizations, and companies compare and compete and in emerging tech requires understanding all of these facets. Because the Almanac focuses on a single indicator, it shouldn't be used on its own to draw conclusions about the overall state of play.

Drawing definitive conclusions about China's overall research output. The Almanac only covers English-language articles (defined as articles with English titles or abstracts), but Chinese authors and organizations often publish in Chinese. A full accounting of China's research output would need to consider these Chinese-language articles as well as English-language articles.

Sources and methodology

Linking articles to subjects

The Almanac relies on article-level metadata from ETO's Merged Academic Corpus (MAC). When you visit a topic's Almanac page, you'll see facts and figures derived from English-language articles from a recent five-year period that are tagged with that topic. To learn more about how we apply topic tags to MAC articles, visit the MAC documentation.

(We tag an article as an "AI + [academic subject]" article (for example, AI + pharmacology or AI + genetics) if it is classified as an AI article and is assigned the academic subject in question.)

It's important to note that emerging research topics have fuzzy boundaries; there's no objectively correct answer to whether a particular article is "AI safety" research (for example). For each emerging topic, we try to capture articles in the MAC that subject matter experts would consider highly relevant to the topic in question. This inevitably involves some judgment calls. In addition, we rely on statistical models to apply the topic tags. For both reasons, the numbers in the Almanac are necessarily imprecise and should be interpreted as estimates. Read more >>

Another caveat: we err on the side of inclusion when assigning articles to research subjects, meaning subject-specific counts in the Almanac (e.g., the total number of biology or chemistry articles) or may be higher than expected. In most cases, we "count" a research article toward a particular Almanac subject if that subject is among the article's three highest-scoring subjects, according to the Merged Academic Corpus subject models. (Totals for AI and its subfields use a different method based on different models.) This is a liberal standard. It also means that each article counts toward three different Almanac subjects (in other words, different Almanac subjects can include the same article). Both of these factors tend to increase the Almanac's subject-specific totals.

Linking articles to countries and institutions

The Almanac attributes articles to countries based on the author organizations listed in each article, as recorded in MAC metadata. (Here, and generally in ETO resources, we use "country" informally, as a shorthand term for sovereign countries, independent states, and certain other geographic entities. Read more >>) In the Almanac, an article "counts for" a given country if it lists at least one author affiliated with an organization in that country. The MAC relies on the article to determine the author's organization; for instance, an article listing "Jane Smith, University of Texas" as its author would be attributed to the United States even if Professor Smith later moved to the University of Tokyo. By the same token, authors are associated with the country of their listed organization even if they're not "from" that country: once she moved to the University of Tokyo, Professor Smith's articles would count for Japan, even if she was born and raised in Chicago.

If an article lists authors from organizations in more than one country, the article will "count toward" multiple countries in the Almanac. However, if a single article has multiple authors from the same country, it will only be counted once for that country.

So, for example:

  • an article with a single author, from the University of Texas, will count as one article for the United States.
  • an article with one author from the University of Texas and one author from the University of Tokyo will count as one article for the United States and one article for Japan.
  • an article with two authors from the University of Texas and one author from the University of Tokyo will also count as one article for the United States and one article for Japan.

Finally, articles are assigned to organizations in the same way as countries, e.g., an article with one author from the University of Texas and one author from the University of Tokyo counts as one article for each university.

Isolating English-language articles

For purposes of the Almanac, we isolate English-language articles by filtering out every MAC article without an English-language title or abstract.

Maintenance

How is it updated?

The Almanac updates as the Merged Academic Corpus is updated. Read more >>

How can I report an issue?

Use our general issue reporting form.

Credits

  • Concept: Zach Arnold, Jennifer Melot, Dewey Murdick
  • Design and analysis: Zach Arnold, Brian Love, Jennifer Melot
  • Engineering: Brian Love, Jennifer Melot
  • Review and testing: Catherine Aiken, Liam Alexander, Zach Arnold, Shaun Ee, Brian Love, Jennifer Melot, Neha Singh, and anonymous testers
  • Documentation: Zach Arnold
  • Maintenance: Brian Love, Jennifer Melot

Topic classifications used in the Research Almanac are based upon work supported in part by the Alfred P. Sloan Foundation under Grant No. G-2023-22358.

Major change log

11/22/23Major data update following MAC changes
5/19/23Initial release
ETO Logo

Keep in touch

Twitter