ETO Logo

Our roadmap for the Map of Science and Merged Academic Corpus

2022-09-29

Some planned updates to the Map of Science and Merged Academic Corpus.

ETO’s engineering team is busy upgrading the Map of Science and its main underlying dataset, the Merged Academic Corpus. Here are our priorities for the next several months:

Map of Science

  • Summary metrics view for groups of research clusters: We’re building new features that will allow users to view summary information for groups of clusters, in addition to the details already available for individual clusters. Users will be able to view summary information for the full set of clusters matching filters they select - just like they can currently view that information for a single cluster through the Map’s detail view.

Merged Academic Corpus

  • Improve article deduplication: We’re planning additional improvements to how we deduplicate and combine metadata from duplicate articles in our data sources. Specifically, we’re working to further reduce “false negatives” (duplicate articles that currently aren’t being merged together) and “false positives” (articles that aren’t actually duplicates but are being merged together) by refining how we use article metadata to identify potential duplicates. (Our existing processes have already reduced these sorts of errors to low levels, but we’re aiming to reduce them even further.)
  • Add new article metadata: We’re also planning to integrate more article-level metadata from OpenAlex.
  • Replace and update subject models: We’ll integrate CSET’s own models for identifying article subjects, replacing the recently discontinued Microsoft Academic models currently used in the corpus. We’re also retraining our AI and AI subfield relevance models with updated data.
  • Automate collection of article full text from arXiv to help us better identify GitHub repos relevant to research clusters.
  • Incorporate more machine-translated and manually cleaned metadata, such as more consistent translations of non-English text and more consistently “clean” organization names.

As always, we welcome feedback on our plans - please feel free to contact us directly with comments, critiques, or suggestions.