ETO's Open-source software Research and Community Activity (ORCA) tool compiles data on open-source software (OSS) used in science and technology research. Drawing on Github Archive, ETO's Merged Academic Corpus, and several other data sources, ORCA tracks OSS usage, health, development activity, and community engagement across a wide range of software projects and research subjects.
Use ORCA to:
ORCA uses data from Github Archive, the GitHub API, PyPI, The Stack, OpenSSF Scorecards, ETO's Merged Academic Corpus, arXiv, Semantic Scholar, and Papers With Code. Read more >>
If you use data from ORCA in your work, please cite the "Emerging Technology Observatory ORCA tracker" and include the link to the tool.
These instructions focus on the desktop version of the tool. Some features may be missing or act differently on mobile devices.
ORCA is organized by research field. Start by choosing a field from the dropdown menu:
You can browse OSS trends in the selected field with two different views. ORCA defaults to summary view, a condensed view that presents key facts and figures for the top OSS projects associated with the research field you selected. The three boxes list the top projects according to different metrics:
(For more information about these metrics and how we produced them, see our methodology description below.)
Scroll down to view trends over time for the top five projects. You can use the dropdown menu to change the metric used to identify the top projects in these graphs:
Click the view toggle in the toolbar to switch from summary view to list view. This view includes information on every OSS project associated with the research field you selected.
Each project's entry in the list includes a graph displaying a metric over time. Change the metric with the "show graphs for" dropdown.
You can also filter the list by license and/or programming language - click on "Show filters" to display the menus for these filters:
Click on the "full profile" buttons that appear at the bottom of each repository card in the list view to view all the data in ORCA about that project. The accordion at the bottom of the page includes graphs of different health and activity metrics over time (see below for how these metrics are produced), as well as top-cited articles associated with the project.
When you modify ORCA's settings, your browser's address bar will update to reflect what you're viewing. Copy the URL in order to return to the same view later:
Compare OSS projects in a particular research area according to different metrics of project activity, interest and health.
Track activity, usage, and community engagement trends over time for specific repos or for all repos in a particular field.
Sort and filter projects by research field, programming language, license, and various activity metrics.
Drawing broad conclusions about all OSS used for a particular topic or purpose. ORCA is incomplete. Before drawing conclusions based on ORCA data, consider how its limitations might affect the analysis.
Evaluating a particular OSS project's health based only on one or two ORCA metrics. For example, some projects have very few contributors yet successfully close out issues from the community and maintain a regular release schedule. Project health is complex and is dependent on multiple factors, some of which are not currently visible in ORCA.
ORCA tracks OSS projects that are used in different fields of research. This involves two subtasks: identifying projects associated with articles from different research fields and describing those projects according to different metrics.
In most cases, we use a largely automated process to identify articles used in different research fields. This process involves four steps: gathering research articles, grouping the articles by research field, detecting mentions of OSS projects in articles from each field, and screening the fields for final display. A few research fields are manually associated with projects instead.
ORCA's research literature dataset includes data from five sources:
We deduplicate and structure articles from these sources using processes similar to those described in the Merged Academic Corpus documentation.
We search the gathered data for mentions of Github repositories using the regular expression /(?i)github.com/([A-Za-z0-9-_.]+/[A-Za-z0-9-_.]*[A-Za-z0-9-_])/. We then add all of the mentions of articles in repository README files compiled by The Stack. We then extract article-repo pairs from this data - that is, a list of pairs in which the article cites the project, or the project README cites the article. We deduplicate the pairs, then use them to create the set of OSS projects tracked in ORCA, along with a set of articles associated with each project.
Each project-citing article that has an English title and abstract, a publication date after 2010, and an abstract of over 500 characters is automatically associated with three research fields using the process described in the Merged Academic Corpus documentation. With these groupings and the citation lists compiled in the prior step, we can count how often a given OSS project is linked to articles from different fields.
Finally, we determine which of these fields will be displayed in the ORCA interface (that is, as options in the "Research Field" dropdown selector). As a general rule, the interface includes fields associated with at least ten different OSS projects, each of which is linked to at least three different articles from that field. We perform a "gut check" manual review over the most relevant projects for each field, screening out fields where these projects seem obviously unrelated or otherwise uninformative to ORCA's users. This review is meant to account for the inherent variability in our automated process for associating articles with fields, which sometimes links articles and fields that are only weakly or ambiguously connected in reality.
In some cases, subject matter experts at CSET or elsewhere have manually compiled lists of GitHub repositories relevant to specific research fields. Rather than using our automated process for associating these fields with projects, we use expert-compiled lists in the following cases:
Our methods for identifying research-relevant OSS projects have important limitations.
Several metrics involve counting new and returning contributors in a given year. ORCA defines new contributors as contributors who first committed to the project repo in the year in question. Returning contributors are those who first committed to the project repo in a prior year. We identify contributors based on the names reported in GitHub Archive's PushEvents, meaning that contributions may be undercounted for individuals whose contributor names change over time.
The number of open issues and pull requests is retrieved from the GitHub API (the
open_issues_count key). We report them together rather than separately to reflect the structure of the data we receive from the GitHub API.
OSS criticality scores are maintained by the Open Source Security Foundation, which explains: "A project's criticality score defines the influence and importance of a project. It is a number between 0 (least-critical) and 1 (most-critical)." We retrieve these scores from the criticality-score-v0-latest BigQuery dataset. Note that criticality scores assess criticality in the overall OSS ecosystem, not criticality to a particular field of research. We do not currently adjust this metric for importance to a particular research field.
This data comes directly from PyPI. It is only available for projects released on PyPI.Maintenance.
Each project is assigned a relevance score for each research field included in ORCA, other than manually compiled fields. These are TF-IDF-based scores, calculated in each case using the project repository URL as "the term" and the corpus of OSS-linked articles associated with each research field (using the process discussed above) as the "documents."
This metric counts new stars added (WatchEvents in GitHub Archive). It may not equal the total number of stars displayed in GitHub, as we do not keep track of star removal.
Currently, the data is updated on an ad-hoc basis. We plan to automate monthly updates in the second half of 2023.
Use our general issue reporting form.