ETO Logo

How we define "AI safety research" in our tools

Screenshot of the Research Almanac tool


Drawing boundaries around a fast-moving topic

What counts as AI safety research? For those interested in understanding this critical field, it's a question as challenging as it is essential to answer. The domain of "AI safety" research is young, quickly evolving, and poorly defined, even compared to other emerging topics. Right now, at least, there's no authoritative Journal of AI Safety to consult, no single AI safety category in the leading research repositories, and in general, no widely accepted standards for what's in and out.

ETO's Research Almanac tracks trends and key players in a variety of AI research domains and applications, including AI safety. Because there's no authoritative definition of AI safety research, we had to develop one ourselves to get the job done. It had to be intuitive enough to be useful for the Almanac's users, rigorous enough to satisfy users and subject-matter experts, and objective and well-specified, in order to be applied consistently.

Our definition, developed in collaboration with subject matter experts on CSET's analytic staff, is copied below. We refined and tested it by having CSET staff use it to annotate research articles as AI safety-focused or not, going through multiple iterations. Then, we developed a classifier based on the staff annotations. When you visit the AI safety page in the Almanac, the facts and figures you see are derived from the aggregated metadata of the articles determined by the classifier to be AI safety articles. In other words, the Almanac's AI safety data is based on the output of a model that's emulating humans applying our definition of AI safety research.

Developing a model to implement our AI safety definition

For detailed information on the model we used to implement our AI safety definition at scale, and how we developed it, see the documentation.

In brief: we split the records our staff annotated, plus additional unlabeled records, into development, training, and validation sets, then used a programmatic weak supervision approach, creating labeling functions that collectively predicted AI safety relevance for the development set. These were based on keywords related to our AI safety definition or other heuristics we thought would help differentiate AI safety work from unrelated work, such as publication year. We also used some programmatically suggested labeling functions nominated by the Snorkel Flow platform. Labeling functions were evaluated against the development set, and we selected those that had at least 74% precision.

We used the labeling functions to produce "weak" labels for the entire training set, then trained a classifier on the weak labels and evaluated its performance on the validation set. Finally, we ran the classifier on all of the English-language AI articles in the Merged Academic Corpus, the data source for the Research Almanac.

As you look over the Almanac's facts and figures, keep two things in mind. First, because there's no objective definition of AI safety research, measurements of AI safety research are inherently imprecise. We think our definition lines up reasonably well with the boundaries other informed observers might draw around the field, but there's room for reasonable disagreement. For this reason, the AI safety numbers in the Almanac should be read as estimates.

Second, and related, our definition is broad by design. When we developed it, we wanted to capture a wide variety of plausible definitions of AI safety. If your understanding of AI safety is more specific - for example, maybe you're most concerned about a particular type of bias, or about longer-term concerns as AI systems continue to develop - our metrics may be describing a different body of research than you might have in mind.

We think our definition of AI safety research is a good start, but it's certainly not the last word. We'd welcome your feedback on the definition, on how we've implemented it, or on other AI-safety related topics we could integrate into our tools.

Our definition of "AI safety research"

This is the same definition used in the development of our AI safety research classifier. For more details on that process, see the documentation.


AI safety research is research focused on ensuring that AI systems effectively perform their intended functions with minimal adverse side effects when deployed in real-world conditions.

This definition is centered around the idea that the data or task an AI system was trained on, and its performance characteristics on evaluation data in an R&D environment, may not always reflect its performance in a particular real-world application. In a real-world application, data or performance requirements may differ from what the model encountered in training, and the model may even be exposed to malicious inputs.

Embedded concepts


  • "Research" includes commentary of an analytic or synthetic nature, e.g., papers commenting on the safety-related strengths and weaknesses of different AI techniques, review papers, etc.

"AI systems"

An AI system is a machine-based system that is capable of influencing the environment by producing an output (predictions, recommendations or decisions) for a given set of objectives. It uses machine and/or human-based data and inputs to (i) perceive real and/or virtual environments; (ii) abstract these perceptions into models through analysis in an automated manner (e.g., with machine learning), or manually; and (iii) use model inference to formulate options for outcomes.

"Focused on"

  • The main subject matter of the research must qualify as AI safety research. For example, a short "impact assessment" in an otherwise irrelevant article isn't enough on its own to make the article an AI safety article under our definition.
  • The subject matter must be directly and proximately related to ensuring that AI systems effectively perform their intended functions with minimal adverse side effects when deployed in real-world conditions.
    • For example, an article arguing that utilitarian ethics are theoretically superior to deontological ethics, and should therefore be favored by AI developers, probably wouldn't count as AI safety research under our definition - even though it may have some indirect relevance (e.g., it might inform a specific conceptualization of AI bias or fairness that could in turn be programmed into a deployed AI).
      • On the other hand, an article describing specific principles or methods for ensuring that AI systems adhere to utilitarian values, or specific ways AI systems might violate those values, probably would count, because it is more directly related to the safe deployment of AI systems.
    • For example, an article about improving the accuracy or efficiency of AI systems in general (e.g., by implementing a new variant of gradient descent or an architecture that requires less training data) probably wouldn't count as AI safety research under our definition, even though it may have some indirect relevance (e.g., accuracy is generally relevant to robustness).

"Intended function"

  • The intent of an AI's creators or deployers often isn't stated explicitly. We draw reasonable inferences in these cases. That is, absent evidence to the contrary, we assume AI systems aren't intended to do things that a reasonable third party would condemn, e.g., because they are illegal or widely recognized as unacceptable.
    • For example, we can assume that autonomous car designers and users don't intend for the cars to kill people, and that credit-scoring algorithm designers don't intend for their systems to illegally discriminate against applicants of a certain race.

"Minimal adverse side effects"

  • What consequences are "side effects," and among those, which are "minimal" and "adverse," depends on the intent of the AI's creator or deployer. When direct evidence of that intent is not available, we draw reasonable inferences as described above.
  • "Side effects" are those that result directly and proximately from the actions of a deployed AI. Research into less direct effects does not count under this definition.
    • For example, an article discussing how large-scale AI deployment might eventually destabilize certain types of governments would not count.

Rules of thumb

  • This definition ordinarily includes these research topics (without limitation):
    • AI robustness, including research into adversarial inputs and examples, generalization, distribution shift, and noise or corruption issues with input data.
    • AI explainability.
    • Reward specification, including research into ways to ensure AI systems implement goals that may be unspecified or poorly specified.
    • AI alignment with human values and intentions.
    • Undesired bias in AI systems or the data used to develop them.
    • Monitoring and control of AI systems.
    • Test, evaluation, validation, and verification of AI systems in deployment environments.
    • Contextual interventions that may affect the performance or adverse side effects of AI systems, including interventions related to AI governance; user training and education; AI developer training, education, and best practices; and human-computer interaction.
    • Implementation of specific, defined conceptions of "fairness," "ethics," or similar concepts by AI systems.
    • Research into AI systems that can effectively interact and/or coordinate with humans or other non-AI agents.
  • This definition ordinarily does not include these research topics (without limitation):
    • Theoretical inquiry into, or normative arguments about, the content of AI fairness, ethics, or similar concepts.
    • Responsible production of inputs to AI systems, such as annotated data, except to the extent it directly affects the performance or adverse side effects of those systems in deployment.
      • Note that most research into privacy-preserving machine learning does not qualify as AI safety research under this definition.
    • Studies of AI research or AI safety research (e.g., polls of researchers, bibliometric analyses).
    • Research agendas.
    • Research into using AI systems to mitigate safety or fairness problems existing independently in other domains (for example, research into AI-based crime prevention or cyber intrusion detection).
    • Research articles that stress performance gains against benchmarks as critical evidence of the value of the research are less likely to meet this definition.
    • Research articles that describe their main contribution as reframing a problem, defining questions for future research, or similarly abstract contributions are less likely to meet this definition.
    • Research in the area of public policy is less likely to meet this definition, even if it discusses AI safety. Such research is generally oriented toward shaping incentives for other actors to develop and deploy safer AI systems, rather than focusing directly on how such systems can be developed and deployed.
  • If the argument or subject matter of an article, or its relevance to AI, is unclear, assume it does not meet this definition.