Moral Learning

Project overview

Team:

Program Lead: Anna Leshinskaya (anna @ objective.is)
Engineering: Emre Turan

Quick links:

Latest Update: Morally Guided Action Reasoning in Humans and Large Language Models: Alignment Beyond Reward
Paper (covers early work): Value as Semantics: Representations of Human Moral and Hedonic Value in Large Language Models
Github repo: github.com/AIObjectives/graph_extract

Overview

It is notoriously difficult to instruct an AI agent to act in line with our intentions, because communicating intentions with enough specification and situational flexibility is intractable (Russell, 2019; Soares & Fallenstein, 2017; Weiner, 1960). AI must instead learn how to act and reason about arbitrary situations in the way we would. In our view, this learning must result in a process for generating decisions in arbitrary, realistic, morally ambiguous action scenarios.

The goal of the moral learning project is to evaluate and improve the extent to which AI agents are aligned to an individual person in terms of deciding how to act. Aligned agents should make similar action choices to the human they are aligned on. By training to make choices in morally complex scenarios, such agents would learn to capture that person’s moral principles and values, and apply them in context-specific ways. We treat this problem as composed of two parts.

Value learning

How can an AI represent an individual human’s values? In real-world scenarios, actions are guided both by specific goals (e.g., find a meal) and more general concerns (e.g., eat ethically). This forces us to bridge traditionally distinct notions of ‘values’:

long-run cumulative rewards for actions and,
the importance of certain abstract action-guiding concepts (like ethical food consumption).

In our recent work, we discuss representational systems that can jointly represent both such ‘values’ and the benefits of thinking of both of them as continuous semantic attributes. Our current work asks what kinds of training data and model architectures could allow systems to learn to represent (learn and retrieve) individual humans’ values of both kinds for the purpose of guiding action.

Program learning for morality-guided action decision making

Knowing someone’s values is not enough to explain how they decide to act; at least two more things are required. First, because choices in complex scenarios often pit different values and utilities against each other, we need to understand how a person weighs different considerations against one another. Second, psychological research has long shown that decisions are made not with simple rules or principles that apply to all scenarios in the same way, but rather, by flexibly knowing how to apply these principles to arbitrary, specific contexts. Humans consider the causal structure among events, mental states like intentions and beliefs of participants in the scenarios, and what they think will happen if they intervene to make such decisions (Cushman, 2023; Kleiman-Weiner & Levine, 2015; Knobe, 2010). In short, to make human-like morally guided action choices, AI agents must learn to employ a particular set of reasoning processes or algorithms to combine information about values and scenario specifics. We seek to both characterize and amend these decision-making algorithms in modern AI systems. In particular, by using program learning, we believe we can discover human-aligned morally guided decision-making algorithms in a bottom-up way while maintaining transparency and interpretability.

References

Cushman, F. (2023). Computational social psychology. Annual Review of Psychology, 75(1),1–28.

Knobe, J. (2010). Action Trees and Moral Judgment. Topics in Cognitive Science, 2(3), 555–578. https://doi.org/10.1111/j.1756-8765.2010.01093.x

Kleiman-Weiner, M., & Levine, S. (2015). Inference of intention and permissibility in moral decision making. In Proceedings of the 37th Annual Conference of the Cognitive Science Society.

Russell, S. (2019). Human compatible: Artificial intelligence and the problem of control. Penguin.

Soares, N., & Fallenstein, B. (2017). Agent foundations for aligning machine intelligence with human interests: a technical research agenda. In V. Callaghan, J. Miller, R. Yampolskiy, & S. Armstrong (Eds.), The Technological Singularity (pp. 103–125). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-662-54033-6_5

Weiner, N. (1960). Some moral and technical consequences of automation. Science, 131(May).