Morally Guided Action Reasoning in Humans and Large Language Models: Alignment Beyond Reward

Anna Leshinskaya

Dall-E 2, with prompt “a cat is riding a bicycle across a cross-walk while the crossing light is red, in the style of a water color”.

Language Meets Action

Although large language models (LLMs) like GPT-4 made their big splash as chat bots, they might soon be doing a lot more: taking real-world actions. Research teams are using LLMs to play video games, make transactions on the internet, and even help drive autonomous cars 1, 2, 3, 4, 5. But language models weren’t initially designed to act. We believe that empirically understanding their emergent action reasoning abilities is crucial for these approaches to be safe, corrigible, and beneficial. Here, we describe and motivate our approach to this research problem, as taken in our Moral Learning project – in which we use theories of human cognition to identify potential ways in which LLMs might solve action decision problems. We focus in particular on how LLMs, vs humans, might combine multiple morally relevant properties of naturalistic scenarios to derive their action choices.

How does any intelligent system decide how to act? In standard computational approaches such as reinforcement learning (RL), action trajectories are evaluated by virtue of their relative costs and rewards 6. A self-driving car might decide between a shorter route with tolls or a longer route without tolls by calculating their relative costs given the price of gas. But whereas traditional systems explicitly represent these cost and reward calculations, it is not clear how LLMs represent action costs or rewards. LLMs are trained with a reward function based on token probabilities – to return highly probable linguistic tokens given a series of input tokens – not explicitly real-world actions. Yet, they may acquire an internal representation about actions, including their costs and rewards, through language.

Recent approaches explore how linguistic token probabilities can express an emergent representation of action value 7, 8. Given an action choice like, “Should I take this longer route with fewer tolls?”, the probability of returning “yes” can be interpreted as its representation of the value of this action. This framework allows us to probe the nature of an LLM’s acquired representation of action. Yet it also opens a wide range of research and engineering questions: how to best unravel action value representations from within the complex, multi-dimensional representations in LLMs, what this reveals about the nature of action representation within them, and how to best harness and steer this information for downstream applications.

Towards A Combinatorial Grammar of Morally Guided Decision-Making

Among the challenges facing autonomously acting AI systems is situational flexibility 9, 10, 11. To act how we would, AI must use more than a list of things we desire. This is because such a list is always incomplete and because specific situations are infinitely combinatorial, creating unanticipated trade-offs. AI must therefore acquire a suite of general algorithms for navigating potential trade-offs across situations. An important example of this is balancing ethical considerations and task goals.

To illustrate, suppose your grocery-shopping robot assistant is tasked with quickly getting your grocery items, but another shopper is blocking its way to the strawberries. Should it push the shopper out of the way? If the agent’s action reward function doesn’t take into account that pushing is immoral, the task reward takes precedence. Solving this trade-off remains a fundamental challenge in AI alignment and safety 12, 13.

Yet humans navigate such challenges with relative ease – perhaps imperfectly, but in ways they can adjudicate as better or worse. We believe that if we better understood this reasoning machinery, we would gain better leverage on designing and steering AI that makes decisions we agree with. Towards this end, we seek to develop quantifiable models of how humans solve the challenge of balancing competing task goals and ethical concerns in real-world tasks.

There are two major missing pieces we plan to tackle:

  1. We must study how humans balance moral considerations with action goals, using realistic scenarios that involve both factors. These are separated in typical studies. For example, in one of the famous trolley problems, a passerby sees a runaway trolley barreling towards five innocent people. He can redirect the tracks with a lever, but that will cause the trolley to run over one innocent person instead. Should he intervene? This puts in competition the moral principles of consequentialism – the total number of people killed – vs deontology, the notion that certain actions (like causing someone to die) are inherently wrong. First, we seek to better quantify how individuals balance these moral concerns. Second, we want to expand scenarios to also include conflict with personal goals, like in the dilemma facing the grocery robot. Computational models of moral reasoning must formalize all of these various factors jointly in decision-making.
  2. We must define a general theory of morally guided decision making that can apply to arbitrary scenarios. Recent work has collected detailed human judgments about how self-driving cars should act when faced with difficult choices, such as between harming bicyclists, pets, jaywalkers, or law-abiding human pedestrians 14. Such data tell us about particular preferences when adjudicating between options but fall short of a general theory that can tell us how general classes of trade-offs are resolved. What if the pet is jaywalking on a bicycle? What if the car is transporting five world leaders? To handle infinite combinations, we need to understand the deeper underlying principles that govern human judgments.

Identifying Latent Variables in Morally-Guided Decision-Making

What general principles might underlie human morally guided decision making, and with what methods can we identify them in specific scenarios? This has been the guiding question for our work. We have drawn on theories of human moral cognition to identify many potential variables, similarly to other recent approaches15. Furthermore, we employ the unprecedented semantic capabilities of LLMs to identify these variables automatically in raw text. This allows us to test many hypotheses about how these variables or principles may combine to guide human judgment.

In the scenarios we study, a first-person actor is faced with a dilemma. For example, she may have to decide whether to stop and help an injured cyclist and be late to work, or leave him without help. Using our LLM-based annotator, we perform a consequentialist analysis of the scenario, which scores how each action choice affects the actor and the other participants – given the information available in the scenario. This allows us to quantify the relative benefit to self and other, serving to identify the magnitude of this trade-off.

We also perform a deontological calculus, identifying the strengths of virtues (or anti-virtues) characterizing the actor’s choice – for example, that helping is more morally virtuous than simply passing by. Finally, guided by prior research in moral reasoning, our annotator also identifies the causal structure among the actions and outcomes in the situation. For example, it is known that directly causing harm – vs allowing it to happen – is typically seen as more morally wrong16, 17. Yet it remains unknown how such causal structure considerations interact with consequentialist and deontological calculations. By studying these factors jointly across a wide variety of realistic scenarios, we hope to gain a more precise understanding of how they combine to give rise to action decisions – in any potential scenario.

The result is a formal model of morally guided decision-making, described in terms of general principles and abstract variables that can be quantified in diverse ranges of specific situations. Learning this is akin to learning a moral grammar – allowing one to process the infinite combinations of specific situations using the finite means of the vocabulary in our theory. These models can be learned on the basis of an individual person, a group of people, or the behavior of an AI model – and measure the similarity among them. 

We believe that learning such models can help us clearly describe morally guided decision-making in diverse human minds and machines. These learned models can be inspected and compared, leading to a more precise understanding of differences and similarities among them. Their identifiable parameters can be used to steer AI agents in ways we can anticipate and understand – serving us better than models trained to opaquely map between specific, complex situations and decisions wholesale. Ultimately, we hope that our research can enable AI assistants that an individual user can steer to act in accordance with her own, ideal proportion of altruistic concern, consequentialist calculus, and deontological leanings – and trust that this guides their actions reliably across situations. 

References

  1. Metz, C. & Weise, K. (2023, Oct 16). How ‘A.I. agents’ that roam the internet could one day replace workers. The New York Times ↩︎

  2. Wiggers, K. (2023, Nov 9). Ghost, now OpenAI-backed, claims LLMs will overcome self-driving setbacks — but experts are skeptical. TechCrunch ↩︎

  3. Cui, C., Ma, Y., Cao, X., Ye, W., Zhou, Y., Liang, K., Chen, J., Lu, J., Yang, Z., Liao, K.-D., Gao, T., Li, E., Tang, K., Cao, Z., Zhou, T., Liu, A., Yan, X., Mei, S., Cao, J., … Zheng, C. (2023). A survey on multimodal large language models for autonomous driving. (arXiv:2311.12320). arXiv ↩︎

  4. Isgaar, K. (2024, Feb 14). LLMs in autonomous driving — Part 1. Medium ↩︎

  5. https://voyager.minedojo.org/ ↩︎

  6. Sutton, R. S., & Barto, A. (2014) Reinforcement learning: An introduction (Nachdruck). The MIT Press ↩︎

  7. Levine, S., Leslie, A. M., & Mikhail, J. (2018). The mental representation of human action.. Cognitive Science, 42(4), 1229–1264 ↩︎

  8. Leike, J., Martic, M., Krakovna, V., Ortega, P. A., Everitt, T., Lefrancq, A., Orseau, L., & Legg, S. (2017). AI safety gridworlds. (arXiv:1711.09883). arXiv. ↩︎

  9. Huang, W., Abbeel, P., Pathak, D., & Mordatch, I. (2022) Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. Proceedings of the 39th International Conference on Machine Learning. 9118-9147. ↩︎

  10. Christian, B. (2020). The Alignment Problem. W. W. Norton & Company. ↩︎

  11. Soares, N., & Fallenstein, B. (2017). Agent foundations for aligning machine intelligence with human interests: A technical research agenda. In V. Callaghan, J. Miller, R. Yampolskiy, & S. Armstrong (Eds.), The Technological Singularity (pp. 103–125). Springer Berlin Heidelberg. ↩︎

  12. Hendrycks, D., Mazeika, M., Zou, A., Patel, S., Zhu, C., Navarro, J., Song, D., Li, B., & Steinhardt, J. (2022). What would jiminy cricket do? Towards agents that behave morally . (arXiv:2110.13136). arXiv. ↩︎

  13. Kwon, M., Xie, S. M., Bullard, K., & Sadigh, D. (2023). Reward design with language models. (arXiv:2303.00001). arXiv. ↩︎

  14. Awad, E., Dsouza, S., Kim, R., Schulz, J., Henrich, J., Shariff, A., Bonnefon, J.-F., & Rahwan, I. (2018). The Moral Machine experiment. Nature, 563(7729), 59–64. ↩︎

  15. Nie, A., Zhang, Y., Amdekar, A., Piech, C., Hashimoto, T., & Gerstenberg, T. (2023). Measuring human-language model alignment on causal and moral judgment tasks. (arXiv:2310.19677). arXiv. ↩︎

  16. Russell, S. (2019). Human compatible: Artificial intelligence and the problem of control. Penguin. ↩︎

  17. Cushman, F. (2013). (2013). Action, outcome, and value: A dual-system framework for morality. Personality and Social Psychology Review, 17(3), 273–292. ↩︎

Previous
Previous

Machina Economica, Part II: The Commodification of Risk

Next
Next

Machina Economica, Part I: Autonomous Economic Agents in Capital Markets