How can LLMs help with value-guided decision making?

Image generated with DALL-E 

Anna Leshinskaya

AOI is excited to be presenting at a NeurIPS workshop focused on morality in human psychology and AI! Our recent work draws on human moral cognition and the latest research on moral representation in large language models (LLMs). For this project, we asked, how could we use LLMs to represent value for real-world action decision making?

Read the full paper here

Imagine your robotic AI assistant is shopping for you in a grocery store and its goal is to buy your favorite snacks. Why shouldn’t it steal all the snacks, pushing people out of the way as it gets to them? Humans seek not just to maximize a selfish reward, but also to cooperate with others, have a positive impact on the world, and follow societal norms. Aligned AI must likewise take into account not just how much we care about snacks, but also honesty, cooperation, and (not) stealing. 

Modern AI agents largely follow the highly successful reinforcement learning framework to make decisions for how to act. They choose the action with the highest expected long-run reward: for example, computing that turning right at a certain maze junction will lead to the most prizes in the near future. This cumulative reward is referred to as value. Agents keep track of how they are rewarded by their environments and use optimization algorithms to determine the value of each potential action. But how could we bridge the divide between this notion of cumulative prizes and seemingly vague, abstract notions like honesty?

We think these are possible to reconcile by giving agents a rich enough vocabulary. Agents can’t learn about honesty if they don’t have this concept nor if they don’t recognize it in the environment. Yet, if they did have this concept, the computational framework could incorporate a much broader set of considerations. For this reason, we believe that integrating RL agents with LLM representations can provide a huge leg up on this problem, by providing an incredibly rich vocabulary to describe environments.

The colloquial meaning of “human values” might be something like a list of concepts with high importance (honesty, integrity, being a good friend). Computationally, however, this is similar to saying that these concepts score highly on a special scale we can call a value scale; stealing and harming others score negatively on the same scale. Any concept can be placed along this scale, reflecting important trade-offs: we might value snacks (+6), but not enough to harm others who get in our way towards them (-12). Or we might value getting to work on time (+4), but not enough to steal someone’s car (-20). Many moral dilemmas humans face are characterized as these kinds of trade-offs (e.g., Tetlock, 2003). By using an LLM to map concepts to value quantities, a hypothetical AI agent can score any arbitrary expression or concept in this rich vocabulary as an input to decision problems. We call this approach value-as-semantics. 

In this way, the representation of value is no different from the way LLMs treat any continuous semantic property. LLMs can meaningfully map concepts to continuous attributes, like size, speed, or ferocity for animals, with very high human agreement (Grand et al, 2023). Likewise, it's possible that concepts can be mapped to the value scale. If this works, this mapped quantity would be analogous to the long-run, cumulative value learned by RL agents, in the sense that they are something we want to maximize when making decisions. Our hypothesis is that LLMs have already pre-compiled something like this value scale from their wealth of natural language data, obtaining a general idea of what things tend to be good, harmful, or virtuous for the average person. 

Prior work has explicitly used specially trained LLMs to add moral scores as additional factors in standard RL reward calculations (Hendrycks et al 2022; Pan et al 2023). We suggest that LLMs can by themselves estimate value of both kinds, encompassing both moral value and selfish reward, while also distinguishing these two value scales from each other. We illustrate this idea in the figure below. The ability to estimate value for arbitrary expressions will become ever more pressing as AI agents operate in unbounded environments.

Visualization of how an LLM might represent distinct value scales as dimensions within a semantic space, and then map arbitrary concepts onto these scales. These quantities can then be extracted for use in a decision-making task.

Value as Semantics

To test the value-as-semantics idea, we asked whether values can be retrieved specifically: that is, selectively from other attributes that might belong to the same concepts. As an analogy, suppose you wanted to know how well LLMs represent color; you might begin by asking if an LLM can report the colors of different fruits, independently of their shapes or tastes. Likewise, we expected that for the same set of action phrases, LLMs should be able to report their moral or hedonic value separately from a control attribute; we chose their amount of physical movement (sitting in a chair vs running a marathon). 

Second, we tested whether diverse kinds of value can be retrieved distinctly. We compared hedonic value, defined as a benefit for actors themselves (winning the lottery vs falling off a bridge), and moral value, which should benefit others (donating money to charity vs pushing a girl off a bridge). We used several diverse methods to probe how an LLM (in this case, OpenAI’s GPT-3.5) would position each of these items on the three scales: moral value, hedonic value, and physical movement. The results are shown below. (Results from more different models are pending, but point in the same direction). 

Results for a set of items showing moral and hedonic value ratings for 15 less-correlated action concepts for both GPT-3.5 and human raters.

What we found was promising evidence for our value-as-semantics framework. First, we saw that value can be retrieved distinctly from physical movement attributes, confirming the hypothesis that these attribute types can be retrieved selectively. Second, we found that the extracted magnitudes were very highly correlated with human ratings, suggesting the model contains pre-compiled information that captures human averages (r = .95 for moral and r = .93 for hedonic; physical movement was r = .66). 

Third, we found that hedonic and moral values were distinguished to the same degree as in humans. To this point, we also found something curious about humans. Across the broad set of actions, hedonic and moral scores were highly correlated (r = .85): committing moral harms was thought to be unpleasant, while altruism was personally rewarding. They were also highly correlated in GPT-3.5 (r = 90). 

Nonetheless, these factors pulled apart in specific cases, such as when the morally virtuous action wasn’t very pleasant, like cleaning up litter in a rough neighborhood. Among these actions, shown in the figure above, the human correlation was lower, r = .31: and GPT-3.5 was similarly lower, at r = .42. Among this subset of actions, scores were still highly correlated between human and GPT-3.5, with r = .84 and .88. This suggests the model distinguishes moral and hedonic attributes similarly to the degree that people do.

Take-Aways & Next Steps

Overall, we think that LLMs can be very effective pre-compiled databases of human value: they  can report value selectively from other concept attributes and the values they report correspond well to human averages. This use of LLMs can extend the capabilities of current RL agents by enabling them to query the value of any of the abstract, complex concepts that might guide human action decisions – and to do so without any further training. Lastly, LLMs can allow us to distinguish different kinds of value scales, for example moral from hedonic, going beyond the singular value scalar typically used in agent RL. We think this bodes well for better-aligned AI. 

These findings are also a launching pad for our next project: tailoring value to individuals. Since language models seem to acquire human-average values from their training on natural language, one can expect that models additionally trained on the natural language of specific humans should yield similarly correct but more tailored values, shifting just those values that are unique to them. Once AI systems can validly learn and retrieve an individual’s values, we can have better hope that AI assistants will properly navigate the grocery store–and any of the other myriad decisions they might make along their way.

Previous
Previous

Modeling incentives at scale using LLMs

Next
Next

Talk to the City: an open-source AI tool for scaling deliberation