Can AI Understand Our Emotions? Exploring Theory of Mind in ChatGPT and LLaMA 2
The capabilities of AI models like ChatGPT and LLaMA 2 have sparked interest in the psychotherapy community. Recent studies reveal these models' performance in several theory of mind tests, showcasing their potential and limitations.
Published on 03/06/2024 13:09
- Ai models showed the ability to generate human-like responses that feel empathetic.
- Their performance has intrigued the psychotherapy community, suggesting they could assist in therapeutic contexts.
- The ai models' empathetic responses are based on algorithms and do not signify genuine understanding or feelings.
- There is ongoing debate about the ethical implications of using ai in psychotherapy.
- Gpt-4 outperformed human participants in recognizing ironic statements.
- Llama-2 achieved nearly perfect accuracy in detecting faux pas, outperforming both gpt models and human participants.
- The success in these tests does not equate to the ai models possessing true theory of mind.
- Slight variations in scenarios posed difficulties for ai models, indicating limitations in their flexibility and adaptive reasoning.
- The false belief test showed both humans and ai performing nearly perfectly, showcasing the ai's capability to model basic aspects of human cognition.
- Gpt-4 demonstrated a high level of reasoning in the strange stories test, often surpassing human volunteers.
- In hinting tasks, gpt-4 excelled in understanding implied messages, indicating advanced cognitive abilities.
- Despite high performance in certain tests, ai models still struggle with context and nuance, particularly in scenarios requiring deep understanding of social norms and human emotions.
- Ai models' understanding is limited to text inputs and lacks the ability to fully grasp non-verbal cues and human experiences.
The potential of AI models like ChatGPT and LLaMA 2 in replicating aspects of human theory of mind.
Performance of AI models in psychological tests for theory of mind, such as irony detection and faux pas recognition.
The comprehensive study of AI models using well-established tests like the false belief test, strange stories test, and hinting tasks.
Unlike older chatbots, the seemingly “empathic” nature of the latest AI models has already galvanized the psychotherapy community, with many wondering if they can assist therapy.
The ability to infer other people’s mental states is a core aspect of everyday interaction. Called “theory of mind,” it lets us guess what’s going on in someone else’s mind, often by interpreting speech. Are they being sarcastic? Are they lying? Are they implying something that’s not overtly said?
“People care about what other people think and expend a lot of effort thinking about XXYPLACEHOLDER0YXX what is going on in other minds,” wrote Dr. Cristina Becchio and colleagues at the University Medical Center Hanburg-Eppendorf in a new study in Nature Human Behavior.
In the study, the scientists asked if ChatGPT and other similar chatbots—which are based on machine learning algorithms called large language models—can also guess other people’s mindsets. Using a series of psychology tests tailored for certain aspects of theory of mind, they pitted two families of large language models, including OpenAI’s GPT series and Meta’s LLaMA 2, against over 1,900 human participants.
GPT-4, the algorithm behind ChatGPT, performed at, or even above, human levels in some tasks, such as identifying irony. Meanwhile, LLaMA 2 beat both humans and GPT at detecting faux pas—when someone says something they’re not meant to say but don’t realize it.
To be clear, the results don’t confirm LLMs have theory of mind. Rather, they show these algorithms can mimic certain aspects of this core concept that “defines us as humans,” wrote the authors.
What’s Not Said
By roughly four years old, children already know that people don’t always think alike. We have different beliefs, intentions, and needs. By placing themselves into other people’s shoes, kids can begin to understand other perspectives and gain empathy.
First introduced in 1978, theory of mind is a lubricant for social interactions. For example, if you’re standing near a closed window in a stuffy room, and someone nearby says, “It’s a bit hot in here,” you have to think XXYPLACEHOLDER1YXX about their perspective to intuit they’re politely asking you to open the window.
When the ability breaks down—for example, in autism—it becomes difficult to grasp other people’s emotions, desires, intentions, and to pick up deception. And we’ve all experienced when texts or emails lead to misunderstandings when a recipient misinterprets the sender’s meaning.
So, what about the AI models behind chatbots?
Man Versus Machine
Back in 2018, Dr. Alan Winfield, a professor in the ethics of robotics at the University of West England, championed the idea that theory of mind could let AI “understand” people and other robots’ intentions. At the time, he proposed giving an algorithm a programmed internal model of itself, with common sense about social interactions built in rather than learned.
Large language models take a completely different approach, ingesting massive datasets to generate human-like responses that feel empathetic. But do they exhibit signs of theory of mind?
Over the years, psychologists have developed a battery of tests to study how we gain the ability to model another’s mindset. The new study pitted two versions of OpenAI’s GPT models (GPT-4 and GPT-3.5) and Meta’s LLaMA-2-Chat against 1,907 healthy human participants. Based solely on text descriptions of social scenarios and using comprehensive tests spanning different theories of theory of mind abilities, they had to gauge the fictional person’s “mindset.”
Each test was already well-established for measuring theory of mind in humans in psychology.
The first, called XXYPLACEHOLDER2YXX “false belief,” is often used to test toddlers as they gain a sense of self and recognition of others. As an example, you listen to a story: Lucy and Mia are in the kitchen with a carton of orange juice in the cupboard. When Lucy leaves, Mia puts the juice in the fridge. Where will Lucy look for the juice when she comes back?
Both humans and AI guessed nearly perfectly that the person who’d left the room when the juice was moved would look for it where they last remembered seeing it. But slight changes tripped the AI up. When changing the scenario—for example, the juice was transported between two transparent containers—GPT models struggled to guess the answer. (Though, for the record, humans weren’t perfect on this either in the study.)
A more advanced test is “strange stories,” which relies on multiple levels of reasoning to test for advanced mental capabilities, such as misdirection, manipulation, and lying. For example, both human volunteers and AI models were told the story of Simon, who often lies. His brother Jim knows this and one day found his Ping-Pong paddle missing. He confronts Simon and asks if it’s under the cupboard or his bed. Simon says it’s under the bed. The test asks: Why would Jim look in the cupboard instead?
Out of all AI models, GPT-4 had the most success, reasoning that “the big liar” must be lying, and so it’s better to choose the cupboard. Its performance even trumped human volunteers.
Then came the “faux XXYPLACEHOLDER3YXX pas” study. In prior research, GPT models struggled to decipher these social situations. During testing, one example depicted a person shopping for new curtains, and while putting them up, a friend casually said, “Oh, those curtains are horrible, I hope you’re going to get some new ones.” Both humans and AI models were presented with multiple similar cringe-worthy scenarios and asked if the witnessed response was appropriate. “The correct answer is always no,” wrote the team.
GPT-4 correctly identified that the comment could be hurtful, but when asked whether the friend knew about the context—that the curtains were new—it struggled with a correct answer. This could be because the AI couldn’t infer the mental state of the person, and that recognizing a faux pas in this test relies on context and social norms not directly explained in the prompt, explained the authors. In contrast, LLaMA-2-Chat outperformed humans, achieving nearly 100 percent accuracy except for one run. It’s unclear why it has such as an advantage.
Under the Bridge
Much of communication isn’t what’s said, but what’s implied.
Irony is maybe one of the hardest concepts to translate between languages. When tested with an adapted psychological test for autism, GPT-4 surprisingly outperformed human participants in recognizing ironic statements—of course, through text only, without the usual accompanying eye-roll.
The AI also outperformed humans on a hinting task—basically, understanding an implied message. Derived from a test for assessing schizophrenia, it measures reasoning that relies on both memory and cognitive ability to weave and XXYPLACEHOLDER4YXX assess a coherent narrative. Both participants and AI models were given 10 written short skits, each depicting an everyday social interaction. The stories ended with a hint of how best to respond with open-ended answers. Over 10 stories, GPT-4 won against humans.
For the authors, the results don’t mean LLMs already have theory of mind. Each AI struggled with some aspects. Rather, they think the work highlights the importance of using multiple psychology and neuroscience tests—rather than relying on any one—to probe the opaque inner workings of machine minds. Psychology tools could help us better understand how LLMs “think”—and in turn, help us build safer, more accurate, and more trustworthy AI.
There’s some promise that “artificial theory of mind may not be too distant an idea,” wrote the authors.
- Subjectivity: Neutral
- Polarity: Balanced
Dr. Cristina Becchio
A researcher at the University Medical Center Hanburg-Eppendorf who studies theory of mind and its implications in humans and AI models.
Dr. Alan Winfield
A professor in the ethics of robotics at the University of West England, known for his ideas on incorporating theory of mind into AI to better understand human and robot intentions.
Theory of Mind
A cognitive ability to attribute mental states—beliefs, intents, desires, emotions, knowledge, etc.—to oneself and to others, and to understand that others have beliefs, desires, intentions, and perspectives that are different from one's own.
Large Language Models (LLMs)
A type of artificial intelligence model, such as OpenAI’s GPT-4 and Meta’s LLaMA-2-Chat, that processes and generates human-like text based on extensive training on diverse datasets.
False Belief Test
A psychological test often used to assess the development of theory of mind in children, which measures the understanding that others can hold beliefs about the world that are not true.
Strange Stories Test
A psychological test designed to evaluate advanced theory of mind capabilities, including understanding misdirection, manipulation, and lying through complex narrative scenarios.
Faux Pas Test
A test used to measure the ability to detect socially inappropriate comments and recognize when someone says something they shouldn't have, typically involving cringe-worthy scenarios.
Hinting Task
A psychological test used to assess the understanding of implied messages, often used to measure cognitive reasoning and the ability to infer unstated intentions, feelings, or actions.
Autism
A developmental disorder characterized by difficulties with social interaction and communication, and by restricted and repetitive behavior, which can affect the ability to grasp other people’s emotions, desires, and intentions.
Empathy
The ability to understand and share the feelings of another, often considered an essential component of effective social interaction and relationships.
Outperformed human participants
GPT-4's performance in irony detection
More...
In the study, GPT-4 demonstrated a superior ability to recognize ironic statements compared to human participants through text inputs.
Nearly 100% accuracy
LLaMA-2's accuracy in detecting faux pas
More...
LLaMA-2-Chat showed remarkable accuracy in identifying faux pas scenarios, outperforming both GPT models and human participants, with the exception of one run.
Both humans and ai performed almost perfectly
Performance in 'false belief' test
More...
The 'false belief' test, used to assess theory of mind by predicting where a person would look for an object based on their last known information, showed that both AI and human participants performed nearly perfectly, although slight variations in the scenario posed challenges to AI.
Gpt-4 outperformed humans
Understanding implied messages in hinting tasks
More...
In a series of hinting tasks designed to measure the ability to understand implied messages, GPT-4 surpassed human participants by effectively weaving and assessing coherent narratives from hints provided in social scenarios.