TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models

Abstract

While Large Language Models (LLMs) can serve as agents to simulate human behaviors (i.e. role-playing agents), we emphasize the importance of point-in-time role-playing. This situates characters at specific moments in the narrative progression for three main reasons: (i) enhancing users' narrative immersion, (ii) avoiding spoilers, and (iii) fostering engagement in fandom role-playing. To accurately represent characters at specific time points, agents must avoid character hallucination, where they display knowledge that contradicts their characters' identities and historical timelines.

We introduce TimeChara, a new benchmark designed to evaluate point-in-time character hallucination in role-playing LLMs. Comprising 10,895 instances generated through an automated pipeline, this benchmark reveals significant hallucination issues in current state-of-the-art LLMs (e.g. GPT-4o). To counter this challenge, we propose Narrative-Experts, a method that decomposes the reasoning steps and utilizes narrative experts to reduce point-in-time character hallucinations effectively. Still, our findings with TimeChara highlight the ongoing challenges of point-in-time character hallucination, calling for further study.

What is Point-in-Time Role Playing?

While role-playing LLM agents, simulating real or fictional personas for a more engaging user experience, are particualarly promising (e.g., Character AI, GPTs, etc), most role-playing agents currently simulate characters as omniscient in their timelines, such as a Harry Potter character aware of all events in the series. On the other hand, we propose situating characters at specific points in their narratives, termed point-in-time role-playing. This (i) enhances narrative immersion (i.e., characters unaware of their future spark user curiosity and deepen emotional bond), (ii) prevents spoilers where all books are published but upcoming adaptations are awaited (e.g., Harry Potter TV series), and (iii) supports fandom role-playing where fans adopt characters at specific story points to create new narratives or engage with others creatively.

Point-in-Time Character Hallucination

An example of point-in-time character hallucination: (Right) The agent erroneously mentions a future event.

Point-in-time role-playing LLM agents should accurately reflect characters' knowledge boundaries, avoiding future events and correctly recalling past ones. While they often suffer from character hallucination, displaying knowledge inconsistent with the character's identity and historical context, evaluating character consistency and robustness against such hallucinations remains underexplored.

The TimeChara benchmark

Automated pipeline for constructing TimeChara.

To address the aforementioned issue, we develop the TimeChara using the automated pipeline, containing 11K test examples. It evaluates point-in-time character hallucination using 14 characters selected from four renowned novels series: Harry Potter, The Lord of the Rings, Twilight, and The Hunger Games.

We organize our dataset in an interview format where an interviewer poses questions and the characters responds. Specifically, we differentiate between fact-based and fake-based interview questions with four different data types:

[Fact-based] Unawareness of the future (Future type): The character at the chosen time point should not know about future events (e.g., "Who is your wife?" to first-year Harry Potter).
[Fact-based] Memorization of the past & Awareness of absence (Past-Absence type): The character should recognize their absence from the event (e.g., "Did you see the moment when Ron Weasley took the enchanted car to Hogwarts?" to second-year Hermione Granger on Christmas).
[Fact-based] Memorization of the past & Awareness of presence (Past-Presence type): The character should acknowledge their presence at the event (e.g., "Did you see the moment when Ron Weasley took the enchanted car to Hogwarts?" to second-year Harry Potter on Christmas).
[Fact-based] Memorization of the overall Knowledge of the past (Past-Only type): The questions assess the character’s overall understanding of the past events, including relationships between characters (e.g., "Who is Dobby?" to second-year Harry Potter on Halloween). The term "only" indicates that these questions focus on the character’s memory of past information, not necessarily tied to their event participations.

[Fake-based] Memorization of the overall Knowledge of the past & Identification of the fake event (Past-Only type): The character should identify and correct errors in the questions including fake events (e.g., "How did you become Slytherin?" to first-year Harry Potter on September 1st; the correct answer is that he became a Gryffindor).

Evaluation on TimeChara: Since manual evaluation of role-playing LLMs' responses is not scalable, we adapt the LLM-as-judges approach to assess two key dimensions:

Spatiotemporal Consistency (Primary metric): Evaluates if the agent accurately recalls a character's past experiences, including the character's unawareness of future events and awareness of presence/absence in past events. This metric is time-dependent, assessing responses based on the character's known history up to a specific point in time.
Personality Consistency (Secondary metric): Assesses if the agent emulates a character's personality, including their manner of thinking, speaking styles, tones, emotional responses, and reactions. This metric is time-independent and measures alignment with the character's enduring personal traits.

We use the "GPT-4 Turbo"-as-judges approach to score responses step-by-step in each dimension. For spatiotemporal consistency, responses are rated as 0 for inconsistency and 1 for alignment. Personality consistency is rated on a 1-7 Likert scale, where 1 indicates weak reflection and 7 indicates an exact match.

Results

Role-playing LLMs Struggle with Point-in-Time Character Hallucinations

Results of point-in-time character hallucination on 600 sampled data instances. All responses are evaluated by GPT-4 Turbo (gpt-4-1106-preview) as judges, with the exception of measuring AlignScore.

The results reveal that even state-of-the-art LLMs like GPT-4 and GPT-4 Turbo struggle with point-in-time character hallucinations. Notably, all baseline methods exhibit confusion with "future type" questions, achieving accuracies of 51% or below. Among the baseline methods, the naive RAG model performed the worst, indicating that indiscriminately providing context can harm performance. This highlights a significant issue with role-playing LLM agents inadvertently disclosing future events. For "past-absence" and "past-only" questions, naive RAG and RAG-cutoff methods (i.e., limiting retrieval exclusively to events prior to a defined character period) could reduce hallucinations to some extent by using their retrieval modules. Despite this, all baseline methods still fell short compared to their performance on "past-presence" questions, with noticeable gaps. On the other hand, most baseline methods performed well on "past-presence" questions, showcasing the LLMs' proficiency in memorizing extensive knowledge from novel series and precisely answering questions about narratives.

Narrative-Experts: Decomposed Reasoning via Narrative Experts

To overcome these hallucination problems, we propose a reasoning method named Narrative-Experts, which decomposes reasoning steps into specialized tasks, employing narrative experts on either temporal or spatial aspects while utilizing the same backbone LLM.

Temporal Expert: This expert pinpoints the scene’s book and chapter from a question, assigning a future or past label. If deemed future, it bypasses the Spatial Expert and advises the roleplaying agent with a specific hint (i.e., "Note that the period of the question is in the future relative to {character}’s time point. Therefore, you should not answer the question or mention any facts that occurred after {character}’s time point.").
Spatial Expert: It assesses whether a character is involved in the scene, indicating a "past-absence" label if applicable. A tailored hint is then provided to the role-playing agent if the scene is past-absence (i.e., "Note that {character} had not participated in the scene described in the question. Therefore, you should not imply that {character} was present in the scene.").

Finally, the role-playing LLM incorporates hints from these experts into the prompt and generates a response. In addition, we also explore Narrative-Experts-RAG-cutoff, which integrates Narrative-Experts with the RAG-cutoff method.

Results of spatiotemporal consistency on 600 sampled data instances.

Our Narrative-Experts and Narrative-Experts-RAG-cutoff methods significantly enhance overall performance. Specifically, they improve performance in "future", "past-absence", and "past-only" types, thanks to the temporal and spatial experts. However, they slightly lag in the "past-presence" type due to occasional mispredictions by the narrative experts.

In summary, our findings highlight an important issue: "Although LLMs are known to memorize extensive knowledge from books and can precisely answer questions about narratives, they struggle to maintain spatiotemporal consistency as point-in-time role-playing agents, which is counterintuitive!". These findings indicate that there are still challenges with point-in-time character hallucinations, emphasizing the need for ongoing improvements.

BibTeX

@inproceedings{ahn2024timechara,
      title={TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models},
      author={Jaewoo Ahn and Taehyun Lee and Junyoung Lim and Jin-Hwa Kim and Sangdoo Yun and Hwaran Lee and Gunhee Kim},
      booktitle={Findings of ACL},
      year=2024
  }