FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

1Seoul National University 2KRAFTON 3Georgia Institute of Technology
* Equal contribution
EMNLP 2025 Main

"Can Vision-Language Models play adventure games like Humans?"

(room escape, mystery, visual novel, simulation, etc)

Abstract

GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions.

Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion, and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information.

We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.

Figure 1: FlashAdventure consists of 34 Flash-based classic adventure games and supports automatic evaluation of the GUI agent using CUA-as-a-Judge.

Game Collection

Our benchmark includes 34 carefully selected Flash-based adventure games across 5 subgenres:

Sherlock Holmes game thumbnail

Sherlock Holmes: The Tea Shop Murder Mystery

Sherlock Holmes 2 game thumbnail

Sherlock Holmes 2

Vortex Point 1 game thumbnail

Vortex Point 1

Vortex Point 2 game thumbnail

Vortex Point 2

Vortex Point 3 game thumbnail

Vortex Point 3

Pierre Hotel game thumbnail

Pierre Hotel

Small Town Detective game thumbnail

Small Town Detective

Dakota Winchester's Adventures game thumbnail

Dakota Winchester's Adventures

Saucy Devil Gordon game thumbnail

Saucy Devil Gordon

Ray and Cooper 2 game thumbnail

Ray and Cooper 2

Nick Bounty game thumbnail

Nick Bounty: A Case of the Crabs

Grim Tales: The Bride game thumbnail

Grim Tales: The Bride

Grim Tales: The Legacy game thumbnail

Grim Tales: The Legacy Collector's Edition

Computer Office Escape game thumbnail

Computer Office Escape

Crimson Room game thumbnail

Crimson Room

Camping Room Escape game thumbnail

Camping Room Escape

Chemical Room Escape game thumbnail

Chemical Room Escape

Space Museum Escape game thumbnail

Space Museum Escape

Vending Machine Escape game thumbnail

Vending Machine Escape

Wood Workshop Escape game thumbnail

Wood Workshop Escape

Geometric Room Escape game thumbnail

Geometric Room Escape

Game Cafe Escape game thumbnail

Game Cafe Escape

Machine Room Escape game thumbnail

Machine Room Escape

Video Studio Escape game thumbnail

Video Studio Escape

Design House Escape game thumbnail

Design House Escape

Paint Room Escape game thumbnail

Paint Room Escape

Mirror Room Escape game thumbnail

Mirror Room Escape

Elevator Room Escape game thumbnail

Elevator Room Escape

Pico Sim Date game thumbnail

Pico Sim Date

Festival Days Sim Date game thumbnail

Festival Days Sim Date

Kingdom Days game thumbnail

Kingdom Days

Idol Days Sim Date game thumbnail

Idol Days Sim Date

Community College Sim game thumbnail

Community College Sim

Sort the Court game thumbnail

Sort the Court

Key Challenge: Observation-Behavior Gap

A critical challenge in FlashAdventure is the observation-behavior gap, which refers to the time lag between when an agent observes information and when it can act upon it. Unlike prior benchmarks that focus on short-term objectives or include short story arcs, FlashAdventure emphasizes completion of full story arcs involving long-term objectives.

Observation-Behavior Gap comparison
Figure 2: Comparison of gameplay progression across benchmarks. FlashAdventure requires agents to manage long-term time lags, such as interrogating a suspect and later discovering their innocence, demonstrating the importance of bridging the observation-behavior gap.

Adventure games require agents to manage long-term dependencies crucial for solving full story arcs. Tolman's theory on latent learning suggests that humans can retrieve and apply clues after a long delay, which can also be explored in agents to assess whether similar emergent behaviors occur.

CUA-as-a-Judge

Automatic Evaluation Framework

CUA-as-a-Judge acts as an oracle with access to predefined success milestones for each game. It actively interacts with the game environment to verify whether milestones have been achieved.


After a game agent finishes gameplay, CUA-as-a-Judge resumes from the game's final state and executes actions to check milestone completion, simulating a human judging process.

Validation Results (vs Human)

Accuracy 94.00%
Spearman Correlation 0.9912
Pearson Correlation 0.9999
Total Samples 300

COAST Framework

COAST (Clue-Oriented Agent for Sequential Tasks) addresses the observation-behavior gap through a Seek-Map-Solve cycle:

1. Clue Seeking

Explores environment for Nseek steps to collect potential clues

2. Clue-Observation Mapping

Analyzes memory to identify promising clue-observation pairs

3. Problem Solving

Executes proposed subtasks for Nsolve steps

COAST Algorithm
Figure 3: COAST Framework with Seek-Map-Solve Cycle

Experimental Results

Performance Comparison across all games

Model GUI Grounding / Action Execution Agentic Framework Success Rate (%) Milestone Completion (%) # Steps
GPT-4o UGround-V1-7B / pyautogui Cradle 0.00 4.56 1000.0
Claude-3.7-Sonnet UGround-V1-7B / pyautogui Cradle 0.00 6.59 1000.0
Claude-3.7-Sonnet / pyautogui Cradle 0.00 10.60 1000.0
Agents S2 0.00 1.20 1000.0
UI-TARS-1.5-7B 0.00 6.93 1000.0
OpenAI CUA 5.88 15.39 954.1
Claude-3.7-Sonnet Computer-Use 0.00 17.11 992.4
Claude-3.7-Sonnet Computer-Use + COAST (Ours) 5.88 19.89 966.8
Human Performance (max 1,000 steps) 50.98 78.98 815.5
Human Performance (unlimited) 97.06 100.00 1142.0

Table 1: Comparison of different GUI agents across all 34 games.

Performance Analysis by Game Subgenre

Our analysis reveals significant performance variations across different game subgenres. COAST shows improvements over baselines, particularly in games requiring long-term memory and planning.


Mystery/detective and room escape games benefit from clue-based reasoning, while visual novels show inconsistent trends, as the observation–behavior gap is less pronounced in this subgenre.

Performance analysis by subgenre
Figure 4: Comparison of average milestone completion rates (MCR) across different game subgenres

Key Findings

  • Current GUI agents struggle with full story arc completion (best: 5.88% success rate).
  • COAST improves goal / milestone completion by 5.88 / 2.78 percentage points over the baseline.
  • Still, significant gap remains between GUI agents and human performance (97.06% vs 5.88%).
  • Agents exhibit weak planning, poor visual perception, and deficient lateral thinking.

BibTeX

@inproceedings{ahn2025flashadventure,
  title={FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games},
  author={Ahn, Jaewoo and Kim, Junseo and Yun, Heeseung and Son, Jaehyeon and Park, Dongmin and Cho, Jaewoong and Kim, Gunhee},
  booktitle={EMNLP},
  year={2025}
}