FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

Jaewoo Ahn^*,1,2 Junseo Kim^*,1 Heeseung Yun¹ Jaehyeon Son^2,3 Dongmin Park² Jaewoong Cho² Gunhee Kim¹

¹Seoul National University ²KRAFTON ³Georgia Institute of Technology

* Equal contribution

EMNLP 2025 Main

"Can Vision-Language Models play adventure games like Humans?"

(room escape, mystery, visual novel, simulation, etc)

Abstract

GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions.

Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion, and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information.

We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.

**Figure 1:** FlashAdventure consists of 34 Flash-based classic adventure games and supports automatic evaluation of the GUI agent using CUA-as-a-Judge.

Game Collection

Our benchmark includes 34 carefully selected Flash-based adventure games across 5 subgenres:

Mystery/Detective (11)
Hidden Object (2)
Room Escape (15)
Visual Novel (4)
Simulation (2)

Sherlock Holmes: The Tea Shop Murder Mystery

Sherlock Holmes 2

Vortex Point 1

Vortex Point 2

Vortex Point 3

Pierre Hotel

Small Town Detective

Dakota Winchester's Adventures

Saucy Devil Gordon

Ray and Cooper 2

Nick Bounty: A Case of the Crabs

Grim Tales: The Bride

Grim Tales: The Legacy Collector's Edition

Computer Office Escape

Crimson Room

Camping Room Escape

Chemical Room Escape

Space Museum Escape

Vending Machine Escape

Wood Workshop Escape

Geometric Room Escape

Game Cafe Escape

Machine Room Escape

Video Studio Escape

Design House Escape

Paint Room Escape

Mirror Room Escape

Elevator Room Escape

Pico Sim Date

Festival Days Sim Date

Kingdom Days

Idol Days Sim Date

Community College Sim

Sort the Court

Key Challenge: Observation-Behavior Gap

A critical challenge in FlashAdventure is the observation-behavior gap, which refers to the time lag between when an agent observes information and when it can act upon it. Unlike prior benchmarks that focus on short-term objectives or include short story arcs, FlashAdventure emphasizes completion of full story arcs involving long-term objectives.

Figure 2: Comparison of gameplay progression across benchmarks. FlashAdventure requires agents to manage long-term time lags, such as interrogating a suspect and later discovering their innocence, demonstrating the importance of bridging the observation-behavior gap.

Adventure games require agents to manage long-term dependencies crucial for solving full story arcs. Tolman's theory on latent learning suggests that humans can retrieve and apply clues after a long delay, which can also be explored in agents to assess whether similar emergent behaviors occur.

CUA-as-a-Judge

Automatic Evaluation Framework

CUA-as-a-Judge acts as an oracle with access to predefined success milestones for each game. It actively interacts with the game environment to verify whether milestones have been achieved.

After a game agent finishes gameplay, CUA-as-a-Judge resumes from the game's final state and executes actions to check milestone completion, simulating a human judging process.

Validation Results (vs Human)

Accuracy	94.00%
Spearman Correlation	0.9912
Pearson Correlation	0.9999
Total Samples	300

COAST Framework

COAST (Clue-Oriented Agent for Sequential Tasks) addresses the observation-behavior gap through a Seek-Map-Solve cycle:

1. Clue Seeking

Explores environment for N_seek steps to collect potential clues

2. Clue-Observation Mapping

Analyzes memory to identify promising clue-observation pairs

3. Problem Solving

Executes proposed subtasks for N_solve steps

Figure 3: COAST Framework with Seek-Map-Solve Cycle

Experimental Results

Performance Comparison across all games

Model	GUI Grounding / Action Execution	Agentic Framework	Success Rate (%)	Milestone Completion (%)	# Steps
GPT-4o	UGround-V1-7B / pyautogui	Cradle	0.00	4.56	1000.0
Claude-3.7-Sonnet	UGround-V1-7B / pyautogui	Cradle	0.00	6.59	1000.0
	Claude-3.7-Sonnet / pyautogui	Cradle	0.00	10.60	1000.0
	Claude-3.7-Sonnet / pyautogui	Agents S2	0.00	1.20	1000.0
UI-TARS-1.5-7B			0.00	6.93	1000.0
OpenAI CUA			5.88	15.39	954.1
Claude-3.7-Sonnet Computer-Use			0.00	17.11	992.4
Claude-3.7-Sonnet Computer-Use + COAST (Ours)			5.88	19.89	966.8
Human Performance (max 1,000 steps)			50.98	78.98	815.5
Human Performance (unlimited)			97.06	100.00	1142.0

Table 1: Comparison of different GUI agents across all 34 games.

Performance Analysis by Game Subgenre

Our analysis reveals significant performance variations across different game subgenres. COAST shows improvements over baselines, particularly in games requiring long-term memory and planning.

Mystery/detective and room escape games benefit from clue-based reasoning, while visual novels show inconsistent trends, as the observation–behavior gap is less pronounced in this subgenre.

Performance analysis by subgenre — **Figure 4:** Comparison of average milestone completion rates (MCR) across different game subgenres

                Key Findings
                Current GUI agents struggle with full story arc completion (best: 5.88% success rate).
COAST improves goal / milestone completion by 5.88 / 2.78 percentage points over the baseline.
Still, significant gap remains between GUI agents and human performance (97.06% vs 5.88%).
Agents exhibit weak planning, poor visual perception, and deficient lateral thinking.

            

BibTeX

@inproceedings{ahn2025flashadventure,
  title={FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games},
  author={Ahn, Jaewoo and Kim, Junseo and Yun, Heeseung and Son, Jaehyeon and Park, Dongmin and Cho, Jaewoong and Kim, Gunhee},
  booktitle={EMNLP},
  year={2025}
}