TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents
Abstract
A new tower defense-based environment called TowerMind is introduced for evaluating large language models' planning and decision-making capabilities with low computational requirements and multimodal observations.
Recent breakthroughs in Large Language Models (LLMs) have positioned them as a promising paradigm for agents, with long-term planning and decision-making emerging as core general-purpose capabilities for adapting to diverse scenarios and tasks. Real-time strategy (RTS) games serve as an ideal testbed for evaluating these two capabilities, as their inherent gameplay requires both macro-level strategic planning and micro-level tactical adaptation and action execution. Existing RTS game-based environments either suffer from relatively high computational demands or lack support for textual observations, which has constrained the use of RTS games for LLM evaluation. Motivated by this, we present TowerMind, a novel environment grounded in the tower defense (TD) subgenre of RTS games. TowerMind preserves the key evaluation strengths of RTS games for assessing LLMs, while featuring low computational demands and a multimodal observation space, including pixel-based, textual, and structured game-state representations. In addition, TowerMind supports the evaluation of model hallucination and provides a high degree of customizability. We design five benchmark levels to evaluate several widely used LLMs under different multimodal input settings. The results reveal a clear performance gap between LLMs and human experts across both capability and hallucination dimensions. The experiments further highlight key limitations in LLM behavior, such as inadequate planning validation, a lack of multifinality in decision-making, and inefficient action use. We also evaluate two classic reinforcement learning algorithms: Ape-X DQN and PPO. By offering a lightweight and multimodal design, TowerMind complements the existing RTS game-based environment landscape and introduces a new benchmark for the AI agent field. The source code is publicly available on GitHub(https://github.com/tb6147877/TowerMind).
Community
Some of the observations are :-
i. TowerMind is a lightweight RTS-style benchmark for LLM agents
It introduces a tower defense based environment that preserves long term planning and decision making challenges of RTS games, while requiring very low computational resources compared to StarCraft II based benchmarks.
ii. Multimodal observations enable broader LLM evaluation
TowerMind supports pixel-based, textual (JSON), and structured state observations, making it suitable for evaluating language-only and vision-language models under the same environment.
iii. Hallucination is explicitly measured via action validity
Beyond performance score, the benchmark introduces valid action rate to quantify hallucinations. i.e. actions that violate game rules or state constraints allowing simultaneous evaluation of capability and reliability.
iv. LLMs significantly underperform human experts
Even the best-performing models (e.g. GPT-4.1, Claude 3.7 Sonnet) show a large gap from human experts, especially on harder levels, revealing weaknesses in planning validation, multifinality, and efficient action use.
v. TowerMind is challenging for both LLMs and RL agents
Classic RL algorithms (Ape-X DQN, PPO) also fail to reach human level performance, confirming TowerMind as a non-trivial benchmark that complements existing LLM and RL evaluation environments.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Large Language Models as Pok'emon Battle Agents: Strategic Play and Content Generation (2025)
- Vox Deorum: A Hybrid LLM Architecture for 4X / Grand Strategy Game AI -- Lessons from Civilization V (2025)
- LLM-Cave: A benchmark and light environment for large language models reasoning and decision-making system (2025)
- LLM CHESS: Benchmarking Reasoning and Instruction-Following in LLMs through Chess (2025)
- VLN-MME: Diagnosing MLLMs as Language-guided Visual Navigation agents (2025)
- Reflecting with Two Voices: A Co-Adaptive Dual-Strategy Framework for LLM-Based Agent Decision Making (2025)
- Benchmarking the Generality of Vision-Language-Action Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper