A procedurally generated roguelike dungeon game with a Gymnasium RL environment. Train agents with DQN/PPO or use a frozen VLM (Qwen3.5-4B) with a trainable action head. Play it yourself too.
You're @, navigating procedural dungeon floors to reach the exit >. Fight zombies z and vampires v, pick up health potions +, and descend deeper. Rooms connect via tunnels, enemies get harder with depth, and shops appear every 2 floors (if enabled).
Controls: 4 discrete actions: right, left, down, up.
Text render (LLMs). Image render (VLMs / RGB RL agents). Numeric observation (RL agents, 16x18 int grid).
src/ # Game engine
engine.py # Core loop, FOV, bump system, level transitions
world/level.py # Level generation (rooms, tunnels, spawning)
entities/ # Entity hierarchy (fighters, items, shops)
utilities/ # Actions, A* pathfinding
env/ # Gymnasium environment
__init__.py # EasyRogue env (3 render modes, configurable observability)
conf/ # Reward configs (YAML)
scripts/ # Training, evaluation, benchmarking
from env import EasyRogue
env = EasyRogue(
render_mode="numeric", # "numeric" (16x18 int grid), "text" (ASCII), "image" (280x180 RGB)
perfect_info=True, # Full map vs 7x7 FOV
reward_config="conf/rewards_default.yaml",
)
obs, info = env.reset()
obs, reward, terminated, truncated, info = env.step(action) # action in {0,1,2,3}The info dict includes: map (ASCII), player health/attack/defense, gold, enemies, potions, depth, exits taken.
Uses Stable Baselines 3 with a custom GridCNN feature extractor that treats the 16x18 integer grid as a single-channel image (3 conv layers, 32->64->64 channels).
# Train PPO (best performer)
python scripts/train_ppo.py --perfect-info --timesteps 5000000
# Train DQN
python scripts/train_dqn.py --perfect-info --timesteps 5000000
# Evaluate and compare models
python scripts/evaluate_rl.py --models models/ppo_*.zip --algo ppo --n-episodes 500Things that mattered:
- CNN is required. MLP on flattened grids doesn't learn spatial navigation.
- Don't use VecNormalize with discrete integer observations. It corrupts the replay buffer for DQN especially.
- PPO needs
ent_coef=0.02to avoid collapsing to wall-bumping on procedural levels. - DQN needs long exploration (
exploration_fraction=0.5,final_eps=0.05) or it collapses too.
Zero-shot evaluation of Qwen3.5-4B on the game, using either ASCII text or rendered screenshots.
# Text mode (local model, no server needed)
python scripts/benchmark_vlm_text.py --num-episodes 50
# Image mode
python scripts/benchmark_vlm_image.py --num-episodes 20
# With vLLM server instead of local inference
python scripts/benchmark_vlm_text.py --api-url http://localhost:8000Freezes Qwen3.5-4B as a feature extractor and trains a small MLP action/value head with PPO.
python scripts/finetune_vlm_actionhead.py \
--render-mode text --no-perfect-info \
--total-timesteps 50000 --device cudaArchitecture: game state (text or image) -> frozen VLM -> last-token features (2560-dim) -> 3-layer MLP -> 4 action logits + value estimate.
Training uses cosine LR decay, return normalization, value function clipping, and saves the best checkpoint by eval exit rate. The VLM runs once per step during rollout collection; cached features are reused across PPO epochs.
All numbers are exit rate (% of episodes where the agent reaches the exit). Random baseline is 17%.
| Model | Perfect Info | Imperfect Info (7x7 FOV) |
|---|---|---|
| PPO (5M steps) | 62.6% (+98.0 reward) | 29.2% (-23.6 reward) |
| DQN (5-10M steps) | 10.0% (-20.6 reward) | 34.4% (-39.2 reward) |
PPO with full observability is the strongest RL agent: aggressive play, fast exits (85 avg steps), high kill rate (1.12/ep). DQN struggles with perfect info (exploration collapse) but is the best RL agent under imperfect info, playing patiently (179 avg steps, 66.6% survival).
| Mode | Perfect Info | Imperfect Info |
|---|---|---|
| Text | 4% | 4% |
| Image | 5% | 5% |
Worse than random. The model moves directionally (right/down bias) instead of exploring, so it rarely finds the exit.
| Mode | Perfect Info | Imperfect Info |
|---|---|---|
| Text | 45% | 70% |
| Image | 50% | 40% |
Text+imperfect is the best agent in the entire project at 70% exit rate. Direction hints in the text prompt ("exit is 3 tiles RIGHT and 2 DOWN") let the VLM's language understanding handle spatial navigation better than learned CNN features with full map visibility. Image mode does better with perfect info (50% vs 45%) since the full map screenshot contains direct spatial information, but text with direction hints dominates under imperfect info.
python scripts/gameonly.pypip install -r requirements.txt

