SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

Can one-off task experience become reusable instructions that future agents can follow?

_{Yingtie Lei^1,*, Zhongwei Wan^1,*, Jiankun Zhang², Samiul Alam¹, Zixuan Zhong³, Peizhou Huang⁴, Xin Wang¹

Jingxuan Zhang¹, Donghao Zhou⁵, Yunta Hsieh⁴, Zhihao Dou⁶, Hui Shen⁴, Yan Xu⁷, Dimitrios Dimitriadis⁷, Tuo Zhang⁷, Mi Zhang¹}

_{¹The Ohio State University,
²The University of Chicago,
³University College London,
⁴University of Michigan,

⁵The Chinese University of Hong Kong,
⁶Case Western Reserve University,
⁷Amazon

^*Equal contribution}

SkillEvolBench Team

_{Correspondence: Tuo Zhang tuozhang@amazon.com,
Mi Zhang mizhang.1@osu.edu}

📝 Abstract

Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing context shift, adversarial shortcuts, and composition.

By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces. Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes, but these gains are unstable under frozen deployment.

Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. Capacity and cost analyses further show that writing more skills or larger Tier-3 resource libraries is not sufficient: additional updates can improve coverage while introducing episode-specific drift and procedural clutter. These findings position SkillEvolBench as a testbed for measuring when one-off experience becomes durable procedural knowledge rather than task-local memory.

🗂️ GitHub Repo Layout

Path	Purpose
`benchmark/tasks/`	180 benchmark tasks with task metadata, instructions, environments, and validation assets.
`benchmark/skills/`	Curated seed skills used by curated-skill baselines.
`configs/baselines/`	Baseline YAMLs. Select with `--baseline-name <name>`.
`configs/models/`	Model/provider presets for Azure OpenAI, AWS Bedrock Claude, Gemini, and Kimi-style endpoints. Select with `--model-yaml`.
`configs/strategies/`	Runtime strategies, mainly `chain` and `chain_tier3`. Most baselines choose this automatically.
`configs/env_orders.yaml`	Deterministic environment orders for seeds `A`, `B`, and `C`.
`skillevolbench/`	Core benchmark engine: scheduler, runtime, stores, retrieval, prompting, metrics, schemas, and Harbor hooks.
`agents_port/`	Preinstalled-agent adapter layer for `codex`, `claude-code`, `gemini-cli`, and `kimi-cli`.
`scripts/run.py`	Main single-run launcher. Use this for individual reproductions.
`scripts/launch_main_experiment.py`	Batch launcher for canonical baselines.
`scripts/launch_multi_model.py`	Batch launcher for baseline × model sweeps.
`scripts/validate_configs.py`	Validates YAML configs against schemas.
`scripts/validate_assets.py`	Validates benchmark task and skill assets.
`scripts/preflight.py`	Checks runtime readiness before expensive model calls.
`scripts/summarize.py`	Summarizes a completed run.
`docker/agent-build/`	Builds the Harbor task image `agent-runtime:latest`.
`workspace/runs/`	Local generated run outputs. Do not commit.
`asset/`	README/project visual assets.

🧩 Benchmark Structure

SkillEvolBench uses a fixed stratified design:

6 environments × 5 latent skill families × 6 tasks = 180 tasks

For each latent skill family, T1-T3 are learning/acquisition tasks and T4-T6 are frozen evaluation tasks:

Task	Role	Purpose
`T1`	learning	canonical first encounter
`T2`	learning	enriched follow-up
`T3`	learning	variant acquisition case
`T4`	evaluation	context shift
`T5`	evaluation	adversarial variant
`T6`	evaluation	skill composition

⚙️ Installation

Python 3.12 is recommended.

python3.12 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e '.[dev]'
python -m pip install litellm boto3 jinja2 pandas matplotlib

Install Harbor and build the task runtime image:

python -m pip install git+https://github.com/harbor-framework/harbor.git
bash docker/agent-build/build.sh
docker image inspect agent-runtime:latest

🔐 API Keys and Provider Setup

Create a local credential file. It is ignored by Git.

touch .harbor-agents.env
chmod 600 .harbor-agents.env

Fill only the providers you plan to run:

# Azure OpenAI / Codex presets
AZURE_OPENAI_ENDPOINT=https://<your-resource>.openai.azure.com/openai/v1
AZURE_OPENAI_API_KEY=<your-azure-openai-key>
# Optional if your endpoint requires it:
# AZURE_OPENAI_API_VERSION=<api-version>

# AWS Bedrock Claude presets
AWS_BEARER_TOKEN_BEDROCK=<your-bedrock-bearer-token>
AWS_REGION=us-east-1
CLAUDE_CODE_USE_BEDROCK=1

# Gemini direct presets
GEMINI_API_KEY=<your-gemini-api-key>

# Kimi / OpenAI-compatible Mantle endpoint
KIMI_BEDROCK_BASE_URL=<your-openai-compatible-base-url>
KIMI_BEDROCK_API_KEY=<your-kimi-key>

Load credentials in the same shell before launching runs:

set -a
source .harbor-agents.env
set +a

Check without printing secret values:

python -c "import os; [print(k, bool(os.getenv(k))) for k in ['AZURE_OPENAI_ENDPOINT','AZURE_OPENAI_API_KEY','AWS_BEARER_TOKEN_BEDROCK','AWS_REGION','GEMINI_API_KEY','KIMI_BEDROCK_BASE_URL','KIMI_BEDROCK_API_KEY']]"

Provider notes

Provider family	Presets	Required env vars	Notes
Azure OpenAI Codex	`gpt-5.2-codex`, `gpt-5.3-codex`, `gpt-5.4`	`AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`	The preset `agent_model_name` must match your Azure deployment name. Edit the YAML if your deployment differs.
AWS Bedrock Claude	`claude-sonnet-`, `claude-opus-`	`AWS_BEARER_TOKEN_BEDROCK`, `AWS_REGION`, `CLAUDE_CODE_USE_BEDROCK=1`	Uses Bedrock inference-profile IDs such as `us.anthropic...`. Your AWS account must have model access.
Gemini direct	`gemini-*`	`GEMINI_API_KEY`	Included for convenience; adapt the YAML if your Gemini CLI setup differs.
Kimi/Mantle	`kimi-*`	`KIMI_BEDROCK_BASE_URL`, `KIMI_BEDROCK_API_KEY`	Requires an OpenAI-compatible endpoint.

The paper configuration here uses Azure OpenAI for Codex-style models and AWS Bedrock Claude for Claude. If you use direct OpenAI/Anthropic API keys instead, add a new YAML under configs/models/ and set the right provider, harbor_agent_name, agent_model_name, agent_env, and host_litellm fields for your endpoint.

✅ Validate Before Running

python -m scripts.validate_configs
python -m scripts.validate_assets
python -m scripts.preflight
python -m scripts.run --baseline-name no_skill --order-seed A --dry-run

A healthy dry run schedules 180 tasks and does not call any model API.

For a stricter check including Harbor and Docker image readiness:

python -m scripts.preflight --strict

🚀 Run One Benchmark

General pattern:

python -m scripts.run \
  --baseline-name <baseline_name> \
  --model-yaml configs/models/<model_preset>.yaml \
  --order-seed A

Azure OpenAI example:

python -m scripts.run \
  --baseline-name no_skill \
  --model-yaml configs/models/gpt-5.4.yaml \
  --order-seed A

AWS Bedrock Claude example:

python -m scripts.run \
  --baseline-name selfgen_experience_always \
  --model-yaml configs/models/claude-sonnet-4.6.yaml \
  --order-seed A

Use --run-id for a stable output directory:

python -m scripts.run \
  --baseline-name raw_trajectory_rag \
  --model-yaml configs/models/gpt-5.4.yaml \
  --order-seed A \
  --run-id rawrag_gpt54_seedA

📊 Paper Baselines

Canonical baselines are defined in skillevolbench/baselines/policy.py.

Baseline	Config name	What it tests
No Skill	`no_skill`	Bare agent with no skills, memory, or trajectory retrieval.
Raw Trajectory RAG	`raw_trajectory_rag`	Retrieves same-family raw learning trajectories instead of distilled skills.
Self-Generated Zero-Shot	`selfgen_zero_shot`	Creates skills before experience from task/family context.
Self-Generated Experience	`selfgen_experience_always`	Induces skills from experience and revises on every learning trial.
Curated Static	`curated_static`	Uses curated seed skills without revision.
Curated + Revision	`curated_with_revision_always`	Starts from curated skills and revises on every learning trial.

Run the full canonical ladder for one model:

MODEL=configs/models/gpt-5.4.yaml
BASELINES=(
  no_skill
  raw_trajectory_rag
  selfgen_zero_shot
  selfgen_experience_always
  curated_static
  curated_with_revision_always
)

for BASELINE in "${BASELINES[@]}"; do
  python -m scripts.run \
    --baseline-name "$BASELINE" \
    --model-yaml "$MODEL" \
    --order-seed A
done

Or use the multi-run launcher after inspecting the plan:

python -m scripts.launch_multi_model \
  --baselines no_skill,raw_trajectory_rag,selfgen_zero_shot,selfgen_experience_always,curated_static,curated_with_revision_always \
  --models gpt-5.4 \
  --order-seed A \
  --max-workers 1 \
  --dry-run

Remove --dry-run to execute.

🧪 Ablation Baselines

Ablation	Config name
Failure-only self-generated revision	`selfgen_experience`
Failure-only curated revision	`curated_with_revision`
Required Tier-3 skill files, self-generated	`selfgen_experience_always_with_tier3`
Required Tier-3 skill files, curated	`curated_with_revision_always_with_tier3`

Run an ablation like any other baseline:

python -m scripts.run \
  --baseline-name selfgen_experience \
  --model-yaml configs/models/gpt-5.4.yaml \
  --order-seed A

🤖 Model Presets

Model preset	Provider route	Credentials
`gpt-5.2-codex.yaml`	Azure OpenAI Codex CLI	`AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`
`gpt-5.3-codex.yaml`	Azure OpenAI Codex CLI	`AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`
`gpt-5.4.yaml`	Azure OpenAI Codex CLI	`AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`
`claude-sonnet-4.5.yaml`	AWS Bedrock Claude Code	`AWS_BEARER_TOKEN_BEDROCK`, `AWS_REGION`
`claude-sonnet-4.6.yaml`	AWS Bedrock Claude Code	`AWS_BEARER_TOKEN_BEDROCK`, `AWS_REGION`
`claude-opus-4.5.yaml`	AWS Bedrock Claude Code	`AWS_BEARER_TOKEN_BEDROCK`, `AWS_REGION`
`claude-opus-4.6.yaml`	AWS Bedrock Claude Code	`AWS_BEARER_TOKEN_BEDROCK`, `AWS_REGION`
`gemini-2.5-pro.yaml`	Gemini CLI direct	`GEMINI_API_KEY`
`gemini-3-flash.yaml`	Gemini CLI direct	`GEMINI_API_KEY`
`gemini-3.1-pro.yaml`	Gemini CLI direct	`GEMINI_API_KEY`
`kimi-2-thinking.yaml`	OpenAI-compatible Mantle/Kimi	`KIMI_BEDROCK_BASE_URL`, `KIMI_BEDROCK_API_KEY`
`kimi-2.5.yaml`	OpenAI-compatible Mantle/Kimi	`KIMI_BEDROCK_BASE_URL`, `KIMI_BEDROCK_API_KEY`

Inspect a preset before editing:

sed -n '1,120p' configs/models/gpt-5.4.yaml

📦 Outputs

Runs are written to:

workspace/runs/<run_id>/

Key files:

Path	Meaning
`config.json`	Frozen run configuration.
`reports/full_report.json`	Main metrics report.
`stores/replay/`	Per-task replay records.
`stores/events/`	Runtime event logs.
`stores/retrieval/`	Retrieval traces when enabled.
`runtime/`	Per-task execution workdirs.
`harbor-job/`	Harbor job artifacts.

Summarize a run:

python -m scripts.summarize workspace/runs/<run_id>

Find recent runs:

ls -lt workspace/runs | head

🛠️ Troubleshooting

Symptom	Likely cause	Fix
`Please set an Auth method... GEMINI_API_KEY`	`.harbor-agents.env` was not sourced or `GEMINI_API_KEY` is missing.	Source `.harbor-agents.env` in the same shell.
`No module named 'harbor'`	Harbor SDK is missing.	`python -m pip install git+https://github.com/harbor-framework/harbor.git`
`agent-runtime:latest` missing	Docker image has not been built.	`bash docker/agent-build/build.sh`
Azure 401/403	Endpoint/key/deployment mismatch.	Check `AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`, and `agent_model_name`.
Bedrock model identifier error	Region/profile mismatch.	Use the `us.anthropic...` model IDs in `configs/models/claude-*.yaml` and set `AWS_REGION`.

📚 Citation

@article{lei2026skillevolbench,
  title={SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills},
  author={Lei, Yingtie and Wan, Zhongwei and Zhang, Jiankun and Alam, Samiul and Zhong, Zixuan and Huang, Peizhou and Wang, Xin and Zhang, Jingxuan and Zhou, Donghao and Hsieh, Yunta and others},
  journal={arXiv preprint arXiv:2605.24117},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

📝 Abstract

🗂️ GitHub Repo Layout

🧩 Benchmark Structure

⚙️ Installation

🔐 API Keys and Provider Setup

Provider notes

✅ Validate Before Running

🚀 Run One Benchmark

📊 Paper Baselines

🧪 Ablation Baselines

🤖 Model Presets

📦 Outputs

🛠️ Troubleshooting

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
agents_port		agents_port
asset		asset
benchmark		benchmark
configs		configs
docker/agent-build		docker/agent-build
docs		docs
scripts		scripts
skillevolbench		skillevolbench
workspace		workspace
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
setup_vm.sh		setup_vm.sh

Folders and files

Latest commit

History

Repository files navigation

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

📝 Abstract

🗂️ GitHub Repo Layout

🧩 Benchmark Structure

⚙️ Installation

🔐 API Keys and Provider Setup

Provider notes

✅ Validate Before Running

🚀 Run One Benchmark

📊 Paper Baselines

🧪 Ablation Baselines

🤖 Model Presets

📦 Outputs

🛠️ Troubleshooting

📚 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages