Skip to content

AIoT-MLSys-Lab/SkillEvolBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

Can one-off task experience become reusable instructions that future agents can follow?

arXiv Paper PDF Hugging Face Page Project Page

YingtieΒ Lei1,*, ZhongweiΒ Wan1,*, JiankunΒ Zhang2, SamiulΒ Alam1, ZixuanΒ Zhong3, PeizhouΒ Huang4, XinΒ Wang1
JingxuanΒ Zhang1, DonghaoΒ Zhou5, YuntaΒ Hsieh4, ZhihaoΒ Dou6, HuiΒ Shen4, YanΒ Xu7, DimitriosΒ Dimitriadis7, TuoΒ Zhang7, MiΒ Zhang1

1The Ohio State University, 2The University of Chicago, 3University College London, 4University of Michigan,
5The Chinese University of Hong Kong, 6Case Western Reserve University, 7Amazon
*Equal contribution

SkillEvolBench Team

Correspondence: Tuo Zhang tuozhang@amazon.com, Mi Zhang mizhang.1@osu.edu

The Ohio State University Β Β Β Β  Amazon Science

πŸ“ Abstract

Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing context shift, adversarial shortcuts, and composition.

By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces. Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes, but these gains are unstable under frozen deployment.

Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. Capacity and cost analyses further show that writing more skills or larger Tier-3 resource libraries is not sufficient: additional updates can improve coverage while introducing episode-specific drift and procedural clutter. These findings position SkillEvolBench as a testbed for measuring when one-off experience becomes durable procedural knowledge rather than task-local memory.

πŸ—‚οΈ GitHub Repo Layout

Path Purpose
benchmark/tasks/ 180 benchmark tasks with task metadata, instructions, environments, and validation assets.
benchmark/skills/ Curated seed skills used by curated-skill baselines.
configs/baselines/ Baseline YAMLs. Select with --baseline-name <name>.
configs/models/ Model/provider presets for Azure OpenAI, AWS Bedrock Claude, Gemini, and Kimi-style endpoints. Select with --model-yaml.
configs/strategies/ Runtime strategies, mainly chain and chain_tier3. Most baselines choose this automatically.
configs/env_orders.yaml Deterministic environment orders for seeds A, B, and C.
skillevolbench/ Core benchmark engine: scheduler, runtime, stores, retrieval, prompting, metrics, schemas, and Harbor hooks.
agents_port/ Preinstalled-agent adapter layer for codex, claude-code, gemini-cli, and kimi-cli.
scripts/run.py Main single-run launcher. Use this for individual reproductions.
scripts/launch_main_experiment.py Batch launcher for canonical baselines.
scripts/launch_multi_model.py Batch launcher for baseline Γ— model sweeps.
scripts/validate_configs.py Validates YAML configs against schemas.
scripts/validate_assets.py Validates benchmark task and skill assets.
scripts/preflight.py Checks runtime readiness before expensive model calls.
scripts/summarize.py Summarizes a completed run.
docker/agent-build/ Builds the Harbor task image agent-runtime:latest.
workspace/runs/ Local generated run outputs. Do not commit.
asset/ README/project visual assets.

🧩 Benchmark Structure

SkillEvolBench uses a fixed stratified design:

6 environments Γ— 5 latent skill families Γ— 6 tasks = 180 tasks

For each latent skill family, T1-T3 are learning/acquisition tasks and T4-T6 are frozen evaluation tasks:

Task Role Purpose
T1 learning canonical first encounter
T2 learning enriched follow-up
T3 learning variant acquisition case
T4 evaluation context shift
T5 evaluation adversarial variant
T6 evaluation skill composition

βš™οΈ Installation

Python 3.12 is recommended.

python3.12 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e '.[dev]'
python -m pip install litellm boto3 jinja2 pandas matplotlib

Install Harbor and build the task runtime image:

python -m pip install git+https://github.com/harbor-framework/harbor.git
bash docker/agent-build/build.sh
docker image inspect agent-runtime:latest

πŸ” API Keys and Provider Setup

Create a local credential file. It is ignored by Git.

touch .harbor-agents.env
chmod 600 .harbor-agents.env

Fill only the providers you plan to run:

# Azure OpenAI / Codex presets
AZURE_OPENAI_ENDPOINT=https://<your-resource>.openai.azure.com/openai/v1
AZURE_OPENAI_API_KEY=<your-azure-openai-key>
# Optional if your endpoint requires it:
# AZURE_OPENAI_API_VERSION=<api-version>

# AWS Bedrock Claude presets
AWS_BEARER_TOKEN_BEDROCK=<your-bedrock-bearer-token>
AWS_REGION=us-east-1
CLAUDE_CODE_USE_BEDROCK=1

# Gemini direct presets
GEMINI_API_KEY=<your-gemini-api-key>

# Kimi / OpenAI-compatible Mantle endpoint
KIMI_BEDROCK_BASE_URL=<your-openai-compatible-base-url>
KIMI_BEDROCK_API_KEY=<your-kimi-key>

Load credentials in the same shell before launching runs:

set -a
source .harbor-agents.env
set +a

Check without printing secret values:

python -c "import os; [print(k, bool(os.getenv(k))) for k in ['AZURE_OPENAI_ENDPOINT','AZURE_OPENAI_API_KEY','AWS_BEARER_TOKEN_BEDROCK','AWS_REGION','GEMINI_API_KEY','KIMI_BEDROCK_BASE_URL','KIMI_BEDROCK_API_KEY']]"

Provider notes

Provider family Presets Required env vars Notes
Azure OpenAI Codex gpt-5.2-codex, gpt-5.3-codex, gpt-5.4 AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY The preset agent_model_name must match your Azure deployment name. Edit the YAML if your deployment differs.
AWS Bedrock Claude claude-sonnet-*, claude-opus-* AWS_BEARER_TOKEN_BEDROCK, AWS_REGION, CLAUDE_CODE_USE_BEDROCK=1 Uses Bedrock inference-profile IDs such as us.anthropic.... Your AWS account must have model access.
Gemini direct gemini-* GEMINI_API_KEY Included for convenience; adapt the YAML if your Gemini CLI setup differs.
Kimi/Mantle kimi-* KIMI_BEDROCK_BASE_URL, KIMI_BEDROCK_API_KEY Requires an OpenAI-compatible endpoint.

The paper configuration here uses Azure OpenAI for Codex-style models and AWS Bedrock Claude for Claude. If you use direct OpenAI/Anthropic API keys instead, add a new YAML under configs/models/ and set the right provider, harbor_agent_name, agent_model_name, agent_env, and host_litellm fields for your endpoint.

βœ… Validate Before Running

python -m scripts.validate_configs
python -m scripts.validate_assets
python -m scripts.preflight
python -m scripts.run --baseline-name no_skill --order-seed A --dry-run

A healthy dry run schedules 180 tasks and does not call any model API.

For a stricter check including Harbor and Docker image readiness:

python -m scripts.preflight --strict

πŸš€ Run One Benchmark

General pattern:

python -m scripts.run \
  --baseline-name <baseline_name> \
  --model-yaml configs/models/<model_preset>.yaml \
  --order-seed A

Azure OpenAI example:

python -m scripts.run \
  --baseline-name no_skill \
  --model-yaml configs/models/gpt-5.4.yaml \
  --order-seed A

AWS Bedrock Claude example:

python -m scripts.run \
  --baseline-name selfgen_experience_always \
  --model-yaml configs/models/claude-sonnet-4.6.yaml \
  --order-seed A

Use --run-id for a stable output directory:

python -m scripts.run \
  --baseline-name raw_trajectory_rag \
  --model-yaml configs/models/gpt-5.4.yaml \
  --order-seed A \
  --run-id rawrag_gpt54_seedA

πŸ“Š Paper Baselines

Canonical baselines are defined in skillevolbench/baselines/policy.py.

Baseline Config name What it tests
No Skill no_skill Bare agent with no skills, memory, or trajectory retrieval.
Raw Trajectory RAG raw_trajectory_rag Retrieves same-family raw learning trajectories instead of distilled skills.
Self-Generated Zero-Shot selfgen_zero_shot Creates skills before experience from task/family context.
Self-Generated Experience selfgen_experience_always Induces skills from experience and revises on every learning trial.
Curated Static curated_static Uses curated seed skills without revision.
Curated + Revision curated_with_revision_always Starts from curated skills and revises on every learning trial.

Run the full canonical ladder for one model:

MODEL=configs/models/gpt-5.4.yaml
BASELINES=(
  no_skill
  raw_trajectory_rag
  selfgen_zero_shot
  selfgen_experience_always
  curated_static
  curated_with_revision_always
)

for BASELINE in "${BASELINES[@]}"; do
  python -m scripts.run \
    --baseline-name "$BASELINE" \
    --model-yaml "$MODEL" \
    --order-seed A
done

Or use the multi-run launcher after inspecting the plan:

python -m scripts.launch_multi_model \
  --baselines no_skill,raw_trajectory_rag,selfgen_zero_shot,selfgen_experience_always,curated_static,curated_with_revision_always \
  --models gpt-5.4 \
  --order-seed A \
  --max-workers 1 \
  --dry-run

Remove --dry-run to execute.

πŸ§ͺ Ablation Baselines

Ablation Config name
Failure-only self-generated revision selfgen_experience
Failure-only curated revision curated_with_revision
Required Tier-3 skill files, self-generated selfgen_experience_always_with_tier3
Required Tier-3 skill files, curated curated_with_revision_always_with_tier3

Run an ablation like any other baseline:

python -m scripts.run \
  --baseline-name selfgen_experience \
  --model-yaml configs/models/gpt-5.4.yaml \
  --order-seed A

πŸ€– Model Presets

Model preset Provider route Credentials
gpt-5.2-codex.yaml Azure OpenAI Codex CLI AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY
gpt-5.3-codex.yaml Azure OpenAI Codex CLI AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY
gpt-5.4.yaml Azure OpenAI Codex CLI AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY
claude-sonnet-4.5.yaml AWS Bedrock Claude Code AWS_BEARER_TOKEN_BEDROCK, AWS_REGION
claude-sonnet-4.6.yaml AWS Bedrock Claude Code AWS_BEARER_TOKEN_BEDROCK, AWS_REGION
claude-opus-4.5.yaml AWS Bedrock Claude Code AWS_BEARER_TOKEN_BEDROCK, AWS_REGION
claude-opus-4.6.yaml AWS Bedrock Claude Code AWS_BEARER_TOKEN_BEDROCK, AWS_REGION
gemini-2.5-pro.yaml Gemini CLI direct GEMINI_API_KEY
gemini-3-flash.yaml Gemini CLI direct GEMINI_API_KEY
gemini-3.1-pro.yaml Gemini CLI direct GEMINI_API_KEY
kimi-2-thinking.yaml OpenAI-compatible Mantle/Kimi KIMI_BEDROCK_BASE_URL, KIMI_BEDROCK_API_KEY
kimi-2.5.yaml OpenAI-compatible Mantle/Kimi KIMI_BEDROCK_BASE_URL, KIMI_BEDROCK_API_KEY

Inspect a preset before editing:

sed -n '1,120p' configs/models/gpt-5.4.yaml

πŸ“¦ Outputs

Runs are written to:

workspace/runs/<run_id>/

Key files:

Path Meaning
config.json Frozen run configuration.
reports/full_report.json Main metrics report.
stores/replay/ Per-task replay records.
stores/events/ Runtime event logs.
stores/retrieval/ Retrieval traces when enabled.
runtime/ Per-task execution workdirs.
harbor-job/ Harbor job artifacts.

Summarize a run:

python -m scripts.summarize workspace/runs/<run_id>

Find recent runs:

ls -lt workspace/runs | head

πŸ› οΈ Troubleshooting

Symptom Likely cause Fix
Please set an Auth method... GEMINI_API_KEY .harbor-agents.env was not sourced or GEMINI_API_KEY is missing. Source .harbor-agents.env in the same shell.
No module named 'harbor' Harbor SDK is missing. python -m pip install git+https://github.com/harbor-framework/harbor.git
agent-runtime:latest missing Docker image has not been built. bash docker/agent-build/build.sh
Azure 401/403 Endpoint/key/deployment mismatch. Check AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, and agent_model_name.
Bedrock model identifier error Region/profile mismatch. Use the us.anthropic... model IDs in configs/models/claude-*.yaml and set AWS_REGION.

πŸ“š Citation

@article{lei2026skillevolbench,
  title={SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills},
  author={Lei, Yingtie and Wan, Zhongwei and Zhang, Jiankun and Alam, Samiul and Zhong, Zixuan and Huang, Peizhou and Wang, Xin and Zhang, Jingxuan and Zhou, Donghao and Hsieh, Yunta and others},
  journal={arXiv preprint arXiv:2605.24117},
  year={2026}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors