Can one-off task experience become reusable instructions that future agents can follow?
YingtieΒ Lei1,*, ZhongweiΒ Wan1,*, JiankunΒ Zhang2, SamiulΒ Alam1, ZixuanΒ Zhong3, PeizhouΒ Huang4, XinΒ Wang1
JingxuanΒ Zhang1, DonghaoΒ Zhou5, YuntaΒ Hsieh4, ZhihaoΒ Dou6, HuiΒ Shen4, YanΒ Xu7, DimitriosΒ Dimitriadis7, TuoΒ Zhang7, MiΒ Zhang1
1The Ohio State University,
2The University of Chicago,
3University College London,
4University of Michigan,
5The Chinese University of Hong Kong,
6Case Western Reserve University,
7Amazon
*Equal contribution
SkillEvolBench Team
Correspondence: Tuo Zhang tuozhang@amazon.com, Mi Zhang mizhang.1@osu.edu
Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing context shift, adversarial shortcuts, and composition.
By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces. Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes, but these gains are unstable under frozen deployment.
Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. Capacity and cost analyses further show that writing more skills or larger Tier-3 resource libraries is not sufficient: additional updates can improve coverage while introducing episode-specific drift and procedural clutter. These findings position SkillEvolBench as a testbed for measuring when one-off experience becomes durable procedural knowledge rather than task-local memory.
| Path | Purpose |
|---|---|
benchmark/tasks/ |
180 benchmark tasks with task metadata, instructions, environments, and validation assets. |
benchmark/skills/ |
Curated seed skills used by curated-skill baselines. |
configs/baselines/ |
Baseline YAMLs. Select with --baseline-name <name>. |
configs/models/ |
Model/provider presets for Azure OpenAI, AWS Bedrock Claude, Gemini, and Kimi-style endpoints. Select with --model-yaml. |
configs/strategies/ |
Runtime strategies, mainly chain and chain_tier3. Most baselines choose this automatically. |
configs/env_orders.yaml |
Deterministic environment orders for seeds A, B, and C. |
skillevolbench/ |
Core benchmark engine: scheduler, runtime, stores, retrieval, prompting, metrics, schemas, and Harbor hooks. |
agents_port/ |
Preinstalled-agent adapter layer for codex, claude-code, gemini-cli, and kimi-cli. |
scripts/run.py |
Main single-run launcher. Use this for individual reproductions. |
scripts/launch_main_experiment.py |
Batch launcher for canonical baselines. |
scripts/launch_multi_model.py |
Batch launcher for baseline Γ model sweeps. |
scripts/validate_configs.py |
Validates YAML configs against schemas. |
scripts/validate_assets.py |
Validates benchmark task and skill assets. |
scripts/preflight.py |
Checks runtime readiness before expensive model calls. |
scripts/summarize.py |
Summarizes a completed run. |
docker/agent-build/ |
Builds the Harbor task image agent-runtime:latest. |
workspace/runs/ |
Local generated run outputs. Do not commit. |
asset/ |
README/project visual assets. |
SkillEvolBench uses a fixed stratified design:
6 environments Γ 5 latent skill families Γ 6 tasks = 180 tasks
For each latent skill family, T1-T3 are learning/acquisition tasks and T4-T6 are frozen evaluation tasks:
| Task | Role | Purpose |
|---|---|---|
T1 |
learning | canonical first encounter |
T2 |
learning | enriched follow-up |
T3 |
learning | variant acquisition case |
T4 |
evaluation | context shift |
T5 |
evaluation | adversarial variant |
T6 |
evaluation | skill composition |
Python 3.12 is recommended.
python3.12 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e '.[dev]'
python -m pip install litellm boto3 jinja2 pandas matplotlibInstall Harbor and build the task runtime image:
python -m pip install git+https://github.com/harbor-framework/harbor.git
bash docker/agent-build/build.sh
docker image inspect agent-runtime:latestCreate a local credential file. It is ignored by Git.
touch .harbor-agents.env
chmod 600 .harbor-agents.envFill only the providers you plan to run:
# Azure OpenAI / Codex presets
AZURE_OPENAI_ENDPOINT=https://<your-resource>.openai.azure.com/openai/v1
AZURE_OPENAI_API_KEY=<your-azure-openai-key>
# Optional if your endpoint requires it:
# AZURE_OPENAI_API_VERSION=<api-version>
# AWS Bedrock Claude presets
AWS_BEARER_TOKEN_BEDROCK=<your-bedrock-bearer-token>
AWS_REGION=us-east-1
CLAUDE_CODE_USE_BEDROCK=1
# Gemini direct presets
GEMINI_API_KEY=<your-gemini-api-key>
# Kimi / OpenAI-compatible Mantle endpoint
KIMI_BEDROCK_BASE_URL=<your-openai-compatible-base-url>
KIMI_BEDROCK_API_KEY=<your-kimi-key>Load credentials in the same shell before launching runs:
set -a
source .harbor-agents.env
set +aCheck without printing secret values:
python -c "import os; [print(k, bool(os.getenv(k))) for k in ['AZURE_OPENAI_ENDPOINT','AZURE_OPENAI_API_KEY','AWS_BEARER_TOKEN_BEDROCK','AWS_REGION','GEMINI_API_KEY','KIMI_BEDROCK_BASE_URL','KIMI_BEDROCK_API_KEY']]"| Provider family | Presets | Required env vars | Notes |
|---|---|---|---|
| Azure OpenAI Codex | gpt-5.2-codex, gpt-5.3-codex, gpt-5.4 |
AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY |
The preset agent_model_name must match your Azure deployment name. Edit the YAML if your deployment differs. |
| AWS Bedrock Claude | claude-sonnet-*, claude-opus-* |
AWS_BEARER_TOKEN_BEDROCK, AWS_REGION, CLAUDE_CODE_USE_BEDROCK=1 |
Uses Bedrock inference-profile IDs such as us.anthropic.... Your AWS account must have model access. |
| Gemini direct | gemini-* |
GEMINI_API_KEY |
Included for convenience; adapt the YAML if your Gemini CLI setup differs. |
| Kimi/Mantle | kimi-* |
KIMI_BEDROCK_BASE_URL, KIMI_BEDROCK_API_KEY |
Requires an OpenAI-compatible endpoint. |
The paper configuration here uses Azure OpenAI for Codex-style models and AWS Bedrock Claude for Claude. If you use direct OpenAI/Anthropic API keys instead, add a new YAML under configs/models/ and set the right provider, harbor_agent_name, agent_model_name, agent_env, and host_litellm fields for your endpoint.
python -m scripts.validate_configs
python -m scripts.validate_assets
python -m scripts.preflight
python -m scripts.run --baseline-name no_skill --order-seed A --dry-runA healthy dry run schedules 180 tasks and does not call any model API.
For a stricter check including Harbor and Docker image readiness:
python -m scripts.preflight --strictGeneral pattern:
python -m scripts.run \
--baseline-name <baseline_name> \
--model-yaml configs/models/<model_preset>.yaml \
--order-seed AAzure OpenAI example:
python -m scripts.run \
--baseline-name no_skill \
--model-yaml configs/models/gpt-5.4.yaml \
--order-seed AAWS Bedrock Claude example:
python -m scripts.run \
--baseline-name selfgen_experience_always \
--model-yaml configs/models/claude-sonnet-4.6.yaml \
--order-seed AUse --run-id for a stable output directory:
python -m scripts.run \
--baseline-name raw_trajectory_rag \
--model-yaml configs/models/gpt-5.4.yaml \
--order-seed A \
--run-id rawrag_gpt54_seedACanonical baselines are defined in skillevolbench/baselines/policy.py.
| Baseline | Config name | What it tests |
|---|---|---|
| No Skill | no_skill |
Bare agent with no skills, memory, or trajectory retrieval. |
| Raw Trajectory RAG | raw_trajectory_rag |
Retrieves same-family raw learning trajectories instead of distilled skills. |
| Self-Generated Zero-Shot | selfgen_zero_shot |
Creates skills before experience from task/family context. |
| Self-Generated Experience | selfgen_experience_always |
Induces skills from experience and revises on every learning trial. |
| Curated Static | curated_static |
Uses curated seed skills without revision. |
| Curated + Revision | curated_with_revision_always |
Starts from curated skills and revises on every learning trial. |
Run the full canonical ladder for one model:
MODEL=configs/models/gpt-5.4.yaml
BASELINES=(
no_skill
raw_trajectory_rag
selfgen_zero_shot
selfgen_experience_always
curated_static
curated_with_revision_always
)
for BASELINE in "${BASELINES[@]}"; do
python -m scripts.run \
--baseline-name "$BASELINE" \
--model-yaml "$MODEL" \
--order-seed A
doneOr use the multi-run launcher after inspecting the plan:
python -m scripts.launch_multi_model \
--baselines no_skill,raw_trajectory_rag,selfgen_zero_shot,selfgen_experience_always,curated_static,curated_with_revision_always \
--models gpt-5.4 \
--order-seed A \
--max-workers 1 \
--dry-runRemove --dry-run to execute.
| Ablation | Config name |
|---|---|
| Failure-only self-generated revision | selfgen_experience |
| Failure-only curated revision | curated_with_revision |
| Required Tier-3 skill files, self-generated | selfgen_experience_always_with_tier3 |
| Required Tier-3 skill files, curated | curated_with_revision_always_with_tier3 |
Run an ablation like any other baseline:
python -m scripts.run \
--baseline-name selfgen_experience \
--model-yaml configs/models/gpt-5.4.yaml \
--order-seed A| Model preset | Provider route | Credentials |
|---|---|---|
gpt-5.2-codex.yaml |
Azure OpenAI Codex CLI | AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY |
gpt-5.3-codex.yaml |
Azure OpenAI Codex CLI | AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY |
gpt-5.4.yaml |
Azure OpenAI Codex CLI | AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY |
claude-sonnet-4.5.yaml |
AWS Bedrock Claude Code | AWS_BEARER_TOKEN_BEDROCK, AWS_REGION |
claude-sonnet-4.6.yaml |
AWS Bedrock Claude Code | AWS_BEARER_TOKEN_BEDROCK, AWS_REGION |
claude-opus-4.5.yaml |
AWS Bedrock Claude Code | AWS_BEARER_TOKEN_BEDROCK, AWS_REGION |
claude-opus-4.6.yaml |
AWS Bedrock Claude Code | AWS_BEARER_TOKEN_BEDROCK, AWS_REGION |
gemini-2.5-pro.yaml |
Gemini CLI direct | GEMINI_API_KEY |
gemini-3-flash.yaml |
Gemini CLI direct | GEMINI_API_KEY |
gemini-3.1-pro.yaml |
Gemini CLI direct | GEMINI_API_KEY |
kimi-2-thinking.yaml |
OpenAI-compatible Mantle/Kimi | KIMI_BEDROCK_BASE_URL, KIMI_BEDROCK_API_KEY |
kimi-2.5.yaml |
OpenAI-compatible Mantle/Kimi | KIMI_BEDROCK_BASE_URL, KIMI_BEDROCK_API_KEY |
Inspect a preset before editing:
sed -n '1,120p' configs/models/gpt-5.4.yamlRuns are written to:
workspace/runs/<run_id>/
Key files:
| Path | Meaning |
|---|---|
config.json |
Frozen run configuration. |
reports/full_report.json |
Main metrics report. |
stores/replay/ |
Per-task replay records. |
stores/events/ |
Runtime event logs. |
stores/retrieval/ |
Retrieval traces when enabled. |
runtime/ |
Per-task execution workdirs. |
harbor-job/ |
Harbor job artifacts. |
Summarize a run:
python -m scripts.summarize workspace/runs/<run_id>Find recent runs:
ls -lt workspace/runs | head| Symptom | Likely cause | Fix |
|---|---|---|
Please set an Auth method... GEMINI_API_KEY |
.harbor-agents.env was not sourced or GEMINI_API_KEY is missing. |
Source .harbor-agents.env in the same shell. |
No module named 'harbor' |
Harbor SDK is missing. | python -m pip install git+https://github.com/harbor-framework/harbor.git |
agent-runtime:latest missing |
Docker image has not been built. | bash docker/agent-build/build.sh |
| Azure 401/403 | Endpoint/key/deployment mismatch. | Check AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, and agent_model_name. |
| Bedrock model identifier error | Region/profile mismatch. | Use the us.anthropic... model IDs in configs/models/claude-*.yaml and set AWS_REGION. |
@article{lei2026skillevolbench,
title={SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills},
author={Lei, Yingtie and Wan, Zhongwei and Zhang, Jiankun and Alam, Samiul and Zhong, Zixuan and Huang, Peizhou and Wang, Xin and Zhang, Jingxuan and Zhou, Donghao and Hsieh, Yunta and others},
journal={arXiv preprint arXiv:2605.24117},
year={2026}
}