feat: add xtable-converter Claude Code skill#829
Conversation
Adds a natural-language XTable conversion skill that extracts slots from user requests, generates datasetConfig YAML, runs the XTable bundled jar, and produces a verified per-table SUCCESS/FAILED/UNVERIFIED verdict. Includes no-op sync detection, jol-core workaround docs, and team-wide permission rules to suppress prompts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| @@ -0,0 +1,11 @@ | |||
| { | |||
There was a problem hiding this comment.
Scope/governance question before anything else: this commits Claude-Code-specific tooling and a team-wide permission allowlist into a vendor-neutral ASF incubator repo. Everyone who clones and runs Claude Code here inherits these auto-approve rules. Has this been raised on dev@ / with the PPMC? At minimum, consider whether the skill belongs in-repo vs. a separate tools repo, and whether committed permissions are appropriate for an Apache project (vs. each contributor opting in via the git-ignored settings.local.json).
There was a problem hiding this comment.
@vinishjail97 I am thinking to get rid of this and have settlings.local.json or adding a permission section in SKILLS.md and taking the permissions form user on prompts.Let me know what do you think?
| "allow": [ | ||
| "Bash(mkdir -p /tmp/xtable_run*)", | ||
| "Bash(python3 .claude/skills/xtable-converter/scripts/*)", | ||
| "Bash(java -cp *)", |
There was a problem hiding this comment.
Bash(java -cp *) auto-approves running any Java main class with any classpath and args, with no prompt, for every contributor who clones the repo — that is effectively arbitrary-code-execution approval. Scope this down to the exact invocation the jol-core workaround needs (a specific main class / jar pattern), or drop it and let it prompt.
| for tgt in targets: | ||
| status, message = "unknown", None | ||
| # error lines mentioning the table, or table mentions inside a stack-trace window | ||
| for i, ln in mentions: |
There was a problem hiding this comment.
Error detection isn't target-scoped. mentions is every line that names the table (or its basePath), so this error branch fires for all targets of a table as soon as any one of them logs an error. The success branch below correctly gates on tgt.lower() in ln.lower(), but this one doesn't. Combined with verify.py — which short-circuits log_status == "error" straight to FAILED and never runs the storage check — a single failed target produces false FAILED verdicts for sibling targets that actually succeeded. Scope the error match to the target the same way the success match is scoped.
| return "missing", f"no matching metadata at {url}" | ||
| import datetime | ||
| try: | ||
| ts = datetime.datetime.strptime(newest[0], "%Y-%m-%d %H:%M:%S").timestamp() |
There was a problem hiding this comment.
datetime.strptime(...).timestamp() interprets the naive datetime in the host's local timezone, but aws s3 ls prints timestamps in UTC. On any non-UTC machine the freshness comparison is wrong by the tz offset, yielding spurious stale/fresh. Use calendar.timegm(struct_time) (or attach tzinfo=timezone.utc) so the epoch is computed as UTC.
| ### Step 3 — Run the conversion | ||
|
|
||
| ```bash | ||
| bash scripts/run_xtable.sh /tmp/xtable_run/config.yaml \ |
There was a problem hiding this comment.
This invokes bash scripts/run_xtable.sh (relative to the skill dir), but the committed allow rule is Bash(bash .claude/skills/xtable-converter/scripts/*). The two won't match, so this will still prompt. Make the documented command and the permission pattern use the same path.
| Write it to disk immediately — do not wait for confirmation: | ||
|
|
||
| ``` | ||
| RUN_DIR=/tmp/xtable_run |
There was a problem hiding this comment.
RUN_DIR=/tmp/xtable_run is a fixed path reused on every invocation, so run_meta.json and the run logs are overwritten each run and two concurrent conversions race on the same files. Use a per-run dir (mktemp -d /tmp/xtable_run.XXXXXX) and pass it through.
| METADATA_CHECKS = { | ||
| "DELTA": {"subdir": "_delta_log", "pattern": r"\.json$"}, | ||
| "ICEBERG": {"subdir": "metadata", "pattern": r"\.metadata\.json$"}, | ||
| "HUDI": {"subdir": ".hoodie", "pattern": r"(\.commit|\.deltacommit|\.replacecommit|hoodie\.properties)$"}, |
There was a problem hiding this comment.
The HUDI freshness pattern includes hoodie.properties, which is written once at table creation and not rewritten per sync. Since freshness is max(mtime) over matches, a present-but-old timeline plus a hoodie.properties from creation could still read as the newest file and skew the check. Restrict the freshness match to timeline instants (*.commit/*.deltacommit/*.replacecommit) and keep hoodie.properties only for the existence check if needed.
- Scope error detection per-target in parse_result.py (false FAILED verdicts on sibling targets when one target fails) - Fix UTC timezone bug in verify.py check_s3 (aws s3 ls timestamps are UTC, not local time) - Remove hoodie.properties from HUDI freshness pattern (written once at creation, skews max(mtime) check) - Replace fixed /tmp/xtable_run path with mktemp UTC-timestamped dir to prevent concurrent run collisions - Fix script path mismatch between SKILL.md commands and settings.json allow rules (relative vs absolute paths) - Scope java invocation to PATH prepend so bare 'java' allow rule matches - Update mktemp allow rule in settings.json (replaces stale mkdir rule)
|
@vinishjail97 I have addressed all your comment in my latest commit.Please check. |
Adds a natural-language XTable conversion skill that extracts slots from user requests, generates datasetConfig YAML, runs the XTable bundled jar, and produces a verified per-table SUCCESS/FAILED/UNVERIFIED verdict. Includes no-op sync detection, jol-core workaround docs, and team-wide permission rules to suppress prompts.
Usage -
Convert my hudi table at some_path to iceberg table