Skip to content

feat: add xtable-converter Claude Code skill#829

Open
vaibhavk1992 wants to merge 2 commits into
apache:mainfrom
vaibhavk1992:feat/xtable-converter-skill-upstream
Open

feat: add xtable-converter Claude Code skill#829
vaibhavk1992 wants to merge 2 commits into
apache:mainfrom
vaibhavk1992:feat/xtable-converter-skill-upstream

Conversation

@vaibhavk1992

Copy link
Copy Markdown
Contributor

Adds a natural-language XTable conversion skill that extracts slots from user requests, generates datasetConfig YAML, runs the XTable bundled jar, and produces a verified per-table SUCCESS/FAILED/UNVERIFIED verdict. Includes no-op sync detection, jol-core workaround docs, and team-wide permission rules to suppress prompts.

Usage - Convert my hudi table at some_path to iceberg table

Adds a natural-language XTable conversion skill that extracts slots
from user requests, generates datasetConfig YAML, runs the XTable
bundled jar, and produces a verified per-table SUCCESS/FAILED/UNVERIFIED
verdict. Includes no-op sync detection, jol-core workaround docs,
and team-wide permission rules to suppress prompts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread .claude/settings.json
@@ -0,0 +1,11 @@
{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scope/governance question before anything else: this commits Claude-Code-specific tooling and a team-wide permission allowlist into a vendor-neutral ASF incubator repo. Everyone who clones and runs Claude Code here inherits these auto-approve rules. Has this been raised on dev@ / with the PPMC? At minimum, consider whether the skill belongs in-repo vs. a separate tools repo, and whether committed permissions are appropriate for an Apache project (vs. each contributor opting in via the git-ignored settings.local.json).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinishjail97 I am thinking to get rid of this and have settlings.local.json or adding a permission section in SKILLS.md and taking the permissions form user on prompts.Let me know what do you think?

Comment thread .claude/settings.json
"allow": [
"Bash(mkdir -p /tmp/xtable_run*)",
"Bash(python3 .claude/skills/xtable-converter/scripts/*)",
"Bash(java -cp *)",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bash(java -cp *) auto-approves running any Java main class with any classpath and args, with no prompt, for every contributor who clones the repo — that is effectively arbitrary-code-execution approval. Scope this down to the exact invocation the jol-core workaround needs (a specific main class / jar pattern), or drop it and let it prompt.

for tgt in targets:
status, message = "unknown", None
# error lines mentioning the table, or table mentions inside a stack-trace window
for i, ln in mentions:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error detection isn't target-scoped. mentions is every line that names the table (or its basePath), so this error branch fires for all targets of a table as soon as any one of them logs an error. The success branch below correctly gates on tgt.lower() in ln.lower(), but this one doesn't. Combined with verify.py — which short-circuits log_status == "error" straight to FAILED and never runs the storage check — a single failed target produces false FAILED verdicts for sibling targets that actually succeeded. Scope the error match to the target the same way the success match is scoped.

return "missing", f"no matching metadata at {url}"
import datetime
try:
ts = datetime.datetime.strptime(newest[0], "%Y-%m-%d %H:%M:%S").timestamp()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datetime.strptime(...).timestamp() interprets the naive datetime in the host's local timezone, but aws s3 ls prints timestamps in UTC. On any non-UTC machine the freshness comparison is wrong by the tz offset, yielding spurious stale/fresh. Use calendar.timegm(struct_time) (or attach tzinfo=timezone.utc) so the epoch is computed as UTC.

### Step 3 — Run the conversion

```bash
bash scripts/run_xtable.sh /tmp/xtable_run/config.yaml \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This invokes bash scripts/run_xtable.sh (relative to the skill dir), but the committed allow rule is Bash(bash .claude/skills/xtable-converter/scripts/*). The two won't match, so this will still prompt. Make the documented command and the permission pattern use the same path.

Write it to disk immediately — do not wait for confirmation:

```
RUN_DIR=/tmp/xtable_run

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RUN_DIR=/tmp/xtable_run is a fixed path reused on every invocation, so run_meta.json and the run logs are overwritten each run and two concurrent conversions race on the same files. Use a per-run dir (mktemp -d /tmp/xtable_run.XXXXXX) and pass it through.

METADATA_CHECKS = {
"DELTA": {"subdir": "_delta_log", "pattern": r"\.json$"},
"ICEBERG": {"subdir": "metadata", "pattern": r"\.metadata\.json$"},
"HUDI": {"subdir": ".hoodie", "pattern": r"(\.commit|\.deltacommit|\.replacecommit|hoodie\.properties)$"},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HUDI freshness pattern includes hoodie.properties, which is written once at table creation and not rewritten per sync. Since freshness is max(mtime) over matches, a present-but-old timeline plus a hoodie.properties from creation could still read as the newest file and skew the check. Restrict the freshness match to timeline instants (*.commit/*.deltacommit/*.replacecommit) and keep hoodie.properties only for the existence check if needed.

- Scope error detection per-target in parse_result.py (false FAILED verdicts
  on sibling targets when one target fails)
- Fix UTC timezone bug in verify.py check_s3 (aws s3 ls timestamps are UTC,
  not local time)
- Remove hoodie.properties from HUDI freshness pattern (written once at
  creation, skews max(mtime) check)
- Replace fixed /tmp/xtable_run path with mktemp UTC-timestamped dir to
  prevent concurrent run collisions
- Fix script path mismatch between SKILL.md commands and settings.json allow
  rules (relative vs absolute paths)
- Scope java invocation to PATH prepend so bare 'java' allow rule matches
- Update mktemp allow rule in settings.json (replaces stale mkdir rule)
@vaibhavk1992

Copy link
Copy Markdown
Contributor Author

@vinishjail97 I have addressed all your comment in my latest commit.Please check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants