Skip to content

[CI] Surface upstream pytorch CI job link in parity summary#3264

Open
pablo-garay wants to merge 4 commits into
developfrom
parity-summary-job-id-column
Open

[CI] Surface upstream pytorch CI job link in parity summary#3264
pablo-garay wants to merge 4 commits into
developfrom
parity-summary-job-id-column

Conversation

@pablo-garay
Copy link
Copy Markdown

@pablo-garay pablo-garay commented May 29, 2026

Summary

  • Adds a clickable Job ID column at the end of both the FAILED TESTS and LOG-BASED FAILURES (not in XML) tables in the parity summary markdown. Each cell renders as [<job_id>](https://github.com/pytorch/pytorch/actions/runs/<wf>/job/<job_id>), dropping the reviewer one click away from the stacktrace.
  • Threads the upstream pytorch/pytorch CI job url through the existing pipeline — download_testlogs was already fetching that info, it just wasn't being preserved. No new API calls; no schema migrations; just persistence through download_testlogssummarize_xml_testreports.py / detect_log_failures.pygenerate_summary.py.
  • Backwards-compatible: every consumer reads the new fields via .get(..., '') / os.path.isfile, so older artifacts and CSVs render the column as empty cells instead of breaking.

Example resulting row (FAILED TESTS, set2-disabled case)

| Arch | Test Config | Test File | Test Class | Test Name | Job-Level Shard (rocm) | Test-Level Shard (rocm) | Status (rocm) | Also Failing In | Job ID (rocm) |
| mi300 | default | test_foo | TestBar | test_baz | 3/6 | 5/15 | FAILED | mi355 | [76905282313](https://github.com/pytorch/pytorch/actions/runs/26146653222/job/76905282313) |

Data flow

  • FAILED TESTS (XML-based): _shorten_unzipped_dirs keeps the trailing _<jobid> of the artifact name on each test-<cfg>-N-N/ dir → download_xml_files writes one _wf_run_id file at the parent → parse_xml_reports_as_dict builds the url and stamps it on each test case → per-arch CSV carries job_url_{set_name}collect_failed_tests propagates → markdown renders.
  • LOG-BASED FAILURES: write_test_log_to_file writes a companion <filename>.job_url file (full url from the job's html_url) → scan_logs reads it and stamps job_url on every failure / flaky row → log_failures_<arch>.csv / flaky_tests_<arch>.csv carry it → load_log_failures / load_flaky_tests_as_log_failures propagate → markdown renders.

Test plan

  • Trigger a parity.yml run and confirm:
    • Per-arch test-report shard dirs are named test-<cfg>-N-N_<jobid> after _shorten_unzipped_dirs.
    • _wf_run_id file exists alongside the shard dirs in rocm_xml/ and cuda_xml/.
    • <filename>.job_url companion files exist next to each rocm*.txt / cuda*.txt log file.
  • Inspect the per-arch CSV emitted by summarize_xml_testreports.py and confirm job_url_<set1_name> / job_url_<set2_name> columns are populated for failing rows.
  • Inspect log_failures_<arch>.csv / flaky_tests_<arch>.csv and confirm job_url column is populated.
  • Inspect the parity summary markdown artifact and click a Job ID cell in both tables → lands on the failing pytorch/pytorch job page with the stacktrace.
  • Re-run against a historical commit whose artifacts predate this change and confirm cells render as empty (no crash, no broken table).

@pablo-garay pablo-garay force-pushed the parity-summary-job-id-column branch from 509bb1f to f10baac Compare May 29, 2026 22:43
The parity summary's FAILED TESTS and LOG-BASED FAILURES tables list the
failing test tuples but stop short of pointing the reviewer at the
upstream pytorch/pytorch CI job that actually ran the test - making it
several extra clicks to land on the stacktrace.

download_testlogs already knows the job id of every artifact and log file
it pulls. Persist it through the pipeline and surface it as a clickable
"Job ID" column at the end of both tables:

- download_testlogs: keep the trailing "_<jobid>" segment of the original
  artifact name when shortening unzipped XML dirs, and write a single
  "_wf_run_id" file at the parent rocm_xml/cuda_xml level. For per-log
  artifacts, write a companion "<filename>.job_url" file with the
  canonical html_url from the GitHub API job object.
- summarize_xml_testreports.py: read _wf_run_id once, parse "_<jobid>"
  off each test-<cfg>-N-N dir, stamp a job_url on every test case, and
  emit job_url_{set1_name}/job_url_{set2_name} columns in the per-arch
  CSV.
- detect_log_failures.py: read the per-log .job_url file and stamp
  job_url on every emitted failure/flaky row; add job_url to both CSV
  writers.
- generate_summary.py: propagate job_url_* through collect_failed_tests
  and through the flaky-as-log-failure loader, and add a "Job ID" column
  at the end of both markdown tables rendered as [<jobid>](<url>).

Every read uses .get(..., '') / os.path.isfile, so existing artifacts and
CSVs without the new fields render as empty cells and the pipeline keeps
working unchanged.

Signed-off-by: Garay-Fernandez <pgarayfe@amd.com>
@pablo-garay pablo-garay force-pushed the parity-summary-job-id-column branch from f10baac to 11af07f Compare May 29, 2026 22:45
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented May 29, 2026

Jenkins build for 842bb2995390e4be256dbfd15dfa0e65b7da4b1f commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

Drop comment bloat and a try/except layer that didn't match the
surrounding style:

- _shorten_unzipped_dirs: keep the original `if m:` structure and add the
  job id suffix inside it, instead of restructuring with `if not m:
  continue` + extra locals.
- write_test_log_to_file / download_xml_files / scan_logs /
  parse_xml_reports_as_dict: drop the try/except around the small
  per-file reads and writes; other file IO in these functions doesn't
  guard either.
- parse_xml_reports_as_dict: always set case["job_url"] (empty string
  when absent), mirroring how case["shard"] is set unconditionally one
  line above.
- generate_summary.py: rename _job_url_cell -> _job_id_link to reflect
  what the markdown cell actually shows; drop its docstring.

No behavior change.

Signed-off-by: Garay-Fernandez <pgarayfe@amd.com>
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented May 29, 2026

Jenkins build for cb9a83d5226e30b77a3cae86ad61db5ad4783c24 commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

Explain why these reads/writes exist: how download_testlogs hands the
upstream pytorch CI job id to summarize_xml_testreports.py (via the
"_<jobid>" suffix on each shard dir + a "_wf_run_id" file at the parent),
and the job page URL to detect_log_failures.py (via a "<log>.job_url"
file next to each log). Also rename the local "jid" match to
"job_id_match" so the code reads on first pass without context.

No behavior change.

Signed-off-by: Garay-Fernandez <pgarayfe@amd.com>
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented May 29, 2026

Jenkins build for d0120b1e943d8ff4f834c80084b1bc2a35c26bff commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

Split the trailing ternary that was deciding both the label and the
whole f-string into a "get label, then build link" flow. Drop the
[job](url) fallback: both URL writers (summarize_xml_testreports.py and
write_test_log_to_file) produce URLs containing "/job/<digits>", so the
fallback never fired in practice and a cell labeled "job" wouldn't tell
a reviewer anything. If the URL is malformed, render an empty cell -
same as when no URL is available.

No behavior change for the URLs we actually emit.

Signed-off-by: Garay-Fernandez <pgarayfe@amd.com>
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented May 29, 2026

Jenkins build for d0120b1e943d8ff4f834c80084b1bc2a35c26bff commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant