Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
4fa7069
min API for external invocation
alexandraBara Apr 1, 2026
0731d59
entry point for connection
alexandraBara Apr 1, 2026
b964f54
dynamically load connection
alexandraBara Apr 1, 2026
380e98a
updaes
alexandraBara Apr 10, 2026
faea344
added cpu/gpu_count to system_info
alexandraBara Apr 10, 2026
baa4724
supported sku updates
alexandraBara Apr 10, 2026
b52db2f
Merge branch 'development' into alex_invocation
alexandraBara Apr 13, 2026
4116424
docs: Update plugin documentation [automated]
github-actions[bot] Apr 14, 2026
cbde771
Merge pull request #176 from amd/alex_invocation
alexandraBara Apr 14, 2026
d32588a
Merge branch 'development' into automated-plugin-docs-update
alexandraBara Apr 14, 2026
9b87768
Merge pull request #187 from amd/automated-plugin-docs-update
alexandraBara Apr 14, 2026
7d42526
added new API
alexandraBara Apr 16, 2026
4117ead
adding --plugin-config= opt
alexandraBara Apr 16, 2026
476ec2e
updates
alexandraBara Apr 20, 2026
e042dd4
undo commit
alexandraBara Apr 20, 2026
a3b791e
Created a new input to the dmesg analyzer allowing for a list of rule…
niratner Apr 20, 2026
f91b146
Added 'match_all' flag to dmesg analyzer priority_override_rules to a…
niratner Apr 20, 2026
89727a4
updates
alexandraBara Apr 21, 2026
3787fe0
updated syntax of optional return value in dmesg plugin analyzer func…
niratner Apr 21, 2026
66faa69
Updated README with usage example, updated resolve_priority docstring…
niratner Apr 21, 2026
de3df44
nodescraper/plugins/inband/rdma/rdma_collector.py
Apr 21, 2026
1939808
rdma fix
Apr 21, 2026
5b73934
Moved the functionality to update ErrorRegex priorities based on rule…
niratner Apr 23, 2026
29aece1
Updated comments
niratner Apr 23, 2026
69ba66a
updated unit tests with new priority override logic
niratner Apr 23, 2026
18ae2c7
updated test comment
niratner Apr 23, 2026
4e49aca
Merge pull request #188 from amd/niratner_dmesgplugin_custom_priority…
alexandraBara Apr 23, 2026
a0a5449
event category fix
jaspals3123 Apr 23, 2026
8e1fbee
Merge branch 'development' into jaspal_rdmafix
jaspals3123 Apr 23, 2026
68d6894
sys info print when not none
alexandraBara Apr 24, 2026
92b4344
Merge branch 'development' into alex_sku
alexandraBara Apr 24, 2026
887656e
Merge pull request #189 from amd/jaspal_rdmafix
alexandraBara Apr 24, 2026
00eabad
Merge branch 'development' into alex_sku
alexandraBara Apr 24, 2026
22519c0
Merge branch 'development' into alex_subcommands
alexandraBara Apr 24, 2026
806eb8e
Merge pull request #191 from amd/alex_subcommands
alexandraBara Apr 24, 2026
a1b1032
docs: Update plugin documentation [automated]
github-actions[bot] Apr 25, 2026
55018a6
Merge pull request #192 from amd/automated-plugin-docs-update
alexandraBara Apr 27, 2026
28232b5
Merge branch 'development' into alex_sku
alexandraBara Apr 27, 2026
fba3bc8
Merge pull request #190 from amd/alex_sku
alexandraBara Apr 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 23 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ usage: cli.py [-h] [--version] [--sys-name STRING]
[--sys-location {LOCAL,REMOTE}]
[--sys-interaction-level {PASSIVE,INTERACTIVE,DISRUPTIVE}]
[--sys-sku STRING] [--sys-platform STRING]
[--plugin-configs [STRING ...]] [--system-config STRING]
[--plugin-configs LIST] [--system-config STRING]
[--connection-config STRING] [--log-path STRING]
[--log-level {CRITICAL,FATAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET}]
[--no-console-log] [--gen-reference-config] [--skip-sudo]
Expand Down Expand Up @@ -112,10 +112,11 @@ options:
--sys-sku STRING Manually specify SKU of system (default: None)
--sys-platform STRING
Specify system platform (default: None)
--plugin-configs [STRING ...]
built-in config names or paths to plugin config JSONs.
Available built-in configs: NodeStatus, AllPlugins
(default: None)
--plugin-configs LIST
Comma-separated built-in names and/or plugin config
JSON paths (e.g. --plugin-
configs=NodeStatus,/path/c.json). Built-ins:
NodeStatus, AllPlugins (default: None)
--system-config STRING
Path to system config json (default: None)
--connection-config STRING
Expand Down Expand Up @@ -337,6 +338,16 @@ You can extend the built-in error detection with custom regex patterns. Create a
"event_category": "SW_DRIVER",
"event_priority": 4
}
],
"priority_override_rules": [
{
"message": "Application Crash",
"new_priority": "ERROR"
},
{
"event_category": "SW_DRIVER",
"new_priority": "WARNING"
}
]
}
}
Expand All @@ -348,7 +359,7 @@ You can extend the built-in error detection with custom regex patterns. Create a
Save this to `dmesg_custom_config.json` and run:

```sh
node-scraper --plugin-configs dmesg_custom_config.json run-plugins DmesgPlugin
node-scraper --plugin-configs=dmesg_custom_config.json run-plugins DmesgPlugin
```

#### **'compare-runs' subcommand**
Expand Down Expand Up @@ -539,8 +550,9 @@ Built-in configs include **NodeStatus** (a subset of plugins) and **AllPlugins**
registered plugin with default arguments—useful for generating a reference config from the full system).

**NodeStatus plus additional plugins** — built-in configs merge with plugins named after `run-plugins`.
Use **`--plugin-configs=<name>`** (equals form): with a space
after `--plugin-configs`. See below for examples:
Values are comma-separated; pass as **`--plugin-configs=…`** or **`--plugin-configs` …** (same as other
optional flags), e.g. `--plugin-configs=NodeStatus,/path/extra.json`.
Examples:
```sh
node-scraper --plugin-configs=NodeStatus run-plugins PciePlugin
```
Expand All @@ -551,7 +563,7 @@ node-scraper --log-path ./logs --plugin-configs=NodeStatus run-plugins PciePlugi

Using a JSON file:
```sh
node-scraper --plugin-configs plugin_config.json
node-scraper --plugin-configs=plugin_config.json
```
Here is an example of a comprehensive plugin config that specifies analyzer args for each plugin:
```json
Expand Down Expand Up @@ -613,7 +625,7 @@ data.

**Run all registered plugins (AllPlugins config):**
```sh
node-scraper --plugin-config AllPlugins
node-scraper --plugin-configs=AllPlugins

```

Expand Down Expand Up @@ -647,7 +659,7 @@ This will generate the following config:
```
This config can later be used on a different platform for comparison, using the steps at #2:
```sh
node-scraper --plugin-configs reference_config.json
node-scraper --plugin-configs=reference_config.json

```

Expand Down
96 changes: 3 additions & 93 deletions docs/PLUGIN_DOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

| Plugin | Collection | Analyzer Args | Collection Args | DataModel | Collector | Analyzer |
| --- | --- | --- | --- | --- | --- | --- |
| AmdSmiPlugin | bad-pages<br>firmware --json<br>list --json<br>metric -g all<br>partition --json<br>process --json<br>ras --cper --folder={folder}<br>ras --afid --cper-file {cper_file}<br>static -g all --json<br>static -g {gpu_id} --json<br>topology<br>version --json<br>xgmi -l<br>xgmi -m | **Analyzer Args:**<br>- `check_static_data`: bool — If True, run static data checks (e.g. driver version, partition mode).<br>- `expected_gpu_processes`: Optional[int] — Expected number of GPU processes.<br>- `expected_max_power`: Optional[int] — Expected maximum power value (e.g. watts).<br>- `expected_driver_version`: Optional[str] — Expected AMD driver version string.<br>- `expected_memory_partition_mode`: Optional[str] — Expected memory partition mode (e.g. sp3, dp).<br>- `expected_compute_partition_mode`: Optional[str] — Expected compute partition mode.<br>- `expected_firmware_versions`: Optional[dict[str, str]] — Expected firmware versions keyed by amd-smi fw_id (e.g. PLDM_BUNDLE).<br>- `l0_to_recovery_count_error_threshold`: Optional[int] — L0-to-recovery count above which an error is raised.<br>- `l0_to_recovery_count_warning_threshold`: Optional[int] — L0-to-recovery count above which a warning is raised.<br>- `vendorid_ep`: Optional[str] — Expected endpoint vendor ID (e.g. for PCIe).<br>- `vendorid_ep_vf`: Optional[str] — Expected endpoint VF vendor ID.<br>- `devid_ep`: Optional[str] — Expected endpoint device ID.<br>- `devid_ep_vf`: Optional[str] — Expected endpoint VF device ID.<br>- `sku_name`: Optional[str] — Expected SKU name string for GPU.<br>- `expected_xgmi_speed`: Optional[list[float]] — Expected xGMI speed value(s) (e.g. link rate).<br>- `analysis_range_start`: Optional[datetime.datetime] — Start of time range for time-windowed analysis.<br>- `analysis_range_end`: Optional[datetime.datetime] — End of time range for time-windowed analysis. | **Collection Args:**<br>- `cper_file_path`: Optional[str] — Path to CPER folder or file for RAS AFID collection (ras --afid --cper-file). | [AmdSmiDataModel](#AmdSmiDataModel-Model) | [AmdSmiCollector](#Collector-Class-AmdSmiCollector) | [AmdSmiAnalyzer](#Data-Analyzer-Class-AmdSmiAnalyzer) |
| AmdSmiPlugin | bad-pages<br>firmware --json<br>list --json<br>metric -g all<br>partition --json<br>process --json<br>ras --cper --folder={folder}<br>ras --afid --cper-file {cper_file}<br>static -g all --json<br>static -g {gpu_id} --json<br>topology<br>version --json<br>xgmi -l<br>xgmi -m | **Analyzer Args:**<br>- `check_static_data`: bool — If True, run static data checks (e.g. driver version, partition mode).<br>- `expected_gpu_processes`: Optional[int] — Expected number of GPU processes.<br>- `expected_max_power`: Optional[int] — Expected maximum power value (e.g. watts).<br>- `expected_driver_version`: Optional[str] — Expected AMD driver version string.<br>- `expected_memory_partition_mode`: Optional[str] — Expected memory partition mode (e.g. sp3, dp).<br>- `expected_compute_partition_mode`: Optional[str] — Expected compute partition mode.<br>- `expected_firmware_versions`: Optional[dict[str, str]] — Expected firmware versions keyed by amd-smi fw_id (e.g. PLDM_BUNDLE).<br>- `l0_to_recovery_count_error_threshold`: Optional[int] — L0-to-recovery count above which an error is raised.<br>- `l0_to_recovery_count_warning_threshold`: Optional[int] — L0-to-recovery count above which a warning is raised.<br>- `vendorid_ep`: Optional[str] — Expected endpoint vendor ID (e.g. for PCIe).<br>- `vendorid_ep_vf`: Optional[str] — Expected endpoint VF vendor ID.<br>- `devid_ep`: Optional[str] — Expected endpoint device ID.<br>- `devid_ep_vf`: Optional[str] — Expected endpoint VF device ID.<br>- `sku_name`: Optional[str] — Expected SKU name string for GPU.<br>- `expected_xgmi_speed`: Optional[list[float]] — Expected xGMI speed value(s) (e.g. link rate).<br>- `analysis_range_start`: Optional[datetime.datetime] — Start of time range for time-windowed analysis.<br>- `analysis_range_end`: Optional[datetime.datetime] — End of time range for time-windowed analysis. | **Collection Args:**<br>- `analysis_firmware_ids`: Optional[list[str]] — amd-smi fw_id values to record in analysis_ref.firmware_versions<br>- `cper_file_path`: Optional[str] — Path to CPER folder or file for RAS AFID collection (ras --afid --cper-file). | [AmdSmiDataModel](#AmdSmiDataModel-Model) | [AmdSmiCollector](#Collector-Class-AmdSmiCollector) | [AmdSmiAnalyzer](#Data-Analyzer-Class-AmdSmiAnalyzer) |
| BiosPlugin | sh -c 'cat /sys/devices/virtual/dmi/id/bios_version'<br>wmic bios get SMBIOSBIOSVersion /Value | **Analyzer Args:**<br>- `exp_bios_version`: list[str] — Expected BIOS version(s) to match against collected value (str or list).<br>- `regex_match`: bool — If True, match exp_bios_version as regex; otherwise exact match. | - | [BiosDataModel](#BiosDataModel-Model) | [BiosCollector](#Collector-Class-BiosCollector) | [BiosAnalyzer](#Data-Analyzer-Class-BiosAnalyzer) |
| CmdlinePlugin | cat /proc/cmdline | **Analyzer Args:**<br>- `required_cmdline`: Union[str, List] — Command-line parameters that must be present (e.g. 'pci=bfsort').<br>- `banned_cmdline`: Union[str, List] — Command-line parameters that must not be present.<br>- `os_overrides`: Dict[str, nodescraper.plugins.inband.cmdline.cmdlineconfig.OverrideConfig] — Per-OS overrides for required_cmdline and banned_cmdline (keyed by OS identifier).<br>- `platform_overrides`: Dict[str, nodescraper.plugins.inband.cmdline.cmdlineconfig.OverrideConfig] — Per-platform overrides for required_cmdline and banned_cmdline (keyed by platform). | - | [CmdlineDataModel](#CmdlineDataModel-Model) | [CmdlineCollector](#Collector-Class-CmdlineCollector) | [CmdlineAnalyzer](#Data-Analyzer-Class-CmdlineAnalyzer) |
| DeviceEnumerationPlugin | powershell -Command "(Get-WmiObject -Class Win32_Processor &#124; Measure-Object).Count"<br>lspci -d {vendorid_ep}: &#124; grep -i 'VGA\&#124;Display\&#124;3D' &#124; wc -l<br>powershell -Command "(wmic path win32_VideoController get name &#124; findstr AMD &#124; Measure-Object).Count"<br>lscpu<br>lshw<br>lspci -d {vendorid_ep}: &#124; grep -i 'Virtual Function' &#124; wc -l<br>powershell -Command "(Get-VMHostPartitionableGpu &#124; Measure-Object).Count" | **Analyzer Args:**<br>- `cpu_count`: Optional[list[int]] — Expected CPU count(s); pass as int or list of ints. Analysis passes if actual is in list.<br>- `gpu_count`: Optional[list[int]] — Expected GPU count(s); pass as int or list of ints. Analysis passes if actual is in list.<br>- `vf_count`: Optional[list[int]] — Expected virtual function count(s); pass as int or list of ints. Analysis passes if actual is in list. | - | [DeviceEnumerationDataModel](#DeviceEnumerationDataModel-Model) | [DeviceEnumerationCollector](#Collector-Class-DeviceEnumerationCollector) | [DeviceEnumerationAnalyzer](#Data-Analyzer-Class-DeviceEnumerationAnalyzer) |
Expand Down Expand Up @@ -970,6 +970,8 @@ Data model for amd-smi data.
- **xgmi_link**: `Optional[list[nodescraper.plugins.inband.amdsmi.amdsmidata.XgmiLinks]]`
- **cper_data**: `Optional[list[nodescraper.models.datamodel.FileModel]]`
- **cper_afids**: `dict[str, int]`
- **analysis_firmware_ids**: `Optional[list[str]]`
- **analysis_ref**: `Optional[nodescraper.plugins.inband.amdsmi.amdsmidata.AmdSmiAnalysisRef]`

## BiosDataModel Model

Expand Down Expand Up @@ -1691,98 +1693,6 @@ Check RDMA statistics for errors (RoCE and other RDMA error counters).

**Link to code**: [rdma_analyzer.py](https://github.com/amd/node-scraper/blob/HEAD/nodescraper/plugins/inband/rdma/rdma_analyzer.py)

### Class Variables

- **ERROR_FIELDS**: `[
recoverable_errors,
tx_roce_errors,
tx_roce_discards,
rx_roce_errors,
rx_roce_discards,
local_ack_timeout_err,
packet_seq_err,
max_retry_exceeded,
rnr_nak_retry_err,
implied_nak_seq_err,
unrecoverable_err,
bad_resp_err,
local_qp_op_err,
local_protection_err,
mem_mgmt_op_err,
req_remote_invalid_request,
req_remote_access_errors,
remote_op_err,
duplicate_request,
res_exceed_max,
resp_local_length_error,
res_exceeds_wqe,
res_opcode_err,
res_rx_invalid_rkey,
res_rx_domain_err,
res_rx_no_perm,
res_rx_range_err,
res_tx_invalid_rkey,
res_tx_domain_err,
res_tx_no_perm,
res_tx_range_err,
res_irrq_oflow,
res_unsup_opcode,
res_unaligned_atomic,
res_rem_inv_err,
res_mem_err,
res_srq_err,
res_cmp_err,
res_invalid_dup_rkey,
res_wqe_format_err,
res_cq_load_err,
res_srq_load_err,
res_tx_pci_err,
res_rx_pci_err,
out_of_buffer,
out_of_sequence,
req_cqe_error,
req_cqe_flush_error,
resp_cqe_error,
resp_cqe_flush_error,
resp_remote_access_errors,
req_rx_pkt_seq_err,
req_rx_rnr_retry_err,
req_rx_rmt_acc_err,
req_rx_rmt_req_err,
req_rx_oper_err,
req_rx_impl_nak_seq_err,
req_rx_cqe_err,
req_rx_cqe_flush,
req_rx_dup_response,
req_rx_inval_pkts,
req_tx_loc_acc_err,
req_tx_loc_oper_err,
req_tx_mem_mgmt_err,
req_tx_retry_excd_err,
req_tx_loc_sgl_inv_err,
resp_rx_dup_request,
resp_rx_outof_buf,
resp_rx_outouf_seq,
resp_rx_cqe_err,
resp_rx_cqe_flush,
resp_rx_loc_len_err,
resp_rx_inval_request,
resp_rx_loc_oper_err,
resp_rx_outof_atomic,
resp_tx_pkt_seq_err,
resp_tx_rmt_inval_req_err,
resp_tx_rmt_acc_err,
resp_tx_rmt_oper_err,
resp_tx_rnr_retry_err,
resp_tx_loc_sgl_inv_err,
resp_rx_s0_table_err,
resp_rx_ccl_cts_outouf_seq,
tx_rdma_ack_timeout,
tx_rdma_ccl_cts_ack_timeout,
rx_rdma_mtu_discard_pkts
]`
- **CRITICAL_ERROR_FIELDS**: `['unrecoverable_err', 'res_tx_pci_err', 'res_rx_pci_err', 'res_mem_err']`

## Data Analyzer Class RocmAnalyzer

### Description
Expand Down
22 changes: 20 additions & 2 deletions nodescraper/cli/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
#
# MIT License
#
# Copyright (c) 2025 Advanced Micro Devices, Inc.
# Copyright (C) 2026 Advanced Micro Devices, Inc.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
Expand All @@ -24,6 +24,24 @@
#
###############################################################################

from .cli import get_cli_top_level_subcommands
from .cli import main as cli_entry
from .embed import CLI_TOP_LEVEL_SUBCOMMANDS, run_cli_return_code, run_main_return_code
from .invocation import (
PluginRunInvocation,
get_plugin_run_invocation,
plugin_run_invocation_scope,
run_plugin_queue_with_invocation,
)

__all__ = ["cli_entry"]
__all__ = [
"CLI_TOP_LEVEL_SUBCOMMANDS",
"cli_entry",
"get_cli_top_level_subcommands",
"run_cli_return_code",
"run_main_return_code",
"PluginRunInvocation",
"get_plugin_run_invocation",
"plugin_run_invocation_scope",
"run_plugin_queue_with_invocation",
]
Loading
Loading