Context
During the attribute rename work in #56, we added neighbor_list to the regex pattern in DataMixin.__inc__() (nvalchemi/data/data.py:155):
return self.num_nodes if bool(re.search("(index|face|neighbor_list)", key)) else 0
This method determines which fields receive cumulative node-index offsets when batching. The regex approach is implicit and fragile — it matches any field whose name contains the substring, so a field like neighbor_list_shifts would incorrectly match (flagged by @zubatyuk in PR #56 review).
A similar pattern exists in the Zarr backend's _get_cat_dim() (nvalchemi/data/datapipes/backends/zarr.py:231):
if bool(re.search("(index|face)", key)):
return -1
In contrast, _get_field_level() in the same file already does this correctly — it checks membership in AtomicData._default_node_keys, _default_edge_keys, and _default_system_keys.
Proposed fix
Replace both regex sites with explicit set membership, as @zubatyuk suggested:
- Add an
_index_keys ClassVar on AtomicData (or DataMixin) containing the canonical set of fields that require node-index offsets (currently {"neighbor_list"}).
- Update
__inc__ to use key in self._index_keys instead of regex.
- Update
_get_cat_dim to use AtomicData._index_keys (or an equivalent edge-key set) instead of regex.
This makes the behavior explicit, self-documenting, and safe against substring collisions.
Affected files
nvalchemi/data/data.py — DataMixin.__inc__()
nvalchemi/data/datapipes/backends/zarr.py — _get_cat_dim()
Origin
Identified during review of PR #56 (attribute rename). Deferred as out of scope for that PR.
Context
During the attribute rename work in #56, we added
neighbor_listto the regex pattern inDataMixin.__inc__()(nvalchemi/data/data.py:155):This method determines which fields receive cumulative node-index offsets when batching. The regex approach is implicit and fragile — it matches any field whose name contains the substring, so a field like
neighbor_list_shiftswould incorrectly match (flagged by @zubatyuk in PR #56 review).A similar pattern exists in the Zarr backend's
_get_cat_dim()(nvalchemi/data/datapipes/backends/zarr.py:231):In contrast,
_get_field_level()in the same file already does this correctly — it checks membership inAtomicData._default_node_keys,_default_edge_keys, and_default_system_keys.Proposed fix
Replace both regex sites with explicit set membership, as @zubatyuk suggested:
_index_keysClassVar onAtomicData(orDataMixin) containing the canonical set of fields that require node-index offsets (currently{"neighbor_list"}).__inc__to usekey in self._index_keysinstead of regex._get_cat_dimto useAtomicData._index_keys(or an equivalent edge-key set) instead of regex.This makes the behavior explicit, self-documenting, and safe against substring collisions.
Affected files
nvalchemi/data/data.py—DataMixin.__inc__()nvalchemi/data/datapipes/backends/zarr.py—_get_cat_dim()Origin
Identified during review of PR #56 (attribute rename). Deferred as out of scope for that PR.