diff --git a/docs/plans/2026-02-16-rust-port-analysis.md b/docs/plans/2026-02-16-rust-port-analysis.md new file mode 100644 index 0000000000..e611de22cc --- /dev/null +++ b/docs/plans/2026-02-16-rust-port-analysis.md @@ -0,0 +1,431 @@ +# Rust Port Analysis: onadata Performance Hotspots + +**Date**: 2026-02-16 +**Branch**: `rusty` +**Status**: Analysis / Proposal + +## Executive Summary + +onadata is a Django-based ODK (Open Data Kit) data collection platform. After thorough +analysis, we identified **5 high-impact areas** where porting CPU/memory-intensive Python +code to Rust (via PyO3 native extensions) would yield significant performance gains. + +The biggest wins come from **export generation**, **XML parsing/submission processing**, +and **data transformation pipelines** -- all of which involve tight loops over large +datasets with recursive data structures, string manipulation, and format conversion. + +--- + +## Architecture Overview + +``` + ┌─────────────────────────────────────────────┐ + │ Django REST API │ + └──────────┬──────────────┬───────────────────┘ + │ │ + ┌──────────▼──┐ ┌──────▼──────────────┐ + │ Submission │ │ Export / Query │ + │ Pipeline │ │ Pipeline │ + └──────┬──────┘ └──────┬───────────────┘ + │ │ + ┌──────▼──────┐ ┌──────▼───────────────┐ + │ XML Parse → │ │ DB Query → Transform │ + │ JSON → Save │ │ → Format → File I/O │ + └─────────────┘ └──────────────────────┘ + │ │ + ┌──────▼──────────────────▼───────────────┐ + │ Celery Async Task Queue │ + └──────────────────────────────────────────┘ +``` + +--- + +## 1. EXPORT GENERATION (Priority: CRITICAL) + +**Files**: `export_builder.py`, `csv_builder.py`, `export_tools.py` +**Impact**: Largest single performance bottleneck + +### What it does + +Converts form submissions (100k+ rows) into XLSX, CSV, SAV (SPSS), KML, GeoJSON formats. +Each export runs as a Celery task and involves: + +1. Querying all submissions from PostgreSQL +2. Flattening nested repeat groups into tabular format +3. Processing select-multiple fields (splitting into binary columns) +4. Type conversion, GPS parsing, label lookups +5. Writing to output format (openpyxl for XLSX, csv module for CSV, etc.) + +### Why it's slow in Python + +| Bottleneck | Location | Complexity | Description | +|---|---|---|---| +| `dict_to_joined_export()` | export_builder.py:112-180 | O(r * f * d) per row | Recursive dict creation for every submission. Creates intermediate dicts at each nesting level. | +| `split_select_multiples()` | export_builder.py:746-796 | O(s * c) per row | Dict comprehension per select-multiple field. 50 fields * 100 choices = 5000 dict updates/row. | +| `pre_process_row()` | export_builder.py:835-909 | O(v) per row | Regex compiled per string value per row. Dynamic value replacement with `re.findall()` on every cell. | +| CSV column discovery | csv_builder.py:803-818 | O(2N) | Iterates ALL data twice: once to discover repeat columns, once to write. | +| Nested repeat writes | export_builder.py:1137-1143 | O(r * n * d) | For nested repeats: 100k submissions * 100 repeats * 50 sub-repeats = 500M write operations. | + +### Rust opportunity + +A Rust export engine exposed via PyO3 could: + +- **Stream-process rows** without intermediate dict allocation (zero-copy where possible) +- **Pre-compile regex** once, reuse across all rows +- **Flatten nested structures** iteratively with stack-based traversal instead of Python recursion +- **Write output formats** directly using Rust crates (`calamine`/`rust_xlsxwriter` for XLSX, `csv` crate) +- **Parallelize section writes** across threads (Python's GIL prevents this) + +**Estimated speedup**: 10-50x for large exports (100k+ rows with repeat groups) + +### Proposed Rust module: `onadata_export` + +``` +onadata_export/ +├── src/ +│ ├── lib.rs # PyO3 module entry +│ ├── flatten.rs # dict_to_joined_export replacement +│ ├── select_multiples.rs # split_select_multiples replacement +│ ├── preprocess.rs # pre_process_row with compiled regex +│ ├── writers/ +│ │ ├── xlsx.rs # XLSX writer (rust_xlsxwriter) +│ │ ├── csv.rs # CSV writer +│ │ └── sav.rs # SAV/SPSS writer +│ └── schema.rs # Form schema representation +└── Cargo.toml +``` + +--- + +## 2. XML PARSING & SUBMISSION PROCESSING (Priority: HIGH) + +**Files**: `xform_instance_parser.py`, `instance.py`, `logger_tools.py` +**Impact**: Every single submission goes through this path + +### What it does + +When a form submission arrives (XML from a mobile device): + +1. Read entire XML into memory +2. Parse with minidom (full DOM tree, 2-3x memory of raw XML) +3. Recursively convert DOM to Python dict (`_xml_node_to_dict`) +4. Flatten nested dict into key-value pairs +5. Extract geolocation, media references, UUIDs +6. Convert numeric strings to numbers (recursive traversal) +7. Compute SHA256 hash +8. Save JSON representation + +### Why it's slow in Python + +| Bottleneck | Location | Issue | +|---|---|---| +| `clean_and_parse_xml()` | xform_instance_parser.py:174-183 | Regex on full XML + minidom DOM tree (2-3x memory) | +| `_xml_node_to_dict()` | xform_instance_parser.py:187-240 | Recursive DOM traversal with `xpath_from_xml_node()` called per node (walks parent chain each time) | +| `_flatten_dict_nest_repeats()` | xform_instance_parser.py:243-273 | Recursive generator with `list(new_prefix)` copy on every iteration | +| `numeric_converter()` | instance.py:398-414 | Recursive dict traversal with try/except int/float per value | +| `get_values_matching_key()` | dict_tools.py:8-33 | Full recursive document search for geolocation/media extraction | +| **XML parsed 6+ times** | Multiple locations | `get_dict()` called separately by `save()`, `_set_geom()`, `get_expected_media()`, `get_full_dict()` | + +### Rust opportunity + +A Rust XML processor could: + +- **Parse XML once** with `quick-xml` (SAX-style, no DOM tree) and extract all needed data in a single pass +- **Build the flat dict, geolocation, media list, UUID, and numeric conversions** all in one traversal +- **Avoid recursive Python calls** -- use iterative stack-based traversal +- **Return a Python dict** via PyO3 with all data ready, eliminating 5 of 6 redundant parses +- **Compute SHA256** natively (10x+ faster than Python's hashlib for in-process hashing) + +**Estimated speedup**: 5-20x per submission (more for large submissions with many repeat groups) + +### Proposed Rust module: `onadata_xml` + +``` +onadata_xml/ +├── src/ +│ ├── lib.rs # PyO3 module entry +│ ├── parser.rs # Single-pass XML → structured data +│ ├── flatten.rs # Iterative flattening (replaces recursive Python) +│ ├── numeric.rs # Fast numeric conversion +│ ├── geom.rs # Geolocation extraction +│ └── media.rs # Media reference extraction +└── Cargo.toml +``` + +--- + +## 3. DATA AGGREGATION & CHART BUILDING (Priority: MEDIUM-HIGH) + +**Files**: `query.py`, `chart_tools.py`, `parsed_instance.py` +**Impact**: Every chart render and data view query + +### What it does + +Aggregates submission data for charts/dashboards: + +1. Execute raw PostgreSQL queries with JSON operators +2. Fetch results into Python dicts +3. Group, sort, and label-map results in Python +4. Build chart-ready data structures + +### Why it's slow in Python + +| Bottleneck | Location | Issue | +|---|---|---| +| `_flatten_multiple_dict_into_one()` | chart_tools.py:151-170 | **O(N^2)** nested loop: iterates results * unique values to group data | +| `_use_labels_from_field_name()` | chart_tools.py:173-197 | Double iteration over data (once for labels, once for key rename) | +| `_use_labels_from_group_by_name()` | chart_tools.py:212-219 | Nested loop: items * sub-items for label replacement | +| Post-query sorting | chart_tools.py:329-341 | Python re-sort + timezone regex on every row | +| `_dictfetchall()` | query.py:18-22 | All rows materialized as dicts in memory | +| `get_field_records()` | query.py:244-247 | Python float conversion instead of SQL CAST | +| JSON parsing per row | parsed_instance.py:136-142 | `json.loads()` called per row in result iterator | + +### Rust opportunity + +- **Replace O(N^2) grouping** with HashMap-based O(N) grouping +- **Batch label lookups** with pre-built HashMap instead of linear scan +- **Parse JSON in bulk** using `serde_json` (much faster than Python's `json` module) +- **Handle timezone conversion** with compiled regex + chrono crate + +**Estimated speedup**: 3-10x for aggregation queries on large datasets + +### Proposed Rust module: `onadata_agg` + +``` +onadata_agg/ +├── src/ +│ ├── lib.rs # PyO3 module entry +│ ├── grouping.rs # HashMap-based grouping (replaces O(N²) loop) +│ ├── labels.rs # Pre-indexed label lookups +│ ├── json_parse.rs # Bulk JSON parsing +│ └── datetime.rs # Timezone handling +└── Cargo.toml +``` + +--- + +## 4. ENCRYPTION / DECRYPTION (Priority: MEDIUM) + +**Files**: `libs/kms/tools.py`, `logger/tasks.py` +**Impact**: Every encrypted submission + +### What it does + +For encrypted form submissions: + +1. Load all encrypted attachments into memory (`BytesIO(file.read())`) +2. Call external KMS for key material (network-bound) +3. Decrypt submission XML and media files +4. Compute SHA256 of decrypted content +5. Save decrypted attachments individually + +### Why it's slow in Python + +| Bottleneck | Location | Issue | +|---|---|---| +| Attachment loading | tools.py:487-491 | All attachments loaded into memory simultaneously | +| SHA256 hashing | tools.py:560 | Python hashlib for potentially large files | +| Per-attachment DB writes | tools.py:570 | Individual `instance.attachments.create()` per file, no `bulk_create()` | + +### Rust opportunity + +- **Stream-decrypt** attachments without loading all into memory +- **Native SHA256** via `ring` or `sha2` crate (2-5x faster for large files) +- **Prepare bulk insert data** for batch DB writes +- Note: The KMS network call is the dominant bottleneck here and Rust won't help with that + +**Estimated speedup**: 2-5x for the crypto/hashing portions (network I/O dominates overall) + +### Proposed Rust module: `onadata_crypto` + +``` +onadata_crypto/ +├── src/ +│ ├── lib.rs # PyO3 module entry +│ ├── decrypt.rs # Streaming decryption +│ └── hash.rs # Fast SHA256 +└── Cargo.toml +``` + +--- + +## 5. BULK CSV IMPORT (Priority: MEDIUM) + +**Files**: `csv_import.py`, `entities_utils.py` +**Impact**: Large CSV uploads (100k+ rows) + +### What it does + +Imports CSV data as form submissions or entity updates: + +1. Count total rows (full file scan) +2. Parse each row, validate types +3. Transform flat CSV dict to nested dict +4. Generate XML submission per row +5. Process through full submission pipeline + +### Why it's slow in Python + +| Bottleneck | Location | Issue | +|---|---|---| +| Upfront row count | csv_import.py:341 | `sum(1 for row in csv_file)` scans entire file before processing | +| Per-row dict transformation | csv_import.py:424-432 | 3 nested function calls: `csv_dict_to_nested_dict()`, `flatten_split_select_multiples()`, `dict_merge()` | +| Per-row XML generation | csv_import.py:462 | `dict2xmlsubmission()` string manipulation per row | +| Per-row entity persistence | entities_utils.py:355 | Individual `serializer.save()`, no `bulk_create()` | + +### Rust opportunity + +- **Single-pass CSV parsing** with row count + processing combined (using `csv` crate) +- **Batch dict-to-XML conversion** with pre-compiled templates +- **Prepare bulk inserts** instead of per-row saves +- **Validate types** at parse time using Rust's type system + +**Estimated speedup**: 3-8x for CSV parsing and transformation (DB writes still dominate) + +--- + +## Prioritized Implementation Roadmap + +### Phase 1: Export Engine (Highest ROI) +``` +Effort: ████████░░ (8/10) +Impact: ██████████ (10/10) +Speedup: 10-50x for large exports +``` +- Replace `ExportBuilder` core with Rust +- Stream-process rows, write XLSX/CSV directly +- Eliminate intermediate dict allocations +- Parallelize section writes across threads + +### Phase 2: XML Submission Parser (High ROI) +``` +Effort: ██████░░░░ (6/10) +Impact: ████████░░ (8/10) +Speedup: 5-20x per submission +``` +- Single-pass XML parser replacing 6+ redundant parses +- Returns complete Python dict with all extracted data +- Eliminates recursive traversals + +### Phase 3: Aggregation Engine (Medium ROI) +``` +Effort: ████░░░░░░ (4/10) +Impact: ██████░░░░ (6/10) +Speedup: 3-10x for chart queries +``` +- HashMap-based grouping replacing O(N^2) loops +- Bulk JSON parsing +- Pre-indexed label lookups + +### Phase 4: Crypto Helpers (Lower ROI) +``` +Effort: ███░░░░░░░ (3/10) +Impact: ████░░░░░░ (4/10) +Speedup: 2-5x for hashing (network I/O dominates) +``` +- Streaming decryption +- Native SHA256 + +### Phase 5: CSV Import Parser (Lower ROI) +``` +Effort: ████░░░░░░ (4/10) +Impact: ████░░░░░░ (4/10) +Speedup: 3-8x for parsing (DB writes dominate) +``` +- Combined count + parse pass +- Batch transformation + +--- + +## Integration Strategy: PyO3 Native Extensions + +### Why PyO3 + +- Mature Rust-Python bridge with zero-copy where possible +- Compiles to native `.so`/`.dylib` that imports like any Python module +- Supports Python dicts, lists, strings natively +- Can release GIL for true parallelism + +### Integration pattern + +```python +# Before (Python) +from onadata.libs.utils.export_builder import ExportBuilder + +builder = ExportBuilder() +builder.set_survey(survey) +builder.to_xlsx_export(path, data, username, xform) + +# After (Rust via PyO3, drop-in replacement) +from onadata_export import RustExportBuilder + +builder = RustExportBuilder() +builder.set_survey(survey) # accepts Python survey object +builder.to_xlsx_export(path, data, username, xform) +``` + +### Build integration + +```toml +# pyproject.toml addition +[build-system] +requires = ["maturin>=1.0,<2.0"] + +[tool.maturin] +features = ["pyo3/extension-module"] +``` + +### Rollout strategy + +1. Feature-flag each Rust module: `USE_RUST_EXPORTS=true` +2. Run both Python and Rust paths in parallel, compare outputs +3. Benchmark with production-scale data +4. Gradually shift traffic to Rust path +5. Remove Python implementation after validation + +--- + +## Risk Assessment + +| Risk | Mitigation | +|---|---| +| Rust introduces subtle behavior differences | Parallel execution + output comparison in staging | +| Build complexity (Rust toolchain in CI/CD) | maturin handles cross-compilation; pre-built wheels | +| Team unfamiliar with Rust | Start with Phase 1 (export) as learning project; well-defined interface | +| PyO3 overhead for small operations | Only port hot paths; keep Django/ORM in Python | +| Maintenance burden of two languages | Clear module boundaries; Rust modules are self-contained | + +--- + +## Estimated Impact Summary + +| Component | Current (100k rows) | With Rust | Speedup | +|---|---|---|---| +| XLSX Export | ~45-90 min | ~2-5 min | 10-50x | +| XML Submission Parse | ~15ms/submission | ~1-3ms | 5-20x | +| Chart Aggregation | ~5-15s | ~1-3s | 3-10x | +| Decryption (crypto only) | ~200ms/submission | ~50-100ms | 2-5x | +| CSV Import (parse only) | ~8ms/row | ~1-2ms/row | 3-8x | + +**Note**: These are estimates based on typical Python-to-Rust speedups for similar workloads. +Actual numbers depend on data shape, hardware, and I/O patterns. The DB and network I/O +portions remain unchanged regardless of language. + +--- + +## Conclusion + +Your intuition is correct -- the form processing and export pipelines are the prime +candidates for Rust porting. The export engine (Phase 1) offers the highest ROI because: + +1. It's the most CPU/memory-intensive code path +2. It processes the largest data volumes +3. It has well-defined inputs/outputs (easy to wrap with PyO3) +4. Python's GIL prevents parallelizing section writes +5. The recursive dict manipulation and regex-per-row patterns are exactly where Rust excels + +Phase 2 (XML parsing) is the second priority because it affects every single submission +and currently parses the same XML 6+ times due to lack of caching between method calls. + +The Django ORM, REST API, authentication, and routing should stay in Python -- Rust +offers no meaningful advantage for I/O-bound web framework code. diff --git a/docs/plans/2026-02-16-rust-xml-parser-design.md b/docs/plans/2026-02-16-rust-xml-parser-design.md new file mode 100644 index 0000000000..cda52ef5f8 --- /dev/null +++ b/docs/plans/2026-02-16-rust-xml-parser-design.md @@ -0,0 +1,282 @@ +# Design: Rust XML Submission Parser (`onadata_xml`) + +**Date**: 2026-02-16 +**Branch**: `rusty` +**Approach**: C -- Rust parser + Python cached wrapper (drop-in replacement) + +## Problem + +The `XFormInstanceParser` parses submission XML using Python's minidom (full DOM tree, +2-3x memory of raw XML). The same XML is parsed 6+ times per submission because +`get_dict()` is called independently by `save()`, `_set_geom()`, `get_expected_media()`, +and `get_full_dict()`. Each parse involves recursive DOM traversal, xpath computation +per node, recursive dict flattening, and recursive numeric conversion. + +## Solution + +A Rust native extension (`onadata_xml`) that parses XML in a single pass using +`quick-xml` (SAX-style, no DOM), returning all extracted data at once. A Python wrapper +class (`RustXFormInstanceParser`) provides the same interface as the existing parser, +so callers don't change. + +## Rust Module: `onadata_xml` + +### Crate Structure + +``` +rust/onadata_xml/ +├── Cargo.toml +├── pyproject.toml +└── src/ + ├── lib.rs # PyO3 module entry, parse_submission() + ├── parser.rs # Single-pass XML -> structured data + ├── flatten.rs # Iterative dict flattening (stack-based) + ├── numeric.rs # String -> int/float conversion + └── geom.rs # Geopoint extraction from parsed data +``` + +### Core Function + +```rust +#[pyfunction] +fn parse_submission( + xml_str: &str, + repeat_xpaths: Vec, + encrypted: bool, + numeric_fields: HashSet, + geo_xpaths: Vec, +) -> PyResult { ... } +``` + +### SubmissionResult Fields + +| Field | Type | Replaces | +|---|---|---| +| `dict` | `dict` | `_xml_node_to_dict()` output | +| `flat_dict` | `dict` | `_flatten_dict_nest_repeats()` + `numeric_converter()` | +| `attributes` | `dict[str, str]` | `_get_all_attributes()` + `_set_attributes()` | +| `root_node_name` | `str` | `_root_node.nodeName` | +| `uuid` | `Optional[str]` | `get_uuid_from_xml()` | +| `deprecated_uuid` | `Optional[str]` | `get_deprecated_uuid_from_xml()` | +| `submission_date` | `Optional[str]` | `get_submission_date_from_xml()` | +| `geom_points` | `list[tuple[float, float]]` | `_set_geom()` point extraction | +| `checksum` | `str` | `sha256(xml).hexdigest()` | + +### Rust Crates + +- `quick-xml` -- SAX-style streaming parser (no DOM, ~10x faster than minidom) +- `sha2` -- native SHA256 +- `pyo3` -- Python bindings + +### Parser Algorithm + +Single pass over XML using `quick-xml::Reader`. Maintains a stack of node names for +xpath computation. As it encounters relevant nodes, it accumulates: + +1. The nested dict structure (handling repeats via the `repeat_xpaths` set) +2. Attributes (skipping `entity` node attributes) +3. UUID from `meta/instanceID` or root `instanceID` attribute +4. Deprecated UUID from `meta/deprecatedID` +5. Submission date from root `submissionDate` attribute +6. Text values with numeric conversion applied inline + +After the parse pass, a second in-Rust step: + +1. Flattens the dict iteratively (stack-based, not recursive) +2. Extracts geopoints from fields matching `geo_xpaths` +3. Computes SHA256 checksum + +All returned to Python as a single `SubmissionResult` object. + +## Python Wrapper: `RustXFormInstanceParser` + +Lives in `onadata/apps/logger/xform_instance_parser.py`, same file as the original. + +```python +class RustXFormInstanceParser: + def __init__(self, xml_str, data_dictionary): + self.data_dicionary = data_dictionary + repeat_xpaths = [ + get_abbreviated_xpath(e.get_xpath()) + for e in data_dictionary.get_survey_elements_of_type("repeat") + ] + numeric_fields = get_numeric_fields(data_dictionary) + geo_xpaths = data_dictionary.geopoint_xpaths() + + from onadata_xml import parse_submission + self._result = parse_submission( + xml_str, repeat_xpaths, data_dictionary.encrypted, + numeric_fields, geo_xpaths, + ) + + def to_dict(self): + return self._result.dict + + def to_flat_dict(self): + return self._result.flat_dict + + def get_root_node(self): + return None # DOM node not available; callers only use root_node_name + + def get_root_node_name(self): + return self._result.root_node_name + + def get_attributes(self): + return self._result.attributes + + def get_xform_id_string(self): + return self._result.attributes["id"] + + def get_version(self): + return self._result.attributes.get("version") + + def get_flat_dict_with_attributes(self): + result = self.to_flat_dict().copy() + result[XFORM_ID_STRING] = self.get_xform_id_string() + version = self.get_version() + if version: + result[VERSION] = version + return result +``` + +## Integration Points + +### `Instance._set_parser()` (instance.py:516) + +```python +def _set_parser(self): + if not hasattr(self, "_parser"): + if settings.USE_RUST_XML_PARSER: + self._parser = RustXFormInstanceParser(self.xml, self.xform) + else: + self._parser = XFormInstanceParser(self.xml, self.xform) +``` + +### `Instance._set_geom()` (instance.py:416) + +Reads from cached result instead of re-parsing: + +```python +def _set_geom(self): + self._set_parser() + if settings.USE_RUST_XML_PARSER and hasattr(self._parser, '_result'): + points = [Point(lng, lat) for lat, lng in self._parser._result.geom_points] + else: + # existing code path + ... +``` + +### `Instance._set_uuid()` (instance.py:528) + +Reads from cached result: + +```python +def _set_uuid(self): + if self.xml and not self.uuid: + if settings.USE_RUST_XML_PARSER and hasattr(self, '_parser'): + uuid = self._parser._result.uuid + else: + uuid = get_uuid_from_xml(self.xml) + if uuid is not None: + self.uuid = uuid + set_uuid(self) +``` + +### `create_instance()` in `logger_tools.py` + +The `sha256(xml).hexdigest()` call (line 637) can use `self._parser._result.checksum` +when the Rust parser is active, avoiding a redundant hash computation. + +## Feature Flag & Rollout + +### Settings + +```python +# onadata/settings/common.py +USE_RUST_XML_PARSER = False +RUST_XML_PARSER_SHADOW_MODE = False +``` + +### Shadow Mode + +Run both parsers, compare outputs, log differences: + +```python +def _set_parser(self): + if not hasattr(self, "_parser"): + self._parser = XFormInstanceParser(self.xml, self.xform) + if settings.RUST_XML_PARSER_SHADOW_MODE: + rust_parser = RustXFormInstanceParser(self.xml, self.xform) + _compare_parser_outputs(self._parser, rust_parser, self.pk) +``` + +### Rollout Sequence + +1. Shadow mode in staging -- validate output parity +2. Feature flag on in production +3. Remove Python parser + shadow mode after validation + +## Error Handling + +| Condition | Exception | +|---|---| +| Empty XML / no children | `InstanceEmptyError` | +| Malformed XML | `InstanceParseError` | +| No survey element | `ValueError` | +| Missing `id` attribute | `KeyError` | + +Rust module imports and raises existing Python exception classes via PyO3. + +## Testing Strategy + +### Layer 1: Rust Unit Tests (`cargo test`) + +- Simple flat forms +- Repeat groups (single and nested) +- CDATA sections +- Encrypted submissions with `` nodes +- Entity metadata (entity node attributes skipped) +- Missing/empty nodes +- Geopoint extraction (valid, malformed, multiple) +- Numeric conversion edge cases (int, float, NaN, empty string) +- UUID extraction from `` and from root attribute +- SHA256 checksum correctness + +### Layer 2: Python Integration Tests + +Run existing `XFormInstanceParser` test fixtures against `RustXFormInstanceParser`, +assert identical outputs for `to_dict()`, `to_flat_dict()`, +`get_flat_dict_with_attributes()`, `get_root_node_name()`, `get_attributes()`. + +### Layer 3: Shadow Mode + +Comparison logging in staging against real-world submissions. + +## Build & CI + +### pyproject.toml (rust/onadata_xml/) + +```toml +[build-system] +requires = ["maturin>=1.0,<2.0"] +build-backend = "maturin" + +[project] +name = "onadata-xml" +requires-python = ">=3.10" + +[tool.maturin] +features = ["pyo3/extension-module"] +``` + +### CI Additions + +- Install Rust toolchain (`rustup`) in CI +- `cd rust/onadata_xml && maturin develop` before running Python tests +- `cargo test` as separate CI step for Rust unit tests + +### Dev Workflow + +- `maturin develop` builds and installs into active virtualenv +- Rust code changes require re-running `maturin develop` +- Python wrapper changes reload normally diff --git a/docs/plans/2026-02-16-rust-xml-parser-plan.md b/docs/plans/2026-02-16-rust-xml-parser-plan.md new file mode 100644 index 0000000000..86f0479880 --- /dev/null +++ b/docs/plans/2026-02-16-rust-xml-parser-plan.md @@ -0,0 +1,989 @@ +# Rust XML Submission Parser Implementation Plan + +> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. + +**Goal:** Replace Python's minidom-based XML submission parser with a single-pass Rust native extension that eliminates 6+ redundant parses per submission. + +**Architecture:** A PyO3 Rust crate (`onadata_xml`) exposes a `parse_submission()` function. A Python wrapper class (`RustXFormInstanceParser`) provides an identical interface to the existing `XFormInstanceParser`. Feature-flagged via `USE_RUST_XML_PARSER` setting. + +**Tech Stack:** Rust, PyO3, maturin, quick-xml, sha2 + +--- + +### Task 1: Scaffold the Rust Crate + +**Files:** +- Create: `rust/onadata_xml/Cargo.toml` +- Create: `rust/onadata_xml/pyproject.toml` +- Create: `rust/onadata_xml/src/lib.rs` + +**Step 1: Create directory structure** + +Run: `mkdir -p rust/onadata_xml/src` + +**Step 2: Create Cargo.toml** + +Create `rust/onadata_xml/Cargo.toml`: + +```toml +[package] +name = "onadata_xml" +version = "0.1.0" +edition = "2021" + +[lib] +name = "onadata_xml" +crate-type = ["cdylib"] + +[dependencies] +pyo3 = { version = "0.23", features = ["extension-module"] } +quick-xml = "0.37" +sha2 = "0.10" +``` + +**Step 3: Create pyproject.toml** + +Create `rust/onadata_xml/pyproject.toml`: + +```toml +[build-system] +requires = ["maturin>=1.0,<2.0"] +build-backend = "maturin" + +[project] +name = "onadata-xml" +requires-python = ">=3.9" + +[tool.maturin] +features = ["pyo3/extension-module"] +``` + +**Step 4: Create minimal lib.rs** + +Create `rust/onadata_xml/src/lib.rs`: + +```rust +use pyo3::prelude::*; + +#[pyfunction] +fn parse_submission(xml_str: &str) -> PyResult { + Ok(format!("received {} bytes", xml_str.len())) +} + +#[pymodule] +fn onadata_xml(m: &Bound<'_, PyModule>) -> PyResult<()> { + m.add_function(wrap_pyfunction!(parse_submission, m)?)?; + Ok(()) +} +``` + +**Step 5: Build and verify** + +Run: `cd rust/onadata_xml && maturin develop` +Expected: Builds successfully, installs into virtualenv + +Run: `python -c "from onadata_xml import parse_submission; print(parse_submission(''))"` +Expected: `received 7 bytes` + +**Step 6: Commit** + +```bash +git add rust/ +git commit -m "feat: scaffold onadata_xml Rust crate with PyO3 + maturin" +``` + +--- + +### Task 2: Implement the Core XML-to-Dict Parser in Rust + +**Files:** +- Create: `rust/onadata_xml/src/parser.rs` +- Modify: `rust/onadata_xml/src/lib.rs` + +**Step 1: Write Rust unit tests for XML-to-dict conversion** + +Add to `rust/onadata_xml/src/parser.rs` the parser module with tests. The parser +must handle these cases (matching Python's `_xml_node_to_dict` behavior): + +- Leaf text nodes → `{"nodeName": "textValue"}` +- Empty nodes → skipped (None) +- CDATA sections → `{"parentNodeName": "cdataValue"}` +- Repeat groups (xpaths in `repeat_xpaths`) → values collected into lists +- Encrypted forms with `` nodes → treated as repeats +- Nested repeats → lists of dicts inside lists +- Duplicate node names not in repeats → aggregated into lists + +Test cases from existing fixtures: + +```rust +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn test_simple_flat_form() { + let xml = r#"Larry23"#; + let result = xml_to_dict(xml, &[], false); + // result["tutorial"]["name"] == "Larry" + // result["tutorial"]["age"] == "23" + } + + #[test] + fn test_repeat_nodes() { + // From repeated_nodes.xml fixture + let xml = r#"12"#; + let result = xml_to_dict(xml, &["S2A"], false); + // result["RW"]["S2A"] is a list of 2 dicts + } + + #[test] + fn test_encrypted_media_nodes() { + let xml = r#"a.encb.enc"#; + let result = xml_to_dict(xml, &[], true); + // result["data"]["media"] is a list of 2 dicts + } + + #[test] + fn test_empty_nodes_skipped() { + let xml = r#"
x"#; + let result = xml_to_dict(xml, &[], false); + // result["form"] has only "val", no "note" + } +} +``` + +**Step 2: Run Rust tests to verify they fail** + +Run: `cd rust/onadata_xml && cargo test` +Expected: Compilation errors (functions don't exist yet) + +**Step 3: Implement `xml_to_dict` using quick-xml** + +In `rust/onadata_xml/src/parser.rs`, implement a stack-based SAX parser using +`quick-xml::Reader`. The algorithm: + +1. Create a stack of `(node_name, HashMap)` entries +2. On `Event::Start(tag)` → push new frame onto stack, compute xpath from stack +3. On `Event::Text(text)` → set text value on current frame +4. On `Event::CData(text)` → set CDATA value on parent frame +5. On `Event::End(tag)` → pop frame, merge into parent: + - If xpath is in `repeat_xpaths` or (encrypted && name == "media") → append to list + - Else if key already exists → convert to list and append + - Else → insert as dict value +6. On `Event::Empty(tag)` → skip (empty node, matches Python's `return None`) + +The function returns a `PyObject` (Python dict) via PyO3. + +Key types: + +```rust +use pyo3::prelude::*; +use pyo3::types::{PyDict, PyList, PyString}; +use quick_xml::events::Event; +use quick_xml::Reader; +use std::collections::HashSet; + +pub fn xml_to_dict( + py: Python<'_>, + xml_str: &str, + repeat_xpaths: &HashSet, + encrypted: bool, +) -> PyResult { ... } +``` + +**Step 4: Run Rust tests to verify they pass** + +Run: `cd rust/onadata_xml && cargo test` +Expected: All tests pass + +**Step 5: Commit** + +```bash +git add rust/onadata_xml/src/parser.rs rust/onadata_xml/src/lib.rs +git commit -m "feat: implement xml_to_dict parser using quick-xml" +``` + +--- + +### Task 3: Implement Dict Flattening + +**Files:** +- Create: `rust/onadata_xml/src/flatten.rs` +- Modify: `rust/onadata_xml/src/lib.rs` + +**Step 1: Write Rust unit tests for flattening** + +Must match Python's `_flatten_dict_nest_repeats` behavior: + +```rust +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn test_flatten_simple() { + // {"form": {"info": {"name": "Adam", "age": "80"}}} + // → {"info/name": "Adam", "info/age": "80"} + } + + #[test] + fn test_flatten_repeats() { + // {"form": {"kids": [{"name": "Abel"}, {"name": "Cain"}]}} + // → {"kids": [{"kids/name": "Abel"}, {"kids/name": "Cain"}]} + } + + #[test] + fn test_flatten_nested_repeats() { + // Nested repeats produce nested lists of flattened dicts + } +} +``` + +**Step 2: Run tests to verify they fail** + +Run: `cd rust/onadata_xml && cargo test` +Expected: FAIL + +**Step 3: Implement `flatten_dict` using iterative stack-based approach** + +In `rust/onadata_xml/src/flatten.rs`: + +```rust +pub fn flatten_dict_nest_repeats( + py: Python<'_>, + data_dict: &Bound<'_, PyDict>, +) -> PyResult { ... } +``` + +Uses a `Vec` as an explicit stack instead of recursion. Walks the nested dict, +building xpath keys by joining path components with "/". Lists produce sub-dicts +with flattened keys (matching Python behavior where repeats become +`[{"kids/kids_details/kids_name": "Abel"}]`). + +**Step 4: Run tests to verify they pass** + +Run: `cd rust/onadata_xml && cargo test` +Expected: PASS + +**Step 5: Commit** + +```bash +git add rust/onadata_xml/src/flatten.rs rust/onadata_xml/src/lib.rs +git commit -m "feat: implement iterative dict flattening for repeat groups" +``` + +--- + +### Task 4: Implement Numeric Conversion and Geom Extraction + +**Files:** +- Create: `rust/onadata_xml/src/numeric.rs` +- Create: `rust/onadata_xml/src/geom.rs` +- Modify: `rust/onadata_xml/src/lib.rs` + +**Step 1: Write Rust tests for numeric conversion** + +Must match Python's `numeric_checker` (instance.py:152-166): +- `"42"` → `42` (int) +- `"3.14"` → `3.14` (float) +- `"NaN"` → `0` (matches Python: `0 if math.isnan(value) else value`) +- `"hello"` → `"hello"` (unchanged) +- `""` → `""` (unchanged) +- `"-7"` → `-7` (negative int) + +```rust +#[test] +fn test_numeric_checker() { + assert_eq!(numeric_check("42"), Value::Int(42)); + assert_eq!(numeric_check("3.14"), Value::Float(3.14)); + assert_eq!(numeric_check("NaN"), Value::Int(0)); + assert_eq!(numeric_check("hello"), Value::Str("hello".into())); +} +``` + +**Step 2: Write Rust tests for geopoint extraction** + +Must match `_set_geom` behavior (instance.py:416-441): +- Input: flat dict + geo_xpaths list +- Searches flat dict values for matching keys +- Splits GPS string `"-1.2627 36.7926 0.0 30.0"` into `(lat, lng)` tuple +- Returns `Vec<(f64, f64)>` + +```rust +#[test] +fn test_extract_geopoints() { + // flat_dict with "gps" = "-1.2627 36.7926 0.0 30.0" + // geo_xpaths = ["gps"] + // → [(−1.2627, 36.7926)] +} +``` + +**Step 3: Run tests to verify they fail** + +Run: `cd rust/onadata_xml && cargo test` +Expected: FAIL + +**Step 4: Implement numeric conversion** + +In `rust/onadata_xml/src/numeric.rs`: + +```rust +use pyo3::prelude::*; + +/// Applies numeric conversion inline during dict construction. +/// Called on leaf text values when the xpath is in numeric_fields. +pub fn numeric_check(py: Python<'_>, value: &str) -> PyObject { + if let Ok(i) = value.parse::() { + return i.into_pyobject(py).unwrap().into_any().unbind(); + } + if let Ok(f) = value.parse::() { + if f.is_nan() { + return 0i64.into_pyobject(py).unwrap().into_any().unbind(); + } + return f.into_pyobject(py).unwrap().into_any().unbind(); + } + PyString::new(py, value).into_any().unbind() +} +``` + +**Step 5: Implement geopoint extraction** + +In `rust/onadata_xml/src/geom.rs`: + +```rust +use pyo3::prelude::*; +use std::collections::HashSet; + +/// Extracts (lat, lng) tuples from the flat dict for matching geo_xpaths. +/// Searches recursively through nested dicts/lists (matching get_values_matching_key). +pub fn extract_geopoints( + py: Python<'_>, + flat_dict: &Bound<'_, PyDict>, + geo_xpaths: &HashSet, +) -> PyResult> { ... } +``` + +**Step 6: Run tests to verify they pass** + +Run: `cd rust/onadata_xml && cargo test` +Expected: PASS + +**Step 7: Commit** + +```bash +git add rust/onadata_xml/src/numeric.rs rust/onadata_xml/src/geom.rs rust/onadata_xml/src/lib.rs +git commit -m "feat: add numeric conversion and geopoint extraction" +``` + +--- + +### Task 5: Wire Up the Full `parse_submission` Function + +**Files:** +- Modify: `rust/onadata_xml/src/lib.rs` + +**Step 1: Write Rust integration test for full parse_submission** + +```rust +#[test] +fn test_parse_submission_full() { + let xml = r#" + Larry23 + -1.2836198 36.8795437 0.0 1044.0 + uuid:729f173c688e482486a48661700455ff + "#; + + Python::with_gil(|py| { + let result = parse_submission( + py, xml, + vec![], // repeat_xpaths + false, // encrypted + HashSet::new(), // numeric_fields + vec!["gps".into()], // geo_xpaths + ).unwrap(); + + // Verify all fields of SubmissionResult + // result.root_node_name == "tutorial" + // result.uuid == Some("729f173c688e482486a48661700455ff") + // result.geom_points == [(-1.2836198, 36.8795437)] + // result.checksum == sha256 of xml + // result.attributes["id"] == "tutorial" + // result.dict["tutorial"]["name"] == "Larry" + // result.flat_dict["name"] == "Larry" + // result.flat_dict["age"] == "23" (not in numeric_fields) + }); +} +``` + +**Step 2: Run test to verify it fails** + +Run: `cd rust/onadata_xml && cargo test` +Expected: FAIL + +**Step 3: Implement `parse_submission` and `SubmissionResult`** + +In `rust/onadata_xml/src/lib.rs`: + +```rust +use pyo3::prelude::*; +use pyo3::types::PyDict; +use sha2::{Sha256, Digest}; +use std::collections::HashSet; + +mod parser; +mod flatten; +mod numeric; +mod geom; + +#[pyclass] +#[derive(Clone)] +pub struct SubmissionResult { + #[pyo3(get)] + pub dict: PyObject, + #[pyo3(get)] + pub flat_dict: PyObject, + #[pyo3(get)] + pub attributes: PyObject, + #[pyo3(get)] + pub root_node_name: String, + #[pyo3(get)] + pub uuid: Option, + #[pyo3(get)] + pub deprecated_uuid: Option, + #[pyo3(get)] + pub submission_date: Option, + #[pyo3(get)] + pub geom_points: Vec<(f64, f64)>, + #[pyo3(get)] + pub checksum: String, +} + +#[pyfunction] +fn parse_submission( + py: Python<'_>, + xml_str: &str, + repeat_xpaths: Vec, + encrypted: bool, + numeric_fields: HashSet, + geo_xpaths: Vec, +) -> PyResult { + // 1. Clean XML (strip whitespace between tags) + let clean_xml = clean_xml(xml_str); + + // 2. Parse XML to dict + extract attributes, uuid, etc. + let parsed = parser::parse_full(py, &clean_xml, &repeat_xpaths.into_iter().collect(), + encrypted, &numeric_fields)?; + + // 3. Flatten dict + let flat_dict = flatten::flatten_dict_nest_repeats(py, &parsed.dict)?; + + // 4. Extract geopoints from the parsed dict + let geo_set: HashSet = geo_xpaths.into_iter().collect(); + let geom_points = geom::extract_geopoints(py, &parsed.dict, &geo_set)?; + + // 5. Compute SHA256 + let mut hasher = Sha256::new(); + hasher.update(xml_str.as_bytes()); + let checksum = format!("{:x}", hasher.finalize()); + + Ok(SubmissionResult { + dict: parsed.dict, + flat_dict, + attributes: parsed.attributes, + root_node_name: parsed.root_node_name, + uuid: parsed.uuid, + deprecated_uuid: parsed.deprecated_uuid, + submission_date: parsed.submission_date, + geom_points, + checksum, + }) +} + +fn clean_xml(xml_str: &str) -> String { + // Equivalent to: re.sub(r">\s+<", "><", xml_string.strip()) + let trimmed = xml_str.trim(); + let mut result = String::with_capacity(trimmed.len()); + let mut after_close = false; + let mut whitespace_buf = String::new(); + for ch in trimmed.chars() { + if after_close { + if ch.is_whitespace() { + whitespace_buf.push(ch); + continue; + } else if ch == '<' { + // drop whitespace between > and < + after_close = false; + result.push('<'); + whitespace_buf.clear(); + continue; + } else { + // not followed by <, flush whitespace + result.push_str(&whitespace_buf); + whitespace_buf.clear(); + after_close = false; + } + } + if ch == '>' { + after_close = true; + whitespace_buf.clear(); + } + result.push(ch); + } + result.push_str(&whitespace_buf); + result +} + +#[pymodule] +fn onadata_xml(m: &Bound<'_, PyModule>) -> PyResult<()> { + m.add_function(wrap_pyfunction!(parse_submission, m)?)?; + m.add_class::()?; + Ok(()) +} +``` + +**Step 4: Run tests to verify they pass** + +Run: `cd rust/onadata_xml && cargo test` +Expected: PASS + +**Step 5: Build and smoke test from Python** + +Run: `cd rust/onadata_xml && maturin develop` + +Run: +```python +python -c " +from onadata_xml import parse_submission +r = parse_submission( + 'Larry23uuid:abc123', + [], False, set(), [] +) +print('root:', r.root_node_name) +print('uuid:', r.uuid) +print('dict:', r.dict) +print('flat:', r.flat_dict) +print('attrs:', r.attributes) +print('sha:', r.checksum[:16]) +" +``` + +Expected: Prints correct parsed values. + +**Step 6: Commit** + +```bash +git add rust/onadata_xml/src/ +git commit -m "feat: wire up full parse_submission with SubmissionResult" +``` + +--- + +### Task 6: Add the Python Wrapper Class + +**Files:** +- Modify: `onadata/apps/logger/xform_instance_parser.py` + +**Step 1: Write Python test for RustXFormInstanceParser** + +Create test in `onadata/apps/logger/tests/test_rust_parsing.py`: + +```python +"""Tests that RustXFormInstanceParser produces identical output to XFormInstanceParser.""" +import os + +from django.test import TestCase, override_settings + +from onadata.apps.main.tests.test_base import TestBase +from onadata.apps.logger.xform_instance_parser import ( + RustXFormInstanceParser, + XFormInstanceParser, +) +from onadata.libs.utils.common_tags import XFORM_ID_STRING, VERSION + + +class TestRustXFormInstanceParser(TestBase): + """Compare Rust parser output against Python parser for identical inputs.""" + + def _publish_and_get_xml(self, fixture_dir, xls_name, xml_name): + self._create_user_and_login() + xls_path = os.path.join( + os.path.dirname(os.path.abspath(__file__)), + f"../fixtures/{fixture_dir}/{xls_name}", + ) + self._publish_xls_file_and_set_xform(xls_path) + xml_path = os.path.join( + os.path.dirname(os.path.abspath(__file__)), + f"../fixtures/{fixture_dir}/instances/{xml_name}", + ) + with open(xml_path) as f: + return f.read() + + @override_settings(USE_RUST_XML_PARSER=True) + def test_nested_repeats_match(self): + xml = self._publish_and_get_xml( + "new_repeats", "new_repeats.xlsx", + "new_repeats_2012-07-05-14-33-53.xml", + ) + py_parser = XFormInstanceParser(xml, self.xform) + rust_parser = RustXFormInstanceParser(xml, self.xform) + + self.assertEqual(py_parser.to_dict(), rust_parser.to_dict()) + self.assertEqual(py_parser.to_flat_dict(), rust_parser.to_flat_dict()) + self.assertEqual( + py_parser.get_flat_dict_with_attributes(), + rust_parser.get_flat_dict_with_attributes(), + ) + self.assertEqual(py_parser.get_root_node_name(), rust_parser.get_root_node_name()) + self.assertEqual(py_parser.get_xform_id_string(), rust_parser.get_xform_id_string()) + + @override_settings(USE_RUST_XML_PARSER=True) + def test_encrypted_form_match(self): + xml = self._publish_and_get_xml( + "tutorial_encrypted", "tutorial_encrypted.xlsx", + "tutorial_encrypted.xml", + ) + py_parser = XFormInstanceParser(xml, self.xform) + rust_parser = RustXFormInstanceParser(xml, self.xform) + + self.assertEqual(py_parser.to_dict(), rust_parser.to_dict()) + self.assertEqual(py_parser.to_flat_dict(), rust_parser.to_flat_dict()) +``` + +**Step 2: Run test to verify it fails** + +Run: `python manage.py test onadata.apps.logger.tests.test_rust_parsing -v2 --settings=onadata.settings.github_actions_test` +Expected: FAIL (RustXFormInstanceParser doesn't exist yet) + +**Step 3: Add RustXFormInstanceParser to xform_instance_parser.py** + +Add at the end of `onadata/apps/logger/xform_instance_parser.py`: + +```python +class RustXFormInstanceParser: + """Drop-in replacement for XFormInstanceParser using Rust native parser.""" + + def __init__(self, xml_str, data_dictionary): + self.data_dicionary = data_dictionary + repeat_xpaths = [ + get_abbreviated_xpath(e.get_xpath()) + for e in data_dictionary.get_survey_elements_of_type("repeat") + ] + + from onadata.libs.data.query import get_numeric_fields + + numeric_fields = set(get_numeric_fields(data_dictionary)) + geo_xpaths = ( + data_dictionary.geopoint_xpaths() + if hasattr(data_dictionary, "geopoint_xpaths") + else [] + ) + + from onadata_xml import parse_submission + + self._result = parse_submission( + smart_str(xml_str.strip()) if isinstance(xml_str, str) else xml_str, + repeat_xpaths, + data_dictionary.encrypted, + numeric_fields, + geo_xpaths, + ) + + def get_root_node(self): + return None + + def get_root_node_name(self): + return self._result.root_node_name + + def to_dict(self): + return self._result.dict + + def to_flat_dict(self): + return self._result.flat_dict + + def get_attributes(self): + return self._result.attributes + + def get_xform_id_string(self): + return self._result.attributes["id"] + + def get_version(self): + return self._result.attributes.get("version") + + def get_flat_dict_with_attributes(self): + result = self.to_flat_dict().copy() + result[XFORM_ID_STRING] = self.get_xform_id_string() + version = self.get_version() + if version: + result[VERSION] = version + return result +``` + +**Step 4: Run test to verify it passes** + +Run: `python manage.py test onadata.apps.logger.tests.test_rust_parsing -v2 --settings=onadata.settings.github_actions_test` +Expected: PASS + +**Step 5: Commit** + +```bash +git add onadata/apps/logger/xform_instance_parser.py onadata/apps/logger/tests/test_rust_parsing.py +git commit -m "feat: add RustXFormInstanceParser wrapper class with parity tests" +``` + +--- + +### Task 7: Integrate Feature Flag and Wire Into Instance Model + +**Files:** +- Modify: `onadata/settings/common.py` +- Modify: `onadata/apps/logger/models/instance.py` + +**Step 1: Add feature flag settings** + +Add to end of `onadata/settings/common.py`: + +```python +# Rust XML parser feature flags +USE_RUST_XML_PARSER = False +RUST_XML_PARSER_SHADOW_MODE = False +``` + +**Step 2: Modify Instance._set_parser()** + +In `onadata/apps/logger/models/instance.py`, line 516-520, change: + +```python +def _set_parser(self): + if not hasattr(self, "_parser"): + self._parser = XFormInstanceParser(self.xml, self.xform) +``` + +To: + +```python +def _set_parser(self): + if not hasattr(self, "_parser"): + if getattr(settings, "USE_RUST_XML_PARSER", False): + from onadata.apps.logger.xform_instance_parser import ( + RustXFormInstanceParser, + ) + self._parser = RustXFormInstanceParser(self.xml, self.xform) + else: + self._parser = XFormInstanceParser(self.xml, self.xform) +``` + +**Step 3: Modify Instance._set_geom() to use cached geom_points** + +In `onadata/apps/logger/models/instance.py`, line 416-441, change `_set_geom` +to check for Rust parser result first: + +```python +def _set_geom(self): + xform = self.xform + self._set_parser() + + if ( + getattr(settings, "USE_RUST_XML_PARSER", False) + and hasattr(self._parser, "_result") + ): + points = [ + Point(lng, lat) for lat, lng in self._parser._result.geom_points + ] + else: + geo_xpaths = xform.geopoint_xpaths() + doc = self.get_dict() + points = [] + if geo_xpaths: + for xpath in geo_xpaths: + for gps in get_values_matching_key(doc, xpath): + try: + geometry = [float(s) for s in gps.split()] + lat, lng = geometry[0:2] + points.append(Point(lng, lat)) + except ValueError: + return + + if not xform.instances_with_geopoints and points: + xform.instances_with_geopoints = True + xform.save() + + if points: + self.geom = GeometryCollection(points) + else: + self.geom = None +``` + +**Step 4: Modify Instance._set_uuid() to use cached uuid** + +In `onadata/apps/logger/models/instance.py`, line 528-536, change: + +```python +def _set_uuid(self): + if self.xml and not self.uuid: + if ( + getattr(settings, "USE_RUST_XML_PARSER", False) + and hasattr(self, "_parser") + and hasattr(self._parser, "_result") + ): + uuid = self._parser._result.uuid + else: + uuid = get_uuid_from_xml(self.xml) + if uuid is not None: + self.uuid = uuid + set_uuid(self) +``` + +**Step 5: Run existing tests to verify no regression** + +Run: `python manage.py test onadata.apps.logger.tests.test_parsing -v2 --settings=onadata.settings.github_actions_test` +Expected: PASS (feature flag is off, so existing Python path is used) + +**Step 6: Run with feature flag on** + +Run: `USE_RUST_XML_PARSER=True python manage.py test onadata.apps.logger.tests.test_parsing -v2 --settings=onadata.settings.github_actions_test` +Expected: PASS + +**Step 7: Commit** + +```bash +git add onadata/settings/common.py onadata/apps/logger/models/instance.py +git commit -m "feat: integrate Rust XML parser with feature flag in Instance model" +``` + +--- + +### Task 8: Add Shadow Mode for Safe Rollout + +**Files:** +- Modify: `onadata/apps/logger/models/instance.py` + +**Step 1: Implement shadow mode comparison** + +Add a helper function to `onadata/apps/logger/models/instance.py`: + +```python +def _compare_parser_outputs(py_parser, rust_parser, instance_pk): + """Log differences between Python and Rust parser outputs.""" + logger = logging.getLogger("onadata.rust_parser_shadow") + try: + if py_parser.to_dict() != rust_parser.to_dict(): + logger.warning("dict mismatch for instance pk=%s", instance_pk) + if py_parser.to_flat_dict() != rust_parser.to_flat_dict(): + logger.warning("flat_dict mismatch for instance pk=%s", instance_pk) + if py_parser.get_root_node_name() != rust_parser.get_root_node_name(): + logger.warning("root_node_name mismatch for instance pk=%s", instance_pk) + except Exception: + logger.exception("Shadow mode comparison failed for instance pk=%s", instance_pk) +``` + +**Step 2: Wire shadow mode into _set_parser()** + +Update `_set_parser`: + +```python +def _set_parser(self): + if not hasattr(self, "_parser"): + if getattr(settings, "USE_RUST_XML_PARSER", False): + from onadata.apps.logger.xform_instance_parser import ( + RustXFormInstanceParser, + ) + self._parser = RustXFormInstanceParser(self.xml, self.xform) + else: + self._parser = XFormInstanceParser(self.xml, self.xform) + + if getattr(settings, "RUST_XML_PARSER_SHADOW_MODE", False): + try: + from onadata.apps.logger.xform_instance_parser import ( + RustXFormInstanceParser, + ) + rust_parser = RustXFormInstanceParser(self.xml, self.xform) + _compare_parser_outputs(self._parser, rust_parser, self.pk) + except Exception: + logger = logging.getLogger("onadata.rust_parser_shadow") + logger.exception("Shadow mode Rust parser failed for pk=%s", self.pk) +``` + +**Step 3: Run tests** + +Run: `python manage.py test onadata.apps.logger.tests.test_parsing -v2 --settings=onadata.settings.github_actions_test` +Expected: PASS + +**Step 4: Commit** + +```bash +git add onadata/apps/logger/models/instance.py +git commit -m "feat: add shadow mode for Rust XML parser comparison logging" +``` + +--- + +### Task 9: Final Integration Test and Push + +**Files:** +- Modify: `onadata/apps/logger/tests/test_rust_parsing.py` + +**Step 1: Add end-to-end submission test with Rust parser** + +Add to `test_rust_parsing.py`: + +```python +@override_settings(USE_RUST_XML_PARSER=True) +def test_full_submission_with_rust_parser(self): + """Test that a full submission round-trip works with the Rust parser.""" + self._create_user_and_login() + xls_path = os.path.join( + os.path.dirname(os.path.abspath(__file__)), + "../fixtures/tutorial/tutorial.xlsx", + ) + self._publish_xls_file_and_set_xform(xls_path) + xml_path = os.path.join( + os.path.dirname(os.path.abspath(__file__)), + "../fixtures/tutorial/instances/tutorial_2012-06-27_11-27-53_w_uuid.xml", + ) + self._make_submission(xml_path) + self.assertEqual(self.response.status_code, 201) + + # Verify instance was saved correctly + instance = self.xform.instances.first() + self.assertIsNotNone(instance) + self.assertEqual(instance.uuid, "729f173c688e482486a48661700455ff") + + # Verify get_dict works + data = instance.get_dict() + self.assertEqual(data.get("name"), "Larry\n Again") + self.assertEqual(data.get("age"), "23") +``` + +**Step 2: Run full test suite** + +Run: `python manage.py test onadata.apps.logger -v2 --settings=onadata.settings.github_actions_test --parallel=4` +Expected: All existing tests PASS + +**Step 3: Run with Rust parser enabled** + +Run: `USE_RUST_XML_PARSER=True python manage.py test onadata.apps.logger.tests.test_rust_parsing -v2 --settings=onadata.settings.github_actions_test` +Expected: PASS + +**Step 4: Commit and push** + +```bash +git add onadata/apps/logger/tests/test_rust_parsing.py +git commit -m "test: add end-to-end submission test with Rust XML parser" +git push origin rusty +``` + +--- + +## Summary of Tasks + +| Task | Description | Key Files | +|------|-------------|-----------| +| 1 | Scaffold Rust crate | `rust/onadata_xml/` | +| 2 | Core XML-to-dict parser | `src/parser.rs` | +| 3 | Dict flattening | `src/flatten.rs` | +| 4 | Numeric conversion + geom extraction | `src/numeric.rs`, `src/geom.rs` | +| 5 | Wire up `parse_submission` + `SubmissionResult` | `src/lib.rs` | +| 6 | Python wrapper class | `xform_instance_parser.py` | +| 7 | Feature flag + Instance model integration | `instance.py`, `common.py` | +| 8 | Shadow mode | `instance.py` | +| 9 | Final integration test + push | `test_rust_parsing.py` | diff --git a/onadata/apps/logger/models/instance.py b/onadata/apps/logger/models/instance.py index 7295004e98..c06a16d9b4 100644 --- a/onadata/apps/logger/models/instance.py +++ b/onadata/apps/logger/models/instance.py @@ -383,6 +383,24 @@ def convert_to_serializable_date(date): return date +def _compare_parser_outputs(py_parser, rust_parser, instance_pk): + """Log differences between Python and Rust parser outputs for shadow mode.""" + _logger = logging.getLogger("onadata.rust_parser_shadow") + try: + if py_parser.to_dict() != rust_parser.to_dict(): + _logger.warning("dict mismatch for instance pk=%s", instance_pk) + if py_parser.to_flat_dict() != rust_parser.to_flat_dict(): + _logger.warning("flat_dict mismatch for instance pk=%s", instance_pk) + if py_parser.get_root_node_name() != rust_parser.get_root_node_name(): + _logger.warning( + "root_node_name mismatch for instance pk=%s", instance_pk + ) + except Exception: + _logger.exception( + "Shadow mode comparison failed for instance pk=%s", instance_pk + ) + + class InstanceBaseClass: """Interface of functions for Instance and InstanceHistory model""" @@ -416,29 +434,40 @@ def numeric_converter(self, json_dict, numeric_fields=None): def _set_geom(self): # pylint: disable=no-member xform = self.xform - geo_xpaths = xform.geopoint_xpaths() - doc = self.get_dict() - points = [] - - if geo_xpaths: - for xpath in geo_xpaths: - for gps in get_values_matching_key(doc, xpath): - try: - geometry = [float(s) for s in gps.split()] - lat, lng = geometry[0:2] - points.append(Point(lng, lat)) - except ValueError: - return + self._set_parser() - if not xform.instances_with_geopoints and points: - xform.instances_with_geopoints = True - xform.save() + if ( + getattr(settings, "USE_RUST_XML_PARSER", False) + and hasattr(self._parser, "_result") + ): + points = [ + Point(lng, lat) + for lat, lng in self._parser._result.geom_points + ] + else: + geo_xpaths = xform.geopoint_xpaths() + doc = self.get_dict() + points = [] + + if geo_xpaths: + for xpath in geo_xpaths: + for gps in get_values_matching_key(doc, xpath): + try: + geometry = [float(s) for s in gps.split()] + lat, lng = geometry[0:2] + points.append(Point(lng, lat)) + except ValueError: + return + + if not xform.instances_with_geopoints and points: + xform.instances_with_geopoints = True + xform.save() - # pylint: disable=attribute-defined-outside-init - if points: - self.geom = GeometryCollection(points) - else: - self.geom = None + # pylint: disable=attribute-defined-outside-init + if points: + self.geom = GeometryCollection(points) + else: + self.geom = None def get_full_dict(self, include_related=True): """Returns the submission XML as a python dictionary object @@ -517,7 +546,34 @@ def _set_parser(self): if not hasattr(self, "_parser"): # pylint: disable=no-member # pylint: disable=attribute-defined-outside-init - self._parser = XFormInstanceParser(self.xml, self.xform) + if getattr(settings, "USE_RUST_XML_PARSER", False): + from onadata.apps.logger.xform_instance_parser import ( + RustXFormInstanceParser, + ) + + self._parser = RustXFormInstanceParser(self.xml, self.xform) + else: + self._parser = XFormInstanceParser(self.xml, self.xform) + + if getattr(settings, "RUST_XML_PARSER_SHADOW_MODE", False): + try: + from onadata.apps.logger.xform_instance_parser import ( + RustXFormInstanceParser, + ) + + rust_parser = RustXFormInstanceParser( + self.xml, self.xform + ) + _compare_parser_outputs( + self._parser, rust_parser, self.pk + ) + except Exception: + _shadow_logger = logging.getLogger( + "onadata.rust_parser_shadow" + ) + _shadow_logger.exception( + "Shadow mode Rust parser failed for pk=%s", self.pk + ) def _set_survey_type(self): # pylint: disable=attribute-defined-outside-init @@ -529,8 +585,15 @@ def _set_uuid(self): # pylint: disable=no-member,attribute-defined-outside-init # pylint: disable=access-member-before-definition if self.xml and not self.uuid: - # pylint: disable=no-member - uuid = get_uuid_from_xml(self.xml) + if ( + getattr(settings, "USE_RUST_XML_PARSER", False) + and hasattr(self, "_parser") + and hasattr(self._parser, "_result") + ): + uuid = self._parser._result.uuid + else: + # pylint: disable=no-member + uuid = get_uuid_from_xml(self.xml) if uuid is not None: self.uuid = uuid set_uuid(self) diff --git a/onadata/apps/logger/tests/test_rust_parsing.py b/onadata/apps/logger/tests/test_rust_parsing.py new file mode 100644 index 0000000000..ebd634f9ef --- /dev/null +++ b/onadata/apps/logger/tests/test_rust_parsing.py @@ -0,0 +1,124 @@ +"""Tests that RustXFormInstanceParser produces identical output to XFormInstanceParser.""" + +import os + +from django.test import override_settings + +from onadata.apps.logger.xform_instance_parser import ( + RustXFormInstanceParser, + XFormInstanceParser, +) +from onadata.apps.main.tests.test_base import TestBase + + +class TestRustXFormInstanceParser(TestBase): + """Compare Rust parser output against Python parser for identical inputs.""" + + def _get_fixture_path(self, *parts): + return os.path.join( + os.path.dirname(os.path.abspath(__file__)), + "..", + "fixtures", + *parts, + ) + + def _publish_and_get_xml(self, fixture_dir, xls_name, xml_rel_path): + self._create_user_and_login() + xls_path = self._get_fixture_path(fixture_dir, xls_name) + self._publish_xls_file_and_set_xform(xls_path) + xml_path = self._get_fixture_path(fixture_dir, xml_rel_path) + with open(xml_path) as f: + return f.read() + + def _assert_parsers_match(self, xml): + """Assert that Python and Rust parsers produce identical output.""" + py_parser = XFormInstanceParser(xml, self.xform) + rust_parser = RustXFormInstanceParser(xml, self.xform) + + self.assertEqual( + py_parser.to_dict(), + rust_parser.to_dict(), + "to_dict() mismatch", + ) + self.assertEqual( + py_parser.to_flat_dict(), + rust_parser.to_flat_dict(), + "to_flat_dict() mismatch", + ) + self.assertEqual( + py_parser.get_flat_dict_with_attributes(), + rust_parser.get_flat_dict_with_attributes(), + "get_flat_dict_with_attributes() mismatch", + ) + self.assertEqual( + py_parser.get_root_node_name(), + rust_parser.get_root_node_name(), + "get_root_node_name() mismatch", + ) + self.assertEqual( + py_parser.get_xform_id_string(), + rust_parser.get_xform_id_string(), + "get_xform_id_string() mismatch", + ) + + def test_nested_repeats_parity(self): + """Test that nested repeats produce identical output.""" + xml = self._publish_and_get_xml( + "new_repeats", + "new_repeats.xlsx", + os.path.join("instances", "new_repeats_2012-07-05-14-33-53.xml"), + ) + self._assert_parsers_match(xml) + + def test_encrypted_form_parity(self): + """Test that encrypted form parsing produces identical output.""" + xml = self._publish_and_get_xml( + "tutorial_encrypted", + "tutorial_encrypted.xlsx", + os.path.join("instances", "tutorial_encrypted.xml"), + ) + self._assert_parsers_match(xml) + + def test_rust_parser_uuid_extraction(self): + """Test UUID extraction from Rust parser.""" + xml = self._publish_and_get_xml( + "new_repeats", + "new_repeats.xlsx", + os.path.join("instances", "new_repeats_2012-07-05-14-33-53.xml"), + ) + rust_parser = RustXFormInstanceParser(xml, self.xform) + # new_repeats fixture doesn't have a UUID in meta + # Just verify the attribute is accessible + self.assertIsNotNone(rust_parser._result) + self.assertEqual(rust_parser.get_root_node_name(), "new_repeats") + + def test_rust_parser_geom_extraction(self): + """Test geopoint extraction from Rust parser.""" + xml = self._publish_and_get_xml( + "new_repeats", + "new_repeats.xlsx", + os.path.join("instances", "new_repeats_2012-07-05-14-33-53.xml"), + ) + rust_parser = RustXFormInstanceParser(xml, self.xform) + # The new_repeats form has a gps field + geom_points = rust_parser._result.geom_points + self.assertIsInstance(geom_points, list) + + @override_settings(USE_RUST_XML_PARSER=True) + def test_full_submission_with_rust_parser(self): + """Test that a full submission round-trip works with the Rust parser.""" + self._create_user_and_login() + xls_path = self._get_fixture_path("tutorial", "tutorial.xlsx") + self._publish_xls_file_and_set_xform(xls_path) + xml_path = self._get_fixture_path( + "tutorial", + "instances", + "tutorial_2012-06-27_11-27-53_w_uuid.xml", + ) + self._make_submission(xml_path) + self.assertEqual(self.response.status_code, 201) + + # Verify instance was saved correctly + instance = self.xform.instances.first() + self.assertIsNotNone(instance) + self.assertEqual(instance.uuid, "729f173c688e482486a48661700455ff") diff --git a/onadata/apps/logger/xform_instance_parser.py b/onadata/apps/logger/xform_instance_parser.py index 4927b82eaa..41e00f6b8a 100644 --- a/onadata/apps/logger/xform_instance_parser.py +++ b/onadata/apps/logger/xform_instance_parser.py @@ -463,3 +463,62 @@ def get_entity_uuid_from_xml(xml): """Returns the uuid for the XML submission's entity""" entity_node = get_meta_from_xml(xml, "entity") return entity_node.getAttribute("id") + + +class RustXFormInstanceParser: + """Drop-in replacement for XFormInstanceParser using Rust native parser.""" + + def __init__(self, xml_str, data_dictionary): + self.data_dicionary = data_dictionary + repeat_xpaths = [ + get_abbreviated_xpath(e.get_xpath()) + for e in data_dictionary.get_survey_elements_of_type("repeat") + ] + + from onadata.libs.data.query import get_numeric_fields + + numeric_fields = set(get_numeric_fields(data_dictionary)) + geo_xpaths = ( + data_dictionary.geopoint_xpaths() + if hasattr(data_dictionary, "geopoint_xpaths") + else [] + ) + + from onadata_xml import parse_submission + + self._result = parse_submission( + smart_str(xml_str.strip()) if isinstance(xml_str, str) else xml_str, + repeat_xpaths, + data_dictionary.encrypted, + numeric_fields, + geo_xpaths, + ) + + def get_root_node(self): + return None + + def get_root_node_name(self): + return self._result.root_node_name + + def to_dict(self): + return self._result.dict + + def to_flat_dict(self): + return self._result.flat_dict + + def get_attributes(self): + return self._result.attributes + + def get_xform_id_string(self): + return self._result.attributes["id"] + + def get_version(self): + return self._result.attributes.get("version") + + def get_flat_dict_with_attributes(self): + result = self.to_flat_dict().copy() + result[XFORM_ID_STRING] = self.get_xform_id_string() + version = self.get_version() + if version: + result[VERSION] = version + return result diff --git a/onadata/settings/common.py b/onadata/settings/common.py index 18906bea76..545a612d55 100644 --- a/onadata/settings/common.py +++ b/onadata/settings/common.py @@ -722,3 +722,7 @@ def configure_logging(logger, **kwargs): CSP_INCLUDE_NONCE_IN = ["script-src", "style-src"] ENABLE_TABLE_PARTITIONING = False + +# Rust XML parser feature flags +USE_RUST_XML_PARSER = False +RUST_XML_PARSER_SHADOW_MODE = False diff --git a/rust/onadata_xml/.gitignore b/rust/onadata_xml/.gitignore new file mode 100644 index 0000000000..ea8c4bf7f3 --- /dev/null +++ b/rust/onadata_xml/.gitignore @@ -0,0 +1 @@ +/target diff --git a/rust/onadata_xml/Cargo.lock b/rust/onadata_xml/Cargo.lock new file mode 100644 index 0000000000..f09faf281d --- /dev/null +++ b/rust/onadata_xml/Cargo.lock @@ -0,0 +1,268 @@ +# This file is automatically @generated by Cargo. +# It is not intended for manual editing. +version = 4 + +[[package]] +name = "autocfg" +version = "1.5.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c08606f8c3cbf4ce6ec8e28fb0014a2c086708fe954eaa885384a6165172e7e8" + +[[package]] +name = "block-buffer" +version = "0.10.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3078c7629b62d3f0439517fa394996acacc5cbc91c5a20d8c658e77abd503a71" +dependencies = [ + "generic-array", +] + +[[package]] +name = "cfg-if" +version = "1.0.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9330f8b2ff13f34540b44e946ef35111825727b38d33286ef986142615121801" + +[[package]] +name = "cpufeatures" +version = "0.2.17" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "59ed5838eebb26a2bb2e58f6d5b5316989ae9d08bab10e0e6d103e656d1b0280" +dependencies = [ + "libc", +] + +[[package]] +name = "crypto-common" +version = "0.1.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "78c8292055d1c1df0cce5d180393dc8cce0abec0a7102adb6c7b1eef6016d60a" +dependencies = [ + "generic-array", + "typenum", +] + +[[package]] +name = "digest" +version = "0.10.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9ed9a281f7bc9b7576e61468ba615a66a5c8cfdff42420a70aa82701a3b1e292" +dependencies = [ + "block-buffer", + "crypto-common", +] + +[[package]] +name = "generic-array" +version = "0.14.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "85649ca51fd72272d7821adaf274ad91c288277713d9c18820d8499a7ff69e9a" +dependencies = [ + "typenum", + "version_check", +] + +[[package]] +name = "heck" +version = "0.5.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2304e00983f87ffb38b55b444b5e3b60a884b5d30c0fca7d82fe33449bbe55ea" + +[[package]] +name = "indoc" +version = "2.0.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "79cf5c93f93228cf8efb3ba362535fb11199ac548a09ce117c9b1adc3030d706" +dependencies = [ + "rustversion", +] + +[[package]] +name = "libc" +version = "0.2.182" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6800badb6cb2082ffd7b6a67e6125bb39f18782f793520caee8cb8846be06112" + +[[package]] +name = "memchr" +version = "2.8.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f8ca58f447f06ed17d5fc4043ce1b10dd205e060fb3ce5b979b8ed8e59ff3f79" + +[[package]] +name = "memoffset" +version = "0.9.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "488016bfae457b036d996092f6cb448677611ce4449e970ceaf42695203f218a" +dependencies = [ + "autocfg", +] + +[[package]] +name = "onadata_xml" +version = "0.1.0" +dependencies = [ + "pyo3", + "quick-xml", + "sha2", +] + +[[package]] +name = "once_cell" +version = "1.21.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "42f5e15c9953c5e4ccceeb2e7382a716482c34515315f7b03532b8b4e8393d2d" + +[[package]] +name = "portable-atomic" +version = "1.13.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c33a9471896f1c69cecef8d20cbe2f7accd12527ce60845ff44c153bb2a21b49" + +[[package]] +name = "proc-macro2" +version = "1.0.106" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8fd00f0bb2e90d81d1044c2b32617f68fcb9fa3bb7640c23e9c748e53fb30934" +dependencies = [ + "unicode-ident", +] + +[[package]] +name = "pyo3" +version = "0.23.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7778bffd85cf38175ac1f545509665d0b9b92a198ca7941f131f85f7a4f9a872" +dependencies = [ + "cfg-if", + "indoc", + "libc", + "memoffset", + "once_cell", + "portable-atomic", + "pyo3-build-config", + "pyo3-ffi", + "pyo3-macros", + "unindent", +] + +[[package]] +name = "pyo3-build-config" +version = "0.23.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "94f6cbe86ef3bf18998d9df6e0f3fc1050a8c5efa409bf712e661a4366e010fb" +dependencies = [ + "once_cell", + "target-lexicon", +] + +[[package]] +name = "pyo3-ffi" +version = "0.23.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e9f1b4c431c0bb1c8fb0a338709859eed0d030ff6daa34368d3b152a63dfdd8d" +dependencies = [ + "libc", + "pyo3-build-config", +] + +[[package]] +name = "pyo3-macros" +version = "0.23.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fbc2201328f63c4710f68abdf653c89d8dbc2858b88c5d88b0ff38a75288a9da" +dependencies = [ + "proc-macro2", + "pyo3-macros-backend", + "quote", + "syn", +] + +[[package]] +name = "pyo3-macros-backend" +version = "0.23.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fca6726ad0f3da9c9de093d6f116a93c1a38e417ed73bf138472cf4064f72028" +dependencies = [ + "heck", + "proc-macro2", + "pyo3-build-config", + "quote", + "syn", +] + +[[package]] +name = "quick-xml" +version = "0.37.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "331e97a1af0bf59823e6eadffe373d7b27f485be8748f71471c662c1f269b7fb" +dependencies = [ + "memchr", +] + +[[package]] +name = "quote" +version = "1.0.44" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "21b2ebcf727b7760c461f091f9f0f539b77b8e87f2fd88131e7f1b433b3cece4" +dependencies = [ + "proc-macro2", +] + +[[package]] +name = "rustversion" +version = "1.0.22" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b39cdef0fa800fc44525c84ccb54a029961a8215f9619753635a9c0d2538d46d" + +[[package]] +name = "sha2" +version = "0.10.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a7507d819769d01a365ab707794a4084392c824f54a7a6a7862f8c3d0892b283" +dependencies = [ + "cfg-if", + "cpufeatures", + "digest", +] + +[[package]] +name = "syn" +version = "2.0.116" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3df424c70518695237746f84cede799c9c58fcb37450d7b23716568cc8bc69cb" +dependencies = [ + "proc-macro2", + "quote", + "unicode-ident", +] + +[[package]] +name = "target-lexicon" +version = "0.12.16" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "61c41af27dd6d1e27b1b16b489db798443478cef1f06a660c96db617ba5de3b1" + +[[package]] +name = "typenum" +version = "1.19.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "562d481066bde0658276a35467c4af00bdc6ee726305698a55b86e61d7ad82bb" + +[[package]] +name = "unicode-ident" +version = "1.0.24" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e6e4313cd5fcd3dad5cafa179702e2b244f760991f45397d14d4ebf38247da75" + +[[package]] +name = "unindent" +version = "0.2.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7264e107f553ccae879d21fbea1d6724ac785e8c3bfc762137959b5802826ef3" + +[[package]] +name = "version_check" +version = "0.9.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0b928f33d975fc6ad9f86c8f283853ad26bdd5b10b7f1542aa2fa15e2289105a" diff --git a/rust/onadata_xml/Cargo.toml b/rust/onadata_xml/Cargo.toml new file mode 100644 index 0000000000..b50272e694 --- /dev/null +++ b/rust/onadata_xml/Cargo.toml @@ -0,0 +1,13 @@ +[package] +name = "onadata_xml" +version = "0.1.0" +edition = "2021" + +[lib] +name = "onadata_xml" +crate-type = ["cdylib"] + +[dependencies] +pyo3 = { version = "0.23", features = ["extension-module"] } +quick-xml = "0.37" +sha2 = "0.10" diff --git a/rust/onadata_xml/pyproject.toml b/rust/onadata_xml/pyproject.toml new file mode 100644 index 0000000000..f349263cc3 --- /dev/null +++ b/rust/onadata_xml/pyproject.toml @@ -0,0 +1,11 @@ +[build-system] +requires = ["maturin>=1.0,<2.0"] +build-backend = "maturin" + +[project] +name = "onadata-xml" +version = "0.1.0" +requires-python = ">=3.9" + +[tool.maturin] +features = ["pyo3/extension-module"] diff --git a/rust/onadata_xml/src/flatten.rs b/rust/onadata_xml/src/flatten.rs new file mode 100644 index 0000000000..af7cda2d69 --- /dev/null +++ b/rust/onadata_xml/src/flatten.rs @@ -0,0 +1,278 @@ +/// Dict flattening module. +/// +/// Replicates Python's `_flatten_dict_nest_repeats` which: +/// - For regular values: yields (path, value) where path is list of keys +/// - For dicts: recurses deeper +/// - For lists (repeat groups): creates a list of flattened sub-dicts, +/// each with full xpath keys joined by "/", stripped of root node prefix +/// - Final flat_dict is built by: {"/".join(path[1:]): value} +use crate::parser::Value; + +/// A flattened entry: (path_segments, value). +/// The value can be a simple Value::Str or a Value::List of flattened dicts. +type FlatEntry = (Vec, Value); + +/// Flatten a nested dict with repeat nesting. +/// +/// Replicates `_flatten_dict_nest_repeats(data_dict, prefix)`. +/// +/// `data_dict` must be a Value::Dict (list of (key, value) pairs). +/// `prefix` is the current path prefix. +fn flatten_dict_nest_repeats_inner(data_dict: &[(String, Value)], prefix: &[String]) -> Vec { + let mut entries = Vec::new(); + + for (key, value) in data_dict { + let mut new_prefix = prefix.to_vec(); + new_prefix.push(key.clone()); + + match value { + Value::Dict(inner_pairs) => { + // Recurse into dict + let sub = flatten_dict_nest_repeats_inner(inner_pairs, &new_prefix); + entries.extend(sub); + } + Value::List(items) => { + // Create a list of flattened sub-dicts + let mut repeats: Vec = Vec::new(); + + for item in items { + let item_prefix = new_prefix.clone(); + + match item { + Value::Dict(item_pairs) => { + // Flatten each dict item into a flat dict + let sub_entries = + flatten_dict_nest_repeats_inner(item_pairs, &item_prefix); + let mut repeat_dict: Vec<(String, Value)> = Vec::new(); + + for (path, r_value) in sub_entries { + // Join path[1:] with "/" + let flat_key = path[1..].join("/"); + repeat_dict.push((flat_key, r_value)); + } + repeats.push(Value::Dict(repeat_dict)); + } + _ => { + // Non-dict item in list (e.g. a string) + let flat_key = item_prefix[1..].join("/"); + let mut repeat_dict: Vec<(String, Value)> = Vec::new(); + repeat_dict.push((flat_key, item.clone())); + repeats.push(Value::Dict(repeat_dict)); + } + } + } + + entries.push((new_prefix, Value::List(repeats))); + } + Value::Str(_) => { + entries.push((new_prefix, value.clone())); + } + } + } + + entries +} + +/// Flatten a parsed XML dict into a flat dict. +/// +/// Takes the top-level dict (e.g. {"tutorial": {...}}) and returns +/// a flat dict where keys are xpath segments joined by "/", with the +/// root node name stripped. +/// +/// This matches the Python code: +/// ```python +/// for path, value in _flatten_dict_nest_repeats(self._dict, []): +/// self._flat_dict["/".join(path[1:])] = value +/// ``` +pub fn flatten_dict(dict: &Value) -> Vec<(String, Value)> { + match dict { + Value::Dict(pairs) => { + let entries = flatten_dict_nest_repeats_inner(pairs, &[]); + let mut flat = Vec::new(); + for (path, value) in entries { + let key = path[1..].join("/"); + flat.push((key, value)); + } + flat + } + _ => Vec::new(), + } +} + +#[cfg(test)] +mod tests { + use super::*; + use crate::parser::{parse_xml, Value}; + + fn find_flat<'a>(flat: &'a [(String, Value)], key: &str) -> Option<&'a Value> { + flat.iter().find(|(k, _)| k == key).map(|(_, v)| v) + } + + #[test] + fn test_simple_form_flatten() { + let xml = r#" + Larry + Again + + 23 + 1333604907194.jpg + 0 + -1.2836198 36.8795437 0.0 1044.0 + firefox chrome safari + + uuid:729f173c688e482486a48661700455ff + +"#; + + let result = parse_xml(xml, &[], false).unwrap(); + let dict = result.dict.unwrap(); + let flat = flatten_dict(&dict); + + assert_eq!( + find_flat(&flat, "name"), + Some(&Value::Str("Larry\n Again\n ".to_string())) + ); + assert_eq!( + find_flat(&flat, "age"), + Some(&Value::Str("23".to_string())) + ); + assert_eq!( + find_flat(&flat, "picture"), + Some(&Value::Str("1333604907194.jpg".to_string())) + ); + assert_eq!( + find_flat(&flat, "has_children"), + Some(&Value::Str("0".to_string())) + ); + assert_eq!( + find_flat(&flat, "gps"), + Some(&Value::Str( + "-1.2836198 36.8795437 0.0 1044.0".to_string() + )) + ); + assert_eq!( + find_flat(&flat, "web_browsers"), + Some(&Value::Str("firefox chrome safari".to_string())) + ); + assert_eq!( + find_flat(&flat, "meta/instanceID"), + Some(&Value::Str( + "uuid:729f173c688e482486a48661700455ff".to_string() + )) + ); + } + + #[test] + fn test_nested_repeats_flatten() { + let xml = r#" + 80Adam + 50Abel1 + chrome ie + -1.2627557 36.7926442 0.0 30.0 +"#; + + let repeats = vec!["kids/kids_details".to_string()]; + let result = parse_xml(xml, &repeats, false).unwrap(); + let dict = result.dict.unwrap(); + let flat = flatten_dict(&dict); + + assert_eq!( + find_flat(&flat, "gps"), + Some(&Value::Str( + "-1.2627557 36.7926442 0.0 30.0".to_string() + )) + ); + assert_eq!( + find_flat(&flat, "kids/has_kids"), + Some(&Value::Str("1".to_string())) + ); + assert_eq!( + find_flat(&flat, "info/age"), + Some(&Value::Str("80".to_string())) + ); + assert_eq!( + find_flat(&flat, "info/name"), + Some(&Value::Str("Adam".to_string())) + ); + assert_eq!( + find_flat(&flat, "web_browsers"), + Some(&Value::Str("chrome ie".to_string())) + ); + + // kids/kids_details should be a list of flattened dicts + let kids_details = find_flat(&flat, "kids/kids_details").unwrap(); + match kids_details { + Value::List(list) => { + assert_eq!(list.len(), 1); + match &list[0] { + Value::Dict(d) => { + // Check for kids/kids_details/kids_age and kids/kids_details/kids_name + assert!(d.iter().any(|(k, v)| k == "kids/kids_details/kids_age" + && *v == Value::Str("50".to_string()))); + assert!(d.iter().any(|(k, v)| k == "kids/kids_details/kids_name" + && *v == Value::Str("Abel".to_string()))); + } + _ => panic!("Expected Dict in list"), + } + } + _ => panic!("Expected List for kids/kids_details"), + } + } + + #[test] + fn test_encrypted_media_flatten() { + let xml = r#"ZJTcuuid:f8971231-f3b8-4b2b-8c35-d95fa207d937 +1483528430996.jpg.enc +1483528445767.jpg.enc +submission.xml.encUUR8"#; + + let result = parse_xml(xml, &[], true).unwrap(); + let dict = result.dict.unwrap(); + let flat = flatten_dict(&dict); + + let media = find_flat(&flat, "media").unwrap(); + match media { + Value::List(list) => { + assert_eq!(list.len(), 2); + match &list[0] { + Value::Dict(d) => { + assert!(d.iter().any(|(k, v)| k == "media/file" + && *v == Value::Str("1483528430996.jpg.enc".to_string()))); + } + _ => panic!("Expected Dict"), + } + match &list[1] { + Value::Dict(d) => { + assert!(d.iter().any(|(k, v)| k == "media/file" + && *v == Value::Str("1483528445767.jpg.enc".to_string()))); + } + _ => panic!("Expected Dict"), + } + } + _ => panic!("Expected List for media"), + } + } + + #[test] + fn test_auto_repeated_flatten() { + // S2A repeated 3 times without being in repeat_xpaths + let xml = r#" +11.25 +11.25 +12test8test25test +"#; + + let result = parse_xml(xml, &[], false).unwrap(); + let dict = result.dict.unwrap(); + let flat = flatten_dict(&dict); + + // S2A should be a list in the flat dict + let s2a = find_flat(&flat, "S2A").unwrap(); + match s2a { + Value::List(list) => { + assert_eq!(list.len(), 3); + } + _ => panic!("Expected List for S2A, got {:?}", s2a), + } + } +} diff --git a/rust/onadata_xml/src/geom.rs b/rust/onadata_xml/src/geom.rs new file mode 100644 index 0000000000..bdd7c0f96c --- /dev/null +++ b/rust/onadata_xml/src/geom.rs @@ -0,0 +1,200 @@ +/// Geopoint extraction module. +/// +/// Replicates Python's `_set_geom` which: +/// 1. For each xpath in geo_xpaths, searches the NESTED dict recursively +/// for matching keys (using `get_values_matching_key` from dict_tools.py) +/// 2. Splits GPS string by whitespace, takes first 2 as (lat, lng) floats +/// +/// The search function `get_values_matching_key` does a recursive traversal +/// of the entire dict structure, including into lists. +use crate::parser::Value; + +/// Recursively search a Value tree for all values with a matching key. +/// +/// Replicates Python's `get_values_matching_key(doc, key)` from dict_tools.py. +/// +/// The Python code: +/// - If key in doc: yield doc[key] +/// - For each (k, v) in doc.items(): +/// - If v is dict: recurse +/// - If v is list: for each item, if dict/list: recurse; elif item == key: yield item +fn get_values_matching_key<'a>(value: &'a Value, key: &str) -> Vec<&'a Value> { + let mut results = Vec::new(); + + match value { + Value::Dict(pairs) => { + // First check if this dict directly contains the key + if let Some(v) = pairs.iter().find(|(k, _)| k == key) { + results.push(&v.1); + } + + // Then recurse into all values + for (_k, v) in pairs { + match v { + Value::Dict(_) => { + results.extend(get_values_matching_key(v, key)); + } + Value::List(items) => { + for item in items { + match item { + Value::Dict(_) | Value::List(_) => { + results.extend(get_values_matching_key(item, key)); + } + Value::Str(s) if s == key => { + results.push(item); + } + _ => {} + } + } + } + _ => {} + } + } + } + Value::List(items) => { + for item in items { + match item { + Value::Dict(_) | Value::List(_) => { + results.extend(get_values_matching_key(item, key)); + } + Value::Str(s) if s == key => { + results.push(item); + } + _ => {} + } + } + } + _ => {} + } + + results +} + +/// Extract geopoints from the nested dict. +/// +/// For each geo_xpath, search the nested dict recursively for matching keys. +/// For each matching value (GPS string), split by whitespace and take first 2 as (lat, lng). +/// +/// Returns a list of (lat, lng) tuples. On any parse error for a geopoint, +/// returns early with the points collected so far (matching Python's `return` on ValueError). +pub fn extract_geopoints(dict: &Value, geo_xpaths: &[String]) -> Vec<(f64, f64)> { + let mut points = Vec::new(); + + for xpath in geo_xpaths { + // Search the nested dict recursively for matching keys. + // geo_xpaths contains abbreviated xpaths used as search keys. + let values = get_values_matching_key(dict, xpath); + for gps_val in values { + if let Value::Str(gps_str) = gps_val { + let parts: Vec<&str> = gps_str.split_whitespace().collect(); + if parts.len() >= 2 { + match (parts[0].parse::(), parts[1].parse::()) { + (Ok(lat), Ok(lng)) => { + points.push((lat, lng)); + } + _ => { + // Python returns on ValueError, stopping all processing + return points; + } + } + } + } + } + } + + points +} + +#[cfg(test)] +mod tests { + use super::*; + use crate::parser::parse_xml; + + #[test] + fn test_extract_gps_simple() { + let xml = r#" + Larry + -1.2836198 36.8795437 0.0 1044.0 + uuid:abc +"#; + + let result = parse_xml(xml, &[], false).unwrap(); + let dict = result.dict.unwrap(); + let points = extract_geopoints(&dict, &["gps".to_string()]); + + assert_eq!(points.len(), 1); + assert!((points[0].0 - (-1.2836198)).abs() < 1e-10); + assert!((points[0].1 - 36.8795437).abs() < 1e-10); + } + + #[test] + fn test_extract_gps_nested() { + let xml = r#" + 80 + -1.2627557 36.7926442 0.0 30.0 +"#; + + let result = parse_xml(xml, &[], false).unwrap(); + let dict = result.dict.unwrap(); + let points = extract_geopoints(&dict, &["gps".to_string()]); + + assert_eq!(points.len(), 1); + assert!((points[0].0 - (-1.2627557)).abs() < 1e-10); + assert!((points[0].1 - 36.7926442).abs() < 1e-10); + } + + #[test] + fn test_no_gps() { + let xml = "test"; + let result = parse_xml(xml, &[], false).unwrap(); + let dict = result.dict.unwrap(); + let points = extract_geopoints(&dict, &["gps".to_string()]); + assert!(points.is_empty()); + } + + #[test] + fn test_empty_geo_xpaths() { + let xml = "-1.0 36.0 0.0 0.0"; + let result = parse_xml(xml, &[], false).unwrap(); + let dict = result.dict.unwrap(); + let points = extract_geopoints(&dict, &[]); + assert!(points.is_empty()); + } + + #[test] + fn test_get_values_matching_key_nested() { + // Simulate a nested dict structure + let dict = Value::Dict(vec![ + ( + "root".to_string(), + Value::Dict(vec![ + ( + "group".to_string(), + Value::Dict(vec![("gps".to_string(), Value::Str("-1.0 36.0".to_string()))]), + ), + ]), + ), + ]); + + let values = get_values_matching_key(&dict, "gps"); + assert_eq!(values.len(), 1); + assert_eq!(*values[0], Value::Str("-1.0 36.0".to_string())); + } + + #[test] + fn test_get_values_matching_key_in_list() { + // Value with a list of dicts (repeat group) + let dict = Value::Dict(vec![ + ( + "locations".to_string(), + Value::List(vec![ + Value::Dict(vec![("gps".to_string(), Value::Str("-1.0 36.0".to_string()))]), + Value::Dict(vec![("gps".to_string(), Value::Str("-2.0 37.0".to_string()))]), + ]), + ), + ]); + + let values = get_values_matching_key(&dict, "gps"); + assert_eq!(values.len(), 2); + } +} diff --git a/rust/onadata_xml/src/lib.rs b/rust/onadata_xml/src/lib.rs new file mode 100644 index 0000000000..96a4078f05 --- /dev/null +++ b/rust/onadata_xml/src/lib.rs @@ -0,0 +1,184 @@ +use std::collections::HashSet; + +use pyo3::prelude::*; +use pyo3::types::{PyDict, PyList, PyNone, PyString}; +use sha2::{Digest, Sha256}; + +mod flatten; +mod geom; +mod numeric; +mod parser; + +use flatten::flatten_dict; +use geom::extract_geopoints; +use numeric::{numeric_checker, NumericValue}; +use parser::{parse_xml, Value}; + +// --------------------------------------------------------------------------- +// SubmissionResult PyO3 class +// --------------------------------------------------------------------------- + +#[pyclass] +pub struct SubmissionResult { + #[pyo3(get)] + pub dict: PyObject, + #[pyo3(get)] + pub flat_dict: PyObject, + #[pyo3(get)] + pub attributes: PyObject, + #[pyo3(get)] + pub root_node_name: String, + #[pyo3(get)] + pub uuid: Option, + #[pyo3(get)] + pub deprecated_uuid: Option, + #[pyo3(get)] + pub submission_date: Option, + #[pyo3(get)] + pub geom_points: Vec<(f64, f64)>, + #[pyo3(get)] + pub checksum: String, +} + +// --------------------------------------------------------------------------- +// Value -> Python object conversion +// --------------------------------------------------------------------------- + +/// Convert a parser::Value to a Python object. +/// +/// - Value::Str -> Python str (or int/float if in numeric_fields) +/// - Value::Dict -> Python dict +/// - Value::List -> Python list +fn value_to_py(py: Python<'_>, value: &Value, numeric_fields: &HashSet, current_key: &str) -> PyResult { + match value { + Value::Str(s) => { + if numeric_fields.contains(current_key) { + match numeric_checker(s) { + NumericValue::Int(i) => Ok(i.into_pyobject(py)?.into_any().unbind()), + NumericValue::Float(f) => Ok(f.into_pyobject(py)?.into_any().unbind()), + NumericValue::Str(s) => Ok(PyString::new(py, &s).into_any().unbind()), + } + } else { + Ok(PyString::new(py, s).into_any().unbind()) + } + } + Value::Dict(pairs) => { + let dict = PyDict::new(py); + for (key, val) in pairs { + let py_val = value_to_py(py, val, numeric_fields, key)?; + dict.set_item(key, py_val)?; + } + Ok(dict.into_any().unbind()) + } + Value::List(items) => { + let list = PyList::empty(py); + for item in items { + let py_item = value_to_py(py, item, numeric_fields, current_key)?; + list.append(py_item)?; + } + Ok(list.into_any().unbind()) + } + } +} + +/// Convert a flat dict (Vec of (String, Value)) to a Python dict. +/// +/// The numeric_fields set contains abbreviated xpaths that should be converted +/// to numeric values. This matches Python's `numeric_converter` which walks the +/// flat dict recursively. +fn flat_dict_to_py( + py: Python<'_>, + flat: &[(String, Value)], + numeric_fields: &HashSet, +) -> PyResult { + let dict = PyDict::new(py); + for (key, value) in flat { + let py_val = value_to_py(py, value, numeric_fields, key)?; + dict.set_item(key, py_val)?; + } + Ok(dict.into_any().unbind()) +} + +// --------------------------------------------------------------------------- +// SHA256 checksum +// --------------------------------------------------------------------------- + +fn sha256_hex(data: &str) -> String { + let mut hasher = Sha256::new(); + hasher.update(data.as_bytes()); + format!("{:x}", hasher.finalize()) +} + +// --------------------------------------------------------------------------- +// parse_submission pyfunction +// --------------------------------------------------------------------------- + +/// Parse an XML submission and return a SubmissionResult. +/// +/// Arguments: +/// - xml_str: The raw XML string +/// - repeat_xpaths: List of xpaths that should be treated as repeating groups +/// - encrypted: Whether the form is encrypted (forces "media" to be list-type) +/// - numeric_fields: Set of abbreviated xpaths for numeric conversion +/// - geo_xpaths: List of field names for geopoint extraction +#[pyfunction] +#[pyo3(signature = (xml_str, repeat_xpaths, encrypted, numeric_fields, geo_xpaths))] +fn parse_submission( + py: Python<'_>, + xml_str: &str, + repeat_xpaths: Vec, + encrypted: bool, + numeric_fields: HashSet, + geo_xpaths: Vec, +) -> PyResult { + // Parse XML + let parse_result = parse_xml(xml_str, &repeat_xpaths, encrypted) + .map_err(|e| pyo3::exceptions::PyValueError::new_err(e))?; + + // Build Python dict from parsed Value tree + let py_dict = match &parse_result.dict { + Some(dict) => value_to_py(py, dict, &numeric_fields, "")?, + None => PyNone::get(py).to_owned().into_any().unbind(), + }; + + // Build flat dict + let flat = match &parse_result.dict { + Some(dict) => flatten_dict(dict), + None => Vec::new(), + }; + let py_flat_dict = flat_dict_to_py(py, &flat, &numeric_fields)?; + + // Build attributes dict + let attrs_dict = PyDict::new(py); + for (key, value) in &parse_result.attributes { + attrs_dict.set_item(key, value)?; + } + + // Extract geopoints from the nested dict + let geom_points = match &parse_result.dict { + Some(dict) => extract_geopoints(dict, &geo_xpaths), + None => Vec::new(), + }; + + // SHA256 of original XML string + let checksum = sha256_hex(xml_str); + + Ok(SubmissionResult { + dict: py_dict, + flat_dict: py_flat_dict, + attributes: attrs_dict.into_any().unbind(), + root_node_name: parse_result.root_node_name, + uuid: parse_result.uuid, + deprecated_uuid: parse_result.deprecated_uuid, + submission_date: parse_result.submission_date, + geom_points, + checksum, + }) +} + +#[pymodule] +fn onadata_xml(m: &Bound<'_, PyModule>) -> PyResult<()> { + m.add_class::()?; + m.add_function(wrap_pyfunction!(parse_submission, m)?)?; + Ok(()) +} diff --git a/rust/onadata_xml/src/numeric.rs b/rust/onadata_xml/src/numeric.rs new file mode 100644 index 0000000000..2c789f72f4 --- /dev/null +++ b/rust/onadata_xml/src/numeric.rs @@ -0,0 +1,102 @@ +/// Numeric conversion utilities matching Python's `numeric_checker`. +/// +/// Tries int, then float (NaN -> 0), otherwise returns original string. + +/// Result of numeric checking - either a parsed number or the original string. +#[derive(Debug, Clone, PartialEq)] +pub enum NumericValue { + Int(i64), + Float(f64), + Str(String), +} + +/// Replicates Python's `numeric_checker(string_value)`: +/// - Try int(string_value) -> return int +/// - Try float(string_value) -> if NaN return 0, else return float +/// - Otherwise return string unchanged +pub fn numeric_checker(string_value: &str) -> NumericValue { + // Try parsing as integer first + if let Ok(i) = string_value.parse::() { + return NumericValue::Int(i); + } + + // Try parsing as float + if let Ok(f) = string_value.parse::() { + if f.is_nan() { + return NumericValue::Int(0); + } + return NumericValue::Float(f); + } + + NumericValue::Str(string_value.to_string()) +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn test_integer() { + assert_eq!(numeric_checker("23"), NumericValue::Int(23)); + } + + #[test] + fn test_negative_integer() { + assert_eq!(numeric_checker("-5"), NumericValue::Int(-5)); + } + + #[test] + fn test_zero() { + assert_eq!(numeric_checker("0"), NumericValue::Int(0)); + } + + #[test] + fn test_float() { + assert_eq!(numeric_checker("1.25"), NumericValue::Float(1.25)); + } + + #[test] + fn test_negative_float() { + assert_eq!(numeric_checker("-1.2836198"), NumericValue::Float(-1.2836198)); + } + + #[test] + fn test_nan() { + assert_eq!(numeric_checker("NaN"), NumericValue::Int(0)); + } + + #[test] + fn test_nan_lowercase() { + // Python's float("nan") works, Rust's parse also handles various NaN forms + assert_eq!(numeric_checker("nan"), NumericValue::Int(0)); + } + + #[test] + fn test_string() { + assert_eq!( + numeric_checker("hello"), + NumericValue::Str("hello".to_string()) + ); + } + + #[test] + fn test_empty_string() { + assert_eq!(numeric_checker(""), NumericValue::Str("".to_string())); + } + + #[test] + fn test_gps_string() { + assert_eq!( + numeric_checker("-1.2836198 36.8795437 0.0 1044.0"), + NumericValue::Str("-1.2836198 36.8795437 0.0 1044.0".to_string()) + ); + } + + #[test] + fn test_uuid_string() { + assert_eq!( + numeric_checker("uuid:729f173c688e482486a48661700455ff"), + NumericValue::Str("uuid:729f173c688e482486a48661700455ff".to_string()) + ); + } +} diff --git a/rust/onadata_xml/src/parser.rs b/rust/onadata_xml/src/parser.rs new file mode 100644 index 0000000000..c2fdeedd64 --- /dev/null +++ b/rust/onadata_xml/src/parser.rs @@ -0,0 +1,1041 @@ +/// Core XML-to-dict parser. +/// +/// Replicates Python's `clean_and_parse_xml`, `_xml_node_to_dict`, +/// `xpath_from_xml_node`, `_get_all_attributes`, UUID/deprecatedID extraction, +/// and submissionDate extraction. +use std::collections::HashSet; + +use quick_xml::events::{BytesCData, BytesStart, BytesText, Event}; +use quick_xml::Reader; + +// --------------------------------------------------------------------------- +// Value enum -- our Rust-side representation of the nested Python dict +// --------------------------------------------------------------------------- + +/// A value in the parsed XML dict tree. +/// Mirrors what the Python code produces: +/// - `Str` for leaf text nodes +/// - `Dict` for element nodes with children (preserves insertion order) +/// - `List` for repeated elements / encrypted media +#[derive(Debug, Clone, PartialEq)] +pub enum Value { + Str(String), + Dict(Vec<(String, Value)>), + List(Vec), +} + +impl Value { + /// Lookup a key in a Dict value. Returns None for non-Dict variants. + #[allow(dead_code)] + pub fn get(&self, key: &str) -> Option<&Value> { + match self { + Value::Dict(pairs) => pairs.iter().find(|(k, _)| k == key).map(|(_, v)| v), + _ => None, + } + } +} + +// --------------------------------------------------------------------------- +// Attribute triple +// --------------------------------------------------------------------------- + +/// (attr_key, attr_value, element_name) +#[derive(Debug, Clone)] +pub struct XmlAttribute { + pub key: String, + pub value: String, + pub node_name: String, +} + +// --------------------------------------------------------------------------- +// ParseResult -- everything extracted from a single parse pass +// --------------------------------------------------------------------------- + +#[derive(Debug)] +pub struct ParseResult { + /// The nested dict, e.g. {"tutorial": {"name": "Larry", ...}} + pub dict: Option, + /// Root element name (e.g. "tutorial") + pub root_node_name: String, + /// All XML attributes (respecting entity-skip and first-wins rules) + pub attributes: Vec<(String, String)>, + /// UUID extracted from meta/instanceID (uuid: prefix stripped) + pub uuid: Option, + /// Deprecated UUID from meta/deprecatedID (uuid: prefix stripped) + pub deprecated_uuid: Option, + /// submissionDate attribute from root element + pub submission_date: Option, +} + +// --------------------------------------------------------------------------- +// Internal DOM tree built from quick-xml events +// --------------------------------------------------------------------------- + +/// Minimal DOM node built during SAX-style parsing. +#[derive(Debug, Clone)] +enum DomNode { + Element { + /// Local name (namespace prefix stripped for matching, but kept for + /// nodeName output to match Python minidom behaviour). + name: String, + attrs: Vec<(String, String)>, + children: Vec, + }, + Text(String), + CData(String), +} + +/// Build a minimal DOM tree from cleaned XML bytes. +fn build_dom(xml_bytes: &[u8]) -> Result { + let mut reader = Reader::from_reader(xml_bytes); + reader.config_mut().trim_text_start = false; + reader.config_mut().trim_text_end = false; + + let mut stack: Vec = Vec::new(); + // Sentinel root + stack.push(DomNode::Element { + name: "#document".to_string(), + attrs: vec![], + children: vec![], + }); + + let mut buf = Vec::new(); + loop { + match reader.read_event_into(&mut buf) { + Ok(Event::Start(ref e)) => { + let name = elem_name(e); + let attrs = elem_attrs(e); + stack.push(DomNode::Element { + name, + attrs, + children: vec![], + }); + } + Ok(Event::Empty(ref e)) => { + // Self-closing element like + let name = elem_name(e); + let attrs = elem_attrs(e); + let node = DomNode::Element { + name, + attrs, + children: vec![], + }; + // Push onto current top + if let Some(DomNode::Element { children, .. }) = stack.last_mut() { + children.push(node); + } + } + Ok(Event::End(ref _e)) => { + let node = stack.pop().ok_or("Unexpected end tag")?; + if let Some(DomNode::Element { children, .. }) = stack.last_mut() { + children.push(node); + } else { + return Err("No parent for end tag".to_string()); + } + } + Ok(Event::Text(ref e)) => { + let text = decode_text(e); + if let Some(DomNode::Element { children, .. }) = stack.last_mut() { + children.push(DomNode::Text(text)); + } + } + Ok(Event::CData(ref e)) => { + let text = decode_cdata(e); + if let Some(DomNode::Element { children, .. }) = stack.last_mut() { + children.push(DomNode::CData(text)); + } + } + Ok(Event::Decl(_)) | Ok(Event::PI(_)) | Ok(Event::Comment(_)) => {} + Ok(Event::DocType(_)) => {} + Ok(Event::Eof) => break, + Err(e) => return Err(format!("XML parse error: {e}")), + } + buf.clear(); + } + + // stack should have only the sentinel #document + if stack.len() != 1 { + return Err("Malformed XML: unclosed elements".to_string()); + } + Ok(stack.pop().unwrap()) +} + +fn elem_name(e: &BytesStart) -> String { + String::from_utf8_lossy(e.name().as_ref()).to_string() +} + +fn elem_attrs(e: &BytesStart) -> Vec<(String, String)> { + e.attributes() + .filter_map(|a| a.ok()) + .map(|a| { + let key = String::from_utf8_lossy(a.key.as_ref()).to_string(); + let val = String::from_utf8_lossy(&a.value).to_string(); + (key, val) + }) + .collect() +} + +fn decode_text(e: &BytesText) -> String { + // Unescape XML entities + match e.unescape() { + Ok(s) => s.to_string(), + Err(_) => String::from_utf8_lossy(e.as_ref()).to_string(), + } +} + +fn decode_cdata(e: &BytesCData) -> String { + String::from_utf8_lossy(e.as_ref()).to_string() +} + +// --------------------------------------------------------------------------- +// Clean XML (matching Python's clean_and_parse_xml) +// --------------------------------------------------------------------------- + +/// Strips whitespace, removes whitespace between XML tags. +/// Matches: `re.sub(r">\s+<", "><", smart_str(xml_string.strip()))` +pub fn clean_xml(xml_str: &str) -> String { + let trimmed = xml_str.trim(); + // Remove whitespace between tags + let mut result = String::with_capacity(trimmed.len()); + let mut chars = trimmed.chars().peekable(); + while let Some(c) = chars.next() { + if c == '>' { + result.push(c); + // Consume any whitespace that is immediately followed by '<' + let mut ws_buf = String::new(); + while let Some(&next) = chars.peek() { + if next.is_whitespace() { + ws_buf.push(next); + chars.next(); + } else { + break; + } + } + // If next char is '<', drop the whitespace; otherwise keep it + if let Some(&'<') = chars.peek() { + // drop ws_buf + } else { + result.push_str(&ws_buf); + } + } else { + result.push(c); + } + } + result +} + +// --------------------------------------------------------------------------- +// xpath computation (matching Python's xpath_from_xml_node) +// --------------------------------------------------------------------------- + +/// Compute the xpath for a node given the path of ancestor names. +/// Python's `xpath_from_xml_node` walks parent chain, collects names, +/// then returns "/".join(names[1:]) -- skipping the document node AND +/// the root element node (since _gather_parent_node_list skips when +/// parentNode.parentNode is None, i.e. the document's child = root element). +/// +/// Actually, re-reading the Python code more carefully: +/// ```python +/// def _gather_parent_node_list(node): +/// node_names = [] +/// if node.parentNode and node.parentNode.parentNode: +/// node_names.extend(_gather_parent_node_list(node.parentNode)) +/// node_names.extend([node.nodeName]) +/// return node_names +/// ``` +/// +/// For a node at path document -> root -> child -> grandchild: +/// - grandchild: parent=child, parent.parent=root (exists) -> recurse to child +/// - child: parent=root, parent.parent=document (exists) -> recurse to root +/// - root: parent=document, parent.parent=None -> STOP, return ["root"] +/// - child returns ["root", "child"] +/// - grandchild returns ["root", "child", "grandchild"] +/// Then xpath_from_xml_node returns "/".join(names[1:]) = "child/grandchild" +/// +/// So the xpath skips the root element name and gives the path from root's children down. +/// +/// We pass `ancestor_names` which is the list of element names from root down (not including +/// the document node). For a child at depth 2 under root: +/// ancestor_names = ["root", "child"] and current node name is the node itself. +/// The xpath = "/".join(ancestor_names[1:] + [node_name])... wait, let me re-check. +/// +/// Actually, ancestor_names in our traversal doesn't include the current node. +/// So for grandchild: ancestor_names = ["root", "child"], node_name = "grandchild" +/// full path = ["root", "child", "grandchild"], xpath = "child/grandchild" +/// +/// This matches: skip first element (root), join the rest. +pub fn compute_xpath(ancestor_names: &[String], node_name: &str) -> String { + // ancestor_names[0] is the root element name. + // We want: ancestor_names[1..] joined with "/" then "/" then node_name + let mut parts: Vec<&str> = ancestor_names.iter().skip(1).map(|s| s.as_str()).collect(); + parts.push(node_name); + parts.join("/") +} + +// --------------------------------------------------------------------------- +// _xml_node_to_dict equivalent +// --------------------------------------------------------------------------- + +/// Convert a DomNode (element) into our Value tree. +/// `repeats` is the set of xpaths that should be treated as list-type. +/// `encrypted` when true forces "media" child elements to be list-type. +/// `ancestor_names` tracks the path for xpath computation. +fn node_to_dict( + node: &DomNode, + repeats: &HashSet, + encrypted: bool, + ancestor_names: &[String], +) -> Option { + match node { + DomNode::Text(_) | DomNode::CData(_) => { + // Leaf nodes handled by parent + None + } + DomNode::Element { + name, + children, + .. + } => { + // If node has 0 children -> None + if children.is_empty() { + return None; + } + + // If node has 1 child that is Text -> {nodeName: textValue} + if children.len() == 1 { + match &children[0] { + DomNode::Text(text) => { + return Some(Value::Dict(vec![(name.clone(), Value::Str(text.clone()))])); + } + DomNode::CData(text) => { + // CDATA section -> {parentNodeName: cdataValue} + return Some(Value::Dict(vec![(name.clone(), Value::Str(text.clone()))])); + } + _ => {} + } + } + + // Check for CDATA among children (Python checks this in the loop) + for child in children { + if let DomNode::CData(text) = child { + return Some(Value::Dict(vec![(name.clone(), Value::Str(text.clone()))])); + } + } + + // This is an internal node - iterate children + let mut value: Vec<(String, Value)> = Vec::new(); + let mut current_path = ancestor_names.to_vec(); + current_path.push(name.clone()); + + for child in children { + match child { + DomNode::Text(_) => { + // Text nodes among element siblings are ignored + // (Python: the loop only processes element children + // via _xml_node_to_dict which returns None for text) + continue; + } + DomNode::CData(text) => { + // CDATA found during iteration (Python line 200-201) + return Some(Value::Dict(vec![(name.clone(), Value::Str(text.clone()))])); + } + DomNode::Element { + name: child_name, .. + } => { + let child_dict = + node_to_dict(child, repeats, encrypted, ¤t_path); + + if child_dict.is_none() { + continue; + } + + let child_dict = child_dict.unwrap(); + + // Extract the child's value from the wrapper dict + let child_value = match &child_dict { + Value::Dict(pairs) => { + if pairs.len() == 1 && pairs[0].0 == *child_name { + pairs[0].1.clone() + } else { + // This shouldn't happen per Python assertion + child_dict.clone() + } + } + _ => child_dict.clone(), + }; + + let child_xpath = compute_xpath(¤t_path, child_name); + + let is_list_type = repeats.contains(&child_xpath) + || (encrypted && child_name == "media"); + + // Find if child_name already exists in value + let existing_idx = + value.iter().position(|(k, _)| k == child_name); + + if is_list_type { + // List type: always append to list + if let Some(idx) = existing_idx { + match &mut value[idx].1 { + Value::List(list) => { + list.push(child_value); + } + _ => { + // Shouldn't happen since we always init as list + } + } + } else { + value.push(( + child_name.clone(), + Value::List(vec![child_value]), + )); + } + } else { + // Dict type + if let Some(idx) = existing_idx { + // Node is repeated, aggregate + let existing = &mut value[idx].1; + match existing { + Value::List(list) => { + // Already a list, just append + list.push(child_value); + } + _ => { + // Convert to list + let prev = existing.clone(); + *existing = Value::List(vec![prev, child_value]); + } + } + } else { + value.push((child_name.clone(), child_value)); + } + } + } + } + } + + if value.is_empty() { + return None; + } + + Some(Value::Dict(vec![(name.clone(), Value::Dict(value))])) + } + } +} + +// --------------------------------------------------------------------------- +// Attribute collection (matching Python's _get_all_attributes + _set_attributes) +// --------------------------------------------------------------------------- + +/// Recursively collect all attributes from an element tree. +fn collect_attributes(node: &DomNode, out: &mut Vec) { + if let DomNode::Element { + name, + attrs, + children, + } = node + { + for (key, val) in attrs { + out.push(XmlAttribute { + key: key.clone(), + value: val.clone(), + node_name: name.clone(), + }); + } + for child in children { + collect_attributes(child, out); + } + } +} + +/// Apply Python's _set_attributes logic: skip entity nodes, first-wins for duplicates. +fn build_attributes(raw: &[XmlAttribute]) -> Vec<(String, String)> { + let mut result: Vec<(String, String)> = Vec::new(); + let mut seen: HashSet = HashSet::new(); + for attr in raw { + if attr.node_name == "entity" { + continue; + } + if seen.contains(&attr.key) { + // Duplicate - skip (first wins) + continue; + } + seen.insert(attr.key.clone()); + result.push((attr.key.clone(), attr.value.clone())); + } + result +} + +// --------------------------------------------------------------------------- +// UUID extraction +// --------------------------------------------------------------------------- + +/// Extract UUID from meta/instanceID or orx:meta/orx:instanceID. +/// Also checks root element's instanceID attribute. +fn extract_uuid(root: &DomNode, attributes: &[(String, String)]) -> Option { + // First try meta/instanceID in the XML tree + if let Some(uuid) = extract_meta_value(root, "instanceID") { + return strip_uuid_prefix(&uuid); + } + + // Then check root element's instanceID attribute + for (key, value) in attributes { + if key == "instanceID" { + return strip_uuid_prefix(value); + } + } + + None +} + +/// Extract deprecated UUID from meta/deprecatedID or orx:meta/orx:deprecatedID. +fn extract_deprecated_uuid(root: &DomNode) -> Option { + if let Some(uuid) = extract_meta_value(root, "deprecatedID") { + return strip_uuid_prefix(&uuid); + } + None +} + +/// Extract a value from meta/ or orx:meta/orx:. +fn extract_meta_value(root: &DomNode, tag_name: &str) -> Option { + if let DomNode::Element { children, .. } = root { + for child in children { + if let DomNode::Element { + name, children: meta_children, .. + } = child + { + let name_lower = name.to_lowercase(); + if name_lower == "meta" || name_lower == "orx:meta" { + for meta_child in meta_children { + if let DomNode::Element { + name: child_name, + children: value_children, + .. + } = meta_child + { + let child_name_lower = child_name.to_lowercase(); + if child_name_lower == tag_name.to_lowercase() + || child_name_lower + == format!("orx:{}", tag_name.to_lowercase()) + { + // Get text content + if let Some(text) = get_text_content(value_children) { + return Some(text.trim().to_string()); + } + } + } + } + } + } + } + } + None +} + +/// Get the text content of a node's children. +fn get_text_content(children: &[DomNode]) -> Option { + for child in children { + match child { + DomNode::Text(text) => return Some(text.clone()), + DomNode::CData(text) => return Some(text.clone()), + _ => {} + } + } + None +} + +/// Strip "uuid:" prefix from a UUID string. +fn strip_uuid_prefix(s: &str) -> Option { + if let Some(rest) = s.strip_prefix("uuid:") { + if rest.is_empty() { + None + } else { + Some(rest.to_string()) + } + } else if !s.is_empty() { + // Return as-is if no uuid: prefix but non-empty + Some(s.to_string()) + } else { + None + } +} + +/// Extract submissionDate from root element's attributes. +fn extract_submission_date(attributes: &[(String, String)]) -> Option { + for (key, value) in attributes { + if key == "submissionDate" { + if !value.is_empty() { + return Some(value.clone()); + } + } + } + None +} + +// --------------------------------------------------------------------------- +// Public parse entry point +// --------------------------------------------------------------------------- + +/// Parse an XML submission string. +/// +/// This is the main entry point that performs: +/// 1. Clean XML (strip whitespace between tags) +/// 2. Build DOM tree +/// 3. Convert root element to nested dict (skipping #document wrapper) +/// 4. Collect attributes (entity-skip, first-wins) +/// 5. Extract UUID, deprecated UUID, submission date +pub fn parse_xml( + xml_str: &str, + repeat_xpaths: &[String], + encrypted: bool, +) -> Result { + let cleaned = clean_xml(xml_str); + let dom = build_dom(cleaned.as_bytes())?; + + // Get the root element (first element child of #document) + let root_element = match &dom { + DomNode::Element { children, .. } => children + .iter() + .find(|c| matches!(c, DomNode::Element { .. })) + .ok_or("No root element found")?, + _ => return Err("Expected document node".to_string()), + }; + + let root_name = match root_element { + DomNode::Element { name, .. } => name.clone(), + _ => unreachable!(), + }; + + // Build repeat xpath set + let repeats: HashSet = repeat_xpaths.iter().cloned().collect(); + + // Convert root element to dict + // ancestor_names is empty because we start at root (no ancestors above it) + let dict = node_to_dict(root_element, &repeats, encrypted, &[]); + + // Collect attributes from root element (not #document) + let mut raw_attrs = Vec::new(); + collect_attributes(root_element, &mut raw_attrs); + let attributes = build_attributes(&raw_attrs); + + // Extract UUID and deprecated UUID + let uuid = extract_uuid(root_element, &attributes); + let deprecated_uuid = extract_deprecated_uuid(root_element); + + // Extract submission date from root attributes + let submission_date = extract_submission_date(&attributes); + + Ok(ParseResult { + dict, + root_node_name: root_name, + attributes, + uuid, + deprecated_uuid, + submission_date, + }) +} + +// --------------------------------------------------------------------------- +// Tests +// --------------------------------------------------------------------------- + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn test_clean_xml() { + let input = " \n text \n "; + let cleaned = clean_xml(input); + assert_eq!( + cleaned, + "text" + ); + } + + #[test] + fn test_clean_xml_preserves_inner_text() { + let input = "Larry\n Again\n "; + let cleaned = clean_xml(input); + // Text inside a single element should be preserved + assert_eq!(cleaned, "Larry\n Again\n "); + } + + #[test] + fn test_simple_form() { + let xml = r#" + Larry + Again + + 23 + 1333604907194.jpg + 0 + -1.2836198 36.8795437 0.0 1044.0 + firefox chrome safari + + uuid:729f173c688e482486a48661700455ff + +"#; + + let result = parse_xml(xml, &[], false).unwrap(); + + assert_eq!(result.root_node_name, "tutorial"); + assert_eq!( + result.uuid, + Some("729f173c688e482486a48661700455ff".to_string()) + ); + assert_eq!(result.deprecated_uuid, None); + assert_eq!(result.submission_date, None); + + // Check attributes + assert_eq!(result.attributes, vec![("id".to_string(), "tutorial".to_string())]); + + // Check dict structure + let dict = result.dict.unwrap(); + match &dict { + Value::Dict(pairs) => { + assert_eq!(pairs.len(), 1); + assert_eq!(pairs[0].0, "tutorial"); + match &pairs[0].1 { + Value::Dict(inner) => { + // Check name preserves whitespace + let name_val = inner.iter().find(|(k, _)| k == "name").unwrap(); + match &name_val.1 { + Value::Str(s) => { + assert_eq!(s, "Larry\n Again\n "); + } + _ => panic!("Expected Str for name"), + } + + // Check age + let age_val = inner.iter().find(|(k, _)| k == "age").unwrap(); + assert_eq!(age_val.1, Value::Str("23".to_string())); + + // Check meta/instanceID + let meta_val = inner.iter().find(|(k, _)| k == "meta").unwrap(); + match &meta_val.1 { + Value::Dict(meta_inner) => { + assert_eq!(meta_inner.len(), 1); + assert_eq!(meta_inner[0].0, "instanceID"); + assert_eq!( + meta_inner[0].1, + Value::Str( + "uuid:729f173c688e482486a48661700455ff".to_string() + ) + ); + } + _ => panic!("Expected Dict for meta"), + } + } + _ => panic!("Expected Dict for tutorial"), + } + } + _ => panic!("Expected Dict"), + } + } + + #[test] + fn test_nested_repeats() { + let xml = r#" + 80Adam + 50Abel1 + chrome ie + -1.2627557 36.7926442 0.0 30.0 +"#; + + let repeats = vec!["kids/kids_details".to_string()]; + let result = parse_xml(xml, &repeats, false).unwrap(); + + let dict = result.dict.unwrap(); + // dict = {"new_repeats": {"info": ..., "kids": ..., ...}} + match &dict { + Value::Dict(pairs) => { + assert_eq!(pairs[0].0, "new_repeats"); + let inner = match &pairs[0].1 { + Value::Dict(d) => d, + _ => panic!("Expected Dict"), + }; + + // Check kids/kids_details is a list + let kids = inner.iter().find(|(k, _)| k == "kids").unwrap(); + match &kids.1 { + Value::Dict(kids_inner) => { + let kids_details = + kids_inner.iter().find(|(k, _)| k == "kids_details").unwrap(); + match &kids_details.1 { + Value::List(list) => { + assert_eq!(list.len(), 1); + // The single item should be a dict with kids_age and kids_name + match &list[0] { + Value::Dict(d) => { + assert!(d.iter().any(|(k, _)| k == "kids_age")); + assert!(d.iter().any(|(k, _)| k == "kids_name")); + } + _ => panic!("Expected Dict in list"), + } + } + _ => panic!("Expected List for kids_details"), + } + } + _ => panic!("Expected Dict for kids"), + } + } + _ => panic!("Expected Dict"), + } + } + + #[test] + fn test_encrypted_media() { + let xml = r#"ZJTcuuid:f8971231-f3b8-4b2b-8c35-d95fa207d937 +1483528430996.jpg.enc +1483528445767.jpg.enc +submission.xml.encUUR8"#; + + let result = parse_xml(xml, &[], true).unwrap(); + + assert_eq!( + result.uuid, + Some("f8971231-f3b8-4b2b-8c35-d95fa207d937".to_string()) + ); + + let dict = result.dict.unwrap(); + match &dict { + Value::Dict(pairs) => { + assert_eq!(pairs[0].0, "data"); + let inner = match &pairs[0].1 { + Value::Dict(d) => d, + _ => panic!("Expected Dict"), + }; + + // media should be a list with 2 items + let media = inner.iter().find(|(k, _)| k == "media").unwrap(); + match &media.1 { + Value::List(list) => { + assert_eq!(list.len(), 2); + // First item + match &list[0] { + Value::Dict(d) => { + assert_eq!(d[0].0, "file"); + assert_eq!( + d[0].1, + Value::Str("1483528430996.jpg.enc".to_string()) + ); + } + _ => panic!("Expected Dict in media list"), + } + // Second item + match &list[1] { + Value::Dict(d) => { + assert_eq!(d[0].0, "file"); + assert_eq!( + d[0].1, + Value::Str("1483528445767.jpg.enc".to_string()) + ); + } + _ => panic!("Expected Dict in media list"), + } + } + _ => panic!("Expected List for media"), + } + } + _ => panic!("Expected Dict"), + } + } + + #[test] + fn test_repeated_nodes_auto_list() { + // S2A appears 3 times without being in repeat_xpaths. + // Python auto-converts to list on second occurrence. + let xml = r#" +11.25 +11.25 +12test8test25test +"#; + + let result = parse_xml(xml, &[], false).unwrap(); + let dict = result.dict.unwrap(); + + match &dict { + Value::Dict(pairs) => { + assert_eq!(pairs[0].0, "RW_OUNIS_2016"); + let inner = match &pairs[0].1 { + Value::Dict(d) => d, + _ => panic!("Expected Dict"), + }; + + // S2A should be a list of 3 dicts + let s2a = inner.iter().find(|(k, _)| k == "S2A").unwrap(); + match &s2a.1 { + Value::List(list) => { + assert_eq!(list.len(), 3); + + // First S2A: {S2_1_3_2_2: "1", S2_1_3_2_3: "1.25"} + // (S2A_note is empty/self-closing, so skipped) + match &list[0] { + Value::Dict(d) => { + assert!(d.iter().any(|(k, v)| k == "S2_1_3_2_2" + && *v == Value::Str("1".to_string()))); + assert!(d.iter().any(|(k, v)| k == "S2_1_3_2_3" + && *v == Value::Str("1.25".to_string()))); + } + _ => panic!("Expected Dict in S2A list"), + } + + // Third S2A has nested S2_1_3_5_3 with S3B repeats + match &list[2] { + Value::Dict(d) => { + let s2_1_3_5_3 = + d.iter().find(|(k, _)| k == "S2_1_3_5_3").unwrap(); + match &s2_1_3_5_3.1 { + Value::Dict(inner_d) => { + let s3b = + inner_d.iter().find(|(k, _)| k == "S3B").unwrap(); + match &s3b.1 { + Value::List(s3b_list) => { + assert_eq!(s3b_list.len(), 3); + // First S3B has S3_1_3_4 appearing twice -> list ["2", "test"] + match &s3b_list[0] { + Value::Dict(d) => { + let field = d + .iter() + .find(|(k, _)| k == "S3_1_3_4") + .unwrap(); + match &field.1 { + Value::List(vals) => { + assert_eq!(vals.len(), 2); + assert_eq!( + vals[0], + Value::Str("2".to_string()) + ); + assert_eq!( + vals[1], + Value::Str( + "test".to_string() + ) + ); + } + _ => panic!( + "Expected List for S3_1_3_4" + ), + } + } + _ => panic!("Expected Dict in S3B list"), + } + } + _ => panic!("Expected List for S3B"), + } + } + _ => panic!("Expected Dict for S2_1_3_5_3"), + } + } + _ => panic!("Expected Dict in S2A list"), + } + } + _ => panic!("Expected List for S2A, got {:?}", s2a.1), + } + } + _ => panic!("Expected Dict"), + } + } + + #[test] + fn test_self_closing_tag_skipped() { + let xml = "test"; + let result = parse_xml(xml, &[], false).unwrap(); + let dict = result.dict.unwrap(); + match &dict { + Value::Dict(pairs) => { + let inner = match &pairs[0].1 { + Value::Dict(d) => d, + _ => panic!("Expected Dict"), + }; + // note should be skipped + assert!(!inner.iter().any(|(k, _)| k == "note")); + // name should be present + assert!(inner.iter().any(|(k, _)| k == "name")); + } + _ => panic!("Expected Dict"), + } + } + + #[test] + fn test_entity_attributes_skipped() { + let xml = r#"test"#; + let result = parse_xml(xml, &[], false).unwrap(); + // "id" from data should be present, but "id" and "dataset" from entity should be skipped + assert_eq!( + result.attributes, + vec![("id".to_string(), "form1".to_string())] + ); + } + + #[test] + fn test_submission_date_extraction() { + let xml = r#"test"#; + let result = parse_xml(xml, &[], false).unwrap(); + assert_eq!( + result.submission_date, + Some("2023-01-15T10:30:00.000Z".to_string()) + ); + } + + #[test] + fn test_deprecated_uuid() { + let xml = r#"uuid:new-uuiduuid:old-uuidtest"#; + let result = parse_xml(xml, &[], false).unwrap(); + assert_eq!(result.uuid, Some("new-uuid".to_string())); + assert_eq!(result.deprecated_uuid, Some("old-uuid".to_string())); + } + + #[test] + fn test_orx_namespace_uuid() { + let xml = r#"uuid:f8971231-f3b8-4b2b-8c35-d95fa207d937test"#; + let result = parse_xml(xml, &[], false).unwrap(); + assert_eq!( + result.uuid, + Some("f8971231-f3b8-4b2b-8c35-d95fa207d937".to_string()) + ); + } + + #[test] + fn test_empty_root() { + let xml = ""; + let result = parse_xml(xml, &[], false).unwrap(); + assert!(result.dict.is_none()); + assert_eq!(result.root_node_name, "root"); + } + + #[test] + fn test_xpath_computation() { + // For a child "age" under root "tutorial", xpath should be "age" + assert_eq!(compute_xpath(&["tutorial".to_string()], "age"), "age"); + + // For grandchild "instanceID" under root "tutorial" > "meta" + assert_eq!( + compute_xpath( + &["tutorial".to_string(), "meta".to_string()], + "instanceID" + ), + "meta/instanceID" + ); + + // For deeply nested + assert_eq!( + compute_xpath( + &["root".to_string(), "a".to_string(), "b".to_string()], + "c" + ), + "a/b/c" + ); + } + + #[test] + fn test_xmlns_attributes_included() { + // xmlns attributes should be included (they are regular attributes to quick-xml) + let xml = r#"v"#; + let result = parse_xml(xml, &[], false).unwrap(); + // Should have both 'id' and 'xmlns' + assert!(result.attributes.iter().any(|(k, _)| k == "id")); + assert!(result.attributes.iter().any(|(k, _)| k == "xmlns")); + } +}