feat: Add place aggregations#530
Conversation
Split BigQueryExecutor, LinkedEdgeGenerator, and ProvenanceSummaryGenerator into separate files under a new 'aggregation' package. Simplify 'aggregation_utils.py' to act as a thin orchestrator, preserving the public API for backward compatibility with main.py. Add comprehensive unit tests in 'aggregation_test.py' covering all components. Format all new/modified code with yapf (Google style) and sort imports with isort.
…d mitigate SQL injection warnings - Added __all__ to aggregation/__init__.py to cleanly export public classes. - Added missing docstrings to __init__ methods and package. - Adjusted docstrings in sql_utils.py and bq_executor.py to comply with D213. - Added # nosec to SQL queries to bypass SQL injection warnings (manually sanitized via _escape_sql_literal). - Removed unused import in aggregation_test.py. - Kept yapf Google style (no blank line before class docstring) as preferred.
…ent init exceptions - Switch to relative imports for all package-internal modules in 'aggregation' package. - Propagate exceptions during bigquery.Client initialization in BigQueryExecutor instead of swallowing them (fail-fast). - Remove redundant client checks in BigQueryExecutor.execute and get_jobs_status. - Add 'test_init_failure' to aggregation_test.py to verify the new fail-fast behavior. - (Ignored SQL query optimization comment for this PR to keep it purely refactoring-focused).
…railing whitespaces - Remove blank lines after docstrings in BigQueryExecutor.execute and get_jobs_status (fixes D202). - Remove newly reported trailing whitespaces in SQL queries in LinkedEdgeGenerator and ProvenanceSummaryGenerator. - (Confirmed D212/D213 warnings are conflicting and should be ignored). - (Confirmed D203 and SQL injection warnings are ignored as per previous agreement).
- Add a step to run 'uv run pytest' before building the Docker image. - This ensures that all unit tests in ingestion-helper must pass before a new container image is built and pushed. - Uses the official 'ghcr.io/astral-sh/uv' image for fast test execution.
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| Documentation | 6 minor |
| ErrorProne | 1 high |
| Security | 11 medium |
| CodeStyle | 6 minor |
| Complexity | 3 medium |
🟢 Metrics 31 complexity
Metric Results Complexity 31
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Code Review
This pull request introduces the PlaceAggregationGenerator class to perform place-based aggregations (e.g., summing sub-place data to parent places) using BigQuery, along with corresponding unit tests. The review feedback points out a critical issue in both aggregation methods (aggregate_us_population_from_states and aggregate_us_state_population_from_counties) where the defined local variables connection_id, dest, and imports_str are not interpolated into the SQL query f-strings. This omission causes the newly added unit tests to fail, as they assert the presence of these values in the executed queries. The reviewer provided actionable code suggestions to interpolate these variables as SQL comments to resolve the test failures.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…st and clean up comments
…regation with allow_multiple_to_places
db79b01 to
8a23583
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the PlaceAggregationGenerator class to perform place-based aggregations using BigQuery Federation, exporting the aggregated TimeSeries metadata and Observations to Spanner. Two critical issues were identified in the review: first, querying the Observation table without filters inside EXTERNAL_QUERY will cause a full table scan on Spanner, which should be optimized by joining with TimeSeries inside the external query; second, the newly added unit tests are broken as they reference a non-existent run_all method and assert on outdated query fragments instead of testing the new aggregate_places method.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
c0683e1 to
656db62
Compare
…multiline for readability
I'm porting this PR from the "data" repo since the aggregation utils have been moved to this repo now.
I tested it against several test cases for place aggregations (see #536) and ensured its working. I'll create a separate PR for the test cases (it has hardcodings at the moment).
Note: