Skip to content

feat: Add place aggregations#530

Open
SandeepTuniki wants to merge 18 commits into
masterfrom
add-place-aggregations
Open

feat: Add place aggregations#530
SandeepTuniki wants to merge 18 commits into
masterfrom
add-place-aggregations

Conversation

@SandeepTuniki

@SandeepTuniki SandeepTuniki commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

I'm porting this PR from the "data" repo since the aggregation utils have been moved to this repo now.

I tested it against several test cases for place aggregations (see #536) and ensured its working. I'll create a separate PR for the test cases (it has hardcodings at the moment).

Note:

  • I'll shortly come with up a way to configure it so that it only runs on base DC + specific DCP users who prefer it. For now, I haven't enabled the place aggregations to be run automatically.

Split BigQueryExecutor, LinkedEdgeGenerator, and ProvenanceSummaryGenerator into separate files under a new 'aggregation' package.

Simplify 'aggregation_utils.py' to act as a thin orchestrator, preserving the public API for backward compatibility with main.py.

Add comprehensive unit tests in 'aggregation_test.py' covering all components.

Format all new/modified code with yapf (Google style) and sort imports with isort.
…d mitigate SQL injection warnings

- Added __all__ to aggregation/__init__.py to cleanly export public classes.

- Added missing docstrings to __init__ methods and package.

- Adjusted docstrings in sql_utils.py and bq_executor.py to comply with D213.

- Added # nosec to SQL queries to bypass SQL injection warnings (manually sanitized via _escape_sql_literal).

- Removed unused import in aggregation_test.py.

- Kept yapf Google style (no blank line before class docstring) as preferred.
…ent init exceptions

- Switch to relative imports for all package-internal modules in 'aggregation' package.

- Propagate exceptions during bigquery.Client initialization in BigQueryExecutor instead of swallowing them (fail-fast).

- Remove redundant client checks in BigQueryExecutor.execute and get_jobs_status.

- Add 'test_init_failure' to aggregation_test.py to verify the new fail-fast behavior.

- (Ignored SQL query optimization comment for this PR to keep it purely refactoring-focused).
…railing whitespaces

- Remove blank lines after docstrings in BigQueryExecutor.execute and get_jobs_status (fixes D202).

- Remove newly reported trailing whitespaces in SQL queries in LinkedEdgeGenerator and ProvenanceSummaryGenerator.

- (Confirmed D212/D213 warnings are conflicting and should be ignored).

- (Confirmed D203 and SQL injection warnings are ignored as per previous agreement).
- Add a step to run 'uv run pytest' before building the Docker image.

- This ensures that all unit tests in ingestion-helper must pass before a new container image is built and pushed.

- Uses the official 'ghcr.io/astral-sh/uv' image for fast test execution.
@codacy-production

codacy-production Bot commented Jun 8, 2026

Copy link
Copy Markdown

Not up to standards ⛔

🔴 Issues 1 high · 14 medium · 12 minor

Alerts:
⚠ 27 issues (≤ 0 issues of at least minor severity)

Results:
27 new issues

Category Results
Documentation 6 minor
ErrorProne 1 high
Security 11 medium
CodeStyle 6 minor
Complexity 3 medium

View in Codacy

🟢 Metrics 31 complexity

Metric Results
Complexity 31

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the PlaceAggregationGenerator class to perform place-based aggregations (e.g., summing sub-place data to parent places) using BigQuery, along with corresponding unit tests. The review feedback points out a critical issue in both aggregation methods (aggregate_us_population_from_states and aggregate_us_state_population_from_counties) where the defined local variables connection_id, dest, and imports_str are not interpolated into the SQL query f-strings. This omission causes the newly added unit tests to fail, as they assert the presence of these values in the executed queries. The reviewer provided actionable code suggestions to interpolate these variables as SQL comments to resolve the test failures.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread pipeline/workflow/ingestion-helper/aggregation/place_aggregation_generator.py Outdated
@SandeepTuniki SandeepTuniki marked this pull request as ready for review June 8, 2026 17:24
Base automatically changed from refactor-aggregation-utils to master June 9, 2026 03:21
Comment thread pipeline/workflow/ingestion-helper/aggregation/place_aggregation_generator.py Outdated
Comment thread pipeline/workflow/ingestion-helper/aggregation/place_aggregation_generator.py Outdated
@SandeepTuniki SandeepTuniki requested a review from vish-cs June 9, 2026 14:43
@SandeepTuniki SandeepTuniki force-pushed the add-place-aggregations branch from db79b01 to 8a23583 Compare June 10, 2026 18:31
@SandeepTuniki

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the PlaceAggregationGenerator class to perform place-based aggregations using BigQuery Federation, exporting the aggregated TimeSeries metadata and Observations to Spanner. Two critical issues were identified in the review: first, querying the Observation table without filters inside EXTERNAL_QUERY will cause a full table scan on Spanner, which should be optimized by joining with TimeSeries inside the external query; second, the newly added unit tests are broken as they reference a non-existent run_all method and assert on outdated query fragments instead of testing the new aggregate_places method.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread pipeline/workflow/ingestion-helper/aggregation_test.py Outdated
Comment thread pipeline/workflow/ingestion-helper/aggregation/place_aggregation_generator.py Outdated
@SandeepTuniki SandeepTuniki force-pushed the add-place-aggregations branch from c0683e1 to 656db62 Compare June 11, 2026 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants