feat(ingestion): expose Dataflow IP configuration for restricted VPC environments#121
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds configuration options for Dataflow worker network settings (dataflow_ip_configuration and dataflow_subnetwork) to the ingestion workflow module, allowing support for private IP setups. The review feedback recommends adding validation to ensure the subnetwork format is correct and enforcing that a subnetwork is specified when private IPs are enabled.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| variable "dataflow_subnetwork" { | ||
| type = string | ||
| description = <<-EOT | ||
| Subnetwork for Dataflow workers. Required when dataflow_ip_configuration | ||
| is WORKER_IP_PRIVATE. Format: regions/{region}/subnetworks/{subnetwork}. | ||
| EOT | ||
| default = "" | ||
| } |
There was a problem hiding this comment.
To prevent runtime failures during Dataflow job execution, we should validate the format of dataflow_subnetwork when provided, and ensure it is not left empty when dataflow_ip_configuration is set to WORKER_IP_PRIVATE.
Using a validation block ensures the subnetwork matches the expected GCP format (either a short path or a full self-link), and a check block enforces the cross-variable dependency during terraform plan/apply.
variable "dataflow_subnetwork" {
type = string
description = <<-EOT
Subnetwork for Dataflow workers. Required when dataflow_ip_configuration
is WORKER_IP_PRIVATE. Format: regions/{region}/subnetworks/{subnetwork}.
EOT
default = ""
validation {
condition = var.dataflow_subnetwork == "" || can(regex("regions/[a-zA-Z0-9-]+/subnetworks/[a-zA-Z0-9-]+$", var.dataflow_subnetwork))
error_message = "The dataflow_subnetwork must be in the format 'regions/{region}/subnetworks/{subnetwork}' or a full self-link ending with that format."
}
}
check "dataflow_subnetwork_presence" {
assert {
condition = var.dataflow_ip_configuration != "WORKER_IP_PRIVATE" || var.dataflow_subnetwork != ""
error_message = "dataflow_subnetwork must be specified when dataflow_ip_configuration is WORKER_IP_PRIVATE."
}
}
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
…environments
Add dataflow_ip_configuration and dataflow_subnetwork variables to the
ingestion/workflow module, threaded up through the stack module via the
ingestion_config object.
Both variables default to their current implicit values (WORKER_IP_UNSPECIFIED
and empty string), so existing deployments are unaffected.
Operators in environments where compute.vmExternalIpAccess org policy blocks
VMs from obtaining external IPs can now set:
dataflow_ip_configuration = "WORKER_IP_PRIVATE"
dataflow_subnetwork = "regions/{region}/subnetworks/{subnetwork}"
Prerequisites on the operator side: Private Google Access on the target
subnet and a Cloud NAT gateway for outbound connectivity.
Validated on a GCP project with a deny-all vmExternalIpAccess org policy.
Ingestion of 66 SDMX/CSV files into Spanner completed successfully.
25a8b1c to
305eed5
Compare
|
Hi @carlosinfantes , thank you for the submission! Expect initial feedback on our review in one business day. |
| description = "Path where pre-processed files are placed for the next stage" | ||
| } | ||
|
|
||
| variable "dataflow_ip_configuration" { |
There was a problem hiding this comment.
Thanks for the PR!
You may want to declare this at the top level in infra/dcp/variables.tf and infra/dcp/main.tf (this change is no-op as is).
I tested this change against our last stable version of data commons platform release. With top level changes this config is reflected.
There was a problem hiding this comment.
step 1: Append the following to infra/dcp/variables.tf:
variable "ingestion_dataflow_ip_configuration" {
type = string
description = "IP configuration for Dataflow workers (WORKER_IP_UNSPECIFIED, WORKER_IP_PUBLIC, WORKER_IP_PRIVATE)"
default = "WORKER_IP_UNSPECIFIED"
}
variable "ingestion_dataflow_subnetwork" {
type = string
description = "Subnetwork for Dataflow workers. Required if using WORKER_IP_PRIVATE. Format: regions/{region}/subnetworks/{subnetwork}"
default = ""
}
step 2: in infra/dcp/main.tf, locate local.ingestion_config and add the new fields:
locals {
ingestion_config = {
# ... existing fields ...
dataflow_ip_configuration = var.ingestion_dataflow_ip_configuration
dataflow_subnetwork = var.ingestion_dataflow_subnetwork
}
}
Add ingestion_dataflow_ip_configuration and ingestion_dataflow_subnetwork variables to infra/dcp/variables.tf and wire them into local.ingestion_config in infra/dcp/main.tf, making the feature reachable from the top-level module. Also add a format validation block on dataflow_subnetwork and a check block that enforces subnetwork is set when WORKER_IP_PRIVATE is used. Addresses review feedback from @yiyche and @gemini-code-assist.
Summary
Adds two optional variables to the ingestion workflow module that allow operators
to configure Dataflow worker IP addressing:
`WORKER_IP_PRIVATE`
Both default to their current implicit values, so existing deployments are unaffected.
Motivation
Enterprise and public-sector GCP organizations frequently enforce the
`compute.vmExternalIpAccess` org policy, which prevents any VM (including Dataflow
workers) from obtaining an external IP address. Without this change, the Dataflow Flex
Template launch fails immediately with an org policy violation and the job never leaves
`JOB_STATE_PENDING`.
The fix is to pass `ipConfiguration: WORKER_IP_PRIVATE` in the Dataflow
`FlexTemplateRuntimeEnvironment`. The module currently hardcodes this block inside the
Cloud Workflow `source_contents` YAML string with no way to inject it at the module level.
Changes
Usage example
Prerequisites on the operator side:
Validation
Tested on a GCP project with an enforced `compute.vmExternalIpAccess: deny` org policy.
Ingestion of 66 SDMX/CSV observation files into Spanner completed successfully
(`JOB_STATE_DONE`). Without `WORKER_IP_PRIVATE` the job never left the `JOB_STATE_PENDING`
state.
References
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.flexTemplates/launch#FlexTemplateRuntimeEnvironment
https://cloud.google.com/resource-manager/docs/organization-policy/restricting-resources