Skip to content

feat(ingestion): expose Dataflow IP configuration for restricted VPC environments#121

Merged
yiyche merged 2 commits into
datacommonsorg:mainfrom
carlosinfantes:feat/dataflow-ip-configuration
Jun 18, 2026
Merged

feat(ingestion): expose Dataflow IP configuration for restricted VPC environments#121
yiyche merged 2 commits into
datacommonsorg:mainfrom
carlosinfantes:feat/dataflow-ip-configuration

Conversation

@carlosinfantes

@carlosinfantes carlosinfantes commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds two optional variables to the ingestion workflow module that allow operators
to configure Dataflow worker IP addressing:

  • `dataflow_ip_configuration` — enum: `WORKER_IP_UNSPECIFIED` (default), `WORKER_IP_PUBLIC`,
    `WORKER_IP_PRIVATE`
  • `dataflow_subnetwork` — subnetwork self-link, required when using `WORKER_IP_PRIVATE`

Both default to their current implicit values, so existing deployments are unaffected.

Motivation

Enterprise and public-sector GCP organizations frequently enforce the
`compute.vmExternalIpAccess` org policy, which prevents any VM (including Dataflow
workers) from obtaining an external IP address. Without this change, the Dataflow Flex
Template launch fails immediately with an org policy violation and the job never leaves
`JOB_STATE_PENDING`.

The fix is to pass `ipConfiguration: WORKER_IP_PRIVATE` in the Dataflow
`FlexTemplateRuntimeEnvironment`. The module currently hardcodes this block inside the
Cloud Workflow `source_contents` YAML string with no way to inject it at the module level.

Changes

File Change
`infra/dcp/variables.tf` Add top-level `ingestion_dataflow_ip_configuration` + `ingestion_dataflow_subnetwork` variables
`infra/dcp/main.tf` Wire both variables into `local.ingestion_config`
`infra/dcp/modules/ingestion/workflow/variables.tf` Add `dataflow_ip_configuration` + `dataflow_subnetwork` variables, format validation, and cross-variable `check` block
`infra/dcp/modules/ingestion/workflow/main.tf` Conditionally emit `ipConfiguration` and `subnetwork` in the Dataflow environment block
`infra/dcp/modules/stack/variables.tf` Add both fields as optional to the `ingestion_config` object
`infra/dcp/modules/stack/main.tf` Pass both variables through to `module.ingestion_workflow`

Usage example

# terraform.tfvars
ingestion_dataflow_ip_configuration = "WORKER_IP_PRIVATE"
ingestion_dataflow_subnetwork       = "regions/us-central1/subnetworks/default"

Prerequisites on the operator side:

  • Private Google Access enabled on the target subnet
  • Cloud NAT gateway attached to the subnet's Cloud Router

Validation

Tested on a GCP project with an enforced `compute.vmExternalIpAccess: deny` org policy.
Ingestion of 66 SDMX/CSV observation files into Spanner completed successfully
(`JOB_STATE_DONE`). Without `WORKER_IP_PRIVATE` the job never left the `JOB_STATE_PENDING`
state.

References

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds configuration options for Dataflow worker network settings (dataflow_ip_configuration and dataflow_subnetwork) to the ingestion workflow module, allowing support for private IP setups. The review feedback recommends adding validation to ensure the subnetwork format is correct and enforcing that a subnetwork is specified when private IPs are enabled.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +90 to +97
variable "dataflow_subnetwork" {
type = string
description = <<-EOT
Subnetwork for Dataflow workers. Required when dataflow_ip_configuration
is WORKER_IP_PRIVATE. Format: regions/{region}/subnetworks/{subnetwork}.
EOT
default = ""
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent runtime failures during Dataflow job execution, we should validate the format of dataflow_subnetwork when provided, and ensure it is not left empty when dataflow_ip_configuration is set to WORKER_IP_PRIVATE.

Using a validation block ensures the subnetwork matches the expected GCP format (either a short path or a full self-link), and a check block enforces the cross-variable dependency during terraform plan/apply.

variable "dataflow_subnetwork" {
  type        = string
  description = <<-EOT
    Subnetwork for Dataflow workers. Required when dataflow_ip_configuration
    is WORKER_IP_PRIVATE. Format: regions/{region}/subnetworks/{subnetwork}.
  EOT
  default     = ""

  validation {
    condition     = var.dataflow_subnetwork == "" || can(regex("regions/[a-zA-Z0-9-]+/subnetworks/[a-zA-Z0-9-]+$", var.dataflow_subnetwork))
    error_message = "The dataflow_subnetwork must be in the format 'regions/{region}/subnetworks/{subnetwork}' or a full self-link ending with that format."
  }
}

check "dataflow_subnetwork_presence" {
  assert {
    condition     = var.dataflow_ip_configuration != "WORKER_IP_PRIVATE" || var.dataflow_subnetwork != ""
    error_message = "dataflow_subnetwork must be specified when dataflow_ip_configuration is WORKER_IP_PRIVATE."
  }
}

@google-cla

google-cla Bot commented Jun 16, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

…environments

Add dataflow_ip_configuration and dataflow_subnetwork variables to the
ingestion/workflow module, threaded up through the stack module via the
ingestion_config object.

Both variables default to their current implicit values (WORKER_IP_UNSPECIFIED
and empty string), so existing deployments are unaffected.

Operators in environments where compute.vmExternalIpAccess org policy blocks
VMs from obtaining external IPs can now set:

  dataflow_ip_configuration = "WORKER_IP_PRIVATE"
  dataflow_subnetwork       = "regions/{region}/subnetworks/{subnetwork}"

Prerequisites on the operator side: Private Google Access on the target
subnet and a Cloud NAT gateway for outbound connectivity.

Validated on a GCP project with a deny-all vmExternalIpAccess org policy.
Ingestion of 66 SDMX/CSV files into Spanner completed successfully.
@carlosinfantes carlosinfantes force-pushed the feat/dataflow-ip-configuration branch from 25a8b1c to 305eed5 Compare June 16, 2026 15:18
@dwnoble dwnoble self-requested a review June 17, 2026 14:08
@dwnoble

dwnoble commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Hi @carlosinfantes , thank you for the submission! Expect initial feedback on our review in one business day.

@dwnoble dwnoble requested review from gmechali, vish-cs and yiyche June 17, 2026 14:20
description = "Path where pre-processed files are placed for the next stage"
}

variable "dataflow_ip_configuration" {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
You may want to declare this at the top level in infra/dcp/variables.tf and infra/dcp/main.tf (this change is no-op as is).
I tested this change against our last stable version of data commons platform release. With top level changes this config is reflected.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

step 1: Append the following to infra/dcp/variables.tf:

variable "ingestion_dataflow_ip_configuration" {
  type        = string
  description = "IP configuration for Dataflow workers (WORKER_IP_UNSPECIFIED, WORKER_IP_PUBLIC, WORKER_IP_PRIVATE)"
  default     = "WORKER_IP_UNSPECIFIED"
}
variable "ingestion_dataflow_subnetwork" {
  type        = string
  description = "Subnetwork for Dataflow workers. Required if using WORKER_IP_PRIVATE. Format: regions/{region}/subnetworks/{subnetwork}"
  default     = ""
}

step 2: in infra/dcp/main.tf, locate local.ingestion_config and add the new fields:

locals {
  ingestion_config = {
    # ... existing fields ...
    dataflow_ip_configuration = var.ingestion_dataflow_ip_configuration
    dataflow_subnetwork       = var.ingestion_dataflow_subnetwork
  }
}

Add ingestion_dataflow_ip_configuration and ingestion_dataflow_subnetwork
variables to infra/dcp/variables.tf and wire them into local.ingestion_config
in infra/dcp/main.tf, making the feature reachable from the top-level module.

Also add a format validation block on dataflow_subnetwork and a check block
that enforces subnetwork is set when WORKER_IP_PRIVATE is used.

Addresses review feedback from @yiyche and @gemini-code-assist.
@yiyche yiyche added this pull request to the merge queue Jun 18, 2026
Merged via the queue into datacommonsorg:main with commit 58e37c4 Jun 18, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants