Skip to content

[SPARK-52669][PYSPARK] Fix Python executable selection for YARN client mode #51357#55310

Open
gwdgithubnom wants to merge 1 commit intoapache:masterfrom
agodomen:master
Open

[SPARK-52669][PYSPARK] Fix Python executable selection for YARN client mode #51357#55310
gwdgithubnom wants to merge 1 commit intoapache:masterfrom
agodomen:master

Conversation

@gwdgithubnom
Copy link
Copy Markdown

What changes were proposed in this pull request?

This PR improves the Python executable selection logic in SparkContext to resolve version mismatch issues, particularly in YARN client mode.

Previously, the driver might fail to locate the correct Python interpreter when PYSPARK_PYTHON was not explicitly set in the shell environment, even if it was defined in SparkConf. This led to RuntimeError due to minor version discrepancies between the driver and executors (e.g., Driver using system Python 3.10 while Executors use archived Python 3.6).

Key changes:

  1. Refined Priority Logic: Implemented a robust _get_python_exec_from_conf method that follows a 7-level priority sequence to ensure consistency: PYSPARK_DRIVER_PYTHON (Env) > PYSPARK_PYTHON (Env) > spark.pyspark.driver.python (Conf) > spark.pyspark.python (Conf) > spark.executorEnv.PYSPARK_DRIVER_PYTHON > spark.executorEnv.PYSPARK_PYTHON > Default (python3).
  2. Improved Client Mode Support: Ensures the driver can correctly resolve the Python path from the archived environment (e.g., ./environment/bin/python) via Spark configuration without requiring manual environment variable exports for every script execution.
  3. Code Quality: Standardized the documentation to follow NumPy/Sphinx style and fixed formatting issues to comply with Spark's linting tools (ruff, mypy).

Note: This PR is a revival and optimization of #51357.

Why are the changes needed?

PySpark requires the driver and executors to use consistent Python minor versions. In many production environments (especially when using conda-pack or virtualenvs), PYSPARK_PYTHON is passed via SparkConf rather than system-wide environment variables.

Without this fix, the driver falls back to the system default Python when scripts are launched directly, causing a mismatch with the executor's archived Python environment. This change automates the resolution, making the deployment more robust and user-friendly by eliminating the need to manually export environment variables for each session.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

  1. Manual Verification: Ran a PySpark job in YARN client mode with a specific Python archive. Verified that sys.version and sys.executable match between the driver and executors using:
import sys
spark.range(1).rdd.map(lambda x: (x, sys.version, sys.executable)).collect()
  1. Unit Tests: Added/Updated mock tests in pyspark/tests/test_context.py to verify the 7-level priority logic and ensure correct overrides between environment variables and Spark configurations.
  2. Linting: Passed dev/lint-python checks.

Was this patch authored or co-authored using generative AI tooling?

No.

@gwdgithubnom gwdgithubnom changed the title [SPARK-52669][PYSPARK] Improve Python executable selection for YARN client mode #51357 [SPARK-52669][PYSPARK] Fix Python executable selection for YARN client mode #51357 Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant