[SPARK-52669][PYSPARK] Fix Python executable selection for YARN client mode #51357 by gwdgithubnom · Pull Request #55310 · apache/spark

gwdgithubnom · 2026-04-11T08:13:21Z

What changes were proposed in this pull request?

This PR improves the Python executable selection logic in SparkContext to resolve version mismatch issues, particularly in YARN client mode.

Previously, the driver might fail to locate the correct Python interpreter when PYSPARK_PYTHON was not explicitly set in the shell environment, even if it was defined in SparkConf. This led to RuntimeError due to minor version discrepancies between the driver and executors (e.g., Driver using system Python 3.10 while Executors use archived Python 3.6).

Key changes:

Refined Priority Logic: Implemented a robust _get_python_exec_from_conf method that follows a 7-level priority sequence to ensure consistency: PYSPARK_DRIVER_PYTHON (Env) > PYSPARK_PYTHON (Env) > spark.pyspark.driver.python (Conf) > spark.pyspark.python (Conf) > spark.executorEnv.PYSPARK_DRIVER_PYTHON > spark.executorEnv.PYSPARK_PYTHON > Default (python3).
Improved Client Mode Support: Ensures the driver can correctly resolve the Python path from the archived environment (e.g., ./environment/bin/python) via Spark configuration without requiring manual environment variable exports for every script execution.
Code Quality: Standardized the documentation to follow NumPy/Sphinx style and fixed formatting issues to comply with Spark's linting tools (ruff, mypy).

Note: This PR is a revival and optimization of #51357.

Why are the changes needed?

PySpark requires the driver and executors to use consistent Python minor versions. In many production environments (especially when using conda-pack or virtualenvs), PYSPARK_PYTHON is passed via SparkConf rather than system-wide environment variables.

Without this fix, the driver falls back to the system default Python when scripts are launched directly, causing a mismatch with the executor's archived Python environment. This change automates the resolution, making the deployment more robust and user-friendly by eliminating the need to manually export environment variables for each session.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual Verification: Ran a PySpark job in YARN client mode with a specific Python archive. Verified that sys.version and sys.executable match between the driver and executors using:

import sys
spark.range(1).rdd.map(lambda x: (x, sys.version, sys.executable)).collect()

Unit Tests: Added/Updated mock tests in pyspark/tests/test_context.py to verify the 7-level priority logic and ensure correct overrides between environment variables and Spark configurations.
Linting: Passed dev/lint-python checks.

Was this patch authored or co-authored using generative AI tooling?

No.

…r yarn client mode apache#51357

[SPARK-52669][PySpark]Improvement PySpark choose pythonExec in cluste…

37e9cf3

…r yarn client mode apache#51357

gwdgithubnom changed the title ~~[SPARK-52669][PYSPARK] Improve Python executable selection for YARN client mode #51357~~ [SPARK-52669][PYSPARK] Fix Python executable selection for YARN client mode #51357 Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-52669][PYSPARK] Fix Python executable selection for YARN client mode #51357#55310

[SPARK-52669][PYSPARK] Fix Python executable selection for YARN client mode #51357#55310
gwdgithubnom wants to merge 1 commit intoapache:masterfrom
agodomen:master

gwdgithubnom commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gwdgithubnom commented Apr 11, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant