Rachit Avasthi

Posted on Dec 19, 2025

Fixing PySpark “Cannot run program python3” Error on Windows

#programming #python #pyspark #spark

When running PySpark on Windows, many beginners (and even experienced developers) encounter the following error:

java.io.IOException: Cannot run program "python3":
CreateProcess error=2, The system cannot find the file specified

This article explains why this error happens, why one solution works and another doesn’t, and the correct, professional way to fix it permanently.

Understanding the Problem

Apache Spark is written in Java/Scala, but PySpark allows us to write Spark applications in Python.

When Spark executes Python code, it:

Starts the JVM (Java Virtual Machine)
Spawns a Python worker process
Communicates between Java and Python using Py4J

By default, Spark tries to launch a Python executable named:

python3

This works on Linux and macOS, but on Windows, the Python executable is:

python.exe

Since python3 does not exist on Windows, Spark fails to start the Python worker and the job crashes.

Why Setting Python Inside the Code Works

A common workaround is setting the Python executable directly in the script:

import os

os.environ["PYSPARK_PYTHON"] = r"C:\Users\User\Desktop\Training\Week5\venv\Scripts\python.exe"
os.environ["PYSPARK_DRIVER_PYTHON"] = r"C:\Users\User\Desktop\Training\Week5\venv\Scripts\python.exe"

from pyspark.sql import SparkSession

This works because:

The environment variables are set before SparkSession is created
Spark reads these variables immediately
The correct Python interpreter is used

However, this approach is not ideal:

The same code must be repeated in every PySpark file
Scripts become cluttered
It is not how Spark is configured in real-world projects

Why the PowerShell Method Often Fails

You may try setting environment variables in PowerShell:

$env:PYSPARK_PYTHON="C:\path\to\python.exe"
$env:PYSPARK_DRIVER_PYTHON="C:\path\to\python.exe"
python Lab1.py

This sometimes fails because:

PowerShell variables are session-scoped
Closing the terminal clears them
Running Spark from another terminal or editor loses them
Spark must see these variables before the JVM starts

This makes the approach unreliable for long-term use.

The Correct and Permanent Solution (Best Practice)

The recommended and professional solution is to set these variables at the Windows system level.

This ensures:

Spark always knows which Python to use
No code changes are required
Works across all projects and terminals

Step-by-Step: Setting Environment Variables on Windows

Step 1: Open Environment Variables

Press Windows + R
Type:
```
sysdm.cpl
```
Go to the Advanced tab
Click Environment Variables

Step 2: Add User Environment Variables

Under User variables, click New and add the following:

Variable 1

Name: PYSPARK_PYTHON

Value:

C:\Users\User\Desktop\Training\Week5\venv\Scripts\python.exe

Variable 2

Name: PYSPARK_DRIVER_PYTHON

Value:

C:\Users\User\Desktop\Training\Week5\venv\Scripts\python.exe

Click OK on all windows.

Step 3: Restart Your Terminal (Very Important)

Environment variables only load when a terminal starts.

Close all PowerShell / CMD / VS Code terminals
Open a new PowerShell
Activate your virtual environment:
```
.\venv\Scripts\activate
```

Step 4: Verify the Setup

Run:

echo $env:PYSPARK_PYTHON

If you see your Python path, Spark will see it too.

Clean PySpark Code (After Fix)

Once the environment is set, your PySpark script stays clean:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

data = [
    (1, "Laptop", "Electronics", 50000),
    (12, "Mobile", "Electronics", 20000),
    (3, "Tablet", "Electronics", 15000),
    (4, "Headphones", "Accessories", 3000),
    (5, "Keyboard", "Accessories", 2500)
]

df = spark.createDataFrame(
    data, ["id", "product", "category", "amount"]
)

df.show()

No environment setup code is required anymore.

Key Takeaways

Spark defaults to python3, which breaks on Windows
Setting Python inside the script works but is not scalable
PowerShell environment variables are temporary
Windows Environment Variables are the correct solution
Always configure Spark outside your code

Final Thought

If you are learning Spark on Windows, this configuration step is mandatory.

Once set correctly, PySpark becomes stable, predictable, and production-ready.

Happy Spark learning 🚀

DEV Community