DEV Community

Rachit Avasthi
Rachit Avasthi

Posted on

Fixing PySpark “Cannot run program python3” Error on Windows

When running PySpark on Windows, many beginners (and even experienced developers) encounter the following error:

java.io.IOException: Cannot run program "python3":
CreateProcess error=2, The system cannot find the file specified

Enter fullscreen mode Exit fullscreen mode

This article explains why this error happens, why one solution works and another doesn’t, and the correct, professional way to fix it permanently.


Understanding the Problem

Apache Spark is written in Java/Scala, but PySpark allows us to write Spark applications in Python.

When Spark executes Python code, it:

  1. Starts the JVM (Java Virtual Machine)
  2. Spawns a Python worker process
  3. Communicates between Java and Python using Py4J

By default, Spark tries to launch a Python executable named:

python3

Enter fullscreen mode Exit fullscreen mode

This works on Linux and macOS, but on Windows, the Python executable is:

python.exe

Enter fullscreen mode Exit fullscreen mode

Since python3 does not exist on Windows, Spark fails to start the Python worker and the job crashes.


Why Setting Python Inside the Code Works

A common workaround is setting the Python executable directly in the script:

import os

os.environ["PYSPARK_PYTHON"] = r"C:\Users\User\Desktop\Training\Week5\venv\Scripts\python.exe"
os.environ["PYSPARK_DRIVER_PYTHON"] = r"C:\Users\User\Desktop\Training\Week5\venv\Scripts\python.exe"

from pyspark.sql import SparkSession

Enter fullscreen mode Exit fullscreen mode

This works because:

  • The environment variables are set before SparkSession is created
  • Spark reads these variables immediately
  • The correct Python interpreter is used

However, this approach is not ideal:

  • The same code must be repeated in every PySpark file
  • Scripts become cluttered
  • It is not how Spark is configured in real-world projects

Why the PowerShell Method Often Fails

You may try setting environment variables in PowerShell:

$env:PYSPARK_PYTHON="C:\path\to\python.exe"
$env:PYSPARK_DRIVER_PYTHON="C:\path\to\python.exe"
python Lab1.py

Enter fullscreen mode Exit fullscreen mode

This sometimes fails because:

  • PowerShell variables are session-scoped
  • Closing the terminal clears them
  • Running Spark from another terminal or editor loses them
  • Spark must see these variables before the JVM starts

This makes the approach unreliable for long-term use.


The Correct and Permanent Solution (Best Practice)

The recommended and professional solution is to set these variables at the Windows system level.

This ensures:

  • Spark always knows which Python to use
  • No code changes are required
  • Works across all projects and terminals

Step-by-Step: Setting Environment Variables on Windows

Step 1: Open Environment Variables

  1. Press Windows + R
  2. Type:

    sysdm.cpl
    
    
  3. Go to the Advanced tab

  4. Click Environment Variables


Step 2: Add User Environment Variables

Under User variables, click New and add the following:

Variable 1

  • Name: PYSPARK_PYTHON
  • Value:

    C:\Users\User\Desktop\Training\Week5\venv\Scripts\python.exe
    
    

Variable 2

  • Name: PYSPARK_DRIVER_PYTHON
  • Value:

    C:\Users\User\Desktop\Training\Week5\venv\Scripts\python.exe
    
    

Click OK on all windows.


Step 3: Restart Your Terminal (Very Important)

Environment variables only load when a terminal starts.

  • Close all PowerShell / CMD / VS Code terminals
  • Open a new PowerShell
  • Activate your virtual environment:

    .\venv\Scripts\activate
    
    

Step 4: Verify the Setup

Run:

echo $env:PYSPARK_PYTHON

Enter fullscreen mode Exit fullscreen mode

If you see your Python path, Spark will see it too.


Clean PySpark Code (After Fix)

Once the environment is set, your PySpark script stays clean:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

data = [
    (1, "Laptop", "Electronics", 50000),
    (12, "Mobile", "Electronics", 20000),
    (3, "Tablet", "Electronics", 15000),
    (4, "Headphones", "Accessories", 3000),
    (5, "Keyboard", "Accessories", 2500)
]

df = spark.createDataFrame(
    data, ["id", "product", "category", "amount"]
)

df.show()

Enter fullscreen mode Exit fullscreen mode

No environment setup code is required anymore.


Key Takeaways

  • Spark defaults to python3, which breaks on Windows
  • Setting Python inside the script works but is not scalable
  • PowerShell environment variables are temporary
  • Windows Environment Variables are the correct solution
  • Always configure Spark outside your code

Final Thought

If you are learning Spark on Windows, this configuration step is mandatory.

Once set correctly, PySpark becomes stable, predictable, and production-ready.

Happy Spark learning 🚀

Top comments (0)