DEV Community

ChelseaLiu0822
ChelseaLiu0822

Posted on

PySpark: missing value

Drop

df.na.drop() vs. df.dropna()

DataFrame.dropna() and DataFrameNaFunctions.drop() are aliases of each other. So theoretically their efficiency should be equivalent.

In addition, df.na.drop() can also specify a subset.

examples

Image description

# Code to drop any row that contains missing data
df.na.drop().show()
Enter fullscreen mode Exit fullscreen mode

Image description

# Only drop if row has at least 2 NON-null values
df.na.drop(thresh=2).show()
Enter fullscreen mode Exit fullscreen mode

Image description

# Only drop the rows with null in Sales col
df.dropna(how='any',subset='Sales').show()
Enter fullscreen mode Exit fullscreen mode

Image description

df.na.drop(how='any').show()
df.na.drop(how='all').show()
Enter fullscreen mode Exit fullscreen mode

Image description

fill

We can also fill the missing values with new values. If you have multiple nulls across multiple data types, Spark smart enough to match up the data types. For example:

df.na.fill('NEW VALUE').show()
Enter fullscreen mode Exit fullscreen mode

Image description

if you have multiple columns to fill, you could use a dictionary.

Image description

Sentry image

Hands-on debugging session: instrument, monitor, and fix

Join Lazar for a hands-on session where you’ll build it, break it, debug it, and fix it. You’ll set up Sentry, track errors, use Session Replay and Tracing, and leverage some good ol’ AI to find and fix issues fast.

RSVP here →

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay