After a long break from data analysis, I’m #back-at-It! Currently, I’m working on a SAS-to-Python migration project, and I’m documenting what I learn — especially the unexpected things.
This post is all about one such surprise: missing values.
Funny how much time data analysts spend handling... nothing. That’s a rhetorical question I found myself thinking the other day.
I’m talking about missing values, which, if handled improperly, can break logic, skew results, or silently wipe out rows during analysis.
As I’ve been working through the SAS-to-Python conversion over the past couple of months, I’ve started noticing differences in how each handles missing data. Here’s a summary of what I found, plus a few practical examples at the end.
Different languages, different "nothings"
Missing values may look the same across tools, but their behavior varies depending on the language, data type, and operation (comparison, filtering, grouping, etc.).
Below is an overview comparing how SAS, Python, and pandas represent and work with missing values:
Missing values representations in most common data types
Data Type | SAS Missing Value | Python Core | pandas Missing Value |
---|---|---|---|
Numeric |
. (dot), .A–.Z (specials with sort order) |
float('nan') , None (but None breaks math) |
np.nan , pd.NA (nullable Int64 , Float64 ) |
Character |
'' (empty string) |
None |
None , np.nan , pd.NA (used with string dtype) |
Dates/Times |
. (still numeric, formatted as date) |
None (not valid in datetime ops) |
pd.NaT , None , pd.NA (used with datetime64[ns] ) |
Boolean | N/A (SAS has no native boolean type) | None |
pd.NA , None (nullable Boolean dtype in pandas) |
References
SAS
The dot (.
) is the universal marker for missing numeric values—including dates and booleans, which are stored as numerics under the hood. SAS also supports 27 special missing values (.A–.Z
, ._
) for data situations when you need to categorize the missing values. I have never used or met them in my work though!
pandas
pandas uses different missing value markers depending on the data type:
-
np.nan
for floating-point numbers -
pd.NaT
for datetime types -
pd.NA
for the newer, more consistent nullable types (such asInt64
,string
, and nullable Boolean) - For object or string columns,
None
andnp.nan
can also appear as missing values.
Sorting behavior: SAS vs Python
In SAS, while sorting numeric values, the missing values will always appear first.
Interestingly, SAS supports special missing values that even have their own rank when sorting:
The standard numeric missing value (.
) is sorted before .A
, and both are sorted before .Z
. The ._
is the smallest.
Sorting numerical values in SAS
Sort Order | Symbol | Description |
---|---|---|
smallest | ._ |
underscore |
. |
period | |
.A–.Z |
special missing values A (smallest) through Z | |
-n |
negative numbers | |
0 |
zero | |
largest | +n |
positive numbers |
Reference
In pandas, on the other hand, missing values go to the end by default—but you can change this behavior using a parameter:
df.sort_values(by='x', na_position='first')
Detecting missing values
One of the first steps in working with missing values is detecting them.
In SAS, you typically check for missing values using:
if missing(x) /* universal */
if x = . /* numeric */
if x = '' /* character */
In Python, you can use:
pd.isna(x)
pd.isnull(x) # (These are aliases)
Filling missing values in Python: doing something with nothing
Once you've detected missing values, the next step is to decide how to fill them or you need to do it at all. There are several ways to handle missing values in pandas. Here's an overview of the most common methods:
-
Fill with a constant value
You can replace all missing values with a specific value, like
0
or'unknown'
:
df['col'].fillna(0) df.fillna('unknown')
-
Fill using another column
You can use the values from another column to fill missing entries.
df['col_a'].fillna(df['col_b'])
Alternatively, you can use the
combine_first()
method, which behaves like a SQLCOALESCE
operation:
df['col_a'].combine_first(df['col_b'])
-
Forward and backward fill
You can use forward fill or backward fill to propagate non-missing values forward or backward to fill gaps:
df.fillna(method='ffill') df.fillna(method='bfill')
You can also limit how far to propagate with the
limit
parameter:
df.fillna(method='ffill', limit=1)
-
Interpolation
For numeric columns, you can use interpolation. By default, pandas uses linear interpolation:
df['col'].interpolate()
-
Group-based filling
If you want to fill missing values within specific groups, you can use
groupby()
in combination with a fill method. For example, filling missing values within each group using forward fill:
df.groupby('id')['value'].transform(lambda x: x.fillna(method='ffill'))
-
Statistical filling
Another approach is to fill missing values based on statistical measures like the mean, median, or mode of the column:
df['col'].fillna(df['col'].mean()) df['col'].fillna(df['col'].median()) df['col'].fillna(df['col'].mode()[0])
To perform statistical filling within groups:
df['col'] = df.groupby('group')['col'].transform(lambda x: x.fillna(x.mean()))
Inplace updates
If you want to modify the original DataFrame without creating a copy, you can use the inplace=True
parameter:
df['col'].fillna(0, inplace=True)
How did Python get me?
And how I made it act Like SAS
I've heard a joke online that stuck with me:
Why does pandas treat NaN == NaN as False?
Because even missing values have trust issues.
Turns out it’s funny because it’s true. While converting SAS code to Python/pandas, I ran into several moments where things didn’t behave quite the way I expected.
Here are a few practical examples where translating from SAS to Python didn’t go as expected — and how to make pandas behave more like SAS.
1. x < 0
: SAS includes missing, pandas excludes
In SAS:
data filtered;
set test;
where x < 0;
run;
This will include standard missing values (.
), because SAS treats .
as less than any number.
In pandas:
df[df['x'] < 0]
This excludes missing values, because comparisons with NaN
, pd.NA
, or NaT
always return False.
To mimic SAS:
df[(df['x'] < 0) | (df['x'].isna())]
2. NaN == NaN
is False, but . == .
is True
In SAS:
if . = . then put "Equal";
/* Outputs: Equal */
In Python:
import numpy as np
print(np.nan == np.nan)
The above returns False
.
How to match SAS logic:
import pandas as pd
def equal_or_missing(a, b):
if pd.isna(a) and pd.isna(b):
return True
return a == b
equal_or_missing(np.nan, np.nan)
The above returns True
now.
3. Grouping with missing values: pandas excludes them
In SAS, PROC SQL
treats .
as a valid group and includes it in GROUP BY
.
In pandas, groupby()
excludes missing values in the grouping column by default.
To include them:
df.groupby('x', dropna=False).mean()
Final thoughts
It’s ironic — some of the trickiest bugs come from values that don’t even exist.
In SAS, missing values are consistent and predictable.
In Python/pandas, you get flexibility — but also more room for surprises.
If you're also transitioning from SAS to Python or have your own gotchas around missing values, please share!
#Back-at-It
#Buggy-but-getting-there
Top comments (0)