DEV Community

Cover image for Dealing with Nothing: What SAS Taught Me About Missing Values in Python (as a Beginner)
Brigita Jon
Brigita Jon

Posted on • Edited on

Dealing with Nothing: What SAS Taught Me About Missing Values in Python (as a Beginner)

After a long break from data analysis, I’m #back-at-It! Currently, I’m working on a SAS-to-Python migration project, and I’m documenting what I learn — especially the unexpected things.

This post is all about one such surprise: missing values.


Funny how much time data analysts spend handling... nothing. That’s a rhetorical question I found myself thinking the other day.

I’m talking about missing values, which, if handled improperly, can break logic, skew results, or silently wipe out rows during analysis.

As I’ve been working through the SAS-to-Python conversion over the past couple of months, I’ve started noticing differences in how each handles missing data. Here’s a summary of what I found, plus a few practical examples at the end.


Different languages, different "nothings"

Missing values may look the same across tools, but their behavior varies depending on the language, data type, and operation (comparison, filtering, grouping, etc.).

Below is an overview comparing how SAS, Python, and pandas represent and work with missing values:

Missing values representations in most common data types

Data Type SAS Missing Value Python Core pandas Missing Value
Numeric . (dot), .A–.Z (specials with sort order) float('nan'), None (but None breaks math) np.nan, pd.NA (nullable Int64, Float64)
Character '' (empty string) None None, np.nan, pd.NA (used with string dtype)
Dates/Times . (still numeric, formatted as date) None (not valid in datetime ops) pd.NaT, None, pd.NA (used with datetime64[ns])
Boolean N/A (SAS has no native boolean type) None pd.NA, None (nullable Boolean dtype in pandas)

References

SAS

The dot (.) is the universal marker for missing numeric values—including dates and booleans, which are stored as numerics under the hood. SAS also supports 27 special missing values (.A–.Z, ._) for data situations when you need to categorize the missing values. I have never used or met them in my work though!

pandas

pandas uses different missing value markers depending on the data type:

  • np.nan for floating-point numbers
  • pd.NaT for datetime types
  • pd.NA for the newer, more consistent nullable types (such as Int64, string, and nullable Boolean)
  • For object or string columns, None and np.nan can also appear as missing values.

Sorting behavior: SAS vs Python

In SAS, while sorting numeric values, the missing values will always appear first.

Interestingly, SAS supports special missing values that even have their own rank when sorting:

The standard numeric missing value (.) is sorted before .A, and both are sorted before .Z. The ._ is the smallest.

Sorting numerical values in SAS

Sort Order Symbol Description
smallest ._ underscore
. period
.A–.Z special missing values A (smallest) through Z
-n negative numbers
0 zero
largest +n positive numbers

Reference

In pandas, on the other hand, missing values go to the end by default—but you can change this behavior using a parameter:

df.sort_values(by='x', na_position='first')
Enter fullscreen mode Exit fullscreen mode

Detecting missing values

One of the first steps in working with missing values is detecting them.

In SAS, you typically check for missing values using:

if missing(x)      /* universal */
if x = .           /* numeric */
if x = ''          /* character */
Enter fullscreen mode Exit fullscreen mode

In Python, you can use:

pd.isna(x)
pd.isnull(x)   # (These are aliases)
Enter fullscreen mode Exit fullscreen mode

Filling missing values in Python: doing something with nothing

Once you've detected missing values, the next step is to decide how to fill them or you need to do it at all. There are several ways to handle missing values in pandas. Here's an overview of the most common methods:

  1. Fill with a constant value

    You can replace all missing values with a specific value, like 0 or 'unknown':

    df['col'].fillna(0)
    df.fillna('unknown')
    
  2. Fill using another column

    You can use the values from another column to fill missing entries.

    df['col_a'].fillna(df['col_b'])
    

    Alternatively, you can use the combine_first() method, which behaves like a SQL COALESCE operation:

    df['col_a'].combine_first(df['col_b'])
    
  3. Forward and backward fill

    You can use forward fill or backward fill to propagate non-missing values forward or backward to fill gaps:

    df.fillna(method='ffill')
    df.fillna(method='bfill')
    

    You can also limit how far to propagate with the limit parameter:

    df.fillna(method='ffill', limit=1)
    
  4. Interpolation

    For numeric columns, you can use interpolation. By default, pandas uses linear interpolation:

    df['col'].interpolate()
    
  5. Group-based filling

    If you want to fill missing values within specific groups, you can use groupby() in combination with a fill method. For example, filling missing values within each group using forward fill:

    df.groupby('id')['value'].transform(lambda x: x.fillna(method='ffill'))
    
  6. Statistical filling

    Another approach is to fill missing values based on statistical measures like the mean, median, or mode of the column:

    df['col'].fillna(df['col'].mean())
    df['col'].fillna(df['col'].median())
    df['col'].fillna(df['col'].mode()[0])
    

    To perform statistical filling within groups:

    df['col'] = df.groupby('group')['col'].transform(lambda x: x.fillna(x.mean()))
    

Inplace updates

If you want to modify the original DataFrame without creating a copy, you can use the inplace=True parameter:

df['col'].fillna(0, inplace=True)
Enter fullscreen mode Exit fullscreen mode

How did Python get me?

And how I made it act Like SAS

I've heard a joke online that stuck with me:

Why does pandas treat NaN == NaN as False?

Because even missing values have trust issues.

Turns out it’s funny because it’s true. While converting SAS code to Python/pandas, I ran into several moments where things didn’t behave quite the way I expected.

Here are a few practical examples where translating from SAS to Python didn’t go as expected — and how to make pandas behave more like SAS.


1. x < 0: SAS includes missing, pandas excludes

In SAS:

data filtered;
  set test;
  where x < 0;
run;
Enter fullscreen mode Exit fullscreen mode

This will include standard missing values (.), because SAS treats . as less than any number.

In pandas:

df[df['x'] < 0]
Enter fullscreen mode Exit fullscreen mode

This excludes missing values, because comparisons with NaN, pd.NA, or NaT always return False.

To mimic SAS:

df[(df['x'] < 0) | (df['x'].isna())]
Enter fullscreen mode Exit fullscreen mode

2. NaN == NaN is False, but . == . is True

In SAS:

if . = . then put "Equal";
/* Outputs: Equal */
Enter fullscreen mode Exit fullscreen mode

In Python:

import numpy as np
print(np.nan == np.nan)
Enter fullscreen mode Exit fullscreen mode

The above returns False.

How to match SAS logic:

import pandas as pd

def equal_or_missing(a, b):
    if pd.isna(a) and pd.isna(b):
        return True
    return a == b

equal_or_missing(np.nan, np.nan)
Enter fullscreen mode Exit fullscreen mode

The above returns True now.


3. Grouping with missing values: pandas excludes them

In SAS, PROC SQL treats . as a valid group and includes it in GROUP BY.

In pandas, groupby() excludes missing values in the grouping column by default.

To include them:

df.groupby('x', dropna=False).mean()
Enter fullscreen mode Exit fullscreen mode

Final thoughts

It’s ironic — some of the trickiest bugs come from values that don’t even exist.
In SAS, missing values are consistent and predictable.
In Python/pandas, you get flexibility — but also more room for surprises.

If you're also transitioning from SAS to Python or have your own gotchas around missing values, please share!

#Back-at-It #Buggy-but-getting-there

Top comments (0)