Camille He

Posted on Jun 4

Mask Sensitive Data using Python Built-in Logging Module

#python #logging

Logging is an essential and common topic in software development, which play a vital role in monitoring, debugging and troubleshooting applications. If logs are not properly secured or managed, they can become a target for hackers and other malicious actors who may attempt to gain access to this sensitive data. By keeping sensitive data out of logs, you can help protect users' privacy and reduce the risk of data breaches or other security incidents.

There are lots of best practices for logging for different programming languages, and this article only focus on masking sensitive data in python using python built-in module - logging.

Here is the demo source code in GitHub.

Let's get started.

Initialize Logging Configuration

First of all, we create a file named log.py with a log configuration as below. The outputs are formatted according to the formatter argument. The default is "console".

# import modules
import re
import time
import logging
import logging.config

def init_logging(
    log_level: str = "DEBUG", formatter: str = "console"
) -> logging.Logger:
 
    LOG_CONFIG = {
        "version": 1,
        "handlers": {
            "stdout": {
                "class": "logging.StreamHandler",
                "stream": "ext://sys.stdout",
                "formatter": formatter,
            }
        },        "formatters": {
            "json": {
                "format": (
                    '{"msg":"%(message)s","level":"%(levelname)s",'
                    '"file":"%(filename)s","line":%(lineno)d,'
                    '"module":"%(module)s","func":"%(funcName)s"}'
                ),
                "datefmt": "%Y-%m-%dT%H:%M:%SZ",
            },
            "console": {
                "format": "%(asctime)s %(levelname)s : %(message)s",
                "datefmt": "%Y-%m-%dT%H:%M:%SZ",
            },
        },
        "root": {"handlers": ["stdout"], "level": log_level},
    }
    logging.Formatter.converter = time.gmtime
    logging.config.dictConfig(LOG_CONFIG)
    logger = logging.getLogger(__name__)
 
    return logger

Then, create a file named main.py with below code.

from log import init_logging

LOG = init_logging()
 
if __name__ == "__main__":
    test_case = (
        "John [513-84-7329] made a payment with credit card 1234-5678-9012-3456."
    )
    LOG.info(test_case)

Outputs:

2024-06-03T06:45:38Z INFO : John [513-84-7329] made a payment with credit card 1234-5678-9012-3456.

Obviously, there are sensitive data in logs. 513-84-7329 is U.S. social security number, 1234-5678-9012-3456 is a credit card number. These data are highly confidential and should be masked or redacted with a series of asterisks or replace them with a hash value.

Although filters are used primarily to filter records based on more sophisticated criteria than levels, they get to see every record which is processed by the handler or logger they’re attached to: this can be useful if you want to do things like counting how many records were processed by a particular logger or handler, or adding, changing or removing attributes in the LogRecord being processed.

From logging.Filter https://docs.python.org/3/library/logging.html#filter-objects

According to the above description, I use logging.Filter to mask sensitive data in logs.

Setup logging.Filter in Logger Configuration

Create a filter and configure it in log configuration. Add below code in log.py file.


# 1. Define the regex patterns of sensitive data
regex_patterns = [
    # U.S. Social Security numbers
    r"\d{3}-\d{2}-\d{4}",
    # Credit card numbers
    r"\d{4}-\d{4}-\d{4}-\d{4}",
]
 
# 2. Define a filter class to mask sensitive data that match the pre-defined regex patterns
class SensitiveDataFilter(logging.Filter):
    patterns = regex_patterns
    def filter(self, record):
        record.msg = self.mask_sensitive_data(record.msg)
        return True
    def mask_sensitive_data(self, message):
        for pattern in self.patterns:
            message = re.sub(pattern, "******", message)
        return message

# 3. Add filter configuration in LOG_CONFIG
LOG_CONFIG = {
        "version": 1,
        "handlers": {
            "stdout": {
                ...
                # setup filters
                "filters": ["sensitive_data_filter"],
            }
        },
        # add filters in configuration
        "filters": {
            "sensitive_data_filter": {
                "()": SensitiveDataFilter,
            }
        },
        "formatters": {...},
        "root": {...},
    }

Replace main.py with below code. We print message to console using three string formatting methods for comparison. All work.

Read Python String Formatting Best Practices if you are interested in string formatting in python.

from log import init_logging

LOG = init_logging()

if __name__ == "__main__":
    test_case = (
        "John [513-84-7329] made a payment with credit card 1234-5678-9012-3456."
    )
    LOG.debug(f"use f-string: {test_case}")
    LOG.debug("use str.format: {}".format(test_case))
    LOG.debug("use string modulo method: %s" % (test_case))

Outputs:

2024-06-03T07:05:26Z DEBUG : use f-string: John [******] made a payment with credit card ******.
2024-06-03T07:05:26Z DEBUG : use str.format: John [******] made a payment with credit card ******.
2024-06-03T07:05:26Z DEBUG : use string modulo method: John [******] made a payment with credit card ******.

Updates:

Define the regex patterns of sensitive data
Define a filter class SensitiveDataFilter to mask sensitive data that match the regex patterns
Add and setup filter configuration in LOG_CONFIG

The masking process is clear. We define the regex patterns for sensitive data, and replace them with asterisks.

Mask sensitive data in the values of dictionary

In above examples, the sensitive data have regex patterns that can be defined easily. However, sometimes our sensitive data is a random string, for example, we need to print test_case as below and mask the password only. We cannot define a regex pattern for password string 'fFwpUd!CJT4'.

test_case = {'username': 'John Doe', 'password': 'fFwpUd!CJT4'}

After diving into logging module, I found two solutions to mask sensitive data in a dictionary. They may not be the best, but they did solve my problem.

Option 1. Mask sensitive data in record.msg matching regex pattern

The first solution use regex pattern as well, but this time the sensitive data is located by it's key in dictionary.

Let's update class SensitiveDataFilter. Add below code in log.py file.


# Define a list of keys that values are sensitive data
sensitive_keys = (
    "headers",
    "credentials",
    "Authorization",
    "token",
    "password",
)


# mask sensitive data in record.msg
class SensitiveDataFilter2(logging.Filter):
    patterns = regex_patterns
    sensitive_keys = sensitive_keys

    def filter(self, record):
        record.msg = self.mask_sensitive_msg(record.msg)
        return True

    def mask_sensitive_msg(self, message):
        for pattern in self.patterns:
            message = re.sub(pattern, "******", message)
        # replace sensitive data with asterisks
        for key in self.sensitive_keys:
            pattern_str = rf"'{key}': '[^']+'"
            replace = f"'{key}': '******'"
            message = re.sub(pattern_str, replace, message)
        return message

Updates:

Define a list of keys that values are sensitive data. These keys are used to locate the sensitive data in dictionary.
Mask sensitive data in dictionary.values() by processing record.msg. In function mask_sensitive_msg, we define the pattern string '{key}': '[^']+'. And replace the match string with '{key}': ******.

Outputs:

2024-06-03T07:33:15Z DEBUG : use f-string: {'username': 'John Doe', 'password': '******'}
2024-06-03T07:33:15Z DEBUG : use str.format: {'username': 'John Doe', 'password': '******'}
2024-06-03T07:33:15Z DEBUG : use string modulo method: {'username': 'John Doe', 'password': '******'}

In the solution, the dictionary object is treated as a part of log message essentially. The only difference is the regex patterns. However, it's not stable from my perspective, causes errors for special scenarios. So I tried another solution.

Option 2. Mask sensitive data in record.args

When going through the logging module documentation, I found that, for a LogRecord, the primary information is passed in msg and args, which are combined using msg % args to create the message attribute of the record. With args, we can process the sensitive data before being merged into msg. Here is the description of msg and args fields in a LogRecord object.

msg (Any) – The event description message, which can be a %-format string with placeholders for variable data, or an arbitrary object (see Using arbitrary objects as messages).
args (tuple | dict[str, Any]) – Variable data to merge into the msg argument to obtain the event description.

Interesting, right? Let's update class SensitiveDataFilter as below to make it work.

class SensitiveDataFilter(logging.Filter):
    patterns = regex_patterns
    sensitive_keys = sensitive_keys
    def filter(self, record):
        try:
            record.args = self.mask_sensitive_args(record.args)
            record.msg = self.mask_sensitive_msg(record.msg)
            return True
        except Exception as e:
            return True
    def mask_sensitive_args(self, args):
        if isinstance(args, dict):
            new_args = args.copy()
            for key in args.keys():
                if key in sensitive_keys:
                    new_args[key] = "******"
                else:
                    # mask sensitive data in dict values
                    new_args[key] = self.mask_sensitive_msg(args[key])
            return new_args
        # when there are multi arg in record.args
        return tuple([self.mask_sensitive_msg(arg) for arg in args])
    def mask_sensitive_msg(self, message):
        # mask sensitive data in multi record.args
        if isinstance(message, dict):
            return self.mask_sensitive_args(message)
        if isinstance(message, str):
            for pattern in self.patterns:
                message = re.sub(pattern, "******", message)
            for key in self.sensitive_keys:
                pattern_str = rf"'{key}': '[^']+'"
                replace = f"'{key}': '******'"
                message = re.sub(pattern_str, replace, message)
        return message

Replace main.py with below code.

from log import init_logging

LOG = init_logging()

if __name__ == "__main__":
    test_case = {"username": "John Doe", "password": "xyz"}
    LOG.debug("use args: %s", test_case)
    LOG.debug("use multi args: %s %s", test_case, test_case)

Outputs:

2024-06-03T09:41:02Z DEBUG : use args: {'username': 'John Doe', 'password': '******'}
2024-06-03T09:41:02Z DEBUG : use multi args: {'username': 'John Doe', 'password': '******'} {'username': 'John Doe', 'password': '******'}

Let's go through the updates.

I created a new function named mask_sensitive_args to process variable data in record.arg:
When args is dictionary: Iterate all key, value pair of dict. replace the sensitive value if the key in the sensitive_keys.
When args is a tuple: Iterate tuple. The item of each tuple may be string or dict.
- If item is a string, mask sensitive data using regex patterns
- If item is a dict, use mask_sensitive_args

Modify existing function mask_sensitive_msg in order to support multiple args scenarios.
Wrap the filter function with a try-catch block. We catch all exception when masking data, and print the data without masking if there is any error occur during masking.

Summary

When masking sensitive data in logs, choose regex pattern if the data match common pattern, for example SSN, or credit card.
When masking sensitive data (random string) in a dictionary, utilize sensitive key name to locate sensitive value or leverage record.args.
Always put an eye on your logs.

References:
Keep sensitive information out of logs

Thanks for reading. Looking forward to your comments and ideas.

Top comments (1)

thitkhotau • Jun 5 • Edited

It's not directly related to the main topic you're discussing, but I noticed an issue with how you're logging; it doesn't categorize logs by module, making debugging difficult. Instead of using init_logging, why not try this approach?

# _logging.py
import logging
from functools import lru_cache
from logging import Logger


@lru_cache(maxsize=None)
def init_logging() -> None:
    logging.basicConfig(
        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        datefmt="%m/%d/%Y %I:%M:%S %p",
    )


def get_logger(name: str) -> Logger:
    init_logging()
    return logging.getLogger(name)

from ._logging import get_logger

logger = get_logger(__name__)


def main():
    logger.warning("***")


if __name__ == "__main__":
    main()

DEV Community

Mask Sensitive Data using Python Built-in Logging Module

Initialize Logging Configuration

Setup logging.Filter in Logger Configuration

Mask sensitive data in the values of dictionary

Option 1. Mask sensitive data in record.msg matching regex pattern

Option 2. Mask sensitive data in record.args

Summary

Top comments (1)

Read next

FastAPI Beyond CRUD Part 2 - Build a Simple Web Server (Path & Query Params, Request Body, Headers)

Django 9: Scheduled tasks

Java Certified Foundations Associate 1Z0-811 Preparation Study Plan

Ruby on Rails está muerto, fin.