DEV Community

Camille He
Camille He

Posted on

Mask Sensitive Data using Python Built-in Logging Module

Logging is an essential and common topic in software development, which play a vital role in monitoring, debugging and troubleshooting applications. If logs are not properly secured or managed, they can become a target for hackers and other malicious actors who may attempt to gain access to this sensitive data. By keeping sensitive data out of logs, you can help protect users' privacy and reduce the risk of data breaches or other security incidents.

There are lots of best practices for logging for different programming languages, and this article only focus on masking sensitive data in python using python built-in module - logging.

Here is the demo source code in GitHub.

Let's get started.
 

Initialize Logging Configuration

First of all, we create a file named log.py with a log configuration as below. The outputs are formatted according to the formatter argument. The default is "console".



# import modules
import re
import time
import logging
import logging.config

def init_logging(
    log_level: str = "DEBUG", formatter: str = "console"
) -> logging.Logger:
 
    LOG_CONFIG = {
        "version": 1,
        "handlers": {
            "stdout": {
                "class": "logging.StreamHandler",
                "stream": "ext://sys.stdout",
                "formatter": formatter,
            }
        },        "formatters": {
            "json": {
                "format": (
                    '{"msg":"%(message)s","level":"%(levelname)s",'
                    '"file":"%(filename)s","line":%(lineno)d,'
                    '"module":"%(module)s","func":"%(funcName)s"}'
                ),
                "datefmt": "%Y-%m-%dT%H:%M:%SZ",
            },
            "console": {
                "format": "%(asctime)s %(levelname)s : %(message)s",
                "datefmt": "%Y-%m-%dT%H:%M:%SZ",
            },
        },
        "root": {"handlers": ["stdout"], "level": log_level},
    }
    logging.Formatter.converter = time.gmtime
    logging.config.dictConfig(LOG_CONFIG)
    logger = logging.getLogger(__name__)
 
    return logger


Enter fullscreen mode Exit fullscreen mode

Then, create a file named main.py with below code.



from log import init_logging

LOG = init_logging()
 
if __name__ == "__main__":
    test_case = (
        "John [513-84-7329] made a payment with credit card 1234-5678-9012-3456."
    )
    LOG.info(test_case)


Enter fullscreen mode Exit fullscreen mode

Outputs:



2024-06-03T06:45:38Z INFO : John [513-84-7329] made a payment with credit card 1234-5678-9012-3456.


Enter fullscreen mode Exit fullscreen mode

Obviously, there are sensitive data in logs. 513-84-7329 is U.S. social security number, 1234-5678-9012-3456 is a credit card number. These data are highly confidential and should be masked or redacted with a series of asterisks or replace them with a hash value.

Although filters are used primarily to filter records based on more sophisticated criteria than levels, they get to see every record which is processed by the handler or logger they’re attached to: this can be useful if you want to do things like counting how many records were processed by a particular logger or handler, or adding, changing or removing attributes in the LogRecord being processed.

 
From logging.Filter https://docs.python.org/3/library/logging.html#filter-objects
 
According to the above description, I use logging.Filter to mask sensitive data in logs.
 

Setup logging.Filter in Logger Configuration

Create a filter and configure it in log configuration. Add below code in log.py file.




# 1. Define the regex patterns of sensitive data
regex_patterns = [
    # U.S. Social Security numbers
    r"\d{3}-\d{2}-\d{4}",
    # Credit card numbers
    r"\d{4}-\d{4}-\d{4}-\d{4}",
]
 
# 2. Define a filter class to mask sensitive data that match the pre-defined regex patterns
class SensitiveDataFilter(logging.Filter):
    patterns = regex_patterns
    def filter(self, record):
        record.msg = self.mask_sensitive_data(record.msg)
        return True
    def mask_sensitive_data(self, message):
        for pattern in self.patterns:
            message = re.sub(pattern, "******", message)
        return message

# 3. Add filter configuration in LOG_CONFIG
LOG_CONFIG = {
        "version": 1,
        "handlers": {
            "stdout": {
                ...
                # setup filters
                "filters": ["sensitive_data_filter"],
            }
        },
        # add filters in configuration
        "filters": {
            "sensitive_data_filter": {
                "()": SensitiveDataFilter,
            }
        },
        "formatters": {...},
        "root": {...},
    }


Enter fullscreen mode Exit fullscreen mode

Replace main.py with below code. We print message to console using three string formatting methods for comparison. All work.

Read Python String Formatting Best Practices if you are interested in string formatting in python.



from log import init_logging

LOG = init_logging()

if __name__ == "__main__":
    test_case = (
        "John [513-84-7329] made a payment with credit card 1234-5678-9012-3456."
    )
    LOG.debug(f"use f-string: {test_case}")
    LOG.debug("use str.format: {}".format(test_case))
    LOG.debug("use string modulo method: %s" % (test_case))


Enter fullscreen mode Exit fullscreen mode

Outputs:



2024-06-03T07:05:26Z DEBUG : use f-string: John [******] made a payment with credit card ******.
2024-06-03T07:05:26Z DEBUG : use str.format: John [******] made a payment with credit card ******.
2024-06-03T07:05:26Z DEBUG : use string modulo method: John [******] made a payment with credit card ******.


Enter fullscreen mode Exit fullscreen mode

 
Updates:

  1. Define the regex patterns of sensitive data
  2. Define a filter class SensitiveDataFilter to mask sensitive data that match the regex patterns
  3. Add and setup filter configuration in LOG_CONFIG

The masking process is clear. We define the regex patterns for sensitive data, and replace them with asterisks.

Mask sensitive data in the values of dictionary

In above examples, the sensitive data have regex patterns that can be defined easily. However, sometimes our sensitive data is a random string, for example, we need to print test_case as below and mask the password only. We cannot define a regex pattern for password string 'fFwpUd!CJT4'.



test_case = {'username': 'John Doe', 'password': 'fFwpUd!CJT4'}


Enter fullscreen mode Exit fullscreen mode

After diving into logging module, I found two solutions to mask sensitive data in a dictionary. They may not be the best, but they did solve my problem.

Option 1. Mask sensitive data in record.msg matching regex pattern

The first solution use regex pattern as well, but this time the sensitive data is located by it's key in dictionary.

Let's update class SensitiveDataFilter. Add below code in log.py file.




# Define a list of keys that values are sensitive data
sensitive_keys = (
    "headers",
    "credentials",
    "Authorization",
    "token",
    "password",
)


# mask sensitive data in record.msg
class SensitiveDataFilter2(logging.Filter):
    patterns = regex_patterns
    sensitive_keys = sensitive_keys

    def filter(self, record):
        record.msg = self.mask_sensitive_msg(record.msg)
        return True

    def mask_sensitive_msg(self, message):
        for pattern in self.patterns:
            message = re.sub(pattern, "******", message)
        # replace sensitive data with asterisks
        for key in self.sensitive_keys:
            pattern_str = rf"'{key}': '[^']+'"
            replace = f"'{key}': '******'"
            message = re.sub(pattern_str, replace, message)
        return message


Enter fullscreen mode Exit fullscreen mode

Updates:

  1. Define a list of keys that values are sensitive data. These keys are used to locate the sensitive data in dictionary.
  2. Mask sensitive data in dictionary.values() by processing record.msg. In function mask_sensitive_msg, we define the pattern string '{key}': '[^']+'. And replace the match string with '{key}': ******.

Outputs:



2024-06-03T07:33:15Z DEBUG : use f-string: {'username': 'John Doe', 'password': '******'}
2024-06-03T07:33:15Z DEBUG : use str.format: {'username': 'John Doe', 'password': '******'}
2024-06-03T07:33:15Z DEBUG : use string modulo method: {'username': 'John Doe', 'password': '******'}


Enter fullscreen mode Exit fullscreen mode

In the solution, the dictionary object is treated as a part of log message essentially. The only difference is the regex patterns. However, it's not stable from my perspective, causes errors for special scenarios. So I tried another solution.
 

Option 2. Mask sensitive data in record.args

When going through the logging module documentation, I found that, for a LogRecord, the primary information is passed in msg and args, which are combined using msg % args to create the message attribute of the record. With args, we can process the sensitive data before being merged into msg. Here is the description of msg and args fields in a LogRecord object.

  • msg (Any) – The event description message, which can be a %-format string with placeholders for variable data, or an arbitrary object (see Using arbitrary objects as messages).
  • args (tuple | dict[str, Any]) – Variable data to merge into the msg argument to obtain the event description.

Interesting, right? Let's update class SensitiveDataFilter as below to make it work.



class SensitiveDataFilter(logging.Filter):
    patterns = regex_patterns
    sensitive_keys = sensitive_keys
    def filter(self, record):
        try:
            record.args = self.mask_sensitive_args(record.args)
            record.msg = self.mask_sensitive_msg(record.msg)
            return True
        except Exception as e:
            return True
    def mask_sensitive_args(self, args):
        if isinstance(args, dict):
            new_args = args.copy()
            for key in args.keys():
                if key in sensitive_keys:
                    new_args[key] = "******"
                else:
                    # mask sensitive data in dict values
                    new_args[key] = self.mask_sensitive_msg(args[key])
            return new_args
        # when there are multi arg in record.args
        return tuple([self.mask_sensitive_msg(arg) for arg in args])
    def mask_sensitive_msg(self, message):
        # mask sensitive data in multi record.args
        if isinstance(message, dict):
            return self.mask_sensitive_args(message)
        if isinstance(message, str):
            for pattern in self.patterns:
                message = re.sub(pattern, "******", message)
            for key in self.sensitive_keys:
                pattern_str = rf"'{key}': '[^']+'"
                replace = f"'{key}': '******'"
                message = re.sub(pattern_str, replace, message)
        return message


Enter fullscreen mode Exit fullscreen mode

Replace main.py with below code.



from log import init_logging

LOG = init_logging()

if __name__ == "__main__":
    test_case = {"username": "John Doe", "password": "xyz"}
    LOG.debug("use args: %s", test_case)
    LOG.debug("use multi args: %s %s", test_case, test_case)


Enter fullscreen mode Exit fullscreen mode

Outputs:



2024-06-03T09:41:02Z DEBUG : use args: {'username': 'John Doe', 'password': '******'}
2024-06-03T09:41:02Z DEBUG : use multi args: {'username': 'John Doe', 'password': '******'} {'username': 'John Doe', 'password': '******'}


Enter fullscreen mode Exit fullscreen mode

Let's go through the updates.

  1. I created a new function named mask_sensitive_args to process variable data in record.arg:
  2. When args is dictionary: Iterate all key, value pair of dict. replace the sensitive value if the key in the sensitive_keys.
  3. When args is a tuple: Iterate tuple. The item of each tuple may be string or dict.
    • If item is a string, mask sensitive data using regex patterns
    • If item is a dict, use mask_sensitive_args

Process Logic Flow

  1. Modify existing function mask_sensitive_msg in order to support multiple args scenarios.
  2. Wrap the filter function with a try-catch block. We catch all exception when masking data, and print the data without masking if there is any error occur during masking.

Summary

  1. When masking sensitive data in logs, choose regex pattern if the data match common pattern, for example SSN, or credit card.
  2. When masking sensitive data (random string) in a dictionary, utilize sensitive key name to locate sensitive value or leverage record.args.
  3. Always put an eye on your logs.

References:
Keep sensitive information out of logs

Thanks for reading. Looking forward to your comments and ideas.

Top comments (1)

Collapse
 
manhdt profile image
thitkhotau • Edited

It's not directly related to the main topic you're discussing, but I noticed an issue with how you're logging; it doesn't categorize logs by module, making debugging difficult. Instead of using init_logging, why not try this approach?

# _logging.py
import logging
from functools import lru_cache
from logging import Logger


@lru_cache(maxsize=None)
def init_logging() -> None:
    logging.basicConfig(
        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        datefmt="%m/%d/%Y %I:%M:%S %p",
    )


def get_logger(name: str) -> Logger:
    init_logging()
    return logging.getLogger(name)

Enter fullscreen mode Exit fullscreen mode
from ._logging import get_logger

logger = get_logger(__name__)


def main():
    logger.warning("***")


if __name__ == "__main__":
    main()

Enter fullscreen mode Exit fullscreen mode