DEV Community

Kate Vu
Kate Vu

Posted on • Originally published at Medium

Centralised Logging for AWS Glue Jobs with Python

When I was working with AWS Glue jobs, I ran into a frustrating problem. I had a few Glue jobs and a utils module, and I’d set up logging in every module so the same code appeared in multiple places. Things got even worse when I needed to update the logger because I had to change the same logic everywhere. Now things got bitter, and I copied and pasted the same code everywhere in anger.
That’s when I realized I needed a centralized logging setup that I could reuse across all my Glue jobs. Centralized logging keeps things consistent, secure, easier to maintain, and makes debugging a lot simpler.


Why Centralize Logging?
Centralized logging brings several benefits:

  • Consistency: All jobs share the same log format, levels, and filtering — no surprises when you check different logs.
    Maintainability: Update once, and every job gets the change automatically.

  • Structured Context: You can define a clear structure for your log messages and build a reusable log function to capture the details you need. In my case, I needed specific info for S3 operations — like bucket name, key, and action.

  • Flexibility: With dictionary-based configuration, updating formats or handlers later is super simple.

  • Traceability: Module-level loggers give you hierarchical control while still sharing the same handlers and formatters.

  • Security: You can set log levels differently for each environment. For example, log only INFO in production but allow DEBUG in dev. That way, sensitive data stays out of your production logs.


Setting Up the Centralized Logging Configuration
Here’s how I structured my logging_config module:

  • Define formatters, handlers, and loggers in setup_logging()
def setup_logging(
    level: str = "INFO",
    log_format: Optional[str] = None,
    log_file: Optional[str] = None,
    environment: str = "dev",
) -> logging.Logger:
    """
    Set up centralized logging configuration.

    Args:
        level: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
        log_format: Custom log format string
        log_file: Path to log file (optional)
        environment: Environment name (dev, staging, prod)

    Returns:
        Configured logger instance
    """

    # Default format with structured information
    if log_format is None:
        log_format = (
            "%(asctime)s | %(levelname)-8s | %(name)-20s | "
            "%(funcName)-15s:%(lineno)-4d | %(message)s"
        )

    # Create handlers
    handlers = {
        "stdout": {
            "class": "logging.StreamHandler",
            "level": "INFO",
            "formatter": "simple",
            "stream": "ext://sys.stdout",
        },
        "stderr": {
            "class": "logging.StreamHandler",
            "level": "ERROR",
            "formatter": "simple",
            "stream": "ext://sys.stderr",
        },
    }

    # Add file handler if specified
    if log_file:
        handlers["file"] = {
            "class": "logging.handlers.RotatingFileHandler",
            "level": level,
            "formatter": "detailed",
            "filename": log_file,
            "maxBytes": 10485760,  # 10MB
            "backupCount": 5,
            "encoding": "utf8",
        }

    # Logging configuration dictionary
    logging_config = {
        "version": 1,
        "disable_existing_loggers": False,  # Important to keep this False
        "formatters": {
            "detailed": {"format": log_format, "datefmt": "%Y-%m-%d %H:%M:%S"},
            "simple": {"format": "%(name)-20s - %(levelname)s - %(message)s"},
        },
        "handlers": handlers,
        "loggers": {
            # Root logger - Parent of all loggers
            "": {
                "level": level,
                "handlers": list(handlers.keys()),
            },
            # AWS SDK loggers (reduce noise)
            "boto3": {"level": "WARNING"},
            "botocore": {"level": "WARNING"},
            "urllib3": {"level": "WARNING"},
            # PySpark loggers
            "pyspark": {"level": "WARNING"},
            "py4j": {"level": "WARNING"},
        },
    }

    # Apply configuration
    logging.config.dictConfig(logging_config)

    # Get logger for the calling module
    logger = logging.getLogger(__name__)

    # Log configuration info
    logger.info(f"Logging configured - Level: {level}, Environment: {environment}")
    if log_file:
        logger.info(f"Log file: {log_file}")

    return logger
Enter fullscreen mode Exit fullscreen mode
  • Define purpose specific logging functions
def log_function_call(logger: logging.Logger, func_name: str, **kwargs):
    """
    Log function entry with parameters.

    Args:
        logger: Logger instance
        func_name: Function name
        **kwargs: Function parameters to log
    """
    params = {
        k: v for k, v in kwargs.items() if k not in ["password", "secret", "token"]
    }
    logger.debug(f"Entering {func_name} with params: {params}")


def log_performance(logger: logging.Logger, operation: str, duration: float, **metrics):
    """
    Log performance metrics.

    Args:
        logger: Logger instance
        operation: Operation name
        duration: Duration in seconds
        **metrics: Additional metrics to log
    """
    logger.info(f"Performance - {operation}: {duration:.3f}s")
    if metrics:
        for key, value in metrics.items():
            logger.info(f"  {key}: {value}")


def log_dataframe_info(logger: logging.Logger, df_name: str, df):
    """
    Log DataFrame information for debugging.

    Args:
        logger: Logger instance
        df_name: DataFrame name/description
        df: PySpark DataFrame
    """
    try:
        row_count = df.count()
        col_count = len(df.columns)
        logger.info(
            f"DataFrame '{df_name}' - Rows: {row_count:,}, Columns: {col_count}"
        )
        logger.debug(f"DataFrame '{df_name}' columns: {df.columns}")
    except Exception as e:
        logger.warning(f"Could not get DataFrame info for '{df_name}': {e}")


def log_s3_operation(
    logger: logging.Logger, operation: str, bucket: str, key: str, **kwargs
):
    """
    Log S3 operations with consistent format.

    Args:
        logger: Logger instance
        operation: S3 operation (read, write, delete, etc.)
        bucket: S3 bucket name
        key: S3 object key
        **kwargs: Additional operation details
    """
    details = f" | {kwargs}" if kwargs else ""
    logger.info(f"S3 {operation.upper()} - s3://{bucket}/{key}{details}")


def log_error_with_context(
    logger: logging.Logger, error: Exception, context: Dict[str, Any]
):
    """
    Log errors with additional context information.

    Args:
        logger: Logger instance
        error: Exception that occurred
        context: Additional context information
    """
    logger.error(f"Error: {str(error)}")
    logger.error(f"Context: {json.dumps(context, indent=2, default=str)}")
    logger.exception("Full traceback:")
Enter fullscreen mode Exit fullscreen mode
  • Using the logger in AWS Glue Jobs
# Initialize root logger
setup_logging(level="INFO", environment=args["env_name"])

# Get module level logger
logger = get_logger("glue.ingestion")

# Or
logger = get_logger("glue.utils")

# Example:
logger.info(f"Writing processed data to: {output_s3_path}")
       log_s3_operation(
           logger,
           "write",
           output_bucket,
           f"{env_name}/staging_{file_name.split('.')[0]}/",
       )
Enter fullscreen mode Exit fullscreen mode

With this setup, logs clearly show where they came from


Logger vs print()
print() is easy to use and requires no setup. For quick checks, it works fine. But once your pipelines grow, run in parallel, or deploy across environments, it becomes harder to track and manage.
Here’s why I chose logger over print() for my case:

  • Control: Use log levels (DEBUG, INFO, ERROR, etc.) to filter output without touching code everywhere.
  • Consistency: All jobs follow the same format with timestamps, module names, and job context, making logs readable and traceable.
  • Security: You can debug in development without risking sensitive information appearing in production logs by setting log levels — something print() would require extra effort to manage.

References
https://docs.python.org/3/howto/logging.html
https://docs.python.org/3/howto/logging-cookbook.html

Top comments (0)