DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Databricks Audit Toolkit: Scheduling Automated Audits

Scheduling Automated Audits

This guide explains how to run the Databricks Audit Toolkit on a schedule using
common orchestration tools: cron, Azure Data Factory, Databricks Workflows, and
GitHub Actions.


Prerequisites

Before scheduling, ensure:

  1. Service principal or long-lived token for authentication
  2. Environment variables configured in the execution environment
  3. Python 3.8+ with requests, jinja2, tabulate installed
  4. Network access from the runner to your Databricks workspace

Recommended environment variables

export DATABRICKS_HOST='https://your-workspace.cloud.databricks.com'
export DATABRICKS_TOKEN='dapi...'
export AUDIT_OUTPUT_DIR='/var/audit/output'
export AUDIT_LOG_LEVEL='INFO'
export AUDIT_COMPANY_NAME='Acme Corp'
Enter fullscreen mode Exit fullscreen mode

Security tip: Use a service principal token instead of a personal access
token for automated runs. Service principals can be scoped to read-only
permissions and are not tied to individual user accounts.


Option 1: Cron (Linux / macOS)

The simplest approach for a dedicated VM or on-premises server.

Setup

# Install dependencies
pip install -r /opt/databricks-audit-toolkit/requirements.txt

# Create a wrapper script
cat > /opt/databricks-audit-toolkit/run_scheduled.sh << 'EOF'
#!/bin/bash
set -euo pipefail

export DATABRICKS_HOST='https://your-workspace.cloud.databricks.com'
export DATABRICKS_TOKEN='dapi...'
export AUDIT_OUTPUT_DIR="/var/audit/output/$(date +%Y-%m-%d)"
export AUDIT_COMPANY_NAME='Acme Corp'

cd /opt/databricks-audit-toolkit
./run_audit.sh --all --format both 2>&1 | tee "${AUDIT_OUTPUT_DIR}/run.log"

# Optional: upload results to cloud storage
# az storage blob upload-batch \
#   --source "${AUDIT_OUTPUT_DIR}" \
#   --destination audit-reports \
#   --account-name yourstorageaccount
EOF

chmod +x /opt/databricks-audit-toolkit/run_scheduled.sh
Enter fullscreen mode Exit fullscreen mode

Cron entry

# Edit crontab
crontab -e

# Run every Monday at 6:00 AM UTC
0 6 * * 1 /opt/databricks-audit-toolkit/run_scheduled.sh

# Run on the 1st and 15th of each month at 3:00 AM UTC
0 3 1,15 * * /opt/databricks-audit-toolkit/run_scheduled.sh

# Run daily at midnight
0 0 * * * /opt/databricks-audit-toolkit/run_scheduled.sh
Enter fullscreen mode Exit fullscreen mode

Option 2: Azure Data Factory

Use ADF to orchestrate the audit as part of your data platform pipelines.

Architecture

ADF Pipeline
├── Web Activity: Get token from Key Vault
├── Custom Activity (or Azure Batch):
│   ├── Install dependencies
│   ├── Run audit toolkit
│   └── Upload results to ADLS Gen2
└── Web Activity: Send notification (Teams/Slack/Email)
Enter fullscreen mode Exit fullscreen mode

Steps

  1. Store credentials in Azure Key Vault
   - Secret: databricks-audit-host → https://your-workspace.cloud.databricks.com
   - Secret: databricks-audit-token → dapi...
Enter fullscreen mode Exit fullscreen mode
  1. Create a Custom Activity (using Azure Batch pool)
   {
     "name": "RunDatabricksAudit",
     "type": "Custom",
     "linkedServiceName": {
       "referenceName": "AzureBatchLinkedService",
       "type": "LinkedServiceReference"
     },
     "typeProperties": {
       "command": "bash run_scheduled.sh",
       "resourceLinkedService": {
         "referenceName": "AzureStorageLinkedService",
         "type": "LinkedServiceReference"
       },
       "folderPath": "audit-toolkit",
       "referenceObjects": {
         "linkedServices": [],
         "datasets": []
       }
     }
   }
Enter fullscreen mode Exit fullscreen mode
  1. Add a Schedule Trigger

    • Type: Schedule
    • Recurrence: Weekly on Mondays at 06:00 UTC
  2. Add notification step (Logic App or direct webhook)


Option 3: Databricks Workflows

Run the audit from within Databricks itself using a scheduled job.

Setup

  1. Upload the toolkit to Databricks Repos or as workspace files:
   /Repos/audit-team/databricks-audit-toolkit/
Enter fullscreen mode Exit fullscreen mode
  1. Create a notebook run_audit_notebook.py:
   # Databricks notebook source

   import subprocess
   import os
   import sys

   # The toolkit directory
   toolkit_dir = "/Workspace/Repos/audit-team/databricks-audit-toolkit"

   # Install dependencies (first run only)
   %pip install -r {toolkit_dir}/requirements.txt

   # Set environment variables
   # Use Databricks secrets for credentials
   host = dbutils.secrets.get(scope="audit-config", key="databricks-host")
   token = dbutils.secrets.get(scope="audit-config", key="databricks-token")

   os.environ["DATABRICKS_HOST"] = host
   os.environ["DATABRICKS_TOKEN"] = token
   os.environ["AUDIT_OUTPUT_DIR"] = "/tmp/audit-output"
   os.environ["AUDIT_COMPANY_NAME"] = "Acme Corp"

   # Add toolkit to path and run
   sys.path.insert(0, toolkit_dir)

   from utils import DatabricksApiClient, save_json, ensure_output_dir
   from audits import ALL_AUDITS

   client = DatabricksApiClient()
   results = []
   for audit_cls in ALL_AUDITS:
       audit = audit_cls(client=client)
       result = audit.execute()
       results.append(result)

   # Copy results to DBFS or cloud storage
   dbutils.fs.cp(
       "file:/tmp/audit-output/",
       f"dbfs:/audit-reports/{dbutils.widgets.get('run_date', 'latest')}/",
       recurse=True,
   )
Enter fullscreen mode Exit fullscreen mode
  1. Create a scheduled job:
    • Task: Notebook → run_audit_notebook
    • Cluster: Single-node, smallest instance (no Spark needed)
    • Schedule: Weekly, Mondays 06:00 UTC
    • Alerts: Email on failure

Option 4: GitHub Actions

Run the audit from a GitHub Actions workflow — useful if your team already uses
GitHub for CI/CD.

Workflow file: .github/workflows/databricks-audit.yml

name: Databricks Workspace Audit

on:
  schedule:
    # Every Monday at 06:00 UTC
    - cron: '0 6 * * 1'
  workflow_dispatch:  # Allow manual trigger

env:
  DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
  DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
  AUDIT_COMPANY_NAME: 'Acme Corp'

jobs:
  audit:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout toolkit
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run audit
        run: |
          export AUDIT_OUTPUT_DIR="${GITHUB_WORKSPACE}/output"
          chmod +x run_audit.sh
          ./run_audit.sh --all --format both

      - name: Upload report artifacts
        uses: actions/upload-artifact@v4
        with:
          name: audit-report-${{ github.run_number }}
          path: output/
          retention-days: 90

      - name: Notify on failure
        if: failure()
        uses: slackapi/slack-github-action@v1.24.0
        with:
          payload: |
            {
              "text": "Databricks audit failed! Run: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
Enter fullscreen mode Exit fullscreen mode

Required GitHub secrets

Secret Description
DATABRICKS_HOST Workspace URL
DATABRICKS_TOKEN Service principal token
SLACK_WEBHOOK_URL (Optional) Slack webhook for failure notifications

Notification Integration

Slack webhook (post-audit)

Add this to your wrapper script:

REPORT_FILE="${AUDIT_OUTPUT_DIR}/security_summary.json"
if [[ -f "$REPORT_FILE" ]]; then
  RISK_SCORE=$(python3 -c "import json; print(json.load(open('${REPORT_FILE}'))['risk_score'])")
  TOTAL=$(python3 -c "import json; print(json.load(open('${REPORT_FILE}'))['total_findings'])")

  curl -s -X POST "$SLACK_WEBHOOK_URL" \
    -H 'Content-Type: application/json' \
    -d "{
      \"text\": \"Databricks Audit Complete\nRisk Score: ${RISK_SCORE}/100\nFindings: ${TOTAL}\nReport: <${REPORT_URL}|View Report>\"
    }"
fi
Enter fullscreen mode Exit fullscreen mode

Email via sendmail

SUBJECT="Databricks Audit Report - $(date +%Y-%m-%d)"
BODY="Risk Score: ${RISK_SCORE}/100\nTotal Findings: ${TOTAL}\n\nSee attached report."

echo -e "Subject: ${SUBJECT}\n\n${BODY}" | sendmail audit-team@acme.com
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Use service principals — never schedule with personal tokens
  2. Rotate tokens regularly — set a 90-day maximum lifetime
  3. Store credentials in a vault — Azure Key Vault, GitHub Secrets, or Databricks secret scopes
  4. Archive reports — store historical reports in ADLS Gen2 or S3 for trend analysis
  5. Set up alerting — notify on CRITICAL findings or risk score increases
  6. Run at low-traffic times — API calls are rate-limited; run during off-peak
  7. Version your toolkit — pin to a release tag so scheduled runs are predictable
  8. Monitor the runner — alert if the audit job itself fails to execute

Report Archival Strategy

/audit-reports/
├── 2026-01-06/
│   ├── audit_report.html
│   ├── audit_report.json
│   └── security_summary.json
├── 2026-01-13/
│   ├── ...
├── 2026-01-20/
│   ├── ...
└── latest/ → symlink to most recent
Enter fullscreen mode Exit fullscreen mode

Use the JSON reports for programmatic trend analysis:

import json
import glob

reports = sorted(glob.glob("/audit-reports/*/security_summary.json"))
for path in reports[-12:]:  # Last 12 runs
    data = json.load(open(path))
    date = path.split("/")[-2]
    print(f"{date}: Risk={data['risk_score']}, Findings={data['total_findings']}")
Enter fullscreen mode Exit fullscreen mode

This is 1 of 6 resources in the DataStack Pro toolkit. Get the complete [Databricks Audit Toolkit] with all files, templates, and documentation for $49.

Get the Full Kit →

Or grab the entire DataStack Pro bundle (6 products) for $164 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)