Scheduling Automated Audits
This guide explains how to run the Databricks Audit Toolkit on a schedule using
common orchestration tools: cron, Azure Data Factory, Databricks Workflows, and
GitHub Actions.
Prerequisites
Before scheduling, ensure:
- Service principal or long-lived token for authentication
- Environment variables configured in the execution environment
-
Python 3.8+ with
requests,jinja2,tabulateinstalled - Network access from the runner to your Databricks workspace
Recommended environment variables
export DATABRICKS_HOST='https://your-workspace.cloud.databricks.com'
export DATABRICKS_TOKEN='dapi...'
export AUDIT_OUTPUT_DIR='/var/audit/output'
export AUDIT_LOG_LEVEL='INFO'
export AUDIT_COMPANY_NAME='Acme Corp'
Security tip: Use a service principal token instead of a personal access
token for automated runs. Service principals can be scoped to read-only
permissions and are not tied to individual user accounts.
Option 1: Cron (Linux / macOS)
The simplest approach for a dedicated VM or on-premises server.
Setup
# Install dependencies
pip install -r /opt/databricks-audit-toolkit/requirements.txt
# Create a wrapper script
cat > /opt/databricks-audit-toolkit/run_scheduled.sh << 'EOF'
#!/bin/bash
set -euo pipefail
export DATABRICKS_HOST='https://your-workspace.cloud.databricks.com'
export DATABRICKS_TOKEN='dapi...'
export AUDIT_OUTPUT_DIR="/var/audit/output/$(date +%Y-%m-%d)"
export AUDIT_COMPANY_NAME='Acme Corp'
cd /opt/databricks-audit-toolkit
./run_audit.sh --all --format both 2>&1 | tee "${AUDIT_OUTPUT_DIR}/run.log"
# Optional: upload results to cloud storage
# az storage blob upload-batch \
# --source "${AUDIT_OUTPUT_DIR}" \
# --destination audit-reports \
# --account-name yourstorageaccount
EOF
chmod +x /opt/databricks-audit-toolkit/run_scheduled.sh
Cron entry
# Edit crontab
crontab -e
# Run every Monday at 6:00 AM UTC
0 6 * * 1 /opt/databricks-audit-toolkit/run_scheduled.sh
# Run on the 1st and 15th of each month at 3:00 AM UTC
0 3 1,15 * * /opt/databricks-audit-toolkit/run_scheduled.sh
# Run daily at midnight
0 0 * * * /opt/databricks-audit-toolkit/run_scheduled.sh
Option 2: Azure Data Factory
Use ADF to orchestrate the audit as part of your data platform pipelines.
Architecture
ADF Pipeline
├── Web Activity: Get token from Key Vault
├── Custom Activity (or Azure Batch):
│ ├── Install dependencies
│ ├── Run audit toolkit
│ └── Upload results to ADLS Gen2
└── Web Activity: Send notification (Teams/Slack/Email)
Steps
- Store credentials in Azure Key Vault
- Secret: databricks-audit-host → https://your-workspace.cloud.databricks.com
- Secret: databricks-audit-token → dapi...
- Create a Custom Activity (using Azure Batch pool)
{
"name": "RunDatabricksAudit",
"type": "Custom",
"linkedServiceName": {
"referenceName": "AzureBatchLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"command": "bash run_scheduled.sh",
"resourceLinkedService": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"folderPath": "audit-toolkit",
"referenceObjects": {
"linkedServices": [],
"datasets": []
}
}
}
-
Add a Schedule Trigger
- Type: Schedule
- Recurrence: Weekly on Mondays at 06:00 UTC
Add notification step (Logic App or direct webhook)
Option 3: Databricks Workflows
Run the audit from within Databricks itself using a scheduled job.
Setup
- Upload the toolkit to Databricks Repos or as workspace files:
/Repos/audit-team/databricks-audit-toolkit/
-
Create a notebook
run_audit_notebook.py:
# Databricks notebook source
import subprocess
import os
import sys
# The toolkit directory
toolkit_dir = "/Workspace/Repos/audit-team/databricks-audit-toolkit"
# Install dependencies (first run only)
%pip install -r {toolkit_dir}/requirements.txt
# Set environment variables
# Use Databricks secrets for credentials
host = dbutils.secrets.get(scope="audit-config", key="databricks-host")
token = dbutils.secrets.get(scope="audit-config", key="databricks-token")
os.environ["DATABRICKS_HOST"] = host
os.environ["DATABRICKS_TOKEN"] = token
os.environ["AUDIT_OUTPUT_DIR"] = "/tmp/audit-output"
os.environ["AUDIT_COMPANY_NAME"] = "Acme Corp"
# Add toolkit to path and run
sys.path.insert(0, toolkit_dir)
from utils import DatabricksApiClient, save_json, ensure_output_dir
from audits import ALL_AUDITS
client = DatabricksApiClient()
results = []
for audit_cls in ALL_AUDITS:
audit = audit_cls(client=client)
result = audit.execute()
results.append(result)
# Copy results to DBFS or cloud storage
dbutils.fs.cp(
"file:/tmp/audit-output/",
f"dbfs:/audit-reports/{dbutils.widgets.get('run_date', 'latest')}/",
recurse=True,
)
-
Create a scheduled job:
- Task: Notebook →
run_audit_notebook - Cluster: Single-node, smallest instance (no Spark needed)
- Schedule: Weekly, Mondays 06:00 UTC
- Alerts: Email on failure
- Task: Notebook →
Option 4: GitHub Actions
Run the audit from a GitHub Actions workflow — useful if your team already uses
GitHub for CI/CD.
Workflow file: .github/workflows/databricks-audit.yml
name: Databricks Workspace Audit
on:
schedule:
# Every Monday at 06:00 UTC
- cron: '0 6 * * 1'
workflow_dispatch: # Allow manual trigger
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
AUDIT_COMPANY_NAME: 'Acme Corp'
jobs:
audit:
runs-on: ubuntu-latest
steps:
- name: Checkout toolkit
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run audit
run: |
export AUDIT_OUTPUT_DIR="${GITHUB_WORKSPACE}/output"
chmod +x run_audit.sh
./run_audit.sh --all --format both
- name: Upload report artifacts
uses: actions/upload-artifact@v4
with:
name: audit-report-${{ github.run_number }}
path: output/
retention-days: 90
- name: Notify on failure
if: failure()
uses: slackapi/slack-github-action@v1.24.0
with:
payload: |
{
"text": "Databricks audit failed! Run: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
Required GitHub secrets
| Secret | Description |
|---|---|
DATABRICKS_HOST |
Workspace URL |
DATABRICKS_TOKEN |
Service principal token |
SLACK_WEBHOOK_URL |
(Optional) Slack webhook for failure notifications |
Notification Integration
Slack webhook (post-audit)
Add this to your wrapper script:
REPORT_FILE="${AUDIT_OUTPUT_DIR}/security_summary.json"
if [[ -f "$REPORT_FILE" ]]; then
RISK_SCORE=$(python3 -c "import json; print(json.load(open('${REPORT_FILE}'))['risk_score'])")
TOTAL=$(python3 -c "import json; print(json.load(open('${REPORT_FILE}'))['total_findings'])")
curl -s -X POST "$SLACK_WEBHOOK_URL" \
-H 'Content-Type: application/json' \
-d "{
\"text\": \"Databricks Audit Complete\nRisk Score: ${RISK_SCORE}/100\nFindings: ${TOTAL}\nReport: <${REPORT_URL}|View Report>\"
}"
fi
Email via sendmail
SUBJECT="Databricks Audit Report - $(date +%Y-%m-%d)"
BODY="Risk Score: ${RISK_SCORE}/100\nTotal Findings: ${TOTAL}\n\nSee attached report."
echo -e "Subject: ${SUBJECT}\n\n${BODY}" | sendmail audit-team@acme.com
Best Practices
- Use service principals — never schedule with personal tokens
- Rotate tokens regularly — set a 90-day maximum lifetime
- Store credentials in a vault — Azure Key Vault, GitHub Secrets, or Databricks secret scopes
- Archive reports — store historical reports in ADLS Gen2 or S3 for trend analysis
- Set up alerting — notify on CRITICAL findings or risk score increases
- Run at low-traffic times — API calls are rate-limited; run during off-peak
- Version your toolkit — pin to a release tag so scheduled runs are predictable
- Monitor the runner — alert if the audit job itself fails to execute
Report Archival Strategy
/audit-reports/
├── 2026-01-06/
│ ├── audit_report.html
│ ├── audit_report.json
│ └── security_summary.json
├── 2026-01-13/
│ ├── ...
├── 2026-01-20/
│ ├── ...
└── latest/ → symlink to most recent
Use the JSON reports for programmatic trend analysis:
import json
import glob
reports = sorted(glob.glob("/audit-reports/*/security_summary.json"))
for path in reports[-12:]: # Last 12 runs
data = json.load(open(path))
date = path.split("/")[-2]
print(f"{date}: Risk={data['risk_score']}, Findings={data['total_findings']}")
This is 1 of 6 resources in the DataStack Pro toolkit. Get the complete [Databricks Audit Toolkit] with all files, templates, and documentation for $49.
Or grab the entire DataStack Pro bundle (6 products) for $164 — save 30%.
Top comments (0)