Dabhi Abhishek

Posted on Apr 23 • Originally published at linkedin.com

Databricks in Production: 8 Architectural Decisions That Actually Matter

#databricks #dataengineering #productivity #data

Recently I passed the Databricks Certified Professional Data Engineer exam and the Databricks Certified Associate Developer for Apache Spark. The exams were very useful. They sharpened vocabulary and cleared out fog around a lot of theoretical concepts. They forced me to revisit corners of the platform I might have ignored under delivery pressure. What they did not fully prepare me for were the architectural decisions that keep surfacing once a platform is live.

While I was studying, I started keeping notes on the questions that were consuming my real engineering time: where to draw the isolation boundary in Unity Catalog, how to keep Bronze durable without turning it into a junk drawer, when fine-grained governance becomes a query-planning problem, and where Lakeflow Spark Declarative Pipelines stops being the right owner of a workflow. This article is the cleaned-up version of those notes. Anything labeled "Note" comes from my research, production practice, and a lot from personal experiences rather than a Databricks rulebook.

1. Make the catalog your isolation unit, not your layer label

Databricks now describes catalogs as the primary unit of data isolation in the typical Unity Catalog governance model. Schemas organize within that boundary. Managed storage guidance also prefers catalog-level storage as the primary unit of data isolation, and workspace bindings apply directly at the catalog level. Put those three facts together and the implication is hard to ignore: if you need a durable control boundary, the catalog is the strongest first-class object available to you.

That is why I no longer like a layout where the top-level catalogs are named bronze, silver, and gold, with domains hidden inside schemas. It mirrors medallion terminology, but it makes the isolation boundary weaker than it needs to be. I would rather express ownership and isolation like this:

sales_prod
  - bronze
  - silver
  - gold

finance_prod
  - bronze
  - silver
  - gold

That structure aligns the domain boundary, the privilege boundary, and the storage boundary to the same top-level object. It also fits Databricks guidance that external locations should align with catalog or schema boundaries and that catalogs are the main isolation unit.

Note: the fastest way I know to spot a Unity Catalog design that will become painful later is this: if someone can ask for “read access to all Silver data” and the answer is a single broad grant on a single shared object, the team has probably centralized ownership more than it realizes.

2. Keep Bronze durable and boring

The medallion architecture page is more opinionated than many teams remember. Bronze is described as raw, append-oriented, historically retained, minimally validated, and intended for workloads that enrich data into Silver - not for analyst and data scientist access. Databricks also explicitly recommends storing most Bronze fields as STRING, VARIANT, or BINARY to protect against unexpected schema changes.

That is the right mental model for Bronze: not elegant, not analyst-friendly, not semantically curated - just durable, or that's what I think, is the best way to understand.

VARIANT matters here. Databricks documents support for using VARIANT to ingest semi-structured data in Databricks Runtime 15.3 and above. It is a strong fit for payloads that drift without warning. The important catch I found was in the ingestion docs: when you ingest an entire record as a single VARIANT column, Databricks does not do schema evolution for that record, and rescuedDataColumn is not supported in that pattern. That trade can be worth it, but only if you choose it deliberately.

A Bronze table I trust tends to look more like this than like a partially typed model:

CREATE TABLE sales_prod.bronze.events (
  _ingest_ts    TIMESTAMP NOT NULL,
  _batch_id     STRING,
  _source_file  STRING,
  payload       VARIANT NOT NULL
)
USING DELTA; -- optional

The fields that matter most are the ones that let me identify when a shape changed and which load introduced it. Databricks now explicitly recommends metadata columns in Bronze, and one subtle but useful reminder appears in the runtime release notes: input_file_name() is no longer supported in Databricks Runtime 17.3 LTS and above because it is unreliable. Databricks recommends _metadata.file_name instead. That is exactly the kind of small runtime detail that changes how I design Bronze metadata. There are few updates noticing in the latest Databricks Runtime, but that's probably for another article as I still need to do some research around that.

Note: the most expensive Bronze design I inherit is usually not obviously broken. It is the one that started typing payload fields too early, lost the raw record, and turned every upstream schema change into a historical repair job.

3. Treat the Silver/Gold boundary as a reprocessing boundary

The medallion documentation gives a useful direction here. Silver is where Databricks places validation, deduplication, normalization, schema enforcement, handling of late and out-of-order data, and at least one validated non-aggregated representation of each record. Gold is where Databricks places business-facing modeling, aggregation, and datasets aligned to domain questions and reporting needs.

That distinction sounds obvious until deadlines blur it. The practical way I learned is: if a transformation must stay stable while business definitions change above it, it belongs in Silver; if the transformation embodies a business definition that will change when policy, finance, operations, or stakeholder logic changes, it belongs in Gold. Silver is my reprocessing boundary. Gold is where I let business meaning evolve, which makes more sense if you think about it.

Note: the heuristic I use with myself or new engineers is not “is this complicated?”, It is “would I want to rebuild historical Silver every time this rule changes?” If the answer is no, it is not Silver logic.

When teams already have business rules leaking downward, I do not start with a rewrite. I start with containment:

Current:
Bronze --> Silver (contains business definitions) --> Gold

Recovery:
Bronze --> Silver (freeze)
       --> Gold_v2 (moves evolving business logic here)
       --> Gold_v1 (sunset and retire)

That approach restores a stable engineering layer without forcing a historical rebuild on day one.

4. Use ABAC for scale, but design UDFs as if the query planner is your adversary

Databricks now documents three distinct ways to control data visibility at query time: dynamic views, row filters / column masks, and ABAC. The differences matter. Dynamic views are read-only SQL abstractions over one or more tables. Row filters and column masks apply logic directly to tables. ABAC is the tag-driven policy framework that Databricks recommends for centralized, scalable governance. It is currently in Public Preview, and ABAC-secured tables require compatible compute - including Databricks Runtime 16.4 or above on supported standard or dedicated compute.

That last point changes architecture. Once governance becomes tag-driven and inherited, it stops being a table-by-table administration problem and becomes part of the platform model. That is exactly where ABAC belongs.

The piece many teams miss is that governance UDFs are not just security artifacts; they are also optimizer-facing code. Databricks’ ABAC UDF guidance is direct: keep UDFs simple, deterministic, free of external calls, and built from built-in functions rather than nested UDF chains. The row-filter and mask guidance adds an important performance-and-security detail: use deterministic expressions that cannot throw, and prefer functions like try_divide when a normal ANSI expression might raise an error. The reason is subtle and important. If an expression can fail before masking or filtering has finished protecting the row, the error path itself can reveal information.

One nuance is worth calling out because that tripped me thinking few things over. Databricks’ manual masking examples do show identity-aware logic such as is_account_group_member() inside the masking function. Databricks’ ABAC UDF best-practice page, however, pushes in the other direction: keep access checks in the policy design and keep the UDF as simple and reusable as possible. Those two pages are not contradictory once you separate manual table-level masks from ABAC policy UDFs. They are different governance surfaces.

A minimal manual mask can look like this:

CREATE FUNCTION ssn_mask(ssn STRING)
RETURN CASE
  WHEN is_account_group_member('HumanResourceDept') THEN ssn
  ELSE '***-**-****'
END;

ALTER TABLE hr_prod.gold.users
  ALTER COLUMN ssn SET MASK ssn_mask;

That syntax is Databricks documented. The more important hidden trap shows up later: Databricks documents that MERGE does not support tables whose row-filter or column-mask policies contain nesting, aggregations, windows, limits, or non-deterministic functions. That restriction is easy to miss because the breakage appears at write time, not when the policy is created. If your Gold maintenance path relies on MERGE, governance UDF complexity is now part of your write-path design.

5. Use Lakeflow Spark Declarative Pipelines for table DAGs, and Jobs for the orchestration around them

Databricks has renamed Delta Live Tables to Lakeflow Spark Declarative Pipelines. The old dlt naming still appears in places such as event logs and classic SKU names, but Databricks explicitly recommends moving Python code to the new names. That rename is not cosmetic. It helps clarify what the product is good at: declarative dataflow over datasets.

The load guidance is also clearer than older DLT-era material. Databricks states that pipelines can define datasets from any query that returns a Spark DataFrame, that they can mix SQL and Python, and that streaming tables are the recommended default for most ingestion work. That is a strong signal about where the product fits best: dataset-centric flows, especially ingestion and transformation flows that are easiest to reason about as tables and views.

Note: the cleanest division of labor I have found is simple. Let Lakeflow own the table DAG. Let Jobs own the control flow that surrounds it.

Job
  - pre-flight checks
  - Lakeflow pipeline task
  - post-process task
  - publish / notify task

That split keeps declarative logic declarative. It also stops the outer workflow from being distorted to fit a table materialization engine.

6. Treat production bundle mode as an enforcement layer, not a label

The current Databricks docs now refer to the old Databricks Asset Bundles as Declarative Automation Bundles. The rename matters less than the behavior of production mode. In practice, too many teams still treat mode: production as naming rather than policy. The docs show otherwise.

Production mode enforces behavior. Databricks documents that deploying a production target validates that related Lakeflow pipelines are marked development: false, validates the Git branch if one is specified in the target, and - when you are not using service principals - validates run_as and permissions mappings. Databricks also recommends service principals for production deployments. Just as important, production mode does not allow overriding existing compute definitions with --compute-id or the compute_id mapping. That is not a nuisance. It is the platform refusing to let an emergency shortcut silently become the production model.

A minimal production-oriented pattern looks like this:

targets:
  prod:
    mode: production
    workspace:
      host: https://my-host.cloud.databricks.com
    run_as:
      service_principal_name: '68ed9cd5-8923-4851-x0c1-c7536c67ff99'

resources:
  jobs:
    my_job:
      permissions:
        - group_name: platform-admins
          level: CAN_MANAGE
        - group_name: platform-observers
          level: CAN_VIEW

Every key in that snippet is documented: mode, workspace.host, run_as.service_principal_name, and resource-level permissions using *group_name * with levels such as CAN_MANAGE and CAN_VIEW.

Note: if a production bundle only deploys successfully when the deployment identity is massively over-privileged, that is not a CI/CD success. It is a permissions design failure that just has not failed publicly yet.

7. Bound streaming state on purpose

The most common long-running streaming failure I see is not a crash. It is slow decay: state grows, trigger times drift upward, and the pipeline quietly becomes more expensive and less predictable every day.

Databricks’ watermarking guidance is strong enough now that this should be an architectural review item, not an implementation detail. For stream-stream joins, Databricks recommends watermarks on each input stream so old state can be discarded. For outer joins, watermarks are mandatory. When there are multiple input streams, Structured Streaming computes a single global watermark. By default, Databricks chooses the minimum of the input watermarks so slower streams do not cause data to be dropped as late. If you switch spark.sql.streaming.multipleWatermarkPolicy to max, the global watermark advances with the fastest stream, but Databricks explicitly warns that this drops data from slower streams.

That trade-off is not academic. It is a direct choice between lower state/lower latency and higher completeness.

The same page also tightens the deduplication guidance. Databricks documents that distinct() is **stateful **and requires a watermark to avoid unbounded state growth. It also documents dropDuplicatesWithinWatermark() for identifier-based deduplication within an event-time threshold, including the important detail that duplicates arriving outside the specified threshold might still be dropped, so the watermark delay must be longer than the maximum timestamp difference across duplicate events.

This is the pattern I trust when the business key is not itself the event timestamp:

(events
  .withWatermark("event_time", "2 hours")
  .dropDuplicatesWithinWatermark(["event_id"]))

Note: for joins where result correctness matters more than raw latency, I stay on the default min policy unless I have a deliberate reason not to. The Databricks docs explain exactly why the faster option is also the riskier one.

8. Use native functions first. Use pandas iterator UDFs when you truly need Python.

Databricks is unusually explicit on this point now. The UDF guidance says built-in functions and SQL UDFs are the most efficient options. The performance-efficiency guidance goes further: do not use Python or Scala UDFs if a native function exists, because serialization between Spark and Python significantly slows queries. If native functionality is missing and you genuinely need Python, Databricks recommends Pandas UDFs. The optimizations guidance separately notes that higher-order functions provide a performance benefit over user-defined functions for many operations.

That stack of guidance changes how I should review PySpark code. Before I look at the Python, I should ask whether the job should still be in Python at all. Arrays, structs, JSON fields, and many semi-structured transformations already have native operators and higher-order functions. Databricks has been giving us more of those building blocks, not fewer.

When Python is still the right answer, the iterator-based Pandas UDF pattern is the one I reach for first. Databricks documents Iterator[pandas.Series] -> Iterator[pandas.Series] as the right pattern when the UDF needs initialized state, such as loading a model once and applying it batch by batch. The same page documents two details worth committing to muscle memory: the total output length across the iterator must match the total input length, and cleanup should be wrapped in ' or a context manager.

from typing import Iterator
import pandas as pd
from pyspark.sql.functions import pandas_udf

@pandas_udf("double")
def score(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]:
    model = load_model()
    try:
        for batch in batch_iter:
            yield pd.Series(model.predict(batch))
    finally:
        model.close()

One more runtime detail belongs in the same mental model: Databricks documents that in Databricks Runtime 15.3 and above, Python UDFs, UDAFs, and UDTFs that use VARIANT as an argument or return type throw an exception. That is the kind of release-note detail that becomes a production outage if you treat semi-structured design and UDF design as separate topics. They are not separate topics anymore.

Closing

What changed for me after these exams was not that I suddenly knew the platform. It was that I started writing down the decisions the platform kept forcing me to make. The certifications helped with terminology. The harder learning came from architecture review, failure analysis, and reading the parts of the documentation that only become interesting after something goes wrong.

That is the level I care about now. Not “what is Bronze.” Not “what is Unity Catalog.” The more important questions are narrower and more expensive: which object should carry the isolation boundary, which layer should absorb semantic drift, which governance mechanism still works when the table count reaches the hundreds, which runtime behavior quietly changes a previously safe design, and which convenience today becomes the repair bill six months from now.

Those are the decisions I wish I had studied before the exams. They are also the decisions I now want to document first, so this is just the beginning!

DEV Community