DEV Community

Cover image for ADF-Mapping Data Flows Debug Mode

Posted on • Updated on

ADF-Mapping Data Flows Debug Mode

Azure Data Factory mapping data flow's debug mode allows you to interactively watch the data shape transform while you build and debug your data flows.

1.The debug session can be used both in Data Flow design sessions as well as during pipeline debug execution of data flows.

2.When Debug mode is on, you'll interactively build your data flow with an active Spark cluster. The session will close once you turn debug off in Azure Data Factory. You should be aware of the hourly charges incurred by Azure DataBricks during the time that you have the debug session turned on.

3.If you have parameters in your Data Flow or any of its referenced datasets, you can specify what values to use during debugging by selecting the Parameters tab in Debug Settings.

4.File sources only limit the rows that you see, not the rows being read. For very large datasets, it is recommended that you take a small portion of that file and use it for your testing. You can select a temporary file in Debug Settings for each source that is a file dataset type.

5.When running in Debug Mode in Data Flow, your data will not be written to the Sink transform.A Debug session is intended to serve as a test harness for your transformations.

6.When unit testing Joins, Exists, or Lookup transformations, make sure that you use a small set of known data for your test. You can use the Debug Settings option above to set a temporary file to use for your testing. This is needed because when limiting or sampling rows from a large dataset, you cannot predict which rows and which keys will be read into the flow for testing. The result is non-deterministic, meaning that your join conditions may fail.

7.When executing a debug pipeline run with a data flow, you have two options on which compute to use. You can either use an existing debug cluster or spin up a new just-in-time cluster for your data flows.
Using an existing debug session will greatly reduce the data flow start up time as the cluster is already running, but is not recommended for complex or parallel workloads as it may fail when multiple jobs are run at once.

Discussion (0)