Discussion on: Simplify Data Prep with AWS Glue DataBrew

View post

Nice. The catch here is the profiler is done for the first 20k rows. What if I have more rows in the dataset? There are many open-source profiler libraries available. What is the specific advantage here?

sunilbhola • Feb 7 '21

The best part is its easy to use GUI interface. Profiling done on specific/custom sample - transform data - apply that profile on entire dataset. Earlier it was tough to work through spark or hive to do cleanup - as sampling option and then pushing that on entire dataset was not there very handy.

Thanks @ Pradeep for such a nice presentation for the - still not much known - stuff under Glue umbrella (glue/athena/gluebrew/glue studio)

AWS is choosing best products of hadoop framework and nicely coating the stuff in GUI with lots of enhancement and designing optimal solution.

pradeep-misra • Feb 2 '21

As I have understood, DataBrew jobs work on entire dataset. We can create either profile or recipe jobs that work on the whole dataset. We select sample data for building the recipes where we can visualize the changes to data as we add steps. But once done we need to use the recipe and create a job. DataBrew integrates - visual GUI data prep, profiler and lineage in one and very well coupled with other aws services