Cameron Lepper

Posted on Mar 11, 2019

In CRISP-DM We Trust

#beginners #blog #data

I started out in my tech career developing BI (and occasionally AI) applications, before moving more into Cloud Software Development. The early involvement in Data Science tasks was really useful, but one thing sticks with me particularly:

CRISP-DM

CRISP-DM is an acronym which stands for 'Cross-Industry Standard Process for Data Mining'. This post will explain what CRISP-DM represents, and why it's one of the most valuable things I was taught during my short dabble with Data Science.

I'll prefix this post by stating that this is intended for beginners, or those who (like me) had not come across this term.

Definition

The Cross-Industry Standard Process for Data Mining (CRISP-DM) defines a partially-iterative methodology, comprising six fundamental components which contribute towards a streamlined standard process for performing data-related tasks.

Image: Forbes

I genuinely think that the process is mostly common sense, but having it defined and labelled as a standard is demonstrably useful.

Business Understanding

When planning any task involving operations on data, it is important to understand the business or context to which the data belongs. This not only helps ensure alignment with the overall motivation for the task, but also helps in terms of understanding the best way to approach the problem.

Data Understanding

This should be fairly evident, but having understood the wider context of the business OK, analysing and understanding what the data (should) represent is critical. Without knowing the data, it's pretty impossible to process it properly, and it will become a painful battle to suss something out. Trying to understand the data may prompt you to return to gather further business understanding.

Data Preparation

Sigh... You've got all your data, and you understand its purpose and wider context. Now you have to go through and put it into a format that you can do something with. Data Cleansing! Removing duplicates, nulls, and all other usual fairly mundane, but equally crucial, clean-up tasks.

Image: PlanSpace

Data Modelling

Phew - The cleansing is over. Now time to do some fun stuff; Modelling! This may be the cool part where you do your fancy AI, or design your complex dashboard with the appropriate visuals. Really, this is where the fun happens - however, I'll add the disclaimer that you're likely to find a few data anomalies here that require you to painfully revist the Data Prep stage.

Evaluation

Now that we have modelled our data, and we're looking at the results of whatever cool thing we've done, we can started to assess whether what we're seeing on front of us is actually what we wanted to do, or of any use. If it's not, it's best to invoke another iteration of the full process again, starting from capturing more understanding of the business context.

Deployment

Once we've evaluated our results, and concluded that 'this is it chief', we can finally make it available. Publish away to whatever platform you use, and people can start to access it.

*This whole process is obviously dynamic and continuous, as live data will need cleansed and remodelled regularly, depending on the task and data usage.

Why am I telling you all this?

Unlike Software Development, which has numerous methodologies that have changed quite dramatically over the years (Waterfall, Agile, DevOps), the CRISP-DM methodology has stuck around, standing the test of time.

A lot of what I do as a developer, building and deploying applications in the cloud, relies on a significant amount of work handling data. I've adhered to the principles above since discovering CRISP-DM, and found that it has been absolutely vital to quickly, effectively and correctly perform the relevant data tasks I need on the data.

I realise I'm still fairly early on in my career in tech, so perhaps this might explain why I hadn't yet come across CRISP-DM, but for those who hadn't previously discovered it, I do hope that this offers a nice structured approach for the next time you need to do some cool data stuff!

DEV Community