The decision tree becomes a model that can help classify several repetitive categories. With the similarity of some of the attributes of data observations, a set of sequential rules can be created that indicates which types of data observations exist.
Let's look at one example using the case of classifying drugs based on transactional data of patients from a pharmacy (note, the data has been preprocessed before and is still using the CRISP-DM method. Click here to read more)
Strong tree with a solid foundation
Like a tree that soars into the sky with various components, this one decision tree towers down and also has several important features to understand first.
- Root node, meaning a primary variable that determines a data category.
- Splitting, meaning an advanced process of classifying data into several other determining variables based on a particular threshold value.
- Decision node, when a node that has been split is divided back into several more nodes, it is called a decision node.
- The leaf node is the last node that determines a data observation.
- Branch / Sub Tree, meaning a sub-part of several nodes in one tree.
After knowing the most critical components in the decision tree, the next step is to apply the algorithm to make a suitable model. The algorithm used is the C4.5 algorithm with three main repetitive steps as follows:
- Root node creation process
- Leaf node formation process
- Return to step one to max_depth of this algorithm.
Root node, where are you?
The root node is created on which variable ensures a classification is formed. Therefore it will use a metric called entropy with a formula like this.
We will search the entropy value of each little weight in each attribute with a nominal categorical data type. One of the nominal values obtained is the entropy as follows.
After all nominal values get their entropy, proceed with searching for the best gain value. The gain value shows how much entropy value is wasted on each attribute. This number can be achieved using the following formula.
One of them is the gain value in the Na_to_K_binned attribute.
After that, let's get how many nodes will be divided or the number of decision nodes. We can achieve this condition with the split info or intrinsic info formula.
One example is the Na_to_K_binned attribute returns to get the split info value.
Then it is time for the determination stage to find the most significant gene ratio metric among all its features. We can obtain this value by comparing the gain value with the split info value of each attribute. Look at the value on the attribute BP has the largest gain ratio.
So that the first decision tree framework will be created as follows.
Foliage, grow!
Do the same thing until an attribute can no longer be divided into several nodes. And in the end, We will create a complete decision tree like this.
Bonus, Rapid Miner with all its conveniences.
After a series of processes are carried out, let's use one of the tools that make it easy to create a model, namely Rapid Miner, with the same data set and a working canvas like this.
Produce a model with an accuracy rate of 67.26% in classifying the population data compared to the sample data.
Top comments (0)