DEV Community: daudfernando

I wonder whether you are a proper stakeholder.

daudfernando — Tue, 28 Jun 2022 23:11:50 +0000

Working on a project requires a lot of resources. The difficulty of gathering resources is secondary to developing a product concept. This problem makes the basis that a project requires three or more stakeholders who are relevant to the objectives of a product, but they are ready in terms of resources such as money and infrastructure.

Quoting to the Project Management Institute, the failure of a project is usually caused by a lack of project sponsors. The project sponsor is the main stakeholder who provides a lot of resources and advocacy from the very beginning of the project until the project is completed. However, in a project, there is not only a project sponsor. But several other stakeholders need to be identified by:

Ask your project sponsor and business analyst for a preliminary list of stakeholders.
Interview specific people on your team to gain knowledge about stakeholder groups.
If you want to engage a larger audience, hold workshops.
Also, provide a survey for all stakeholder teams working remotely.
The most important thing, of course, is to review the project documentation. From here, you will know a lot of information, such as the business processes and organizational structure of the project being worked on.

Don't let the list of stakeholder names spread across multiple documents; make it a primary copy. This document is called the Stakeholder Register, in the form of a table like this.

The list of stakeholders is then made into a map by considering the strengths and interests in this project.

After mapping all existing stakeholders, it's time to make an engagement plan between existing stakeholders using the Stakeholder Engagement Assessment Matrix.

This table will map each stakeholder at which level, complete with the desired level of transformation of these stakeholders. To make changes to a level that is below the Neutral level, you can do it in two ways, namely:

We identify the reason for resistance. It could be a concern about losing control over the process or being fearful about learning new systems.
Find ways to minimize concerns and maximize opportunities.

Well, that's the way to manage stakeholders in a project. Let me know if you have several questions to be clarified in the comment section below!

Wait, I changed my mind.

daudfernando — Sat, 25 Jun 2022 02:03:12 +0000

We don't always have the right decision all the time. Sometimes we change our minds because of a sudden change in conditions. This is undoubtedly an excellent opportunity to make decisions that are genuinely by current needs. This can also be done by database technology and has even been applied several times to real cases worldwide. One example is that we enter information that can only be accessed by the admin. However, it turns out that if we do not have that access, the database will immediately cancel the process. This process can be achieved by using SQL Transaction.

SQL Transactions

SQL transaction is one of the procedural languages in SQL that can increase computation and reduce the memory usage when executing a query. Not only that, SQL transactions will facilitate us to meet our previous needs. At least two conditions will be achieved when running a SQL transaction. First, when the query is executed because it satisfies the condition.

Second, it will be aborted when the executed query does not meet the requirements.

Anatomy of SQL Transaction

Like PL/SQL, SQL Transaction consists of identifiers such as DECLARE, BEGIN, and END. But to note is the unique part:

Start Transaction => to initialize SQL Transaction
Commit => Processing the query
Rollback => Cancel query

Case study

To implement it, let's go to the banking sector. Suppose we have a cash withdrawal machine or ATM. This machine has special provisions if a customer wants to withdraw money, including:

The balance must be sufficient
The minimum total balance in the account is 50,000 IDR
The nominal taken is in the form of a sheet of 50.000 IDR
The withdrawal amount is in the range of 50,000 IDR – 1,000,000 IDR Suppose a user has a balance of 400,000 IDR, and then he plans to withdraw money 4 times with each withdrawal of 100,000 IDR. What happened to the funds in his account?

Analysis is the key!

Disclaimer! We must assume that the database that we create will only be related to ATM machines that can make money withdrawals. To simplify troubleshooting, let's create a table structure for this case.

After we realize what crucial things need to be saved, we ensure one piece of data is in the table with dummy personal data, but the remaining balance is definitely 400,000 IDR. After creating a registered user, it's time to use our knowledge to complete a SQL transaction that satisfies the previous four conditions. Let's discuss each state that our PL/SQL must-have one by one.

The balance must be sufficient, meaning that we create a condition if the user withdraws money with a withdrawal nominal greater than the remaining balance, meaning the transaction will be canceled.
The remaining balance must be available at least 50,000 IDR, meaning that we create a condition where withdrawals cannot be made if the remaining balance is only 50,000 IDR.
This ATM is only available in 50,000 IDR currency. Make sure each user can only make withdrawals in multiples of 50,000 IDR.
Do not allow users to withdraw balances below 50,000 IDR or above 1,000,000 IDR. After everything is clear, let's do the execution in managing this ATM database.

ATM, here we come!

Start by disabling auto-commit, which by default is enabled by the database we use.

First, we need to create a database and tables that fit the structure. We will use MySQL with a database connection to MariaDB.

Then let's register one user into the customer's table.

The awaited moment has arrived. Let's make a transaction that can facilitate these four conditions. We will divide it into several stages.

First of all, let's change the delimiter of each query with a // flag (this is for execution in multiple questions in SQL transactions). Don't forget when it's activated (and the SQL transaction has been created), the delimiter mark must be returned to its original state by changing it to ;

Next, we need to define the procedure's name that will contain our SQL Transaction. Then we also need to initialize any parameters, either input or output, when calling the procedure. Let's call this procedure withdrawal with the parameters account_number, pin, total withdrawal, and a notification containing the transaction's success.

After that, let's define when an anomaly occurs we are using so that there is no need for

Troubleshooting when the database is not running correctly through Error Handling. In addition, several local variables are created that can be used as a calculation process in the stored procedure.

We first ensure that the customer who will enter the system has been registered and entered the appropriate account or pin. We also define a local variable that can hold the business processes in this ATM by reducing the remaining balance available by withdrawing the balance made.

Condition number four can be done in this way. This ensures that withdrawals are made within the defined range.

The nominal multiple of 50,000 IDR can be found by finding the remainder of the quotient by 50,000.

The following condition is that there must be at least 50,000 IDR in the balance.

When all conditions are met, the remaining balance value of the user will be deducted and updated automatically.

So the whole SQL transaction procedure looks like this.

MariaDB [atm]> CREATE PROCEDURE withdrawal(
    -> IN in_account INT(64),
    -> IN in_pin INT(8),
    -> IN in_amount INT(16),
    -> OUT notif VARCHAR(255))
    -> BEGIN
    ->  DECLARE exist INT;
    ->  DECLARE balance_cust DECIMAL(16,2);
    ->  DECLARE remain_balance DECIMAL(16,2);
    ->  DECLARE EXIT HANDLER FOR SQLEXCEPTION
    ->  BEGIN
    ->          SET notif = 'System Error';
    ->          ROLLBACK;
    ->  END;
    ->  START TRANSACTION;
    ->          SELECT COUNT(*) INTO exist FROM customers WHERE account_number = in_account AND pin = in_pin;
    ->          IF exist != 1 THEN
    ->                  SET notif = 'Incorrect account number or pin entered';
    ->                  ROLLBACK;
    ->          ELSE
    ->                  SELECT remaining_balance INTO balance_cust FROM customers WHERE account_number = in_account AND pin = in_pin;
    ->                  SET remain_balance = balance_cust - in_amount;
    ->          END IF;
    ->          IF in_amount < 50000 OR in_amount > 1000000 THEN
    ->                  SET notif = 'Withdrawal range is between 50.000 IDR AND 1.000.000 IDR';
    ->                  ROLLBACK;
    ->          ELSE
    ->                  IF in_amount % 50000 != 0 THEN
    ->                          SET notif = 'The multiple of withdrawal is 50.000 IDR';
    ->                          ROLLBACK;
    ->                  ELSE
    ->                          IF remain_balance < 50000 THEN
    ->                                  SET notif = 'Minimum remaining balance is 50.000 IDR';
    ->                                  ROLLBACK;
    ->                          ELSE
    ->                                  UPDATE customers SET remaining_balance = remain_balance WHERE account_number = in_account AND pin = in_pin;
    ->                                  SET notif = 'Withdrawal of balance successfull';
    ->                                  COMMIT;
    ->                          END IF;
    ->                  END IF;
    ->          END IF;
    -> END //

It's time to validate our query by calling this procedure four times according to the case study. Each withdrawal is 100,000 IDR.

Yup, the database for the ATM machine that we made has worked! Thank you for reading!

Dealing with the date data type.

daudfernando — Thu, 23 Jun 2022 08:52:57 +0000

Time is a valuable measure to obtain in an online transaction. With time we can see a trend in each of the available deals. However, it is undeniable that when a database stores the time of a successful transaction, there is segmentation between each time unit.

For example, in the image above, several units of time and date are separate and have their respective columns. This occasion will depend on the case study to be completed. But in this case, we need a date and time in ISO 8601 format.

Not only can this be achieved using SQL queries with the built-in CONCAT function, but Tableau can also do this with the built-in functions provided by creating a new column.

What's the problem?

In this case, we will try to see how big the comparison is between transactions returned by men and women per quarter from 2018 to 2022. Then conclude which gender makes the majority of transaction returns.

Create transaction return date and time

Recall that our data set consists of several columns of time units. We must make this state into one column for the date and time. We will use the MAKEDATE function for dates, which will take three value arguments when used.

Ensure the order of input arguments is appropriate and must also be type integer yes. So the return date column will be like this:

As with time, in Tableau, we will use the MAKETIME function.

Thus the return time column will be created as follows:

Once you've successfully created the date and time columns, it's time to make all of them into a timestamp. This time we will use the MAKEDATETIME function.

And the column that we create becomes like this.

Transactions and returns, which is more?

Let's create a line chart representing the changes in these two metrics per successive quarter. So the steps taken are:

Create a row in a new sheet with COUNT DISTINCT of total transactions returned and TOTAL quantity of transactions received.
Compare the two values based on time with continuous units of quarters.
Don't forget to change the two measurements in rows to dual axes, so they stand on the same Cartesian chart.

Do not be fooled

At first glance, the line chart shows that the returned transaction is greater than the entire transaction. This chart can present a bias for our audience and must be adjusted to show the percentage between the two line charts. The method :

Create a new column named % of return, which results from the division between the total transactions returned and the complete transactions.
Change the data type format to a percentage with a precision of one digit after a comma.

So that the trend of the returned transactions can be seen on this line chart.

So, is it a woman or a man?

The last step that needs to be done is to see the comparison between men and women, who often return their transactions. We can achieve this goal by giving a Marks Card in the color section of a Gender column consisting of Male and Female so that it becomes a line chart between the two sexes.

Look, it turns out that the female gender is the majority of customers who make transaction returns. This uniqueness certainly needs further analysis!

Analyze the business first

daudfernando — Wed, 22 Jun 2022 08:04:07 +0000

We all know that business analysis is crucial in evaluating performance to achieve work efficiency. There are two conditions to achieve efficiency. The first reduces the cost (resources) of production, and the second increases income.

Because the basis (or the goal) is the same, the analysis process must have a mature framework. So that in the process, each analysis carried out can be interpreted quickly and allows for reuse with different case studies, or the term is modularity. I am introducing a framework called business analytics which consists of three main stages.

Descriptive Analytics

Descriptive analytics is the first stage of an analyst to review his entire data from the top point of view. This point of view will provide an overview of existing data distribution after we know the problem we want to solve.

This stage requires an analyst to understand the problem comprehensively and create a rough technical framework for solving it. This stage will also produce findings that cannot be obtained directly from the data. This insight will then be the basis for the problem-solving steps that are carried out.

For example, this stage will present a dashboard that can briefly understand what events have occurred in the distribution of available data. This dashboard allows the emergence of questions that can be answered directly using the dashboard or report available at this stage (of course, after adjusting to the problem to be solved).

Predictive Analytics

Prescriptive analytics is the second stage that predicts a data trend. This stage will generally deal with columns of data type timestamps. Data types in date and time can be used as a reference for a KPI at a particular deadline and then compared with other deadlines in a specific line chart.

For example, a profit is shown in a line chart. Then from the trend formed, the yield will be projected on several future deadlines to create profit forecasting from the unrestricted movement. Ensure the prediction model has a good MAE or MAPE evaluation level so the prediction results are not too far from the original data later. It is also necessary to ensure that there is a repetitive trend and not too varied to create the prediction results optimally.

Prescriptive Analytics

The final stage is the gold of all. At this stage, all the analysis results are expected to produce a series of strategies for solving problems in a structured manner. Undeniably, the analysis results made by an analyst must be helpful to achieve the two bases of goals that have been defined previously. Therefore, the form of the output product from this stage can be of various kinds and is generally in the form of a model in five data mining roles. The five functions include:

Estimation (estimating the value of a data based on the similarity of the previously generated attributes)
Forecasting (predicting a value based on the trend of previous data on a specific date.
Classification (grouping a data based on similarities in specific categories)
Clustering (grouping data based on proximity to a centroid in each cluster)
Association (shows the relationship between available data)

Of the five roles, the developed model will have various functionalities. However, in general, it will produce new data that can reduce production costs

Is the decision understandable?

daudfernando — Tue, 21 Jun 2022 14:15:20 +0000

The decision tree becomes a model that can help classify several repetitive categories. With the similarity of some of the attributes of data observations, a set of sequential rules can be created that indicates which types of data observations exist.

Let's look at one example using the case of classifying drugs based on transactional data of patients from a pharmacy (note, the data has been preprocessed before and is still using the CRISP-DM method. Click here to read more)

Strong tree with a solid foundation

Like a tree that soars into the sky with various components, this one decision tree towers down and also has several important features to understand first.

Root node, meaning a primary variable that determines a data category.
Splitting, meaning an advanced process of classifying data into several other determining variables based on a particular threshold value.
Decision node, when a node that has been split is divided back into several more nodes, it is called a decision node.
The leaf node is the last node that determines a data observation.
Branch / Sub Tree, meaning a sub-part of several nodes in one tree.

After knowing the most critical components in the decision tree, the next step is to apply the algorithm to make a suitable model. The algorithm used is the C4.5 algorithm with three main repetitive steps as follows:

Root node creation process
Leaf node formation process
Return to step one to max_depth of this algorithm.

Root node, where are you?

The root node is created on which variable ensures a classification is formed. Therefore it will use a metric called entropy with a formula like this.

We will search the entropy value of each little weight in each attribute with a nominal categorical data type. One of the nominal values obtained is the entropy as follows.

After all nominal values get their entropy, proceed with searching for the best gain value. The gain value shows how much entropy value is wasted on each attribute. This number can be achieved using the following formula.

One of them is the gain value in the Na_to_K_binned attribute.

After that, let's get how many nodes will be divided or the number of decision nodes. We can achieve this condition with the split info or intrinsic info formula.

One example is the Na_to_K_binned attribute returns to get the split info value.

Then it is time for the determination stage to find the most significant gene ratio metric among all its features. We can obtain this value by comparing the gain value with the split info value of each attribute. Look at the value on the attribute BP has the largest gain ratio.

So that the first decision tree framework will be created as follows.

Foliage, grow!

Do the same thing until an attribute can no longer be divided into several nodes. And in the end, We will create a complete decision tree like this.

Bonus, Rapid Miner with all its conveniences.

After a series of processes are carried out, let's use one of the tools that make it easy to create a model, namely Rapid Miner, with the same data set and a working canvas like this.

Produce a model with an accuracy rate of 67.26% in classifying the population data compared to the sample data.

SAT in the NYC borough, was the performance good enough?

daudfernando — Mon, 20 Jun 2022 06:57:18 +0000

The SAT (Scholastic Aptitude Test) exam is essential to get data points for each student who wants to continue their tertiary study. A person's SAT score, literacy, numeracy, and writing competencies can be measured as a consideration of whether they are worthy of being accepted by a university or not. The score range of each part of the exam is 200 - 800.

Because many universities consider this score necessary, many students take this test to go through the selection process for new student admissions. One of them is in New York state. Does each area have the same SAT score with the division of five boroughs? Let's analyze it using the data available in the database.

Given that our data is still available in a database let's drag using the SQL language in a notebook.

1. We have to connect right

In making a connector in the database, it is necessary to use the function available in mysql.connector, then adjust the available arguments to the database to be inputted. In this case, the database name is a school that uses the root account on the local host.

After that, make the first query to display the available table characteristics.

2. Viewing the top 10 data

Our school table contains several columns that define the average SAT score for each school in each NYC borough. The top ten data held are as follows:

Let's also check whether the table we have contains empty values or not.

Fortunately, each column in the table has its value.

3. Math, which school is the best?

Numerical ability is a crucial thing that can be used as a benchmark for someone to be accepted into a university or not. Let's see which school has students with the highest average math scores compared to other schools based on quartiles using Paging in SQL.

Stuyvesant High School is the best, with an average math score of 754, which is quite different from second place with 40 points!

4. Don't forget about literacy and writing!

Let's see how low NYC students are in the reading aspect first.

A score of 302 is a reasonably low average for a country like New York. Then, what about the current authorship score?

Again the same school! We should examine the overall SAT score average at each NYC school.

5. Let's give it a ranking!

The available schools also need to be seen whether they have the same average score as Stuyvesant High School.

It turns out that there are only four schools that have an overall SAT score average that exceeds 2000. Then we can determine which borough of New York City has the best average SAT score.

6. Which NYC Borough is the best?

Of the five available boroughs, the distribution of the number of schools and the average inevitably SAT score is uneven. Let's see now!

We can see that the Staten Island area has a reasonably high average overall SAT score compared to other boroughs. However, there are only ten schools in this area. To close this analysis, we will choose a school in the borough of Brooklyn which has 109 schools but with the best average score in mathematics.

7. Best school for mathematicians in Brooklyn

You can use this method to see which school will be the best place in the average SAT math score.

Ok, next month, Brooklyn Technical High School will be an excellent place for someone aspiring to be a mathematician.

Your car is costly!

daudfernando — Sat, 18 Jun 2022 04:13:56 +0000

Today I met Michael, who was confused about the car he had just serviced. The car broke down again and again. I don't know how often he had to service his vehicle because of the various conditions. After being traced, the service fees he has spent are equivalent to five times the purchase price of his car. Wow, fantastic, that car had to be replaced with a new car.

In addition to telling stories about how bad Michael's car was, he also asked me for advice on estimating how much it would cost for specific car criteria. So he can save diligently and have accurate goals. Without further ado, of course, I'm happy to help an old friend who accidentally met at a vehicle repair shop.

Maps to Help Michael

In helping Michael, I need a roadmap showing the end-to-end process of working on a data mining role regarding Estimation. Of course, I needed a framework called CRISP-DM, or the Cross-Industry Standard Process for Data Mining abbreviation. This framework seems to make it easier for us to help poor Michael. Here I present a series of processes used throughout the project to assist Michael.

Dataset Initiation

Before I get started, I need a bunch of price lists for cars where they come from. And luckily, I can get it directly at this link. In the dataset, I called 6019 types of car variations with 14 columns consisting of the car's characteristics and complete with the price of the vehicle. This dataset is big data and will probably represent the vehicle's overall price and help Michael set his savings goals later.

Business Understanding

This stage is the beginning to bring up the root of the problem. Luckily Michael has given an overview of the problem and needs help estimating the price of the car he wants in the future. I need a metric threshold to prove that the estimate is not too far off from the original data. I could use one of the metrics called Root Mean Squared Error (RMSE). At least I should get a relatively small RMSE value from an infinite RMSE range. That way, the average nominal variation of the estimated car price is not too far from the original price. OK, now to the Data Understanding stage.

Data Understanding

This stage shows a series of processes for exploring the dataset that we have obtained. Using cloud computing provided by Google (Google Colab), we will use the Python language to analyze the surface of our data.

1. Import equipment

We need some libraries and packages that are already available.

We need Pandas for Data Frame analysis
Numpy for array calculations and other operations
Matplotlib and Seaborn for visualization of our analysis results

2. Upload dataset

Let's upload our dataset into an object called data and look at some examples of the data available there.

We can see that our dataset has 14 columns and 6019 rows of data. We need to look at each column's data distribution now.

3. Describing the dataset

We can describe the dataset we have by first removing the numeric and categorical type columns. This way, we can see the massive distribution of data.

It seems that some of our numerical data are skewed in the data distribution. In addition, some columns should be numeric but become categorical data types in this case.

4. Something's amiss

We need to look at the distribution of our data with a histogram chart for numeric data. We have a bar chart for categorical data. It turns out that it varies significantly between the columns that we have!

The Year column shows a negative trend while the Price column indicates a positive bias. The other two columns have too high a frequency of the same value and the same two values (bimodal). This oddness needs to be cleaned up by us later!

And yes, that's right, for categorical data, some of them have units that vary and make them of definite data type rather than numeric. But for some other columns, it's OK so far.

Data Preparation

Data preprocessing is the adjustment of the available columns to produce the best model. Based on the previous stages, the available dataset will be adjusted through several steps.

1. Handling missing values

The data information we get shows an imbalance in the number of rows of data between several columns. This circumstance indicates missing observational data. Let's use data.isna().sum() to accumulate the amount of missing data.

Well, it turns out that there are several columns with missing values. As a consideration, let's delete the data row that contains the missing value. We might do this because the lost data is still below 10% of the total existing data.

However, for columns like Unnamed: 0 and also New_Price, it's best to delete the entire column.

2. Remove duplicate values

Duplicate data will undoubtedly bias a model because of its repetitive appearance. We can check whether exact data is available or not in this way.

Phew, luckily, our data has no duplicate values.

3. We must set outliers aside

Do you remember the angular distribution of our numerical data? That data should clean immediately. This time, using the z-score value will be very helpful in taking the distribution of values as much as 95% of the quantity so that the numeric column we have will lose the super extreme weight.

Thankfully we only deleted 348 data. This consideration keeps our data representative of the population.

4. Heatmap shows your correlation!

Heatmap will be very useful for seeing the Pearson correlation between several columns. The correlation heatmap available in seaborn renders the numeric columns as discrete values and crosses between the columns as the y-axis and the x-axis. This time we have to focus on our target column or the purpose of the estimation model we will make later.

From this, it can be seen that only the Year column, and even then, the correlation is only 0.23, which correlates with the Price target column. Our data needs further preprocessed by encoding categorical columns into numeric ones.

5. Regex in action

Regular expressions (regex) are crucial for customizing Mileage, Engine, and Power fields. The three columns are of numeric type, but because of the different units used, these columns are identified as categorical type columns. Let's look again at the three types of definite value types.

OK, that means we will delete each available unit and assume that one column has the same teams by taking only the cardinal values.

Note this: don't forget to change the column's data type to float. Because there are several values in the "null" Power column, it's time to delete the empty values that the previous null value has transformed.

6. Encode the rest!

Look again at the available value variations in the Transmission and Fuel_Type columns!

The variation in the values in the two columns is not too much, and this is the right time to use One Hot Encoding. This technique will create new columns according to the variation of column values to be encoded. So the result is like this.

When all the columns have become one DataFrame, it's time to look at the heatmap correlation between attribute variables and classes from the preprocessed dataset.

From this, it can be concluded that several variables determine the price of a car, including:

Diesel type fuel (Fuel_Type_Diesel)
Strength of the engine (Engine)
Available car power (Power)
Year of the car (Year)
Transmission type automatic (Transmission_Automatic)

Wrap it up into the model

It's time to make a suitable estimation model. For modeling, I will choose one machine learning algorithm, Random Forest, available in the SK-learn library. The stages in making the model will be divided into three processes.

1. Train and test with data

In teaching a data model, it is necessary first to divide it into two parts. So the Data Frame, which consists of six columns, is divided into training and test data. Also, make sure to separate the dependent variable as well as the independent variable, yes!

2. This car should be at this price.

Let's teach this model using training data in the following way.

When the model has learned from the available training data, it is time for it to be tested using data that has never been found before (aka data testing). Compare the overall estimation results with the previously split y_test data. Then the accuracy value of this model will be displayed as follows.

Well, that value is good enough as a benchmark for the estimated price of the new car that Michael will buy.

3. We need validation

Not necessarily that the model that has been created is ready to use. We must always validate whether the model has estimated the same value as the original data. We can do this by manually looking at the available test data. Then, we try to use the existing model by entering the value of the argument corresponding to the testing data.

Yes! the estimated price for the car is not too far away and is entirely accurate. Let's see whether our initial achievement has been met by evaluating some metrics.

Finally, Evaluation!

Evaluate, evaluate, evaluate! First, I'll show you what variables are the main determinants of how expensive a car can be.

It turns out that the Power variable is the culprit! I have to tell Michael that the higher the quantity of Power of a car, the higher the price of the vehicle will be. This insight will be a piece of helpful information for him. Of course, after I measured how successful this model was via the Root Mean Squared Error (RMSE) metric. RMSE is a metric that results from the root of the Mean Squared Error (MSE), which will show how much the average error value varies from the estimate given by this model. I will look at it this way.

Hooray! The RMSE results obtained by this model are too poor and adequately represent the accuracy value of the training data. It's time for me to meet Michael and talk about the insights I've found about him.

How do I know the distribution of my data?

daudfernando — Fri, 17 Jun 2022 01:26:38 +0000

G'day, mate! Now I will think of you as a leader of an entertainment services company who wants to know how the distribution of anime films is available on the website. Let me use Tableau software to show the distribution of data anime up to 2018!

Critical thinking is also called structured thinking
~ Pearl Zhu.

Therefore, we must make a sequential process exploring the available data. Here are some steps.

What's wrong with my data?

Data checking is crucial after knowing what business problems you want to solve. At this stage, we can adjust the data types that Tableau automatically sets, but there are errors. For example, the classification field of observation data is detected as a continuous numeric data type, even though the data is a discrete dimension. But, so far, our data is safe, so let's move on to the next stage.

There must be gold behind the unknown.

Now is a good time to interrogate our data, ask various questions about the data, and let a visualization answer our questions. First, we determine the Key Performance Indicators (KPIs), namely the average value of a movie, the total reviewers (audiences who rate a movie), and the total number of available films.

Let's separate each genre and get our Top #N Genre as a Treemap (make sure to use parameters for interactive visualization). The treemap will make it easier for us to see the hierarchy of each available genre based on its quantity. This also makes it easier for stakeholders to rank available film genres automatically.

The next step is the film ratings. In a pie chart, let's see how many films are available in each rating. In addition, let's also see how large the distribution of each type of film is in general in the form of a donut chart. These two graphs consist of not too wide varieties; seeing them as a circle graph would be interesting.

Analyze and Visualize Data

Univariate charts do give a lot of gold over the unknown. However, there are still diamonds we can mine from this data. Let's see with a bivariate graph how the distribution of the average reviewer data, average favorites, and total episodes is related to the average member seen in discrete-time dimensions per year. Let's visualize it using a bar chart and a line chart to compare these metrics with the average member each year.

Alright, for the overall data distribution, I think that's enough. We can already answer various questions that arise out of curiosity or even when we see the Exploratory Data Analysis that has been done. It's time to unite it all into one unified dashboard.

Put them all together

It's time to unite all the data visualizations that have been successfully formed through various graphs. Dashboards are an excellent medium for viewing unique data sets based on specific characteristics. With the dashboard in Tableau, we can make a chart a filter for other charts. Very amazing! The first step you need to do is create a layout with a variety of available objects.

Make sure you create a dashboard that doesn't contain too much information but doesn't lack information between the charts. The purpose of making a dashboard is to see the connection and comparison between existing data. So, I can make all the graphs we have made previously into one dashboard because it will answer various questions regarding data distribution.

After adjusting the dashboard to the browser size and various visualizations that the dashboard will display, a dashboard has been successfully created. Yey!! You can access the dashboard at this link. One more thing, make sure you filter all the visualizations and existing parameters to the entire visualization in one dashboard. For parameters, you can do it like this.

And for one data visualization, you can do it in the following way. You need to click on icons like a funnel and fwailaa; a dashboard has been created.

Thank you for reading this article! See you, mate!

Which one customer do you mean?

daudfernando — Wed, 15 Jun 2022 23:47:06 +0000

The company has various type's customers that depend on their behavior. It is tricky tho. I mean, if we have to create one campaign and need to choose the proper customer because of the budget. We must decide the similarity between the customers or a customer segmentation properly. So it can operate well.

I am interested, too, in how the customers can be segmented. So let's explore it using this data set. But first of all, what does customer segmentation mean? Is it like breaking the customer part so we can organize it like a puzzle?

Well, the part of the puzzle, you're almost right. But we don't mean to break the human body's features into groups :(. We are only grouping customers into several clusters with one or two similarities in their behavior to fit our campaign goals.

RFM Analysis to boost the accuracy

Thanks to Jan Roelf Bult and Tom Wansbeek in 1995, they developed Recency, Frequency, and Monetary Value (RFM analysis) so the company will select optimally which one customer base on the campaign characteristic. Here is the definition of each value :

Recency value measures how recently a customer has transacted with the company
Frequency value quantifies how frequently a customer has engaged with a brand
Monetary value calculates how much money a customer has spent on a company's products and services.

Theoretically, the customers will have to be segmented like below:

Low Value: Customers who are less active than others, not very frequent buyers/visitors, and generate very low - zero - maybe negative revenue. In this case, it is cluster number 1.
Mid Value: In the middle of everything. Customers often use our platform (but not as much as our High Values) fairly frequently and generate average revenue. In this case are the cluster numbers 2, 3, and 4.
High Value: The group we don't want to lose. High Revenue, Frequency, and low Inactivity. In this case, it is cluster number 5.

Ok, then we are ready to deep dive into the dataset and start segmenting our customers. First of all, we have to quantify our three calculated fields, which are the RFM metrics. We need to consider that all the metrics depend on each user. So the Level Of Detail (LOD) in our calculated field must be set to Fixed so that each customer ID can group metrics without being inserted explicitly into the canvas.

Creating RFM Calculated Field

A company can obtain the Recency value by subtracting the date of the most recent transaction from a customer with the date on the customer's first transaction. So the calculation of the Recency column is as follows.

Then it's time to measure the frequency of each customer at this company by calculating the unique customer id and invoice number of the transactions made so that the column will be created like this.

And the last is the number of money customers spends on all transactions in this company or the term monetary. This column is formed by multiplying the number of goods purchased by the unit price, in this case, shown in the Revenue column.

Percentile on Duty

Now is the time to call percentiles to segment each RFM value in each customer. Recall that percentiles divide the data you have into hundredths of parts so that the customer can do clustering easily. Make sure to look at the resulting data type in the form of a dimension with the value range of all customers being cluster 1 - cluster 5. The resulting calculations on RFM, primarily Monetary, are as follows.

Do the same for the recency and frequency values. So it's time to visualize.

Appealing Visualization

Graphics are all our friends. So let's create a segmentation based on customer behavior with a bar chart and count how many are in each cluster. Remember, before We can form a bar chart, there is a mapping of values between the three measures plus a unique total id between the three clusters. So on one page, the five clusters formed can be seen in the image below.

Access All in One Dashboard

Clustering will be very useful if we know other characteristics in a cluster. Therefore you can access this link to adjust which set will be selected for a particular campaign. Ciao!