DEV Community

ASTRONOE
ASTRONOE

Posted on

How I built my first machine learning app

The diamonds dataset

I searched my Kaggle repository for CSV files to analyze in my notebook because I wanted my data science and programming skills into practice. I looked at the size and dimensionality of the datasets. I will ask myself, "Is this one worth working on?". I desire to partake in projects that can meet real-world needs, not just anything I find. My preference is that the dataset must have over 10,000 entries and any number of columns so far its features are relevant, there must be few null values and a significant degree of cardinality.
I came across an exciting dataset consisting of information about 54,000 diamonds. I looked through it on a spreadsheet, and before giving it a shot on Google Colab.
It had the following features:

  • Price: The price of a diamond in USD.
  • Carat: The weight of a diamond in kt.
  • Cut: The quality of the cut. In increasing order, the values are fair, good, very good, premium, and ideal.
  • Clarity: The clarity of the diamond. From worst to best, the values are SI2, SI1, VS2, VS1, VVS2, VVS1, and IF.
  • Color: The quality of the colour. From worst to best, the values are J to D.
  • x, y, z: The dimensions of a diamond, in mm.
  • Volume: The cubic multiplication of all the dimensions.
  • Depth: The height of a diamond, from the culet to the table.
  • Table: The width of the diamond's table.

Preprocessing and exploratory analysis

I believe that price is the most important feature because all the other features significantly influence its value. The monetary worth of a diamond is the first thing that comes to mind for most people who possess it. It thus makes sense that the goal of our project is to analyze the inherent factors that influence the price and to build a machine-learning application that can predict the price of a diamond based on specific characteristics and dimensions.

The distribution of price

The histogram distribution of the diamonds according to their prices
The histogram above shows that price is exponentially distributed. This implies that the majority of the diamonds are very cheap. The price increases as the quantity of diamonds increases. This makes sense because only a few diamonds in existence are expensive.

Exploring categorical features

The dataset has three categorical features: cut, colour, and clarity. The question is how are the values of these qualities distributed?
Stacked horizontal chart showing the distribution of categorical values
The chart above shows that in the cut category, most of the diamonds have the best quality value; the 'ideal' quality, while the cut and colour features have an almost even distribution, although they tend to have average quality.

The influence of the categorical values on price

To understand the influence of clarity, colour, and cut on price can be described using multiple bar charts that describe the relationship between the values and their means and medians. Below is the price relationship among the clarity values. As the quality of the clarity increases, there is a sharp decrease in the number of diamonds with price. The distribution of price among the clarity values
As clarity increases, there is a sharp decrease in the number of diamonds with price.
The distribution of price among the color values
The amount of diamonds begin to drop from G toward the worst diamond colour quality. However, it is noticeable that the mean and median increase from best to worst.
Finally, I looked at the cut category.
The distribution of price among the cut values
There is not much impact that can be noticed compared to the other categorical values. This prompted me to perform a Kruskal-Wallis test on the categorical values in relation to price.

#group the values into lists
cut_group = Df.groupby('cut')['price'].apply(lambda x:x.values.flatten().tolist())
color_group = Df.groupby('color')['price'].apply(lambda x:x.values.flatten().tolist())
clarity_group = Df.groupby('clarity')['price'].apply(lambda x:x.values.flatten().tolist())

#apply the tests
kruskal_cut = kruskal(*cut_group)
print("Kruskal-Wallis Test for 'cut' groups:")
print("Statistic:", kruskal_cut.statistic)
print("P-value:", kruskal_cut.pvalue)
print()

kruskal_color = kruskal(*color_group)
print("Kruskal-Wallis Test for 'color' groups:")
print("Statistic:", kruskal_color.statistic)
print("P-value:", kruskal_color.pvalue)
print()

kruskal_clarity = kruskal(*clarity_group)
print("Kruskal-Wallis Test for 'cut' groups:")
print("Statistic:", kruskal_clarity.statistic)
print("P-value:", kruskal_clarity.pvalue)
Enter fullscreen mode Exit fullscreen mode

The statistical test on the categorical values
It turned out that all three categorical variables failed the test for the null hypothesis. This means there are significant differences among the median prices.

Exploring the numeric features

The numeric features include the dimensions and the weights. But the most important numeric feature is the weights (or carats) because a heavier diamond implies a bigger diamond, which increases the price. It is expected that excluding the other known factors, the carats should be directly correlated with price.
The distribution of carats
The above chart shows that the number of diamonds reduces as the number of carats increases. This means that many of the diamonds are very small in size and weight.

The next thing that comes to mind is how the diamonds are distributed among prices and carats. Below is a histogram heatmap showing how more of the available diamonds are concentrated in the darker areas. The chart on the right indicates what the price looks like after log transformation is applied to it. I called the transformed variable "log-price".
The distribution of diamonds among carats and price
The chart is consistent with the number of diamonds getting fewer with increasing weight and price. That is why the far right sides of the chart with four or five carats have extremely few diamonds, and the darker parts represent the cheapest and lightest diamonds.
The log transformation compresses the price data and transforms it from an exponential to a log-normal distribution. The log transformation will be needed later in the machine-learning process. I concluded that so far most of the diamonds are very small and cheap.

Exploring the dimensions

The effect of depth and table on price and carat
The chart above leaves no apparent impressions about how table and depth affect prices. Even the outliers are as cheap and small as most of the diamonds in the cluster.
However, let's take a look at the x, y, and z axes to see if they have an effect. Rather than working on each axis separately, it was better to multiply all three axes' lengths and insert them into a new variable called volume.

Df['volume'] = Df.x * Df.y * Df.z
Enter fullscreen mode Exit fullscreen mode

Volume is the same as size. We can therefore assume that a bigger diamond is the same as a heavier diamond which means a more expensive diamond. Below is a heatmap showing the strong positive correlation between price, log of price, volume and carat variables or features.

Correlation heatmap showing the effect of size and weight on price

Summary of the analysis

  • Price was identified to be the most important feature for the project.
  • The price of the diamonds is exponentially distributed. The diamonds become more abundant as the price reduces.
  • The categorical features: colour, cut, and clarity affected price in order of their quality.
  • They also failed to pass the null hypothesis for the Kruskal-Wallis test.
  • Log transformation was used to rescale the price distribution.
  • The weights (in carats) were exponentially distributed. The diamonds become more abundant as the carats reduce.
  • Table and depth did not display any remarkable effect on the price.
  • By multiplying the lengths of all the axes, x, y, and z, we get the volume of the diamond.
  • The price, the log of price, the volume, and the carat are strongly positively correlated.

Machine Learning process

During the analysis, I was able to identify the relevant features: the cut, colour, clarity, carat, and price variables. I also added new features: log-price, and volume. I thought of two ML algorithms to apply: random forests and polynomial linear regression.

Polynomial linear regression

Because our price data is exponentially distributed, I used log-price as my predicted variable, with carat and volume as my predictor variables. When I split my data into 80/20 training and testing sets, this is what it looked like:
Training and testing scatterplot
I wrote a function to predict the price of a diamond

def predict_diamond_price(col, deg=2, k=2, dim=1):

  X_Train, X_Test, y_train, y_test = train_test_split(Df_reg[col], Df_reg['log_price'], test_size=0.15, random_state=42)

  #reshape the data
  X_train = X_Train.values.reshape(-1, dim)
  X_test = X_Test.values.reshape(-1, dim)

  #create spline object
  spline = SplineTransformer(degree=deg, n_knots=k)

  #Transform and fit the data to the reshaped data to the transformer
  X_train_spline = spline.fit_transform(X_train)
  X_test_spline = spline.transform(X_test)

  #Train a linear regression model using the 
  spline_reg = LinearRegression()
  spline_reg.fit(X_train_spline, y_train)

  #implement cross validation
  scores = cross_val_score(spline_reg, X_train_spline, y_train, cv=5, scoring='r2')

  #predict the prices using the transformed test data
  y_pred = spline_reg.predict(X_test_spline)

  #get the loss functions
  r_sq = spline_reg.score(X_test_spline, y_test)
  mse = mean_squared_error(y_test, y_pred)
  rmse = np.sqrt(mse)

  #print the resulting metrics
  print('Cross-validated R-squared score:', scores.mean())
  print("Mean Squared Error:", mse)
  print("R-squared:", r_sq)
  print("Root Mean Squared Error: ", rmse)
  print("Intercept:", spline_reg.intercept_)
  print("Slope:", spline_reg.coef_)

  return X_Train, X_Test, y_train, y_test, spline, spline_reg, y_pred
Enter fullscreen mode Exit fullscreen mode

The steps in the code to build the model are as follows:

  • Reshape the independent variables.
  • Create an instance of a spline transformer.
  • Use the spline transformer's instance to fit and transform the variables into polynomial form.
  • Create an instance of the linear regression class.
  • Train the transformed variables with the linear regression object.
  • Create a predictor variable that will be returned.
  • Get the loss functions.

Let's evaluate the model's performance with volume and carat:

#predict diamond prices based on carat
print("Evaluation for price against carat")
price_carat = predict_diamond_price('carat')
c_diamond = predict_diamond_price_distribution(price_carat, list(Df_reg['carat']))

#predict diamond prices based on volume
print("\n\nEvaluation of price against volume")
price_volume = predict_diamond_price('volume')
v_diamond = predict_diamond_price_distribution(price_volume, list(Df_reg['volume']))
Enter fullscreen mode Exit fullscreen mode

The results
We got an R-squared of almost 93%, which is quite OK.
After that, I wrote some code to make predictions on the entire price distribution and made comparisons.
Actual prices vs predicted prices
One reason why there is a bit of roughness is due to the model failing to capture sparse data points after the transformation is reversed.

Random forest

After implementing hyperparameter optimizations, I built a random forest model with the following steps:

  • Select the variables.
  • Use an ordinal encoder to encode the categorical variables.
  • Instantiate a random forest regressor.
  • Deploy KFold cross-validation.
  • Create a predictor variable.
  • Get the loss function, timer, and feature importances.
# Split the data into training and testing sets
X = Df_rt.drop(columns=['price', 'log_price', 'volume'])
y = Df_rt['log_price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

encoder = OrdinalEncoder(categories='auto')
X_train[['cut', 'color', 'clarity']] = encoder.fit_transform(X_train[['cut', 'color', 'clarity']])
X_test[['cut', 'color', 'clarity']] = encoder.transform(X_test[['cut', 'color', 'clarity']])

# Initialize the random forest regressor
rf = RandomForestRegressor(n_estimators=200, oob_score=True, max_depth=15, random_state=42)

start_time = time.time()
rf.fit(X_train, y_train)

kfold=KFold(n_splits=5)
# Cross validate
scores = cross_validate(
  estimator=rf, X=X_train, y=y_train, cv=kfold, scoring='neg_mean_squared_error', n_jobs=-1, verbose=2)

end_time = time.time()
# Make predictions on the testing data
y_pred = rf.predict(X_test)

# Evaluate the performance of the random forest regressor using mean squared error
mse = mean_squared_error(y_test, y_pred)
rmse = mse**0.5

# Print the results
print(f"Training took {end_time - start_time:.2f} seconds to complete.")
print("Random forest score", rf.score(X_train, y_train))
print("OOB Score", rf.oob_score_)
print("OOB error", 1 - rf.oob_score_)
print("Random forest mean squared error:", mse)
print("Random forest root mean squared Error", rmse)
print("Cross validation mean squared error:", -1*scores['test_score'])
print("Cross validation root mean squared error:", (-1*scores['test_score'])**0.5)

# Get feature importances
importances = pd.DataFrame({'Feature': X_train.columns, 'Importance': rf.feature_importances_})
importances = importances.sort_values('Importance', ascending=False)

print('\nFeature Importances:') 
print(importance)
Enter fullscreen mode Exit fullscreen mode

The results
The results show that the random forest model performed far better than the polynomial regression model. So I used the model to predict a price distribution.
Actual distribution vs predicted distribution
It looks smoother and fits much better compared to the linear regression model. The next thing to look at is the model's feature importance.
Feature importance
The result above shows that carats remain the most powerful determiner of price.

Summary of the machine learning process

  • Two machine learning algorithms were involved: polynomial linear regression and random forest.
  • The independent variables were carat, volume, cut, colour, and clarity.
  • The dependent variable was log-price for both algorithms.
  • Carat and volume were separately used as independent variables for the linear regression model.
  • A spline transformer with degree 2 was used to transform the numeric variables in the linear regression model.
  • The model was trained along with KFold cross validation.
  • The predicted price distribution for both volume and carats was not very smooth.
  • The model achieved an R-squared of 0.93 on the test set.
  • An ordinal encoder was used to encode the values in the categorical variables.
  • The model was trained along with KFold cross validation.
  • The model achieved an MSE and an OOB error of 0.01 respectively.
  • The feature importance bar chart showed that carats had the greatest influence by a wide margin
  • The predicted price distribution for our random forest model was smoother and fitted better.

Deployment

After this, I built a web app using Streamlit to deploy the random forest model. I created the deployment file to train and cache the model. Then I built the main source file to launch from
main.py
The user will select the feature characteristics in the drop-down and use the slider to select the carats. The diamond's price will be predicted from the cached resource. Then I hosted the app on Streamlit.

Conclusion

There were some steps I took but I left them out of this writing because of relevance and space such as the outliers I removed. There are different and perhaps better ways to build the model. The app is a single web page and does not have many features because the program is not yet scaled for commercial purposes. However, if there is a need to make it commercially useful, obviously for a jewellery enterprise, the whole app would be built differently.

Top comments (0)