Introduction:
Whenever I hear any stories about stars far far away from earth, Im talking millions of light years away not possibly seen visually through any telescope, I always wonder:
And to satisfy this curiousity, I did a little google search and found out that every star has its own life cycle, ranging from a few million to trillions of years, and its properties change as it ages. And by measuring these properties, we can deduce the type of star.
But I'm not satisfied until I do it myself. So in this notebook I'll be writing a model that classifies a star's type based on it's features. (And also serve as a good practice for Pytorch)
Dataset
The dataset consists of 6 collumns.
- The Temperature of the star
- The Luminosity of the star
- It's Radius
- It's Absolute Magnitude
- It's General Color of Spectrum
- The Spectral Class
Everything should be obvious, however the Spectral Class might be news.
An asteroid spectral type is assigned to asteroids based on their reflectance spectrum (its effectiveness in reflecting radiant energy)
import os
from pathlib import Path
iskaggle = os.environ.get("KAGGLE_KERNEL_RUN_TYPE", '')
if iskaggle: path = Path('../input/star-type-classification')
else:
path = Path('data')
Here's a quick look at the data:
import pandas as pd
data = pd.read_csv(path/'Stars.csv')
data
output:
| temperature | luminosity | radius | magnitude | color | class | label | |
|---|---|---|---|---|---|---|---|
| 0 | 3068 | 0.002400 | 0.1700 | 16.12 | Red | M | 0 |
| 1 | 3042 | 0.000500 | 0.1542 | 16.60 | Red | M | 0 |
| 2 | 2600 | 0.000300 | 0.1020 | 18.70 | Red | M | 0 |
| 3 | 2800 | 0.000200 | 0.1600 | 16.65 | Red | M | 0 |
| 4 | 1939 | 0.000138 | 0.1030 | 20.06 | Red | M | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 235 | 38940 | 374830.000000 | 1356.0000 | -9.93 | Blue | O | 5 |
| 236 | 30839 | 834042.000000 | 1194.0000 | -10.63 | Blue | O | 5 |
| 237 | 8829 | 537493.000000 | 1423.0000 | -10.73 | White | A | 5 |
| 238 | 9235 | 404940.000000 | 1112.0000 | -11.23 | White | A | 5 |
| 239 | 37882 | 294903.000000 | 1783.0000 | -7.80 | Blue | O | 5 |
240 rows × 7 columns
Sampling the dataset is cool and all, but we wanna see some more important information
data.describe()
output:
| temperature | luminosity | radius | magnitude | label | |
|---|---|---|---|---|---|
| count | 240.000000 | 240.000000 | 240.000000 | 240.000000 | 240.000000 |
| mean | 10497.462500 | 107188.361635 | 237.157781 | 4.382396 | 2.500000 |
| std | 9552.425037 | 179432.244940 | 517.155763 | 10.532512 | 1.711394 |
| min | 1939.000000 | 0.000080 | 0.008400 | -11.920000 | 0.000000 |
| 25% | 3344.250000 | 0.000865 | 0.102750 | -6.232500 | 1.000000 |
| 50% | 5776.000000 | 0.070500 | 0.762500 | 8.313000 | 2.500000 |
| 75% | 15055.500000 | 198050.000000 | 42.750000 | 13.697500 | 4.000000 |
| max | 40000.000000 | 849420.000000 | 1948.500000 | 20.060000 | 5.000000 |
Quick Overview
Some values differ alot, and this outliers are not very good for writing a model that can generalize to new data.
The smallest value for luminosity is as low as 0.00008 (That's 4 zeros!!) whereas the largest gets up to ~850,000
Now come's a dillemma:
Best way to know, is to check if these overly small/huge values are truly outliers and must be normalized.
Almost half of the dataset has it's luminosity in the decimals!! With the rest being in the hundreds of thousands.
Alright, we'll have to normalize this. I'll use Z-Score which is very simple, you subtract the original value by the mean of the column, and divide that by the standard deviation.
z = (x - μ) / σ
import torch
import matplotlib.pyplot as plt
def z_score(x): return (x - torch.mean(x)) / torch.std(x)
normalized_luminosity = z_score(torch.tensor(data.luminosity))
plt.subplot(2, 1, 1)
plt.scatter(data['luminosity'].to_numpy(), range(240))
plt.title('Before')
plt.subplot(2, 1, 2)
plt.scatter(normalized_luminosity.numpy(), range(240))
plt.title('After')
plt.tight_layout()
plt.show()
From hundreds of thousand's to single digits, while still preserving the data's structure! Perfect!
Preparation
Now ill convert the dataframe into a tensor and prepare it to be chucked into a neural network.
This dataset already is very well-put together. There are no nil values, however there is a need for dummy columns.
So let's make this quick, and get it over with!
First up is defining the dependant and independant variables as tensors
import numpy as np
t_dep = torch.tensor(data['label'])
t_indep = torch.tensor(data.drop(columns=['label']).astype(np.float32).values, dtype=torch.float)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[22], line 4
1 import numpy as np
3 t_dep = torch.tensor(data['label'])
----> 4 t_indep = torch.tensor(data.drop(columns=['label']).astype(np.float32).values, dtype=torch.float)
File c:\Users\user\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\generic.py:6643, in NDFrame.astype(self, dtype, copy, errors)
6637 results = [
6638 ser.astype(dtype, copy=copy, errors=errors) for _, ser in self.items()
6639 ]
6641 else:
6642 # else, only a single dtype is given
-> 6643 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
6644 res = self._constructor_from_mgr(new_data, axes=new_data.axes)
6645 return res.__finalize__(self, method="astype")
File c:\Users\user\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\internals\managers.py:430, in BaseBlockManager.astype(self, dtype, copy, errors)
427 elif using_copy_on_write():
428 copy = False
--> 430 return self.apply(
431 "astype",
432 dtype=dtype,
433 copy=copy,
434 errors=errors,
435 using_cow=using_copy_on_write(),
436 )
File c:\Users\user\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\internals\managers.py:363, in BaseBlockManager.apply(self, f, align_keys, **kwargs)
361 applied = b.apply(f, **kwargs)
362 else:
--> 363 applied = getattr(b, f)(**kwargs)
364 result_blocks = extend_blocks(applied, result_blocks)
366 out = type(self).from_blocks(result_blocks, self.axes)
File c:\Users\user\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\internals\blocks.py:758, in Block.astype(self, dtype, copy, errors, using_cow, squeeze)
755 raise ValueError("Can not squeeze with more than one column.")
756 values = values[0, :] # type: ignore[call-overload]
--> 758 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
760 new_values = maybe_coerce_values(new_values)
762 refs = None
File c:\Users\user\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\dtypes\astype.py:237, in astype_array_safe(values, dtype, copy, errors)
234 dtype = dtype.numpy_dtype
236 try:
--> 237 new_values = astype_array(values, dtype, copy=copy)
238 except (ValueError, TypeError):
239 # e.g. _astype_nansafe can fail on object-dtype of strings
240 # trying to convert to float
241 if errors == "ignore":
File c:\Users\user\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\dtypes\astype.py:182, in astype_array(values, dtype, copy)
179 values = values.astype(dtype, copy=copy)
181 else:
--> 182 values = _astype_nansafe(values, dtype, copy=copy)
184 # in pandas we don't store numpy str dtypes, so convert to object
185 if isinstance(dtype, np.dtype) and issubclass(values.dtype.type, str):
File c:\Users\user\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\dtypes\astype.py:133, in _astype_nansafe(arr, dtype, copy, skipna)
129 raise ValueError(msg)
131 if copy or arr.dtype == object or dtype == object:
132 # Explicit copy, or required since NumPy can't view from / to object.
--> 133 return arr.astype(dtype, copy=True)
135 return arr.astype(dtype, copy=copy)
ValueError: could not convert string to float: 'Red'
Oh golly jargon!! Damn, what went wrong?!
Let's see here..
ValueError: could not convert string to float: 'Red'
We tried to convert a word into a number!
Ah.. almost forgot math needs numbers..
Dummy Columns
data['color'].unique()
output:
array(['Red', 'Blue White', 'White', 'Yellowish White', 'Blue white',
'Pale yellow orange', 'Blue', 'Blue-white', 'Whitish',
'yellow-white', 'Orange', 'White-Yellow', 'white', 'yellowish',
'Yellowish', 'Orange-Red', 'Blue-White'], dtype=object)
As we can see, the color column has 17 possible colors a star can be.
However, since we can only accept numbers, an easy fix is using Dummy Columns
Ill give you an example, say we had data for a group of people. One of the columns is "Gender" where each person could have the value "Female" or "Male". To turn these into numbers, we'll create 2 new columns:
- "Female?"
- "Male?"
Say you have bob, a professional male specializing in being a man. His female column would be "0" for false, and his male column would be "1" for true.
Boom!! Problem solved.
So to apply this here, we'll just make 17 columns for every color of star.
data = pd.get_dummies(data, columns=['color', 'class'])
data[["color_Red", "color_Blue White", "color_White", "color_Yellowish White", "color_Blue white", "color_Pale yellow orange", "color_Blue", "color_Blue-white", "color_Whitish", "color_yellow-white", "color_Orange", "color_White-Yellow", "color_white", "color_yellowish", "color_Yellowish", "color_Orange-Red", "color_Blue-White"]].sample(5)
output:
| color_Red | color_Blue White | color_White | color_Yellowish White | color_Blue white | color_Pale yellow orange | color_Blue | color_Blue-white | color_Whitish | color_yellow-white | color_Orange | color_White-Yellow | color_white | color_yellowish | color_Yellowish | color_Orange-Red | color_Blue-White | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 202 | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False |
| 56 | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 103 | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False |
| 176 | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False |
| 8 | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
Phew, let's try this again now.
import numpy as np
t_dep = torch.tensor(data['label'])
t_indep = torch.tensor(data.drop(columns=['label']).astype(np.float32).values, dtype=torch.float)
t_indep = (t_indep - t_indep.mean()) / t_indep.std()
No errors! Eureka!
Defining the Neural Network
For the math-focused nerds, you can imagine a neural network simply as one big ol' composite function. Multiple layers, with each layer containing multiple units, and each unit taking in a matrix as input, multipliying it with a matrix of coefficients.
The main import thing to remember for the coefficients, is their matrix dimensions. Let's say we have the shape of a matrix of coefficients as [x, y]:
- x: Number of units of previous layer
- y: Number of units of next layer
First function we must define is one to initialize the coeffs
We'll create the number of hiddens layers, and units in each layer. Then, we'll loop over each hidden layer to create randomized matricies of parameters and constants in correct shape.
Quick note:
The values in the matricies of parameters MUST be randomized. Since during backpropagation, initializing all parameters to 0 will result in identical gradients that would effectivally cancel the training out.
initializing the coeffs
n_coeffs = t_indep.shape[1]
def init_coeffs():
hiddens = [10, 10] # Update for each hidden layer
sizes = [n_coeffs] + hiddens + [6]
n = len(sizes)
layers = [(torch.randn(sizes[i], sizes[i+1])) * 0.1 for i in range(n-1)]
consts = [torch.randn(1, sizes[i+1]) for i in range(n-1)]
for layer in layers + consts: layer.requires_grad_()
return layers, consts
If this was NumPy, I would be forced to calculate the derivatives for all parameters in the neural network. Yes, id automate it in a loop. But since I really only have an intuative understanding of derivatives, i never fully understand whats going on.
Luckily for us, in PyTorch, the derivatives are automatically calculated aslong as we define the cost function!! While intializing the coeffs, simply adding the line layer.requires_grad_() told PyTorch to start tracking that layer to layer calculate it's gradients.
import torch.nn.functional as F
def calc_preds(coeffs, indeps):
layers, consts = coeffs
n = len(layers)
y_pred = indeps
for i, layer in enumerate(layers):
y_pred = y_pred @ layer + consts[i]
if i != n-1: y_pred = F.relu(y_pred)
preds = F.softmax(y_pred, dim=1)
logits = y_pred
return preds, logits
So what exactly did we do here? To answer, not much.
Simply itterated over each layer (containing the randomized coeffs) and matrix multiplied the coeffs by the input independants.
Output to a layer becoming the input to the next. With each time, applying the ReLU (Rectified Linear Unit) to the output.
The ReLU is simply a linear equation thats cut off at zero. Meaning that, any number less than 0 will be turned into a zero.
The ReLU and Tanh activation functions are the most common for neural network hidden layers. I picked ReLU for this example because it's the one I've used the most. We then finish it off with a sigmoid activation to get out binary prediction.
Ah ah ah!! Stop right there. If we were doing binary classification, we'd use a Sigmoid function. But since we have multiple outputs (multiple types of stars) this is a multiclass classification problem! We must use the SoftMAX
Updating The Parameters
Pytorch automatically tracks gradients, so subtracting the gradients is as easy as using #sub_(#grad()) :D
def update_coeffs(coeffs, lr):
layers, consts = coeffs
for layer in layers + consts:
if layer.grad is not None:
layer.sub_(layer.grad * lr)
layer.grad.zero_()
return layers, consts
Crucial step Needed
We've talked about Pytorch automatically tracking the gradients.. But it can't do any of that without the loss function!
So, let's refresh our memory:
Thus, an appropriate loss function would be the Categorical Cross-Entropy. However, CCE uses one-hot encoded labels. Frankly, i feel too lazy to write add in one-hot encoding. So instead, we'll use Sparse CCE which works with basic integer labels.
Sparse Categorical Cross Entropy
Here's something about me, I HAAAATE jargon.
The loss function's name is "Sparse Categorical Cross-Entropy" and i think that's the stupidest thing ever.
All this is, is fancy worded jargon meant to boost the egos of those who use it.
The bad downside, is that this deters so many people from Machine learning because of how complicated it sounds.
In reality, "Sparse Categorical Cross-Entropy" is defined as:
L = -log(P(y))
wow. how complicated.
Remainder of the code used:
def accuracy(coeffs, t_dep, t_indep):
preds, logits = calc_preds(coeffs, t_indep)
predicted_classes = torch.argmax(preds, dim=1)
correct = (predicted_classes == t_dep).float()
return correct.mean().item()
def one_epoch(t_dep, t_indep, coeffs, lr):
preds, logits = calc_preds(coeffs, t_indep)
loss = torch.nn.CrossEntropyLoss()(logits, t_dep)
loss.backward()
with torch.no_grad(): return update_coeffs(coeffs, lr)
def train_model(t_dep, t_indep, epochs=300000, lr=0.00055, loss_arr=[], acc_arr=[]):
torch.manual_seed(777)
coeffs = init_coeffs()
for i in range(epochs):
coeffs = one_epoch(t_dep, t_indep, coeffs, lr)
if i % 1000 == 0:
preds, logits = calc_preds(coeffs, t_indep)
loss = torch.nn.CrossEntropyLoss()(logits, t_dep)
acc = accuracy(coeffs, t_dep, t_indep) * 100
print(f"Iteration: {i:03d} | Loss: {loss:.4f} | Accuracy: {acc:.2f}%")
loss_arr.append(loss.item())
acc_arr.append(acc)
return coeffs, loss_arr, acc_arr
coeffs, loss_arr, acc_arr = train_model(t_dep, t_indep)
Iteration: 299000 | Loss: 0.6482 | Accuracy: 65.83%
Showing the results:
import matplotlib.pyplot as plt
plt.plot(range(300), loss_arr)
plt.xlabel("Itterations (in thousands)")
plt.ylabel("Loss")
plt.title("Variation of Loss During Training")
plt.show()
plt.plot(range(300), acc_arr)
plt.xlabel("Itterations (in thousands)")
plt.ylabel("Accuracy")
plt.title("Variation of Accuracy During Training")
plt.show()
Output:
This Was PyTorch




Top comments (0)