DEV Community: Narendra kumar A

Python & data science in banking sectors

Narendra kumar A — Thu, 03 Mar 2022 18:53:58 +0000

Data science in banking plays a major role nowadays. Banks all over the world analyze data to provide better experiences to their customers and also to reduce risks.
In this post, You can get to know the importance and role of data science in the banking sector and how it leverages the earnings potential by reducing the risks of a firm.

Here are a few applications across banking:

Credit Decisions
Risk Assessment
Fraud prevention
Process Automation

The above use cases have been applied by JP Morgan Chase & Co in their business operations and management as per a case study. You can find more information on the case study from the following link https://www.superiordatascience.com/jpmcasestudy.html

A PayPal's Use case: Here is another example that was applied by PayPal. PayPal uses AI base model for listing your available payment options(between linked bank accounts and credit/debit cards). This happens when PayPal deducts money when you are performing transactions. With PayPal, you can observe your bank account doesn't show up few times when you are performing a transaction. This moreover happens when you transfer funds with friends and family or Goods and services category. This is because an AI-based model will classify your list of payment options linked, based on the Risk Assessment analysis from your previous transactions.
Note: Accross the europe, the bank transactions will be preformed using a payment system called SEPA Direct debit Mandate. For paypal to preform these type of transactions they will not have access to check the users bank account whether they had enough funds or not. They will only get the information on whether a payment is successful or declined after paypal registers the transaction on users account bank. This usually takes a couple of working days for bank to make successful or unsuccessful transaction. For this reason, paypal have to provide a credit for the user if they use the bank account which might be high risk depends on the transaction amount. So, during the risk assessment your listed bank account will more likely gets a high risk score than the credit cards or debit cards transactions.

You can look into this thread for more information on this issue by few PayPal users at https://www.paypal-community.com/t5/Transactions/Linked-bank-account-not-showing-as-payment-method/td-p/1808787
You can also find more information on how PayPal uses AI in their business from the following link https://www.paypal.com/us/brc/article/enterprise-solutions-paypal-machine-learning-stop-fraud

A practical example with python:

For a practical understanding, we can look into an example of Loan eligibility prediction. For this, we can use a publicly available dataset from the Kaggle. We start with data processing which includes handling missing data, data analysis then followed by training and testing machine learning models.

Loan eligibility identification is one of the most challenging problems in the banking sector. Applicant’s eligibility for a loan will be based on several factors like credit history, Salary, loan application amount, tenure to repay, and a few other factors. To solve this problem we use machine learning to train a few sample records and predict future outcomes.

Steps Involved:

Dataset Information
Loading data
Dealing with Missing Values
Adding extra features
Exploratory Data Analysis
Correlation matrix and Outliers Detection
Encoding Categorical to numerical data
Model training and evaluation
Conclusion

Here we use pandas, matplotlib, scikit-learn modules in our data processing and model development.

Dataset Information:
Contains 614 records with the following 13 column names
'Loan_ID'
'Gender'
'Married'
'Dependents'
'Education'
'Self_Employed'
'ApplicantIncome'
'CoapplicantIncome'
'LoanAmount'
'Loan_Amount_Term'
'Credit_History'
'Property_Area'
'Loan_Status‘

You can load the data using the following code sniplet:

Loading data:

import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
data = pd.read_csv('train.csv')

Dealing with Missing Values:
After loading the dataset, we have to find whether any missing values exist in our dataset. We can deal with the missing values in different ways. We can either remove the rows containing missing values in case you consider having large enough dataset for model training, or we can fill the missing values by using any statistical methods by finding mean, average, or clustering, machine learning model to predict the missing values. This depends on the type of variables that you are dealing with and the amount of data available.

print(data.isnull().sum())

Will give us the count of missing values in each column of our dataframe.
Output:

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In our dataset columns, Credit_History, Self_Emoloyed, Dependents, Loan_Amount_Term, Gender, and Married are categorical columns so we used mode to fill the missing values in these columns while LoanAmount is a numerical data type, so we fill these missing values using median of column values
We can perform this operation with the folowing code sniplets:

data['Gender'] = data['Gender'].fillna(data['Gender'].dropna().mode().values[0])
data['Married'] = data['Married'].fillna(data['Married'].dropna().mode().values[0])
data['Dependents'] = data['Dependents'].fillna(data['Dependents'].dropna().mode().values[0])
data['Self_Employed'] = data['Self_Employed'].fillna(data['Self_Employed'].dropna().mode().values[0])
data['LoanAmount'] = data['LoanAmount'].fillna(data['LoanAmount'].dropna().median())
data['Loan_Amount_Term'] = data['Loan_Amount_Term'].fillna(data['Loan_Amount_Term'].dropna().mode().values[0])
data['Credit_History'] = data['Credit_History'].fillna(data['Credit_History'].dropna().mode().values[0])

Adding extra features:
Adding extra features based on a few data analysis insights can improve the model accuracy.
Based on our dataset and goal we are adding two extra columns Total_Income, avg_income_met.

Total_Income is calculated by adding ApplicantIncome and CoapplicantIncome for each applicant/record.
avg_income_met column is generated with 1’s and 0’s by calculating whether is applicant Total_Income is greater than the average income of all the applicants with Loan_Status=Y (Yes) Dataframe with new columns:

Exploratory Data Analysis:
Few data analysis observations were performed on the dataset to analyze and get insights throughout the data

From the above graphs following observations can be noted

Applicants with fewer Dependents have more possibility of getting a loan.
More applications are made by the applicants from semi-urban areas.
Graduates are more likely to get loans.
Credit History should be moreover positive for the applicants with average income.
Married people are more likely to apply for loans.
Males applicants are higher than female applicants.
Self-employed applicants have less probability of getting a loan.

The EDA graphs can be generated individually using the following code sniplets:

data['Gender'].value_counts(normalize=True).plot.bar(title='Gender')
plt.show()

data['Married'].value_counts(normalize=True).plot.bar(title='Married')
plt.show()

data['Self_Employed'].value_counts(normalize=True).plot.bar(title='Self_Employed')
plt.show()

data['Credit_History'].value_counts(normalize=True).plot.bar(title='Credit_History')
plt.show()

# Independent Variable (Ordinal)
data['Dependents'].value_counts(normalize=True).plot.bar(title='Dependents', color= 'cyan',edgecolor='black')
plt.show()

data['Education'].value_counts(normalize=True).plot.bar(title='Education', color= 'cyan',edgecolor='black')
plt.show()

data['Property_Area'].value_counts(normalize=True).plot.bar(title='Property_Area', color= 'cyan',edgecolor='black')
plt.show()

Correlation matrix and Outliers Detection:

Correlation matrix is a table showing correlation coefficients between variables. Thus we can remove variables that having less correlation as a dimensional reduction technique.
An outlier is a value in a data set that is very different from the other values. That is, outliers are values unusually far from the middle. In most cases, outliers have influence on mean , but not on the median , or mode.

Correlation matrix and boxplot can be generated using the following code

corr = data.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

data.boxplot(column = 'Total_Income', by = 'Loan_Status')
plt.suptitle("")

From the above heatmap, we can observe that Married and Gender columns are negatively correlated and the Loan_Status is well correlated with credit_History.

The above code sniplets will generate a box plot to check the data distribution of Total_income based on Loan_Status as above.
From the above boxplot, we can observe that there are outliers present that have Total_Income greater than 60000. The outliers should be handled as these will reduce the standard deviation between variables and result in decreasing correlation. We considered data having greater than 30000 are outliers. We removed the records containing outliers with the following code sniplet.
data.drop(data[data.Total_Income >30000].index)

Encoding Categorical to numerical data:
Columns Gender, Married, Education, Property_Area, Self_Employed, and Loan_Status contain categorical values. So we convert these data types to numerical by using a dictionary as mentioned below.

cat2num = {'Male': 1, 'Female': 2,
           'Yes': 1, 'No': 0,
            'Graduate': 1, 'Not Graduate': 0,
            'Rural': 1, 'Semiurban': 2,'Urban': 3,
            'Y': 1, 'N': 0,
            '3+': 3}
data = data.applymap(lambda item: cat2num.get(item) if item in cat2num else item)

Before training machine learning algorithms, we have to evaluate whether all the features will be used to train the machine learning algorithms. In our data LoanID has unique values for each applicant, so we drop this column as it is not useful for model training and predictions.
data.drop('Loan_ID', axis = 1, inplace = True)

In our dataframe we have Dependants column with values 0, 1, 2, 3, 3+. Here 3+ is not numerical datatypeinstead it is of string datatype. So we convert this to numerical datatype.
data['Dependents'] = pd.to_numeric(data['Dependents'])

Now, we are good to train the data with a few machine learning models and test the model with predictions, and calculating accuracy with f1_score.

Model training and evaluation:
All the models trained and tested are imported from the available scikit-learn module.
Every model is imported from sklearn module
.fit() method used to train
model.fit(X_train, y_train)
X_train – contains all the features after splitting data.
Y_train – contains target variable Loan_Status.
•predict() method used for prediction
Here we have used f1_socre which is an evaluation metric for testing model accuracy

To achieve the goal, we have selected five classifier models to train and test on our data.

Support Vector Machine Classifier
Decision Tree Classifier
Random Forest Classifier
KNearestNeighbors Classifier
Naive Bayes Classifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

#splitting data (80% - for training, 20% - for validation)
X_train, X_test, y_train, y_test = train_test_split(data.drop('Loan_Status', axis = 1), 
                                                    data['Loan_Status'], test_size=0.20, random_state=0)

# SVM classifier
from sklearn import svm
classifier = svm.SVC(kernel='linear')
classifier.fit(X_train,y_train)
svm_prediction = classifier.predict(X_test)
evaluation_svm = f1_score(y_test, svm_prediction)
print('SVM classifier f1_score : ', evaluation_svm)

# Decision Tree
from sklearn.tree import DecisionTreeClassifier
d_tree = DecisionTreeClassifier()
d_tree.fit(X_train, y_train)
d_pred = d_tree.predict(X_test)
evaluation_DT = f1_score(y_test, d_pred)
print('Decision Tree f1_score : ', evaluation_DT)

#Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier()
forest.fit(X_train, y_train)
forest_prediction = forest.predict(X_test)
evaluation_forest = f1_score(y_test, forest_prediction)
print('Random Forest Classifier f1_score : ', evaluation_forest)

#KNN classifier
from sklearn.neighbors import KNeighborsClassifier
KNN_model = KNeighborsClassifier(n_neighbors=3)
KNN_model.fit(X_train, y_train)
KNN_predicted= KNN_model.predict(X_test)
evaluation_KNN = f1_score(y_test, KNN_predicted)
print('KNN f1_score : ', evaluation_KNN)

# Navie Bayes classification
from sklearn.naive_bayes import GaussianNB
NB_model = GaussianNB()
NB_model.fit(X_train, y_train)
NB_predicted= NB_model.predict(X_test)
evaluation_NB = f1_score(y_test, NB_predicted)
print('NB f1_score : ', evaluation_NB)

scores_N = list(('svm', 'DT', 'forest', 'KNN', 'NB'))
scores = list((evaluation_svm, evaluation_DT, evaluation_forest, evaluation_KNN, evaluation_NB))
x = np.array([0,1,2,3,4])
plt.xticks(x, scores_N)
plt.plot(scores)

You might need to use kfold technique to train and evaluate the models in cases of small dataset.
K-Fold is validation technique in which we split the data into k-subsets and the holdout method is repeated k-times where each of the k subsets are used as test set and other k-1 subsets are used for the training purpose.

Conclusion:
Model evaluation is tested on 123 samples (20% of the dataset called validation dataset)
From the above graph, we can observe that the Gaussian Naive Bayes classifier resulted in the highest accuracy of 89.23%
Scores plot using matplotlib

Future posts:

How datascience techniques can be used for malware detection on android systems.
A programmers instinct towards datascience (In this post, I would like to discuss how programming tricks can be used to perform data-processing tasks, model development and training with a minimal knowledge on datascience concepts) . . .

If you are interested to know how datascience can be used in any particular field, please comment your interested area of study. I will try do a post with a practical example.

Please comment down, if you have any questions.
Thank you for your time, hope you enjoyed reading. Happy learning

Denoising MRI data with Tensorflow-Python

Narendra kumar A — Fri, 18 Feb 2022 17:17:31 +0000

You may find a lot of information on the internet about why deep learning is useful in medical imaging analysis and reconstruction. A simple answer is - ‘Medical data contains numerous amount of data points and where deep learning comes into play.’

Data always follows a pattern, that can be analysed using different AI based models.

What next for this post??

This post is to show a basic example on how deeplearning can be used in medical domain supporting the healthcare professionals with denoising medical imaging data. It can also be used in data analysis, disease detection, artifacts removal, reconstructing undersampled data and lot more.

prerequisite: Basic knowledge on python, Numpy, Basics in dataprocessing(Image)

For our example let us take few medical imaging data samples then create and train a basic deep learning model (convolutional autoencoder) that reduces the noise in our medical images.

Steps involved:

Reading data
Preprocessing data
Generating noisy data using scikit-learn random noise generator.
Build a model and train
Prediction and model evaluation

Read data: Here I am using a total of 20 MRI nifty volumes(T1 weighted MRI scans). We use nibabel a python module to load the data. Each volume contains a sequence of 128 brain slices which i have discussed in my previous blog @https://dev.to/narendraanupoju/mri-data-processing-with-python-1jgg
The following function can load the data from .nii or .nii.gz format files.

import nibabel as nib
def FileRead(file_path):
    data = nib.load(file_path).get_fdata()
    return data[:,:,14:114]

Dataset structure - Each sample had a shape (256, 256, 128) (represented as a 3 dimensional numpy array). Here we can observe that we have 128 slices of brain scans sequentially. We also need to perform normalization for each sample because of the wide range of data distribution

Preprocess the data: - Preprocessing is one of the most important step before training any machine learning or deep learning models. Preprocessing techniques can include data cleaning, data transformation, outliers detection, handling missing data ..soon a lot other depends on the input data and end result you need. We are using data transformation technique called normalization to reduce the variance in our data.

Now we stack all the volumes and generate a NumPy array before it can be used for training a deep learning model. I am only considering the centric slices of the volumes as the initial and the end volumes don't contain more useful information for the model training, so we are neglecting them. 100 center slices are considered for each sample.

Note: Please note that, we are training the model with 2D slices rather than training with 3D volumes.

import glob
import numpy as np
def normalize(x):
    x = (x-np.min(x)) / (np.max(x)-np.min(x))
    return x
listFiles = glob.glob('/*.nii.gz')
totalData = np.dstack([normalize(FileRead(file)) for file in totalFiles])

Finally, after the processing is done, you will get a NumPy array something like the below shape:

Final shape of data : (256, 256, 2000)

From the above snippet, we can observe 2000 samples of MRI slices aligned across third dimension axis. We are only working w.r.t 2D slices.

Generating Noisy Data: - Now, we have noise free images and we need noisy images to train our model w.r.t ground truths. For this, I am using sklearn random_noise generator to generate the noise on each image. The noise generated on each slice is based on the Gaussian noise considering the local variance for each data sample.

from skimage.util import random_noise

print('____Generating noisy data______')
noisyData= np.zeros(totalData.shape)
for img in range(samplesCount):
    noisy= random_noise(totalData[:,:,img],mode='localvar' ,seed=None,clip=True)
    noisyData[:,:,img] = noisy

After generating noisy data, now we have noisy samples w.r.t ground truths. We can observe from the above noisy images that most of the brains internal structures are distorted. Now we use a simple convolution autoencoder model built on top of TensorFlow to remove the noise from our data reconstructs the image with better internal structures.

Noisy samples data shape : (256, 256, 2000) #containing noise
Ground truth data shape : (256, 256, 200)

One last step before model training is we have to structure our data with standard representation (batch_size, lenght, width, height). For that, we expand totalData and noisyData with 4th dimension to train and roll the third axis to starting index as shown below

totalData = np.rollaxis(totalData, -1)
totalData = np.expand_dims(totalData, axis=-1)
noisyData = np.rollaxis(noisyData, -1)
noisyData= np.expand_dims(noisyData, axis=-1)

X_train, X_test, y_train, y_test = train_test_split(noisyData, totalData , test_size=0.2, random_state=42)

Note You can also consider converting numpy arrays to tensors and use the advantage of tensorflow Dataloader class.
For simple understandinf i am only considering only numpy arrays.

After spliting the data shapes are as follows:
X_train : (1600, 256, 256, 1)
y_train : (1600, 256, 256, 1)
X_test : (400 , 256, 256, 1)
X_test : (400 , 256, 256, 1)

Model building and training
Now we have simple convolutional autoencode model as shown below to train and predict our results:

import tensorflow as tf
import keras
import keras.layers as layers
import keras.models as models
from keras.initializers import orthogonal
from tensorflow.keras.optimizers import Adam

def ConvolutionLayer(x, filters, kernel, strides, padding, block_id, kernel_init=orthogonal()):
    prefix = f'block_{block_id}_'

    x = layers.Conv2D(filters, kernel_size=kernel, strides=strides, padding=padding,
                      kernel_initializer=kernel_init, name=prefix+'conv')(x)

    x = layers.LeakyReLU(name=prefix+'lrelu')(x)
    x = layers.Dropout(0.2, name=prefix+'drop')((x))
    x = layers.BatchNormalization(name=prefix+'conv_bn')(x)
    return x

def DeconvolutionLayer(x, filters, kernel, strides, padding, block_id, kernel_init=orthogonal()):
    prefix = f'block_{block_id}_'

    x = layers.Conv2DTranspose(filters, kernel_size=kernel, strides=strides, padding=padding,
                               kernel_initializer=kernel_init, name=prefix+'de-conv')(x)

    x = layers.LeakyReLU(name=prefix+'lrelu')(x)
    x = layers.Dropout(0.2, name=prefix+'drop')((x))
    x = layers.BatchNormalization(name=prefix+'conv_bn')(x)
    return x


# Convolutional Autoencoder with skip connections
def noiseModel(input_shape):
    inputs = layers.Input(shape=input_shape)

    # 256 x 256
    conv1 = ConvolutionLayer(inputs, 64, 3, strides=2, padding='same', block_id=1)

    conv2 = ConvolutionLayer(conv1, 128, 5, strides=2, padding='same', block_id=2)

    # 64 x 64
    conv3 = ConvolutionLayer(conv2, 256, 3, strides=2, padding='same', block_id=3)

    # 16 x 16
    deconv1 = DeconvolutionLayer(conv3, 128, 3, strides=2, padding='same', block_id=4)

    # 64 x 64
    deconv2 = DeconvolutionLayer(deconv1, 64, 3, strides=2, padding='same', block_id=5)

    # 256 x 256
    deconv3 = DeconvolutionLayer(deconv2, 1, 3, strides=2, padding='same', block_id=6)

    return models.Model(inputs=inputs, outputs=deconv3)

From the above code each ConvolutionLayer and DeconvolutionLayer block contains sequential operation of convolution/convolution transpose, followed by leakyRelu(activation function), dropout and batchnormalization layers.
Our convolutional autoencoder model consists of six blocks, first three blocks is the _encoder network _part where they perform the convolution operations to generate a latent feature map of the noisy data, then the latent feature map is passed to _decoder _for generating a noise free image.

Further more detailed information on the model development and parametric optimization can be discussed in the next post.

Now we have our model blocks ConvolutionLayer and DeconvolutionLayer which will support the model generation noiseModel

Next, we have to define an optmizer, metric and the loss functions that is required for our model training for updating the weights and bias.

input_shape = (256, 256)    # shape of the sample
model = noiseModel((*input_shape, 1))
model_optimizer = Adam(learning_rate=0.002)

def SSIMMeasure(y_true, y_pred):
    return tf.reduce_mean(tf.image.ssim(y_true, y_pred, 1.0))

def SSIMLoss(y_true, y_pred):
    return 1 - tf.reduce_mean(tf.image.ssim(y_true, y_pred, 1.0))

model.compile(optimizer=model_optimizer, loss=SSIMLoss, metrics=[SSIMMeasure])

Now we are good to call the training method to train our model.
Let us view the summary of our simple convolutional autoencoder model.

print(model.summary())
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, 256, 256, 1)]     0         

 block_1_conv (Conv2D)       (None, 128, 128, 64)      640       

 block_1_lrelu (LeakyReLU)   (None, 128, 128, 64)      0         

 block_1_drop (Dropout)      (None, 128, 128, 64)      0         

 block_1_conv_bn (BatchNorma  (None, 128, 128, 64)     256       
 lization)                                                       

 block_2_conv (Conv2D)       (None, 64, 64, 128)       204928    

 block_2_lrelu (LeakyReLU)   (None, 64, 64, 128)       0         

 block_2_drop (Dropout)      (None, 64, 64, 128)       0         

 block_2_conv_bn (BatchNorma  (None, 64, 64, 128)      512       
 lization)                                                       

 block_3_conv (Conv2D)       (None, 32, 32, 256)       295168    

 block_3_lrelu (LeakyReLU)   (None, 32, 32, 256)       0         

 block_3_drop (Dropout)      (None, 32, 32, 256)       0         

 block_3_conv_bn (BatchNorma  (None, 32, 32, 256)      1024      
 lization)                                                       

 block_4_de-conv (Conv2DTran  (None, 64, 64, 128)      295040    
 spose)                                                          

 block_4_lrelu (LeakyReLU)   (None, 64, 64, 128)       0         

 block_4_drop (Dropout)      (None, 64, 64, 128)       0         

 block_4_conv_bn (BatchNorma  (None, 64, 64, 128)      512       
 lization)                                                       

 block_5_de-conv (Conv2DTran  (None, 128, 128, 64)     73792     
 spose)                                                          

 block_5_lrelu (LeakyReLU)   (None, 128, 128, 64)      0         

 block_5_drop (Dropout)      (None, 128, 128, 64)      0         

 block_5_conv_bn (BatchNorma  (None, 128, 128, 64)     256       
 lization)                                                       

 block_6_de-conv (Conv2DTran  (None, 256, 256, 1)      577       
 spose)                                                          

 block_6_lrelu (LeakyReLU)   (None, 256, 256, 1)       0         

 block_6_drop (Dropout)      (None, 256, 256, 1)       0         

 block_6_conv_bn (BatchNorma  (None, 256, 256, 1)      4         
 lization)                                                       

=================================================================
Total params: 872,709
Trainable params: 871,427
Non-trainable params: 1,282
_________________________________________________________________

From the above model summary you can observe that we have around 0.9 million trainable parameters and very few non trainable parameters. Adding more layers will result in increase of trainable parameters.

For training we have to define the batch size and epochs aswell, and also we log the information into tensorboard and csv logger so that the data can be analysed lateron. For this i have the following training loop as below:

Path_MODELSAVE = 'saved_models'
Path_LOGS = 'logs'
for i in [Path_MODELSAVE, Path_LOGS]:
    if i not in os.listdir(os.getcwd()):
        os.mkdir(i)

epochs = 100
batch_size = 10
saved_weight = os.path.join(Path_MODELSAVE, 'dataweights.{epoch:02d}-{SSIMMeasure:.2f}.hdf5')

model_checkpoint = keras.callbacks.ModelCheckpoint(saved_weight,
                                        monitor = 'val_SSIMMeasure',
                                        verbose=1,
                                        save_best_only=False,
                                        save_weights_only=True,
                                        mode='auto', save_freq = 'epoch', period = 10)

tensorboard = keras.callbacks.TensorBoard(log_dir=Path_LOGS,
                                            histogram_freq=0,
                                            write_graph=True,
                                            write_images=True)

csv_logger = keras.callbacks.CSVLogger(f'{Path_LOGS}/keras_log.csv' ,append=True)

model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, 
            validation_data=(X_test, y_test), 
            callbacks=[model_checkpoint, tensorboard, csv_logger])

From the above sniplet model_checkpoint saves the model in our local system over a period of 10 epochs.
tensorboard will log the tensorboard events to Path_LOGS directory and csv_logger will log loss, SSIMMeasure for both training and validation into a .csv file for each epoc.

model.fit() function will perform all the training process considering the batch_size and callbacks as defined above.

Prediction and model evaluation
Once the data is trained, you can perform prediction on any other samples for model evaluation by maintaining the batch size of the trained model.
Also please note that you have to perform the same preprocessing steps on the samples that are passed for model prediction. In our case we have to perform normalization before passing a sample for prediction.

Here are few predictions from the model trained with 130 epochs.

From above results we can observe that our model could able to remove the noise and also generates few structures that are not exactly visible in our noisy data. It all happened by training upto few 130 epochs with batch_size equal to 10 and model containing only 6 blocks.

If you want to see better results, i suggest to train your model with more number of epochs with a greater batch_size by inspecting whether your model is overfitting or underfitting.

Below you can observe the model accuracy and loss over training and validation.

Training results:

Validation results:

If you have any questions regarding the post, please comment. I will try to answer asap.

If this article helped you, you can support my work here ☕

👉 Buy me a coffee: https://buymeacoffee.com/anupoju

MRI Data Processing with Python

Narendra kumar A — Mon, 03 Aug 2020 17:04:26 +0000

Understanding and processing MRI data can be tricky and confusing. In this blog, I will provide a basic introduction on how to load and process MRI data using the most important Python libraries.

MRI data mainly consists of three pieces of information.
-> Header (metadata)
-> Affine (Represents the affine transformation)
-> Image data (N-D Array)

Most of the medical imaging data will be in the Nifti (.nii) and Dicom (.dcm) formats. We will be discussing Nifti format data, dicom format will be discussed in further posts.

Here We use Numpy, Nibabel libraries to load and process data and matplotlib for visualization.

If you don’t have these libraries installed, you can install using pip package manager in your python environment entering the following commands:

pip install numpy
pip install niabel
pip install matplotlib

Importing modules

import nibabel as nib
import numpy as np

Loading data

data = nib.load("path to Nifti format data")  #(.nii or .nii.gz)

print(data)

Here data is a data object containing header, affine, and image data with all required attributes for further processing.

image_data = data.get_fdata()

get_fdata() function returns a floating-point NumPy matrix containing image pixel data

image_data.shape

(256, 256, 128)

output shape represents that image_data is a 3d volume, z-axis(128) represents the number of slices in MRI data

To check the affine coordinates

print(data.affine)

This will output an array relating array coordinates from the image data array to coordinates in some RAS+ world coordinate system
RAS (Right, Anterior, Superior)
For a detailed understanding of coordinate systems of neuroimaging go through the following link:
https://nipy.org/nibabel/coordinate_systems.html

To check the header info of the data object

print(data.header)

A detailed understanding of metadata and affine transformations will be discussed in a next post.

Visualizing MRI slices

import matplotlib.pyplot as plt
plt.imshow(image_data[:,:,64], cmap="gray")

You can change the integer value in the above code snippet with in a range of (0-127) to visualize different slices.

plt.imshow(image_data[:,:,116], cmap="gray")

Visualize 116th slice

plt.imshow(image_data[120,:,:], cmap="gray")

Visualize 120th slice w.r.to x-axis

<matplotlib.image.AxesImage at 0x1f6cb300148>

For a detailed understanding on Indexing using NumPy go through the following link
https://numpy.org/doc/stable/user/basics.indexing.html

In further posts, I will be discussing how to prepare MRI data to apply machine learning algorithms, noise removal techniques, and different pre-processing and post-processing techniques.

If this article helped you, you can support my work here ☕

👉 Buy me a coffee: https://buymeacoffee.com/anupoju