DEV Community: Timothy Cummins

Help: Reverse Dict

Timothy Cummins — Wed, 30 Dec 2020 04:47:52 +0000

Introduction to the Problem

In my second week of holiday spirit I decided to continue my friend with his adventures in Python. In the most recent problem he was working on it gave him a dictionary of English words and their translation in Spanish and wanted him to create a program to inverse them. The requirement given to him were:

1) Raise an Exception for the error if something besides a dictionary is passed through the function

2) Return the reversed dictionary using a Zip function if a dictionary is passed through the function

So here is a step by step with some instruction about how I went through the problem.

Making Sure a Dictionary is Passed Through

So when I read this problem the first thing I thought to myself is that we are going to need a to define a function of course, but also that we should use an 'if' statement to give our function two options. So initially I typed up the following:

def invdict(D):
    if D is dict():
        print("item is dictionary")
    else:
        print("item is not a dictionary")

Though when I tried to run the following dictionary I created I ended up receiving a response that the my dictionary was not a dictionary.

thisdict = {
    "dog": "perro",
    "cat": "gato",
    "chicken": "pollo",
    "chicken": "pollo"
    }

This was because what I had typed above was comparing the dictionary I had created to dict() which is any empty dictionary, therefore they were not the same and in turn my statement was false. Noticing my mistake and thinking back on what I had learned in the past I remembered there was a built in function that would return True or False if an object was the specified type, isinstance(). Which worked!

def invdict(D):
    if isinstance(D,dict):
        print("item is dictionary")
    else:
        print("item is not a dictionary")

Now we just needed to reverse our dictionary.

Reversing the Dictionary

I want to start out by saying there are a couple of ways to reverse a dictionary. Personally my first thought would be to go to a comprehension method by using something like dict((v, k) for k, v in D.items()) but as I said in the beginning this problem specified using a zip method. So to do that what we will have to do is actually pretty simple. Since the dictionary itself lets us call upon the keys and values separately as lists, all we need to do is tell python that we will want the result to be a dictionary using dict(), then to create that dictionary we will need connect two lists zip() and finally we need to provide those lists D.values() and D.keys(). So all together we have dict(zip(D.values(), D.keys())).

So putting everything together our output will look like:

def invdict(D):
    if isinstance(D,dict):
        return dict(zip(D.values(), D.keys()))
    else:
        raise Exception('This should be a dictionary')

Conclusion

First of all I want to say that I hope if someone is using this for school work, I hope you at least read through this to help you understand what each piece is doing. Secondly I would like to announce that since I have finished the pleasantries of the holidays I look forward to tackling some more complex Machine Learning concepts in the upcoming weeks.

Happy Holidays!!!

`break` Python

Timothy Cummins — Thu, 24 Dec 2020 05:28:51 +0000

Introduction

In the spirit of giving I thought I would write a blog to help one of friends out with his understanding of using the "break" statement in Python. Break, if you are not familiar with it is used in Python, as well as some other programming languages, to create a condition to end a loop. I personally find this exit most useful when using it in nested loops (one loop inside of another) because, it allows you to exit the inner loop once a condition is met and return to the outer loop which creates less iterations, therefore saving time.

`for` Loop

If you are not very familiar with loops, a for loop is used to iterate over a sequence, whether it be a list, dictionary, tuple or even just a string. To use break in a for loop the only you thing you need to do create a conditional statement with the "if" statement that defines what you want the program to recognize and then just let it know to break if that condition is met. So as a silly example lets create a simple loop that goes through the letters of my name, a string, and prints them one by one but stops when it finds the "o".

for letter in "timothy":
    if letter == "o":
        break
    print(letter)
print("loop ended")

So above on the first line I initiated my for loop telling it to go through each letter of the string "timothy", then below inside of that loop, one tab further, I created a statement letting the computer know that if it found an "o" I wanted it to do something. Then it was a simple as placing the statement 'break' one tab in on the line below, so that it was inside the if statement, just letting the program know to end the loop once the condition was met. You can see that this worked because the loop ended before it once again hit the line with print(letter) and instead moved to the print statement that is outside of the loop.

`while` Loop

Executing a break statement in a while loop is done the exact same way and does the same thing. Though in my practice of using while loops I have not needed to use a break statement since the purpose of a while loop is to repeat a statement as long as the given statement is true. Though I could potentially see the use of a break statement as an emergency exit by using it as a counter for iterations or time to prevent an indefinite loop that could end the statement early. For example if we ran the following code, the computer would keep printing the next number until we stopped the kernel or the program crashed.

n=0
while n > -1:
    n+=1
    print(n)

Though if we wanted to make sure this didn't happen, imagine a more complex code that we weren't sure if it would meet the requirement, we could add a simple counter with a break statement to make sure we only repeated a certain amount of iterations.

n=0
m=0
while n > -1:
    n+=1
    m+=1
    if m == 10:
        break
    print(n)

Nested Loops

As I mentioned earlier using break in nested loop can lead to running less iterations, so as an example I recreated a small version of my buddies problem to show how.

def reducedict(D, keywords):
    result={}
    for (key,value) in D.items():
        for text in keywords:
            if text.lower() in key.lower():
                result.update({key:value})
                break
    return result

If we take a look at what this code is doing, we can see that the first for loop is iterating through a dictionary and then below the second loop is looping through a list of keywords to see if any of them match the keys of the dictionary, to add the matching key with its value to a new reduced dictionary. So lets create a dictionary and list to see what happens:

dictionary ={"one":"happy","two":"sad","Three":"mad"}
numbers=("one","three","five")
reducedict(dictionary,numbers)

We can see that it was successful as it pulled out only the keys labeled "one" and "three" but, what did the break do? So if we did not have break it would check each key in our dictionary once for each keyword. Though with our break statement, since it is finding a matching keyword the first term it will return to the outer loop, skipping the remaining two iterations through the keywords and begin searching the next term in the dictionary. When timed I received 2.16 µs ± 63.1 ns per loop without a break statement and 1.71 µs ± 22.5 ns per loop with it included. This may seem like a very small amount of time, and it is, but imagine searching through all two hundred seventy three thousand words in the Oxford dictionary with their full definition.

Conclusion

I know learning about coding can be tough and I am constantly learning about how small changes in your code can make huge differences. I know this blog will help at least one person, me, as I write it and I hope it helps others understand more about some complex topics. I would like to also add that on this topic, this is written to the best of my knowledge and if anyone has a correction, I would love to hear it!

Merry Christmas!

Processing Text in Python: split, get and strip

Timothy Cummins — Thu, 17 Dec 2020 01:06:44 +0000

During this last week I have been helping a friend of mine understand some Python concepts while he was trying to pull data out of a .txt file. So I thought I would share what we covered and maybe it will end up helping someone else struggling with the same sort of material. These methods are split, get and strip.

.split()

The split method is a very important tool to make use of in python. This method allows you to take a string of text and break it apart by a character you use and return the string as a list containing the parts that were split apart. Usually when using split and the default setting of this method you will be separating your string by white spaces or in other words empty spaces. For example lets create a simple sentence: sentence= "I fought the law but the law won"

words = sentence.split()
print(words)

Though you can also give it a parameter on where you want it to split the text or even how many splits you want it to do.

Splitting on "the":

words2 = sentence.split("the")
print(words2)

Limiting the amount of splits to 4:

words3 = sentence.split(" ",4)
print(words3)

.get()

The next method I would like to talk about is get(). Normally this method is just used to return the value of a specified key, but where it becomes very useful in working with text is that instead of returning an error if the value doesn't exist you can have it return a specified value. For example the get method is needed to create a dictionary for a word counter.

word_counts = {}
for word in words:
    word_counts[word] = word_counts.get(word,0)+1

So in the above example we are starting with an empty dictionary and creating a for loop to cycle through our list of words. Then we are taking advantage of the get methods ability to return a specified specified default value of 0 even if it has never seen that key before. Then lastly we just add the +1 to the end allowing the function to add 1 to the key once it is added to the dictionary.

.strip()

The last method I find necessary for working with text is the strip method. What this allows you to do by default is get rid of white space before and after a selected string. Though if you specify a character, it will remove that character or set of characters if it occurs at the beginning or end of the string. So to show you why this is useful let us say that we were trying to create a word count but there are some commas in our string so when we use our split method our list looks like this:['I', 'fought', 'the', 'law,', 'but', 'the,', 'law', 'won']. So let's try our using our get method to create a word count of this new list.

for word in words2:
    word_counts[word] = word_counts.get(word,0)+1

As you can see now it recognizes 'law' and 'law,' as two separate words, but to fix this we can use a for loop and our strip method to remove the unwanted commas from our words.

new_words=[]
for w in words:
    new_words.append(w.strip(','))
print(new_words)

Now once again we have our words without extra punctuation and can get a correct count. This method is very common in Natural Language Processing for not only removing punctuation but in also in removing variations in words so that you can compare similar words such as you and you're.

I hope this helps anyone learning how to process text with Python and that you will have fun continuing your journey.

CatBoost: What's the Hype?

Timothy Cummins — Thu, 10 Dec 2020 02:49:28 +0000

Introduction

Lately, while I have been browsing the internet for new Data Science tools, since the field is always changing, I keep coming across a new boosting algorithm that has taken over as the supposed king, CatBoost. So I thought I would give it a test by classifying some Credit Card Fraud data that I found on Kaggle, to see what all the hype is about. If you would like follow along with my experiment and are downloading the data, you may get a message saying the website is a possible fraud website, very ironic, though I did not have any issues after bypassing the security and downloading the data.

Updating Python to include CatBoost

First thing we need to do is update our toolbox with Catboost. To do this we through a conda install we will need to add conda forge to our channels with conda config --add channels conda-forge and then install by simply typing conda install catboost into our Terminal.

CatBoost

So like I said above CatBoost is another boosting algorithm and if you have no clue what that is I would recommend checking out my previous blog where I talk about what boosting is and how it works. For those of you who are familiar with boosting, here are some of the reasons that stuck out to me about why CatBoost is supposed to be superior to other boosting algorithms:

Implements Symmetric Trees: Apparently this reduces the prediction time for the model, with the default max depth of the trees equal to 6.
Random Permutations: The algorithm automatically splits the dataset into 4 permutations, which we have seen in some other boosting algorithms, but it is important because it reduces overfitting.
Automatic Categorial Feature Combinations: Finds additional connections within your features without you having to manually create them, producing better scores.
Automatic OneHotEncoding of Categorial Features: CatBoost will automatically OneHotEncode features with 2 categories saving you the time with the option of having encode more diverse features as well. Saving you the time and energy of having to do it yourself before plugging your data into the algorithm. There are also a ton of parameters you can tune for special occasions such as Data Changing Over Time, Weighted Datasets and Small/Large Datasets.

The Comparison

So to try out this new algorithm myself I decided to give it a comparison with a normal Random Forest Classifier and a Gradient Boosting Classfier to see if it lived up to its hype. Also since from what I had read about it working very well without any hyper-parameter tuning, I thought that I would leave all parameter set to default, which with my previous experiences I hoped for the best with my Gradient Boosting Classifier since I had needed to do a ton of tuning with them in the past.

Importing Tools

import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

Data

df = pd.read_csv('../../Downloads/creditcard.csv')
class_names = {0:'Not Fraud', 1:'Fraud'}
print(df.Class.value_counts().rename(index = class_names))
df.head()

We can see that our dataset has quite a large imbalance, though since we are doing a first time run-through and not tuning our models, I am going to leave the data how it is.

Train Test Split

X=df.drop('Class',axis=1)
y=df['Class']
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=1)
print("Length of X_train is: " + str(len(X_train)))
print("Length of X_test is: "+str(len(X_test)))
print("Length of y_train is: "+str(len(y_train)))
print("Length of y_test is: "+str(len(y_test)))

Fitting our Models

%%time
modelrf = RandomForestClassifier()
modelrf.fit(X_train,y_train)

2 minutes 34 seconds

%%time
modelgb=GradientBoostingClassifier()
modelgb.fit(X_train,y_train)

3 minutes 58 seconds

%%time
modelcb=CatBoostClassifier(verbose=False)
modelcb.fit(X_train,y_train)

26 seconds

With my computer really needing to be reset and running very slow the the Cat Boosting Classifier still fit the data by over 3 minutes faster.

Predictions

predrf = modelrf.predict(X_test)
print("Mean Accuracy Random Forest:",modelrf.score(X_test,y_test))
print("F1 Random Forest:",metrics.f1_score(y_test, predrf))

predgb = modelgb.predict(X_test)
print("Mean Accuracy Gradient:",modelgb.score(X_test,y_test))
print("F1 Gradient:",metrics.f1_score(y_test, predgb))

predcb = modelcb.predict(X_test)
print("Mean Accuracy Cat:",modelcb.score(X_test,y_test))
print("F1 Cat:",metrics.f1_score(y_test, predcb))

Conclusion

For running base models the results were quite stunning! The CatBoost algorithm not only was significantly faster at fitting the data but it also performed better on the test data in accuracy and F1 scores. Though with the normal Random Forest Classifier besides the amount of time it took to fit the model I was surprised that it did find a good correlation in the data and preformed quite well for being a basic technique. Then the biggest disappointment, though I was ready for it since I decided not to tune any hyperparamers, was the Gradient Boosting which did not come out ready to play. From this trial I am excited to use the CatBoost algorithm in futures projects and I agree with all of the Data Scientist out there so excited about it!

If anyone reading this has had any experiences using this algorithm and would like to share your projects, I would love to see them and excited to see the progressions in technology!

Pivot: With Pandas and SQLite

Timothy Cummins — Thu, 03 Dec 2020 04:39:24 +0000

SQL has become one of the most popular querying languages, if not the most popular. So with my experiences in the language I thought I would write a mixed blog to create a pivot table with the use of SQLite and Pandas. Just for fun at the end I will show you how to use the Pivot operator in SQL as well.

Pivot

What Pivot allows you to do is to is convert rows to columns to create a sort of "long view". Right away you might wonder why that would be useful but it can come in handy when you want to view or present your results from your query in an easy to read fashion. With Pandas it as simple as using pivot_table and with SQL there is a slick Pivot operator as I mentioned above.

Grabbing Some Data

I will be taking some soccer data for todays post from Kaggle if you would like to follow along. Once that is downloaded we will want to import our packages and take our data out of our wonderful Downloads folder.

import sqlite3
import pandas as pd
import numpy as np
conn = sqlite3.connect('../../Downloads/database.sqlite')
cur = conn.cursor()

If you are not familiar with sqlite, above we are just setting our connection to our data and our cur so we can create queries.

Querying Our Data

Now that we have our data and SQLite setup we can begin our queries. For our example I thought it would be fun to find out which teams had the most goals in home games so we could see who appreciates those hometown fans! To do this we are going to have to join our Match table and Team table together.

SELECT team_long_name, home_team_goal goals
FROM Match 
JOIN Team
ON Team.team_api_id = Match.home_team_api_id

Since above we have both of those tables combined on the team_api_id we can pull the full team names and how many goals they scored in each match when they were the home team. Now to display our results we need to tell our cur to execute the function and then we can fetch the information as a Pandas DataFrame.

cur.execute("""
SELECT team_long_name,home_team_goal
FROM Match 
JOIN Team
ON Team.team_api_id = Match.home_team_api_id;
""")
df = pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

You may have noticed that I did not sum the values for the total amount of goals yet, but just hold on Pandas pivot_tables has an option for that.

Changing to Pivot

So now that we have our columns we can go over what this whole thing is about, pivots. With Pandas creating a pivot table is very simple, all you need is the pivot_table function and then the following data:

Data: The DataFrame you want to use
Values: The information you want for your rows
Columns: What you want your new columns to be
Aggfunc: Any aggregate functions you want to pass on your values

It is a simple as that so let's plug it in and see what happens.

pd.pivot_table(df, columns = ['team'], values = ['goals','games'], aggfunc = np.sum)

Our pivot table is complete but we have so many teams. To get the top 5 and reorder them we could use a combination of reindexing and iloc but, why do that when we have SQL syntax and can easily reorder with an order by and limit.

SELECT team_long_name team,SUM(home_team_goal) goals
FROM Match 
JOIN Team
ON Team.team_api_id = Match.home_team_api_id
GROUP BY team_long_name
ORDER BY goals DESC
LIMIT 5

SQL Pivot

SQL also offers a Pivot Operator as I mentioned earlier, though sadly you cannot use it with SQLite. To use this operator we use the exact same query but with three differences.

We need to make our query a subquery so that we can call upon it
We need to call the subquery with our Pivot operator
We need to select which values we want as our columns

So your final query should end up as:

SELECT * FROM  
(SELECT team_long_name team,SUM(home_team_goal) goals
FROM Match 
JOIN Team
ON Team.team_api_id = Match.home_team_api_id
GROUP BY team_long_name
ORDER BY goals DESC
LIMIT 5
)
AS Home_Team_Goals
PIVOT(
    goals
    FOR team IN ([Real Madrid CF],[FC Barcelona],[Celtic],[FC Bayern Munich],[PSV])
) AS TopHomePivotTable

Conclusion

It is simple as that to create a pivot table out of your data. I apologize for not having an image for the Pivot using SQL syntax, but I am having trouble with my root password for MySQL and am having trouble resetting it through the terminal. If anyone has any tips or tricks on the reset besides from the MySQL webpage I would love to to hear.

Using a Neural Network Pt.3

Timothy Cummins — Thu, 26 Nov 2020 00:37:14 +0000

After last weeks blog I continued to train the Neural Network that we had created, when I realized my model was running an accuracy of 1 on the training data but only a .8 or so on the validation data. This told me that my model had begun overfitting my data so I decided to add a few more Dropout layers to my network, along with some extra Batch Normalization to help speed it up as seen below. Also to continue training after your model has finished running all of your epochs rerun the cell with "history" in it and it will continue another set after the last epoch.

model.add(Conv2D(16, kernel_size = (3, 3), activation='relu', input_shape=(image_size, image_size, 3)))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(BatchNormalization())

model.add(Conv2D(32, kernel_size=(3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(BatchNormalization())

model.add(Conv2D(64, kernel_size=(3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(BatchNormalization())
model.add(Dropout(0.2))

model.add(Conv2D(168, kernel_size=(3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(BatchNormalization())
model.add(Dropout(0.2))

model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.6))

model.add(Dense(128, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.4))

model.add(Dense(64, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.2))

model.add(Dense(1, activation = 'sigmoid'))

New Hyperparameters

Now that we have a saved model I want to show you a couple of awesome Hyperparameters that can save you a lot of time in training your model. These will be the callbacks EarlyStopping and Model Checkpoint.

from keras.callbacks import EarlyStopping, ModelCheckpoint

Early Stopping allows your model to stop itself when the model begins overfitting or just stops improving. If you have restore_best_weights set to true it will return the model to the weights where it was preforming the best before the epochs ended. Then lastly you can also set a tolerance on how many epochs you want your model to try to find better weights before the model stops by setting the patience setting. So overall the Early Stopping allows you to set your model to run a large amount of epochs without having to worry about your model running forever or overfitting.

es = EarlyStopping(monitor='val_accuracy', patience=10,
restore_best_weights=True)

Then next we have Model Checkpoint, this feature is used for saving the progress of the training. So it is very similar to the model.save that we used in the last blog except that instead of just saving at the end it will save your model while it is running. Also similar to the restore_best_weights of the Early Stopping, Model Checkpoint has a feature called save_best_only that allows you to make sure it is not overwriting a model that is performing better.

checkpoint_cb = ModelCheckpoint("pneu_model2.h5",
save_best_only=True)

So between these two features you can run your model with a large amount of epochs and save yourself the fear of saving over your best model, overfitting or even worse having your computer crash while you do not have a saved model.

*With a saved model you can load it in with model = load_model('pneu_model2.h5')

epochs = 50
steps_per_epoch = train_generator.n // batch_size
validation_steps = test_generator.n // batch_size
es = EarlyStopping(monitor='val_accuracy', patience=10, restore_best_weights=True)
checkpoint_cb = ModelCheckpoint("pneu_model2.h5",
                                                    save_best_only=True)
cb_list = [es,checkpoint_cb]
history = model.fit(train_generator,
                              steps_per_epoch=steps_per_epoch,
                              epochs=epochs,
                              validation_data=test_generator,
                              validation_steps=validation_steps,
                              class_weight=class_weight,
                              callbacks=cb_list)

Evaluating our model

Testing your metrics on your training data is very simple with Keras, all you need to do is pull choose which metrics you would like to access and call them like such: loss, acc, rec = model.evaluate(test_generator).

After that is I usually enjoy creating a small confusion matrix so I can have a visual of how my model is preforming as well. I find the easiest way to do this is to create a list of all of the predictions, round them all to 0 or 1 because the neural network actually works by creating a sort of confidence interval, and then use the confusion_matrix through Sklearns metrics package.

predictions = []
for x in model.predict(test_generator):
    for z in x:
        predictions.append(np.round(z))

metrics.confusion_matrix(test_generator.classes,predictions)

Conclusion

With our goal of receiving a high recall score it seems as though we have done very well with a score of .98, meaning that we will have a very low chance of diagnosing someone that has pneumonia as being healthy. Though we did well on this our overall accuracy of .86 could use some improvement and I believe the best way to get this done would be to collect some more images images in general, especially some more images of healthy people. This way our neural network would have more balanced data to work with. Though I am happy with this model for this tutorial I would urge you to try changing and adding some of the layers and see what you get.

Note that for these tutorials I used the testing data as validation data, normally this is not what you should use. I did this because the point of these blogs was showing how neural networks are setup but, if you are creating your own it is best to create a validation set out of your training data with a train_test split.

Using a Neural Network Pt.2

Timothy Cummins — Thu, 19 Nov 2020 01:47:38 +0000

Introduction

In last weeks blog we downloaded our dataset, found an understanding of our data so that we could determine what metrics we wanted to use and setup our images so that they could be used in our Neural Net. So today I will be continuing I will be going over the setup of the Neural Net itself and trying my hardest to prepare a model that is both accurate and efficient enough that if you are trying this yourself it won't take forever to run.

Batches and Epochs

Back in the last blog in the section "Prepping the Image" I talked about all of the conversions we were doing to the image but I left out talking about another important feature that I had assigned value to, batch_size. Batch size controls how many images (in our case), that we are fitting to our model at one time. After the every time a batch is fit to the model the neural net will then adjust it's weights on each node and then run the next batch through. The completion of all of the batches is called an Epoch, and when the model completes an Epoch it reshuffles the images into new batches and begins the next Epoch. Why I bring this up now is because previously I accidentally had my batch size set to 8, which will get us better results in less Epochs, it would take a lot of time to run each one. So we are going to adjust batch_size to 64.

Building the Network

For our model we will be using a couple of different layer types such as dense, convolutional, batch normalization and dropout layers.

Dense layers are the most common layer in neural networks, as the Keras description says "Just your regular densely-connected NN layer". Dense Layers find associations between features by taking the dot product of the input tensor and a weight kernel we feature in our model.

Convolutional layers go through the pixels in the image and compare them to the surrounding pixels to find patterns in the image.

Batch normalization is one of the layers I do not understand but I do know that it speeds up the neural network by re-centering and re-scaling the input layer.

Then finally we have dropout layers which do exactly what they sound like, they drop out a percentage of the weights to prevent overfitting.

So lets define our model and add our layers.

model = Sequential()

model.add(Conv2D(32, kernel_size = (3, 3), activation='relu', input_shape=(image_size, image_size, 3)))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(BatchNormalization())

model.add(Conv2D(64, kernel_size=(3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(BatchNormalization())

model.add(Conv2D(64, kernel_size=(3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(BatchNormalization())

model.add(Conv2D(96, kernel_size=(3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(BatchNormalization())
model.add(Dropout(.3))

model.add(Conv2D(32, kernel_size=(3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(BatchNormalization())
model.add(Dropout(0.4))

model.add(Flatten())
model.add(Dense(128, activation='relu'))
# model.add(Dropout(0.3))
model.add(Dense(1, activation = 'sigmoid'))

So as you can see first we are using our convolutional layers of different sizes to go through the image and find out what patterns it sees and then we flatten the image down which allows us to throw in some dense layers and then eventually our last layer just has a single output, so we can get our diagnoses.

Then finally we can compile our model together.

model.compile(optimizer = 'adam',
              loss='binary_crossentropy',
              metrics=['accuracy',keras.metrics.Recall(name='recall')])

Select our number of epochs and run out model.

epochs = 25
steps_per_epoch = train_generator.n // batch_size
validation_steps = test_generator.n // batch_size
history = model.fit(train_generator,
                              steps_per_epoch=steps_per_epoch,
                              epochs=epochs,
                              validation_data=test_generator,
                              validation_steps=validation_steps,
                              class_weight=class_weight)

And like that we have a fitted neural network!

Conclusion

So now that we have our Neural Network fitted we can save it by using model.save('pneu_model.h5') and then we can continue do some changes to our model without losing the one we have just fitted. Next week I will be going over adding some Hyperparameters to help get us the accuracy and recall we are looking for as well as finish up this series.

Using a Neural Network Pt.1

Timothy Cummins — Thu, 12 Nov 2020 01:23:56 +0000

During my time interviewing I have been asked to describe some of my projects in detail. So to help myself go through my process and the decisions I made I decided to write a blog about one of my favorites, using a Neural Network to detect Pneumonia in Xray images. It has been awhile since I visited this project so I decided to try and add some upgrades to my previous model. The dataset I used for my actual project is 7.9 GB I will be referring to the smaller dataset from Kaggle found here.

Since I am still on the newer side of working with Neural Networks I usually don't come up with the most efficient models, therefore so my data takes awhile to run so for today I will be showing you how I organize my data and set it up for the model.

Getting a first impression of the Data

Before even looking at the data it was important for me to get a business understanding of this data so that I could plan on what metrics I wanted to use to evaluate my model. I determined that in this situation of using a Neural Network for detection, this process could be used in rural areas that lacked expertise or just to lessen the doctors workload in a busy hospital when someone is coming in for a recovery checkup. So with this in mind I determined that it was most important that if the machine did get a diagnosis wrong I wanted it to diagnose a healthy person as having Pneumonia since the doctor would most likely double check an images that is flagged as having Pneumonia, than it telling the doctor that someone who has Pneumonia is healthy. Which means that I needed to really focus on Recall as a metric, while keeping others in mind as well.

Now that I had an understanding of the data it was time to take a look at our downloaded data. The author of the Kaggle post was nice enough to already separate the data into training, test and validation folders. So import it and take a look at what we have.

base_dir, _ = os.path.splitext("../../Downloads/chest_xray")
train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'val')

train_normal = os.path.join(train_dir, 'NORMAL')
print ('Total training normal images:', len(os.listdir(train_normal)))
train_pneu = os.path.join(train_dir, 'PNEUMONIA')
print ('Total training pneu images:', len(os.listdir(train_pneu)))

val_normal = os.path.join(validation_dir, 'NORMAL')
print ('Total validation normal images:', len(os.listdir(val_normal)))
val_pneu = os.path.join(validation_dir, 'PNEUMONIA')
print ('Total validation pneu images:', len(os.listdir(val_pneu)))

test_dir = os.path.join(base_dir,'test')
test_normal = os.path.join(test_dir, 'NORMAL')
print ('Total test normal images:', len(os.listdir(test_normal)))
test_pneu = os.path.join(test_dir, 'PNEUMONIA')
print ('Total test pneu images:', len(os.listdir(test_pneu)))

Interesting the training data has quite a big imbalance with "Pneumonia" X-rays over "Normal" X-rays, but the testing and validation folders are pretty equal. We can correct this bias with our Neural Network before we run it, but we might as well calculate that weight now.

weight_norm = (1 / len(os.listdir(train_normal)))*(len(os.listdir(train_normal))+len(os.listdir(train_pneu)))/2.0 
weight_pneu = (1 / len(os.listdir(train_pneu)))*(len(os.listdir(train_normal))+len(os.listdir(train_pneu)))/2.0
class_weight = {0: weight_norm, 1: weight_pneu}
class_weight

{0: 1.9448173005219984, 1: 0.6730322580645162}

Prepping the images

For this part we have to get a general understanding about how computers "sees" pictures. Images are made up of pixels arranged in rows and columns. To you these are just very tiny dots of colors, and to give you a visualization of how small, a 1920 x 1080 HD TV is made up of a width of 1920 pixels and a height of 1080 pixels. But the computer only understands numbers, so to convert the colors to numbers we use various color models with the most well known being the RGB or Red Green Blue Model. This gives the computer information ranging from 0-255 of how much of each color is in each pixel.

Now that we understand that we need make it so our model can easily understand our images and find the similarities and differences between them. So now I am going to define a way for the model to break down the images to look at 224x224 pixels at a time to compare with the ImageDataGenerator which is a Data Augmenter. Though this is not enough in itself the pixels still have coefficients ranging from 0-255 which is very hard for our model to process, so to make this easier for the computer as well we are going to target between 0 and 1 by scaling 1/255 factor.

image_size = 224 # All images will be resized to 224x224
batch_size = 8

# Rescale all images by 1./255 and apply image augmentation
train_datagen = keras.preprocessing.image.ImageDataGenerator(rescale=1./255)

validation_datagen = keras.preprocessing.image.ImageDataGenerator(rescale=1./255)
test_datagen = keras.preprocessing.image.ImageDataGenerator(rescale=1./255)

# Flow training images in batches of 20 using train_datagen generator
train_generator = train_datagen.flow_from_directory(
                train_dir,  # Source directory for the training images
                target_size=(image_size, image_size),
                batch_size=batch_size,
                # Since we use binary_crossentropy loss, we need binary labels
                class_mode='binary')

# Flow validation images in batches of 20 using test_datagen generator
validation_generator = validation_datagen.flow_from_directory(
                validation_dir, # Source directory for the validation images
                target_size=(image_size, image_size),
                batch_size=batch_size,
                class_mode='binary')

# Flow validation images in batches of 20 using test_datagen generator
test_generator = test_datagen.flow_from_directory(
                test_dir, # Source directory for the validation images
                target_size=(image_size, image_size),
                batch_size=batch_size,
                class_mode='binary',
                shuffle=False)

Conclusion

Now our data is set and in a state that will be easily understood by the computer. Next we can create our model to process our data through, but that is a whole different mess and so I will be continuing with that next week. So stay tuned for Using a Neural Network Pt.2.

My Intro into Survival Analysis

Timothy Cummins — Thu, 05 Nov 2020 00:05:56 +0000

Introduction

Todays blog is going to be different than the blogs I normally post. Instead of talking about a subject that I know well, I am going to take you along on my journey of learning about Survival Analysis. This topic is huge and there is a ton I still don't know but I am going to try to tackle covering examples of the Kaplan Meier Estimator and the Cox Proportional Hazard Model.

What is Survival Analysis

Survival Analysis also known as Time-to-Effect Analysis is used to estimate when a someone or something will experience an event. For example in Engineering it is used for reliability analysis or in Economics for Duration Modeling, though it was originally developed for medical research hence the "survival".

-Prep

If you would like to follow along I downloaded my data from Haberman's Survival Data Set on Kaggle. Then the libraries and tools I am using are:

import pandas as pd 
import numpy as np
import random
from lifelines.fitters.kaplan_meier_fitter import KaplanMeierFitter

And finally the steps to import it into a Pandas data frame:

data = pd.read_csv('../Downloads/haberman.csv',names=['age', 'year_of_treatment', 'positive_lymph_nodes', 'survival_5_years'])

Kaplan Meier

The Kaplan Meier estimator is used to estimate the survival function. It works by finding the fraction of subjects who survived for a certain time frame and then returns a visual of those estimations in a visual called the Kaplan Meier Curve. So in our case the visual will show us the probability of survival for 5 years after the surgery with our age as the time.

km = KaplanMeierFitter()
km.fit(data['age'],data['survival_5_years'])
km.plot()

Another nifty metric you can receive from this estimator is the median survival time, showing where half of the population have experienced the event of interest.

km.median_survival_time_

52.0

From what I understand this estimator can be very useful for comparing how other features effect your population as well by taking different groups from the features you are using based on other features. Though I haven't gone deep enough into the subject to understand the validation of these divisions and comparisons are yet.

Cox Proportions Hazard Model

The Cox Proportions Hazard Model is another huge tool in the toolbox of Survival Analysis. This model takes several variables into account at the same time and examines the relationship they have to the survival distribution. So how I understand that it works is that it is very similar to a multiple regression model separates the data on small amounts of time with at least one event of interest. This lets the model create weights on the different variables to create an accurate estimator.

chm=CoxPHFitter()
chm.fit(data,'age','survival_5_years')
chm.plot()

So now with the Cox Model plotted we can see can see the coefficients of how the model weighted the effect that the positive_lymph_nodes and year_of_treatment had on the survival rate of these people, as well as the confidence interval it has on these predictions. With such large intervals though it looks as we may need more data, but we can also get more information about the coefficients it placed on these features by printing the summary.

chm.print_summary()

Final Note

As I said above this is a new subject for me and my knowledge of it is just beginning, so if anyone is reading this and would like to help me expand on this subject I would love to get in contact with you!

Pipelines: Clean Your Notebook

Timothy Cummins — Wed, 28 Oct 2020 22:46:03 +0000

Introduction

Normally when I am creating a Machine Learning model I actually run through a bunch of models to get an idea of how the default models will perform on my data. Though every time I do this I end up creating a mess in my notebook and have to keep scrolling up and down to find out what I named each model and how it performed so I can compare it to it's neighbor, so I have been trying to implement an awesome tool from Scikit Learn called Pipeline to help clean up my mess and create a neater notebook. While this is not necessarily the only use or even debatably the best use for this tool, it is something that has come in handy for me and so I would like to share it.

Pipeline?

Before I get too far ahead of myself and show you how I use Pipeline, I should give a little more detail about what it does. According to the Scikit Learn website "The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters". What this is saying is that we can have a preprocessor and a model together in the same pipeline, so that we can quickly assess different models. So let's try it out!

Example

To get started we need to import all our libraries and tools, and we have a ton for trying out different classifiers.

import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_wine
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

Now we can load in our data. Like I have done in previous posts I really enjoy using the Pandas for the layout and ease of access.

wine = load_wine()
data = pd.DataFrame(data= np.c_[wine['data'], wine['target']],
                     columns= wine['feature_names'] + ['target'])
data.describe()

With all of that beautiful data loaded in let's do our split and then we can run our pipeline.

X=data.drop('target',axis=1)
y=data['target']
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y)

Setting up the Pipeline is amazingly easy the format for doing this all you need to do is call the pipeline tool and then enter the name or names of the preprocessing processes you want to run and then do the same for the model. Though here since I am running multiple classifiers I just add them to a list and then have the pipeline tool loop through them.

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="rbf", C=0.025, probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    GradientBoostingClassifier()
    ]

for c in classifiers:
    pl = Pipeline([('standard scaler', StandardScaler()), ('models', c)])
    pl.fit(Xtrain,ytrain)
    print(f"{c.__class__.__name__} has a score of {round(pl.score(Xtest,ytest),2)}")

And there we go we have all of our models running in the same space to get an easy early comparison. Another great use I have found for creating a pipeline like this is if I want to quickly run the same model with a couple different hyper-parameters to see the effect they have on the model, though if you have found a model you want to use I would recommend Grid Search in the long run for hyper parameter optimization.

Overfit vs Underfit: The Modeling War

Timothy Cummins — Thu, 22 Oct 2020 02:37:48 +0000

Introduction

When creating machine learning models overfitting and underfitting are key concepts. Having an underfit model will not pick up on key features in your data and perform poorly on both your training and testing sets, while having an overfit model will look great on your training data but will not fit very well on your testing data. To create a model that returns good predictions you will need to find a balance between these two, so let's go over these concepts.

To provide some examples I will be coding some visuals, so if you want to follow along here are the libraries and data that I am using:

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures

To create our data I am going to create random points with the quadratic formula and then create some noise in the data to help it replicate real data (to a point).

np.random.seed(14)
x = np.random.uniform(0, 11, 20)
x = np.sort(x)
y = (-x+4) * (x-9) + np.random.normal(0, 3,20)

Underfitting

Now let's get started with created a visual underfitting so that we can take a look at what we see. To do this we are going to split our data into training and testing sets, then fit a Linear Model to our data.

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.40, random_state=14)
reg = LinearRegression().fit(X_train.reshape(-1, 1), y_train)

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_train, y_train, color='blue',label='Train Data')
plt.plot(X_train.reshape(-1, 1), reg.predict(X_train.reshape(-1, 1)),label='Underfit Model')
plt.legend(loc=[0.17, 0.1])

plt.subplot(1, 2, 2)
plt.scatter(X_test, y_test, color='green',label='Test Data')
plt.plot(X_train.reshape(-1, 1), reg.predict(X_train.reshape(-1, 1)),label='Underfit Model')
plt.legend(loc=[0.17, 0.1]);

Now that we have the visual above we can see that the model is not really picking up the relation between x and y, or in your normal case relations between the features in your data. The nice thing about underfitting is that it can usually be seen from poor performance from your training data. This could be caused by our model not being complex enough for our data (similar to our case here) or by having too many features. Another term commonly used to describe underfitting is that the model has a high bias, which I like to think about as people ignoring the data that does not match their point of view.

Overfitting

So now to show overfitting let's try fitting an 8th degree polynomial to our quadratic dataset.

poly = PolynomialFeatures(8)
x_fin = poly.fit_transform(X_train.reshape(-1, 1))
reg_poly = LinearRegression().fit(x_fin, y_train)

X_linspace = np.linspace(0, 11, 30)
X_linspace_fin = poly.fit_transform(X_linspace.reshape(-1,1))
y_poly_pred = reg_poly.predict(X_linspace_fin)

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_train, y_train, color='blue',label='Train Data')
plt.plot(X_linspace, y_poly_pred,label='Overfit Model')
plt.ylim(bottom=-50,top=20)
plt.legend(loc=[0.17, 0.1])

plt.subplot(1, 2, 2)
plt.scatter(X_test, y_test, color='green',label='Test Data')
plt.plot(X_linspace, y_poly_pred,label='Overfit Model')
plt.ylim(bottom=-50,top=20)
plt.legend(loc=[0.17, 0.1]);

Overfitting is a little trickier as you can see from the graphic above, while modeling the data very well in the training set (on the left), it is totally missing the points on the testing set. This is because the algorithm is modeling the noise in the training set rather that the intended outputs, which is known as having a high variance. The cause of overfitting can normally be traced to having a model that is too complex for our dataset, in our case an 8th degree polynomial on a data set that is based on a 2nd degree, having outliers/errors in your data or by just not having enough data.

Conclusion

For a final image lets take a look at how the model should look, by plugging in a 2nd degree polynomial.

poly = PolynomialFeatures(2)
x_fin = poly.fit_transform(x.reshape(-1, 1))
reg_poly = LinearRegression().fit(x_fin, y)
X_linspace = np.linspace(0, 11, 30)
X_linspace_fin = poly.fit_transform(X_linspace.reshape(-1,1))
y_poly_pred = reg_poly.predict(X_linspace_fin)

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_train, y_train, color='blue',label='Train Data')
plt.plot(X_linspace, y_poly_pred,label='Ideal Model')
plt.legend(loc=[0.17, 0.1])

plt.subplot(1, 2, 2)
plt.scatter(X_test, y_test, color='green',label='Test Data')
plt.plot(X_linspace, y_poly_pred,label='Ideal Model')
plt.legend(loc=[0.17, 0.1]);

Now we can see while our model does not fit our data perfectly on either the training or testing sets, due to the random noise, it does fit both data sets very well. If you were looking at the causes of the opposing issues you might have noticed that they were opposites. While increasing the complexity of your model you are going to have a higher variance and by lowering it you will have a higher bias, this is called the Bias-Variance Tradeoff and the key to creating a well done machine learning model lies somewhere in between.

Central Limit Theorem: In Data Science?

Timothy Cummins — Thu, 15 Oct 2020 02:22:22 +0000

The other day while talking to a friend about all the skills required for a Data Science, we started to go more in depth about an important one that is usually overlooked in conversation because it is not as "exciting", statistics. So today I want to go over a statistical concept that is very important when working with data, the Central Limit Theorem.

What the Central Limit Theorem(CLT) tells us is that, when independent random variables are added together, their normalized sum will converge toward a normal distribution. So let's quickly go over these terms and then why this is useful for Data Scientist.

Independent Random Variables

So here independent random variables is referring to subsets of variables chosen randomly from the larger group. With the independence being that they outcome of one variable does not have any effect on the outcome of another selected. So for example if you had flipped a coin 100 times and then randomly selected the results from 10 of the tosses, we would have 10 independent variables because the outcome of the toss being heads or tails is not dependent on how many times you flip the coin.

Normal Distribution

If a dataset is normally distributed it will have most of its data at or around the the mean of the data, and then be equally spread out on both sides as the probability decreases to create a Bell Curve like below.

Having our data distributed in this way can give us much insight into our data such as the probability distribution to find outliers and hypothesis testing.

In Data Science

In DS hypothesis testing is a very important part of your job. You constantly have to ask yourself if the data you have can support your idea or if the data is just that way due to chance? How this is done in DS is that we check if we have support for our hypothesis given that it is wrong(null hypothesis). So to do this as long as we have a sub set of our data that is suggested to be over 30, we can create a normal distribution of our data to see where our null hypothesis sits and the probability that we can cast it aside to prove our hypothesis. I apologize if this is confusing and will try to continue on hypothesis testing in another blog, but for now we will continue on to some visuals with python to see the Central Limit Theorem in action.

Coding

For this example I reached out and grabbed the train.csv from the Titanic Dataset on Kaggle: https://www.kaggle.com/c/titanic/data.

import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('../../Downloads/train.csv')
df.describe()

So here we can see that we have quite a bit of data in our 'Age' column with 714 rows but that it does not match the 891 rows in 'PassengerId', so we will have to get rid of those pesky little 'nan's.

data=df.dropna()

Now let's grab our mean and take a look at the distribution of our data.

pop_mean=data.mean()
print(pop_mean)
plt.hist(data,bins=100);

Now we know that we have a mean age of just under 30 years old and we can see that our data is skewed to the right. So let's create a function to perform our Central Limit Theorem, and see what happens when we take 100 of samples of 30 people from our dataset.

def CLT(num_of_samples,sample_size):
    sample_means=[]
    for i in range(num_of_samples):
        sample=np.random.choice(data,size=sample_size,replace=True)
        sample_means.append(sample.mean())
    sns.distplot(sample_means, bins=80, kde=True)
    print(f'mean:{sum(sample_means)/num_of_samples}')

CLT(100,30)

Wow our mean is actually pretty close and our data is starting to take the shape of a normal distribution, but we have the power of the computer so let's add a sample or 9,900 more.

CLT(10000,50)

There it is look at that distribution, almost a mirror image of a textbook normal distribution. Also a mean of 29.68, only .01 away from the mean of our total population. Maybe these mathematicians actually know what they are talking about!

DEV Community: Timothy Cummins

Help: Reverse Dict

Introduction to the Problem

1) Raise an Exception for the error if something besides a dictionary is passed through the function

2) Return the reversed dictionary using a Zip function if a dictionary is passed through the function

Making Sure a Dictionary is Passed Through

Reversing the Dictionary

Conclusion

`break` Python

Introduction

for Loop

while Loop

Nested Loops

Conclusion

Processing Text in Python: split, get and strip

.split()

.get()

.strip()

CatBoost: What's the Hype?

Introduction

Updating Python to include CatBoost

CatBoost

The Comparison

Importing Tools

Data

Train Test Split

Fitting our Models

Predictions

Conclusion

Pivot: With Pandas and SQLite

Pivot

Grabbing Some Data

Querying Our Data

Changing to Pivot

SQL Pivot

Conclusion

Using a Neural Network Pt.3

New Hyperparameters

Evaluating our model

Conclusion

Using a Neural Network Pt.2

Introduction

Batches and Epochs

Building the Network

Conclusion

Using a Neural Network Pt.1

Getting a first impression of the Data

{0: 1.9448173005219984, 1: 0.6730322580645162}

Prepping the images

Conclusion

My Intro into Survival Analysis

Introduction

What is Survival Analysis

Kaplan Meier

Cox Proportions Hazard Model

Final Note

Pipelines: Clean Your Notebook

Introduction

Pipeline?

Example

Overfit vs Underfit: The Modeling War

Introduction

Underfitting

Overfitting

Conclusion

Central Limit Theorem: In Data Science?

Independent Random Variables

Normal Distribution

In Data Science

Coding

`for` Loop

`while` Loop