DEV Community: hoganbyun

Classification Models in Scikit-learn

hoganbyun — Fri, 13 Aug 2021 16:40:49 +0000

This post will walk you through some of the different classification models available to use in scikit-learn.

First, it's important to go over what a classification algorithm is and how they are used. A classification algorithm takes in a training set of data to create a model that will predict and classify other data into pre-determined categories. For example, a phone company may take in customer data pertaining to sales, location, etc. to determine whether certain customers are likely to stick with the company for the next calendar year.

K-Nearest Neighbors

What it Does
K-Nearest Neighbors is a classification algorithm that measures distances between points. KNN takes a point and measures the k nearest points in the training set. Then, it looks at the labels of each point and classifies the starting point by the majority of the labels surrounding it. Look at the example below:

Here, the green point is our starting point and we see the surrounding blue and red points. If k = 3, we see the three nearest points are two reds and one blue. Thus, the algorithm will classify the green point as a red triangle.

Code

# Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier

# Instantiate KNeighborsClassifier
knn = KNeighborsClassifier()

# Fit the classifier
classifier = knn.fit(scaled_data_train, y_train)

# Predict on the test set
test_preds = classifier.predict(scaled_data_test)

Decision Trees

What it Does
A decision tree, quite simply, takes a starting point and makes multiple decisions that branch out to ultimately make a classification. See the example below:

This particular example tries to determine what kind of contact lens a person should wear depending on different characteristics. These trees use a greedy search, which always chooses the best way to classify the training data at each classification (depending on the criteria: entropy, information gain, etc).

Code

from sklearn.tree import DecisionTreeClassifier 
clf = DecisionTreeClassifier(criterion='entropy')
clf.fit(X_train_ohe, y_train)

Random Forest

What it Does
A random forest classifier is quite simply multiple decision trees. Usually, bootstrapping is involved where subsets of the training data are created with replacement. This way, each time you create a decision tree, it'll be different from the last. Each tree will classify a point and the random forest will take the aggregate majority as its final prediction. While this is better than a singular decision tree as it has a less likelihood of overfitting, it does take more memory and is more computationally complex.

Code

from sklearn.ensemble import RandomForestClassifier

# n_estimators is how many trees in the forest
forest = RandomForestClassifier(n_estimators=100)
forest.fit(data_train, target_train)

How to Evaluate a Model

Code

print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
print("Testing Accuracy for Decision Tree Classifier: {:.4}%".format(accuracy_score(y_test, pred) * 100))

What These Mean
The code above will give you something like this:

Let's go over how to read this. At the top, we see a 2x2 confusion matrix. Going from left-right, top-bottom, these numbers represent True negatives, False positives, False negatives, and True positives. The other metrics measure the following: Accuracy measures the proportion of correct classifications, precision measures the proportion of predicted positives that are actually positive, recall measures how many of the actual positives were correctly classified, and f1-score measures the balance between precision and recall. These are all metrics that can be used to determine how good your classification model is.

Continuous vs. Categorical: How to Treat These Variables in Multiple Linear Regression

hoganbyun — Fri, 06 Aug 2021 20:24:17 +0000

When attempting to make predictions using multiple linear regression, there are a few steps one must take before diving in, particularly, prepping continuous and categorical variables accordingly. Through this blog post, I will be showing you some techniques to make your data valid and usable in multiple linear regression.

Continuous Variables

What They Are
Continuous variables measure things like height, time, or other things that would not make sense to be classified into specific categories. One way to identify potential continuous variables is to look at the scatter plot of the data points. Usually, the data will be distributed in a cloud-like shape, unlike that of a categorical variable, which will be shown later.

How to prep them
Continuous variables are a lot easier to deal with than categorical variables because adjustments are not always needed (besides the initial data cleaning). However, there are some changes, such as normalization and log transforming, that may potentially improve the model.

Standardization
This method is pretty self-explanatory as you would standardize each data point. In other words, for each data point, you would subtract the mean and divide by the standard deviation.

# function to standardize values
def standardize(col):
    return (col - col.mean()) / col.std()

Log Transform
This method takes the previous standardization method and takes it one step forward. Before the standardization is done, you would first take the logs of each variable, which will make your data more normally-distributed. Afterwards, standardize each data point.

# Code excerpt where 'cont_df' is assumed to be instantiated
cont_log_df = np.log(cont_df)
cont_log_std_df = cont_log_df.apply(standardize)

Categorical Variables

What They Are
Categorical variables, as the name suggests, represent things that can be divided into groups or categories. For example, color or grade level could be considered categorical. One way to identify potential categorical variables is to look at the scatter plot of the data points. Usually, the data will be distributed in a rod-like shapes, unlike the clouds of continuous variables.

How to Prep Them
One-Hot Encoding
This is required when running categorical variables into linear regression. The idea is to create dummy variables that each represent a group. For example, if you had a variable of cities in California, you would need one dummy for each unique city in that column. Afterwards, if a certain data point is associated with that city, you would place a 1 there, if not, 0. One thing to keep note of is the dummy variable trap. Because of how dummy variables are created, you could technically "predict" a dummy from by combining all the other ones (multicollinearity), which will be an issue for multiple linear regression. To combat this, drop the first column, which will eliminate the perfect multicollinearity.

city = ['LA', 'SD', 'SAC', 'SD','LA', 'SAC', 'SD', 'LA']

# convert into series
city_series = pd.Series(city)

# convert into categories
city_cat = city_series.astype('category')

# get dummy variables
# remember to drop first column
pd.get_dummies(city_cat, drop_first=True)

Binning
Binning is a technique to cut down on the number of categorical variables. Ideally, categorical dummy variables should be kept at a minimum and if possible, you should have less dummies than continuous variables. The idea behind it is creating new categorical variables based on a criteria that you choose. An example would be converting months into seasons (which cuts 12 dummies down to 4 dummies).

# This is a change unique to months and seasons. This
# method forms bins with pandas.cut(), which cuts a list
# of consecutive numbers at the given points. In order for
# winter to be represented by Dec-Feb (12, 1, 2), we need
# a way for 12 to smoothly connect to 1, that is, convert
# all 12's into 0's.
bin_df.loc[bin_df['month'] == 12, 'month'] = 0

# The cuts are made left-exclusive and right-inclusive.
# Eg. The first bin does not include -1, but includes 2.
month_bins = [-1, 2, 5, 8, 11]

# Apply bin labels
season_bins = pd.cut(bin_df['month'], month_bins, labels=
['Winter', 'Spring', 'Summer', 'Fall'])

Summary

Now to summarize, continuous variables can be standardized or log transformed. These steps may help the model, but are not required. In fact, if they do not help the fit of the model, using these techniques are not recommended. Categorical variables require some sort of adjustment to be able to run a multiple linear regression. These can be one-hot encoding dummy variables or further lessening the number of categorical dummies by binning.

Thank You

Hopefully this rundown on some of the common steps to take when preparing data for multiple linear regression has been helpful. Thank you for reading!

Popular Data Science Plots and When to Use Them

hoganbyun — Sun, 01 Aug 2021 00:14:43 +0000

When working in Data Science, being able to investigate and answer questions is only half of your responsibilities. No matter how well you are able to manipulate data and code difficult techniques, your findings are no good if you aren't able to communicate them clearly. Doing so, you will probably run into using matplotlib to do a lot of your plotting. Through this blog post, I will be showing you some of the most common types of plots and what situations to use them in, while providing some do's and don't's that'll ensure that your plots are easy to understand.

Line Plot

When to Use One
A line plot is probably the most simple plot out there. It plots points with x and y values on the chart and draws lines thought each, connecting them. A situation where one might use a line plot is when visualizing time-series data, that is, displaying changes of some variable over time. Take a look at this example,

We can clearly see that this plot is measuring the speed (mph) of some object, say a car, over time (sec). From the information that the plot conveys, we can see that the car accelerated early and eventually started to decelerate later on.

How to Code a Line Plot

import matplotlib.pyplot as plt
%matplotlib inline

x = [1,2,3,4,5,6]
y = [4,3,5,2,3,1]

plt.plot(x, y)
plt.xlabel("Week")
plt.ylabel("Pounds Lost")
plt.title("Client Pounds Lost During Training")

plt.show()

This example will yield the following graph:

Bar Plot

When to Use One
Bar plots, like line plots, may also be used to track changes over time. Yet, another use for bar plots is to visualize differences between groups. For example, here is a plot from my recent project:

You can see that the x-axis represents budgets tiers in increments of $1.5 million. The height of each bar depends on the average ROI of all movies that belong to a certain budget tier. In this case, a bar plot is especially useful because it can clearly show that the $6 million budget tier yields the highest ROI, on average. Bar plots are also useful when comparing metrics within groups that aren't quantifiable through numbers. An example would be comparing the number of award-winning movies from each movie studio.

How to Code a Bar Plot

import matplotlib.pyplot as plt
%matplotlib inline

x = ['ATL', 'BOS', 'DAL', 'MEM', 'SAC', 'WAS']
free_agents = [1, 3, 5, 4, 2, 6]

plt.bar(x, free_agents)
plt.xlabel("Team")
plt.ylabel("Free Agents Signed")
plt.title("Free Agents Signed in 2020")

plt.show()

This example will yield the following graph:

Box (and Whisker) Plot

When to Use One
The Box and Whisker plot is an ideal choice when you want to convey information from a five-number summary (minimum, first quartile (Q1), median, third quartile (Q3), maximum). Here, the median is the middle value of a sample. For example, in a list of [1,2,3,4,5], the median would be 3. If there is an odd number of values, the median is the average of the two middle values. The first and third quartiles are the 25th and 75th percentiles, respectively. The Inner Quartile Range (IQR) is calculated Q3 - Q1, while the minimums and maximums are calculated Q1 - 1.5*IQR and Q3 + 1.5*IQR, respectively. These plots are especially useful for displaying how skewed a sample is and for highlighting outliers. Referring to the following example:

The box that you see indicates 3 values. The middle line is the median. Here, we can see that the median is between 20 and 30. The right border of the box is Q3 while the left is Q1. The min and max are represented by the ends of the "whiskers" connected to the box. We also see one outlier, represented by the 55 point game where the player shot extremely well. The code for this example is below.

How to Code a Box and Whisker Plot

import matplotlib.pyplot as plt
%matplotlib inline

x = [22,25,15,33,31,27,18,19,22,37,55,16,24,25,26,25]

plt.boxplot(x, vert=False)
plt.xlabel("Points")
plt.title("Player A: Points Scored Per Game")

plt.show()

Scatter Plot

When to Use One
A scatter plot is used when you have numerical data that is associated by pairs (eg. Age vs. Running Speed). A scatter plot will plot each data point onto an x-y plane, giving the viewer a good picture of how the data is distributed. They are particularly useful when trying to discern whether two variables may be related. Take a look at this example:

In this example, the scatter plot clearly shows that as age increases, max speed tends to go down. Each point represents a different person that was timed. The code for this is shown below.

How to Code a Scatter Plot

import matplotlib.pyplot as plt
%matplotlib inline

age = [18,20,20,24,25,26,29,33,31,32,36,44,44,46,48,55,57,63,64,67,66,62]
max_speed = [19,18,22,16,19,21,17,16,19,16,14,16,13,13,11,12,9,10,8,7,7,8]

plt.scatter(age, max_speed)
plt.xlabel("Age")
plt.ylabel("Max Speed (mph)")
plt.title("Age vs. Max Speed")

plt.show()

BONUS: Regression Plot

Lastly, the regression plot is sort of an extension of the scatter plot. It takes in each data point and calculates a line that "fits" the sample the best. What this means is that it will display a line cutting through the data, indicating what the approximate slope or "trend" is for the sample. Regression lines also have an r-value (between 0 and 1) which indicates how correlated two variables are. The closer this r-value is to 1, the more correlated the variables are. For example,

Here, we used the same scatter plot as an example. You can see the now, there is a line crossing through the data. This line gives us a good estimate of what speed to expect for a certain age. For example, judging from the line, we can approximate that a 40-year old will reach max speed at just under 15 mph. The code is written below. In this case, we had to use Seaborn (an extension of matplotlib) to use its regression plot functionality.

How to Code a Regression Plot

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

age = [18,20,20,24,25,26,29,33,31,32,36,44,44,46,48,55,57,63,64,67,66,62]
max_speed = [19,18,22,16,19,21,17,16,19,16,14,16,13,13,11,12,9,10,8,7,7,8]

sns.regplot(x = age, y = max_speed)
plt.xlabel("Age")
plt.ylabel("Max Speed (mph)")
plt.title("Age vs. Max Speed")

plt.show()

Tips When Plotting

Now that we covered some commonly-used plots in data science, we can now go over a few tips that you should keep in mind.

Try not to make your visualizations too "busy" by highlighting the exact, relevant information that you would want the audience to see. In the below case, I've highlighted bars in green, blue, and red depending on what information I want to convey, as opposed to showing the graph with every bar being the same color.

Avoid pie charts as they are often hard to read when each slice is very close in size. In these cases, bar charts are much more preferred. Below is the data represented in pie and bar format. Note that it is much easier to figure out what is larger and smaller in the bar graph.

Make sure that your graphs are scaled properly, while avoiding "white-space" on the graph, if possible. Take a look at the two examples below and the difference proper axis-scaling does.

Summary

Now that you have had a rundown on some of the most commonly used plots in data science along with some tips to make your graphs more digestible, you are ready to go out a plot your data into effective charts to show your findings!

Commands in SQL

hoganbyun — Sun, 25 Jul 2021 02:03:53 +0000

If you are planning to one day work with and manage data, chances are, you will eventually have to work with SQL (Structured Query Language). SQL is a language used to communicate with databases, most often used to update data or retrieve specific parts or groups in the data. If any task involves manipulating or creating a database, SQL will work well.

In a previous post, I went over common SQL clauses, most specifically, those used to pull data from databases. In this post, I will go over some commands that are more relevant when creating or modifying a database.

Common Commands

CREATE DATABASE - makes a new database
CREATE TABLE - makes a new table; databases are, in essence, a collection of tables
UPDATE - changes or updates data
INSERT INTO - inserts new data
DELETE - deletes selected data
DROP - deletes tables/databases
SELECT - indicates what you want to pull from the data (also covered in a previous post)

Here are some important distinctions to make note of. Databases and tables are both collections of data, but a database is a collection of tables. DROP and DELETE differ as DELETE is used on specific data while DROP refers to whole tables or databases. SELECT allows you to pull specific features of the data, but doesn't really manipulate it like the other commands listed.

Convolutional Neural Networks for Image Classification

hoganbyun — Sat, 17 Jul 2021 22:36:27 +0000

Convolutional Neural Networks (CNNs) are a type of neural network that are commonly used for image classification tasks because, unlike neural networks, they are able to classify a high number of pixels that are often found on images.

What Makes a Neural Network Convolutional?

Convolutional networks are able to identify patterns much better. The main difference that allows this is the convolutional operation that uses a filter (usually 3x3 or 5x5) that is applied to each 3x3 (or 5x5) block on the original image.

In the above image, we see a filter is applied to each possible block to create a new matrix.

Using Padding to Maintain Dimensions

One thing to note is that when the filter is applied, you end up with a smaller image. In the above example, you can see that a 5x5 image with a 3x3 filter results in a 3x3 image. In addition, pixels along the edges of the original image do not get applied by the filter as much as those close to the center.

One way to solve both of these issues is using padding, which is basically adding extra blank pixels around the edges of the image. This will return an image of the same size as the original image and allow the edge pixels to be filtered more.

Stride

Striding can also affect the output image. Basically, you are controlling how many pixels over a filter will move, which will affect the output size (higher striding yields smaller image). For example, both examples above are striding by 1 pixel while the example below strides by 2.

Pooling layer

Everything above is done before the fully connected neural network. The pooling layer is the last step, which basically downsizes the convolutional layers while "pooling" together all the patterns that the convolutional layers found. Typically, the Max Pooling parameter is used.

When to Bring Fully Connected Layers

After the pooling, the next steps are basically the same as building a regular fully connected neural network. Think of a CNN as preliminary prep before the fully connected layers are applied.

Common SQL Clauses

hoganbyun — Sat, 10 Jul 2021 21:06:59 +0000

When attempting to access and manipulate a database, SQL (Structured Query Language) is one of the top language options to use. SQL code revolve around queries, which are requests or commands for some part of the database. Once you get the hang of the query syntax, it can be very intuitive to use. Here are some of the more common SQL clauses.

SELECT and FROM

Simply put, SELECT indicates what exactly you want to pull from the database. FROM points to the specific database that you are pulling from. For example, if you wanted to retrieve a customer ID from the "customers" table, it may look like this:

SELECT customer_id FROM Customer

Something to keep in mind is the asterisk (). This character represents all records in the data. In this case, if you wanted everything, and not just the customer ID, you would replace "customer_id" with "" in the above example.

WHERE

WHERE is a clause that allows the user to filter the data under a specific condition. One thing to note is that WHERE can only be used on ungrouped or unaggregated data. In the next example, let's say we want the customer ID's of all customers who ordered 30 or more units.

SELECT customer_ID
FROM Customer
WHERE Order_Quantity > 30

GROUP BY

GROUP BY does exactly what its name indicates. It groups the data by a certain feature. For example, a user who wants customer ID's grouped into the states they live in would use:

SELECT customer_ID
FROM Customer
GROUP BY State

HAVING

HAVING works similarly to WHERE, the difference being that HAVING is used on aggregated data (most commonly after a GROUP BY). Below is an example of when HAVING would be used. Here, the user is searching for all states that have a total of more than 250 orders. In order to do this, we GROUP BY State and sum all the orders for each customer in that state.

SELECT SUM(Orders) as total
FROM Customer
GROUP BY State
HAVING total > 250

ORDER BY

ORDER BY sorts the data on a given feature. One thing to note is that the default sorting is in ascending order, but there are optional ASC/DESC parameters that give the user the option to control the order. Here's an example of getting customer ID's after ordering customer last names in descending order

SELECT customer_id
FROM Customer
ORDER BY last_name

The above clauses are some of the most commonly used ones and are crucial to know when using SQL.

Semi-Supervised Learning

hoganbyun — Sat, 03 Jul 2021 17:11:27 +0000

In some cases, in machine learning, you will run into times when semi-supervised learning is necessary. For example, let's say that you want to use supervised learning to run a classification model on your data, but you have no labels. In order for the model to be built, you need data with labels assigned to properly train the model. Semi-supervised learning provides a pathway to do that, even when you start out with unlabeled data.

Semi-Supervised Learning (vs. Supervised/Unsupervised)

The main difference between supervised and unsupervised learning is whether we know the output labels. Supervised learning, as the name suggests, needs labels in order to train a model, whereas unsupervised learning does not require labels. For example, a classification model requires supervised learning as it needs each data point to indicate which class that point belongs to as a prerequisite.

Here, we need each data point to tell us whether it belongs to red (disease) or blue (healthy) in order for us to accurately produce a model to separate the two groups.

For unsupervised learning, we can use the example of clustering.

Unlike the previous example, we don't know which group each data point belongs to before we run the model. For example, in k-means clustering, the model will identify centroids (centers for each group) and assign other data points to the closest centroid.

Semi-supervised learning can entail different methods. For example, I recently created a model to classify NBA players into specific playstyles. Obviously a player's playstyle isn't something objective as determining that is up to personal opinion. Even determining how many playstyles exist is up to opinion as well.

One method that I used was to predetermine which playstyles "existed" in this model and, for a small subset of the players, I would determine a playstyle for them. What this gave me was labeled data that I could use in supervised learning. Once that model was created, I ran the rest of the unlabeled data through it to give me playstyle predictions for everyone else. Then, I used supervised learning again on the fully-labeled data.

Another method, which took less manual work, was to first use unsupervised data to separate players into clusters. Then, using the average stats of each cluster, I could get a sense of what type of playstyle each cluster represented. Using the labels from the clusters, I was able to use supervised learning on the newly-labeled data.

So when you are faced with unlabeled data, but want to do a supervised learning task, don't be discouraged as there are available methods to work around it, such as semi-supervised learning.

SQL: WHERE vs. HAVING

hoganbyun — Thu, 24 Jun 2021 23:15:13 +0000

In SQL, the HAVING and WHERE clauses have a similar function with a key difference. Both functions allow a user to filter data with respect to a certain condition. The difference between the two clauses has to do with when each is used. Basically, the WHERE can only be used in non-aggregated data (ie. data that has not been aggregated by GROUP BY). On the other hand, HAVING is used after a GROUP BY. Let's walk through some examples.

WHERE

As mentioned earlier, WHERE is used on data before and aggregation/filtering is done.

SELECT Employee_Name, Employee_ID
FROM Employee
WHERE Employee_Age > 30

The code above selects the names and ID's of all employees over the age of 30. Note that the WHERE clause is used on non-aggregated data.

HAVING (and GROUP BY)

HAVING is often used following a GROUP BY, which aggregates data by a certain feature. In the following example, let's say you have game-by-game data for each player for the first five games of the season and, as a coach, you wanted to see which players on your team made more than 10 3-pointers.

SELECT Player_Name, SUM(3PM_Made) as total
FROM Team
GROUP BY Player_Name
HAVING total > 100

Here, we want the name and number of 3-pointers each player has made. The GROUP BY clause is essential here because if we had used a simple WHERE with no GROUP BY or HAVING, we would get 5 different numbers for each player, representing how many 3-pointers each player made in each game. We want the total that each player made, thus the GROUP BY is needed. The HAVING acts as a filter post-aggregation to get the desired range of total 3-pointers.

While there can always be exceptions, a good rule of thumb is to treat WHERE as the clause used on non-aggregated data, while treating HAVING as the clause used on aggregated data (often in conjunction with GROUP BY).

Supervised Vs. Unsupervised Learning

hoganbyun — Tue, 15 Jun 2021 01:25:06 +0000

When creating machine learning models, there are typically two paths to choose from: supervised and unsupervised learning. Simply, the difference between these two methods is whether we know the output labels.

For example, let's say that we want to build a model that can identify pneumonia from chest x-rays. In this case, for each photo we feed into the model, we know beforehand whether the x-ray is of a pneumonia-positive person. Because we know the output labels of each input beforehand, we would use supervised learning, which aims to measure a relationship between inputs and known outputs.

Now in a different example, hypothetically, let's say that we have data (average speed, total accidents, total tickets, etc.) on many drivers and we want to put these drivers into groups where they are most similar to each other. In this case, we don't have initial output labels (eg. good driver, bad driver, etc.) and have to infer what kind of groups there are after they are made. In this case, we would use unsupervised learning.
Let's go into more detail for each approach.

Supervised Learning

Supervised learning uses labeled data to train a model to classify inputs or predict outcomes more accurately. Because we are feeding labeled data into the model, we are able to test and improve the model validity by verifying how accurate a model is over time.

With supervised learning, there are usually two types of problems to use it for: classification and regression.

Classification

A classification problem entails separating the data into pre-determined groups. For example, one could classify whether an animal is a cat or a dog based on size, weight, etc. Some common classification algorithms include support vector machines, random forest, and gradient boost.

Regression

A regression problem aims to identify the relationship between the independent and dependent variables. For example, a project aiming to project a store's ice cream sales using number of flavors, hours open, etc. would use regression, which could be linear, nonlinear, or logistic, to name a few.

Unsupervised Learning

Unsupervised learning finds groups or patterns using unlabeled data. Because there aren't any labels, there is not a specific way to verify model validity like with supervised learning. Common problems include clustering and dimensionality reduction.

Clustering

A clustering problem aims to separate the data into distinct groups by identifying patterns or similarities between data points. For example, an online retail store may want to separate its customers into different demographics. A common clustering algorithm would be k-means clustering, which groups the data into k groups using the mean distance of each point to a cluster centroid.

Dimensionality Reduction

Dimensionality reduction is used when there are too many features in a given dataset. It will reduce the number of features in a dataset while keeping its integrity and is done before the modeling stage.

Which One to Choose?

Deciding whether to use supervised or unsupervised learning comes down to a few factors, those being, determining whether your data is labeled and what kind of modeling are you trying to accomplish.

Tuning Neural Networks

hoganbyun — Tue, 08 Jun 2021 23:47:27 +0000

When modeling a neural network, you most likely won’t run into satisfactory results immediately. Whether it’s underfitting or overfitting, there are always small, tuning changes that can be made to improve upon the initial model. For the most part, for overfit models, these are the main techniques you can use: normalization, regularization, optimization

Dealing with Overfitting

Here is an example of what an overfit model would look like:

Here we can see that as the training accuracy increases, at a certain point, the validation accuracy stagnates. This means that the model is getting too good at recognizing purely the training data that it fails to recognize general patterns.

Regularization
Regularization is often used when the initial model is overfit. In general, you have three types to choose from: l1, l2, and dropout.

L1 and l2 regularization basically penalizes weight matrices that are too large and in the back propagation phase. An example of it being used:

model.add(Dense(128, activation='relu', 
kernel_regularizer=regularizers.l2(0.005)))

Dropout, on the other hand, sets random nodes in the network to 0 on a given rate. This is also an effective counter measure to overfitting. The number within the dropout function represents the rate at which dropout will occur. An example:

model.add(Dropout(.2))

Normalization
Another countermeasure to overfitting a model is to normalize the input data. The easiest thing to do is to normalize to scale the data to be between 0 and 1. What this does is potentially cut down training time and stablize convergence. You could also normalize within layers such as the random normal:

model.add(Dense(64, activation='relu', 
kernel_initializer=initializers.RandomNormal())

Optimization
Lastly, you could try different optimization functions. The three most used are probably Adam, SGD, and RMSprop.
Adam (“Adaptive Moment Estimation”) is one of the most popular and works very well.

Dealing with Underfitting

Underfit models would look like the opposite of the above graph, where training accuracy/loss fails to improve. There are a few ways to deal with this.

Add Complexity
A likely reason that a model is underfit is that it is not complex enough. That is, it isn't able to identify abstract patterns. A way to fix this is to add complexity to the model by: 1) adding more layers or 2) increasing the number of neurons

Training Time
Another reason that a model may be underfit is the training time. By giving a model more time and iterations to train, you give it more chances to converge to a more ideal solution.

Summary

To summarize, overfit models require regularization, normalization, and optimization while underfit models require more complexity and training time. Neural networks are all about making small, incremental changes until you reach a good balance. These tips will ensure that you are moving in the correct direction when you inevitably find the need to tune a model.