DEV Community: Dylan Lisk

Reinforcement learning

Dylan Lisk — Sun, 21 Jun 2020 04:26:38 +0000

What is it?

Like a lot of the ideas in machine learning, it is an abstraction of real life. How do we learn in real life? By doing and seeing the results. Put your hand in a fire and you won't do it again. You just learned something: fire is hot.

That lesson is taught by negative reinforcement. Having your hand burnt is negative it doesn't feel good. Alternately you can reward positive actions. Like a baby trying ice cream.

Setting the scene

The first step in reinforcement learning is establishing an environment. This is distinct from a programming environment. It can be something as simple as an empty array that represents a game board or you could have a robot learning in the real world.

Making EDA easy

Dylan Lisk — Sat, 13 Jun 2020 18:57:01 +0000

A package has recently come to my attention that makes performing a blanket EDA of a new dataset easier. This package is pandas_profiling.
To install the package the command is

pip install pandas-profiling[notebook]

So what does it do?
These points were taken from the documentation here

-Type inference: detect the types of columns in a dataframe.
-Essentials: type, unique values, missing values
-Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
-Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
-Most frequent values
-Histograms
-Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
-Missing values matrix, count, heatmap and dendrogram of missing values
-Duplicate rows Lists the most occurring duplicate rows
-Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data

All this is done with one command

from pandas_profiling import ProfileReport
ProfileReport(df, title="Pandas Profiling Report")

df in the above code being a Pandas DataFrame object.

If you have a large dataset (1000000,20) you may be better served by passing the kwarg

ProfileReport(large_dataset, minimal=True)

So while good knowledge of your data is ideal, I think pandas_profile is a good first step to building that knowledge.

GPT2 fine tuned with kaggle winning posts

Dylan Lisk — Fri, 14 Feb 2020 20:20:37 +0000

My name is Dylan Lisk and I am a data science student using gpt-2 to write my blog. I also participate in Kaggle competitions and have done so for many years. I am currently in the process of getting my bachelor s in natural resources management and management. I like to think of myself as a data scientist but more likely a statistics whiz. My thesis was on predictive modeling in the forestry industry. I have done statistical analysis in forestry in the past but never before. I also have a strong background in natural resources. I have worked in the oil and gas industry refining oil and natural gas. I have worked in projects dealing with oil and gas and related industries. I have participated in many R packages dealing with data analysis. R is my default choice for data analysis. I use SAS for all my data manipulation. I use R for trying to figure out how to combine and linear regression. I use SAS for setting up my computer for data and for generating the reports that I send to the governor. My old computer is getting old but I have gotten to the point where I can still be productive from time spent on R. I also have a small library of R packages that I have tried but not used in years. One that I have tried using R mostly for data analysis. It is a random forest package. It has nice performance characteristics. I like the idea of an efficient wrapper for R. I used a http randomforest package in my previous job. It is also possible to use R for data analysis. However one has to be careful about the precision of the estimates. R can give estimates up to for some data. I have tried using variance bias and insensitive estimates. I have also tried using nominal and median estimates. These estimates can be used to generate a rough baseline for the model. However I have not tried them all. I have tried using the mean or standard deviation of each of the estimates. This gives a rough estimate for the test statistic for the selected group of the selected data. However I have found that using these estimates as a baseline gives poor results. So I have tried using a weighted average of the mean and the standard deviation. This gave about as good a result as using the mean and the mean and standard deviation. I have tried using a weighted average of the mean and the mean. However I have found that using the mean gives much better results. So I have tried using the median. However I have found that using the median as a baseline gives poor results. I have tried using the median as a baseline and the test statistic. I have tried using the median as a baseline. I have found that using the median as a baseline gives poor results. I have tried using the median as a baseline and the test statistic. I have tried using the standard deviation of the estimates. This gives a rough estimate for the test statistic for the selected group of the selected data. However I have found that using this estimate as a baseline gives poor results. So I have tried using a weighted average of the standard deviation and the mean. This gave a good result when compared to using the mean and the mean. I have tried using a weighted average of the mean and the mean. However I have found that using the mean gives much better results. So I have tried using the median. However I have found that using the median as a baseline gives poor results. I have tried using a weighted average of the mean and the standard deviation. This gave a good result when compared to using the mean and the mean. However I have found that using the median as a baseline gives poor results. I have tried using the median as a baseline. However I have found that using the median as a baseline gives poor results. Group I have tried using the mean difference between the test and the median as a baseline. However I have found that using this as a baseline gives poor results. So I have tried using the mean difference between the test and the median as a baseline. However I have found that using this as a baseline gives poor results. Group I have tried using the median difference as a baseline. However I have found that using this as a baseline gives poor results. So I have tried using the median difference as a baseline. However I have found that using this as a baseline gives poor results. Group I have tried using the mean difference as a baseline. However I have found that using this as a baseline gives poor results. So I have tried using the mean difference as a baseline. However I have found that using this as a baseline gives poor results. Group I have tried using the median difference as a baseline. However I have found that using this as a baseline gives poor results. So I have tried using the median difference as a baseline. However I have found that using this as a baseline gives poor results. Group I have tried using the standard deviation of the estimates. This gives a rough estimate for the test statistic for the selected group of the selected data. However I have found that using this as a baseline gives poor results. Group I have tried using the

i added another

My name is Dylan Lisk and I am a data science student using gpt-2 to write my blog. I also participate in a lot of Kaggle competitions and have done quite well on the leaderboard. I am currently in the last weeks of my senior project in the Bioacoustic Research Institute at the University of Pittsburgh where I am working on some machine learning problems. I have been doing quite well on the leaderboard so far but I would like to improve my score so I joined the strong What made you decide to enter strong Dylan I was looking for a team and a lot of variety in challenges. The length was perfect for this competition. I like to have room to experiment and try things. And the variety of challenges was great. The fact that the problems were all related was very strong Sergei I was looking for a team that was strong Dylan I wanted to create a dataset that was strong Sergei I wanted to create a dataset that was strong a http img aligncenter center auto auto http strong What preprocessing and supervised learning methods did you use strong Dylan I used a lot of feature engineering in preprocessing and feature selection. Feature selection was done by going through the full time series and finding the most important features for each individual month. Then I would select the most important features and polynomial P P was used as feature of all time series for months that had at least features. span color strong a http span color month feature engineering was very powerful in this strong Sergei I used a lot of feature selection and feature transformations. Feature transformation was done by going through the full time series and finding the most important features for each individual month. Then I would select the most important features and polynomial P was used as feature of all time series for months that had at least features. span color strong a http span color month feature transformation was very powerful in this strong a https span color pipeline is a diagram that visually shows the general idea of how I did feature selection and transformation work together. strong a https span color The diagram below summarizes the general idea of how I did feature selection and feature transformation work together. strong a https span color The following figure illustrates an example of how feature engineering can be used to generate features for a more complex model. Notice that the colour gradient is introduced by adding colour to the training data by dividing the training data into bins and averaging the results. strong a https span color Figure: An example of how feature engineering can be used to generate features for a more complex model. Notice that the colour gradient is introduced by adding colour to the training data by dividing the training data into bins and averaging the results. caption aligncenter a http img http Figure: An example of how feature engineering can be used to generate features for a more complex model. Notice that the colour gradient is introduced by adding colour to the training data by dividing the training data into bins and averaging the results. strong a https span color Figure shows an example of how feature engineering can be used to generate features for a more complex model. Notice that the colour gradient is introduced by adding colour to the training data. strong a https span color Figure shows an example of how feature engineering can be used to generate features for a more complex model. Notice that the colour gradient is introduced by adding colour to the training data. strong a https span color strong a https span color Figure shows an example of how feature engineering can be used to generate features for a more complex model. Notice that the colour gradient is introduced by adding colour to the training data. strong a https span color strong a https span color Figure shows an example of how feature engineering can be used to generate features for a more complex model. Notice that the colour gradient is introduced by adding colour to the training data. strong a https span color strong a https span color Figure shows an example of how feature engineering can be used to generate features for a more complex model. Notice that the colour gradient is introduced by adding colour to the training data. strong a https span color strong a https span color Figure shows an example of how feature engineering can be used to generate features for a more complex model. Notice that the colour gradient is introduced by adding colour to the training data. strong a https span color strong a https span color Figure shows an example of how feature engineering can be used to generate features for a more complex model. Notice that the colour gradient is introduced by adding colour to the training data. strong a https span color strong a https span color Figure shows an example of how feature engineering can be used to generate features for a more complex model. Notice that the colour gradient is introduced by adding colour to the training data. strong a https span color strong a https span color Figure shows an example of how feature engineering can be used to generate features for a more complex model. Notice that the colour gradient is introduced by adding colour to the training data. strong a https span color strong a https span color Figure shows an example of how feature engineering can be used to generate features for a more complex model. Notice that the colour gradient is introduced by adding colour to the training data. strong a

A novices explanation of Data Science

Dylan Lisk — Fri, 20 Dec 2019 16:24:03 +0000

The first set of this process is data acquisition which can be accomplished by SQL queries, API calls, web scraping, or simply downloading a file like .csv. For this brief overview I won't touch on data acquisition.

So you have your data in a .csv file so what now?

   import pandas as pd
   import numpy as np
   pd.read_csv('path/data.csv')

What that does is import pandas and names it pd. I then call the pandas read_csv() function to create a pandas.DataFrame object with the data.
With a DataFrame in hand we are able to much more easily address missing values and data trimming.

As a novice my go to strategy deal with missing values is to simply cut out the offending observations.

DataFrame.dropna()

You can also impute values into the data such as the mean or median as you can preserve the rest of the data in the row. If the feature you are imputeing is important to the target be careful. It would be better to mess as little as possible with highly correlated features.

Now we deal with outliers. How do we do that?

You guessed it! I cut that shit out. Now there are many statistical tests and methods to determine what is an outlier. I don't really know them at the moment so I arbitrarily cut the data. To be a little more intelligent and if you have separate test data you can find the min() and max() values chop your train data to reflect those values.

Ok so let's look at our data.

  import matplotlib.pyplot as plt
  smp = DataFrame.sample(100)
  pd.plotting.scatter_matrix(smp)

The first thing to look at is the diagonal plot of histograms. This shows us the distribution of the values of each variable. For a lot of modeling techniques we want our features to have a normal distribution.

What is normal?

This is a normal distribution. If your distribution isn't normal what can you do besides panic? You can try to transform your data.

   DataFrame['log_col'] = np.log1p(DataFrame['col']

Create a new column with values log plus oned and see how that works. Data science is iterative so if that doesn't work try something else.

So we have our data where we want it. Now we bust out the models. The go to python package for modeling is scikit-learn I recommend visiting https://scikit-learn.org/dev/index.html it is a one stop shop for all things machine learning.

Before you throw your data into a model you need to do a couple things.
Split your data into a train and test set. The model is trained on the train set and tested on the test set. Pretty self explanatory. Next scale the data down. Models are happy when everything is scaled down to 1 so that coefficients stay manageable. Some of sklearn's scalers can help with outliers as well.

What time is it?

It's modeling time baby

You would think this is where the real work begins and you would be wrong. You just did most of work doing the data cleaning, exploration and feature engineering.

I touch on model selection in a future post that I will link here

Why data science???

Dylan Lisk — Tue, 03 Dec 2019 17:43:38 +0000

To answer the question of why data science, the question of what is data science has to be answered first. So what is data science?
Well if you break down the word literally data science would be the study of data. That doesn’t really say anything by itself , because what is data.
Data is plural form of the Latin word datum. Datum means “something given”. So from that we can piece together that data science means given something, do science to it. This sounds very broad because it is. Data science is impossibly broad because most problems can be seen as a data science problem. As long as you have observations IE data you can use the data science process to extract meaning.

Data Science is still a new field as stands today. Looking at google trends for “Data Science” shows that these terms were not commonly put together till around 2013-2014 and has been steadily increasing in search popularity since then. While this does not mean anything definitively it does seem to indicate that field is growing. For someone looking for a job a new growing industry is often a good place to be. One of the things that comes with a growing industry particularly a tech industry is constant change. In data science you will have to be constantly learning. As new modeling techniques become available a good data scientist will be expected to rapidly adopt and incorporate these new processes. This to me is one of the major draws of the data science field.

Another facet of data science is in the translation of data. Not every person that needs to use data will be able to digest it in a raw form. So to help people make decisions data science and analysis steps in to present the data in an easy to absorb medium. This is called data visualization and it is an easy to over look part of data science. Data visualization is basically marketing for your analysis of the data and in large part determines the impact of your analysis. Just look at infographics there is no denying that a nice info graphic with have more impact than a paper to the masses.

These points are all good general reasons that data science is a good choice of a career path, but for me the they aren’t the main draws of data science. For me the challenge of continuing to learn new things is very rewarding.

I am just starting my journey into data science so I’m sure my ideas and perception of data will continue to change but I feel like this novices take on data science will be fun for me to look back on and compare to my future understanding.