DEV Community: Moscatena

On Data Quality

Moscatena — Tue, 24 May 2022 14:00:22 +0000

Gathering and cleaning data are two steps present in every Data Analysis and Data Science process there is. Depending on the source you look at they're broken down slightly differently, but they're always there. Analytic Steps categorizes them as:

Determining the Objective
Gathering the Data
Cleaning the Data
Interpreting the Data
Sharing the Results

Data Pine says:

Identify
Collect
Clean
Analyse
Interpret

And according to Google, there are six steps:

Ask
Prepare
Process
Analyse
Share
Act

Though all of these use slightly different words, they preach the same concept:

Find out what you really want to do or discover
Find the data that will help you get there
Make sure the data is in shape to be modelled and visualised
Model and visualise it
Share your findings and take action based on them

The Data Science process is very similar to the Data Analysis one. We make sure to understand the problem, collect and clean the data, explore it, build models and deploy them. And what's the most important part of this? If you ask a data scientist you're likely to have 'modelling' as your answer. The issue is, the most exciting part for a data scientist is usually the model building and deploying. We love tweaking and running different models, getting a little percentage increase in our accuracy or precision, or whichever other metric we chose to use. I am definitely guilty of spending much more time in the model building phase than in any other phase. It feels more exciting and challenging. We're doing what really matters in the process, right?
Well, yes. But that's not all.

Albert Einstein once said, “If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute resolving it.”. If we apply that to the data science process, that would be the first part 'Make sure you understand the problem'. I could go on on how this is another undervalued part of the process, but I want to start talking about what I think is given even less thought now: the Data.

Garbage In, Garbage Out

This popular expression from the early days of computing still holds true. It doesn't matter how advanced your machine learning model is, and how much data it is capable of incorporating. If that data is 'bad', then your result will not be representative of what you think it is. This is something that is completely overlooked a lot of the time in data science, from schools to actual jobs (and a few competition websites out there). Oftentimes we're just presented with a dataset and told to work with it. There's no critical examination of where the data came from, or how it got there.
How can we solve this issue? First, we should all have an understanding of what 'bad' data is.

Bad Data

Improperly Labelled Data

Labelling data can be a tedious and time consuming job, but if you want your models to make correct predictions, it needs to be done properly. If your model was fed data labelled incorrectly, and it uses that information to make it's predictions, those predictions will most likely be wrong.

Not Enough Data

Models need data. A lot of it. And I'm not talking of a few hundred, or even a few thousand data points. To build advanced and accurate models, you may need millions of examples. Imagine if your 'Is it a cat or a dog' model only had a few hundred images of a few different cat breeds. If we challenged it with a breed it has never seen, it might suffer to make accurate predictions.

Untrustworthy Data

For our capstone project at Flatiron school we had to not only pitch the project we wanted to do, but also find a dataset that would allow us to accomplish that project. I chose to make a Fake News Classifier, and pitched a dataset I found on Kaggle for it. My instructor was quick in turning it down. It had no information on how that data was acquired, how it was labelled, and I had no means of verifying it. With some more research done I found the Liar dataset, which contained thousands of data points, humanly labelled by editors from politifact.com using a truthiness scale and which contained extensive metadata on each instance, making it verifiable. Once I settled on my final model, I decided to train a version of it on the rejected dataset, just for curiosity. The Accuracy it provided for the test data from that dataset was way higher than the one trained in the Liar dataset. Why was that? The model wasn't actually making correct predictions, it was just better at identifying the labels (which were not verified) from the dataset.

Dirty Data

This should go without saying already, but your data should be properly pre-processed. This step is the most time consuming for a data scientist, who spends on average 60% of their time cleaning data. Considering that 76% of data scientists find this the 'least enjoyable part of their work' according to the research linked above, one understands why sometimes it's not done properly. That's not an excuse though. Your dirty data will lead you to an unreliable model.

Biased Data

This is something that is much harder to achieve without the proper tools. When using data we must understand not only where it comes from, but how it was collected, by whom, how the sample was selected, amongst many other variables. Contextuality here is key. If my Fake News Classification project used data labelled only by a certain political party, then it could end up being biased towards classifying that party news or statements as true, while other parties could have a much higher chance of being labelled as false. Having an unbiased technique to collect data, from an unbiased sample, and use unbiased labels on them, is a hard task, but essential to have a trustworthy model.

Good Practices

To ensure the data you're using is reliable, you have to incorporate some practices in your model development pipeline, straight from the beginning. Ask yourself these questions:

Is the data Reliable?

This can be hard to assess, but can be achieved by comparing the results to other sources. Unreliable data can lead to incorrect decisions.

Is the data Original?

Can you validate the data with the original source? How does it compare to other datasets?

Is the data Comprehensive?

Is this a complete dataset? Is it meaningful to what we want to accomplish? Is there enough data? Gaps in the dataset, like lots of people not answering 'age' in a questionnaire for instance, can make you unable to fully understand your data.

Is the data Current?

When was this data acquired? Can the lack of recent data lead to biased models? Outdated data can lead to models that don't reflect the current reality.

Is the data Cited?

Is it and its authors formally recorded and acknowledged?

If you can confidently answer all those questions in a positive way, then your data has a much higher chance of having good data quality.

Where to find data?

There are many places where you can find good datasets. I'll list a few here, in no particular order, but be sure to always perform the necessary checks to evaluate their reliability.

Awesome Public Datasets; UCI Machine Learning Repository; Recommender Systems and Personalization Datasets; The Stanford Open Policing Project; Labor Force Statistics from the Current Population Survey; Unicef Data; Climate Data; National Centers for Environment Information; Google Cloud Healthcare API public datasets; WHO Data Collections; USA Census Bureau; US Government Open Data

I can't find the data I need. What to do?

What can you do if the data you're looking for simply doesn't exist? You either find a proxy for the data you don't actually have, or you create your own dataset. Have a survey, study or questionnaire made, and find a representative and unbiased sample of the population you're targeting. If the study is too big, or you have difficulty setting the parameters for it, there are research services available like Prolific that recruit niche or representative samples on-demand and builds the most powerful and flexible tools for online research. They're doing great work connecting researchers with good quality data, which is a terribly important job.

Conclusion

Don't skip steps. It's as simple as that. Incorporate the Data Analysis and the Data Science process into your projects pipelines and actually go through them. Make sure you properly verify your data. Trust me, you'll be much more satisfied with your high Accuracy score when you are confident that the score is actually correct.

Youtube Recommendation System

Moscatena — Wed, 09 Feb 2022 22:33:13 +0000

YouTube it the second most visited website in the internet nowadays. Almost five billions of videos in the platform are watched everyday. Three hundred hours of new content is uploaded on it every single minute! That makes not only maintaining a website with such massive scale already a very difficult task, but creating models that use that information, like their video recommendation system, a much bigger challenge, and that's what I'll go through a little today. For that, I'll be following the structure of their paper in the subject.

Challenges

These are just some of the challenges the systems has to take into consideration:

Scale: As mentioned before, Youtube has a huge user base and a massive corpus to handle. Though there are systems that offer good solutions for smaller scale projects, they might fail to operate in large scale platforms.
Freshness: This means balancing older video recommendations with the thousands of videos being uploaded in the platform every minute.
Noise: With just a like function and comment section that most users don't use, it's hard to determine user satisfaction if you don't have more information than viewed or not viewed. This issue can add a lot of noise to the data.

Youtubes solution to these problems was to use deep learning to deploy a Neural Network for their recommendation system, implemented using TenserFlow. Their model has roughly one billion parameters, and is trained in hundreds of billions of examples!

Neural Networks

Just to touch this briefly, a Neural Network is a computer system inspired by how the human brain operates. The image above shows the the input layer, where data is injected in the model, then a couple of hidden layers. There can be many of those, and they try to identify underlying relationships in the data. Models can have one output or, as in the Youtube case, several, which represent their recommended videos.

How does it work?

The flow chart above shows the structure of the recommendation system. There are two Neural Networks here: the candidate generation and the ranking. The first phase selects a few hundred videos from Youtube corpus based on the user activity. The ranking phase filters videos based on their scores.

Candidate Generation
Though this post is not going into technical details of how these systems work, I'll point out that the candidate generation system uses the softmax classifier, which is used for multi-class classification problems, and uses just implicit feedback to give it's results, meaning that instead of using features like the thumbs up/down, it relies on length of watches in it's algorithm. To address some of the challenges posed earlier, these were some of the decisions made in the system:

The age, or how old a video is, is made into a feature and set very close to zero in the training data. This means that though age has a weight in the recommendation, it's won't disproportionally reward older videos, which have more clicks and views.
Feature engineering was made to increase the model precision. Experiments with a vocabulary of a million videos and a million search tokens, embedded in a bag of maximum size of 50 recent watches and 50 recent searches substantially increased the model mean average precision in relation to one without these features.
Have fewer inputs from each user in the model. Withholding signals helps prevent the model overfitting and increases precision.

Ranking
In this part the system has to find a way to evaluate which videos users are watching and how much are they enjoying them.

Historical data, or which videos the user has watched in the past, has a big weight here. If users watch a specific kind of content, it's likely they'll watch more of that content. If the suggested videos from the same content are not being clocked on though, they start to lose importance. You can assess this change by searching for a 'How to fix bicycle's breaks' for instance. After you've seen one or two of those, you'll be suggested more of that content, but as you interact more with Youtube and don't click on those recommendations, they start losing weight and are suggested less and less.

The actual ranking of the videos is a measure of how long the user is going to engage with the video. The more the user watches the video, the higher it's ranking gets. Interestingly enough, this method does not take into account the length of the video.
For this prediction, the authors decided to use Weighted Logistic Regression. The weight it includes in the calculation is upon the positive impressions, or the clicked videos, and is calculated as the sum of the watch time of all impressions.
The resulting output here are videos, that have been selected regarding user preferences, then selected amongst themselves, in order of how likely it is that that video is going to be watched by the user. The amount of time this process takes is in the tens of milliseconds, which is an incredible feat.

Conclusion

The model briefly described here consisted on splitting the recommendation problem in two: candidate generation and ranking. It manages to assimilate constant incoming signals and model them using several layers of depth in the neural network.
Some of the unique feats of this system are:

Withholding signals from the classifier to prevent overfitting.
Using age as an input feature to remove bias towards the past.
Layers of depth were shown to effectively model non-linear interactions between hundreds of features.
Modifying Logistic Regression to weight watch time for positive examples.

Steve Jobs put it clearly when he said "People don't know what they want until you show it to them." Recommendations of videos on Youtube, pages on Google, products on Amazon or music on Spotify are all trying to figure out in what you are interested in and maximize your engagement with their platforms. It can lead people to discover things they never thought to search for, or can direct people like myself, a beginner in the field of data science, to more and precise information regarding what I'm trying to learn. I'm looking forward to assimilate the mathematical intricacies behind models like this and find out how they can be tweaked and modelled.

References:

Scientific Paper by Paul Covington, Jay Adams and Emre Sargin
Youtube presentation by Paul Covington
On YouTube’s recommendation system
Implementing the YouTube Recommendations Paper in TensorFlow — Part 1
Using Deep Neural Networks to make YouTube Recommendations

What's the Norm? What's the Standard?

Moscatena — Tue, 18 Jan 2022 13:48:09 +0000

When working with datasets and using machine learning processes, it’s usual to normalize our data before using it. Whenever we’re using any machine learning algorithm that involves euclidean distance, data should be scaled. Whenever using KNN, Clustering, linear regression, all deep learning and artificial network algorithms, data should be scaled. There are though, many ways that this can be done. So what scaling method should you use?
In this post I’ll explain why it is an important thing to do, and go through Scikit Learn different scaling models and talk about their differences.

Normalization and Standardization

Both of these are common techniques used when preparing data before undergoing any machine learning procedures. They aim to change the values of numeric columns in a dataset, to have them all be in a common scale, without distorting their range of values or losing information. For instance, if one of your columns have values that range from 0 to 1, while another ranges from 10,000 to 100,000, that difference may cause problems when combining those values when modelling the data.

The big difference between them is that Normalization usually means rescaling values into a range between 0 and 1, while Standardization usually means rescaling the data to have a mean of 0, and a standard deviation of 1 (in other words, replaces point values with their z-score).

Standardization

Here’s the mathematical formula for standardization:

The mean here is the mean of the feature values, sd is the standard deviation of the feature values. In standardization, your values are not restricted to a particular range.

Normalization

This can take a few different forms. One of the most common ones is called the Min-Max scaling. Here’s the formula for MinMax:

X-max and X-min are, respectively, the maximum and minimum values of the feature.
You can see that, if X is equal to X-min, X’ value will be 0, while if X is equal to X-max, X’ value will be 1. Every other number will fall between the two.

Shaping Data with Python

I’ll use a couple of models here to give you a better understanding of what these methods are doing.

First, I’ll use distance from London to create a quick model to compare house prices. Here’s the information:

Now let’s get them in numpy array form, and shape them as columns, so we can transform them later on:

Import numpy as np
X_property_dist = np.array([0, 10, 105, 120, 180, 200, 210, 220, 230]).reshape(-1, 1)
y_price = np.array([6000, 4100, 2200, 2600, 2100, 2000, 2100, 1800, 2200]).reshape(-1, 1)

Now we’ll take a look of how this data look like:

As you can see, the scales here range from 0 to 230 in the x-axis, while they vary from 1800 to 6000 in the y-axis. Our task is to transform this data so they’re all in the same scale. With Sklearn you can easily do that with some preprocessing methods. We’ll start with the Standard Scaler, that Standardizes the data:

from sklearn.preprocessing import StandardScaler

ss_scaler = StandardScaler() # Instantiate the scaler

X_dist_scl_ss = ss_scaler.fit_transform(X_property_dist)
y_price_scl_ss = ss_scaler.fit_transform(y_price)

These methods will transform the data we previously had by getting their mean to 0 and make their standard deviation be 1. Here are the results:
X_dist_scl_ss:

y_price_scl_ss:

The data now has a very different scale, and now is easier to compare unit changes between the two axis. Important to note though, that there was no distortion in any values, which means the graph we’re seeing looks exactly the same as the previous one, bearing the scale.

To further exemplify this, this set of graphs shows the effects of a Standard Scaler in 3 normally distributed arrays:

np.random.seed(2022)
df = pd.DataFrame({
    'x1': np.random.normal(0, 2, 10000),
    'x2': np.random.normal(5, 3, 10000),
    'x3': np.random.normal(-5, 5, 10000)
})

scaler = StandardScaler()
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df, columns=['x1', 'x2', 'x3'])

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 10))

ax1.set_title('Before Scaling')
for i in range(1, 4):
    sns.kdeplot(df[f'x{i}'], ax=ax1)
ax2.set_title('After Standard Scaler')
for i in range(1, 4):
    sns.kdeplot(scaled_df[f'x{i}'], ax=ax2)
plt.show()

So here you can clearly see how all their means were shifted to 0, and the standard deviation was now set to 1.

Now we’ll take a look at what MinMax, one of the most commons normalizers, does to these two sets of data:

from sklearn.preprocessing import MinMaxScaler

mm_scaler = MinMaxScaler() # Instantiate the scaler

y_price_mm = mm_scaler.fit_transform(y_price)
X_dist_mm = mm_scaler.fit_transform(X_property_dist)

These methods will scale the data down so that the minimum value is 0, and the maximum is 1. Let’s take a look at the results:
X_dist_scl_mm:

y_price_scl_mm:

This makes it even easier to compare the two. All values are now in a scale between 0 and 1, and since, again, there was no distortion in the values, the last graph looks the same as both previous ones.

Let’s take a quick look at the effect the MinMax scaler had in our previous example as well:

Now the means of the curves are not exactly the same, but you can see that all the x-axis values fall within 0 and 1.

Other Scalers

Scikit Learn provides a good number of different scalers whose usefulness depend heavily on the type of data we're looking at, mostly looking at how it is distributed and if it contains outliers. Let's go through a few of them:

Robust Scaler: This scaler scales the data around the median according to the quantile range. This comes in handy when your data has outliers that influence the mean too much, making them interfere less with the results. Any method that is strong to outliers is considered to be Robust.
Max Absolute Scaler: Instead of fitting the data within a 0 to 1 range like the MinMax, the MaXAbs scaler sets the maximum value to be 1, but doesn't define a minimum. The data here is not shifted or centred around any point, maintaining sparsity.
Power Transformer: Makes the data more Gaussian-like. It's useful for situations where approximation to normality is needed.
Normalizer: Each sample is rescaled according to norm, either L1, L2 or max. This is applied row-wise, so each rescaling is independent from the others. Here, max does to a row what MaxAbsScaler does for features, l1 uses the sum of all values as norm, giving equal penalty to parameters, and l2 uses the square root of the sum of all squared values, increasing smoothness.

All other preprocessing and normalization methods from Scikit Learn can be found here.

How to choose?

Usually when you get to a conclusion stage of a post like this, you find clear and defined rules to what to do in different situations. Unfortunately, that is not the case here. There isn’t a simple way to be 100% sure on what scaling technique to use all of the time. Sometimes one can yield better results, but depending on the data, another one might be more appropriate. Above I described a little how to chose from a few of these methods. Below is a quick overview comparing just the main topic of this post:

Standardization does not affect outliers, since it doesn’t have a bounding range. It is useful though if your data follows a Gaussian Distribution (also called Normal Distribution).
Normalization is less useful if your data already follows a Normal Distribution. It's necessary when using a prediction algorithm based on weighted relationships between data points.

Predictive modeling problems can be complex, and knowing how to best scale your data may not be simple. There is no knowing the best technique for every scenario. Sometimes even multiple transformations are ideal. What we should aim to then is to create different models, that shape the data in different ways, run our models and analyse their outcomes.

References:

Introduction to SciPy

Moscatena — Mon, 20 Dec 2021 12:03:23 +0000

Data scientists come across a multitude of problems that present themselves in various different fields. We have to deal with design issues from generating plots to either try and see a relationship for ourselves, or present them to a third party. We have to perform data cleaning, organising, interpolation and analysis, sometimes even engineering. We work with APIs to get acquire needed information. And amongst many other things, we use statistics to analyse and interpret our data. Although those are all things that could be done in a long and windy way, there are people that create tools and libraries to facilitate the work that has to be done. Today I’ll talk about one in particular: SciPy.

SciPy

SciPy, pronounced Sigh Pie, is an open source Python library that has a collection of mathematical algorithms, designed to make our lives easier. Basically, instead of writing big and complicated scientific formulas, SciPy got you covered. The library, as of now, contains fifteen sub packages that can be imported independently and have different utilities. They are the following:

cluster: Clustering algorithms are useful in information theory, target detection, communications, compression, and other areas
constants: offers a number of mathematical and physics constants and transformation for units of measurement.
fft: Fast Fourier Transforms, Discrete Sin and Cosine Transforms, Fast Hankel Transform, along with helper functions and backend control
integrate: Integration and ordinary differential equation solvers
interpolate: Sub-package for objects used in interpolation
io: modules to read and write on different types of files
linalg: Linear algebra modules
ndimage: functions for multidimensional image processing
odr: Orthogonal distance regression
optimize: functions for minimizing objective functions
signal: Signal processing functions
sparse: 2-D sparse matrix package for numeric data
spatial: Spatial algorithms and data structures
special: This package offers you a substantial amount of mathematical functions. Available ones include: airy, bessel, beta, hypergeometric, mathieu and kelvin
stats: Statistical functions to work with frequency statistics, correlation functions and statistical tests, masked statistics and many others.

The best way to use these sub-packages is to import them separately, for example:

>>> from scipy import stats

History

Version 0.1 was first written back in 2001, with version 1.0.0 only being released in 2017. Now they’re at version 1.7.3, and make periodic updates. The code is written by scientists, for scientists, giving us a set of easy to use tools. One of it's creators, Travis Oliphant, was also the creator of Numpy, which merged Numeric and Numarray data. With a growing number of extension modules, and the rising necessity for a more complete environment for scientific and technical computing, in 2001 Travis joined efforts with Eric Jones and Pearu Peterson to create version 0.1. With this, SciPy runs on top of the numeric array data structure provided by NumPy. From there, the project only grew. Version 1.0.0 had a total of 121 contributors. Currently it is distributed under the BSD license and has it's development supported by an open community of supporters. Their GitHub repository has information on how to help contribute to SciPy and what are their plans for the future. The applications that the library amassed are significantly varied, from being used in high school education to power field changing research, like the 2017 Physics Nobel Prize winners “for decisive contributions to the LIGO detector and the observation of gravitational waves”. Part of the can be seen here and this is a GitHub repository where you can find a Jupyter Notebook and see some of the code in action.

Competition

NumPy - Though SciPy is built on top of Numpy and possesses all of it's features, Numpy can be a better choice when dealing only with basic array concepts. Python is a powerful and flexible language, but it might not be the fastest in some cases. NumPy is written in C, which makes it's execution faster.
MatLab - This is a different programming language altogether, that instead of being object-oriented like Python, it is Array oriented. This makes it an easy and productive environment for scientists performing mathematical and technical computing, and prime for matrix manipulation. It is not though, a language made for programming, which makes it very clunky when dealing with problems that demand more flexibility.
TenserFlow - Another open source library for numerical computing, that delves into Machine Learning and Artificial Intelligence. TenserFlow is really fast, since it's core is written in a combination of C++, Python and CUDA. A trade-off here is that TenserFlow is considered harder to use and to Debug.

These were just a few of dozens of libraries, tools or languages that have some of the same capabilities of SciPy. It's hard to make an overall comparison between them because they all are designed to do different things, that sometimes overlap . In here for instance, you can find a comparison amongst several numerical-analysis softwares (including SciPy), whereas here you can find comparison of different statistical packages (which also include SciPy). Mostly all of them have something they can do either better, faster, or have better compatibility with a certain program, but they also all have a downside when comparing to using Python and SciPy. You have to decide for yourself what is better for the projects you want to do.

Resources

One of the major advantages of using Python or it's tools and libraries, is that Python is it is consistently amongst the top most common languages used. It's simple syntax and versatility makes it a common entry point for beginners. That makes the user base grow more and more every year. Since Python is open source, like many as it's packages (SciPy included), there's a constantly increasing number of programmers working in it's improvement. That makes for more and better tools, more usage and, let's not forget, better documentation. In SciPy documentation page you can find extensive information on how to use it's tools, separated by sub-packages.
With Python being the programming language with most topic creations in Stack Overflow, there's a lot of content being created also by the major public. There are tutorials made for it by schools, individual people, and paid websites. I'll list a few free ones here:

One of my favourites though would have to be Real Python website. Just regarding SciPy, they have in depth information with very descriptive and didactic posts on:

What I'm trying to say is, if you're trying to learn SciPy, you're probably not gonna run out of resources. And if you do, just remember, you can always ask for .help().

2 Weeks Down. 13 To Go

Moscatena — Sun, 05 Dec 2021 19:25:45 +0000

The Story So Far

Nine months ago I decided to actively pursue my long time objective of changing careers and study something in the Computer Science field.
Eight months ago I learn a little about Html and Css.
Seven months ago I learn a little about how JavaScript works.
Six months ago I started learning Python, and really enjoyed it.
Between then and 2 months ago I learned all I could about data types, functions, loops, object oriented programming, databases, Big-O notation, SQL, and a few other things.
All of this was being done while I was still working a full time job in hospitality, something that took a huge chunk of my time and energy. Also, this whole time I've been studying by myself. Took a lot of time to filter and went through a whole bunch of amazing resources as Edx, Codecademy, Coursera, FreeCodeCamp, Codewars, Leetcode, and many others. That might sound great, but I'm not telling you the amount of effort it is to sieve through all that information, try to come up with a plan that work for you, and keep yourself motivated to keep going even if you have no idea what you're doing. I seriously aplaude people that can go on like that and continue to improve by themselves. In the long run though, that wasn't me. I felt stuck.

Focus

Back when my parents went to university in Brazil, there were fewer options. One could go for the medicine path, the exact science path, the human science path, and apart from a few outliers, that was pretty much it. And those paths didn't have nearly as many branches as they have nowadays.
That may seem like a bad thing, since as a 18 something year old thinking that what you do at university is gonna dictate what you do for the rest of your life (which is complete nonsense by the way), you are not given the options to everything you could be doing with your life. Nowadays, the number of subjects you can chose for university is much larger, and those options have further options that didn't exist before - If you go into Computer Science, you can become a Full-Stack Developer, a Software Engineer, a Mobile Application Developer, a System Architect, a Machine Learning Engineer, a Data Scientist, and many other careers that I'm sure I have no idea exist, much less what they do.
There are also other options rather than going to university. Self-thought professionals are not rare in a variety of fields. The data available online for one to self-educate in Data Science for instance, is extremely rich. It is, and I found this for myself after registering to all those websites I mentioned, 'too rich' for me. I'm not saying abundance of resources is bad, not at all. I'm saying that, for someone trying to get in to a field by themselves, being bombarded with information from a multitude of sources can be extremely overwhelming, and can lead into a state of unhappiness, uncertainty or paralysis with your choice that, had you not had all those choices in the first place, you'd never feel. That thought process is an extreme oversimplification of Barry Schwartz book The Paradox of Choice, which is not really the topic of this post, but it's a factor one should be aware when thinking about how to pursue their education. I'll leave his Ted Talk from 2005 there if anyone is interested.
Getting back on topic, what can you do if you don't want to spend 4+ Years in gaining an (or a second) university degree, but you feel paralyzed by the amount of choices the self-thought way gives?

In comes Bootcamp

If you google what is a data science bootcamp you'll get that they are "short-term, intensive training programs that equip students with in-demand industry knowledge via project-based learning.". That sounds exactly like what I was looking for, so I went and started to look for one. Surprise surprise, there are a ton of them! After considering some of the ones here in the UK, I was informed by a friend to one given by Flatiron School in the USA (which is really funny to me, since my last job here was working for a Flat Iron restaurant), and their Data Science syllabus looked much more in line with what I wanted to learn. A few phonecalls, some tests and some financing later, I'm in. A sneeze later, and we finished our second week!
The course is fast paced. It is for beginners, but if you don't grasp things quickly, there really isn't a lot of time to keep reviewing a topic. In the first two weeks, we set up our environment using the Bash shell, Git, started interacting with GitHub, configured for Conda and VS Code, learned about Jupyter Notebooks, learned about csv and json, and how to interact with the schemas using Python, learned about data types, loops and functions, were introduced to Pandas and how to import and access data, made statistical methods with numpy, visualized data with Pandas, Matplotlib and Seaborn, grouped and cleaned data, and started on sql. All that while taking a test a day, and a big one at the end of the second week.
That might seem not much for someone who knows how to do these things already, but for someone that comes with little experience to this, it's been hard. Hard, but it's also been extremely satisfying. The bootcamp provided me with a tool that I found lacking during my self-study path: structure. Right now, I have to focus on these things. I have to do them. Not only that, but this was my choice, so instead of going about how I have to do this things, my mindset is: I chose to do this. And I'm happy with this choice. Confident that, even if this is not the single best path in the whole world for me (or the best jeans I could get, if you've seen that Barry Schwartz Ted Talk), I am learning a lot and focused in what I chose to study. I am meeting a lot of interesting new people that are adding to my experience, and positive that this is one of the best decisions I made in the last few years regarding my career.
2 weeks down, 13 to go.

One Foot in Front of the Other

Moscatena — Mon, 29 Nov 2021 18:26:52 +0000

From Self Improvement to Data Science

It seems to me that everyone's life is a constant struggle between trying to be happy with how they are, and wanting to improve themselves. This conundrum is something I only came to fully grasp in my thirties, and though I could be upset that I haven't figured it out sooner, because that would've been very useful, I'm just happy that I eventually did.

My whole life I aimed to change things that I thought made me unhappy. I seemed fat, need to lose weight. I'm unhappy at home, need to move out. I don't like doing Architecture, so I need to leave the course. Don't like someone, I need to block them from my life. And for a long time this made sense. It worked. The results that those changes generated were usually good.
The deal is though, deep inside, they never seemed just right. Why? Well, I was doing those things for the wrong reasons.

The Wrong Reasons

You see, the reason why I wanted to lose weight when the doctor jokingly said the word ‘obese’ in an appointment, shouldn’t have been ‘because I don’t want to be fat’, but rather, ‘I want to be a healthier version of myself’. The reason why I wanted to move out of my parents house shouldn’t have been ‘because I don’t like living with them’, but rather, ‘I want to be somewhere I can be myself’. I shouldn’t have just left Architecture school because I didn’t like the course, but rather find out why I don’t like it, what are the areas of it I do like, and if there are opportunities to learn and work in them? And lastly, just because you don’t like someone, for whatever reason, it shouldn’t lead you to completely cut them out of your life. Rather, trying to understand what it is you don’t like, trying to understand if it’s somehow your view of something that is skewed or biased, or if you’re actually correct and that is the right choice to make.

All these scenarios have a couple of things in common: One, the more information you have on why you make these decisions, the better equipped to actually make them you’ll be, and two, at the core of every single one of those, the biggest factor I should have considered is: ‘Will I be a better version of myself if I do this?’.

Self Improvement

In this topic we’ll get to Data Science. I promise. It does seem like a random pivot to a different topic, but hear me out.
This year has been one of the most intense years I’ve ever had. Take the pandemic aside, I have not been away anywhere for over a year, have ended a relationship, have decided to spend less time with some friends, and have quit my job. ‘That’s terrible’ one may think, but for me, even though some of those things were extremely hard to do, I realised I was doing them for the right reason: To become a better version of myself. And that is the goal I want to accomplish from now on. Every day of every year, just being a little bit better than the day before. If that is what I strive to do, instead of trying to be happy, trying to not be uncomfortable, trying to be rich or trying to be healthy, those things lose their grip over you, and funnily enough, become easier to achieve. I’m not saying I’m rich and a hundred percent comfortable with my situation (actually quite the opposite), but these things are a process. The more you put in the more you take out. You reap what you sow, the proverb says.

Now what the hell does this all have to do with Data Science?
Everything I described to you today was a process. A process of understanding what is relevant and what is the desired outcome. Of getting information, filtering it and analyzing it. Of creating a model, making predictions and visualizing outcomes. For those that don’t already know it, that is exactly the Data Science Process I’m starting to learn.
Let’s take the losing weight situation for instance. The doctor that day, and I remember it clearly, said ‘It’s not like you’re obese or anything, just a bit overweight’. I misread the whole situation and tunnel visioned: I acted on little and biased information I had, blocking out any other opinions, even after I was much healthier. If I was to rethink the situation with a Data Scientist mindset, I’d have gathered more information, being careful to filter it properly (your grandma saying you look fine and giving you chocolate cookies is not the most reliable source for this). I’d have created a model of how I think I wanted to look; in what clothes I wanted to fit in again for instance. I’d have made a proper diet and exercise program, and made a prediction of how I wanted to be in a certain number of months. Then having visualized all that, I’d know exactly what I’d have to do to reach those goals.
This may seem like a silly example, but the more I thought about it, the more it made sense, and I managed to find more and more areas where to apply that train of thought. I now see Data Science as a tool for improvement that can be applied at many levels. Country-wise - How does a country want to look like in ten years? What is the relevant information it can gather to create a model that could lead it getting there?; Enterprise-wise - How I want my company to look like in ten years?; or even on a personal level - How I want to look like in ten years? (As a last sidenote, I’ve been doing therapy for the last seven months, and the process is very similar. Understand where you want to be as a person in the future, gather information on yourself and relationships, filter out the noise, analise it, understand yourself, make predictions and make visualizations. Fantastic thing isn’t it?).
What I love about it is that it’s applications are endless, and if used right, it can impact people’s lives in a variety of (un)foreseeable ways.

Ending Notes

I hope I managed to convey my thoughts clearly over here on why I’m pursuing this change. It’s not about the Maths, about the Code, about the status or about the money. It’s about improving myself. Being the better version of myself I can be. And hopefully giving that power to others: to whatever person, enterprise or entity that seeks to improve themselves, but lack the tools to do it properly right now.