DEV Community: Jason Mix

Confidence, Collaboration, and Coding

Jason Mix — Tue, 19 Dec 2023 05:55:39 +0000

We've just wrapped up our first projects for the Data Science bootcamp at Flatiron School. I'm not sure which was more daunting: the aspect of doing my first data science project ever or the fact that this was to be done in a group. But, knowing what I know now, I wouldn't have had it any other way.

As the project was assigned and explained, my stomach did a flip. Sure, we were given a business question to answer and a data set to use, and, sure, we were given some instructions for how to approach the project. However, this was quite different than anything I had done before. This was not a lab or a guided project in a Jupyter notebook that leads me through someone else's thought process for how to answer the business question. There was no solution branch in a GitHub repository to check our approach or our code against. There was no step-by-step instructions for cleaning the data. We were not assigned a statistic to find or a visual to create--we had to decide for ourselves what statistics and visualizations would be useful for answering the business question.

I did not doubt my ability to write code that would remove null values or impute a central value. I knew I could create visualizations in Pandas, MatPlotLib, Seaborn, or Tableau alike. I could calculate statistics and create new columns in a Pandas DataFrame with the best of 'em. I even felt like I had a good understanding of the need to create normalized statistics in order to compare differently-sized data subsets. Yet, with all my skills and confidence, I found myself having a small panic attack when I was faced with the prospect of an open-ended assignment such as this.

I soon found my anxiety at the novelty of this assignment was compounded by the fact that we were working in a group. We did not have a project manager assigning tasks and roles and imposing deadlines. We were on our own to manage ourselves. I was used to working on my own--if I encountered a difficulty I knew I could work through it or find helpful resources. Working in a group, however, meant that we encountered these difficulties as a group. Troubleshooting an issue became much more complicated because we had to communicate the issue amongst ourselves, discuss and make sure everyone was understanding the issue the same way, and then assign roles in addressing it. I found that I could not rely solely on the brainpower and intuition that had gotten me to this point--I needed to communicate, collaborate, and occasionally acquiesce to the majority opinion of my group.

The goal of the project was to provide actionable insights to a businessperson looking to invest in airplanes. We were to use a dataset of aircraft accidents since 1962 to make recommendations that would minimize the investor's risk.

We hit a snag almost right away as a group. I felt that our first task was to clean and filter the data to create master data set that we would all work from. Other groupmates wanted to throw the dataset into Excel in order to come up with some preliminary findings to get a sense of where we were going and pick a direction for our project. We would try to make decisions about certain details of the data cleaning that would lead to questions about where our project was going to go, which would lead back to, "we don't know where our project is going to go because we don't have a master data set yet".

Eventually, to my delight, this all came together. I was able to convince my groupmates of the importance of cleaning the data before determining what the data was going to tell us. Being able to quickly look at subsets of the data in Excel allowed us to make better decisions about the data cleaning process. For instance, as our dataset included many data points for aircrafts other than airplanes (e.g., blimps, hot air balloons, etc.), we needed to filter our dataset to only include airplanes. However, there were many data points that did not specify the type of aircraft. It would have been easy enough to just drop all the rows that had a missing value in the 'Aircraft.Category' column. However, in Excel, it was easier to see a way around losing so much data (thousands of data points, in fact). We noticed that there were many data points where the type of aircraft was missing, but the make and model were present. If we could find another row that specified that make and model as an airplane, then we could deduce that this aircraft was an airplane as well.

This was easy enough to do with Python and Pandas. We made lists of all the makes and models where the 'Aircraft.Category' column specified 'Airplane'. Then we went row by row, checking if the 'Aircraft' category was empty and if the make and model were in the aforementioned lists. If all of the above was true, then we would impute 'Airplane' for the 'Aircraft.Category'. See the code below:

#First we filtered the dataframe to only include rows where the  
 #Aircraft.Category column specified Airplane 
airplane_df = df[df['Aircraft.Category']=='Airplane']

#Then we made respective lists of the makes and models of those 
 #airplanes
airplane_make_list = [make for make in airplane_df['Make']]
airplane_model_list = [model for model in airplane_df['Model']]

#Here we defined a function that checks for airplanes with the 
 #above makes and models that have a missing value for 
 #'Aircraft.Category' and imputes 'Airplane' where appropriate
def replace_airplane(row):
    if pd.isnull(row['Aircraft.Category']) and row['Make'] in 
  airplane_make_list and row['Model'] in airplane_model_list:
        return 'Airplane'
    else:
        return row['Aircraft.Category']

#Then we applied that function to the original DataFrame
df['Aircraft.Category'] = df.apply(replace_airplane, axis=1)

#Finally, we were able to filter for just airplanes without losing 
 #thousands of data points needlessly
airplane_df2 = df[df['Aircraft.Category']=='Airplane']

This is such a powerful example of the benefit of working in a group. Although we clashed at first, everyone brought a different set of skills and ideas. This allowed us to create a more thorough and accurate dataset and, therefore, more thorough and accurate insights.

I am so grateful for the experience of our first project. Even though I initially felt overwhelmed by the open-ended nature of the project, now I feel like if I got a similar project I would intuitively know what to do. This is a byproduct of being thrown into the deep end, so to speak. I've had this experience of getting a project, not quite knowing how to proceed, yet persevering and getting the project done one step at a time. I feel that this was an initiation that I can continue to build on in future projects.

Despite the fact that our group clashed at times, I know that I learned and grew from the experience. Our differences of opinions and intuitions about how to proceed ended up being a benefit. Furthermore, we learned how to communicate and assign roles/tasks. I very much look forward to our next project--I can't wait to continue to build my confidence and skills in collaboration as well as coding.

To Infinite Loops, Beyond, and Now Data Science

Jason Mix — Mon, 27 Nov 2023 23:12:20 +0000

1, 2, 4, 8, 16, 32, 64, 128,...256,...512...

I once again paced around the room trying to see how many "doubles" I could do in a row. I was in second grade and I did not yet know what "powers of 2" were, but I loved playing with numbers in my head, seeing how far I could go.

Fast forward to today: I am starting my second week of Flatiron School's Data Science Bootcamp and I am loving the challenge. I will give a short(ish) explanation of how I got here.

College

In my college days, I was a mathematics major at Wesleyan University (go Cards!). At first, I loved evaluating integrals--it was like solving little puzzles, but in some ways more of an art than a science. Then things started to feel unfamiliar: math was no longer about "doing problems" in order to get "the answer." Rather, it was about understanding ideas. My homework solutions were now a series of paragraphs, sometimes quite lengthy, proving some claim or theorem to be true (or false). Algebra and Topology were fascinating, but in some ways I was more interested in the logic and methods used to prove or know things.

When I took Introduction to Programming in college (taught in Python!), I was delighted to find out that these sort of meta-mathematical ideas were very applicable to computer science. Understanding Boolean operators and symbol logic, as well as how to combine operators and create truth tables, was an essential skill. In math, proof by induction, one of my favorites, involves declaring a "base case", proving A to be true for the base case, and then proving that if a statement P is true for an integer n, then it is true for n+1. These two pieces combine to show that P is true for all integers n greater than or equal to the base case. Any programmers reading this will know that what I have just described is an infinite loop (for non-programmers, think about taking "wash, rinse, and repeat" literally), but the concept was very helpful for understanding while loops. This course was a preview of the way in which my mathematical interests could be applied in the tech world.

After College

The first several years of my post-college career have led me on a winding path to Flatiron and Data Science. Initially I was interested in the actuarial field. I took the first three exams for the Society of Actuaries' certification: I really enjoyed getting really deep into probability and annuities as well as learning the math behind calculating prices for stock options. Years later, I worked as an actuarial intern--this was my first introduction to data and data science. I was no longer learning theory--instead I was dealing with huge Excel tables of insurance policy premiums and digging into VBA and SQL code to determine how things were being calculated. This was interesting, but I was a bit awestruck by interns and actuaries who had a background in data science. They seemed like wizards to me. I would be fascinated to delve into the applications of data science to the actuarial field.

At a certain point in this journey I became interested in education. I worked in after-school centers and charter schools and took some postsecondary coursework in mathematics education. I was fascinated to learn, in addition to the statistics and programming content being taught in schools, that data has become essential to the field. Whether in academic studies, policy-making, or charting individual student-performance and in-the-classroom decision-making, data is ubiquitous. It is used to create an in-depth picture of a student or class's progress in many areas. Statistics regarding the performance of individual schools or districts is paramount for politicians and policy-makers. Quantitative studies rely on data and statistics to determine the best teaching practices as well as the effects of environmental factors on academic performance. While I found that classroom teaching was not my forte, the fascinating importance of data and statistics on the world of education has stuck with me.

Onward...to Data Science!

Life has no roadmap. No one can tell you with certainty what you should do or where you should go. At times I've felt like I'm walking along a mobius strip, traversing a winding and looping path and ending up in a place I've been at before, yet changed and reoriented by the journey. Recently I found myself at a transition point with no obvious next step. Yet, everything I mentioned above, all the signs and glimpses of my future, have gently pushed me into a new field: data science. It is up to me to trust my instinct, believe in myself, and rise to the challenge.