DEV Community: jesusbaquiax

Updating your GitHub Personal Access Token for Windows

jesusbaquiax — Wed, 08 Dec 2021 19:23:07 +0000

Happy December everyone! My Personal Access Token (PAT) for GitHub recently expired and I completely forgot how to update it. After spending about thirty minutes reading different online articles that had a variety of methods to update your PAT. I want to provide a condensed set of instructions that will hopefully be helpful to anyone whose PAT has recently expired. These instructions are for Windows and allows you to cache your PAT so that you don't have to input it every time you git clone/ git push a new repo.

Step 1) Create Personal Access Token

GitHub provides a clear and succinct set of instructions to create a PAT on their website. After creating your new PAT, do not close the window that contains the new PAT until you have finished these steps. Link to instructions is here

Step 2) Add Credentials to Windows Credential Manager

Go to Credential Manager
Click on "Windows Credential"
Click on "Add a generic credential"
For "Internet or network address" Type in: 'git:https://github.com'
For "User name" type in 'GitHub User Name'
For "Password" paste in 'Personal Access Token'
Click OK
Rejoice! You Are Done!

If the previous GitHub Credential is still listed under "Generic Credentials", you can update the PAT by clicking on the GitHub credential and then clicking on the "Edit" button and pasting in the new PAT in the password field.

To ensure that everything was done successfully, try to git push your local repo to GitHub. If you get no error messages, then the update was a success. Let me know if this was helpful or not. Thank you!

Acknowledging Societal Bias as a Data Scientist

jesusbaquiax — Tue, 14 Sep 2021 00:03:30 +0000

One of the many aspects that a data scientists must deal with on a day to day basis is addressing statistical bias within the models and datasets they use. Another aspect of the work that is less often mentioned is addressing societal bias on statistical and machine learning predictions. Mitchell et al's article: "Prediction-Based Decisions and Fairness: A Catalogue of Choices, Assumptions, and Definitions" attempts to offer a concise reference for thinking through the choices, assumptions and fairness considerations of predictions-based decision systems.

As the use of prediction based and decision making machine learning models continues to grow and interweave itself into the societal fabrics of one's day to day life. It has become all the more important to investigate and know if any bias are found due to statistical or societal bias. Mitchell et al uses the two real world examples of pretrial risk assessment and lending models to summarize the definitions and results that have been formalized to date.

Th article highlights three examples of societal bias within predictive modelling. The first is the data: Even if the data were to be representative and accurate, they could perpetuate social inequalities that run counter to the decision maker's goals. In the case of pretrial risk assessments, using arrests as a measure of crim can introduce statistical bias from measurement error that is differential by race because of a racist policing system.

This bias is compounded because statistical and machine learning models are designed to identify patterns in the data used to train them and to an extent reproduce unfair patterns. This also takes place in the parameters that a data analyst uses in the models such as choosing the class of a model, which metrics to focus on for interpretability, and which covariates to include in the model.

Lastly, a final choice in mathematical formulations of fairness is the axes along which fairness is measured. In most cases, deciding how attributes map individuals to these groups is important and highly context specific. In regulated domains such as employment, credit, and housing, these so-called “protected characteristics” are specified in the relevant discrimination laws. Even in the absence of formal regulation, though, certain attributes might be viewed as sensitive, given specific histories of oppression, the task at hand, and the context of a model’s use.

As a junior data scientist, this paper can be informative both in the job application process and in understanding the company's work culture. One can elevate themselves as a more ideal candidate by acknowledging the implicit societal bias in the data / model that they will work with during their interview. Even if one doesn't fully comprehend or know all the factors at play, a recruiter will appreciate a new approach or that you, as a candidate, have domain knowledge about the industry. Knowing how a company addresses societal bias could also be a helpful indicator to a job applicant on the type of work culture at the company they are interested in applying to.

As the two main examples listed in the article show: there are many important and real life applications were the policy and decisions made could have adverse effects to a certain demographic, but not others. As American society is becoming more cognizant about the societal inequalities endemic in its systems and structures, it is becoming more important to also be aware of the bias in machine learning models that play a decision or prediction roles.

Citation

Prediction-Based Decisions and Fairness: A Catalogue of Choices, Assumptions, and Definitions

Shira Mitchell, Eric Potash, Solon Barocas, Alexander D'Amour, Kristian Lum

Logistic Regression Workflow Guide

jesusbaquiax — Tue, 10 Aug 2021 12:19:16 +0000

If you are one of the many whose first time working on a logistic regression project is through a boot camp program, you might feel overwhelmed about where to start to begin when working on your project. This guide will hopefully serve as a beacon of light in order for you to better approach your project and spend less time trying to figure out your approach and more time practicing and improving on your understanding of logistic regression modelling.

Introduction

I'll be using the project I worked on while in my intensive bootcamp program: the Terry Stop Dataset. This tutorial will highlight all the steps one should take when finding a predictive logistic regression model through multiple iterations.

Dataset

I used the Terry Stop dataset that is publicly available on the seattle.gov website. This data represents records of police reported stops under Terry v. Ohio, 392 U.S. 1 (1968). Each row represents a unique stop. Each record contains perceived demographics of the subject, as reported by the officer making the stop and officer demographics as reported to the Seattle Police Department, for employment purposes. Where available, data elements from the associated Computer Aided Dispatch (CAD) event (e.g. Call Type, Initial Call Type, Final Call Type) are included. I chose the 'arrest flag' column as my target column and included the demographic, frisk, police officer YOB, and date column as my features.

First Decision - Deciding on which columns you will use in your modelling. You can mix and match and see which combinations work better than others. For my project, I decided to stick to the same set of features, you could choose to create a data frame with just demographics + precinct and another one with demographics and date reported and compare.

Second Decision - Deciding on how you will clean your data. This will change depending on which columns you decide to use. Basic data cleaning include removing NaN values and consolidating similar values into one value.

Third Decision - Deciding on which predictive models you will use. I used a logistic regression, KNearestNeighbors, RandomForestTree, and DecisionTree models. I used pipelines to expedite the process and cut back on code.

Fourth Decision - If you use a gridsearch which parameters will you decide on using and how many options will you provide for each parameter for each model. This might be limited based on your hardware capabilities.

Fifth Decision - Deciding on which model will be your final one. This depends on many factors about your dataset. In this project, there was a class imbalance in my target column so I focused on precision, recall, and F1 score.

Sixth Decision - Deciding on which visualizations you will use. Ideally you will want visualizations that are easy to read, add to your story, and is catered to the stakeholder.

Libraries

Many of the libraries imported below should be familiar to you by now. If you are interested, you look at the documentation to learn more about each. I'll briefly go over the lesser known ones that I used now.

I used datetime, date to transform the 'date reported' column into a datetime(n64) dtype and then created two new columns with the month and the year respectively.
I used mdates, cbook, mtick, and calendar to transform the month numbers to names when creating the visualizations of terry stops by month and year.

'
import pandas as pd
import numpy as np
from datetime import datetime, date
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix, \
accuracy_score, recall_score, precision_score, f1_score, log_loss, roc_auc_score, roc_curve
from sklearn.compose import ColumnTransformer
from imblearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
import matplotlib.dates as mdates
import matplotlib.cbook as cbook
import matplotlib.ticker as mtick
import calendar
'

Code

Below is the link to project repository on github. You'll find markdowns for almost each cell of code and a explanatory narrative of my process.

Link

Conclusion

Hopefully This guide serves as a basic outline for you as you begin your logistic regression project. Please feel free to reach out if you have any questions or comments about anything!

Jesus Baquiax: Using Data Analytics to Inform Educational Approaches

jesusbaquiax — Tue, 29 Jun 2021 23:20:26 +0000

Over the past 6 years, I have worked in an Operations role at two different charter schools in Brooklyn. I had a variety of roles and responsibilities, but the favorite part of my job was always the data analysis of student investment, academic and behavior data, and teacher performance using google spreadsheets and excel. I enjoyed the google search for a method that someone else used that would work with the issue at hand and I would get a rush of excitement whenever I would figure out a long formula I was working on.

I first learned about Data Science from a co-worker who was always impressed with what I was doing with google spreadsheets and strongly encouraged me to look into Data Analytic Bootcamp Programs. My co-worker also mentioned that someone they knew had taken the program and enjoyed it and was able to get a job after the program. As I began doing my research, I realized that this was similar to what I was doing in a different and more robust "language" and at a much higher difficulty/intensity.

I want to explore how data science can help schools and the education technology industry improve in order to better support the students they serve especially in lower income communities.

One specific question I want to explore is how New York City (NYC) charter schools have performed over time with the New York State Exam (NYSE). The NYSE is a required exam that all NYC Department of Education (DOE) public schools have to administer and is used to determine whether a student will be promoted to the next grade. I don't believe there is a analysis that exists that answers this question and I think this is important to 1) to see if charter schools are a viable option for parents to send their kids to who want to see their children be well educated and 2) and which charter schools have performed more exceptionally than others.

Flatiron School was the program my co-worker recommended to me and when I began looking deeper into the program, the two main factors that appealed most to me were its job application network and full-time program. Flatiron School has multiple locations across the U.S. and access to their respective job market and having the option to apply to various locations with the support and network Flatiron school offers will provide me an advantage when I begin applying for jobs. I was also intrigued by the schedule and curriculum that their full time program offers. I have discovered that the best way for me to learn a new skill is to fully dedicate myself to the absorption of the material and not get distracted by other commitments.

In the future, I want to apply the data analytic skills I learn from this program to become a data analyst or some similar capacity for a education technology company, a large charter school network, or State Department of Education.

Hopefully I'll have more exciting news and updates to offer in the coming weeks and months. Stay tuned!

Jesús Baquiax