DEV Community: Sarima Chiorlu

Everyone Struggles

Sarima Chiorlu — Tue, 16 Jan 2024 12:49:51 +0000

Everyone struggles!! The earlier we normalize this, the quicker we get over things and the fact that we are perfect beings because we are not. We struggle over the simplest details. For example, starting this article was a big struggle for me. Why? I wanted a good and perfect intro, but the moment I let go and just began writing, it all started coming together.

Like learning how to ride a bicycle or how to swim. When we first pick this up, it seems impossible to learn, but as we progress, it is a whole lot easier than we had imagined it to be. This is so similar to my internship experience. I recall when I was given my first task, I struggled with a lot of things during the first few days, trying to figure them out myself and failing to do so. The imposter syndrome kicks in: do I deserve to be here? I am undeserving of this. But what we do not understand is that everyone goes through this, and it isn't all about you. You will realize that just a little guide from a mentor or a friend throws more light into the concept, and it becomes easier to do. This describes my first week in Ersilia. I had a lot of information to look through to successfully incorporate a model. Some of which involved going through their publication, looking at source codes, studying the packages used to build the model, and understanding Ersilia's code base. I put an expectation on myself to deliver and deliver perfectly within a short frame of time. Hence, I tried getting all the information I needed quickly. The moment, however, I let go and took each task as it came, reaching out when stuck, it all became easier. Like learning to ride a bicycle, after my first incorporation, my second and later tasks became a whole lot easier to do.

We all get the urge to deliver exceptionally well, or the nerves on the first day at a job, which makes us struggle. However, the moment we learn to normalize that you're not the first and take each stage in bits, it becomes a whole lot easier. As usual, catch me later on as I share bits of my progress during this internship. Till then, Adieus.

Outreachy: My Internship Journey

Sarima Chiorlu — Sat, 09 Dec 2023 16:45:15 +0000

Introduction

I guess you've found me! I am Sarima Chiorlu, a software engineer based in Lagos, Nigeria. I am currently an Outreachy intern for Ersilia. I mostly write python with a focus on machine learning. I enjoy writing code, debugging, and working with data. This has led me to try various careers in the data path. When not writing code, you can catch me listening to music, reading novels or watching shows.

My core values

My core values are honesty, curiosity, and empathy. I will say these describes me as a person. I'll explain why below:

As a software engineer and generally, I have found that honesty is the foundation upon which trust is built, and I thrive on creating an environment where open communication can be made. This encourages one to be forthright about the challenges they face and the solutions they offer, and we can work together to build something that makes an impact in society.

I have the notion that curiosity is the bedrock of innovation. A curious mind constantly pursues knowledge, explores new ways of doing things, embraces unconventional solutions, etc. This refines your skills and helps you grow in new areas that you haven't yet explored. I try to promote curiosity and speak most about this when talking to people.

The world is difficult as it is. Trying to understand a person and provide solutions for their problems would make it a little bit tolerable. This is what inspires me when I create software. Creating software that would improve the livelihood of society.

Why I applied for Outreachy?

Since I began software development, I have had the opportunity to be a research assistant and have dealt mainly with research. However, I was looking for an application of my skills on a real-world project and an opportunity to work with mentors and people who had been in the programming field for a while now, gaining knowledge and also being a source of knowledge to others.

Outreachy provided me with the opportunity to connect with people, learn more cultures and languages, and improve my professional network.

I hope to connect with you more and tell you about my experience in the next 12 weeks. Until then, have a good day.

Regression in machine learning

Sarima Chiorlu — Mon, 08 Aug 2022 16:12:22 +0000

One of the most popular uses of machine learning models, particularly in supervised machine learning, is to solve regression problems. The relationship between an outcome or dependent variable and independent variables is something that algorithms are trained to grasp. In laymans terms, it means fitting a function from a specified family of functions to the sampled data under some error function. Prediction, forecasting, time series modeling, and establishing the causal connection between variables are its key uses.

This fitting of function serves two purposes.

Estimating missing data within your data range
Estimating future data outside your data range

Although, the common application is predicting future data outside your data range after it has been trained. The Machine Learning regression algorithm is similar to our linear algebra’s line of best fit. Let’s go back in memory lane to our elementary mathematics. In maths, we were given X and y points and asked to plot a linear graph, then in exercises, we were asked to find the value of y for which x is 6 from our data. I believe we all went ahead in plotting our graph and looking for the corresponding value of y. This is very similar to our regression problem, but now we want the machine to do it for us. X here are the variables we want to use to predict y, while y is what we want to find out. Let’s have this basic understanding before we move forward

Terminologies used in Regression

Dependent Variable(Y): The dependant variable is the main factor in Regression analysis that we wish to predict or understand. It is also known as the target value or label
Independent Variable(X): The independent variables are the elements that influence the dependent variables or are used to predict the values of the dependent variables., It is usually referred to as our features.
Outliers: Outlier is an observation that contains either a very low value or a very high value in comparison to other observed values. An outlier should be avoided as it might hurt the outcome.
Multicollinearity: If the independent variables are more highly correlated with each other than other variables, then such a condition is called Multicollinearity. It should not be present in the dataset, because it creates problems while ranking the most affecting variable.
Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with the test dataset, then such a problem is called Overfitting. And if our algorithm does not perform well even with the training dataset, then such a problem is called underfitting.

Uses of regression models?

Common uses for machine learning regression models include:

Forecasting continuous outcomes like house prices, stock prices, or sales.
Predicting the success of future retail sales or marketing campaigns to ensure resources are used effectively.
Predicting customer or user trends, such as on streaming services or e-commerce websites.
Analyzing datasets to establish the relationships between variables and output.
Predicting interest rates or stock values based on a multitude of factors.
Creating time series visualizations.

Types of regression analysis

Let's now discuss the various methods through which we can perform regression.

Regression can be carried out using a variety of different methods in machine learning. Machine learning regression is accomplished using a variety of well-known techniques. The various methods could use various numbers of independent variables or handle various kinds of data. A different relationship between the independent and dependent variables may also be assumed by various kinds of machine learning regression models. Linear regression techniques, for example, presume that the relationship is linear and would be ineffective with nonlinear datasets.

Types of regression models

Simple Linear Regression
Multiple linear regression
Logistic regression
Support Vector Regression
Decision Tree Regression
Random Forest Regression

Simple linear regression

Simple Linear Regression is an approach that plots a straight line within data points to minimize the error between the line and the data points. In this scenario, the connection between the independent and dependent variables is considered to be linear. This method is straightforward because it is used to investigate the relationship between the dependent variable and one independent variable. Outliers may be common in simple linear regression due to the straight line of best fit.

A statistical regression technique used for predictive analysis is called linear regression
It is one of the most basic and straightforward algorithms that use regression to illustrate the relationship between continuous variables.
In machine learning, it is used to solve the regression problem.
The term "linear regression" refers to a statistical method that displays a linear relationship between the independent variable (X-axis) and the dependent variable (Y-axis).
Such linear regression is known as "simple linear regression" if there is only one input variable (x). Additionally, this type of linear regression is known as "multiple linear regression" if there are many input variables.

Y= aX+b

where Y = what we are trying to predict

X = features or variables we would be used to predict the value of Y

a = slope of the line

b = Intercept at the Y-axis (Similar to our maths linear algebra right)

Multiple linear regression

When more than one independent variable is used, multiple linear regression is used. Polynomial regression is an example of a multivariate linear regression technique. It is a sort of multiple linear regression used if there is more than one independent variable. When numerous independent variables are included, it achieves a better fit than simple linear regression. When plotted in two dimensions, the outcome would be a curved line fitting to the data points. Logistic regression is employed when the dependent variable can have one of two values, such as true or false, or success or failure. Logistic regression models can be used to forecast the likelihood of occurrence of a dependent variable. The output values must typically be binary. A sigmoid curve can be used to depict the relationship between the dependent and independent variables.

Polynomial Regression:

Polynomial Regression is a sort of regression that uses a linear model to represent the non-linear dataset. While it fits a non-linear curve between the value of x and related conditional values of y, it is comparable to multiple linear regression. Assume there is a dataset with sample data that are distributed in a non-linear form; in this situation, linear regression will not best match those data points. Polynomial regression is required to cover such data points. In Polynomial regression, the original characteristics are transformed into polynomial features of a specific degree and then modeled using a linear model.

Note: This differs from Numerous Linear regressions in that in Polynomial regression, a single element has different degrees rather than multiple variables with the same degree.

Support Vector Regression:

Support Vector Machine is a supervised learning algorithm that can be used for regression as well as classification problems. So if we use it for regression problems, then it is termed Support Vector Regression.

Support Vector Regression is a regression algorithm that works for continuous variables. Below are some keywords which are used in Support Vector Regression:

Kernel: It is a function that converts lower-dimensional data to higher-dimensional data.
Hyperplane: In general, SVM is a line that divides two classes, while SVR, is a line that helps forecast continuous variables and covers the majority of the data points.
Boundary line: These are two lines that are set aside from the hyperplane and creates a margin for data points are known as boundary lines.
Support vectors: The data points closest to the hyperplane and opposing class is known as support vectors.

In SVR, we always try to determine a hyperplane with a maximum margin, so that the maximum number of data points are covered in that margin. The basic purpose of SVR is to consider as many data points as possible within the boundary lines, and the hyperplane (best-fit line) must contain as many data points as possible.

Decision Tree Regression

Decision Trees are a supervised learning method for solving classification and regression issues.
It is capable of resolving issues with category and numerical data.
Decision Tree regression constructs a tree-like structure in which each internal node represents a "test" for an attribute, each branch indicates the test's result, and each leaf node provides the ultimate decision or result.
Starting with the root node/parent node (dataset), a decision tree is built, which divides into left and right child nodes (subsets of the dataset). These child nodes are further divided into their children nodes, and themselves become the parent node of those nodes. Consider the below image:
Random forest is a powerful supervised learning algorithm capable of handling both regression and classification problems.
The Random Forest regression is an ensemble learning method that combines multiple decision trees and predicts the final output based on the average of each tree output. The combined decision trees are called base models, and they can be represented more formally as:

g(x)= f0(x)+ f1(x)+ f2(x)+....

Random forest uses the Bagging or Bootstrap Aggregation technique of ensemble learning in which aggregated decision tree runs in parallel and do not interact with each other.
With the help of Random Forest regression, we can prevent Overfitting in the model by creating random subsets of the dataset.

Random Forest

A random forest is a meta estimator that fits several categorizing decision trees on different sub-samples of the dataset and utilizes averaging to increase predicted accuracy and control over-fitting. Some of the important parameters are highlighted below:

n_estimators — the number of decision trees you will be running in the model
criterion — This variable lets you choose the criterion (loss function) that will be used to decide model outcomes. We can choose between loss functions like mean squared error (MSE) and mean absolute error (MAE). MSE is the default.
max_depth — this sets the maximum possible depth of each tree
max_features — the maximum number of features the model will consider when determining a split
bootstrap — the default value for this is True, meaning the model follows bootstrapping principles (defined earlier)
max_samples — This parameter is only effective if bootstrapping is set to True; otherwise, it has no effect. When True, this variable specifies the largest size of each sample for each tree
Other important parameters are min_samples_split, min_samples_leaf, n_jobs, and others that can be read in the sklearn’s RandomForestRegressor documentation here.

We would be focusing on the linear regression model in this article. Subsequent articles showing illustrations for each of the following models would be released at a later date

A practical illustration of Linear Regression is shown in the code below. The code depicts a multiple linear regression. However, the same code can be run for a simple linear regression model. As a bonus, we first performed a quick analysis of the data before training

#importing our libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#importing our dataset
dataset = pd.read_csv("/content/Real estate.csv")

#brief overview of what our data looks like
dataset.head()

#Getting descriptive information from our data
dataset.describe()
dataset.info()

#Carrying out our data analysis to see correlations between our data
import seaborn as sns

sns.jointplot(dataset["X1 transaction date"], dataset["Y house price of unit area"])
sns.jointplot(dataset["X2 house age"], dataset["Y house price of unit area"])
sns.jointplot(dataset["X3 distance to the nearest MRT station"], dataset["Y house price of unit area"])
sns.jointplot(dataset["X4 number of convenience stores"], dataset["Y house price of unit area"])
sns.jointplot(dataset["X5 latitude"], dataset["Y house price of unit area"])
sns.jointplot(dataset["X6 longitude"], dataset["Y house price of unit area"])
sns.pairplot(dataset)

sns.lmplot(x='X5 latitude',y ='Y house price of unit area', data=dataset)
sns.lmplot(x='X6 longitude',y ='Y house price of unit area', data=dataset)

#Splitting our data into a training set and testing set
y = dataset["Y house price of unit area"]
X = dataset[["X1 transaction date" ,"X2 house age", "X3 distance to the nearest MRT station", "X4 number of convenience stores","X5 latitude", "X6 longitude"]]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

#Training the simple linear model on the training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

#Predicting the test results
y_pred = regressor.predict(X_test)

#Visualising the test set results
plt.scatter(y_test,y_pred)
plt.xlabel('Years of Experience')
plt.ylabel('Salary')

#Calculating the mean absolute error, mean sqaured error and the root mean squared error
from sklearn import metrics

#Evailuating our model
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

#We move ahead to exploring the residuals to ensure everything is alright with our code
sns.distplot((y_test-y_pred),bins=50);

You can check out the full code here

Conclusion

In summary, regression models just help us predict whether we can buy a house based on some predetermined independent variables. The machine learns the pattern and can predict past and future occurrences

The importance of confusion matrix in machine learning

Sarima Chiorlu — Wed, 18 May 2022 13:45:50 +0000

As a machine learning engineer, it is important to know how well our model performs on our predictions. This aids us in finding out if we have an overfitting problem, and correcting it early on while building our model. One of the ways in which machine learning engineers test the accuracy of their model is through a technique known as confusion matrix.

What is a Confusion matrix?
A confusion matrix is a technique for measuring performance in machine learning classification model. It is a type of table that allows you to determine the performance of the classification model on a set of test data in order to determine the true values. It compares the predicted value with the actual value.
It is extremely useful for measuring Recall, Precision, Specificity, Accuracy, and most importantly AUC-ROC curves.

In a confusion matrix, there are four types of possible outcomes we can have. These are:
True Positive
False Positive
True Negative
False Negative

True positive: Our predicted value matches the real value. Our predicted value was positive and the real value was positive

False positive: The predicted value did not match the real value. Our predicted value was positive but the actual value was negative

True Negative: The predicted value matches the real value. Our predicted value was negative and the actual value was negative.

False Negative: The predicted value did not match the real value. Our predicted value was negative, the real value was positive.

For instance:
We are trying to predict how many people experience side-effects after taking a certain food (Food A), we have our data set, we have cleaned our dataset, trained it and done our preliminary test using our validation data set and our model is saying that "88.8%" of people who are not cancer-prone experience that side-effect after taking food A. However, our model is actually doing the direct opposite. Considering the high death-rate we've had due to cancer over the years, we should instead be predicting the positive values, so as to ensure that such individuals can go see a doctor whilst it's all still in the early stage. Now, how do we do that. We do that via:

Recall: The term recall refers to the proportion of genuine positive examples identified by a predictive model.
Mathematically:

Precision is similar to recall in that it is concerned with the positive examples predicted by your model. Precision, on the other hand, measures something different.

Precision is concerned with the number of genuinely positive examples identified by your model in comparison to all positive examples labeled by it.
Mathematically:

If the distinction between recall and precision is still unclear to you, consider this:

Precision provides an answer to the following question: What percentage of all selected positive examples is genuinely positive?

This question is answered by recall: What percentage of all total positive examples in your dataset did your model identify?

Let's get all practical
From the image shown above, let's calculate out our accuracy

Accuracy =

We have an accuracy of 88.8%.
Precision of 81.4
Recall of 95.5

It should be noted that as we try to increase the precision of our model, the recall goes down. They are inversely proportional to each other. We use what is known as the F1-score to get an average (the harmonic mean) of our precision and recall. This aids in evaluating our performance
Let's see how we had that confusion matrix. Let's code it out:

I am going to be assuming that you have trained your model and tested it. We are just trying to check for how well our model has performed.

from sklearn.metrics import confusion_matrix
import itertools
import matplotlib.pyplot as plt
from random import randint
from sklearn.utils import shuffle

We are importing the necessary libraries we would need

cm = confusion_matrix(y_true=test_labels, y_pred=rounded_predictions)

y_true: correct value, the data we are going to be using to check on our prediction. This is your test data
y_pred: This is the values our model has predicted

def plot_confusion_matrix(cm, classes, normalize = False, title='Confusion matrix',cmap=plt.cm.Blues):
  plt.imshow(cm, interpolation='nearest',cmap=cmap)
  plt.title = title
  plt.colorbar()
  tick_marks = np.arange(len(classes))
  plt.colorbar()
  tick_marks = np.arange(len(classes))
  plt.xticks(tick_marks, classes, rotation=45)
  plt.yticks(tick_marks, classes)
  if normalize:
    cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    print('Normalized confusion matrix')
  else:
    print('Confusion matrix, without normalization')

  print(cm)

  thresh = cm.max() / 2.
  for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
    plt.text(j, i, cm[i, j],
             horizontalalignment='center',
             color="white" if cm[i, j] > thresh else "black")

  plt.tight_layout()
  plt.ylabel('True label')
  plt.xlabel('Predicted label')

If you wish to plot the confusion matrix graph. You could also easily use the code below:

y_true = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0] #Assuming this is the data we have in our test data set
y_pred = [1, 0, 0, 1, 0, 0, 1, 1, 1, 0] #This is the result our model has predicted for us
result = confusion_matrix(expected, predicted)
print(result)

Conclusion
Through the use of the confusion matrix, we can ensure that our model is performing well. Due to this invention by Karl Pearson, which was then known as the contingency table, we had always been able to measure the performance of our models as machine learning engineers. This has helped us better train our models.

Understanding Project Management

Sarima Chiorlu — Fri, 14 Jan 2022 11:28:56 +0000

Project Management is the application of processes, methods and skills to achieve specific project objectives according to the project acceptance criteria within agreed parameters. Project management has final deliverables that are constrained to a finite timescale and budget. It is an essential part of software engineering because professional software engineering is always subject to organizational budget and schedule constraints. A project manager's job is to ensure that the software meets and overcomes these constraints as well as delivering high-quality results.

To understand the activities of a project manager. we shall be looking at two important events.

Risk management
People management

Risk management
This is one of the most important jobs for a project manager. This involves anticipating risks that might disrupt the project development cycle and taking actions to avoid them
Below is the step project managers can take to manage risks.

Identifying the risk: You should identify possible project, product and business risks. As a starting point for identifying risks, different types of risk may be used:
Analyze the risk: You should assess the likelihood and consequences of these risks on your project.
Prioritize the risk: These involves making plans to treat the risks in the order of which would have the most impact on the project and identifying it immediately to be treated.
Treat the risk: You would report the issue to the appropriate people in the team to have it fixed.
Monitor the risk: You should properly assess the risk and your plans for risk control and revise these when you learn more about the risk.

It should be noted that risk management never stops, it is an iterative process that continues for as long as the project is used by clients. You continue to reanalyze the risks and decide if the risk priority has changed or not. At each stage, findings should be recorded and use as a data to help control any further risks your project may face.

People management
Managing people is one of the most essential traits that all project managers need to have.
For a project manager to have a successful career, it is necessary for the individual to recognize that the people working in his team or organization or on a project are his greatest assets. It costs a lot to recruit people and retain good people. Hence, it falls on the project manager to ensure that the organization gets the best possible return on its investment. Also, when recruiting talents to join teams or forming teams to work on some certain features, project managers should take note of the personality type of the people to make up the group. We have:

Task-oriented people: They are motivated by the work they do, by the intellectual challenge of any task given to them.
Self-oriented people: They are motivated by their personal success. They are interested in the task as a means of achieving their own goals
Interaction-oriented people: They perform their tasks based on the actions and performance of co-workers. This is often mistaken for the competitive trait that each and everyone has but this simply refers to individuals who are more user-centered.

When choosing team members for your team. It is advisable to pick more task-oriented people and the rest can be spread across interaction-oriented and self-oriented (although this depends on the project being managed. If we are building a project for a gym)

Teamwork
For a project to be finished effectively and efficiently within a period of time, tasks are often shared amongst teams and team building is the job of the project manager. As a project manager forming teams is one of the most important part of the job. There are certain criteria's that should be taken into account:

the personality type of people to make up the teams
The size of the team, this is necessary to ensure easy communication amongst the team members. For large organizations, that work on large projects, a single feature can be divided into teams, and the feature further divided to smaller bits. The project gets handled effectively and communication is made easier, you also get to know your team members on an individual level.

Group communications
There are lots of software's now that help project managers organize projects and tasks between their members. These software's were created because there was the problem of effectively passing out information to members of a team. When tasks are shared amongst groups, most times, some tasks are dependent on other tasks. There arises a need to communicate with one another the progress of their various tasks. This is one of the problems that project management tools (Asana, etc.) has helped to fix. Although, there may arise some situations where teams may have to meet physically. This is why it is important to understand the personality-type of team members to help balance out the composition of the team.

Now, when managing projects, tasks should be divided into different time ranges and members would be asked to deliver projects also based on some certain criteria's.

Before we conclude, let's have a case study

On the first day in March 2021, the executives of a public hospital came to the decision that it would improve productivity at their hospital by developing an application where patients make an appointment with the doctor via their application, where they would be appointed a time and date to make their visit. Hence, making doctors more organized and would only answer to emergency's. The app is to deployed by July 2021

To begin this project, they appointed Martha to be their project manager because over the years she has proven to get the job done.

Martha realizing that she is in charge of a larger-scale project that would be used by tens of millions of people and across regions starts the planning for the project(first stage planning).

She meets the board and requests for the key-features of the app and their goal in building the project. Martha also has to work with the sales team, to discuss how to build the app in order for it to also return profit to the hospital.

After identifying the goal and requesting the board's budget for the app. She went on to forming her team. She formed her team based on the features of the app. Below is a tree of how her team looked like:

She appointed a technical team lead who reported directly to her and went on to form teams that were responsible for each feature. She created a separate team responsible for testing (unit testing, component testing, etc.). Watch out for a separate article on testing. Let's look at an example of how she selected people that made up a team. For the team responsible for security, she had 4 task-oriented people, 3 self-oriented and 2 interaction-oriented. For a team responsible for the UI/UX for the app, she picked 3 interaction-oriented, 2 task-oriented and 2 self-oriented. Looking at this illustration, for security she picked more task-oriented because it involved more technical prowess but for the UX team, she picked more interaction-oriented because their personality for discovering user's pain-points came in handy.

After setting up teams, she went on to allocate timeframes to different tasks, bearing in mind that some tasks where dependent on others. For tasks like that, she ensured that the independent task be handled urgently within a timeframe. Let's look at this table to have a better understanding:

These tasks would be further divided into smaller units, where the doctor's profile can have sub teams taking care of a smaller feature there. It would be the work of the team lad's of smaller units to further create timeframes working with the timeframe of the main team. Also, for tasks dependent on other tasks such as team 2's building the doctor's profile dependent on team 1's working on authentication, tasks like that have to come first before the rest.

Using software's that are efficient and working around the budget, the project manager also has to effectively communicate the progress the team has made with the directors and executives and work on any adjustment that would be made.

Risk control is also very important, you would have to decide your own approach in mitigating risks and solving them

Some Project Management tools

Zepel
Jira
Asana
Trello
Teamweek
Wrike
Zoho Sprints
Airtable

I hope we have been able to understand the basic project management pattern. Depending on the project you are asked to manage, certain criteria may change but if you understand the basics, you would be able to oversee a project till it becomes a success(this is when the client or customer is satisfied and has issued a verbal review.). Overall, project managers have to ensure that both the internal (those directly involved with the project such as staffs, etc.) and external stakeholders(those indirectly involved with the project such as clients, government, etc.) are satisfied at the end of a project.

For beginners, who wish to start their career in project management. Here are some courses that has helped me thus far: