DEV Community: Karen Ngala

Git for Data Science

Karen Ngala — Wed, 05 Apr 2023 12:54:19 +0000

As data science continues to gain momentum as a field, managing and versioning data and code has become increasingly important. Git, a powerful version control system, is a popular tool among software developers for managing source code changes. However, Git is not just limited to software development and can also be used effectively for managing data science projects.

In this article, we will explore how Git can be leveraged by data scientists to efficiently manage and version data, track changes, collaborate with team members, and reproduce experiments. Whether you are new to Git or an experienced user, this article aims to provide a comprehensive guide on using Git for data science projects.

What is Git and How does it work?

Git is a distributed version control system used for tracking changes in source code during software development. It allows multiple people to collaborate on the same project by tracking changes to code. Git does this by taking snapshots of the files at various points in time, creating a complete history of changes made to those files. Each snapshot is called a "commit" and contains a reference to the previous commit, forming a "commit chain" or a "commit history".

Git uses a distributed model, which means that each user has a local copy of the entire repository, including the commit history. This allows users to work offline and makes collaboration easier. When users are ready to share their changes, they can push their commits to a remote repository, from which other users can then pull to incorporate those changes into their local copies.

Git also offers tools for merging changes made by different people and reverting to earlier versions if necessary. It also provides tools for branching, enabling developers to work on different parts of a project simultaneously without disrupting each other's work.

Git vs GitHub

Git is a command-line tool that allows developers to track source code history over time while also allowing them to collaborate on the same project with minimal conflict.

GitHub is a web platform built on Git technology where remote repositories of git projects are hosted. It offers other features such as bug tracking, project management, automation and other features. Alternatives to GitHub include GitLab, Bitbucket, GitKraken, among others.

Terminologies & Commands

Repository: A repository is a central location where Git stores all the files and folders of a project, along with their revision history.

# Create a new repository on your local computer
git init

Commit: A commit is a snapshot of a repository at a specific point in time. It represents a set of changes that have been made to the repository. You must first stage the edited files using the git add command. This marks the files to go into the commit.

# stage all edited files
git add . 

# stage a specific file
git add <file_name.ext>

git commit -m "commit message goes here"

Branch: A branch is a separate version of the repository that allows developers to work on different features or fixes simultaneously without interfering with each other's work.

# create then checkout to branch
git branch <branch_name>
git checkout <branch_name>

# create and checkout into new branch
git checkout -b <branch_name>

# list all branches in the repository 
git branch

Push: Push is the process of sending changes from a local repository to a remote repository, such as on GitHub.

git push origin <branch_name>
# origin -> the default remote repository that Git tracks for a local repository or points to the original repository in case of cloning.

Pull: Pull is the process of fetching and merging changes from a remote repository into a local repository.

git pull origin <branch_name>

Merge: A merge is the process of combining changes from one branch into another branch.

git merge <feature branch_name>

Pull Request: A pull request is a request made by a developer to merge their changes from a branch into the main branch of the repository.
Fork: A fork is a copy of a repository that allows a developer to make changes to the code without affecting the original repository.
Clone: A clone is a local copy of a remote repository that a developer can work on without affecting the original repository.

git clone <link to remote repository>

HEAD: Shorthand for the current commit your local repository is currently on.

Git Best Practices

1. Don't push secrets

Whether you are working on a private or public repository, never commit any secrets. These include, any username, password, API key, TLS certificates, or other sensitive information. Keep in mind that private repositories can be accessed and cloned by multiple accounts or can also be made public at some point.
To protect such sensitive information, make use of the .env file. This file's purpose is to hold environment variables. The .env file is in turn kept safe by including it in the .gitignore file.
For the purpose of making collaboration easy, you should create a .env.example or .env.template file. This file informs other collaborators which environement variables the system expects. From this file, they can create a .env file with their own usernames, passwords and secret keys.

# .env file:
API_KEY=97467282TTa89sdaf7659025f7sda22245

# .env.example file:
API_KEY=your_key

# gitignore file:
.env

# app.py
from dotenv import load_dotenv
load_dotenv()

api_key = os.getenv('API_KEY')

If you happen to commit a secret, you cannot fix it by simply deleting it. Because git is designed to maintain a persistent history of the code, removing the secret will require rewriting history. This can prove difficult in situations where other people have the secret on their local repositories. The simplest solution is to change the passwords and disable the exposed secret keys.

2. Don't push datasets

The main purpose of Git is to track changes in text file, not large binary files such as a dataset. You may work with extremely large datasets which you can accidentally commit if you are not careful. There are several approaches you can take:
a) If your dataset does not change, you can upload it to a server and gain access ti it via its URL.
b) Use a .gitgnore file. Add your dataset files or folders into the gitignore file to avoid accidentally staging and committing them.

# ignore archives
*.zip
*.tar
*.tar.gz
*.rar

# ignore dataset folder and subfolders
datasets/

3. Don't push notebook outputs

Cell outputs on notebooks are a great feature. However, when using version control systems such as Git, a change to a code cell will most likely change its output. Keep track of the changes made in output cells will distract from the more important changes in the code cells. This can prove tedious when multiple people are working on the same notebook.
You should, therefore, strip all outputs from a notebook before committing to Git by:

Manually clearing all output cells from the main menu Cells -> All Output -> Clear
Setting up a pre-commit hook to clear outputs automatically.
Using a .gitattributes file

4. Refrain from using `--force` or `-f`

At times, you may encounter an error when pushing to remote that asks you to use the --force or -f flag. There are situations that require using this flag. However, make it a habit to read the error message first, try to identify the origin of the error and fix the underling issue. If this proves challenging, try asking for help.
Using --force habitually will prove detrimental in the long run.

5. Make frequent and clear commits

As a general rule of thumb, a single commit should do one thing: fix one bug, not five; solve a single issue, not ten.
For example, a commit that fixes ten bugs will most likely have multiple changed files. Further, if the commit message is unclear like "Model now working", it becomes difficult for someone else to understand what happened in the commit. This provides zero value. The commit message "Fix special tokens not correctly tokenized" is short, but clear. You know what changed, and why.
Thankfully, you can fix your commit history if you haven't pushed to remote. Learning to rewrite history can prove very useful in real world projects.

6. Utilize branching and pull requests

If your project is constantly being worked on by many people or is in production, pull requests can prove very helpful. By default, a git repository has a single branch main or master. It is considered the central true branch.
When you branch, you create a temporary 'caveat' from the main branch. You and other collaborators can work on different features simultaneously through branching. This allows you to work on new features or fix old ones without affecting the main branch.

When you are done working on your feature, you will create a pull request to merge (include) the changes of your branch into the main central branch. Pull requests are a github concept and have features to allow other people to review, comment, suggest changes, approve, or apply the changes in the pull request.

Conclusion

In this article, we've covered Git, how it works and the best practices when working with Git. To further help you in this journey, I have linked articles I found useful below:

I hope you found this post useful!

Getting started with Sentiment Analysis

Karen Ngala — Wed, 22 Mar 2023 19:00:06 +0000

Pre-reading:

Basic understanding of EDA

What is Sentiment Analysis?

Humans communicate with each other using Natural Language, which is often complicated. Humans tend to use subtle variations in their speech, such as sarcasm, which is easy for us to interpret but difficult for machines. To make computers understand Natural language, we use a process known as Natural Language Processing (NLP)

Sentiment analysis, also known as opinion mining, is a an approach to natural language processing that seeks to identify the emotion behind a text such as movie or product reviews. Businesses around the world use sentiment analysis to understand the social opinion on their products or services left on online platforms.

Sentiment analysis identifies, classifies, and quantifies the sentiment expressed in a text. For example, the text "I loved the movie" carries a positive sentiment while "I found it rather slow and boring" carries a negative sentiment. Positive or negative text can further be quantified in text, for example, the text "I really enjoyed the movie" can be quantified as 'relatively more positive'. The amount of positivity or negativity in text is known as polarity.

When a large amount of data is involved, it becomes more effective to use an algorithm to determine customer satisfaction as opposed to humans.

Sentiment Analysis Process

1. Import relevant libraries

There are a number of libraries we can use in sentiment analysis depending on your goals.

Pandas — for data analysis and manipulation import pandas as pd
Matplotlib — for data visualization import matplotlib.plyplot as plt
Seaborn — for high-level data visulaization import seaborn as sns
WordCloud - to visualize text data. The more a word appears in the text, the larger the font of the word. from wordcloud import WordCloud
re — for string pre-processing. Formats string according to a given regular expression import re
nltk — Natural Language Toolkit. It is a collection of libraries used in Natural Language Processing. import nltk
stopwords — A collection of words that do not offer sentiment in a sentence, such as "the", "and" from nltk.corpus import stopwords

Evaluation Libraries:

from sklearn.metrics import roc_curve, auc
from sklearn.metrics import classification_report, confusion_matrix

Once we have trained our model, we need to evaluate the correctness of the model using the testing dataset i.e: is the result what we expect it to be?

Accuracy Score — Ratio of correctly classified instances to the total number of instances.
Precision Score — Ratio of correctly classified instances to the total positive instances.
Recall Score — Ratio of correctly classified instances to the total number of instances.
Classification Report — a report of accuracy, precision, and recall scores
ROC Curve — a graph of Sensitivity/True Positive Rate (y-axis) against Specificity/False Positive Rate (x-axis) at various threshold values. An ROC “Receiver Characteristic Operator” curve summarizes the performance of a binary classification model.

A binary classification model is one that classifies an instance as either one thing or the other, i.e: The output can only be this value or the other. 'Sick' or 'Not Sick', 'Cat' or 'Dog', 'Tree' or 'Not Tree'

2. Load the dataset

A sample sentiment analysis dataset will contain a text column and its corresponding sentiment/target value.

To read the dataset, we need to load it using pandas:

df = pd.read_csv("train.csv")
df_test = pd.read_csv("text.csv")

3. Exploratory Data Analysis

Understand the data you are working with. Check various aspects of the dataset to familiarize yourself with it. This will help you know how you can manipulate the dataset.

df.shape

df.head()

df.dtypes

# Check for null values
np.sum(data.isnull().any(axis=1))

Distribution of target variables:

The next step is to check the various target sentiments in the dataset.

df['label'].value_counts()
# or
sns.countplot(df.label)

In cases where the labels are of more than two types, we can merge them to create two simple sentiments, positive and negative represented in a numerical form: '1' and '0'

4. Data Preparation

Dealing with alphanumeric text requires pre-processing to remove any odd characters and prepare the text for the model.

Covert the text to lowercase. Because of case sensitivity, the word "Hello" is different from "hello"
```
df['text']=df['text'].str.lower()
```

Remove any stopwords. Words such as "the", "and" do not offer much value in sentiment analysis

stopwords_list = stopwords.words('english')

from nltk.corpus import stopwords
", ".join(stopwords.words('english'))

# Get rid of any stopwords
STOPWORDS = set(stopwords.words('english'))
def cleaning_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])
df['text'] = df['text'].apply(lambda text: cleaning_stopwords(text))
df['text'].head()

Remove non-alphabetic characters.

# remove special characters, numbers and punctuations
df['text'] = df['text'].str.replace("[^a-zA-Z#]", " ")
df.head()

# remove short words
df['text'] = df['text'].apply(lambda x: " ".join([w for w in x.split() if len(w)>2]))
df.head()

Depending on the data you are dealing with, you may need to remove different characters and character combinations. For example, when handling twitter data, you will need to remove user handles, i.e: "@username"

# function to remove patterns in the input text.
def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for word in r:
        input_txt = re.sub(word, "", input_txt)
    return input_txt

# remove twitter handles (@user)
df['text'] = np.vectorize(remove_pattern)(df['text'], "@[\w]*")
df.head()

4. Tokenization

This is used in natural language processing to split text into smaller units that can be more easily assigned meaning. For example, the string "Loved the ambiance and drinks". Tokenization is performed to break the string into individual parts that the program can understand better: 'Loved', 'the', 'ambiance', 'and', 'drinks'

This step also lays the ground work for stemming or lemmatization. Learn more on this topic here.

tokenizer = RegexpTokenizer(r'\w+')
df['text'] = df['text'].apply(tokenizer.tokenize)
df['text'].head()

5. Lemmatization

This is the process of deriving the root word from the different forms of the word. For example the words eats, eating are all part of the same lexeme; with eat as the lemma.

Lemmatization is computationally expensive since it involves look-up tables.
Unlike Stemming which looks at word reduction, lemmatization considers a language's vocabulary to derive the base word. Base words in stemming don't always make sense. For example, the word 'having' would return 'hav' in stemming and 'have' in lemmatization.

lm = nltk.WordNetLemmatizer()
def lemmatizer_on_text(data):
    text = [lm.lemmatize(word) for word in data]
    return data

df['text'] = df['text'].apply(lambda x: lemmatizer_on_text(x))

df['text'].head()

5. Prepare for training

The next step is to separate the dataset into training data and testing data. Sentiment analysis is a classification problem. As such, a classification model is trained using the training dataset and evaluated using the testing dataset. The ratio of training data to testing data is usually 1:1 or 4:1 to avoid biasing the model.

The purpose of this step is to ensure the data you use to evaluate your model's accuracy is unseen/new data. Testing a model using the training data will cause the model to only perform well with the training data and not any other data. This is known as overfitting; and the opposite known as underfitting.

Accuracy score allows us to evaluate the model's performance. We compare the training accuracy to the testing accuracy to identify underfitting and overfitting.
If the training accuracy is extremely high while the testing accuracy is poor then this is a good indicator that the model is probably overfitted.

In cases where we need to choose between multiple models, we need to create an extra dataset known as the validation dataset. This allows us to evaluate the models to pick which performs better.

There are many ways to split your dataset. The following is one method that utilizes sklearn. Read more about how to split a dataset.

# This splits data into an 80:20 ratio
training_data, testing_data = train_test_split(df, test_size=0.2, random_state=25)

6. Build Model

The model you choose to use here is not set in stone. A popular choice for sentiment analysis is Logistic regression. This is because it trains quickly even on large datasets and provides very robust results. Other model choices include Random Forests, and Naive Bayes.

Q: What if we do not have labelled data? How can we know the sentiment in a text?
A: Using Pre-Trained Models — TextBlob

TextBlob is a library that returns the sentiment of a text as a named tuple: "(polarity, subjectivity)”

Polarity is a float in the range -1.0 and 1.0. It shows whether a text is negative or positive.
Subjectivity is a float in the range 0.0 and 1.0 to represent very objective and very subjective sentiments respectively.

7. Model Evaluation

After training the model, we evaluate the performance of the model. Assessing the model's efficiency answers the question, Is the model working well with unseen data?

Before going into the evaluation metrics we can use, let's define the results we can get from these metrics.
For these definitions, let's use the example of a model classifying patients as "Sick" or "Not Sick"

True Positive(TP) - the number of Sick people that were correctly classified as Sick.
True Negative(TN) - the number of Not Sick people that were correctly classified as Not Sick.
False Positive(FP) - the number of Not Sick people that were wrongly classified as Sick.
False Negative(FN) - the number of Sick people that were wrongly classified as Not Sick.
N - total number of patients

There are many evaluation metrics. However, we will look at 3 popular metrics used for classification models:

Accuracy — How often does the model make correct predictions? i.e: The actual sentiment and the predicted sentiment are the same.
```
# Testing accuracy
print('Test set\n  Accuracy: {:0.2f}'.format(accr1[1]))
```
Confusion Matrix — a table used to visualize the performance of a classification model on a dataset for which the true (target) values are known. A confusion matrix highlights two errors:
- Type 1 Error - The number of instances that were negative but were wrongly classified as positive. Also called, False Positive(FP)
- Type 2 Error - The number of instances that were positive but were wrongly classified as negative. Also called, False Negative(FN)
```
print('\n')
print("confusion matrix")
print('\n')
CR=confusion_matrix(Y_test, y_pred)
print(CR)
print('\n')

fig, ax = plot_confusion_matrix(conf_mat=CR,figsize=(10, 10),
                                show_absolute=True,
                                show_normed=True,
                                colorbar=True)
plt.show()
```
AUC (Area Under the ROC Curve) — calculated by plotting the true positive rate against the false positive rate at different classification thresholds.
- True Positive Rate (sensitivity) - proportion of positive samples that are correctly identified as positive
- False positive rate (1-specificity) - is the proportion of negative samples that are incorrectly classified as positive.
- True Negative Rate (Specificity) - proportion of negative samples that are correctly identified as negative

Conclusion

In this article we talked about, the steps you can take to solve a sentiment analysis problem.
Practical guide:

Kaggle Notebook on twitter sentiment analysis

I hope you found this article helpful. Leave a comment if you have any questions or would like to discuss this topic further.

Essential SQL Commands for Data Science

Karen Ngala — Wed, 15 Mar 2023 13:25:17 +0000

Pre-requisites: This article assumes basic SQL knowledge and CRUD commands such as: CREATE, INSERT, UPDATE, ALTER, DELETE, and DROP

SQL, Structured Query Language, is a programming language used for manipulating and managing data in a relational database. Data Scientists use it to extract insights from data. A large amount of data used by data scientists lives in a relational database. This data can be extracted using SQL commands. SQL servers such as MySQL and PostgreSQL use SQL.

This article covers the essential SQL commands that data scientists rely on to effectively clean and filter data:

Data Retrieval
- Conditions for Data Retrieval
Data Aggregation
- Changing Data Types
Joining Data From Different Tables
Complex Conditions

The Basics: Data Retrieval

SELECT FROM
This is the simplest method of data retrieval in a relational database.
It can be combined with conditional statements such as WHERE, ORDER BY, and GROUP BY to filter, sort, and group data.

-- To select specific columns in a table:
SELECT column1, column2, column3
FROM table_name;

-- To select everything in a table:
SELECT * 
FROM table_name;

DISTINCT
DISTINCT is used with SELECT to view unique values in a column.
For example, to know all the departments appearing in the column department, we use DISTINCT. It returns a table of the departments appearing in that table.

SELECT DISTINCT department
FROM employees;

Conditions for Data Retrieval

WHERE
This is a conditional statement used to filter data according to a specific condition.

SELECT column1, column2, column3
FROM table_name
WHERE condition;

-- for example:
SELECT *
FROM employees
WHERE age >= 45;

-- We can also filter data with more than one condition:
SELECT employee_name, department, salary
FROM employees
WHERE department = 'Sales' AND salary >= 50000;

SELECT *
FROM employees
WHERE department IN ('Finance', 'IT', 'HR');

GROUP BY
This statement is used to group data based on one or more columns.

SELECT department, salary
FROM employees
GROUP BY department;

ORDER BY
This is used to sort the results of a query either alphabetically or numerically.
The default sorting order in sql is ASC. Therefore, you do not have to specify ASC in your query.

SELECT employee, salary
FROM employees 
ORDER BY salary;

However, to sort the results in a descending order, use the keyword DESC

SELECT employee, salary
FROM employees 
ORDER BY salary DESC;

LIMIT
When the records in a table are many, we may want to limit the number of records we get. For example, to view only the top 10 earners in the Finance department:

SELECT employee_name, department, salary
FROM employees
WHERE salary > 50000
ORDER BY salary
LIMIT 10;

Data Aggregation

Aggregations are summaries of data used to gain insights on a dataset. They are often used with the GROUP BY clause.
COUNT()
Count returns the total number of rows. In the example below, we are displaying the number of employees in each department.

SELECT COUNT(employee_id)
FROM employees
GROUP BY department;

SUM() & AVG()
Sum returns the sum of all the values. In the example below, we use the GROUP BY statement to group the employees by department and calculate the total salary for each department:

SELECT department, SUM(salary) as total_salary
FROM employees
GROUP BY department;

Avg returns the average value. In the example below, we use the GROUP BY statement to group the employees by department and calculate the average salary for each department:

SELECT department, AVG(salary) as avg_salary
FROM employees
GROUP BY department;

HAVING
Having is used to add additional conditions after calculating a grouped aggregation.
For example, the above query can be conditioned further to only show departments with an average salary above 50000.

SELECT department, AVG(salary) as avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) > 50000;

MIN() & MAX()
To know the lowest or highest values in a column, we can use MIN and MAX.

SELECT MIN(salary) AS lowest_salary
FROM employees;

SELECT MAX(salary) AS highest_salary
FROM employees;

Changing Data Types

CAST( )
SQL sees numeric values as numbers even when dealing with money. We can change salary values to dollar amounts using the CAST function:

SELECT department, CAST(SUM(salary) as money)
FROM employees
GROUP BY department
ORDER BY SUM(salary) DESC;

We can also change numbers into floats, text, or date and time.
ROUND()
When aggregations cause many decimal points, we can round off the decimal points:

SELECT department, ROUND(AVG(salary), 2) as avg_salary
FROM employees
GROUP BY department;

JOINS

Working with a single table limits the number of manipulations we can do with data. This is where JOINs come in. We are able to join data from multiple tables.
Before we go any further, we need to distinguish a promary key from a foreign key. A primary key is a column used to uniquely identify records in a table. For example, the primary key in the employees table is employee_id. On the other hand, a foreign key is used to relate two tables.
A foreign key is usually a primary key in the other table. A separate table having information about when employees take vacation days (employee_vacation table) will have a column employee_id to relate to the employee table. Therefore, employee_id is a primary key in the employee table but a foreign key in the employee_vacation table.
There are different types of SQL joins which are best illustrated using venn diagrams.

The following examples will feature a customer database with customers table and orders table.

INNER JOIN
An inner join is used to view data where records in two tables match on two columns. The example below shows the order_id and customer_name if the customer_id on the orders table and the customer_id on the customers table are the same.

SELECT orders.order_id, customers.customer_name
FROM orders
INNER JOIN customers
ON orders.customer_id = customers.customer_id;

An INNER JOIN is also known as a JOIN and therefore, the code above can be written as:

SELECT orders.order_id, customers.customer_name
FROM orders
JOIN customers
ON orders.customer_id = customers.customer_id;

-- We can filter the data to not show a specific customer:
SELECT orders.order_id, customers.customer_name
FROM orders
JOIN customers
ON orders.customer_id = customers.customer_id
WHERE customers.customer_name != 'Lucy Lucy';

You can also work with more than two tables:

SELECT orders.order_id, customers.customer_name, shippers.shipper_name
FROM orders
JOIN customers 
    ON orders.customer_id = customers.customer_id
JOIN shippers 
    ON orders.shipper_id = shippers.shipper_id;

LEFT JOIN

The table before the statement LEFT JOIN is the left table while the one after is the right table.

A LEFT JOIN will return all the records in the left table and the matching records in the right table. If there are no matching records, the result will contain NULL values.

-- customers = left table
SELECT customers.customer_name, orders.order_id
FROM customers
LEFT JOIN orders
ON customers.customer_id = orders.customer_id;

RIGHT JOIN

The table before the statement JOIN is the right table while the one after is the left table.

A RIGHT JOIN will return all the records in the right table and the matching records in the left table. If there are no matching records, the result will contain NULL values.

-- customers = right table
SELECT customers.customer_name, orders.order_id
FROM customers
RIGHT JOIN orders
ON customers.customer_id = orders.customer_id;

Complex Queries

Subqueries
This is a query within another query, also known as a Nested query. It is usually embedded within the WHERE clause.

-- showing the highest paid employees
SELECT * 
FROM employees 
WHERE salary = (SELECT MAX(salary) 
                FROM employees);

CASE statement
This can be used when you need to add a category where the values are determined by an if...else statement(CASE statement)

SELECT order_id, order_total,
CASE 
    WHEN order_total < 20 THEN 'Order total is less than $20'
    ELSE 'Order total is greater than $20' 
END AS sales_threshold 
FROM orders;

Common Table Expressions (CTEs)
CTEs are used to create temporary tables that are then used to extract the information we need.

-- weekly_orders is the temporary table
WITH weekly_orders AS(
    SELECT
        customer_id,
        DATE_PART('week', order_date) AS week,
        COUNT(order_id) AS order_numbers
    FROM orders
    GROUP BY customer_id, week)

SELECT customer_id, AVG(order_numbers)
FROM weekly_rentals
GROUP BY customer_id

Conclusion

Exploratory Data Analysis: Ultimate Guide

Karen Ngala — Mon, 27 Feb 2023 10:04:05 +0000

_Note: Some terms can be confusing for beginners when used interchangeably in articles (even when they shouldn't). I thought it'd be neat to define them before we jump in.

Variable vs Value
- In a dataset, a variable is a characteristic or attribute that is being measured or observed for each individual or unit in the dataset. For example, in a dataset of student grades, variables could include the student's name, class, subject, and test scores.
- On the other hand, a value is a specific measurement or observation of that variable for a particular individual or unit in the dataset. For example: if there were 20 students in the dataset, there would be 20 values for each variable.
Column vs Feature
- A column in a dataset can also be referred to as a feature. The variables we talked about, appear as columns in a dataset. These columns are considered features. Therefore, the terms "column" and "feature" can be used interchangeably to refer to a variable or attribute in a dataset that is used to build a model.

What is covered in this guide:

What is Exploratory Data Analysis?
Why is it important?
Common EDA techniques
Types of EDA

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is a technique used by data professionals to examine or understand datasets before modelling them. Simply put, the goal of EDA is used to discover different underlying patterns and trends, relations, structures, and anomalies in the data.

EDA plays two main roles: cleaning data as well as understanding variables and the relationships between them.

Analyzing data enables analysts to derive meaningful insights that will help identify data cleaning issues, inform the choice of modelling technique, and hypothesis testing. EDA is an iterative process consisting of activities such as data cleaning, manipulation and visualization. The EDA process can be revisited at any stage of the data analysis process if need be.

Importance of EDA

EDA allows data analysts to understand the data better by:

identifying important variables.
understanding the relationships between variables.
identifying issues in data that can affect the accuracy of your models, such as missing variables, outliers.
uncovering hidden patterns in a dataset that were not obvious to the naked eye.
drawing new insights that affect associated hypotheses. These hypotheses are tested and explored to gain a better understanding of the dataset.

Components & Techniques in EDA

The technique or steps you choose to employ is determined by the task you are performing and the dataset you are working with. You may not need to follow all the steps below.

1. Understand the Data

It is important to understand the nature of data you are working with. In this step, you need to:

1. Import the libraries you will need for analysis

#Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns

The next natural step is to load your data into your working environment:

data = pd.read_csv("file.csv")

2. Conduct preliminary analyses on the data. This involves answering the following questions:

a. What is the size of my dataset and what are the variable data types?

data.shape # returns the number of rows by the number of columns in the dataset

data.columns

data.dtypes

b. What does my data look like?

data.head() # view first few records of data

data.describe() # summarizes the count, mean, standard deviation, min, and max for numeric variables

c. Are there any missing variables?

data.isnull().sum() #check for missing values

data.info() # show the data types of each attribute

#Checking for wrong entries (symbols -,? # *)
for col in data.columns:
    print('{} : {}'.format(col,auto[col].unique()))

data.<column_name>.unique() # applied to a column of data to return a list of unique values in that column.

There can be many reasons for missing values, such as:

There was no response recorded
Error while recording the data
Error in reading the data

Categorize your values:

After finding the missing values in your data, you need to determine what category the values fall in. This will help you determine the best method of handling the missing values as well as help you determine the statistical and visualization methods that can work with your dataset.

Categorical variables can have a set number of values.
Continuous variables can have an infinite number of values.
Discrete variables can have a set number of values that must be numeric.

How we handle missing values depends on the situation itself and the relations these variables have with other variables. We can:

Delete all the missing value rows from the dataset before training the model.
Imputation: various methods of filling the missing values.

Ways of imputing missing values:
For **continuous* data, you can:*

Replace the missing value with the mean, median or mode value
Train a linear model to predict the missing value

For **categorical* data, you can:*

Replace the missing value with the mode value
Train a classification model to predict the missing value

2. Clean the Data

The above steps are part of many ways through which you can understand the data you are working with. The insights gained will be used in this step to help you correct some of the issues in your dataset, so as make it more usable.
a. Remove redundant variables

cleaned_data = cleaned_data.copy().drop(['variableA','variableB','variableC'], axis=1)

b. Remove rows with null values

# Using dropna(axis=0) to drop rows with null values
cleaned_data = cleaned_data.dropna(axis=0)
cleaned_data.shape # to see the change in dataset size

c. Remove outliers
You can identify outliers by visualization (discussed later in the article), z-score method, interquartile range method, and machine learning-based methods.

Outliers are data points that are noticeably different from the rest. They represent errors in measurement, bad data collection, or variables not considered when collecting the data.
For X to be an outlier, it should satisfy the criteria:

X > (Q3 + 1.5*IQR) OR X < (Q1-1.5*IQR)
# where:
# Q1: median for first 25% observation when sorted in ascending order
# Q2: median for last 25% observation when sorted in ascending order
# Q3: median of all observation
# IQR: Inter quartile range = Q3-Q1

So, what do you do when you have skewed data and outliers?

Replace outlier values with more suitable values using Quartile or Interquartile range(IQR) methods.
Use a different machine learning model that is not sensitive to outliers eg: Naive Bayes Classifier or Decision Tree Regressor.
Use a lot of training data to improve the signal-to-noise ratio. Outliers will have less impact on the statistical average if you are working with a lot of data.
Removing all outliers by not picking them for further processing.
Use transformation methods to remove skewness and make your data normally distributed

Normalization:
Transformation methods are used to remove outliers, therefore normalizing the dataset. Some methods of variable transformation include log, square root, and box-cox. For example, the value of x can be replaced by its log value or column mean.

# Replacing missing values with mean:
num_col = ['columnA', 'columnB',  'columnC']
for col in num_col:
    data[col]=pd.to_numeric(data[col])
    data[col].fillna(data[col].mean(), inplace=True)

Normalization is important to ensure all features are on a similar scale so as to improve the accuracy and integrity of your data. If a dataset has features that are bigger in scale than others, they become dominating leading to inaccurate results. Using un-normalized inputs can cause your model to get stuck at very flat regions which can stop the model from learning.

3. Analyze variable relationships

Correlation Matrix:
A correlation matrix is a table that shows how strongly different pairs of variables in a dataset are related to each other. Two variables have a:

Positive correlation when one goes up and the other goes up.
Negative correlation when one goes up and the other goes down.
or no relationship between them.

This is the fastest way to get a general understanding of all your variables. They help us identify which variables are important for predicting or explaining a particular outcome of interest.

# calculate correlation matrix
plt.figure(figsize=(10,10))
sns.heatmap(cleaned_data.corr(),xticklabels=corr.columns, yticklabels=corr.columns, annot=True, cmap=sns.diverging_palette(220, 20, as_cmap=True))

Visualization:
By drawing visual representations of your data, such as histograms, scatter plots and pie charts, you can get a better understanding of the distribution of your data. Further, visualization helps in identifying patterns and detecting outliers in a dataset.

How do I know which charts to generate?
Visualizations are all about asking analytical questions. Once you have understood your data - such as the columns(also known as features), you can ask questions to understand their relationships.

For example, if you have a dataset containing different car features such as horsepower, engine quality and price, we can ask: "How does engine quality affect price?" From this question, we can generate a scatter plot or histogram to show their relationship.
1. Histogram - shows the frequencies of each category in a dataset.

cleaned_data['columnX'].plot(kind='hist', bins=50, figsize=(12,6), facecolor='grey',edgecolor='black')
cleaned_data['columnY'].plot(kind='hist', bins=20, figsize=(12,6), facecolor='grey',edgecolor='black')

2. Pie Chart - commonly used to display the distribution of a single categorical variable as a percentage of a whole.

data['columnA'].value_counts().iloc[:5].plot.pie(autopct="%1.2f%%",fontsize=13,startangle=90,labels=['']*5, cmap='Set2',explode=[0.05] * 5,pctdistance=1.2)

3. Box Plot - visualize the distribution of a variable.

cleaned_data.boxplot('columnA')

A box plot can also be used to compare two variables. From te bboxplot below, the average price of a vehicle with two doors is 10000, and the average price of a vehicle with four doors is 12000.

sns.boxplot(x='price',y='num_of_doors',data=auto)

4. Scatter plots - ‘plots’ the values of two variables along two axes. Like a correlation matrix, it shows the relationship between variables and identifying outliers.

cleaned_data.plot(kind='scatter', x='columnA', y='columnB')

sns.pairplot(cleaned_data) # creates scatter plots between all of your variables.

Types of EDA

There are a few types of EDA techniques:

Univariate analysis: This involves examining the distribution of a single variable. The goal is to understand the central tendency (mean, median, mode), variability (range, interquartile range, standard deviation), and shape (skewness, kurtosis) of the variable.
When exploring a single variable, we can use the following methods:
a. For continuous data:
- Tabular Method of describing central tendencies, dispersion, and missing values.
- Graphical Method for distribution(Histograms) and detecting Outliers(Box Plots). b. For Categorical variables:
- Tabular Method: .value_counts() operation in python gives a tabular form of frequencies.
- Graphical Method: The best graph used for categorical variable is barplot.
Bivariate analysis: This involves analyzing the relationship between two variables. The goal is to understand how changes in one variable affect changes in another variable. Common bivariate analysis techniques include scatter plots, line charts, and correlation analysis.
When exploring a two variables, we can use the following methods:
a. For continuous data: scatter plots and the correlation analysis.
b. For categorical-continuous types: use bar plots and T-tests for the analysis purpose.
c. For Categorical-categorical types: use Two-way table and Chi-square test.
Multivariate analysis: This involves analyzing the relationship between multiple variables. The goal is to understand how multiple variables interact with each other and to identify any patterns or relationships that may exist. Common multivariate analysis techniques include principal component analysis (PCA) and factor analysis.

Conclusion

I hope this article gave you a better understanding of Exploratory Data Analysis and how to apply EDA techniques to your dataset.

Feedback is very welcome and highly appreciated.

Python 101: Introduction to Python for Data Science

Karen Ngala — Fri, 17 Feb 2023 11:23:50 +0000

A big dilemma many techies face when picking up a new skill, is "what language or tool should I use, and why?". This dilemma of choice is popularly known as "analysis paralysis" or "choice overload." You will feel overwhelmed by the options available to you which can lead to indecision and a feeling of being stuck. I've been there.

Going into data science, you have the option of learning many languages ranging from Python, R, Java, and Julia, just to name a few. The choice you make should be individual to you, your specific goals, background, and preferences. Not because of peer influence. So, why Python?

It has a simple and intuitive syntax.
Python has developed a deep ecosystem around Data Science. It has a large and active community of volunteers that create and contribute to the wealth of data science libraries such as matplotlib, sklearn, pandas, and numpy.
Python can be applied widely beyond Data Science which includes areas such as web development.

Setting up a Python environment

Before jumping into the deep-end, you need to set up your computer in a way that allows you to write and run code. First, check that you have python installed using the following command:

python --version

If you have python the output should be the version of python you have installed eg:

Python 3.8.5

If you do not, you will get an error. You can download the latest python version from the official python website

A good place to start for beginners is using Anaconda as the environment for your Data Science workflow. Package conflicts in a Python environment can be a nightmare to deal with. Anaconda helps you navigate this and houses required tools, such as Jupyter Notebook. You can later move on to using virtual environments.

TOOLS:

Jupyter Notebook
is an open-source website that allows data scientists, like yourself, to create and share live code and visualizations. Each notebook contains executable cells and text descriptions. This makes it easy for people to interact and understand the code from start to end. You can share your code with others using Jupyter notebook.

Google Colab
Also known as Colaboratory, is a jupyter notebook environment that runs purely on the cloud and requires no setup. It allows users to load notebooks from public GitHub repos as well as saving to GitHub. A copy of each notebook will be saved on your Google Drive.

Python Basics

Learning the language entails first understanding the syntax and rules of Python as a programming language. I will summarize some of the fundamentals of working with Python. For absolute beginners, It would be benefitial to seek further resources and materials. The following are great places to start:

Getting Started with Python on Programiz
Python For Beginners on python.org
How to Use Python: Your First Steps by Leodanis Pozo Ramos on Real Python

1. Variables & Data types

A variable is a named reference to a value that can be changed during program execution. Assigning a value to a variable is done using the assignment operator (=).
A data type is the nature of value assigned to variables. Python supports the following data types:

integer (an integer value with no decimal value)
string (alphanumeric text)
float (a number with a decimal value)
boolean (value can only True or False)

Data Structures:

lists - collection of values that are ordered and changeable. Syntax wise, it uses square brackets: my_list = [1, 2, 3, 4]
tuple - similar to a list, but its values cannot be changed once created. Syntax wise, it uses parenthesis: my_tuple = (1, 2, 3, 4)
dictionary - a collection of key-value pairs that are unordered and changeable. Syntax wise, it uses curly braces: my_dict = {'name': 'John', 'age': 30}
sets - an unordered collection of unique values. Example: my_set = {1, 2, 3, 4}. Values in a set will never repeat

2. Operators

The symbols used for mathematical and logical operations are pretty straight-forward in Python. + for addition, - for subtraction, * for multiplication, and / for division. == for checking value equality, != for not equal and < for less than, and > for greater than.

3. Logic & Process Flow

The first thing to note here, is indentation. Python follows a strict indentation rule when it comes to blocks of code. While other languages use markers such as curly braces, python relies on indentation level when executing code.
Conditions
They are used to execute a block of code based on whether a certain condition is true or false. For example, if... else is a conditional loop that executes the first block of statements if the condition is true and the statements after else if the condition is false. For multiple conditions, the if... elif statement can be used.
Loops
They are used to repeat a certain block of code multiple times until a specific condition is met. Python has the for loop and the while loop. For loops are used to iterate over a sequence, while while loops are used to repeat a block of code until a specific condition is met.
Functions
They are used to group together a set of instructions that can be called multiple times elsewhere in a program. Functions are defined using the def keyword, followed by the function name and the input parameters. They can also return a value or simply perform an action.
For example:

def greet_user(name):
    if name == "Alice":
        print("Hello, Alice!")
    else:
        print("Hello, stranger!")

Classes and objects
Python is an object-oriented programming language. This is a programming paradigm that organizes code into reusable and modular components.
A class is a blueprint for creating objects that share the same attributes and behaviours.
Objects are instances of a class that are created using the class constructor. They can have attributes, which are variables that store data, and methods, which are functions that can be called on the object.
In the following example,

Class: Rectangle
Object: my_rectangle

class Rectangle:
    def __init__(self, length, width):
        self.length = length
        self.width = width

    def area(self):
        return self.length * self.width

my_rectangle = Rectangle(4, 5)
print(my_rectangle.area())  # Output: 20

Understanding OOP will be important when interacting with the libraries used in data science.

4. File Handling

This is an important part of data science. Reading from and writing to files is a common task of data science and data analysis.
Reading a File
The open() function is used to open a file (file.txt in this case) in 'r' mode. This mode specifies that the file should be opened in read-only mode. The read() method reads the contents of file.txt into the contents variable.

The with keyword is used to ensure that the file is closed once it is read.

with open('file.txt', 'r') as f:
    contents = f.read()

Writing to a File
The 'w' denotes write mode while the write() function is used to write "Hello, world!" to the file.

with open('file.txt', 'w') as f:
    f.write('Hello, world!')

Other modes include the 'a' mode which specifies that the file should be opened in append mode. THis allows additional text to be written into the file.

Loading and manipulating data in Python

Data Science often requires working with large amounts of data. Therefore, you need to load the data. There are several ways to load data in Data Science with the most common method being the Pandas library.

Pandas

It is an open-source data analysis and manipulation library for Python. It offers fast and flexible data structures for working with structured and time series data.

Install the pandas library by running the following command in your terminal or command prompt:

pip install pandas

Pandas offers two primary data structures: Series and DataFrame. A Series is a one-dimensional labelled array.

A DataFrame is a 2D table-like data structure in Pandas. It is similar to a spreadsheet or SQL table in that it consists of rows and columns. You access data in a DataFrame by its row and column labels. Rows are labelled with an index, and the columns are labelled with column names. You can then load data into a pandas DataFrame as follows:

import pandas as pd

# Replace 'data.csv' with the name of your file
data = pd.read_csv("data.csv")

There are many methods that you can apply to manipulate your data using Pandas. Pandas offers an array of data manipulation tools such as filtering, grouping, merging, reshaping, pivoting data, as well as time series analysis. It is worth reading further on these. Below are a few examples:

# Print the first few rows of the DataFrame
print(df.head())

# Output;
       name  age gender
0     Alice   25      F
1       Bob   30      M
2  Charlie   35      M

# Filter the DataFrame to only include rows where the 'age' column is greater than 30
filtered_df = df[df['age'] > 30]

# Group the DataFrame by the 'gender' column and compute the mean of the 'salary' column for each group
grouped_df = filtered_df.groupby('gender')['salary'].mean()

Numpy

Numpy is also a data analysis and manipulation library. However, it differs from pandas in that numpy supports homogeneous data types while pandas supports heterogeneous data types. Read about Homogeneous vs Heterogeneous data types

Install the numpy library by running the following command in your terminal or command prompt:

pip install numpy

Numpy is the foundation for many other scientific computing and data science libraries in Python, such as Pandas.

Numpy is a great library for statistical and mathematical operations. For example, generating mean, median and standard deviation:

import numpy as np

# Create a dataset
data = [1, 2, 3, 4, 5]

# Calculate the mean, median, and standard deviation
mean = np.mean(data)
median = np.median(data)
std = np.std(data)

print("Mean:", mean)
print("Median:", median)
print("Standard deviation:", std)

Resources for Numpy:

Data Visualizations using Matplotlib

Data visualization is a critical part of data science. It allows you to understand and communicate the insights derived from your data. Matplotlip provides a wide range of tools for creating different types of charts and plots, including line charts, bar charts, histograms, scatter plots, and more. It also offers customization through styles, shapes, and colors.

Install the matplotlib by running the following command in your terminal or command prompt:

pip install matplotlib

To demonstrate the different capabilities of Matplotlib, let's create a simple line plot.

# Import the librarty
import matplotlib.pyplot as plt

# Some random data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

# Plot the data to create a line chart
plt.plot(x, y)

# Add labels and title
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.title('Line Plot')

# Display the chart
plt.show()

To represent the relationship between the variables, you can create a scatter plot. The only difference in the above code will be in plotting (and the title, of course). Replace plt.plot(x, y) with:

plt.scatter(x, y)

Numpy could be used in the above example to generate random data.

import numpy as np

x = np.linspace(0, 10, 100) # This generates 100 data points for the x-axis 
y = np.sin(x) # This calculates the corresponding y-axis values

For a bar graph on the other hand, you would need labels and their corresponding values

import matplotlib.pyplot as plt

# Data to be used
labels = ['A', 'B', 'C', 'D', 'E']
values = [5, 3, 7, 2, 8]

# Create a bar chart
plt.bar(labels, values)

# Add labels and title
plt.xlabel('Category')
plt.ylabel('Value')
plt.title('Bar Chart')

# Display the chart
plt.show()

There are many other charts that you can create using matplotlib such as histograms, scatter plots, pie charts, and more. It is worth exploring the matplotlib documentation to familiarize yourself with the different charts.

Optimisation with SciPy

SciPy is a scientific computing library built on top of NumPy. It provides additional functionality for optimization, integration, interpolation, linear algebra, and more.

The example below uses SciPy to perform a simple optimization problem:

from scipy.optimize import minimize_scalar

# Define the objective function (a quadratic function)
def objective(x):
    return x**2 + 3*x + 4

# Find the minimum of the objective function 
result = minimize_scalar(objective)

# Print the minimum value and the corresponding value of x
print("Minimum value:", result.fun)
print("Value of x at minimum:", result.x)

The minimize_scalar() function is an optimization algorithm used to find the minimum of the function. This code finds the minimum value of the function result.fun and the value of x when the function is at minimum result.x

This concept can be applied to more complex optimization problems, including those with multiple variables and constraints. Scipy is a powerful and versatile library with many scientific and engineering applications.

Statistical analysis in Python

This is involves interpreting, analyzing, and presenting the collected data. There are several libraries that support statistical analysis in python. These libraries can perform various statistical analysis tasks such as:

Hypothesis testing — testing claims about the population based on a sample of data. This can be done using libraries such as SciPy
Regression analysis — modelling the relationship between two or more variables. For example, Statsmodels can be used to perform a linear regression on a dataset.
Descriptive statistics — simple and quick summary of a dataset. Numpy is used for summaries such calculating the mean, median, and standard deviation of a dataset
Time series analysis — modelling and forecasting time-dependent data. This can be done using libraries such as Statsmodels.
Predictive Modelling — libraries such as Scikit-learn provide a range of machine learning algorithms, including linear and logistic regression, decision trees, random forests, support vector machines, and neural networks.
Probability distribution — modelling the uncertainty in a dataset using common probability distributions such as normal distribution, binomial distribution, and Poisson distribution. This can be done using SciPy.

Conclusion

In this article, we have covered some of the key features and concepts of Python, including data types, operators, control flow, functions, and file reading/writing. We have also introduced some of the most commonly used libraries in Python for data analysis, such as NumPy, Pandas, Matplotlib, and SciPy.

Python is a powerful language for Data Science. Its readability and its popularity within the data science community makes it easy for beginners to dive into Data Science. There are numerous resources available for learning and development.

As an aspiring data scientist, learning Python is only the beginning of building your skillset. This article is a great starting point for beginners looking to learn Python and its applications in data analysis. Keep practising and exploring the wonderful world of Data Science. The possibilities are endless.

Starting a new Django Project with PostgreSQL database

Karen Ngala — Wed, 28 Sep 2022 19:37:52 +0000

Pre-reading: Tutorials you may need

This article assumes:

Basic understanding of Django.
Basic knowledge of how to use CLI.
Basic understanding of Git.

Let's jump right into it!

First, head to your terminal and create a new folder using the mkdir command. This is the folder that will host all the work for the project you are working on.
Then cd into this folder to create a virtual environment.

1. Create a virtual environment

There are many virtual environmnet tools available.

Working within a virtual environment ensures you isolate Python installs and associated pip packages, allowing you to install and manage your own set of packages that are independent of those provided by the system or used by other projects. Depending on the virtual environment you chose to install on your machine, the command to create a virtual environment will vary.

For this use case, we will be using virtualenv.

virtual is the name of my virtual environment.

$ virtualenv virtual

Activate the virtual environment so as to work within the virtual environment:

$ source virtual/bin/activate

2. Install Django

You can now install Django into this dedicated workspace.

# This command will install the most recent version of django.
(virtual) $ pip install django

To install a specific version of django, specify it as follows(replace the number after the == sign with the version you wish to install):

(virtual) $ pip install django==2.2.11

To make collaboration easier and keep track of all packages(and their versions) you have currently in your virtual environment, pin your dependencies using the following command. This will create the file requirements.txt. You can run this command severally as you install more external packages to update the list of dependencies.

(virtual) $ pip freeze > requirements.txt

3. Create django project & app

Django is organized in two major parts; project and app

Project - the package that represents the entire website. The project directory contains settings for the whole website. A project can have many apps. Create a project using the following command:

(virtual) $ django-admin startproject <project-name>

Your folder structure will look like this:

example/
│
├── project/
│   ├── __init__.py
│   ├── asgi.py
│   ├── settings.py
│   ├── urls.py
│   └── wsgi.py
│
└── manage.py

App - a sub-module of a project that implements a specific functionality. For example, a website can have an app for posts and another app for payment. Create a django app using the following command:

(virtual) $ python manage.py startapp <app-name>

A new folder will be added. Your folder structure will look like this:

example/
│
├── app/
│   │
│   ├── migrations/
│   │   └── __init__.py
│   │
│   ├── __init__.py
│   ├── admin.py
│   ├── apps.py
│   ├── models.py
│   ├── tests.py
│   └── views.py
│
├── project/
│   ├── __init__.py
│   ├── asgi.py
│   ├── settings.py
│   ├── urls.py
│   └── wsgi.py
│
└── manage.py

4. Create gitignore & .env files

Before adding git to your project, or before you can commit the changes you've made so far, there are some files you don't want tracked.
The .gitignore file tells git to not track these files or any changes you make to them.

example/
│
├── app/
│   │
│   ├── migrations/
│   │   └── __init__.py
│   │
│   ├── __init__.py
│   ├── admin.py
│   ├── apps.py
│   ├── models.py
│   ├── tests.py
│   └── views.py
│
├── .gitignore
├── .env
├── .env.example
|
├── project/
│   ├── __init__.py
│   ├── asgi.py
│   ├── settings.py
│   ├── urls.py
│   └── wsgi.py
│
└── manage.py

These are some of files you add to gitignore. You can add or omit anything. For example, I usually have a .txt file that I use for 'rough work' which I add to gitignore.

virtual/
.env
*.pyc
db.sqlite3
migrations/
media/*

The reason for adding migrations folder in gitignore is to minimize merge conflicts and errors in production.

Your project also contains sensitive data that you do not want tracked. Data like, your django secret key or your database password. This information is stored in a .env file which is then put in the gitignore file.

When collaborating with others, create a .env.example file that contains example data that other collaborators can replace with their own values to run your project locally. This way, no one commits their environment credentials and you don't have to change the values each time you pull the project.

Contents of .env may look like this:

SECRET_KEY=generate-a-key
DEBUG=True
DB_NAME=db-name
DB_USER=username
DB_PASSWORD=your-password
DB_HOST=127.0.0.1
MODE=dev
ALLOWED_HOSTS=*
DISABLE_COLLECTSTATIC=1

You can then reference these credentials in project/settings.py as follows:

from decouple import config, Csv  #add this to the top



MODE=config("MODE")

SECRET_KEY = config('SECRET_KEY')

DEBUG = config('DEBUG', cast=bool)

ALLOWED_HOSTS = config('ALLOWED_HOSTS', cast=Csv())

5. Database and settings.py

The default database used by Django out of the box is SQLite. For more complex projects, you will require a more powerful database like PostgreSQL.

Some operating systems may come with potgres pre-installed, or you may need to install it

To check if you have PostgreSQL installed, run which psql command.

If Postgres is not installed, there appears to be no output. You just get the terminal prompt ready to accept another command:

> which psql
>

If Postgres is installed, you'll get a response with the path to the location of the Postgres install:

> which psql
/usr/bin/psql

To support postgres database, you need to install psycopg2 and two other libraries. psycopg2 is a database adapter that connects databases to python.

pip install psycopg2
pip install dj-database-url
pip install python-decouple

Make the following changes to project/settings.py

import dj_database_url


INSTALLED_APPS = [
    'application',  #new
    'django.contrib.admin',
    ...
]



# Database
# https://docs.djangoproject.com/en/3.1/ref/settings/#databases
if config('MODE')=="dev":
    DATABASES = {
        'default': {
            'ENGINE': 'django.db.backends.postgresql_psycopg2', #changed database from sqlite to postgresql
            'NAME': config('DB_NAME'),
            'USER': config('DB_USER'),
            'PASSWORD': config('DB_PASSWORD'),
            'HOST': config('DB_HOST'),
            'PORT': '',
        }
    }
else:
   DATABASES = {
       'default': dj_database_url.config(
           default=config('DATABASE_URL')
       )
   }

db_from_env = dj_database_url.config(conn_max_age=500)
DATABASES['default'].update(db_from_env)

6. Version tracking using git

Initialize version control using the git init command. Then add and commit your changes.

7. Test

Check that your set up worked by running this command:

(virtual) $ python manage.py runserver

You will use this command anytime you need to test your code on the browser. The default port is 127.0.0.1:8000

You should see an output like this on your browser:

At this point, you’ve finished setting up the scaffolding for your Django website, and you can start implementing your ideas by adding models, views and templates.

Summary of Commands

Commands in order of execution:

Command	Description
$ virtualenv virtual	setup virtual environment
$ source env/bin/activate	activate the virtual environment
(virtual) $ pip install django	Instal django inside virtual environment
(virtual) $ django-admin startproject <projectname>	set up a Django project
(virtual) $ python manage.py startapp <appname>	set up a Django app
(virtual) $ pip install psycopg2	connect database to python
(virtual) $ pip install dj-database-url
(virtual) $ pip install python-decouple
(virtual) $ pip freeze > requirements.txt	pin dependancies and versions
Initialize and commit to git
(virtual) $ python manage.py runserver	view website on 127.0.0.1:8000

Conclusion

In this article, we went through the steps of starting a new Django project with PostgreSQL database, as well as the common terminal commands used for Django web development.

I hope you found this article helpful!

Developer's guide to remote collaboration

Karen Ngala — Sun, 08 Nov 2020 18:18:39 +0000

Pre-requisites

This article assumes basic git and GitHub understanding and use

So, collaboration...

The first step of effective collaboration is identifying a software development methodology to use - or Software Development Life Cycle(SDLC)

As a software developer, it is inevitable that you have/will encounter Agile methodology. A good place to start is by reading the Agile manifesto and the principles behind the manifesto. It is brief, yet complete. It was written by 17 software developers who sat together to uncover better ways of developing software. However, agile development is beyond the scope of this article but may be something of interest to you as it focuses on a developer's mindset and values, not tools or processes.

Tools

Because we still value these items on the right, we need to consider various technologies that will make remote software development and collaboration seamless and effective.

Definition of scrum terms

Scrum is an agile project management framework that describes a set of meetings, tools, and roles in team work.

Scrum master - is ideally dedicated to just one team, to avoid context switching. She/He is in charge of leading daily standup, addressing blockers, merging approved pull requests, and coaching the team on best practices.
- The SM role assumes servant-leadership, a way of leading people without having formal authority over them. The SM resorts to setting a shared vision, involving everyone in the decisions, coaching the group.
Scrum team - the team of developers and designers working on the project. A scrum team is ideally self-managing and cross-functional.
Backlog - master list of work that needs to get done to complete a project
Blocker - an obstacle faced in the tackling of an assigned task
Standup - in a team of developers working on a project, short meetings are held daily, usually in the mornings. The term comes from the fact that during the meetings, a developer literally stands up and states briefly:
- What did I do yesterday? -achievements-
- What do I plan to do today? -tasks-
- Am I facing any challenges? -blockers- Note: Standups can be run as often as suits the team
Sprint - a time period, usually between 1 week and 1 month, but typically 2 weeks, in which a team works to complete a set amount of user stories. A project is generally divided into sprints in which, each sprint should produce a usable end-product. - increment-
Kanban board - a basic board divides a sprint into cards'To do', 'In progress' and 'Complete'. It can be altered to suite the needs of the team with cards like 'In review', 'Resources'

Breakdown of tools and processes

Processes:

Organize product backlog from stakeholder feedback and requirements
Plan sprints, set timelines and allocate tasks
Run sprints with daily standups
Production

Tools for:

Communication
- Slack is a great tool for general communication and integrates with numerous developer tools.
- Google Meet is great for running standups even in large teams. It allows for screen sharing and has no time limit for a large number of attendees. Google calendar also comes in handy when scheduling recurring meetings.
Project management

Kanban boards come in here.

Each GitHub repository has a Project section for managing tasks within the repo. Some external resources include:
- Trello
- Atlassian Jira
Design and prototyping

Design is an important part of a collaborative project. It helps front-end developers from straying from the agreed upon design saving hours of back and forth.
- Figma - web-based, collaborate as you would on a google doc

Coding

When it comes to coding in a collaborative environment, coding best practices play a major role in it's effectiveness.

A good place to start
A good article - The perfect code review process. The writer of the article, Robert Cooper, takes you through a fictional scenario of Jimmy and his team
Gitflow is a branching model in which every developer in the team works on independent features. A feature should not be dependant on another feature. Programmers should be able to work on features simultaneously without having to wait on the work of another developer.
- Naming convention - each developer works on a feature branch eg: ft-header, ft-authentication, ft-models
- Pull requests are made to a development branch not main/master. At any given time, the master branch should contain deployable work (no bugs, no incomplete work). An incomplete branch should never be merged
- If modifications are requested after your branch has been reviewed, interactive rebasing should be done, no extra commits.

Extra remarks

Take some time and listen to some of Uncle Bob's talks on Youtube
Read: Best practices when it comes to git commits
Always keep in mind:
- Code is read more often than it is written.
- It is your duty as a programmer to write readable and maintainable code.

Happy hacking!

DEV Community: Karen Ngala

Git for Data Science

What is Git and How does it work?

Git vs GitHub

Git Best Practices

1. Don't push secrets

2. Don't push datasets

3. Don't push notebook outputs

4. Refrain from using --force or -f

5. Make frequent and clear commits

6. Utilize branching and pull requests

Conclusion

Getting started with Sentiment Analysis

What is Sentiment Analysis?

Sentiment Analysis Process

1. Import relevant libraries

Evaluation Libraries:

2. Load the dataset

3. Exploratory Data Analysis

4. Data Preparation

4. Tokenization

5. Lemmatization

5. Prepare for training

6. Build Model

7. Model Evaluation

Conclusion

Essential SQL Commands for Data Science

The Basics: Data Retrieval

Conditions for Data Retrieval

Data Aggregation

Changing Data Types

JOINS

Complex Queries

Conclusion

Exploratory Data Analysis: Ultimate Guide

What is Exploratory Data Analysis?

Importance of EDA

Components & Techniques in EDA

1. Understand the Data

2. Clean the Data

3. Analyze variable relationships

Types of EDA

Conclusion

Python 101: Introduction to Python for Data Science

Setting up a Python environment

TOOLS:

Python Basics

1. Variables & Data types

Data Structures:

2. Operators

3. Logic & Process Flow

4. File Handling

Loading and manipulating data in Python

Pandas

Numpy

Data Visualizations using Matplotlib

Optimisation with SciPy

Statistical analysis in Python

Further reading:

Conclusion

Starting a new Django Project with PostgreSQL database

Pre-reading: Tutorials you may need

This article assumes:

1. Create a virtual environment

2. Install Django

3. Create django project & app

4. Create gitignore & .env files

5. Database and settings.py

6. Version tracking using git

7. Test

Summary of Commands

Conclusion

Developer's guide to remote collaboration

Pre-requisites

So, collaboration...

Tools

Definition of scrum terms

Breakdown of tools and processes

Coding

Extra remarks

4. Refrain from using `--force` or `-f`