DEV Community: Armand Sauzay

How to properly setup your Python project

Armand Sauzay — Tue, 16 May 2023 21:22:03 +0000

Industry best practices to kickstart your python project.

Photo by David Clode on Unsplash

As you start working on your python project, you will likely need to set it up
in a consistent and collaboration-friendly way. In this article, I'll describe a
setup that works great for my projects. It includes many industry best practices
and aims at explaining how to install python, run automated checks in a GitOps
fashion and structure your project.

This article covers quite a few topics. To ease the reading, I've split each topic into two parts:

📚 indicates the theory part
🛠️ indicates the practical part (i.e. the commands you need to run).

Sometimes you will also see a 💡 section, indicating a tip or a trick.

Install a Python version manager
Choose an environment manager (poetry)
Alternatively, use Docker as a Dev Environment instead
Add Some Code
Write Some Tests
Lint your code
Automate checks on local with pre-commit
Automate checks on remote with GitHub Actions
Automate your release with GitHub Actions
Enjoy the benefits of your new code practices

1. Install a Python version manager

📚 The first thing you will need to do is to install a python version manager. A
python version manager will allow you to install multiple versions of python on
your machine and switch between them easily.

To illustrate this, let's say you have a project that requires python 3.6 and
another project that requires python 3.7. If you only have python 3.7 installed
on your machine, you will have to uninstall it and install python 3.6 every time
you want to work on the first project... This is where a python version manager
comes in handy.

You can use pyenv for this and follow the
installation instructions.

🛠️ At the time of writing this document, you can install pyenv by running the
following:

curl https://pyenv.run | bash

If you want more control over the installation, on mac for instance you can run:

Use brew to install pyenv
```
brew install pyenv
```

Depending on the shell you use, add pyenv to your PATH.

💡 If you're not sure which shell you're using, you can run the following command:

echo $SHELL

For bash:

echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc

For zsh:

echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.zshrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.zshrc
echo 'eval "$(pyenv init -)"' >> ~/.zshrc

2. Choose an environment manager (poetry)

📚 Now that you have a python version manager, you will need to install an
environment manager. An environment manager will allow you to create isolated
environments for your projects. This is useful when you work on multiple
projects that require different versions of the same package. And it is a great
practice overall to make sure anyone can run your project.

To illustrate this, let's say you have a project that requires pandas and
matplotlib to do some data analysis and another project that requires
tensorflow to do some machine learning. It is considered a good practice to
create a separate environment for each project. This way, you can install
only the packages you need for each project and avoid any conflicts between
packages.

There are many environment managers (pipenv, conda, virtualenv, python
built-in venv, etc...). I personally use poetry.

In general, your environment manager will be installed with your python version.
For instance, if you use pyenv, you can install poetry by running the following command:

pyenv install 3.10.6
pyenv global 3.10.6
pip install poetry

In this case, poetry will live in your python environment. This means that
you can have a different version of poetry for each python version you have
installed on your machine.

🛠️ Pip installing poetry in your python environment with pip is one way to do it but you can also follow the steps below:

Use the curl command to install poetry

curl -sSL https://install.python-poetry.org | python3 -

Add poetry to your PATH. On MacOS, poetry is added at ~/Library/Application Support/pypoetry/venv/bin/poetry. So you can add it to your PATH by adding the
following line to your ~/.bashrc or ~/.zshrc file:
```
export PATH="$HOME/Library/Application Support/pypoetry/venv/bin:$PATH"
```
If you are using a different shell, you can find the path to poetry by running the following command:
```
poetry config --list --local | grep virtualenvs.path
```
And then add it to your PATH.

3. Alternatively, use Docker as a Dev Environment instead

If you don't want to install a python version manager and an environment
manager, or want to abstract this for your team, you can use Docker instead. I
won't go into too many details here but if you're interested, VSCode has a great
integration with
Docker

4. Add Some Code

📚 Now that you have a python version manager and an environment manager, you
can start working on your project. Let's write some simple functions to get
started. For illustration purposes we'll create a simple CLI that prints "Hello
<name>" when you run it and give how many repositories <name> has on GitHub.

💡 By default, poetry wants you to create a package with a folder named as your
git repository (with the only difference that it wants underscores instead of dashes). So if your git
repository is called python-project-template, poetry will create a folder
called python_project_template.

💡 You can find the code for this section at armand-sauzay/hello-world-cli

In this case let's call our project hello-world. So we'll create a folder
called hello_world and add a file called __init__.py to it.

🛠️ Use gh to create a new repository called hello-world and clone it to your
machine:

gh repo create <your-github-username>/hello-world-cli --template armand-sauzay/python-template --public

To initialize your project, run the following command (it comes from the template above):

./script/bootstrap

Add the requests package to your project:

poetry add requests

So far, we ran the following commands:

pyenv install 3.10.6 # install python 3.10.6
pyenv global 3.10.6 # (optional) set python 3.10.6 as the default python version
pip install poetry # install poetry as a python package in your python 3.10.6 environment
poetry add requests # add the requests package to your project

In hello_world_cli folder a file called main.py and add the following code to it:

import argparse
import requests


def main(argv=None) -> int:
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--name", help="The name to greet and count repos for", default="world"
    )
    args = parser.parse_args(argv)
    if args.name == "":
        print("Username cannot be empty")
        return 1

    print(f"Hello {args.name}!")

    repos = requests.get(f"https://api.github.com/users/{args.name}/repos")
    if repos.status_code != 200:
        print(f"Failed to fetch repos for {args.name}")
        return 1
    repos = repos.json()
    print(f"You have {len(repos)} repos.")  # type: ignore
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

There are a few tricks here that are worth mentioning:

We use argparse to parse the arguments passed to our CLI. This is a built-in python package that allows you to parse arguments passed to your CLI.
We use argv to pass the arguments to our main function. This is useful for testing purposes. We can pass a list of arguments to our main function and test it without having to run the CLI.
We use SystemExit to exit our program. This is a built-in python exception that allows you to exit your program with a specific exit code (0 for success, 1 for failure).

5. Write Some Tests

📚 Now that you have a few basic methods, you can start writing some tests. Tests make sure your code behaves as expected. There
are many types of tests (unit tests, integration tests, end-to-end tests, etc...)
but in this article, we will focus on unit tests. Unit tests are tests that
check the smallest unit of your code (i.e. a function or a method).

🛠️ Create a file called test_main.py and add the following code to it:

from hello_world_cli.cli import main

def test_main(capsys):
    assert main(['--name', 'test']) == 0
    out, err = capsys.readouterr()
    # test is a user which does not have any contributions since 2010. 
    # Hopefully this will not change in the future and the test will not break.
    assert out == 'Hello test!\nYou have 5 repos.\n'
    assert err == ''

def test_main_empty_name(capsys):
    assert main(['--name', '']) == 1
    out, err = capsys.readouterr()
    assert out == 'Username cannot be empty\n'
    assert err == ''

We use capsys to capture the output of our CLI. This is a pytest built-in that allows you to capture the output of your CLI.

6. Lint your code

📚 Now that you have some tests, you can start linting your code. Linting is the
process of checking your code for potential errors. There are many linters out
there (pylint, flake8, black, etc...). I personally use flake8, black, isort and
mypy. Ruff is also becoming quite popular and replaces flake8 and isort.

🛠️ To install flake8, black, isort and mypy, you can run the following command:

poetry add --dev flake8 black isort mypy

Then you can use mypy to check your code:

poetry run mypy hello_world_cli/cli.py

You can use flake8 to check your code:

poetry run flake8 hello_world_cli/cli.py

You can use black to format your code:

poetry run black hello_world_cli/cli.py

But you would not really want to run these commands manually every time you
change your code. This is where pre-commit comes in handy, to both make sure you
don't commit wrongly formatted code and to not have to run these commands
manually.

Also, do note that some of these linters have small incompatible differences.
For example, black will complain if you have a line longer than 88 characters,
but flake8 will complain if you have a line longer than 79 characters. You need
to configure them to work correctly together:

you should create a .flake8 file with the following content:

[flake8]
max-line-length = 88
extend-ignore = E203

you should create a .isort.cfg file with the following content:

[settings]
profile = black

you can read more on the black documentation

7. Automate checks on local with pre-commit

📚 Now that you have some tests and some linters, you can start automating your
checks. The goal of automating your checks is to make sure your code is always
in a good state. There are many tools out there that can help you automate your
checks (pre-commit, tox, etc...). I personally use pre-commit.

🛠️ To install pre-commit, you can run the following command:

brew install pre-commit

Then you can run pre-commit

pre-commit run --all-files

You can see on the image above that pre-commit is running the checks that we defined in the .pre-commit-config.yaml file.
Note that this file is coming from the template above in step 4.

8. Automate checks on remote with GitHub Actions

📚 Now that you have some tests and some linters, you can start automating your
checks on the server side. The goal of automating your checks on the server side
is to make sure your code is always in a good state. There are many tools out
there that can help you automate your checks (GitHub Actions, Travis CI, Circle
CI, etc...). I personally use GitHub Actions.

🛠️ To set up GitHub Actions, you can create a file called .github/workflows/ci.yaml and add the following code to it:

name: CI

on:
  workflow_dispatch:
  push:
jobs:
  lint:
    name: Lint
    runs-on: ubuntu-latest
    steps:
      - uses: armand-sauzay/actions-python/lint@v1
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}

  test:
    name: Test
    runs-on: ubuntu-latest
    steps:
      - uses: armand-sauzay/actions-python/test@v1
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          test-flags: --version

Note that this file should already be created if you used the template mentionned above (see gh command in step 4.

Explanation:

If you're not familiar with the syntax of GitHub Actions, you can read the documentation.
Here we call two jobs: lint and test. Each job runs on ubuntu-latest (a Linux machine).
The lint job uses the armand-sauzay/actions-python/lint action to run pre-commit.
The test job uses the armand-sauzay/actions-python/test action to run pytest. Here we pass a flag to pytest to just have a dummy test. You can remove this flag to run your actual tests.

9. Automate your release with GitHub Actions

📚 We have briefly talked about linting - using black, flake8, isort and mypy - and testing - using pytest. Release is an equally important concept that allows you to give versions to

🛠️ To set up GitHub Actions, you can create a file called .github/workflows/release.yaml and add the following code to it:

name: Release

on:
  push:
    branches:
      - main
  workflow_dispatch: # enable manual release

jobs:
  lint:
    name: Lint
    runs-on: ubuntu-latest
    steps:
      - uses: armand-sauzay/actions-python/lint@v1
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}

  test:
    name: Test
    runs-on: ubuntu-latest
    steps:
      - uses: armand-sauzay/actions-python/test@v1
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          test-flags: --version

  release:
    name: Release
    needs: [lint, test]
    runs-on: ubuntu-latest
    outputs:
      new-release-published: ${{ steps.release.outputs.new-release-published }}
      new-release-version: ${{ steps.release.outputs.new-release-version }}
    steps:
      - uses: armand-sauzay/actions-python/release@v1
        id: release
        with:
          github-token: ${{ secrets.ADMIN_TOKEN || secrets.GITHUB_TOKEN }}

release

A few things worth noting:

this workflow will trigger on push on the main branch and on manual trigger (workflow_dispatch)
it will run the lint and test jobs
it will then run the release job which is semantically versioning your code and creating a release on GitHub. In case you don't know what semantic versioning is, you can read the documentation.
In order to follow semantic versioning, you can follow conventional commits (this is why the commits show up as fix: ... or feat: ... in the release notes). You can read more on conventional commits.
Sometimes you might have a protected branch, for which the default GITHUB_TOKEN wouldn't have enough permissions to create a release. In that case, you can create a personal access token called ADMIN_TOKEN and give it the right permissions. You can then use this token in the workflow.

10. Enjoy the benefits of your new code practices

Now you know how to:

use pyenv to install different python versions
use poetry to manage your dependencies (poetry add/remove)
use pytest to test your python code
what linting is and how to use flake8, black, isort and mypy
use pre-commit to automate your checks on your local machine
use GitHub Actions to automate your checks on the server side

Enjoy the benefits of your new code practices! 🎉

As always, if you have any questions, feel free to leave a comment below or reach out to me on the different platforms:

How Fair Are Your Machine Learning Models?

Armand Sauzay — Tue, 07 Mar 2023 18:22:01 +0000

A quick introduction to the topic of fairness with hands on coding. Evaluate your machine learning model fairness in just a few lines of code.

Photo by Wesley Tingey on Unsplash

Are Machine Learning models "fair"? When increasingly more decisions are backed by ML algorithms, it becomes important to understand the biases they can create.

But what does "fairness" mean? This is where it gets a little political (and mathematical)… To illustrate our thoughts, we'll take the example of a machine learning model which predicts whether a salary should be higher than 50K/year based on a number of features including age and gender.

And maybe you've already guessed, by looking at these two features, that fairness can have different definitions. Fair for gender might mean that we want to have the a prediction which is independent of gender (i.e. paying the same people who only differ by their gender). Fair for age might mean something else. We'd probably want to allow a certain correlation between the prediction and the age, as it seems fair to pay better older individuals (which usually are more experienced).

One key thing to understand is that what is judged "fair" is sometimes not even respected in the data itself.

How would the model learn that men and women should be paid the same at equal levels it it does not observe this in the data itself ?

Figure1: data biases vs model biases

Now that we have a bit of context on the problem, let's get into the math (Section 1) and the code (Sections 2 and 3)to be able to evaluate and address unfairness issues:

A few fairness concepts
Evaluating Data Biases
Evaluating and Correcting Model Biases with Fairlearn

a. Evaluating bias

b. Correcting bias

All the code for this tutorial can be found on Kaggle here. Feel free to run the notebook yourself or create a copy!

1. A few fairness concepts

1.1. Mathematical definition of fairness

In order to simplify things, we'll restrict the scope to binary classification (predict whether someone should be paid more than 50K/year).
Usually, we'll call:

X: the feature matrix
Y: the target
A: Sensitive feature, usually one of the columns of X

For binary classification, two main definition of fairness exist:

Demographic parity (also known as statistical parity): A classifier h satisfies demographic parity under a distribution over (X,A,Y) if its prediction h(X) is statistically independent of the sensitive feature A. This is equivalent to: E[h(X)|A=a]=E[h(X)]
Equalized odds: A classifier h satisfies equalized odds under a distribution over (X,A,Y) if its prediction h(X) is conditionally independent of the sensitive feature A given the label Y. This is equivalent to: E[h(X)|A=a,Y=y]=E[h(X)|Y=y]

NOTE: a third one exists but is more rarely used: equal opportunity is a relaxed version of equalized odds that only considers conditional expectations with respect to positive labels.

1.2. Fairness in words

In "simpler words":

Demographic parity: the prediction should be independent from the sensitive features (for instance independent from gender). It states that all categories from the protected feature should receive the positive outcome at the same rate (it plays on selection rate)
Equalized odds: the prediction can be correlated to the sensitive feature, to the extent it is explained by the data we see

1.3. Why does it matter?

OK, that's interesting, but why does it matter? And how can I use those mathematical concepts?

→ Let's take two examples of features and then explain what type of fairness we want to have for this feature. Going back to the previous example of salary prediction, let's say you are the CEO of a very big company and want to build an algorithm which would give you the salary you should give to your employees based on performance indicators. Ideally you would look for something like:

Demographic Parity for gender: the salary prediction should be independent from the gender
Equalized Odds for Age: the salary prediction should not be independent from Age (you want to still pay more employees with more experience) but you still want to control that the salary so that you do not end up being too skewed → you don't want to end up in the situation where the algorithm exacerbates even more the inequalities (pays the youth even less and the elders even more)

Without further due, let's get into the implementation details on how we can evaluate fairness and "retrain" our Machine Learning models against its biases. For this we're going to use the UCI Adult Dataset.

2. Evaluating Data Biases

NOTE: once again, you can find all the associated code here.

Biases can exist in the data itself. Let's just load the data and plot the percentage of Male/Female having a salary above 50K.

Figure 2: gender and age impact on salary

We see that the percentage of males having a salary above 50K is almost 3x the percentage of females. (!!)

If the algorithm learns on this data it will definitely be biased. To counter this bias we can either:

cherry pick data so that the percentage of male
use fairlearn to correct the bias after the model is trained on this unfair data

In section 3, we'll focus on the second approach.

3. Evaluating and Correcting Model Biases with Fairlearn

3.1. Evaluating bias

One of the most interesting features here is probably selection rate. It is the rate of predicting positive outcomes (in this case, whether salary is above 50K)

Figure 3: Selection Rate Definition

Let's use MetricFrame from fairlearn to calculate the selection rates split by Sex.

from fairlearn.metrics import MetricFrame
from sklearn.metrics import accuracy_score,precision_score,recall_score
from sklearn.ensemble import GradientBoostingClassifier
from fairlearn.metrics import selection_rate
from fairlearn.reductions import ExponentiatedGradient, DemographicParity
classifier = GradientBoostingClassifier()
classifier.fit(X, y)
y_pred = classifier.predict(X)
metrics = {
    'accuracy': accuracy_score,
    'precision': precision_score,
    'recall': recall_score,
    'selection_rate': selection_rate
}
metric_frame = MetricFrame(metrics=metrics,
                           y_true=y,
                           y_pred=y_pred,
                           sensitive_features=sex)
metric_frame.by_group['selection_rate'].plot.bar(color= 'r', title='Selection Rate split by Sex')

Figure 4: Selection Rate Split by Sex

We see that the percentage of males having a salary above 50K is almost 3x the percentage of females. (!!)
Once the model is trained we see that this ratio is now 5x (!!). The model is exacerbating the bias we see in the data.

3.2. Correcting bias

Let's now correct the bias we observe by applying Demographic Parity on our classifier (we use ExponentiatedGradient from fairlearn for this). More context on how it works behind the scene can be found on the official fairlearn documentation here.

np.random.seed(0)  # set seed for consistent results with ExponentiatedGradient

constraint = DemographicParity()
classifier = GradientBoostingClassifier()
mitigator = ExponentiatedGradient(classifier, constraint)
mitigator.fit(X, y, sensitive_features=sex)
y_pred_mitigated = mitigator.predict(X)

sr_mitigated = MetricFrame(metrics=selection_rate, y_true=y, y_pred=y_pred_mitigated, sensitive_features=sex)
print(sr_mitigated.overall)
print(sr_mitigated.by_group)
metric_frame_mitigated = MetricFrame(metrics=metrics,
                           y_true=y,
                           y_pred=y_pred_mitigated,
                           sensitive_features=sex)
metric_frame_mitigated.by_group.plot.bar(
    subplots=True,
    layout=[3, 3],
    legend=False,
    figsize=[12, 8],
    title="Show all metrics",
)

Figure 5: Selection rate for original model vs mitigated one

By mitigating the model we introduced demographic parity (and thus equal selection rates) for our new model. Our model is now fair!!!

Woohoo! You now know the basics of fairness works and how you can start using it right away in your machine learning projects!

I hope you liked this article! Let me know if you have any questions or suggestions. Also feel free to contact me on LinkedIn, GitHub or Twitter, or checkout some other articles I wrote on DS/ML best practices. Happy learning!

Sources:

https://fairlearn.org/

About me

Hey! 👋 I'm Armand Sauzay (armandsauzay). You can find, follow or contact me on:

Introduction to Cryptography: Understanding Hashing and Public-key Encryption with Code Examples

Armand Sauzay — Wed, 15 Feb 2023 22:15:59 +0000

What's cryptography? What does hashing and public-key encryption mean? And which tool can you use to start writing cryptography code?

Photo by Markus Spiske on Unsplash

You probably heard the name cryptography a lot recently, especially during the cryptocurrency bull market of 2021, or with the recent FTX saga. But what does it really mean? Is it useful? And how can you start using some cryptography yourself?

As always, we'll go through a little bit of theory and the concepts behind before diving into some code to give a concrete example. As Steve Jobs once said:

The doers are the major thinkers. The people that really create the things that change this industry are both the thinker-doer in one person. And if we really go back and we examine, you know, did Leonardo have a guy off to the side that was thinking 5 years out in the future what he would paint or the technology he would use to paint it. Of course not. Leonardo was the artist, but he also mixed all his own paint. He also was a fairly good chemist, knew about pigments, knew about human anatomy. And combining all of those skills together, the art and the science, the thinking and the doing, was what resulted in the exceptional results.

In this article, we'll cover the following:

Cryptography
- 1.1. Real world use cases
- 1.2. Web2 vs Web3
- 1.3. Symmetric vs asymmetric cryptography
Hashing
- 2.1. Core concept behind hashing
- 2.2. Slight difference in inputs can lead to completely different outputs
- 2.3. Why is this useful?
Public key cryptography (and ssh)
- 3.1. Public key vs Private key and their respective roles
- 3.2. Digital Signature
- 3.3. Bonus: how does ssh work?
Using gpg to create a key pair and encrypting/decrypting documents
- 4.1. Having a secret message/document
- 4.2. Generating a key pair
- 4.3. Encrypting your message with the public key
- 4.4. Decrypting your message with your private key

1. Cryptography

1.1. Real world use cases

Cryptography, in short, is a set of techniques to enable secure communication through encoding and decoding messages. More generally, cryptography is about constructing and analyzing protocols that prevent third parties or the public from reading private messages. It is mainly used for the following:

securing personal information: most web applications use cryptography to encrypt messages and the data they store, and hide their services and APIs from the general public.
as the base layer of crypto/blockchain. And you don't have to go further than the second page of Satoshi Nakamoto's original paper on bitcoin (read more here) to understand that cryptography is central in the blockchain/crypto field.

1.2. Web2 vs Web3

So, most of the "tech" we know relies on cryptography: if you're a Web2 aficionado, you trust that companies will correctly protect your data through encrypting the "username" and "password" required to access their databases (these are usually not user and password per say but can be access keys). And if you're a Web3 aficionado you trust yourself with keeping your "username" and "password" (or public and private keys) secret to yourself.

And how you keep things secret is through cryptography:

have an input
Run a function on it to encode it.
Potentially run another function to then decode the encoded message.

roughly, cryptography = encryption + decryption

1.3. Symmetric and asymmetric cryptography

Popular encryption algorithm include the Caesar cipher (shift all the letters by one or more characters, i.e. a becomes b, b becomes c, etc.), or Pig Latin (adding a suffix to each syllab) which you might know. It is also useful to distinguish between symmetric and asymmetric cryptography:

Symmetric cryptography: the same key is used to encrypt and decrypt the message. This is the most common type of cryptography and is used in most applications. The most popular symmetric cryptography algorithm is AES (Advanced Encryption Standard) which is used in most applications. In symmetric cryptography: encryption = decryption
Asymmetric cryptography: the key used to encrypt the message is different from the key used to decrypt the message. This is the most common type of cryptography used in the blockchain/crypto field. The most popular asymmetric cryptography algorithm is RSA (Rivest–Shamir–Adleman) which is used in most applications. In asymmetric cryptography: encryption ≠ decryption

In this article we'll only cover asymmetric cryptography and cover two main ideas that are the foundation of most applications: (1) hashing and (2) public-key cryptography.

2. Hashing

2.1. Core concept behind hashing

Hashing is the simplest cryptographic process: you take an input (an image, text, any data basically) and you make it go through a hashing process to create an encrypted message. One of the most used algorithm is SHA-256 (which was developed by the NSA): it will create a "random" 64 character long string from any input.

illustration of hashing (image by author)

2.2. A slight difference in inputs can lead to completely different outputs

Below, you can see how the two strings "hello world" and "Hello world" are hashed with the SHA-256 algorithm. Notice the very slight difference on the first letter being capitalized or not but the huge difference in the encrypted messages… You can try it out for yourself here.

Hello world → b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9
hello world → 64ec88ca00b268e5ba1a35678a1b5316d212f4f366b2477232534a8aeca37f3c

2.3. Why is this useful?

Hashing is a one way function. It is easy to go from message to encrypted message but almost impossible to do the opposite.

Well, good. But why is this useful?

→ Let's say you have a website and you are storing all the passwords of all your users. You wouldn't want to store them plainly in a datafile (assuming you're a little concerned about those passwords getting leaked…), right? So you could probably hash them with a specific algorithm. When someone signs up on your website, you'll hash their password and store the hashed password. And if that same someone tries to log in with their password, you hash the their login password input and compare it to the original hash of their password when they signed up. If the 2 hashes match (the hash from the original password and the hash from the login input) , you let the user in. And if someone gains access to the database with all the hashed passwords of all the users, you'll be safe because they would not be able to log in with that information!
But sometimes it would be nice to also be able to decode right? Let's say you're sending a hashed love letter to your lover (with SHA-256 once again)

I love you → c33084feaa65adbbbebd0c9bf292a26ffc6dea97b170d88e501ab4865591aafd

Your lover wouldn't really know what to do with c33084feaa65adbbbebd0c9bf292a26ffc6dea97b170d88e501ab4865591aafd would they?

This is where public-key cryptography can be useful.

3. Public-key cryptography

3.1. Public key vs Private key and their respective roles

In public-key cryptography, there is a set of keys consisting of two keys: a public key and a private key. Both of these keys are one-way functions meaning that they are not interchangeable: the public key is used to encrypt and the private key is used to decrypt. And the security of the process is guaranteed as long as the private key is kept secret.

illustration of public-key encryption (image by author)

This is a powerful concept: only those with the private key would be able to decrypt messages encoded with the public key. This opens up the door for many possible applications. I can create a key pair, send the public key to everyone and keep the private key to myself. People could encrypt the messages they want to send me with the public key and publish it somewhere (in a journal, on social media, whatever). And I will be the only one to be able to decrypt the encoded message.

3.2. Digital Signature

Another very central concept in cryptography is digital signature, which is basically proving that you own the private key associated with a public key without sharing the private key. For example, let's say that you have shared a public key (with which anyone can can encrypt a message with). Someone encrypts a message with your public key and sends the encrypted message to you. If you can decrypt the encrypted message they sent, you have proved that you own the private key associated with the public one. This is how digital signature work. You now know the base layer of Bitcoin for instance since showing that you own the private key associated with your public Bitcoin address is a process very similar to what we just described.

3.3. Bonus: ssh

SSH (Secure Shell) keys are a way to authenticate a user's access to a remote server. They allow users to log in to a remote server without the need to enter a password.

SSH keys consist of a public key and a private key. The public key is used to encrypt data that is sent from the client to the server, and the private key is used to decrypt the data.
To use an SSH key for authentication, the user generates a key pair on their local machine and then adds the public key to the authorized_keys file on the remote server. When the user attempts to log in to the server, the server sends a challenge message to the client, which the client encrypts using the private key and sends back to the server. If the server can decrypt the message using the public key, the user is authenticated and granted access to the server.

SSH keys are a secure and convenient way to authenticate access to a remote server, as they do not rely on the user remembering a password and they can be revoked or replaced easily if necessary.

The concept is nice, right? But how can you apply the theory and start using it in code?

4. Using gpg to create a key pair and encrypting/decrypting documents

One of the nice tools for playing around with public and private keys is gnupg with the gpg CLI (you can read more here).

The goal is not to give an extensive overview of gpg but let's see a simple example to realize typical steps (key pair creation and encrypting/decrypting).

As always you can check out the associated code on GitHub and follow along if you want. This basically consists in three steps:

creating a key pair.
encoding the document with the public key.
decrypting the encoded document with the private key.

4.1. Having a secret message/document

Let's say you have a private message in a document like:

#secret_message.txt
The Navy UFO videos weren't supposed to be released.

4.2. Generating a key pair

You could begin by generating a key pair and storing the public and private key as follows:

read -r -p "What is your key name? " KEY_NAME 
gpg --full-generate-key #then follow the prompts
gpg --export -a $KEY_NAME > ${KEY_NAME}.public.key
gpg --export-secret-key -a $KEY_NAME > ${KEY_NAME}.private.key

4.3. Encrypting your message with the public key

Then, using gpg --encrypt you could encrypt your message using the following commands:

read -r -p "What is your key name? " KEY_NAME
gpg --import ${KEY_NAME}.public.key
gpg --encrypt secret_message.txt

4.4. Decrypting your message with your private key

Finally, using gpg --decrypt you can use the private key to decrypt the message that you just encoded to verify that the encryption/decryption worked well.

gpg --decrypt secret_message.txt.gpg

Tadaa! You've created a key pair, encrypted and then decrypted your message. You're now a crypto expert!

I hope you liked this article! Feel free to contact me on LinkedIn, GitHub or Twitter if you have any questions or suggestions or just want to chat, or checkout other articles I wrote on medium and leave feedback. Happy learning!

About me

Hey! 👋 I'm Armand Sauzay (armandsauzay). You can find, follow or contact me on:

Machine Learning interpretability and feature selection made easy with SHAP.

Armand Sauzay — Tue, 14 Feb 2023 18:22:15 +0000

Machine learning interpretability with hands on code with SHAP.

Photo by Edu Grande on Unsplash

Machine Learning interpretability is becoming increasingly important, especially as ML algorithms are getting more complex.

How good is your Machine Learning algorithm if it cant be explained? Less performant but explainable models (like linear regression) are sometimes preferred over more performant but black box models (like XGBoost or Neural Networks). This is why research around machine learning explainability (aka eXplainable AI or XAI) has recently been a growing field with amazing projects like SHAP emerging.

Would you feel confident using a machine learning model if you can't explain what it does?

This is where SHAP can be of great help: it can explain any ML model by giving the influence of each of the features on the target. But this is not all that SHAP can do.

Build a simple model (sklearn/xgboost/keras) and use SHAP: you now have a feature selection process by looking at features which have the biggest impact on the prediction.
But how does SHAP work under the hood? And how can you start using it?

In this article we'll first get our hands on some python code to see how you can start using SHAP and how it can help you both for explainability and feature selection.

Then, for those of you who want to get into the details of SHAP, we'll go through the theory behind popular XAI tools like SHAP and LIME.

All the code for this tutorial can be found on Kaggle here. Feel free to run the notebook yourself or create a copy!

1. How can you start using SHAP?

Here, we'll go through a simple example with Shap values using the competition Kaggle competition "House Prices - Advanced Regression Techniques" to illustrate SHAP. If you are interested or you have never been on Kaggle before, feel free to read more about the data and the competition itself here.

The process to use shap is quite straightforward: we need to build a model and then use the shap library to explain it.

understand the output of your modelHere, our machine learning model tries to predict the house prices from the data that is given (number of square feet, quality, number of floors etc).

The usual workflow in terms of code looks like this:

Create an estimator. For instance GradientBoostingRegressor from sklearn.ensemble:
```
estimator = GradientBoostingRegressor(random_state = random_state)
```
Train your estimator:
```
estimator.fit(X_train, y_train)
```

use shap library to calculate the SHAP values. For instance, using the following code:

explainer = shap.Explainer(estimator.predict, X100)
shap_values = explainer(X100)

See what is the impact of each feature using shap.summary_plot:

shap.summary_plot(shap_values, max_display=15, show=False)
shap.summary_plot(shap_values, max_display=15, show=False)

For instance, you can see here that OverallQual is the feature that has the most impact on the model output. High values (colored in red on the graph above) of OverallQual can increase a property's price by ~60,000 and low values can decrease a price by ~20,000. Interesting to know if you're in real estate, isn't it?

But this is not all of what SHAP can do! SHAP can also explain a single prediction.

For example, using shap.plots.waterfall for a single element in the dataset, you can have the following:

shap.plots.waterfall(shap_values[sample_index], max_display=14)

For this specific example, the predicted price was 166k (vs 174k on average). And we can understand why the algorithm predicted such: for instance OverallQual which is high (7) drives the value up but YearBuilt (1925) drives the value down.

You can now understand the dynamics behind your model, both overall and on specific datapoints. With SHAP, you can more easily see if something is wrong (or does not make sense for your sharpened data science mind) so you can correct it! This is what observability is about.

And since SHAP allows you to understand the feature importance of your model, you can also use this for feature selection. For instance

shap.summary_plot(shap_values, max_display=15, show=False, plot_type='bar')

Then you can see which features do not have a lot of impact on the output of the model. The features usually are noise for your machine learning model and do not bring a lot of predictive value. So removing them from your training set will generally improve the performance, and allow you to tune correctly the hyper parameters without overfitting on noisy data.

How SHAP can be used for feature selection

2. Quick overview: How does SHAP work under the hood?

If you have a bit of time, feel free to read the original paper that describes the different approaches for model explainability and goes through the advantages of SHAP.

But let's try to explain in short what SHAP is doing and the concepts behind without getting too deep into the mathematical equations.

Explainable Machine Learning (aka eXplainable AI or XAI) aims at understanding why the output of a machine learning model is such. To do so, you could theoretically take the definition of your model, for example a tree based model like Random Forest, and then see why your output is such. But this is not so straightforward, and of course, it gets even more complex for Deep Learning models…

Instead of going through the winding path of understanding what happens inside your model (forward and backward propagation for deep learning models, which splits are the most used in your RF algo etc). But once you have your trained model, could you not instead use it to see how it reacts when you change a feature?

This is the core concept behind popular XAI algorithm (SHAP, LIME etc): use your existing model, approximate it using an explainable model and you now have an explainable model. The complexity is now on how to approximate a ML model around a given prediction, and then around most predictions.

If you are interested in this and want to learn more, let me know and I'll write a follow up article on the mathematical concepts behind SHAP, how it is related to the classic Shapley values, how you can compute SHAP values and how we are able to approximate it for specific use cases, which makes the computation easier.

Woohoo! You now know the basics on how SHAP works and how you can start using it right away in your machine learning projects!

I hope you liked this article! Let me know if you have any questions or suggestions. Also feel free to contact me on LinkedIn, GitHub or Twitter, or checkout some other tutorials I wrote on DS/ML best practices. Happy learning!

Sources:

About me

Hey! 👋 I'm Armand Sauzay (armandsauzay). You can find, follow or contact me on:

5 tools I wish I knew when I started writing Machine Learning code

Armand Sauzay — Tue, 31 Jan 2023 17:24:25 +0000

A few tools that will get you on the right track for your Machine Learning projects using python.

Photo by ray rui on Unsplash

A few years back, I first learnt how to write machine learning code as I took my first ML class while pursuing my graduate studies in Applied Mathematics.

I learnt about the Math behind classic ML algorithms, got my hands on popular python libraries for like scikit-learn, PyTorch and Tensorflow, and participated in my first Kaggle competition on the Titanic dataset.

Fast forward a few years, after a pursuing my graduate studies in Machine Learning at UC Berkeley, I now work on developing machine learning models that end up being used in production and at scale.

And, of course, the code I now write is very different from the Python Notebooks I used to write when I originally started.

Even though I still use notebooks for the exploration part, my final code is structured using only python files, often containerized, and integrated using CI/CD and triggered by workflow management tools.

Environment management tool like conda or poetry

The first thing to consider when you write code is to create an environment for you code to live in.

For Data Science / Machine learning, the best tool around for environments is probably Conda at first. Then for production it's probably more poetry. But let's keep this discussion for another article and talk a bit about conda.

Conda has the benefits of being a version manager and a package manager, it allows you to easily define environments, make sure the package versions are compatible with each other and has some cool perks like being able to define environment variables. If you want to learn more about this tool or if you want a quick refresher, you can go through the article I wrote on that subject:

Using Conda environments for Python, all you need to know

Configuration tools like hydra

Another thing is that Data Science / Machine Learning code often comes with a lot of parameters (hyper parameters, where to log artifacts, preprocessing steps etc..).

For configuration, one of the nicest tool around is Hydra. From the hydra documentation itself:

The key feature is the ability to dynamically create a hierarchical configuration by composition and override it through config files and the command line. The name Hydra comes from its ability to run multiple similar jobs - much like a Hydra with multiple heads.

Hydra makes it easy to set up your config parameters, create config groups, and even instantiate objects from your config. If you want to learn more about this tool, and want to have a code tutorial to get up to speed on it, feel free to check out the article I wrote about it: Hydra, the most efficient config handling library for your Python/ML code

Using the terminal (best way to navigate through files)

Something that can be a bit of a pain when starting to write code but that will make your life 10x easier once you get used to it is to use the terminal / command line interface to navigate through files and perform actions.

If you want a quick overview of the basic actions you can perform with the terminal, feel free to check out the article I wrote: Command Line 101: a Basic Guide to Using the Terminal

MLFlow (ML artifact handling tool)

A lot of tools exist to help with the machine learning lifecycle: for ML experiments, for Model Registry, for packaging your ML code or putting your models in production. MLFlow has all of these capabilities.

But you might agree with me, production machine learning is kind of a niche use case and unless you work for a big marketplace company or FAANG you might not need to deploy your models in production…

However, in general, you might want to have a place to store all your ML experiments results and know which validation score was obtained with which hyper parameters etc. This is exactly what MLFlow Tracking has to offer.
In short, in seconds, you can simply log your artifacts and, by typing mlflow ui in your terminal, you will instantly start a mlflow server on http://127.0.0.1:5000/ that will read from your mlruns folder. You can then easily navigate through your experiments and see all the metrics, parameters and artifacts you logged.

Github Actions (or another CI/CD tool)

Once you work in a context where your ML models need to be retrained or integrated to some architecture or even write daily ETLs you will quickly realize that a lot of actions can be automated with a CI/CD tool like Jenkins or GitHub actions (for ETLs Airlfow is considered a better pick but that's another subject). From the GitHub documentation:

GitHub Actions makes it easy to automate all your software workflows, now with world-class CI/CD

Examples of CI/CD include build/push-ing docker images, pre-commit checks, automated testing and much more. If you want more details, let me know and I'll write a quick tutorial on how GitHub Actions can be used for ML code.

Of course there are many more tools that are absolutely amazing and that I use on a daily basis such as cloud platforms like AWS or GCP, Docker or Visual Studio Code to mention but a few. If you'd like more details about how industry standards for these tools, feel free to reach out!

Hope you liked this article! Don't hesitate if you have any question, or suggestions, in comments, or feel free to connect on LinkedIn, GitHub or Twitter.

About me

Hey! 👋 I'm Armand Sauzay (armandsauzay). You can find, follow or contact me on:

Hydra, the most efficient config handling library for your Python/ML code

Armand Sauzay — Tue, 10 Jan 2023 18:08:56 +0000

Improve your python code by using hydra for configuration.

Photo by Ferenc Almasi on Unsplash

In this tutorial, we’ll go through some available options that you might encounter for config handling, then explain why hydra is my favorite pick, and finally go through some code examples to highlight the key functionalities of Hydra.

Context and available options

As one works on a Python project, especially for machine learning, the number of parameters rapidly increases. Soon comes the question: what is the ideal way to store my parameters?

Let’s go through a few options you might have encountered.

hardcoding: Should I hardcode them in some random places in my code? Probably not.
yaml/json: Should I create a simple YAML file? Or json file? And import with json.load? Doesn’t seem very pythonic, does it?
config.py: Should I create a config.py file where I put all my parameters? That’s nice, but lets say I want to run 10 experiments, would I want to go back to my config.py file and change the values 1 by 1 before re-running. Once again, probably not.
dotenv: Should I use something like dotenv? Same as config.py, it does not seem ideal if we play with a lot of parameters.
other tools: I also tried other tools like dynaconf, or docopt or configparser (the python native configuration parser). But nothing comes close to hydra, which is an amazing tool for configuring python code.

Hydra notably allows for a clear yaml-format configuration, the ability to instantiate objects, running multiple tasks and many other features that you’ll probably love or found missing in other libraries. As described in the hydra official documentation:

The key feature is the ability to dynamically create a hierarchical configuration by composition and override it through config files and the command line. The name Hydra comes from its ability to run multiple similar jobs — much like a Hydra with multiple heads.

Without further due, let’s get right into the code and understand why hydra is the clear winner for configuration file a handling.

All of the code for this tutorial can be found here. You can clone the repo to easily navigate through the different sections in this tutorial.

Requirements:

having Conda (miniconda) installed and a basic understanding of Conda environment. If you’re not familiar with this, check out the article I wrote on that subject here
having a basic understanding of the command line / terminal. If you’re not familiar with this, check out the article I wrote on that subject here

Code

1. Basic example (folder 1_basic_example on the GitHub repo)

In this example, we’re going to go though the basic of hydra. There are two files

#config.yaml
db:
  driver: mysql
  table: bar
  user: bar
  password: foo

# my_app.py 
import hydra
from omegaconf import DictConfig, OmegaConf
@hydra.main(config_path=".", config_name="config")
def my_app(cfg: DictConfig) -> None:
    print(OmegaConf.to_yaml(cfg))
    print(f"reading data with username {cfg.db.user}")
if __name__ == "__main__":
    my_app()

From the terminal, navigate to 1_basic and run python my_app.py.

You see that in the file my_app.py, the function my_app has been decorated with @hydra.main() with 2 parameters config_path and config_name:

config_path=”.” → looking for a config file in the same folder as the script
config_name=”config’ → looking for a config file named config(.yaml is implicit) in the config_path, which means the current folder. This decoration will load the config file in cfg. And you can then access it in your code. If we want to access the value of user from the db section of the config file, we just have to write cfg.db.user. As easy as this.

Now, let’s try a nice added functionality of hydra, the multirun. Lets say we want to run this job with 2 different parameters (for example db.table). For this you can simply run the following:

python my_app.py -m db.table=bar,foo

→ this will launch 2 jobs (-m stands for multirun): one with db.table=bar and one with db.table=foo

2. Config groups (folder 2_config_groups on the GitHub repo)

#my_app.py
import logging
import hydra
from omegaconf import DictConfig
log = logging.getLogger(__name__)
@hydra.main(config_path="configs", config_name="config")
def my_app(cfg: DictConfig) -> None:
    log.info("Info level message")
    log.debug("Debug level message")
    print(f"driver={cfg.dataloader.type}, timeout={cfg.dataloader.timeout}")
if __name__ == "__main__":
    my_app()

#config.yaml
defaults:
  - dataloader: local
  - _self_
dataloader:
  type: foo

The idea is to simplify the main config file and be able to create groups in the yaml to make it even more configurable. Here, the decorator indicates that we are in a folder called configs and that the main config file is named config (.yaml is implicit).

Then, in configs/config.yaml, the defaults: argument indicates that there exists a subfolder called dataloader, in which there are multiple configurations for dataloader.

Finally, in configs/config.yaml, the argument _self_ is required and indicates the precedence. In this case, _self_ is the last line, so it indicates that the defaults will be overwritten by values hardcoded outside of the default yaml. If _self_ is at the beginning, the value from the defaults is used.

Now if you want to try the multi-run approach of hydra you can run the following:

python my_app.py -m dataloader=local,redshift

→ this will launch 2 jobs (-m stands for multirun): one with dataloader=local and one with dataloader=redshift. Of course, this is very useful for hyper parameter tuning.

3. Instantiation (folder 3_instantiation on the GitHub repo)

What happens when you want to pass python objects as part of your config. For instance, let’s say you want to test different ML algorithms in a simple Sklearn project and you want to try a XGBoost model and a Logistic Regression. Hydra allows you to do that!!

Let’s go through this example which has 2 files

#my_app.py
import hydra
import pandas
import sklearn.ensemble
from hydra.utils import instantiate
from omegaconf import DictConfig, OmegaConf
@hydra.main(config_path=".", config_name="config")
def my_app(cfg):
    print(OmegaConf.to_yaml(instantiate(cfg)))
    model = instantiate(cfg.model.feature_extractor)
    print(model)
if __name__ == "__main__":
    my_app()

#config.yaml
model:
  feature_extractor:
    _target_: sklearn.ensemble.GradientBoostingClassifier
    random_state: 0
    n_estimators: 500
    learning_rate: 0.01
    max_depth: 2
bar:
  a: 1
  b: 2
foo:
  a: ${bar.a}

In the yaml, you need to pass a _target_ (note that this name is a convention so dont modify it or you wont be able to instantiate your object) as the first line of the object we want to instantiate. Following lines are parameters for the object we wish to instantiate. For instance, here, instantiate(cfg.model.feature_extractor) will lead to sklearn.ensemble.GradientBoostingClassifier(random_state=0, n_estimators=500, learning_rate=0.0, max_depth=2). Pretty cool right?

Woohoo! You now know how to use hydra for config, creating config groups, instantiating objects. Feel free to checkout on the repo the 4th section on the popular plugin Optuna for bayesian optimization.

Hope you liked this article! Don’t hesitate if you have any question, or suggestions, in comments, or feel free to contact me on LinkedIn, GitHub or Twitter, or checkout some other tutorials I wrote on DS/ML best practices.

About me

Hey! 👋 I'm Armand Sauzay (armandsauzay). You can find, follow or contact me on:

Use pre-commit checks to format your files and commit messages

Armand Sauzay — Fri, 06 Jan 2023 18:05:30 +0000

Stop committing wrongly formatted code and start using pre-commit checks.

Photo by Roman Synkevych 🇺🇦 on Unsplash

Introduction

How many times have you seen a commit message like 'test' or 'modif' or 'reran notebook'?

Commit messages can be very useful and their format can help get the relevant information in a simple look. This is what conventional commits is trying to achieve: standardize the commit format to be able to navigate through commits easily and understand what code update they were associated to. But how could we make sure all commits followed this format?

File formatting can also be quite challenging when more than one person works on a GitHub repository. How could we make sure that all the files committed to a repository have the same format? Or that you do have the extra empty line at the end of your file that some system require to successfully run?

This is where standardization makes our life easier, especially when enforced before the commits: black formatting for python, getting rid of trailing whitespaces, etc.

Long story short, all the formatting issues that you might had to deal with or that created useless commits could be avoided by using pre-commit checks.

Without further due, let's see how we can implement these concepts to an actual GitHub repository (if you need more context on conventional commits feel free to skip to the appendix and come back here after).

All of the code for this tutorial can be found [here](https://github.com/armand-sauzay/blog-posts/pre-commit-checks-to-format-your-files-and-commit-messages)

Get started with pre-commit checks

Step 1: Install pre-commit

To install pre-commit, simply run

brew install pre-commit

This will install pre-commit on your machine.

Step 2: Add pre-commit checks to your repo

To be able to add pre-commit checks that make sure your files and commit messages are correctly formatted, you simply need to add 2 files at the root of your repo:

.pre-commit-config.yaml: defines the checks you want to run
.commitlintrc.yaml: defines the npm package you use for pre commits.

# .commitlintrc.yaml
extends:
  - "@open-turo/commitlint-config-conventional"

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v2.3.0
    hooks:
    -   id: check-yaml
    -   id: end-of-file-fixer
    -   id: trailing-whitespace
  - repo: https://github.com/alessandrojcm/commitlint-pre-commit-hook
    rev: v8.0.0
    hooks:
      - id: commitlint
        stages: [commit-msg]
        additional_dependencies: ["@open-turo/commitlint-config-conventional"]

Once you added those files, you can try adding a commit, the pre commit checks defined will make sure that:

yaml files are correctly formatted
files have an extra empty line at the end (this is considered best practice as some systems fail when this condition is not met)
get rid of trailing whitespaces
make sure that the commits follow the conventional commits format.

To see if this worked, try to commit those files to your repo with some commit message like
git commit -m 'feat: enabled pre-commit checks'

Woohoo! You now know how to add pre commit checks to your own repository to make sure your file format are consistent and your commit messages too!

Apendix

Appendix 1: A word on Conventional Commits
With conventional commits, the commit message should be structured as follows:

<type>[optional scope]: <description>
[optional body]
[optional footer(s)]

For example (from commit conventional docs):

feat: allow provided config object to extend other configs
BREAKING CHANGE: `extends` key in config file is now used for extending other config files

Appendix 2: adding a python code formatter to pre-commit checks
Additional content: if you want to add a python code formatter, like black, you can append to the end of .pre-commit-config.yaml

- repo: https://github.com/psf/black
    rev: 21.12b0
    hooks:
    - id: black

About me

Hey! 👋 I'm Armand Sauzay (armandsauzay). You can find, follow or contact me on:

Command Line 101: a Basic Guide to Using the Terminal

Armand Sauzay — Mon, 19 Dec 2022 12:52:23 +0000

Command Line 101

Get started using your terminal and get one step closer to being an experienced developer.

Photo by Max Duzij from Unsplash

All of the code for this tutorial can be found here.

Developers use the command line to navigate through file and perform operations.

Once you get used to it, it is definitely the most efficient and reproducible way to access files and perform operations. Also, when you start virtual machines on the cloud, it becomes the only way of easily communicating with your instance.

So, without further due, let's learn about the terminal and basic commands to get started!

In this article, we'll cover what is the command line, which commands we can use and we'll go through a simple tutorial to put those commands in practice.

First, what is the command line?

It is a plain and simple text interface for your computer. It takes commands which are then passed to the OS to run.
And which commands can we use?
The usual commands are given in the table below.

Let's now go through a small example on how to use those commands. We'll first explain commands 1 by 1 and then put them all together in a shell script that you can run.

Step-by-step commands

Let's see where our terminal currently is:
```
pwd
```
Navigate to the folder where you usually put your code (I usually have mine in a folder called code in your my folder)
```
cd path/to/your/folder/of/code
```
In my case for instance, I have got a code folder under /Users/myusername so I can do the following
```
cd
cd code
```
Explanation:
- cd: brings me back to my root folder
- cd code: changes my working directory to code
- An equivalent of this would be cd ~/code

Go to/create a folder named command_line_tutorial and enter it

echo "Create a folder named command_line_tutorial and go to it"
if [[ -d "command_line_tutorial" ]]; then
    echo "command_line_tutorial folder exists, entering it"
    cd "command_line_tutorial"
else
    echo "command_line_tutorial folder does not exist, creating it"
    mkdir "command_line_tutorial"
    cd "command_line_tutorial"
fi

Explanation:

echo: prints the string in the terminal
if [[ -d "command_line_tutorial" ]]; then: if the folder command_line_tutorial exists, then enter it
else: otherwise create it and enter it

Create a file called myfile.txt
```
touch myfile.txt
```
Explanation:
- touch: creates a file
Write 'Hello World!' in the above created file
```
echo 'Hello World!' >> myfile.txt
```
Explanation:
- >>: appends the string to the file

Add a line to the above created file

echo 'This is added to the file because of >>, otherwise > overwrites' >> myfile.txt

Explanation:

>>: appends the string to the file

Create 100 files named myfile1.txt, myfile2.txt, myfile3.txt, etc. and write a line in each of them
```
for i in {1..100}; do echo "This is file number $i" > myfile$i.txt; done
```
Explanation:
- for i in {1..100}; do: for each number between 1 and 100, do the following
- echo "This is file number $i" > myfile$i.txt;: create a file named myfile1.txt, myfile2.txt, myfile3.txt, etc. and write a line in each of them
Count the number of files in the folder
```
ls | wc -l
```
Explanation:
- ls: list all files in the current folder
- |: pipe the output of the previous command to the next command
- wc -l: count the number of lines in the output of the previous command
grep all files that are in 90-100 range
```
ls | grep "myfile[9][0-9].txt"
```
Explanation:
- ls: list all files in the current folder
- |: pipe the output of the previous command to the next command
- grep "myfile[9][0-9].txt": grep all files that are in 90-100 range
copy myfile.txt into a file named myfilecopy.txt
```
cp myfile.txt myfilecopy.txt
```
Explanation:
- cp: copy the file myfile.txt into myfilecopy.txt
remove all files that start with myfile
```
rm myfile*
```
Explanation:
- rm: remove the file myfile.txt
go to folder parent
```
cd ..
```
Explanation:
- cd ..: go to the parent folder
remove folder created for this tutorial
```
rmdir command_line_tutorial
```
Explanation:
- rmdir: remove the folder command_line_tutorial

Let's put all these commands in a shell script

you can delete command_line_tutorial and re run the following in order:

#print working directory
echo "Print working directory"
pwd

#go to root folder
echo "Go to root folder"
cd

# code to code folder or create it if it does not exist
echo "Potentially run mkdir code if it does not exist"
if [[ -d "code" ]]; then
    echo "code folder exists, entering it"
    cd "code"
else
    mkdir "code"
    cd "code"
fi

#list files in folder
echo "List files in folder"
ls

#create a folder named code and go to it
echo "Create a folder named code and go to it"
if [[ -d "code" ]]; then
    echo "code folder exists, entering it"
    cd "code"
else
    mkdir "code"
    cd "code"

fi

#create a folder named command_line_tutorial and go to it
echo "Create a folder named command_line_tutorial and go to it"
if [[ -d "command_line_tutorial" ]]; then
    echo "command_line_tutorial folder exists, entering it"
    cd "command_line_tutorial"
else
    echo "command_line_tutorial folder does not exist, creating it"
    mkdir "command_line_tutorial"
    cd "command_line_tutorial"
fi

#create a file called myfile.txt
echo "Create a file called myfile.txt"
touch "myfile.txt"

# create 100 files named myfile1.txt, myfile2.txt, myfile3.txt, etc. and write a line in each of them
echo "Create 100 files named myfile1.txt, myfile2.txt, myfile3.txt, etc. and write a line in each of them"
for i in {1..100}; do echo "This is file number $i" > myfile$i.txt; done

#write 'Hello World!' in myfile.txt
echo "Write 'Hello World!' in myfile.txt"
echo 'Hello World!' >> myfile.txt


#Add a line to the above created file
echo "Add a line to the above created file"
echo 'This is added to the file because of >>, otherwise > overwrites' >> myfile.txt

#count the number of files in the folder
echo "Count the number of files in the folder"
ls | wc -l

# grep all files that are in 90-100 range
echo "Grep all files that are in 90-100 range"
ls | grep "myfile[9][0-9].txt"

#copy myfile.txt into a file named myfilecopy.txt
echo "Copy myfile.txt into a file named myfilecopy.txt"
cp myfile.txt myfilecopy.txt

#remove all files that start with myfile
echo "Remove all files that start with myfile"
rm myfile*

#go to folder parent 
echo "Go to folder parent"
cd ..

#remove folder created for this tutorial
echo "Remove folder created for this tutorial"
rmdir command_line_tutorial

I hope this was helpful. You now are a bash expert and can start using your terminal for automating your tasks going forward!

About me

Hey! 👋 I'm Armand Sauzay (armandsauzay). You can find, follow or contact me on:

Using Conda environments for Python, all you need to know

Armand Sauzay — Wed, 07 Dec 2022 03:34:29 +0000

Conda environments and environment variables made simple for your python projects.

All of the code for this tutorial can be found here.

Python code is great, but being able to reproduce the code is even better! This is why all python projects should come with something that defines the packages and versions to be able to run that exact same code: an environment.

Then comes the question: how can I create my code environment?

There are many possibilities.

First, I am sure all of you have used the following: pip install <package>

But how do we then keep track of the packages we installed and their version?

This is where virtualenv (native Python environment manager) could come handy. But what happens when you have projects in different python versions? You'd need to use pyenv (native Python version manager) to make sure you use the right version of python. And then virtualenv to make sure you use the right package version…

It would be great if a tool could do all of this, wouldn't it? This is why conda exists.

In short, in most cases (except for work that will end in production), conda covers all of what you would need, does some magic behind the scene so you don't have to worry about environments, and has some great added benefits, like allowing you to define environment variables.

Without further due, let's learn more about conda!

In this tutorial, we'll cover the following:

Basic conda commands: create, list, activate and deactivate
environment.yaml and its use
Defining environment variables through conda

1. Basic conda commands: create, list, activate and deactivate

If you haven't installed conda, you can start by installing miniconda. If you are not sure, open a terminal and run conda list

Once the installation is complete, run conda list to make sure you have conda installed.

The following commands are bash commands

a. create your environment:

conda create --name conda-tutorial python=3.8
→ here we create an environment called conda-tutorial with python 3.8 installed. → NOTE: if it asks whether to proceed, type y and enter

b. list existing environments

conda env list
→ you should be able to see the newly created environment

c. activate your environment

conda activate conda-tutorial
→ you're now in your conda virtual environment. So the python code you execute should find its python version and the packages currently installed through this environment.
→ to check this, you can type in terminal: which python.
→ also you can install package by typing conda install
→ we'll see later how we can use an environment.yaml to not have manual installs

d. deactivate your environment

conda deactivate

e. Lastly, if you want, remove your environment

conda env remove --name <your_environment_name>
→ note that you need to first deactivate your current environment if you wish to remove it

Nice! You can now create an environment in which to run your code which can greatly help for reproducibility! You can also bookmark the conda cheat sheet for your future use cases
But, let's be honest, running a lot of conda install is not that great for reproducibility. So let's see hwo we can use environment.yaml files.

2. environment.yaml and its use

Let's create an environment file with pandas installed and test it. For this we need two files

1. A simple python file named `conda_tutorial_pandas.py` which imports pandas and prints its version.

### conda_tutorial_pandas.py file
import pandas as pd
if __name__ == "__main__":    
    print(pd.__version__)

2. A file called environment_pandas.yaml and paste the following in it:

### environment_pandas.yaml file
name: conda-tutorial
channels:
  - conda-forge
  - defaults
dependencies:
    - python=3.8
    - pandas=1.4.2

→ This file will create an environment named conda-tutorial with python 3.8 and pandas 1.4.2.
If you don't have an environment named conda-tutorial, you can run
conda env create --file environment_pandas.yaml
If you do have an environment named conda-tutorial, you can run
conda env update --file environment_pandas.yaml

Now, you can activate the environment conda-tutorial and from terminal, run python conda_tutorial_pandas.py. It should print out: 1.4.2.

Let's now go a step further and let's create environment variable so you don't have to worry about credentials being leaked for instance.

3. Environment variables

a. Create the shell scripts that will activate and deactivate your environment variables.

cd $CONDA_PREFIX 
mkdir -p ./etc/conda/activate.d 
mkdir -p ./etc/conda/deactivate.d 
touch ./etc/conda/activate.d/env_vars.sh 
touch ./etc/conda/deactivate.d/env_vars.sh

b. add the following to export your environment variables when you activate your environment

nano $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
then paste the following:

export MY_KEY='secret-key-value' 
export MY_FILE=/path/to/my/file/

then click ctrl+o and then ctrl+x. If you don't want to use nano, you can also manually create this file with its content.

c. to unset your environment variables when you deactivate your environment

nano $CONDA_PREFIX/etc/conda/deactivate.d/env_vars.sh
Then paste the following

unset MY_KEY 
unset MY_FILE

then click ctrl+o and then ctrl+x. If you don't want to use nano, you can manually create this file with its content.

Now let's create a python file named conda_tutorial_environment_variables.py which will print out these environment variables

### conda_tutorial_environment_variables.py file
import os
if __name__ == "__main__": 
    print(os.getenv("MY_KEY")) 
    print(os.getenv("MY_FILE"))

Now, activate your conda environment by running conda activate <my_env> first and then:
python conda_tutorial_environment_variables.py

Woohoo! You now know how to create and manage a conda environment and create environment variables in your environment so you don't have to worry about hardcoding credentials in your code!!

Hope you liked this article! Don't hesitate if you have any question, or suggestions, in comments, or feel free to contact me on LinkedIn, GitHub or Twitter, or checkout some other tutorials I wrote on DS/ML best practices.

About me

You can follow me, contact me, or see what I do on the following platforms:

GitHub: armand-sauzay
Twitter: @armandsauzay
LinkedIn
Medium: armand-sauzay

DEV Community: Armand Sauzay

How to properly setup your Python project

Table of Contents

1. Install a Python version manager

2. Choose an environment manager (poetry)

3. Alternatively, use Docker as a Dev Environment instead

4. Add Some Code

5. Write Some Tests

6. Lint your code

7. Automate checks on local with pre-commit

8. Automate checks on remote with GitHub Actions

9. Automate your release with GitHub Actions

10. Enjoy the benefits of your new code practices

How Fair Are Your Machine Learning Models?

1. A few fairness concepts

1.1. Mathematical definition of fairness

1.2. Fairness in words

1.3. Why does it matter?

2. Evaluating Data Biases

3. Evaluating and Correcting Model Biases with Fairlearn

3.1. Evaluating bias

3.2. Correcting bias

About me

Introduction to Cryptography: Understanding Hashing and Public-key Encryption with Code Examples

1. Cryptography

1.1. Real world use cases

1.2. Web2 vs Web3

1.3. Symmetric and asymmetric cryptography

2. Hashing

2.1. Core concept behind hashing

2.2. A slight difference in inputs can lead to completely different outputs

2.3. Why is this useful?

3. Public-key cryptography

3.1. Public key vs Private key and their respective roles

3.2. Digital Signature

3.3. Bonus: ssh

4. Using gpg to create a key pair and encrypting/decrypting documents

4.1. Having a secret message/document

4.2. Generating a key pair

4.3. Encrypting your message with the public key

4.4. Decrypting your message with your private key

About me

Machine Learning interpretability and feature selection made easy with SHAP.

1. How can you start using SHAP?

2. Quick overview: How does SHAP work under the hood?

About me

5 tools I wish I knew when I started writing Machine Learning code

Environment management tool like conda or poetry

Configuration tools like hydra

Using the terminal (best way to navigate through files)

MLFlow (ML artifact handling tool)

Github Actions (or another CI/CD tool)

About me

Hydra, the most efficient config handling library for your Python/ML code

Context and available options

Requirements:

Code

1. Basic example (folder 1_basic_example on the GitHub repo)

2. Config groups (folder 2_config_groups on the GitHub repo)

3. Instantiation (folder 3_instantiation on the GitHub repo)

About me

Use pre-commit checks to format your files and commit messages

Introduction

Get started with pre-commit checks

Step 1: Install pre-commit

Step 2: Add pre-commit checks to your repo

Apendix

About me

Command Line 101: a Basic Guide to Using the Terminal

Command Line 101

Step-by-step commands

Let's put all these commands in a shell script

About me

Using Conda environments for Python, all you need to know

1. Basic conda commands: create, list, activate and deactivate

a. create your environment:

b. list existing environments

c. activate your environment

d. deactivate your environment

e. Lastly, if you want, remove your environment

1. A simple python file named `conda_tutorial_pandas.py` which imports pandas and prints its version.