DEV Community: Elias Elikem Ifeanyi Dzobo

The Experiment of the ML Scientist

Elias Elikem Ifeanyi Dzobo — Mon, 26 Aug 2024 07:37:21 +0000

In the world of machine learning (ML), the path from data to a functioning model is not a straight line. It is an iterative, experiment-driven journey where hypotheses are tested, algorithms are refined, and results are evaluated, often multiple times, before arriving at a solution that meets the desired goals. This process is akin to the scientific method, where experiments play a crucial role in understanding the problem space, optimizing models, and ensuring robustness and accuracy.

The Typical ML Model Development Process

Let’s begin by outlining the typical stages involved in developing an ML model. This will set the stage for understanding where experiments come into play:

Problem Definition:
The journey starts with a clear definition of the problem you're trying to solve. This includes understanding the business context, the objectives, and the specific task at hand—whether it's classification, regression, clustering, or another type of ML problem.

Data Collection and Preparation:
Once the problem is defined, the next step is to gather relevant data. This data often comes from various sources and requires significant preprocessing—cleaning, transforming, and organizing—before it can be used to train models.

Exploratory Data Analysis (EDA):
With clean data in hand, an ML scientist conducts EDA to uncover patterns, correlations, and insights. This step involves visualizations and statistical analysis to understand the data's characteristics and to identify any potential issues like imbalance or outliers.

Feature Engineering:
After understanding the data, the focus shifts to creating features that can help the model learn. This might involve selecting relevant variables, creating new ones, or transforming existing features to better represent the underlying patterns in the data.

Model Selection and Training:
Here’s where the experimentation begins in earnest. Multiple algorithms may be tested, hyperparameters tuned, and various approaches compared. This step involves iterating through different models to find the best performing one.

Model Evaluation:
Once a model is trained, it is evaluated using various metrics appropriate for the task (e.g., accuracy, precision, recall, F1 score for classification problems). This evaluation helps in understanding how well the model generalizes to unseen data.

Deployment and Monitoring:
After selecting the best model, it’s deployed to production. However, the process doesn’t end here. Continuous monitoring is essential to ensure that the model performs well over time and to detect any drift in data patterns.

What Are Experiments in Machine Learning?

Experiments in machine learning are systematic procedures carried out to test hypotheses about model performance. In the context of ML, an experiment might involve testing different model architectures, varying hyperparameters, or using different subsets of data to determine their impact on the model’s accuracy, robustness, and generalization.

The goal of running these experiments is to optimize the model. Just as a scientist in a lab runs multiple trials to refine a hypothesis, an ML scientist runs numerous experiments to refine the model, iterating until the model meets the performance criteria.

Why Are Experiments Important?

Optimization: Experiments help in finding the best model architecture and hyperparameters, which are crucial for achieving optimal performance.

Robustness: By systematically testing different variables, experiments can help ensure that the model is not just fitting the training data but can generalize well to new, unseen data.

Innovation: Experimentation allows ML scientists to explore new approaches and techniques, potentially leading to breakthrough improvements.

Reproducibility: Keeping track of experiments is vital for reproducibility. Knowing what was tried, what worked, and what didn’t is essential for both improving models and for collaboration within teams.

Running an ML Model Experiment with MLflow

To illustrate how experiments are managed in practice, let’s walk through the process of setting up and running an ML experiment using a tool like MLflow.

Step 1: Setting Up MLflow

MLflow is an open-source platform that helps manage the entire ML lifecycle, including experimentation, reproducibility, and deployment. It provides an interface to track experiments and log data, models, and results.

Install MLflow:
pip install mlflow

Initialize an Experiment: Start by initializing a new experiment in MLflow. This can be done programmatically or through the MLflow UI.

import mlflow

mlflow.set_experiment("My First Experiment")

Step 2: Logging Parameters and Metrics

During your model training process, you can log parameters (like hyperparameters), metrics (like accuracy), and artifacts (like model files) to MLflow.

import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

with mlflow.start_run():
    # Log parameters
    n_estimators = 100
    mlflow.log_param("n_estimators", n_estimators)

    # Train model
    model = RandomForestClassifier(n_estimators=n_estimators)
    model.fit(X_train, y_train)

    # Log metrics
    predictions = model.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    mlflow.log_metric("accuracy", accuracy)

    # Log the model
    mlflow.sklearn.log_model(model, "model")

Step 3: Comparing Experiments

Once you’ve logged multiple experiments, MLflow allows you to compare them easily. You can view all the experiments in the MLflow UI, where you can compare different runs based on metrics, parameters, and artifacts.

Step 4: Reproducibility and Deployment

One of the powerful features of MLflow is its ability to ensure reproducibility. Since all the details of each experiment are logged, it’s easy to reproduce any experiment, down to the specific environment and dependencies.

MLflow also supports model deployment, allowing you to transition from experimentation to production smoothly. You can deploy models to different platforms directly from MLflow, ensuring consistency between your development and production environments.

Conclusion

Experimentation is at the heart of successful machine learning. By systematically testing and refining models through experiments, ML scientists can optimize performance, ensure robustness, and push the boundaries of what their models can achieve. Tools like MLflow provide a structured and efficient way to manage these experiments, making it easier to track, compare, and deploy ML models. As machine learning continues to evolve, the ability to run effective experiments will remain a key skill for any ML practitioner.

Piping Hot Features: How Feature Stores Keep Your ML Sizzling

Elias Elikem Ifeanyi Dzobo — Wed, 14 Aug 2024 09:12:22 +0000

Introduction

As machine learning (ML) systems evolve from experimental projects to production-grade applications, the need for robust infrastructure becomes paramount. One crucial component in the MLOps (Machine Learning Operations) lifecycle is the feature store. In this article, we'll explore what feature stores are, where they fit in the MLOps lifecycle, why they are important, and provide a hands-on tutorial on how to create and use a feature store in your ML projects.

What is a Feature Store?

A feature store is a centralized repository for storing, managing, and serving features—individual measurable properties or characteristics used as inputs in machine learning models. It ensures consistency and reusability of features across different models and teams.

Where Feature Stores Fit in the MLOps Lifecycle

Feature stores sit at the heart of the data engineering and model training phases of the MLOps lifecycle:

Data Ingestion: Raw data is collected from various sources (databases, APIs, sensors) and ingested into the system.

Feature Engineering: This is where feature stores come into play. Engineers transform raw data into meaningful features and store them in the feature store.

Model Training: Models are trained using features stored in the feature store, ensuring consistency across training and serving environments.

Model Serving: During inference, the same features are retrieved from the feature store, ensuring the model receives consistent inputs.

Monitoring and Management: Feature stores also support monitoring feature usage and quality, aiding in model performance management.

Why Feature Stores are Important

Consistency: Ensures that the same features are used during training and inference, reducing discrepancies.
Reusability: Facilitates reuse of features across different models and teams, speeding up the development process.
Scalability: Handles large volumes of feature data efficiently.
Versioning: Maintains versioning of features, allowing reproducibility of models.
Monitoring: Enables tracking and monitoring of feature performance and quality.

Tutorial: Creating and Using a Feature Store

We'll use feast (Feature Store), an open-source feature store for this tutorial. Let's walk through the steps of creating and using a feature store.

Step 1: Install Feast
First, install feast:
pip install feast

Step 2: Define Your Feature Store
Create a new directory for your feature store project:

mkdir my_feature_store
cd my_feature_store

Initialize a Feast repository:

feast init my_feature_repo
cd my_feature_repo

Step 3: Define Your Feature Definitions
Edit the feature_store.yaml to configure your feature store. In the my_feature_repo directory, create a new file driver_stats.py and define your feature definitions:

from datetime import timedelta
from feast import Entity, Feature, FeatureView, FileSource
from feast.types import Float32, Int64

# Define an entity for the driver. You can think of an entity as a primary key used to fetch features.
driver = Entity(name="driver_id", description="driver id")

# Read data from parquet files. Parquet is convenient for local development because it supports nested data.
driver_stats_source = FileSource(
    path="data/driver_stats.parquet",
    event_timestamp_column="event_timestamp",
    created_timestamp_column="created",
)

# Define a Feature View. A Feature View defines a logical group of features to be served to models.
driver_stats_view = FeatureView(
    name="driver_stats",
    entities=["driver_id"],
    ttl=timedelta(days=1),
    features=[
        Feature(name="conv_rate", dtype=Float32),
        Feature(name="acc_rate", dtype=Float32),
        Feature(name="avg_daily_trips", dtype=Int64),
    ],
    online=True,
    batch_source=driver_stats_source,
    tags={"team": "driver_performance"},
)

Step 4: Register the Feature Definitions
Run the following command to apply the feature definitions to the feature store:
feast apply

Step 5: Ingest Data into the Feature Store
Create a data directory and add a sample driver_stats.parquet file with your driver statistics data.

Ingest the data:
feast materialize-incremental $(date +%Y-%m-%d)

Step 6: Retrieve Features for Training
To retrieve features for training, create a new script retrieve_training_data.py:

from feast import FeatureStore
import pandas as pd

# Initialize the feature store
fs = FeatureStore(repo_path=".")

# Define the entities we want to retrieve features for
entity_df = pd.DataFrame(
    {"driver_id": [1001, 1002, 1003], "event_timestamp": pd.to_datetime(["2024-08-05", "2024-08-05", "2024-08-05"])}
)

# Retrieve features from the feature store
training_df = fs.get_historical_features(
    entity_df=entity_df,
    feature_refs=["driver_stats:conv_rate", "driver_stats:acc_rate", "driver_stats:avg_daily_trips"]
).to_df()

print(training_df)

Run the script to retrieve the features:
python retrieve_training_data.py

Step 7: Use Features for Model Serving
To retrieve features for online serving, create a new script retrieve_online_features.py:

from feast import FeatureStore

# Initialize the feature store
fs = FeatureStore(repo_path=".")

# Define the entities we want to retrieve features for
entity_rows = [{"driver_id": 1001}, {"driver_id": 1002}, {"driver_id": 1003}]

# Retrieve features from the feature store
online_features = fs.get_online_features(
    features=["driver_stats:conv_rate", "driver_stats:acc_rate", "driver_stats:avg_daily_trips"],
    entity_rows=entity_rows,
).to_dict()

print(online_features)

Run the script to retrieve the features:
python retrieve_online_features.py

Conclusion

Feature stores are a vital component in the MLOps lifecycle, providing consistency, reusability, scalability, versioning, and monitoring of features. By centralizing feature management, feature stores like Feast enable efficient and reliable machine learning workflows from data ingestion to model serving. With the hands-on tutorial above, you now have a basic understanding of how to create and use a feature store in your ML projects. Happy experimenting!

Pandas to Pipelines

Elias Elikem Ifeanyi Dzobo — Mon, 05 Aug 2024 09:06:00 +0000

Introduction

If you've ever wrangled data using pandas, you know it's a powerful tool. But as your projects grow, so do the challenges. Enter: pipelines. Think of moving from pandas to pipelines as upgrading from riding a bicycle to driving a car on a highway. It's all about efficiency, scalability, and getting to your destination faster. Let's explore how data preparation evolves from pandas to pipelines.

The Pandas Approach: Manual Labor

Using pandas for data preparation is like cooking a meal from scratch every single time. You chop the veggies, sauté the onions, and season to taste. It's a hands-on process that works well for small, one-off tasks. With pandas, you load your data, clean it, transform it, and merge it all within a few lines of code. For a single data scientist working on a small project, this might seem just fine.

The Drawbacks

However, as your project scales, the drawbacks of this approach become apparent:

Repetition: Every time you run your analysis, you have to manually execute the same steps. This repetition is not only time-consuming but also error-prone.
Lack of Modularity: Your code can become a tangled mess of transformations, making it hard to maintain and debug.
Scalability Issues: Pandas operates in-memory, which can be a bottleneck when dealing with large datasets.
Collaboration Challenges: Sharing your work with others or deploying it in a production environment can be cumbersome without a structured workflow.

Enter Pipelines: Automation and Efficiency

Pipelines are like having a professional kitchen with a team of chefs, each responsible for a specific task, working in harmony to prepare a gourmet meal. In the context of ML, a pipeline automates the sequence of data processing steps, ensuring each step is executed correctly and efficiently.

How Pipelines Transform Your Workflow

Automation: Pipelines automate repetitive tasks, reducing manual intervention and minimizing errors.
Modularity: Each step in a pipeline is a separate component, making your code more organized and easier to debug.
Scalability: Pipelines can handle large datasets by processing data in chunks or leveraging distributed computing.
Collaboration and Deployment: Pipelines provide a clear structure, making it easier for teams to collaborate and deploy models in production environments.

Lets transform a data preparation script into a beautiful pipeline using Prefect

Transforming a Data Preparation Script into a Pipeline Script Using Prefect

Here's a simple script that reads a CSV file, cleans the data, and saves the cleaned data to a new CSV file.

Step 1: Original Pandas Script

import pandas as pd

# Load data
data = pd.read_csv('data.csv')

# Clean data
data.dropna(inplace=True)  # Drop missing values
data['date'] = pd.to_datetime(data['date'])  # Convert date column to datetime
data = data[data['value'] > 0]  # Filter out non-positive values

# Save cleaned data
data.to_csv('cleaned_data.csv', index=False)

Step 2: Install Prefect

First, install Prefect if you haven't already:
pip install prefect

Step 3: Transform the Script into a Prefect Flow

We'll break down the script into individual tasks and then combine them into a Prefect flow.

Import Prefect and Define Tasks

import pandas as pd
from prefect import task, Flow
from prefect.schedules import IntervalSchedule
from datetime import timedelta

# Define tasks
@task
def load_data(filepath):
    data = pd.read_csv(filepath)
    return data

@task
def clean_data(data):
    data.dropna(inplace=True)
    data['date'] = pd.to_datetime(data['date'])
    data = data[data['value'] > 0]
    return data

@task
def save_data(data, output_filepath):
    data.to_csv(output_filepath, index=False)

Create a Prefect Flow

Now, we'll create a Prefect flow that chains these tasks together.

# Create a Prefect flow
@flow
def run():
    filepath = 'data.csv'
    output_filepath = 'cleaned_data.csv'

    data = load_data(filepath)
    cleaned_data = clean_data(data)
    save_data(cleaned_data, output_filepath)

Step 5: Register and Run the Flow

Finally, register and run your flow. You can run the flow locally or use Prefect Cloud for additional features like monitoring and logging.

Run Locally

if __name__ == "__main__":
    run()

You can then use Prefect's UI to monitor and manage your flow.
By transforming your pandas script into a Prefect flow, you gain automation, scalability, and improved error handling. Prefect makes it easy to manage your data workflows and integrate them into production environments.

Conclusion: From Pandas to Pipelines

Transitioning from pandas to pipelines is like moving from a bicycle to a car—it's about embracing efficiency, scalability, and the ability to tackle bigger challenges. By investing in pipeline tools like Airflow, Mage, and Prefect, you're setting yourself up for success in the world of machine learning production.

So, the next time you're prepping your data, remember: you can chop those veggies yourself, or you can let a team of chefs handle it for you. Happy coding, and stay tuned for more insights on bringing your ML projects to life in production!

zkML: Evolving the Intelligence of Smart Contracts Through Zero-Knowledge Cryptography

Elias Elikem Ifeanyi Dzobo — Sun, 11 Jun 2023 20:14:39 +0000

Introduction:

Machine learning (ML) has become a powerful tool in various domains, but its integration with smart contracts has been limited due to computational challenges. However, the emergence of zkML (zero-knowledge machine learning) is set to change this landscape. By leveraging zero-knowledge cryptography, zkML enables the verification of ML model inferences on the blockchain, opening up new possibilities for autonomous and intelligent smart contracts. In this article, we will explore the potential applications, challenges, and emerging projects in the field of zkML.

Enhancing Smart Contracts with ML:

Smart contracts, which are self-executing agreements with predefined rules, have revolutionized decentralized applications (dApps) and blockchain-based systems. However, they often rely on static rules and lack the ability to adapt to real-time data. By integrating ML capabilities, smart contracts can become more autonomous and dynamic, making decisions based on real-time on-chain data. This evolution enables increased automation, accuracy, efficiency, and flexibility in smart contract execution.

The Challenges of On-Chain ML:

One of the main obstacles to incorporating ML models into smart contracts is the high computational cost of running these models on-chain. The resource-intensive nature of ML computations, such as training and inference, makes it infeasible to directly execute them on the Ethereum Virtual Machine (EVM) or similar blockchain platforms. However, the focus of zkML is primarily on the inference phase of ML models, which is more amenable to verification using zero-knowledge proofs.

zkSNARKs: Enabling zkML:

Zero-knowledge Succinct Non-Interactive Arguments of Knowledge (zkSNARKs) provide a solution to the computational complexity of running ML models on-chain. With zkSNARKs, anyone can run an ML model off-chain, generate a verifiable proof of the model's inference, and publish it on-chain. This approach allows smart contracts to leverage the intelligence of ML models without directly executing them on the blockchain.

Applications and Opportunities:

zkML opens up a wide range of applications and opportunities across various domains. In the decentralized finance (DeFi) space, verifiable off-chain ML oracles can be used to settle real-world prediction markets, insurance protocols, and more. ML-parameterized DeFi applications can automate lending protocols and update parameters in real-time. zkML also offers solutions for fraud monitoring, decentralized prompt marketplaces for generative AI, identity management, web3 social media filtering, and personalized advertising, among others.

Emerging Projects and Infrastructure:

The zkML ecosystem is rapidly evolving, with several projects and infrastructure components emerging to support its development. Model-to-proof compilers, such as EZKL, circomlib-ml, LinearA's Tachikoma and Uchikoma, and zkml, enable the translation of ML models into verifiable computational circuits. Generalized proving systems like Halo2 and Plonky2 provide the necessary tools to handle non-linearities in ML models through lookup tables and custom gates. These advancements pave the way for the integration of zkML into various applications and use cases.

Conclusion:

zkML represents a significant advancement in the integration of ML with smart contracts and blockchain-based systems. By leveraging zero-knowledge cryptography and zkSNARKs, zkML enables the verification of ML model inferences on-chain, enhancing the intelligence and flexibility of smart contracts. The potential applications and opportunities for zkML span across DeFi, security, traditional ML, identity, web3 social, and the creator economy. As the zkML ecosystem continues to grow and mature, we can expect further innovations and real-world implementations that harness the power of ML in decentralized systems.

Data Analytics: Blockchain Edition

Elias Elikem Ifeanyi Dzobo — Thu, 27 Apr 2023 21:12:34 +0000

Blockchain technology has been gaining popularity in recent years due to its unique features such as transparency, immutability, and decentralization. One of the main reasons behind the growth of blockchain technology is its potential to create new and innovative applications that can transform various industries.

However, with the growth of blockchain technology, new challenges have emerged in analyzing the data on blockchain networks. This article provides an overview of the unique challenges and opportunities in analyzing data on blockchain networks, including the use of distributed ledger technology and smart contracts.

Blockchain Analytics – An Overview

Blockchain is a distributed ledger technology that records transactions in a secure, transparent, and immutable way. Each transaction is added to a block, which is then added to the blockchain. This creates a chain of blocks that cannot be altered without the consensus of the network.

Blockchain analytics refers to the process of analyzing data on blockchain networks. This includes analyzing transaction data, network activity, and the behavior of network participants. Blockchain analytics can provide valuable insights into how blockchain networks are being used, and help identify potential risks and opportunities.

Challenges in Analyzing Blockchain Data

One of the main challenges in analyzing blockchain data is the complexity of the data. Blockchain data is structured differently from traditional data sources, making it difficult to analyze using traditional data analytics tools and techniques.

Another challenge is the size of the data. Blockchain networks generate a large amount of data, and storing, processing, and analyzing this data can be expensive and time-consuming.

Furthermore, blockchain networks are decentralized, which means that data is distributed across multiple nodes. This can make it challenging to access and analyze data, as each node has its own copy of the blockchain.

Opportunities in Analyzing Blockchain Data

Despite the challenges, analyzing blockchain data presents unique opportunities. For example, blockchain data is inherently transparent, which means that it can be used to provide a comprehensive view of network activity.

Blockchain analytics can also be used to identify patterns and trends in network activity. This can help to identify potential risks and opportunities, such as identifying fraudulent transactions or predicting market trends.

Another opportunity is the use of smart contracts. Smart contracts are self-executing contracts that are programmed to automatically execute when certain conditions are met. These contracts are stored on the blockchain, which means that they are transparent and immutable. Smart contract analytics can be used to identify patterns in smart contract usage, and identify potential risks and opportunities.

Conclusion

In conclusion, blockchain analytics presents both unique challenges and opportunities. While analyzing blockchain data can be complex and expensive, it can provide valuable insights into how blockchain networks are being used, and help identify potential risks and opportunities. As blockchain technology continues to evolve, the importance of blockchain analytics is likely to grow, and new tools and techniques will need to be developed to address the unique challenges and opportunities presented by blockchain data.

Developing a Program with Test Driven Development

Elias Elikem Ifeanyi Dzobo — Mon, 24 Apr 2023 20:40:23 +0000

Test-driven development (TDD) is a software development process that emphasizes writing automated tests before writing code. In this approach, the developer writes a failing test case that specifies the behavior they want to implement, and then writes the minimum amount of code needed to make the test pass. The process is then repeated, with new test cases being added and the code being updated to pass them.

One of the key benefits of TDD is that it helps ensure that code is reliable and meets the intended specifications. By writing tests first, the developer is forced to think carefully about the expected behavior of their code and to consider edge cases and potential errors. This can help catch bugs early in the development process, when they are easier and cheaper to fix.

Using a Test Driven Development approach, we would be implementing a program that collates news feeds on particular topics of interest. Considering the fast pace at which technology improves, staying up to date with the latest news and updates is a must, however it can be quite tasking to check for news articles daily. In this article, we’ll be exploring developing a program to collate news articles using newsapi and a test driven development approach.

Setting up an Environment

The first step is setting up a development environment. We’ll be using python to develop this program so we’ll be setting a virtual environment to develop in. Setting up a Python virtual environment is a best practice for developing Python applications. It helps to isolate dependencies, avoid conflicts between different versions of libraries, and keep the system Python environment clean. Here are the steps to set up a Python virtual environment:

Open a terminal or command prompt on your computer.
Install the virtualenv package by typing the following command and pressing Enter:

pip install virtualenv

Create a new folder where you want to keep your virtual environment, and navigate to that folder using the cd command.
Create a virtual environment by typing the following command and pressing Enter:

virtualenv venv

This will create a new folder called venv
in the current directory, containing a fresh Python installation and a pip
executable for installing packages.

Activate the virtual environment by typing the following command and pressing Enter:

source venv/bin/activate

On Windows, the command is slightly different:

venv\Scripts\activate

Activating the virtual environment will modify your system's PATH environment variable to use the Python installation and packages in the virtual environment.

You can now install packages using pip as usual, and they will be installed only in the virtual environment, not globally on your system. For example, to install the newsapi package, type the following command and press Enter:

pip install newsapi-python

Writing our Test cases

The next step is to write the test cases that will define the functionality of our program. Test-driven development (TDD) is a software development approach that emphasizes writing tests before writing the actual code. This approach helps ensure that the code meets the intended specifications and catches any issues early in the development process.

For our program, we'll define a set of test cases that cover the following functionality:

Retrieving news articles on a specific topic
Sorting news articles by date
Filtering news articles by source
Limiting the number of articles returned

Here's an example of a test case that retrieves news articles on a specific topic:

import unittest
from newsapi import NewsApiClient

class NewsAPITestCase(unittest.TestCase):

    def setUp(self):
        self.api_key = 'your_api_key_here'
        self.newsapi = NewsApiClient(api_key=self.api_key)

    def test_retrieve_news_articles(self):
        topic = 'artificial intelligence'
        response = self.newsapi.get_everything(q=topic)
        articles = response['articles']
        self.assertTrue(len(articles) > 0, f"No news articles found on {topic}")

This test case checks that the API returns at least one news article when searching for articles related to 'artificial intelligence'.

Writing the Code

With our test cases defined, we can now write the code to implement the desired functionality. For each test case, we'll write the minimum amount of code necessary to make the test pass. This iterative process of writing tests and then implementing the code is the foundation of TDD.

Here's an example implementation of the test case we defined earlier:

def get_news_articles(topic, sources=None, sort_by='publishedAt', limit=None):
    response = newsapi.get_everything(q=topic, sources=sources, sort_by=sort_by, page_size=limit)
    articles = response['articles']
    return articles

This function takes a topic as input and returns a list of news articles that match the topic. If sources are specified, it filters the results to only include articles from those sources. The articles are sorted by the published date, with the most recent articles appearing first. Finally, if a limit is specified, it only returns that number of articles.

Running the Tests

With our code implemented, we can now run the test suite to ensure that everything is working correctly. To do this, we'll use a testing framework such as unittest to run our tests.

if __name__ == '__main__':
    unittest.main()

Running the test suite should produce output similar to the following:

----------------------------------------------------------------------
Ran 1 test in 0.058s

OK

Conclusion

In this article, we explored using Test Driven Development to create a program that collates news articles about a particular topic.

The Ethics of Artificial Intelligence

Elias Elikem Ifeanyi Dzobo — Sun, 23 Apr 2023 20:58:27 +0000

Artificial intelligence (AI) is changing the world as we know it. From self-driving cars to automated medical diagnoses, AI has the potential to revolutionize nearly every aspect of our lives. However, as with any powerful technology, there are ethical implications to consider. In this article, we will discuss the ethics of artificial intelligence, including topics such as algorithmic bias and the impact of AI on job displacement.

Algorithmic Bias

One of the most pressing ethical concerns regarding AI is algorithmic bias. AI algorithms are designed to make decisions based on patterns found in large datasets. However, if these datasets are biased, the algorithms can produce biased results. For example, an AI algorithm used in hiring decisions may learn to favour male candidates over female candidates if the training data is biased towards men.

In 2018, Amazon scrapped an AI recruiting tool after it was found to be biased against women. The algorithm was trained on resumes submitted to the company over a 10-year period, and it learned to associate certain words and phrases with male candidates. As a result, the algorithm penalized resumes that contained words like "women's" or "female."

Another example of algorithmic bias is facial recognition technology. Studies have shown that facial recognition algorithms are less accurate when identifying people with darker skin tones. This can lead to harmful outcomes, such as wrongful arrests or police violence against people of colour.

Job Displacement

Another ethical concern related to AI is the impact it may have on job displacement. As AI technology becomes more advanced, many jobs that were previously performed by humans may become automated. This could lead to significant job losses, particularly in industries such as manufacturing and transportation.

For example, the rise of self-driving trucks could lead to the displacement of millions of truck drivers worldwide. Similarly, the growth of AI-powered chatbots and virtual assistants could lead to job losses for customer service representatives.

A 2019 report from the Brookings Institution found that up to 36 million Americans could face job displacement due to AI and automation in the coming years. This could exacerbate income inequality and lead to economic instability.

Conclusion

AI has the potential to transform our world in countless ways, but it also raises important ethical questions that must be addressed. Algorithmic bias and job displacement are just two of the many ethical concerns surrounding AI. As AI technology continues to advance, it is important that we consider the potential impacts on society and take steps to mitigate any harmful effects. Only then can we ensure that AI is used for the benefit of all.

Data Analysis with Blockchain Data

Elias Elikem Ifeanyi Dzobo — Sun, 23 Apr 2023 20:55:02 +0000

Introduction

Doing onchain analysis has been quite popping up, especially in the DeFi space. We’ve seen tons of threads on how to make money by following “smart money” meaning basically copying the trades of whales and venture capitalists, the folks with the big money.

A huge aspect of the blockchain is its transparency meaning that every trade anyone takes is visible to the general public, at least on public blockchains that is, meaning that if you can find the wallets of the whales who “magically”😵👀 know when to buy and sell and copy their trades you can make money

The first step is to find a token that has had a massive increase in value in the last 7 days and CoinGecko provides us with that information. We can then find the date the token started pumping and get all the transactions before the date on the blockchain explorer. Find the token, get the token chain, get the token address and we’ve got work to do!

We found RocketX Exchange, ticket symbol RVF which went from $0.05 on April 4th to $0.15 currently, around a 200% increase in a week with token address 0xdc8af07a7861bedd104b8093ae3e9376fc8596d2

Data Collection

By searching the token address on the blockchain explorer we can find all the transactions of that token to ever occur. Etherscan makes it easy to filter transactions to take place of the buy and sell transactions and then you can download the resultant data as a CSV file. You can even filter out the data range before downloading.

Data Preprocessing

Load the data into a data frame using pandas and do some data preprocessing. Change the date time column to a date time object allowing us to choose rows between a specific timeframe.

Drop columns that add little value to the insights we are trying to get, in this case, the Timestamp in Unix and block number of the transaction.

For the token RVF we notice that April 5th was when the token started gaining momentum so we want to find buy transactions in huge volume before that date.

We can create a new dataframe for all the buy transactions using the pandas: data[data[‘action’] == ‘Buy’] and further look for only transactions over a specific amount, in this case, I chose buy orders over 1 ETH.

Results

After filtering out 36 valid buy transactions over 1eth, I discovered that there were 3 particular wallet addresses with over $100,000 in their portfolio that bought a huge amount of RVF tokens in the weeks leading up to April 4th. One wallet which had been dcaing into RVF that week sold a portion of their holding netting a profit of over $4500.

PS: Disclaimer

This is not an endorsement of the token RVF, just a way of applying data analytics to onchain data and drawing conclusions and results based on the data and insights shown.