DEV Community: Moses-Morris

DATABASE REPLICATION.

Moses-Morris — Fri, 20 Mar 2026 10:38:02 +0000

Your database will crash. It's not a matter of if it will crash, it's when. So here's my question: is your system designed to survive it?

*One database. One crash. Total downtime. *

That's the risk you take without replication.
Here's how serious systems avoid it: a single Primary absorbs all writes while multiple Replicas serve reads in parallel. The result is faster performance, fault tolerance, and no single point of failure. When the primary goes down, a replica steps up automatically, users never feel a thing.
This is the backbone of every 24/7 application you've ever used. If backend or system designs are in your future, get comfortable with this concept now.

The tricky part is replication lag. Since syncing happens asynchronously, a replica might be milliseconds (or more) behind the primary database. If a user writes something and immediately reads it back from a replica, they might not see their own write yet.

Synchronous replication guarantees zero data loss but slows down every write. Asynchronous replication is fast but risks losing data on failover. Your startup is scaling fast and can't afford either problem. How do you architect your way out of this? I would love to hear from you.

TYPES OF AUTHENTICATION

Moses-Morris — Wed, 11 Mar 2026 07:17:17 +0000

How do users prove their Identity, Earn Trust, and get managed on various platforms and APIs?

Having a way to protect your system or platform is everything. Imagine building a family house and not setting up a door. Or, setting up a backdoor where everyone who wants to breach can easily do it and come into your house. This puts you at risk with your family.
Having controlled access and verifying if someone has permission to access your platform is important. That is where Authentication(identity verification) comes in.

Authentication - the security process of verifying the identity of a user, device, or system to ensure they are who they claim to be before granting access to resources.

Let us look at the way to do this when designing and implementing the logic of your system or application.

Here are some types of authentication and architecture patterns you use to verify if someone has access to your system.

Basic Authentication - This type of authentication relies on sending a username and password, encoded in various ways, like base64, which converts binary data into a safe, printable ASCII string format while transmitting data. It encodes and decodes the data. It is mainly used for internal tools, testing, and a simple API. Requires HTTPS for safer data transfer and requests.
Session-Based Authentication - After login, a session is created with a session ID. It is stored in a secure session cookie. This session is maintained and cached in memory, Redis or in a database. Browsers store cookies that are validated each time the application is accessed. It is mainly used for traditional applications, admin dashboards, and server-rendered applications. It is harder to scale since there must be a database and caching.
Token-based authentication - When a user logs in, the server returns a token to the client. So each time the client makes a request, the client must include a bearer token in the Authorization header. If not, no reply will be returned. Mostly used with REST APIs, mobile apps, and Microservices architecture(This helps the client move around different services without having to be authenticated for each service they need or access). The server does not store any client context or session data across requests.
Json Web Token (JWT) - This is a self-contained token contained in encoded user data and signature(signed by the server/issuer). It contains credentials and user data in one string token. It can sometimes be risky. It can be read by anyone who gets it. It is encoded but not encrypted. It has a header, payload, and signature. It is also considered fast since it will not keep on querying sessions. Mainly used for API’s, microservices, server to server connection. Tip: Setting up expiration dates can help reduce attacks and maintain stateless authentication.
*OAuth * - These are 3rd party applications being used as gateways to your applications. These are like hired security personnel. When a user accesses your platform or app, they are routed to your provider. They are then authenticated by your provider, then released and sent back to your app with permission and an access token. Example: Login with Google, Login with Apple, Login with Facebook, Auth2.0, AuthHero, Firebase, Supabase. Some companies provide services, and you just need to integrate with your service, API, or system. Mainly used by external developers, quick login and authentication setups, and third party servers.
API keys - These are static keys assigned to each client application. The client needs that key to access the application or server. Servers use an architecture off ssh keys. They link and authenticate each other with these keys. API keys are good for internal services, server-to-server communication, and applications with usage limits(with rate limiting). They have no ownership or identity, so it can violate applications if stolen. They don't carry any user data or credentials to identify a client.
Multi-Factor Authentication(MFA) - This is an additional step to a login authentication. It is very important because it prevents phishing attacks. It block unauthorized login attempts. It uses SMS or Email codes, authenticator apps, and hardware security codes/keys. Social media platforms rely on this as 2nd Factor authentications to confirm your account has not been attacked/ hijacked or hacked. You can allow a user to log in, but for them to perform an action or access a resource, they have to pass through the MFA.
Biometric Authentication - It uses the client's physical traits. This is because they are immutable and are stored as digital templates. It relies on Fingerprint ID, Face ID, Retina/Iris Scan ID, Voice ID, Palm ID, and sometimes even behavioral patterns. A client can be locked out in case of physical changes. But, with new advanced AI, they can authenticate using ML patterns or well-identified scans. It is mainly used for device protection, service access authentication, financial apps verification, server access, application. They do serve as an additional authentication process.

Bonus

Single sign-on (SSO) - used for multiple access services. - They let you authenticate once and access all services. Some technologies are like OpenIDConnect, SAML(Security Assertion Markup Language), Kerberos(Used for Enterprises), and mTLS(Mutual TLS).
Passwordless Authentication - used for secure access with cryptographic access key pairs. Very secure since no one can access the key or password, as it is not accessible. They follow a standard called FIDO2/WebAuthn.

Summary

Using more than one authentication type/Method is secure, scalable, and gives a client a very great experience. Building a secure system requires a well-designed authentication architecture. You can have more than one authentication architecture.

🔗 Follow Me on Socials and Let us link Up:
GitHub: @mosesmorrisdev.
LinkedIn: Moses-Morris.
Twitter: @Moses_Morrisdev.
Facebook: Moses Dev.
portfolio : mosesmorrisdev

Rate Limiting vs Throttling.

Moses-Morris — Thu, 26 Feb 2026 07:55:09 +0000

Ever wondered why some apps block you instantly while others just slow down?

Both rate limiting and throttling control traffic, but they solve different problems.

Why are these two important for your API implementation?
Which is the best concept to use while designing your API?
What are the limitations?

Let Us Find Out.

Traffic control is not just about blocking requests. It is about protecting system health and user experience.
Rate limiting and throttling are often confused, but they serve different architectural goals.

Rate limiting defines a hard cap. Once a client exceeds the allowed number of requests within a time window, further requests are rejected. This is critical for preventing brute force attacks, scraping, and API abuse.
Throttling, on the other hand, slows down traffic instead of cutting it off completely. It allows systems to remain responsive during peak loads while discouraging aggressive request patterns.

In practice, high performing backend systems use both:
• Throttling to handle bursts and peak traffic.
• Rate limiting to block malicious behavior.

In system design, a strong implementation is not just designing them, but understanding when and why you combine them.

Rate limiting enforces fairness.
Throttling ensures stability.

🔗 Follow Me on Socials and Let us link Up:
GitHub: @mosesmorrisdev.
LinkedIn: Moses-Morris.
Twitter: @Moses_Morrisdev.
Facebook: Moses Dev.

API DOCUMENTATION BEST PRACTICES

Moses-Morris — Wed, 28 Jan 2026 15:36:33 +0000

API Documentation Is Not Optional It Is the Product

Building an API is one thing, writing documentation is another thing(The real thing). They go hand in hand.

Having a tool that can not benefit its users is like building a gadget without having a technical manual on how to use it or how it works.

You can have the best and well-engineered car, but if the driver can not drive it, it is just another waste of time.
In this AI era and tech era, there are a lot of products being released daily with different use cases. Having a well-structured and designed API documentation is the best thing you can add to your product, service, or API.

What is an API?

An API(Application Programming Interface) is a point or interface to help developers understand how to use a product. The product can be a saas, software, or a web product. It defines how software components interact.

Why is documentation needed as part of design in API’s?

Documentations help you provide services to end users without you having to intervene. They help your users find their way into using your products or services. It is more of a self-help service.
This documentation can be for end users or for developers themselves.

Best API Documentation practices.

Use clear words and not jargon — Well-documented APIs have very clear and concise language. It makes it easier for users and developers to focus on the task at hand and not their vocabulary and grammar issues. Simple non-technical words are used for the documentation. Not everyone who visits the documentation is technical. Also, have a good professional tone.
Note: If technical terms are needed, they are clarified and explained very well.
Follow a clear, consistent guide — A poorly structured documentation is like a book without a preface or guideline. Putting a clear guide in place saves the users and developers time. They are able to check the right place for their needs. The approach gives the documentation a good flow in order to use the documented API.
Use media to communicate — Visual communication is the best thing you can do. It helps users know what they are doing visually. Coming up with images, flowcharts, graphs, and videos to explain further what the API does is very important. It also saves users a lot of time.
Good UI — Before documenting an API, have a very good design with good contrast. Have colors communicate things to the users. For example, green is associated with success while red is associated with errors. This makes the user know what is important or what needs to be read or done first while using the api. It also creates a “Please Note” approach that improves the productivity of the documentation. A good UI also has accessibility options, such as for the deaf or visually impaired.
Use code samples — When documenting an API, make sure your users have an idea of what you are doing or documenting, in case there is no clearer approach to the reason for documentation. This makes your users align with what they expect and what the API does. Have practical examples with requests and responses to the paths and requests made by the developers and users. This would also call for test cases with the API dummy data provided in the documentation.
Have potential errors and issues documented as samples.
Create Interactive elements in your docs. — allow users to interact with your documentation. Allow them to use buttons and links to open things they need and close elements they don’t need. This gives them the power to have what they need and make the documentation lively.
Unbloated content — Have straight to the point language and points. This helps in reviewing the use of the API, building momentum, and summarizing according to their basic needs. Also, keep the API documentation regularly up to date and reviewed each time. Technology keeps changing.
Security and quality — Good API documentation is a great opportunity to make things work for your API, but no matter how good it is, it needs to be secure. The documentation should be well-maintained to prevent security exposure and maintain the quality of the API. The quality of the API can only be ensured through its correct usage. Helping users have authentication as well while using or editing the documentation.
Search functionality — this seems like a lightweight practice, but it is the most important among them all. This helps users find things much faster without having to scroll endlessly to get what they want.

A well-documented API has security, consistency, and quality. All these 3 can only be achieved by having good documentation for your API.

These practices prevent :

API exploits while referencing the API.
Standard mistakes through reviewing and having a consistent guideline.
Frustrated end users who do not like working with unclear documentation.

Here are some companies to look at: — OpenAI, Twilio, Stripe, Google

Conclusion

Remember to design fast how your documentation will look. It gives you an open idea strategy on what your readers or users of the documentation need the most compared to others. This will help, especially if you have multiple services. It gives you the information to document first.

Unlocking IIFE in Python - Write It, Run It, Forget It.

Moses-Morris — Wed, 01 Oct 2025 11:31:19 +0000

From JavaScript to Python: Why IIFE Still Matters.

What is IIFE?

An Immediately Invoked Function Expression (IIFE) is a function that is created and executed immediately after its declaration. It’s essentially a function you invoke right away without calling it later.

When it comes to double nesting, Python may start complaining about unexpected indentations or create warnings.

Why IIFE?

It is a create-once, use-once function. The result can be reused in the whole script without changing (it can be treated as a variable in your code since the result value remains constant even when accessed later).

“I first saw the IIFE concept in JavaScript. I thought it was outdated until I found it in my Python code. That’s when I knew I needed to study it. Invoked curiosity.”

Why is this concept an effective choice and important?

When you want to avoid repeatedly calling a function later, you can implement this concept as a solution. Nesting multiple functions or classes can complicate your code. This might cause chaos and create bottlenecks in the future.

With IIFE, you are able to create a function that you only need once. It also improves clarity, especially in asynchronous (async/await) code.

Is it found in other languages?

Yes, it is. If Python is not your first language, don’t worry—the concept is also widely used in JavaScript (to prevent polluting the global scope) and PHP.
You can also check your preferred language to see how it supports this concept.

Top Python IIFE Approaches.

In Python, you can use different approaches to implement IIFE:

1. Use of a decorator.
Decorators come with specific attributes.

- @lambda _:_()
- @invoke(1, 2, 3)

Tip: You can also create your own custom decorator.

2. Use of a lambda function.

(lambda x: x * 2)(5)
# Output: 10

3. Define and immediately call a function.

#example 1
(lambda: print("IIFE in action"))()

#example 2
def add(x,y):
    print(x+y)

add(3, 4)

#example 3
result = (lambda x,y: x+y)(2,5)
print(result) #here we don't approach like this result(2,5) because we have already set the values.



#example 4
(lambda x,y: print (x+y))(8,5)

#example 5
numbers = [4, 2, 6, 8]
squared_numbers = list(map(lambda x: x * x, numbers))
print(squared_numbers)

Very effective when logging out Realtime data.

4. Use a library.
pip install invoke-iife
Note: This library runs on Python 3.0 and higher.

When is this concept most effective?

This concept is most effective when you want to execute a function just once and not reuse the code again. Sometimes we need to trick the code into thinking it is a function, but it is merely an empty definition with executions.

For example:

Server scripts → IIFE helps in logging responses and actions.
Security logging → Logs without compromising sensitive data.
High-performance sectors (banking, healthcare, critical business apps) → Helps avoid code noise, reduces vulnerabilities, and ensures reliability.

They help you log and maintain a history of crashes or performance.

Why Python IIFE?

Reduce naming conflicts → Maintains variable scope.
Clarity → Separates main functions from one-off helpers.
Encapsulation → Keeps logic scoped to a single function.
Security → Protects global variables and supports anonymous variables.
Break redundancy → Use once, get the result, and move on.

⚠️ Note: Code can become complex if the concept is not well-defined.

How to implement the concept?

Define the values.
Decide if you will use a lambda, a library, a decorator, or a direct function call.
Place the function and invoke it immediately.
Use the result as a variable.

Advanced use of the concept.

We can define a class or function and immediately run it to test if it is functional.

It’s not meant to work as a controller, middleware, or utility. Instead, it is more of a “write, test, and see for yourself” technique.
Test this code.

#This is what the code does Pseudocode step by step
#1. Check your files in your directory, 
#2. Make a list.
#3. Count the number of lines.
#4. Give the summary of what you found after checking t=current directory.

import os
from datetime import datetime

class fileChecker:
    def __init__(self, path="."):
        self.path = path
        self.filedata = {}
        self.run()

    def listFiles(self):
        return [f for f in os.listdir(self.path) if f.endswith(".py") and os.path.isfile(os.path.join(self.path, f))]

    def count_lines(self, filename):
        full_path = os.path.join(self.path, filename)
        with open(full_path, 'r', encoding='utf-8') as file:
            return len(file.readlines())

    def summarize(self):
        print("\n🧾 File Summary:")
        for file, lines in self.filedata.items():
            print(f"- {file}: {lines} lines")
        print(f"\n✅ Done at {datetime.now()}\n")

    def run(self):
        print(f"📁 Inspecting: {os.path.abspath(self.path)}")
        files = self.listFiles()
        for file in files:
            self.filedata[file] = self.count_lines(file)
        self.summarize()

## IIFE-style: define & immediately run
fileChecker()

Conclusion

IIFE is a powerful but sometimes overlooked concept in Python. It helps improve scope handling, reduce redundancy, and maintain clean, secure code. Used wisely, it can simplify your work and prevent future issues in complex systems.

🔗 Follow Me on Socials and Let us link Up:
GitHub: @mosesmorrisdev.
LinkedIn: Moses-Morris.
Twitter: @Moses_Morrisdev.
Facebook: Moses Dev.

Oops... I Locked Myself Out with UFW - Here's How I Fixed It

Moses-Morris — Wed, 03 Sep 2025 10:43:28 +0000

Does “F” in ufw stand for Fired?
Let us find out:

WHAT IS UFW?

UFW firewall utility is used to set up rules and configurations for a server firewall. It uses IP tables to perform the setup. People primarily use it in Linux distros.
UFW stands for Uncomplicated Firewall
UFW is a common type of firewall used to configure firewalls on a server. The server could be a web server, a network server, etc.

The server network can be a home network, corporate, e-commerce or business, service provision network, or any type of dedicated server. This helps you configure certain services to specific ports, regulating access and also controlling how users/clients interact with your server resources.

You create the rules, and others follow them.

Why a UFW firewall?

This firewall helps you with your security.
Anyone can easily set up and manage the UFW firewall because it is simple. - It uses IPv4 or IPv6 (helping in access control and traffic control).
Prevent intruders and limit breaches in your server.
Helps in Logging accessed and blocked operations.

How to set up the UFW firewall.

Some Linux distributions, like Ubuntu and CentOS, come pre-installed.
Check if UFW is installed using this commands:

sudo apt install ufw

sudo yum install ufw // for CentOS

You can also preview the basic UFW settings:
vi /etc/default/ufw or cat /etc/default/ufw

Get access to more details about the UFW utility.

man ufw

Locked out ???

Check if you have any setup rules before running the firewall to prevent being locked out.
Check the status rules.

sudo ufw status verbose

You can also show the reports and the active listening ports:

sudo ufw show raw
sudo ufw show listening

I recently got locked out after not considering that I was using a dynamic IP from my network provider. I set up a rule to only let access from the current IP, which I checked at whatismyip.com.
After a change of network and the release of my IP configurations on my local machine, I tried to connect via SSH and well…, I was in total disbelief. The system didn’t allow my newly assigned IP in. I am now an intruder.
How I had set up my SSH connection rule:

sudo ufw  limit from 192.168.1.1 to any port 22

LIMIT - It is used to protect from brute-force attacks (e.g., it will rate-limit repeated connections). This limits one user per server connection with the same IP.

UFW Default Policies->

Control incoming and outgoing access requests to the server.

sudo ufw default deny incoming
sudo ufw default allow outgoing

Verify that applications are functioning properly and accessing the server according to the established rules and configurations.

sudo ufw app list
sudo ufw allow 'OpenSSH' //  allows incoming connections to the OpenSSH service, by name.

These applications or services are stored in “/etc/ufw/applications.d”

Here are Some advanced UFW firewall rules and Tips.

It is advisable to set up from scratch by resetting the rules. This disrupts any of the rules. When done, now set up manually.

sudo ufw reset

If you do not want to start the firewall on startup, you should always disable the rules before exiting your machine. You can easily do this by stopping the firewall process from running:

sudo ufw disable
sudo systemctl stop ufw (sets and stops the service processes of ufw)

Set the rules first, then activate. .

The most common example rules and commands used for UFW firewall.

The system stores the default configurations at /etc/default/ufw.
Here are sample rules:

1. Allow requests like HTTP, SSH, ftp, https.

sudo ufw allow ssh

You can also use their daemon ports for this configuration, like for SSH.

sudo ufw allow 22

2. Allow IP access or Block access.

sudo ufw allow from 123.08.01.01 to any port  443

You can block access of a certain user to certain services

sudo ufw deny from 123.08.01.01 to any port  22

Allow using subnets

sudo ufw allow from 123.08.01.01/24

Allow an IP using a certain protocol

sudo ufw allow from 192.168.0.4 to any port 22 proto tcp

3. Delete rules.
First, list the rules.

sudo ufw status numbered

Then delete the rule by indexing it with its number;

sudo ufw delete rule 8   //(8 is the number of the rule.)

You can also delete using this approach.

sudo ufw delete allow 22

4. Permit Logging.
Enable logging:

sudo ufw logging on

Disable Logging:

sudo ufw logging off

5. Setup the Rules.
When done setting up,
Start the ufw firewall.

sudo systemctl start ufw

Then, start the services and implement the firewall rules.

sudo ufw enable

Alternatively, you can reload without starting or exiting the firewall.

sudo ufw reload

How to fix being locked out of UFW firewall.

This involves resetting the firewall. How do you reset a firewall if you have no access to it?
let us look at this concept first from the server architecture.
A server has a default port of access. That is port 22. When there is no permission to access, you can no longer ping the server or communicate with it.
Here is how to gain access depending on the various states of your server and services.

1. Console access. - Cloud service providers have a serial console, which is mostly web-based. It helps you log in to your server without SSH. They offer even server management tools for use. Look for direct access to the server.
When logged in, update your SSH rule.

sudo ufw allow ssh

You can also list the rules and then delete the rule number.

2. Rescue mode. - Some cloud service providers and VPS service providers have a rescue mode that helps you boot into your OS.
In technical terms, there is a “safe mode” in some OS platforms that helps you gain root access to the minimal setup of your OS.
The service providers offer you safe credentials for you to SSH into your secure rescue OS.
You then mount your real disk and access your server system as root.

mkdir /mnt/server
mount /dev/sda1 /mnt/server 
chroot /mnt/server

When this fails or breaks, contact your service provider with your server details. Different companies have different ways of letting you access the server as root.

3. Reboot. - This applies to local servers where one can gain physical access.
Reboot, uses a keyboard or a mouse to log in. When logged in, access the terminal and change the rules.

When all the processes above are done, don’t forget to reload.
Reload UFW :

sudo ufw reload

NB: Before setting up and proceeding to activate a firewall, review your settings.

ufw show added

UFW TIP

If you like using graphical user interfaces, you can use the GUFW (Graphical Uncomplicated Firewall). Especially if you are a beginner.

sudo apt install gufw

Conclusion

UFW is an uncomplicated firewall interface for managing iptables rules. Serving as a gatekeeper, it is a crucial security feature in Linux. It works alongside the services being provided by our servers. It serves as an alternative to Firewalld(A dynamic firewall management tool), NFtables(supports IPv4 and IPv6 filtering), and IPTables(Mostly for network traffic with IPv4).

🔗 Follow Me on Socials and Let us link Up:
GitHub: @mosesmorrisdev

LinkedIn: Moses-Morris

Twitter: @Moses_Morrisdev

Facebook: Moses Dev

Mastering SQL for Data Engineering: Advanced Queries, Optimization, and Data Modeling Best Practices

Moses-Morris — Wed, 16 Apr 2025 09:50:03 +0000

Advanced SQL for Data Engineering

Querying a database should be at your fingertips. This helps you perform ETL and EDA processes as a data engineer, analyst, or scientist. Data comes in various shapes, structures, and features, but it has to be transformed to be meaningful and important to use the data. Various concepts come in handy during the Data modelling phase. A data engineer can perform CRUD operations and tailor the data according to specified needs in a certain domain. Some of these concepts of operations include understanding joins, SQL functions, SQL methods, formulas, and many more.

We can not ignore how we query because...

All these remain a starting point for best practices when performing SQL queries. However, some operations, when not optimized, cause a lot of lag(slow/poor performance), complexity, failures, inaccuracy and are vulnerable to injection attacks when using the Data. Data engineers support data analysts and automate ETL workflow for them.

Why do we need to master SQL for Data Engineering?

How we query the database is very important. This will help with visualization and information delivery. We often rely on data-building tools to perform DML, DDL, TCL, and DCL operations. This risks the understanding of what is happening and what data we need for the data engineering process. Sometimes we even rely on ORM(Object Relational Model) and DBT(Data Build Tools) tools to query.

These are the advantages of being an SQL master.

1. Avoiding redundancy. - Helps avoid extracting or exploring similar data with repetitive tasks. We can query solid data and get data results that are not redundant. We often rely on indexing and creating keys(PRIMARY AND FOREIGN KEYS) that create a unique identity.

Indexing also improves Lookup speed. Formulas can be easily created to enhance good code. This prevents the repetition of SQL queries that may render your code complicated.
The use of Normalization Forms also minimizes redundancy. You can achieve this by separating a column that appears in many tables by creating a table for it.


-- Create Index
CREATE INDEX customerID 
ON orders(customer_id);
-- Create Primary Key
CREATE TABLE customer (
    customerID INT PRIMARY KEY,
    name VARCHAR(100)
);

-- Normalization and Foreign Keys
CREATE TABLE cities (
    cityID INT PRIMARY KEY,
    city_name VARCHAR(100)
);

CREATE TABLE customers (
    customerID INT PRIMARY KEY,
    name VARCHAR(100),
    cityID INT,
    FOREIGN KEY (cityID) REFERENCES cities(cityID)
);

2. Security Enhancement. - This helps prevent tampering with the results or actual data, reducing compromised responses. Sometimes, this comes in handy with the use of VIEWS in SQL querying. This also helps prevent injections by protecting our SQL queries. We foster the idea of using sub-queries within a query. This abstracts data that is private for use by the pipeline.
We can also create temporary result sets using the WITH clause. - This also helps in common table expressions for modular SQL. This is most efficient in ETL Pipelines

Some cloud platforms double-check nested queries, store logs and the history of queries. They even encourage the Data engineer to partition data.

-- Create Views for Abstraction
CREATE VIEW active_users AS
SELECT customerID, name, 
FROM users
WHERE status = ‘Active’;
// You can CREATE VIEWS, MODIFY VIEWS, and DROP VIEWS

-- Use of WITH
WITH Total AS (
    SELECT customerID, SUM(amount) AS total_spent
    FROM sales
    GROUP BY customerID
)
SELECT c.name, s.total_spent
FROM customers c
JOIN Totals ON c.customerID = s.customerID
WHERE s.total_spent > 10000;

3. Optimization of Queries. - We can optimize the SQL queries we are making, thus reducing Lag. We can also make the queries efficient and accurate with the needed results. We also use the EXPLAIN function to understand and improve the query’s execution.

The use of a specified query is also very important. Do not use "SELECT * " but "SELECT name, age, etc." the rows you need. You can also specify using the WHERE or HAVING clause in your query.
Please note that you can also avoid sub-queries by using RANK functions: when data modeling: RANK(), DENSE_RANK(), and ROW_NUMBER(). These help in pivoting and un-pivoting operations.
You can also use LEAD AND LAG to create dynamic SQL functions, creating room for recursive functions.

-- Use of EXPLAIN
EXPLAIN
SELECT customerID, name, status
FROM customers
WHERE city = ‘UK’;

-- Use of HAVING
SELECT   age, date   
COUNT(*) AS customerAges
FROM customers GROUP BY date
HAVING COUNT(*) > 3;


-- Use of RANK()
WITH ranking AS (  
SELECT     
DENSE_RANK() OVER(ORDER BY amount DESC,date) AS rank,     
orderID,     amount   
FROM orders 
) 
SELECT * FROM ranking WHERE rank < 4;
//We want top 3 orders 
//We would like them to be ranked according to the amount -the higher the amount, the bigger the order. Then, check the date. - The oldest order is ranked higher than the recent order.


-- Use ROW_NUMBER() //to rank customers by amount within each order
SELECT customerID, name, OrderID, amount,
       ROW_NUMBER() OVER (PARTITION BY OrderID ORDER BY amount DESC) AS rank
FROM customers;

Using indexing in your SQL queries also saves you a lot of optimization bottlenecks.

4. Data Accuracy. - We have Precise data, but we also need meaningful data. Sometimes SQL basic queries don’t give meaningful data as responses. This is a high risk when making multiple queries in a crucial app. Apps that need to be fast and reliable with real-time information. The use of LIMIT and DISTINCT in SQL helps solve this accuracy problem and reduce latency within large-scale apps.

When working with APIS or unending datasets, you look at this as pagination or memory saving. The lesser the limit of the query load, the accurate the data query.

-- Use of LIMIT
SELECT name FROM customers_large_table
ORDER BY date DESC
LIMIT 50 OFFSET 100;
//This also helps with performance boost

-- Use of DISTINCT
SELECT DISTINCT * FROM customers_large_table
ORDER BY date DESC
LIMIT 50 OFFSET 100;

While working with some applications, there is a use of WHERE clause with keywords like “IN”, “NOT IN”, and “BETWEEN” to show specific data queried.

5. Integrity of the Data. - This helps in maintaining the integrity of our SQL query. This preserves data integrity via logging and tracking of all SQL operations. We usually normalize data before querying it. We can also use this to remove unnecessary data from the query results.
The use of clauses like ORDER BY and GROUP BY helps maintain the integrity of the visual representation of data.

-- Use of ORDER BY
SELECT orderID, customerID, amount, date
FROM orders
ORDER BY date ASC;

-- Use of GROUP BY
SELECT orderID, customerID, amount, date
FROM orders
ORDER BY amount DESC;

Another way of ensuring integrity in SQL is the use of the ADD CONSTRAINT clause

-- Use of ADD CONSTRAINT

ALTER TABLE orders
ADD CONSTRAINT CustomerID
FOREIGN KEY (customerID) REFERENCES customers(customerID);
//The creation of all orders in our database requires our customer to use a Foreign key. 
//If we don’t have a customer ID, no order details are being made or created.

When working with SQL Querying, you can GRANT and REVOKE various permissions given to the data engineer to perform on the data. This authorizes operations towards the Database.

6. Performance of Your Queries. - The increase of performance by only the data needed, especially when working with large sets of data. A data engineer who has mastered this is aware of methods, set operations, and functions(aggregate functions and window functions) used for operations like limiting, distinguishing, and partitioning the SQL query responses.

The use of SQL functions also helps limit slow performance.
You can also look at the WITH clause functions in *Optimization of queries*..
We can have a look at numeric functions like SUM, AVG, and COUNT other than writing expressions and having operations in your SQL query.
Using string, date, and time functions to work on SQL queried data: functions like CONCAT, LENGTH, SUBSTRING, REPLACE, UPPER, LOWER, DATE, TIMESTAMP, DATEADD, DATEPART, and TIME.
**We have looked at some clause functions in the above queries.

-- Use of SUM
SELECT SUM(amount) AS total
FROM orders;

-- Use of AVG
SELECT AVG(amount)
FROM orders
WHERE customerID=1;
//Get the average of the amount where a certain customer has ordered


-- Use of AVG
SELECT Count(*)
FROM orders
WHERE customerID=1;
//Get the number of orders made by a customer

Part of performance tuning best practices, is avoiding subqueries as much as possible. Also, the use of WHERE speeds filtering compared to HAVING.

7. Relationships within your Queries. - When you are mastering SQL queries, it is easy to create variable and reference points for your data project, often referred to as Entity Relationships Modelling. This helps you create relationships in your data modeling process that help you make well-informed SQL queries that increase not only performance but also security.

The use of JOINS helps structure relevant SQL query responses. They allow data to move in a flowchart kind of manner.

 -- Use of JOINS
SELECT *
FROM customers e
JOIN orders d ON e.customerID = d.customerID;
//selecting customer details of customers that have orders

Remember that there are several types of JOINS, namely LEFT JOIN, RIGHT JOIN, INNER JOIN, etc.

8. Scalability in SQL queries implementation. - Working with advanced SQL querying techniques saves you a lot of time. Good SQL queries create an excellent platform for growth. Large scaling apps need a good mastery of SQL.

Scalable solutions help a data engineer to focus on the most important things. This gives room for expandable SQL queries. You do not have to write other code if you are building another unit in the same system.
We design reusable SQL pipelines by modularizing logic. For efficient processing, many engines use parallelism under the hood. Additionally, we can use PARTITION BY in window functions to logically group data and compute metrics within each group.

-- Partition orders by order date
CREATE TABLE orders (
    orderID INT,
    date DATE,
    amount DECIMAL(8, 2)
)
PARTITION BY RANGE (YEAR(date)) (
    PARTITION p2024 VALUES LESS THAN (2024),
    PARTITION p2025 VALUES LESS THAN (2025)
);

Bonus:
Sometimes, when working with enormous sets of data, a data engineer needs to focus on writing reusable code. Why is this? This improves the flow of data. The engineer understands this by brainstorming which data should frequently be used within the data pipeline. This also helps maintain a good SQL query flow.
A good schema is created to prevent bottlenecks in the data pipelines created alongside various processes (EDA - Exploratory Data Analysis, ETL - Extract, Transform and Load). This schema can also have logging processes to help troubleshoot SQL failures within the application.

What is a data model? And why do we use our SQL mastery to achieve the best results when data modeling?

A data model is a visual representation of data interacting with the system. Data modelling is the process of interaction within the system. It involves defining and organizing data in a way that structures data to support the system’s functionality.
It plays a significant role to a data engineer since it ensures accuracy, consistency, and efficiency.
A model is developed to show relationships and how data moves across the system. UML diagrams and ER diagrams provide a visual representation of this.
Here is an example.

A UML diagram to show how Data interacts within a data model.

A data model helps you write schema for the SQL. It lets you know which data interacts with which data and how each data flows in the system without compromising the integrity.
Data governance is maintained at the highest level. It also gives control, regulation, authorization, and access methods that preserve data in the right state.
Data is compromised if the source of the data is altered and hence can not be trusted. This provides unreliable insights.
Some tools are like Microsoft Power BI, Tableau, Qlik Sense, and Looker used for visualization, reports and interactive dashboards.

Valuable concepts and practices for data engineers that help in advanced querying, optimizing and data modeling in SQL

Here are some concepts that make building, implementing, maintaining, and deploying pipelines a more masterly way:
These are advanced SQL practices with examples.

1. _ Creating user-defined functions(UDF)_ - these are functions added in complex pipelines. These functions are created by the user depending on the domain of expertise the user is in.
The data engineer documents the function and then uses it to make queries. They define the business logic and break complex queries into understandable tasks.

`-- Create reusable functions
CREATE FUNCTION calculateTax(amount DECIMAL)
RETURNS DECIMAL
BEGIN
  RETURN amount * 0.16;
END;`.

2. Creating Logging and Auditing for Queries - some queries create permanent changes to the database. Having a way to log or store action history is important.
This helps the data engineer for trailing mistakes or progress made while querying. This helps in compliance in different domain expertise.

`-- Create a logging table
CREATE TABLE customerLog (
    logID INT AUTO_INCREMENT PRIMARY KEY,
    customerID INT,
    action VARCHAR(50),
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
    FOREIGN KEY (customerID) REFERENCES customers(customerID);
);
//On the Database end, you can also create event triggers on the Database used`.

3. Transaction Handling - A transaction is a query process that has to be completed to proceed to make changes within the SQL operation. These are Transaction Isolation Levels
A COMMIT marks a successful end of a transaction, saving all changes made during the transaction permanently to the database.

A ROLLBACK cancels the transaction and undoes all changes made after the transaction began or up to a specified SAVEPOINT.

A BEGIN (or START TRANSACTION) indicates the beginning of a transaction block, allowing you to group multiple SQL statements as one atomic operation.

A SAVEPOINT creates a labeled point within a transaction to which you can later roll back without affecting the entire transaction. Ensures Atomicity and is used for critical data operations.

This can be used as a trick to only allow a complete transaction to make changes. This is where if an error occurs while a Query is running, the transaction is considered incomplete, hence a rollback (ROLLBACK). If no error occurs, the query can be completed and committed(COMMIT).

-- Create a Process that ensures ATOMICITY and Rollback safety.
BEGIN;
UPDATE orders SET amount = amount - 100 WHERE customerID = 1;
SAVEPOINT service_amount;
UPDATE orders SET  amount = amount + 100 WHERE customerID = 7;
-- Suppose something goes wrong with the second update
ROLLBACK TO service_amount;
-- Only the first update remains
COMMIT;

These concepts help the data engineer do data wrangling, reporting, and analyzing dashboards effectively and efficiently while maintaining quality data pipelines.

Conclusion

Mastering SQL is important to a data engineer. These are just a few concepts that will get you running as a master in SQL. In this data-driven world, this is to make sure you harness the superpower of being an SQL guru when working with data in your day-to-day operations. It makes your work more efficient, secure, and impactful. You can solve complex querying problems by having the above as a cheat sheet.

Link Up For More:
twitter/x.
linkedIn.

😊I would be glad if you Dropped a comment!!!

What is a Dry Run Test ?

Moses-Morris — Thu, 14 Mar 2024 07:41:07 +0000

What is a Dry Run Test in Software Engineering?

A DRY run test helps us know if the software developed fulfills the intended purpose and if lines of code perform the intended task before code deployment.

It is in the testing phase of SDLC(A framework or process that helps to develop high quality software.).

Dry run testing is part of a series of tests, here are other tests done while programming and developing software: walkthrough testing, white box testing, integration testing, alpha testing, blackbox testing, beta testing, stub testing, resources testing, unit testing, smoke testing, acceptance testing, and other testing procedures.
These tests save you the risk of malfunction of the software before, after, and during deployment.

Dry run test is like reading aloud your code to spot mistakes and bugs encountered when writing the code. In other niches other than Software Engineering, They refer to it as practice run tests.

Why it is called a DRY run test ?
According to reports, the term dry run originated among US fire departments. The firefighters would carry out fire brigade dispatches without pumping water for practice. A wet run was referred to as one that had real fire and water. The term dry run has since spread to other areas including military, aviation, and other fields. It is used to describe a rehearsal or practice without the real consequences of the event. Dry runs are an important part of preparing for the real thing.
The dry run test helps developers implement features and updates when running. This procedure helps us trace and follow up on values and variables in our code. Reduces repetition and helps us maintain a good control flow of our code and software process.

Why do we need it? Benefits of conducting the Dry run Test.

To avoid repetition. - Helps implement the DRY principle(Don't repeat yourself).
Avoid potential bugs during production. - Unprecedented code behaviors.
Review code quality and performance. - This helps increase the quality of the software.
Pave way for more product testing. - It is important to initially direct the coder/programmer to productive testing.
Remove unnecessary code breakages. - The test helps counter bottlenecks that may arise during code execution.
Ensure Functionality - This test helps in discovering underlying issues before selling or deploying the software. This is crucial with critically important systems. This acts as a rehearsal.
Save time - This test mitigates potential errors and modifications in the future.
Uncover potential errors - Errors that may be imminent - logical errors, syntax errors, conditional errors, loop errors, typographical errors, etc.

Sometimes the developer can print the code to check on logic, and execution errors that may arise during the execution of code or deployment of the software. It helps validate logic and syntax.

Dry Run Test Examples in Software Engineering

Rsync - This is a utility for transferring and synchronizing data between networked computers or storage drives. It has a dry-run option.
Algorithm Scan- this is a mental walk-through of code and algorithm to confirm if it is logically valid.

How to perform a dry run test.

Here are ways way to perform a dry run test,
Here are some tips to make it successful.

Read the code aloud - go line by line commenting and write the pseudo-code.
Use trace tables - check the value of a variable from line to line and record instances when it changes.
Use friends or teams to debug and dry run.
Logging out all the values and issues. You can print out the errors.
Use comments and descriptions on expressions and functions.
Use a test case to simulate possible outcomes by using dummy data.

Let's use a Trace Table:

Here is a sample code.

Here is the Trace table tracking a variable and showcasing the right expected outcome.

We can verify/Validate the variables. If there was an error, you could have easily spotted the location the error occurred by printing out the variable after each modification.

We can easily track any modifications and changes the variable undergoes. This means that our code has passed the dry run test.
In large projects, developers develop a framework and system that helps them test the project as a whole.
Programming languages have their testing frameworks and systems for efficiency.

Where and when you should consider implementing a dry run testing:

When performing version control operations - you can always use git diff before merging to highlight any conflicts that may arise while merging.
When Using Configuration Management and Build Tools - This helps you to avoid software crashing and malfunctions. Some CGM tools offer testing or simulating spaces to view how the deployment will look like. Most build tools offer a deployment stage for dry running a test for the application before production. They have a dry run mode or plugins that help you do the dry run.
When Doing database migrations - This helps you know any occurrences that might be made before making any crucial database changes. NB: in the Django Rest framework, there is a makemigrations command that highlights possible changes made, and the set query made to the database.

python manage.py makemigrations

Dry run testing Limitations.

It is not easy to implement for large-scale projects.
It is time-consuming since the tester has to review line-by-line tracking of a variable.

Conclusion
The dry-run testing methodology of testing is very efficient. It ensures developers become effective in understanding the logic behind the product. It helps answer the question, "Is the product doing the intended task or solving the intended purpose?"

It makes zero-day vulnerabilities a new vocabulary when developing and deploying software products. Every single developer should consider using a dry run test to increase productivity and prepare them for the next phases of testing.

Link Up For More:
twitter/x.
linkedIn.

😊I would be glad if you Dropped a comment!!!

Unveiling the Power of CodiumAI's PR-Agent: A Comprehensive Comparison with GitHub Copilot.

Moses-Morris — Wed, 20 Dec 2023 15:00:31 +0000

The AI Revolution

AI now looks like a movement and is rapidly influencing our daily ways of working. The wave is increasing productivity in every place and niche it sets foot on. Having good skills and being productive is being boosted by what AI can do.

When looking at software development, a lot has evolved and we can now integrate some of the most basic and advanced features into our software development environments. Life couldn’t get any easier. AI helps software developers write code more efficiently.

Looking at the benefits of AI, I would like to introduce you to my best AI Detective(“Detective”: from the phrase, Being the detective in a crime movie where you are also the murderer.)
Codium AI is a great companion when it comes to coding. It has immense features that improve software quality and production.

The Features of Codium AI.

Codium Chat.
PR bot.(PR-Agent)
The testing suite.
Integrity agents.
Codium Autocomplete.
Alpha Coding - future release.
Codium AI API - future release.
Review other products here: CodiumAi Products. The company provides many features and products. The PR-Agent of CodiumAI increases code quality and streamlines the development process.

Introduction to Codium AI PR-Agent.

When looking at the features of Codium AI, focusing on productivity and Pull Requests, we review the best tool as released for Pull Request Assistance.

The tool is CodiumAI PR-Agent. Codium AI PR-Agent helps with analysis, reviews, commit messages, descriptions, and many more when handling pull requests.
It increases productivity and helps developers write more efficient code without leaving their Git environment, platforms, and IDE environments for software development.
One amazing aspect of CodiumAI PR-Agent is its open-source. It has a developers community that reviews and updates its features as they advance.

Why use a PR Agent for a Pull Request?

It helps you generate Pull Request Descriptions - these are commit messages and titles.
It helps Generate Pull Request Reviews and gives Some Suggestions - it helps by suggesting changes and refining git diff for maximum performance.
It helps generate changelogs and update documents. - It simplifies updating and writing descriptions.
It helps enhance security - by suggesting security measures in the code while bug fixing.
Helps increase performance - helps the developer create efficient code by suggesting best practices.

Codium AI PR-Agent helps in increasing productivity code quality and performance on an overall basis. The tool is quite a saver when mitigating pull request approval processes. This tool hugely benefits developers.

Platforms that can be Used with CodiumAI PR-Agent.

CodiumAI PR-Agent supports over 70 programming languages.
Codium has extensions that can be used with multiple IDEs like VS Code, Jetbrains, Etc.
CodiumAI PR-Agent can be used on many platforms and IDEs. This is because it is open source thus giving access to contributions from a wide range of developers using various tools and programming languages.
Some of the platforms are Github, GitLab, Beanstalk, Bitbucket, Mercurial, etc.
Review your platform or IDE here: Integrations

Features of the CodiumAI PR-Agent.

/describe - the command or tool scans the Pull Request Changes and automatically generates the description which includes the type, summary labels, title, and walkthrough. More features can be viewed here about the tool.More about /describe. > Can be invoked manually by commenting @CodiumAI-Agent /describe.
/review - the tool scans the Pull Request Code changes and automatically generates a Pull Request Review. More features like “/review -i” can be viewed here about the tool more about review
/improve - the tool scans the Pull Request Changes and automatically generates committable suggestions for improving the PR Code. More features like “/improve --extended” can be viewed here about the tool extended /improve
/ask - this tool or feature allows developers to ask questions about Pull Request code changes. It is triggered by typing “/ask “...put your question here…”.
/similar_issue - this tool checks the most similar issues of the Pull Requests and matches them to the current issue. It scans for earlier or previous issues. Here is a more detailed overview of the tool: More about /Similar_issue
/update_changelog - this tool automatically updates the CHANGELOG.md file with the Pull Request changes. It automatically detects and makes the changes. The tool can also be configured with various options. Here is a detailed overview. - How to configure
/add_docs - this tool scans the code changes in the Pull Requests and automatically suggests documentation for undocumented changes in the code. It checks code components. It is an additional documentation tool.
/generate_labels - the tool scans for Pull Request code changes and it automatically suggests labels that match Pull Request changes. The tool is configured before use. You can configure it via the CLI (command line interface), repository configuration file (pr_agent.toml), and manual handling on the repository page.

These features provide feedback to the developer. They review PR, suggest code, answer questions, and describe pull requests. Most of the features can be automatically triggered using GitHub actions. Some tools require configuration for custom benefits and results.

Major achievements when using CodiumAI PR-Agent Features and tools.

Helps write secure code.
Helps write efficient and quality code.
Increase productivity and consistency.
Helps write better code.
Save time by avoiding manual work with PRs.
It fixes problems and bugs in your code.
Helps you get effective feedback.

Comparison between Github Copilot for Pull requests and Codium PR Agent.

What makes Codium the best choice when it comes to Pull Request Assistance in comparison to Tabnine, Replit, GitHub Copilot, and many more...

Github Copilot is a pull request reviewer AI tool. It was developed on GitHub. It has had a ton of limitations hence the introduction and creation of Codium AI's PR-Agent. Let us compare the benefits of using Codium AI PR_Agent over GitHub’s copilot.

Price.
Github Copilot is not a free AI tool. Subscription fees are required to continue using it. It does not have a free mode.
An individual developer can use the free mode of Codium AI PR Agent. One can upgrade when working with teams. They also provide a free trial when working with teams which downgrades when not added to a team for use.
Codium AI PR-Agent currently imposes no restrictions on the number of calls/tokens or repositories accessible by their users.
IDE support.
Github's Copilot version has limited support for IDEs. This limits developers from accessing the AI tool without specified IDEs.
Codium AI PR Agent has more than 10 supported IDE’s. You can review supported extensions and integrations here - Supported Integrations
Functionality.
In addition to unlimited and multiline commands, CodiumAI PR-Agent supports more than 8 commands. This is quite safe for developers because they can describe the issue and task at hand without being limited. Developers actively chat and search on the go without limitations.
When it comes to chat support, Github Copilot has limited capabilities. It relies on only a single command which limits flexibility.

They both perform single and multiline code generation.
Platform support.
Codium AI PR-Agent is open source and built for many version control platforms/systems. This allows developers to easily navigate and comprehensively learn how to use the PR Agent without worrying about the platform of use.
GitHub copilot for pull requests is built only for Git Hub. This limits developers' freedom to use the tool on other platforms. It can only be used on the Git platform.
Supported Languages.
Github's Copilot only supports a few languages according to the last release.
Codium AI PR-Agent has support for almost all programming languages relevant today. The number of languages supported by Codium's AI PR-Agent is more than 70 programming languages.

This is contributed to due to it being open source. Developers can customize it to their specific needs.
Suggestions and latency.
Developers who have been using CodiumAI PR-Agent have been applauding the responses given by the PR-agent models. The company relies on public models for training its AI. This has improved the responses and help given by the PR Agent when reviewing Pull requests.
This showcases the latency to be an average of 9/10. This is superb 😄 .
Markers and features
The Github Co-pilot relies on 4 markers or features to provide Pull Request Assistance. It has 4 markers : copilot: all, summary, walkthrough, and poem.
CodiumAI PR-Agent has more than 4 tools/features. The features list is documented above in the article. Refer to: Features of the CodiumAI PR-Agent.

To advance and learn more about the benefits I would urge you to join the Discord community for more benefits and learning. Join Discord channel

How to use it and increase your productivity?

I will use GitHub for the demo on how to increase productivity.

Installation guides:

You can find installation guides on several ways to install the Codium AI PR-Agent here: Install CodiumAI PR-Agent.

Choose the most preferred way according to your Project Needs

I preferred installing through GitHub Actions so that I could run it via GitHub Actions.Install and Run Via Github Actions.
You can also install the Git Plugin From Here :Git Plugin

When working with pull requests, you might not have all the time to keep on reviewing all the code. This is when the practicality of CodiumAI PR-Agent as an AI assistant chips in. It will help you write and review given Pull Requests. Some features extend their functionality towards productivity and streamline the development process…

Let us preview some commands in use.

I am working on a team project. There are multiple pull request codes. They have to be reviewed each time.
Let us review each pull request. -

/review

We do so by commenting on the pull request:

@CodiumAI-Agent /review

- This is to review the changes to the pull request code. We see the added files and we can give feedback on the same. Here is an example.

The request for review follows up with a response to the review requested. A complete analysis of the pull request code review is given. It contains descriptions, summaries, and suggestions for changes to the code.

A lot of information is given back to the developer.
I can review the Code while focusing on the productivity of the tool and there is a lot that this Pull Request Assistant can do. There are a lot of benefits and a lot that can be done and accomplished with this tool.

Conclusion

Having an assistant who is always there to engage with your daily software development jobs is a great win. CodiumAI PR-Agent serves as the best AI Pull Request Assistant for increasing development productivity. There are no limits when it comes to CodiumAI's PR Agent.
Github Copilot is looking forward to providing more tools in their Beta by requesting people to nominate organizations and enterprises for the Github Copilot Enterprise waitlist form.

Leveraging the power of AI Pull Request Assistance brings productivity and streamlines the software development process.

You can learn more about CodiumAI PR-Agent here: Learn More.
you can also view the website here: CodiumAI PR-Agent

Exploratory Data Analysis using Data Visualization Techniques.

Moses-Morris — Mon, 09 Oct 2023 14:48:04 +0000

What is EDA?

Exploratory Data Analysis is the process of analyzing and investigating a data set to discover patterns, characteristics, trends, anomalies, and relationships. This critical process relies on data visualization methods to accomplish its roles.
The process involves data cleaning, data exploration, feature engineering, and data visualization.

Why do We need Exploratory Data Analysis?

A Variable - a characteristic that can be measured and that can assume different values. Height, age, income, province, etc.

Missing values treatment - This is a method of analysis that involves identifying and treating missing values and null values in a dataset. The approach involves deleting some rows and columns and implementing filling techniques to insert data.
Outlier Treatment - Treatment of outliers involves handling extreme values or values above or below the average. It is possible to get poor results if you have outliers. The majority of outliers are removed because they could be the result of an error.
Variable Transformation - Data is transformed using variable transformations to ensure their normality, linearity, and stability. It involves functions to create data usable by changing the state or the form of the data variables. The data variables are either numerical or categorical.
Feature Engineering - This is a method of analysis that involves creating new features based on existing ones. It involves identifying and extracting features from a dataset.
Correlation Analysis - This method of analysis involves discovering data variable patterns and their magnitude. This drives the actions of that relationship between the variables.

Types of EDA:

Univariate EDA - Involves looking at a single variable at a time.
Bivariate EDA - involves looking at two variables at a time.
Multivariate EDA - Involves looking at three or more variables at a time.

What is Data Visualization?

This is the representation of data using a graphical interface. This involves the use of charts, graphs, plots, infographics, animations, and many other visual techniques.

Why do we need data visualization?

The need for data visualization helps us discover trends, features, data point patterns, and more outlying business parameters.

Data Visualization Techniques:

Charts - line charts, Pie charts, Column charts, Bar charts, Fusion charts, high charts, pictogram charts, histogram charts, waterfall charts, etc.
Plots - Line plots, Bar plots, Box and whisker plots, scatter plots, bubble plots, violin plots, distribution plots, cartograms, etc.
Maps - Heat maps, Treemaps, Choropleth Map, etc.
Diagrams and Matrices - correlation matrix, network diagram, word cloud, Choropleth Map, bullet graphs, highlight table, timeline, etc

These techniques use various tools and technologies to implement visualizations. These tools depend on the domain being used and have different uses and purposes. E.g. Tableau.

How to explore data using visualization techniques.

Let's now explore our data. We mostly use ...

Charts - for

Comparison - comparing variables and values in a dataset.
Distributions - checking the distribution of variables in a dataset.
Proportions - checking the proportionality of the distribution of variables in a dataset.

Plots - for

Trends - Viewing upcoming behaviors in the variables in a dataset.
Relationships - View the correlations between different variables in a dataset.
Outliers - checks for possible variables that are not in range or are above the expected range.

Maps - for

Patterns - used to identify special and regular patterns in the dataset variables.
Structures - they identify the hierarchy of data and the composition of different variables in a dataset.
Intensity - Helps identify the extremeness of variables in a dataset.
Density - helps identify the amount of concentration of values and variables in a dataset.

Diagrams and Matrices - for

Connections - diagrams show entity relations between variables in a dataset.
Summaries - they showcase summaries of data in a dataset. Help identify key performance indicators and quick insights into the data.
Comparison - using keys to identify differences and compare variables in a dataset.

How to explore data using visualization techniques.

Understand the Data - know if your data is numerical, categorical, or timely data. This prepares you for the transformation of the data into the appropriate data type and range of data values.
Identify the problem or question - Know the purpose and expectations of your data and the idea and hypothesis of the EDA.
Choose the most appropriate visualization techniques to implement - Having known and understood the data, you can identify the best techniques to use for visualization. You will understand if the data is numerical, categorical, time-based, or geographical.
Visualize the Data - Use the appropriate tools to visualize your data. Like matplotlib, tableau, seaborn, plotly, etc.
Interpret the data - look for patterns, features, trends, outliers, correlations, and relationships to understand. At this point, you can reiterate and refine the data if expectations are unclear and errors are spotted. Feedback generated drives if the process needs to be refined and re-iteration is needed.
Communication of findings - present and describe insights gained. Use visuals and reports to communicate findings.

Different exploratory data analysis methods require different Data Visualization techniques. There needs to be consideration of the domain and purpose.

Conclusion

EDA involves various processes to prepare and craft datasets used by models. If EDA fails or is not well crafted, the data visualization techniques used also fail to discover patterns and trends in the datasets. These two processes are dependable on each other. Being an expert in this field depends on which tools to use for certain domain knowledge.

Data Science for Beginners: 2023 - 2024 Complete Roadmap

Moses-Morris — Sun, 01 Oct 2023 19:51:56 +0000

What is Data Science?

Data Science is the art of intelligence that involves extracting meaningful information to gain insights. The process consists of gathering, storing, analyzing, and plotting data.

Who are Data Scientists? These are data experts who perform and apply statistics, machine learning, and analytical approaches to answer critical business questions. Data scientists utilize various techniques, such as visualization, to interpret and present their findings and results. They help forecast the future based on the patterns and findings that have been discovered.

Other Different Roles in Data.

Data Analysis - This is a method of querying, processing, providing reports, summarizing, and visualizing data to derive information to influence decision-making.
Data analysts understand cleaning, visualizing, and exploratory data analysis which helps companies or organizations make informed and better decisions.

Data Engineering - This is an intelligence that involves designing systems and building systems used for storing, analyzing, and collecting data.
To collect and organize data, data engineers are responsible for constructing and operating data pipelines. It is their responsibility to make sure that quality data is accessible and available.

Data Science Pillars:

Statistics - This is a type of math that teaches how to collect and analyze data to answer critical questions to influence decisions.
Domain knowledge - This is expertise in the business problem. This helps in collaboration and prowess in navigation in the field of research and in that industry.
Computer Science - This entails knowledge of how computers work. As a result, it is also necessary to understand programming.
Communicating and visualizing - The delivery of messages is essential in this process. Due to the importance of message delivery to the interpretation of data, it is important to consider it.
Collaboration - DataScience relies on other departments for extraction, transformation, and loading of data. This requires effective teamwork in the field.

Tools of a Data Scientist.

Programming tools - These are languages and tools used for programming - Python and its data frames (Numpy, Pandas, PyTorch, Scipy), R, Scala, Java., Jupyter, MongoDB, SQL, Julia, D3.js, Apache Spark.
Machine Learning Tools - These are software tools used for Machine Learning. They are used according to various roles implemented - Scikit Learn, Accord.Net, Apache Mahout, TensorFlow, Weka, KNIME, Colab, Accors.Net, Shogun, Keras.io, Rapid Miner, DataRobot, NLTK (Natural language toolkit).
Visualization tools - These tools help data scientists present data by use of an easy human-understandable format. They rely on graphs, tables, dashboards, graphics, and many more. Seabon, Matplotlib, Gplot2, Lattice, Bokeh, Shinny, Power BI, Tableau, Infogram, Plotly, Matlab, MS Excel, Sisense, fusion charts, Qliqsense, DOMO, LookerVi, board, data wrapper (CSV)
Cloud-Based Tools - These are tools available for easy access and real-time collection and usage of data. - BigML, Google Analytics, AWS, Terraform.

These are just to mention a few. You can look out for more according to the niche of your project. Some tools are more effective in various fields of use.

Important skills for a data scientist.

Technical Skills are:

Statistics and Mathematics - Probability, Linear Algebra, Calculus.
Machine Learning and Deep Learning - Able to train models, evaluate, and deploy them.
Data Wrangling - The ability to convert raw data into usable and meaningful form.
Programming - A data scientist can program in search of maximum querying. They Can learn Java, Python, R, and Scala. Choosing the most effective for the project.
Visualization - A proficient data scientist knows how to present insights found.
Data Management and Governance - Implement security, availability, usability, and integrity.
Web Scraping - This involves extracting data from websites.
Database management and querying - Querying and managing databases in use. SQL, MongoDB, Couch, file storage, Excel file storage,
DSA - (Data Structures and Algorithms) - These help with maximum productivity while approaching a problem.
Version control - Git, Git Lab, Bit bucket.
Cloud computing - The access to resources from anywhere by authorized users.
DevOps - The demand for real-time data is rising. The use of the CI/CD cycle is important to deliver real-time live results.
Operating Systems - Linux, Windows, server OS, and other platforms of use.
Data Extraction, Transformation, cleaning, and preparation for loading.
Automation - using scripts to perform regular and repetitive tasks.

Soft skills are:

Communication.
Problem solving.
Critical thinking.
Decision making.
Creative thinking.
Business intelligence.
Storytelling.
Attention to detail.

Data Science Methodology

This is a lifecycle that involves the approach of a Data Science project.

Business problem understanding - understand owners' needs and their internals. This identifies expectations.
Data collection and storage. - Data acquisition plays a crucial role in helping understand what datasets are important.
Data Preparation and Understanding - this involves understanding the dataset you are working on and the structure of the data (structured or unstructured). It also involves duplication, transforming, and handling missing values. Identifying the data variables is discovered here.
Data Modeling and evaluation - Trends and insights are evaluated in this phase. The tools used in this phase include R, Python, Matlab, and SAS.
Diagnostics and mining of data are executed here to produce a quality evaluation outcome. Prediction and description help us know the hits and misses of the models.
Deployment - feedback is derived from this phase to test the capabilities of the models. Maintenance and monitoring help in recommending the way forward using reports, summaries, and experience.

Data science applications.

Machine learning - teaching machines to interpret the right data for use.
Internet searching - provides better results and is accurate for queries.
Voice assistance - training in dialects and sounds.
Health care - prioritization of surgery and effective treatment.
Robotics and IoT - manufacturing and prediction of outcomes and responses.
Marketing and E-commerce - increasing purchases and client conversion rates, recommending products, competitively advancing business.
Education - providing insights into the performance of students' study behaviors.
Weather prediction and calamity prediction like earthquakes and fires.
Finance - data science provides insights into what is expected when it comes to the economy and expenditure. Helps analyze losses and income and expenditure maintenance.
Technology - Data science has improved technology with very steep growth. Technology and big data are now working parallel to each other to provide a better experience.
Travel - helps with recommendations for shorter routes.
Crime - helps analyze crime rates, sources, and areas of crime for easy detection and prediction.

Benefits of data science.

A Data Scientist is an asset to the company.

A Data Scientist...

Empowers the management to make better-informed decisions.
Provides insights into KPIs (key performance indicators).
Helps identify the underlying opportunities.
Helps identify loopholes and areas of improvement in the business.
Helps refine the target audience and maintain the audience.
Enables a drive for better results.

Trends in Data Science.

Cognitive computing - Artificial intelligence in cybersecurity relies on ML algorithms.
Augmented reality - a great experience is enhanced due to the use of Big Data.
Automation - Machine learning is helping automate very crucial activities. Data collected is being used to accelerate automation.
Cloud data ecosystems- many companies are now migrating to cloud warehouses for faster clustering and access to data.

Conclusion:

The world we are in now is already data-driven. It relies heavily on data to predict, describe, diagnose, and prescribe the best solutions to the problem at hand. The demand for data scientists will not glide downwards anytime soon.

The impact of data science is clear and the demand for knowledge is skyrocketing. Looking at the future, data will fuel everyday lives in how we eat, socialize, learn, and live. It is part of the existing environment.

DEV Community: Moses-Morris

DATABASE REPLICATION.

Your database will crash. It's not a matter of if it will crash, it's when. So here's my question: is your system designed to survive it?

TYPES OF AUTHENTICATION

How do users prove their Identity, Earn Trust, and get managed on various platforms and APIs?

Rate Limiting vs Throttling.

API DOCUMENTATION BEST PRACTICES

API Documentation Is Not Optional It Is the Product

What is an API?

Why is documentation needed as part of design in API’s?

Best API Documentation practices.

Unlocking IIFE in Python - Write It, Run It, Forget It.

What is IIFE?

Why IIFE?

Why is this concept an effective choice and important?

Is it found in other languages?

Top Python IIFE Approaches.

When is this concept most effective?

Why Python IIFE?

How to implement the concept?

Advanced use of the concept.

Conclusion

Oops... I Locked Myself Out with UFW - Here's How I Fixed It

WHAT IS UFW?

Why a UFW firewall?

How to set up the UFW firewall.

Locked out ???

UFW Default Policies->

Here are Some advanced UFW firewall rules and Tips.

The most common example rules and commands used for UFW firewall.

How to fix being locked out of UFW firewall.

*UFW TIP *

Conclusion

Mastering SQL for Data Engineering: Advanced Queries, Optimization, and Data Modeling Best Practices

Advanced SQL for Data Engineering

We can not ignore how we query because...

Why do we need to master SQL for Data Engineering?

What is a data model? And why do we use our SQL mastery to achieve the best results when data modeling?

Conclusion

What is a Dry Run Test ?

What is a Dry Run Test in Software Engineering?

Unveiling the Power of CodiumAI's PR-Agent: A Comprehensive Comparison with GitHub Copilot.

The AI Revolution

The Features of Codium AI.

Introduction to Codium AI PR-Agent.

Why use a PR Agent for a Pull Request?

Platforms that can be Used with CodiumAI PR-Agent.

Features of the CodiumAI PR-Agent.

Major achievements when using CodiumAI PR-Agent Features and tools.

Comparison between Github Copilot for Pull requests and Codium PR Agent.

How to use it and increase your productivity?

Installation guides:

Let us preview some commands in use.

Conclusion

Exploratory Data Analysis using Data Visualization Techniques.

What is EDA?

Why do We need Exploratory Data Analysis?

Types of EDA:

What is Data Visualization?

Why do we need data visualization?

Data Visualization Techniques:

How to explore data using visualization techniques.

How to explore data using visualization techniques.

Conclusion

Data Science for Beginners: 2023 - 2024 Complete Roadmap

What is Data Science?

Other Different Roles in Data.

Data Science Pillars:

Tools of a Data Scientist.

Important skills for a data scientist.

Data Science Methodology

Data science applications.

Benefits of data science.

Trends in Data Science.

Conclusion:

UFW TIP