DEV Community

YASHWANTH CHIKKI HD
YASHWANTH CHIKKI HD

Posted on • Edited on

pre_machine learning

Machine Learning Libraries - Quick Reference

Pandas

Reading Dataset

import pandas as pd

df = pd.read_csv("address.csv")
df = pd.read_excel("address.xlsx", sheet_name="sheet", usecols=['column1'])
Enter fullscreen mode Exit fullscreen mode

Viewing Dataset

print(df.head())       # First 5 rows
print(df.tail())       # Last 5 rows
print(df.sample(5))    # Random 5 rows
Enter fullscreen mode Exit fullscreen mode

Database Information

df.info()        # Gives full information
df.describe()    # Gives statistics
df.shape         # Returns (rows, columns)
Enter fullscreen mode Exit fullscreen mode

Selecting Columns

column = df["column_name"]
subset = df[["column1", "column2"]]
Enter fullscreen mode Exit fullscreen mode

Add Two Compatible Columns

df["new"] = df["col1"] + df["col2"]
Enter fullscreen mode Exit fullscreen mode

Filter Rows

filtered = df[df['column'] > 10]
Enter fullscreen mode Exit fullscreen mode

Merge Two DataFrames

newdf = pd.merge(df1, df2, on="common_column")
Enter fullscreen mode Exit fullscreen mode

Missing Values

Check

values = df.isnull().sum()
Enter fullscreen mode Exit fullscreen mode

Fill With Values

df['col'].fillna(0, inplace=True)
Enter fullscreen mode Exit fullscreen mode

Drop Missing Values

df = df.dropna()
Enter fullscreen mode Exit fullscreen mode

Duplicate Values

duplicates = df.duplicated().sum()
Enter fullscreen mode Exit fullscreen mode

Add New Column

import numpy as np

y = np.array([1, 2, 3, 4])
df['col_new'] = y
Enter fullscreen mode Exit fullscreen mode

Drop Column

df.drop(columns=['col_name'], inplace=True)
Enter fullscreen mode Exit fullscreen mode

iloc Function

x = df.iloc[:, :-1]   # All but last column
y = df.iloc[:, -1]    # Only last column
Enter fullscreen mode Exit fullscreen mode

NumPy Basics

import numpy as np

mean = np.mean(data)
st_dev = np.std(data)
median = np.median(data)
var = np.var(data)
Enter fullscreen mode Exit fullscreen mode

Here’s a clean, professional README.md content formatted like a technical blog with proper Markdown structure — no pictorial symbols, only text and hierarchy.


Machine Learning Overview

1. Artificial Intelligence (AI)

Definition:
Artificial Intelligence is the branch of computer science that aims to create systems capable of performing tasks that typically require human intelligence.
These tasks include reasoning, problem-solving, perception, language understanding, and decision-making.

AI is the broader concept that encompasses Machine Learning (ML), Deep Learning (DL), and other intelligent decision-making techniques.


2. Machine Learning (ML)

Definition:
Machine Learning is a subset of Artificial Intelligence that enables systems to learn patterns and make decisions from data without being explicitly programmed.

Instead of using fixed rules, ML models improve their performance over time as they are exposed to more data.


3. Deep Learning (DL)

Definition:
Deep Learning is a specialized branch of Machine Learning that uses Artificial Neural Networks (ANNs) with multiple layers (hence “deep”) to learn complex patterns.
It is particularly effective for high-dimensional data such as images, speech, and natural language.


4. Difference Between AI, ML, and DL

Concept Description Example
AI Broad concept of machines simulating intelligence Voice Assistants like Alexa
ML AI subset that learns from data Spam email detection
DL ML subset using neural networks Image recognition in photos

5. Types of Machine Learning

Machine Learning can be categorized based on external supervision, training procedure, and type of learning.


5.1 Based on External Supervision

Tree Structure:

Machine Learning
│
├── Supervised Learning
│   ├── Regression
│   └── Classification
│
├── Unsupervised Learning
│   ├── Clustering
│   ├── Dimensionality Reduction
│   ├── Anomaly Detection
│   └── Association Rule Learning
│
├── Semi-Supervised Learning
│
└── Reinforcement Learning
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Supervised Learning:
    Models learn from labeled datasets, meaning each input has a known output.
    Examples:

    • Regression → Predicting continuous values (e.g., house price prediction)
    • Classification → Predicting categorical labels (e.g., spam or not spam)
  • Unsupervised Learning:
    Models learn from unlabeled data by finding hidden structures or patterns.
    Examples:

    • Clustering → Grouping similar data points (e.g., customer segmentation)
    • Dimensionality Reduction → Reducing number of features (e.g., PCA)
    • Anomaly Detection → Detecting outliers or rare events (e.g., fraud detection)
    • Association Rule Learning → Discovering relationships (e.g., market basket analysis)
  • Semi-Supervised Learning:

    A hybrid approach that combines a small amount of labeled data with a large amount of unlabeled data to improve learning accuracy.

  • Reinforcement Learning:

    Learning through interaction with an environment, where the model receives rewards or penalties based on actions.

    Common in robotics, gaming, and autonomous systems.


5.2 Based on Training Procedure

Tree Structure:

Machine Learning
│
├── Batch Learning (Offline Learning)
└── Online Learning
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Batch Learning (Offline Learning):
    The model is trained using the entire dataset at once.
    Suitable for static data that doesn’t change over time.

  • Online Learning:
    The model learns incrementally as data arrives in sequence.
    Used in real-time systems or data streams (e.g., financial trading, live recommendation).

Key Concepts:

  • Learning Rate:
    Controls how much model weights are adjusted after each iteration.
    A small learning rate leads to slow convergence, while a large rate may cause instability.

  • Out-of-Core Learning:
    Technique used when the dataset is too large to fit into memory.
    The model processes data in small chunks sequentially.

  • River:
    A modern Python library designed for online machine learning and data stream processing.

  • Vowpal Wabbit:
    A highly efficient, open-source system for online learning, out-of-core learning, and reinforcement learning.


5.3 Based on Type of Learning

Tree Structure:

Machine Learning
│
├── Instance-Based Learning
└── Model-Based Learning
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Instance-Based Learning:
    The algorithm memorizes training examples and uses them directly to make predictions for new data.
    Example: K-Nearest Neighbors (KNN)
    Focuses on similarity measures (distance-based learning).

  • Model-Based Learning:
    The algorithm builds an abstract model from training data and uses it for predictions.
    Example: Linear Regression, Neural Networks
    Focuses on generalizing patterns rather than memorizing examples.


6. Summary

Basis Categories Example Algorithms
External Supervision Supervised, Unsupervised, Semi-Supervised, Reinforcement Linear Regression, K-Means, Autoencoders, Q-Learning
Training Procedure Batch, Online SGD, River, Vowpal Wabbit
Type of Learning Instance-Based, Model-Based KNN, Decision Tree, Neural Network

7. Final Note

Machine Learning is a continuously evolving field that bridges mathematics, data, and computation.
Understanding the types, learning paradigms, and tools forms the foundation for building intelligent systems.


Data Analysis

Data Analysis is the process of inspecting, cleaning, transforming, and modeling data to uncover useful insights and support decision-making.
It helps identify trends, patterns, and relationships within data that drive business and research outcomes.


1. Statistics

Statistics forms the foundation of data analysis. It involves collecting, summarizing, interpreting, and presenting data in a meaningful way.

Key Concepts

  • Mean:
    The average value of all data points.
    Usage: Measures central tendency but is sensitive to outliers.

  • Median:
    The middle value when data is sorted.
    Usage: More robust to outliers; represents true center of skewed data.

  • Mode:
    The most frequently occurring value.
    Usage: Useful for categorical or discrete data.

  • Variance:
    Measures how far data points are spread out from the mean.
    Usage: Indicates the degree of data dispersion.

  • Standard Deviation (SD):
    The square root of variance.
    Usage: Shows how much data deviates from the mean in original units.


2. Univariate Analysis

Univariate analysis deals with the examination of one variable at a time to understand its distribution, central tendency, and spread.

2.1 Numerical Data

1. Histogram

  • Purpose: Shows the frequency distribution of numerical data.
  • What to See:

    • Shape of distribution (normal, skewed, bimodal).
    • Outliers or unusual gaps.
  • Next Steps:

    • Apply normalization or log transformation if data is highly skewed.
    • Handle outliers if they affect model performance.

2. KDE Plot (Kernel Density Estimate)

  • Purpose: Smoothed version of a histogram showing probability density.
  • What to See:

    • Continuous estimation of data distribution.
    • Peaks and spread of data.
  • Next Steps:

    • Compare multiple KDEs to identify overlap or separation between features.
    • Check if data follows a normal distribution.

3. Box Plot

  • Purpose: Displays five-number summary — minimum, Q1, median, Q3, and maximum — and detects outliers.
  • What to See:

    • Outliers as points beyond whiskers.
    • Interquartile range (spread of middle 50% of data).
  • Next Steps:

    • Investigate or treat outliers.
    • Compare distributions across different categories.

4. Q-Q Plot (Quantile-Quantile Plot)

  • Purpose: Checks whether data follows a specific theoretical distribution (commonly normal).
  • What to See:

    • Points lying on a straight diagonal line → normal distribution.
    • Deviations → skewness or heavy tails.
  • Next Steps:

    • Apply transformation (log, square root) if data deviates strongly from normality.

2.2 Categorical Data

1. Bar Plot

  • Purpose: Displays frequency or proportion of categories.
  • What to See:

    • Most/least frequent categories.
    • Distribution imbalance.
  • Next Steps:

    • Merge low-frequency categories if necessary.
    • Encode categorical variables for modeling.

2. Pie Chart

  • Purpose: Represents percentage contribution of each category to the total.
  • What to See:

    • Dominant categories.
    • Proportional representation.
  • Next Steps:

    • Verify if class imbalance exists.
    • Consider resampling or reweighting for balanced learning.

3. Bivariate Analysis

Bivariate analysis explores the relationship between two variables — either both numerical, both categorical, or one of each.


3.1 Numerical vs Numerical

1. Scatter Plot

  • Purpose: Visualizes the correlation between two continuous variables.
  • What to See:

    • Direction (positive, negative, none).
    • Strength (tight clustering = strong relation).
    • Outliers or nonlinear patterns.
  • Next Steps:

    • Calculate correlation coefficient (Pearson/Spearman).
    • Apply polynomial regression if relation is nonlinear.

2. Heatmap

  • Purpose: Shows correlation matrix using colors to indicate strength and direction of relationships.
  • What to See:

    • High positive correlation close to +1.
    • High negative correlation close to -1.
  • Next Steps:

    • Remove highly correlated variables (multicollinearity).
    • Select top correlated features for modeling.

3. Pair Plot

  • Purpose: Displays scatter plots for all numerical variable pairs with histograms on the diagonal.
  • What to See:

    • Trends, clusters, and correlations.
    • Distribution overlap between features.
  • Next Steps:

    • Identify feature relationships for model input selection.
    • Detect multicollinearity visually.

3.2 Categorical vs Categorical

1. Cross Tabulation (Contingency Table)

  • Purpose: Summarizes frequencies for combinations of two categorical variables.
  • What to See:

    • Co-occurrence patterns between categories.
  • Next Steps:

    • Perform Chi-Square test for independence.
    • Identify significant associations.

2. Stacked Bar Chart

  • Purpose: Displays total count split by subcategories within each category.
  • What to See:

    • Composition differences across groups.
  • Next Steps:

    • Analyze class imbalance across subcategories.
    • Visualize categorical interactions.

3. Grouped Bar Chart

  • Purpose: Compares multiple categorical groups side-by-side.
  • What to See:

    • Relative differences between categories across groups.
  • Next Steps:

    • Assess if certain combinations dominate.
    • Use statistical tests to confirm relationships.

3.3 Numerical vs Categorical

Common Visualizations

  • Box Plot: Compare distribution of numerical values across categorical classes.
  • Violin Plot: Combines box plot and KDE for deeper distribution insight.
  • Bar Plot (with mean/median values): Shows aggregated numerical statistics by category.

What to See

  • Differences in spread and median across categories.
  • Overlapping distributions or distinct separation between classes.

Next Steps

  • If strong separation exists → good predictive potential.
  • If overlap is high → consider feature transformation or combining categories.

4. Summary

Analysis Type Variable Type Common Plots Purpose
Univariate (Numerical) Single numeric variable Histogram, KDE, Box, Q-Q Understand distribution and outliers
Univariate (Categorical) Single categorical variable Bar, Pie Examine category frequency
Bivariate (Num vs Num) Two numeric variables Scatter, Heatmap, Pairplot Explore correlations
Bivariate (Cat vs Cat) Two categorical variables Cross Tab, Stacked Bar, Grouped Bar Check associations
Bivariate (Num vs Cat) One numeric, one categorical Box, Violin, Mean Bar Compare group-wise distributions

Tensors

Definition (short): A tensor is a multi-dimensional array — a generalization of scalars, vectors, and matrices.

  • Rank (or ndims): number of dimensions (axes).
  • Shape: tuple listing length along each dimension.
  • Size (or number of elements): product of the shape elements.

0-D tensor (Scalar)

  • Rank: 0
  • Shape: ()
  • Size: 1
  • Example (NumPy):
import numpy as np
t0 = np.array(5)            # scalar
print(t0, "rank:", t0.ndim, "shape:", t0.shape, "size:", t0.size)
# Output: 5 rank: 0 shape: () size: 1
Enter fullscreen mode Exit fullscreen mode

1-D tensor (Vector)

  • Rank: 1
  • Shape: (n,) where n is length
  • Size: n
  • Example (NumPy):
t1 = np.array([1.0, 2.0, 3.0])   # vector of length 3
print(t1, "rank:", t1.ndim, "shape:", t1.shape, "size:", t1.size)
# Output: [1. 2. 3.] rank: 1 shape: (3,) size: 3
Enter fullscreen mode Exit fullscreen mode

2-D tensor (Matrix)

  • Rank: 2
  • Shape: (rows, cols)
  • Size: rows * cols
  • Example (NumPy):
t2 = np.array([[1, 2, 3],
               [4, 5, 6]])    # 2x3 matrix
print(t2, "rank:", t2.ndim, "shape:", t2.shape, "size:", t2.size)
# Output: [[1 2 3]
#          [4 5 6]] rank: 2 shape: (2, 3) size: 6
Enter fullscreen mode Exit fullscreen mode

3-D tensor

  • Rank: 3
  • Shape: (d0, d1, d2) — e.g., (depth, rows, cols) or (samples, timesteps, features)
  • Size: d0 * d1 * d2
  • Example (NumPy):
t3 = np.array([
    [[ 1,  2,  3,  4],
     [ 5,  6,  7,  8],
     [ 9, 10, 11, 12]],
    [[13, 14, 15, 16],
     [17, 18, 19, 20],
     [21, 22, 23, 24]]
])  # shape (2, 3, 4)
print("rank:", t3.ndim, "shape:", t3.shape, "size:", t3.size)
# Output: rank: 3 shape: (2, 3, 4) size: 24
Enter fullscreen mode Exit fullscreen mode

Notes on indexing and axes

  • Access: t2[row, col], t3[a, b, c].
  • Common axis semantics: axis=0 often denotes samples/batch, axis=1 features/time, etc. Always confirm library convention (NumPy, TensorFlow, PyTorch).

Challenges in ML Data Collection (short list)

  • APIs: rate limits, inconsistent formats, authentication, pagination. Mitigation: robust retry logic, caching, standardized parsers.
  • Web scraping: HTML structure changes, legal/robots rules, CAPTCHAs. Mitigation: use scraping frameworks, respect robots.txt, rotate user agents responsibly.
  • Insufficient data: small sample sizes that lead to overfitting. Mitigation: data augmentation, synthetic data, transfer learning.
  • Noisy / low-quality labels: human errors, label inconsistency. Mitigation: labeling guidelines, consensus labeling, label cleaning.
  • Class imbalance: skewed class distribution harming minority class performance. Mitigation: resampling, class weights, focal loss.
  • Missing values: gaps in the dataset. Mitigation: imputation, model choices robust to missingness, collect more data.
  • Data drift / distribution shift: training vs production mismatch. Mitigation: monitoring, periodic retraining, domain adaptation.
  • Privacy & compliance: PII and legal constraints. Mitigation: anonymization, differential privacy, lawful data sourcing.
  • Storage & scale: very large data requiring special infrastructure. Mitigation: out-of-core pipelines, cloud storage, streaming ingestion.

Machine Learning Development Lifecycle

  1. Problem definition: clearly state the objective and success metrics.
  2. Data collection: gather raw data from APIs, logs, sensors, scraping, databases.
  3. Data ingestion & storage: reliably store and version raw data for reproducibility.
  4. Data cleaning / preprocessing: handle missing values, noise, and format inconsistencies.
  5. Exploratory Data Analysis (EDA): inspect distributions, relationships, and anomalies.
  6. Feature engineering: transform, create, and select features that improve signal.
  7. Model selection: choose algorithms suitable for the problem and data.
  8. Training: fit models on training data with appropriate validation.
  9. Validation & hyperparameter tuning: evaluate generalization and tune parameters.
  10. Evaluation: measure performance on hold-out/test sets using defined metrics.
  11. Deployment: package and serve the model into production environments.
  12. Monitoring & maintenance: track performance, detect drift, and retrain as needed.
  13. Documentation & governance: record data lineage, decisions, and compliance artifacts.

Top comments (0)