Machine Learning Libraries - Quick Reference
Pandas
Reading Dataset
import pandas as pd
df = pd.read_csv("address.csv")
df = pd.read_excel("address.xlsx", sheet_name="sheet", usecols=['column1'])
Viewing Dataset
print(df.head()) # First 5 rows
print(df.tail()) # Last 5 rows
print(df.sample(5)) # Random 5 rows
Database Information
df.info() # Gives full information
df.describe() # Gives statistics
df.shape # Returns (rows, columns)
Selecting Columns
column = df["column_name"]
subset = df[["column1", "column2"]]
Add Two Compatible Columns
df["new"] = df["col1"] + df["col2"]
Filter Rows
filtered = df[df['column'] > 10]
Merge Two DataFrames
newdf = pd.merge(df1, df2, on="common_column")
Missing Values
Check
values = df.isnull().sum()
Fill With Values
df['col'].fillna(0, inplace=True)
Drop Missing Values
df = df.dropna()
Duplicate Values
duplicates = df.duplicated().sum()
Add New Column
import numpy as np
y = np.array([1, 2, 3, 4])
df['col_new'] = y
Drop Column
df.drop(columns=['col_name'], inplace=True)
iloc Function
x = df.iloc[:, :-1] # All but last column
y = df.iloc[:, -1] # Only last column
NumPy Basics
import numpy as np
mean = np.mean(data)
st_dev = np.std(data)
median = np.median(data)
var = np.var(data)
Here’s a clean, professional README.md content formatted like a technical blog with proper Markdown structure — no pictorial symbols, only text and hierarchy.
Machine Learning Overview
1. Artificial Intelligence (AI)
Definition:
Artificial Intelligence is the branch of computer science that aims to create systems capable of performing tasks that typically require human intelligence.
These tasks include reasoning, problem-solving, perception, language understanding, and decision-making.
AI is the broader concept that encompasses Machine Learning (ML), Deep Learning (DL), and other intelligent decision-making techniques.
2. Machine Learning (ML)
Definition:
Machine Learning is a subset of Artificial Intelligence that enables systems to learn patterns and make decisions from data without being explicitly programmed.
Instead of using fixed rules, ML models improve their performance over time as they are exposed to more data.
3. Deep Learning (DL)
Definition:
Deep Learning is a specialized branch of Machine Learning that uses Artificial Neural Networks (ANNs) with multiple layers (hence “deep”) to learn complex patterns.
It is particularly effective for high-dimensional data such as images, speech, and natural language.
4. Difference Between AI, ML, and DL
| Concept | Description | Example |
|---|---|---|
| AI | Broad concept of machines simulating intelligence | Voice Assistants like Alexa |
| ML | AI subset that learns from data | Spam email detection |
| DL | ML subset using neural networks | Image recognition in photos |
5. Types of Machine Learning
Machine Learning can be categorized based on external supervision, training procedure, and type of learning.
5.1 Based on External Supervision
Tree Structure:
Machine Learning
│
├── Supervised Learning
│ ├── Regression
│ └── Classification
│
├── Unsupervised Learning
│ ├── Clustering
│ ├── Dimensionality Reduction
│ ├── Anomaly Detection
│ └── Association Rule Learning
│
├── Semi-Supervised Learning
│
└── Reinforcement Learning
Explanation:
-
Supervised Learning:
Models learn from labeled datasets, meaning each input has a known output.
Examples:- Regression → Predicting continuous values (e.g., house price prediction)
- Classification → Predicting categorical labels (e.g., spam or not spam)
-
Unsupervised Learning:
Models learn from unlabeled data by finding hidden structures or patterns.
Examples:- Clustering → Grouping similar data points (e.g., customer segmentation)
- Dimensionality Reduction → Reducing number of features (e.g., PCA)
- Anomaly Detection → Detecting outliers or rare events (e.g., fraud detection)
- Association Rule Learning → Discovering relationships (e.g., market basket analysis)
Semi-Supervised Learning:
A hybrid approach that combines a small amount of labeled data with a large amount of unlabeled data to improve learning accuracy.Reinforcement Learning:
Learning through interaction with an environment, where the model receives rewards or penalties based on actions.
Common in robotics, gaming, and autonomous systems.
5.2 Based on Training Procedure
Tree Structure:
Machine Learning
│
├── Batch Learning (Offline Learning)
└── Online Learning
Explanation:
Batch Learning (Offline Learning):
The model is trained using the entire dataset at once.
Suitable for static data that doesn’t change over time.Online Learning:
The model learns incrementally as data arrives in sequence.
Used in real-time systems or data streams (e.g., financial trading, live recommendation).
Key Concepts:
Learning Rate:
Controls how much model weights are adjusted after each iteration.
A small learning rate leads to slow convergence, while a large rate may cause instability.Out-of-Core Learning:
Technique used when the dataset is too large to fit into memory.
The model processes data in small chunks sequentially.River:
A modern Python library designed for online machine learning and data stream processing.Vowpal Wabbit:
A highly efficient, open-source system for online learning, out-of-core learning, and reinforcement learning.
5.3 Based on Type of Learning
Tree Structure:
Machine Learning
│
├── Instance-Based Learning
└── Model-Based Learning
Explanation:
Instance-Based Learning:
The algorithm memorizes training examples and uses them directly to make predictions for new data.
Example: K-Nearest Neighbors (KNN)
Focuses on similarity measures (distance-based learning).Model-Based Learning:
The algorithm builds an abstract model from training data and uses it for predictions.
Example: Linear Regression, Neural Networks
Focuses on generalizing patterns rather than memorizing examples.
6. Summary
| Basis | Categories | Example Algorithms |
|---|---|---|
| External Supervision | Supervised, Unsupervised, Semi-Supervised, Reinforcement | Linear Regression, K-Means, Autoencoders, Q-Learning |
| Training Procedure | Batch, Online | SGD, River, Vowpal Wabbit |
| Type of Learning | Instance-Based, Model-Based | KNN, Decision Tree, Neural Network |
7. Final Note
Machine Learning is a continuously evolving field that bridges mathematics, data, and computation.
Understanding the types, learning paradigms, and tools forms the foundation for building intelligent systems.
Data Analysis
Data Analysis is the process of inspecting, cleaning, transforming, and modeling data to uncover useful insights and support decision-making.
It helps identify trends, patterns, and relationships within data that drive business and research outcomes.
1. Statistics
Statistics forms the foundation of data analysis. It involves collecting, summarizing, interpreting, and presenting data in a meaningful way.
Key Concepts
Mean:
The average value of all data points.
Usage: Measures central tendency but is sensitive to outliers.Median:
The middle value when data is sorted.
Usage: More robust to outliers; represents true center of skewed data.Mode:
The most frequently occurring value.
Usage: Useful for categorical or discrete data.Variance:
Measures how far data points are spread out from the mean.
Usage: Indicates the degree of data dispersion.Standard Deviation (SD):
The square root of variance.
Usage: Shows how much data deviates from the mean in original units.
2. Univariate Analysis
Univariate analysis deals with the examination of one variable at a time to understand its distribution, central tendency, and spread.
2.1 Numerical Data
1. Histogram
- Purpose: Shows the frequency distribution of numerical data.
-
What to See:
- Shape of distribution (normal, skewed, bimodal).
- Outliers or unusual gaps.
-
Next Steps:
- Apply normalization or log transformation if data is highly skewed.
- Handle outliers if they affect model performance.
2. KDE Plot (Kernel Density Estimate)
- Purpose: Smoothed version of a histogram showing probability density.
-
What to See:
- Continuous estimation of data distribution.
- Peaks and spread of data.
-
Next Steps:
- Compare multiple KDEs to identify overlap or separation between features.
- Check if data follows a normal distribution.
3. Box Plot
- Purpose: Displays five-number summary — minimum, Q1, median, Q3, and maximum — and detects outliers.
-
What to See:
- Outliers as points beyond whiskers.
- Interquartile range (spread of middle 50% of data).
-
Next Steps:
- Investigate or treat outliers.
- Compare distributions across different categories.
4. Q-Q Plot (Quantile-Quantile Plot)
- Purpose: Checks whether data follows a specific theoretical distribution (commonly normal).
-
What to See:
- Points lying on a straight diagonal line → normal distribution.
- Deviations → skewness or heavy tails.
-
Next Steps:
- Apply transformation (log, square root) if data deviates strongly from normality.
2.2 Categorical Data
1. Bar Plot
- Purpose: Displays frequency or proportion of categories.
-
What to See:
- Most/least frequent categories.
- Distribution imbalance.
-
Next Steps:
- Merge low-frequency categories if necessary.
- Encode categorical variables for modeling.
2. Pie Chart
- Purpose: Represents percentage contribution of each category to the total.
-
What to See:
- Dominant categories.
- Proportional representation.
-
Next Steps:
- Verify if class imbalance exists.
- Consider resampling or reweighting for balanced learning.
3. Bivariate Analysis
Bivariate analysis explores the relationship between two variables — either both numerical, both categorical, or one of each.
3.1 Numerical vs Numerical
1. Scatter Plot
- Purpose: Visualizes the correlation between two continuous variables.
-
What to See:
- Direction (positive, negative, none).
- Strength (tight clustering = strong relation).
- Outliers or nonlinear patterns.
-
Next Steps:
- Calculate correlation coefficient (Pearson/Spearman).
- Apply polynomial regression if relation is nonlinear.
2. Heatmap
- Purpose: Shows correlation matrix using colors to indicate strength and direction of relationships.
-
What to See:
- High positive correlation close to +1.
- High negative correlation close to -1.
-
Next Steps:
- Remove highly correlated variables (multicollinearity).
- Select top correlated features for modeling.
3. Pair Plot
- Purpose: Displays scatter plots for all numerical variable pairs with histograms on the diagonal.
-
What to See:
- Trends, clusters, and correlations.
- Distribution overlap between features.
-
Next Steps:
- Identify feature relationships for model input selection.
- Detect multicollinearity visually.
3.2 Categorical vs Categorical
1. Cross Tabulation (Contingency Table)
- Purpose: Summarizes frequencies for combinations of two categorical variables.
-
What to See:
- Co-occurrence patterns between categories.
-
Next Steps:
- Perform Chi-Square test for independence.
- Identify significant associations.
2. Stacked Bar Chart
- Purpose: Displays total count split by subcategories within each category.
-
What to See:
- Composition differences across groups.
-
Next Steps:
- Analyze class imbalance across subcategories.
- Visualize categorical interactions.
3. Grouped Bar Chart
- Purpose: Compares multiple categorical groups side-by-side.
-
What to See:
- Relative differences between categories across groups.
-
Next Steps:
- Assess if certain combinations dominate.
- Use statistical tests to confirm relationships.
3.3 Numerical vs Categorical
Common Visualizations
- Box Plot: Compare distribution of numerical values across categorical classes.
- Violin Plot: Combines box plot and KDE for deeper distribution insight.
- Bar Plot (with mean/median values): Shows aggregated numerical statistics by category.
What to See
- Differences in spread and median across categories.
- Overlapping distributions or distinct separation between classes.
Next Steps
- If strong separation exists → good predictive potential.
- If overlap is high → consider feature transformation or combining categories.
4. Summary
| Analysis Type | Variable Type | Common Plots | Purpose |
|---|---|---|---|
| Univariate (Numerical) | Single numeric variable | Histogram, KDE, Box, Q-Q | Understand distribution and outliers |
| Univariate (Categorical) | Single categorical variable | Bar, Pie | Examine category frequency |
| Bivariate (Num vs Num) | Two numeric variables | Scatter, Heatmap, Pairplot | Explore correlations |
| Bivariate (Cat vs Cat) | Two categorical variables | Cross Tab, Stacked Bar, Grouped Bar | Check associations |
| Bivariate (Num vs Cat) | One numeric, one categorical | Box, Violin, Mean Bar | Compare group-wise distributions |
Tensors
Definition (short): A tensor is a multi-dimensional array — a generalization of scalars, vectors, and matrices.
- Rank (or ndims): number of dimensions (axes).
- Shape: tuple listing length along each dimension.
- Size (or number of elements): product of the shape elements.
0-D tensor (Scalar)
- Rank: 0
-
Shape:
() -
Size:
1 - Example (NumPy):
import numpy as np
t0 = np.array(5) # scalar
print(t0, "rank:", t0.ndim, "shape:", t0.shape, "size:", t0.size)
# Output: 5 rank: 0 shape: () size: 1
1-D tensor (Vector)
- Rank: 1
-
Shape:
(n,)wherenis length -
Size:
n - Example (NumPy):
t1 = np.array([1.0, 2.0, 3.0]) # vector of length 3
print(t1, "rank:", t1.ndim, "shape:", t1.shape, "size:", t1.size)
# Output: [1. 2. 3.] rank: 1 shape: (3,) size: 3
2-D tensor (Matrix)
- Rank: 2
-
Shape:
(rows, cols) -
Size:
rows * cols - Example (NumPy):
t2 = np.array([[1, 2, 3],
[4, 5, 6]]) # 2x3 matrix
print(t2, "rank:", t2.ndim, "shape:", t2.shape, "size:", t2.size)
# Output: [[1 2 3]
# [4 5 6]] rank: 2 shape: (2, 3) size: 6
3-D tensor
- Rank: 3
-
Shape:
(d0, d1, d2)— e.g., (depth, rows, cols) or (samples, timesteps, features) -
Size:
d0 * d1 * d2 - Example (NumPy):
t3 = np.array([
[[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]],
[[13, 14, 15, 16],
[17, 18, 19, 20],
[21, 22, 23, 24]]
]) # shape (2, 3, 4)
print("rank:", t3.ndim, "shape:", t3.shape, "size:", t3.size)
# Output: rank: 3 shape: (2, 3, 4) size: 24
Notes on indexing and axes
- Access:
t2[row, col],t3[a, b, c]. - Common axis semantics: axis=0 often denotes samples/batch, axis=1 features/time, etc. Always confirm library convention (NumPy, TensorFlow, PyTorch).
Challenges in ML Data Collection (short list)
- APIs: rate limits, inconsistent formats, authentication, pagination. Mitigation: robust retry logic, caching, standardized parsers.
- Web scraping: HTML structure changes, legal/robots rules, CAPTCHAs. Mitigation: use scraping frameworks, respect robots.txt, rotate user agents responsibly.
- Insufficient data: small sample sizes that lead to overfitting. Mitigation: data augmentation, synthetic data, transfer learning.
- Noisy / low-quality labels: human errors, label inconsistency. Mitigation: labeling guidelines, consensus labeling, label cleaning.
- Class imbalance: skewed class distribution harming minority class performance. Mitigation: resampling, class weights, focal loss.
- Missing values: gaps in the dataset. Mitigation: imputation, model choices robust to missingness, collect more data.
- Data drift / distribution shift: training vs production mismatch. Mitigation: monitoring, periodic retraining, domain adaptation.
- Privacy & compliance: PII and legal constraints. Mitigation: anonymization, differential privacy, lawful data sourcing.
- Storage & scale: very large data requiring special infrastructure. Mitigation: out-of-core pipelines, cloud storage, streaming ingestion.
Machine Learning Development Lifecycle
- Problem definition: clearly state the objective and success metrics.
- Data collection: gather raw data from APIs, logs, sensors, scraping, databases.
- Data ingestion & storage: reliably store and version raw data for reproducibility.
- Data cleaning / preprocessing: handle missing values, noise, and format inconsistencies.
- Exploratory Data Analysis (EDA): inspect distributions, relationships, and anomalies.
- Feature engineering: transform, create, and select features that improve signal.
- Model selection: choose algorithms suitable for the problem and data.
- Training: fit models on training data with appropriate validation.
- Validation & hyperparameter tuning: evaluate generalization and tune parameters.
- Evaluation: measure performance on hold-out/test sets using defined metrics.
- Deployment: package and serve the model into production environments.
- Monitoring & maintenance: track performance, detect drift, and retrain as needed.
- Documentation & governance: record data lineage, decisions, and compliance artifacts.
Top comments (0)