Goal is for learners to be able to join top tier labs (OpenAI, Google, MIT) or publish cutting edge open source research & code models.
Python for Machine Learning: From Simple to Advanced
Code a Neural Network from Scratch in NumPy
Attention Mechanism Tutorial: From Simple to Advanced
Machine Learning & Deep Learning Course Syllabus
Module 1: Foundations
Building blocks of machine learning and computation
1.1 Programming Fundamentals
-
1.1.1 Python Basics
- Variables and data types
- Lists, dictionaries, functions
- Control flow and loops
- File I/O and string manipulation
-
1.1.2 NumPy Essentials
- Arrays and vectorization
- Broadcasting and indexing
- Mathematical operations
- Random number generation
-
1.1.3 Data Manipulation
- Pandas fundamentals
- Data cleaning and preprocessing
- Handling missing values
- Basic statistics and aggregations
1.2 Mathematical Foundations
-
1.2.1 Linear Algebra
- Vectors and matrices
- Matrix multiplication
- Eigenvalues and eigenvectors
- Norms and distances
-
1.2.2 Calculus for ML
- Derivatives and gradients
- Chain rule
- Partial derivatives
- Optimization basics
-
1.2.3 Probability and Statistics
- Probability distributions
- Bayes' theorem
- Expected value and variance
- Sampling and estimation
1.3 Data Types and Representation
-
1.3.1 Numerical Data
- Integers and floating point
- Precision and overflow
- Binary representation
- Normalization techniques
-
1.3.2 Text Data
- ASCII and Unicode
- UTF-8 encoding
- String processing
- Regular expressions
-
1.3.3 Tensor Operations
- Tensor shapes and dimensions
- Views and strides
- Contiguous memory layout
- Broadcasting rules
Module 2: Core Machine Learning
Traditional ML algorithms and concepts
2.1 Supervised Learning Fundamentals
-
2.1.1 Linear Models
- Linear regression
- Logistic regression
- Regularization (L1, L2)
- Feature engineering
-
2.1.2 Tree-Based Methods
- Decision trees
- Random forests
- Gradient boosting
- Feature importance
-
2.1.3 Instance-Based Learning
- k-Nearest Neighbors
- Distance metrics
- Curse of dimensionality
- Locality sensitive hashing
2.2 Unsupervised Learning
-
2.2.1 Clustering
- K-means clustering
- Hierarchical clustering
- DBSCAN
- Cluster evaluation
-
2.2.2 Dimensionality Reduction
- Principal Component Analysis (PCA)
- t-SNE
- UMAP
- Manifold learning
-
2.2.3 Association Rules
- Market basket analysis
- Apriori algorithm
- FP-growth
- Lift and confidence
2.3 Model Evaluation and Selection
-
2.3.1 Performance Metrics
- Accuracy, precision, recall
- F1-score and ROC curves
- Regression metrics (MSE, MAE, R²)
- Cross-validation
-
2.3.2 Bias-Variance Tradeoff
- Overfitting and underfitting
- Model complexity
- Learning curves
- Regularization strategies
-
2.3.3 Hyperparameter Tuning
- Grid search
- Random search
- Bayesian optimization
- Automated ML (AutoML)
Module 3: Language Modeling Fundamentals
Building language models from scratch
3.1 Statistical Language Models
-
3.1.1 Bigram Language Models
- Text preprocessing and tokenization
- Bigram counting and probabilities
- Smoothing techniques
- Text generation and sampling
-
3.1.2 N-gram Models
- Extending to trigrams and beyond
- Backoff and interpolation
- Perplexity evaluation
- Limitations of n-grams
-
3.1.3 Language Model Evaluation
- Perplexity and cross-entropy
- Held-out evaluation
- Human evaluation
- Downstream task evaluation
3.2 Neural Language Models
-
3.2.1 Feed-Forward Networks
- Multi-layer perceptrons
- Activation functions
- Universal approximation theorem
- Gradient descent basics
-
3.2.2 Backpropagation
- Micrograd implementation
- Computational graphs
- Automatic differentiation
- Chain rule in practice
-
3.2.3 Word Embeddings
- Distributed representations
- Word2Vec and GloVe
- Embedding spaces
- Analogies and relationships
Module 4: Deep Learning Foundations
Core neural network concepts and architectures
4.1 Neural Network Fundamentals
-
4.1.1 Perceptron and Multi-layer Networks
- Single perceptron
- Multi-layer perceptron (MLP)
- Matrix multiplication implementation
- Activation functions (ReLU, GELU, Sigmoid)
-
4.1.2 Training Neural Networks
- Loss functions
- Gradient descent variants
- Learning rate scheduling
- Batch processing
-
4.1.3 Regularization Techniques
- Dropout
- Batch normalization
- Weight decay
- Early stopping
4.2 Sequence Models
-
4.2.1 Recurrent Neural Networks
- Vanilla RNNs
- Backpropagation through time
- Vanishing gradient problem
- Bidirectional RNNs
-
4.2.2 LSTM and GRU
- Long Short-Term Memory
- Gated Recurrent Units
- Forget gates and memory cells
- Sequence-to-sequence models
-
4.2.3 Convolutional Networks
- 1D convolutions for text
- Pooling operations
- CNN architectures
- Feature maps and filters
4.3 Attention Mechanisms
-
4.3.1 Attention Fundamentals
- Attention intuition
- Query, Key, Value matrices
- Attention scores and weights
- Scaled dot-product attention
-
4.3.2 Self-Attention
- Self-attention mechanism
- Multi-head attention
- Positional encoding
- Attention visualization
-
4.3.3 Advanced Attention
- Sparse attention patterns
- Local attention
- Cross-attention
- Attention variants
Module 5: Transformers and Modern Architectures
State-of-the-art language models
5.1 Transformer Architecture
-
5.1.1 Transformer Components
- Encoder-decoder structure
- Multi-head self-attention
- Feed-forward networks
- Residual connections
-
5.1.2 Layer Normalization
- LayerNorm vs BatchNorm
- Pre-norm vs post-norm
- RMSNorm variants
- Normalization placement
-
5.1.3 Positional Encoding
- Sinusoidal encoding
- Learned positional embeddings
- Rotary Position Embedding (RoPE)
- Relative position encoding
5.2 GPT Architecture
-
5.2.1 GPT-1 and GPT-2
- Decoder-only architecture
- Causal self-attention
- Autoregressive generation
- Scaling laws
-
5.2.2 GPT-3 and Beyond
- Few-shot learning
- In-context learning
- Emergent abilities
- Instruction following
-
5.2.3 Modern Variants
- Llama architecture
- Grouped Query Attention (GQA)
- Mixture of Experts (MoE)
- Retrieval-augmented generation
5.3 Training Large Models
-
5.3.1 Optimization Techniques
- AdamW optimizer
- Learning rate scheduling
- Gradient clipping
- Weight initialization
-
5.3.2 Scaling Considerations
- Parameter scaling
- Compute scaling
- Data scaling
- Optimal compute allocation
-
5.3.3 Stability and Convergence
- Training dynamics
- Loss spikes and recovery
- Gradient monitoring
- Debugging techniques
Module 6: Tokenization and Data Processing
Text preprocessing and representation
6.1 Tokenization Fundamentals
-
6.1.1 Basic Tokenization
- Whitespace tokenization
- Punctuation handling
- Case normalization
- Unicode considerations
-
6.1.2 Subword Tokenization
- Byte Pair Encoding (BPE)
- WordPiece tokenization
- SentencePiece
- Unigram tokenization
-
6.1.3 MinBPE Implementation
- BPE algorithm from scratch
- Merge operations
- Vocabulary building
- Encoding and decoding
6.2 Advanced Tokenization
-
6.2.1 Handling Special Cases
- Out-of-vocabulary words
- Numbers and dates
- Code and structured text
- Multilingual tokenization
-
6.2.2 Tokenizer Training
- Corpus preparation
- Vocabulary size selection
- Special tokens
- Tokenizer evaluation
-
6.2.3 Tokenization Trade-offs
- Vocabulary size vs. sequence length
- Compression efficiency
- Downstream task performance
- Computational considerations
6.3 Data Pipeline
-
6.3.1 Data Loading
- Efficient data loading
- Streaming large datasets
- Data sharding
- Memory management
-
6.3.2 Data Preprocessing
- Text cleaning
- Deduplication
- Quality filtering
- Format standardization
-
6.3.3 Synthetic Data Generation
- Data augmentation
- Synthetic data creation
- Curriculum learning
- Domain adaptation
Module 7: Optimization and Training
Advanced training techniques and optimization
7.1 Optimization Algorithms
-
7.1.1 Gradient Descent Variants
- SGD with momentum
- AdaGrad and AdaDelta
- RMSprop
- Adam and AdamW
-
7.1.2 Advanced Optimizers
- LAMB optimizer
- Lookahead optimizer
- Shampoo optimizer
- Second-order methods
-
7.1.3 Learning Rate Scheduling
- Step decay
- Cosine annealing
- Warm restarts
- Cyclical learning rates
7.2 Training Strategies
-
7.2.1 Initialization Techniques
- Xavier/Glorot initialization
- He initialization
- Layer-wise initialization
- Transfer learning initialization
-
7.2.2 Regularization Methods
- Dropout variants
- DropConnect
- Stochastic depth
- Mixup and CutMix
-
7.2.3 Curriculum Learning
- Easy-to-hard scheduling
- Data ordering strategies
- Multi-task learning
- Progressive training
7.3 Training Monitoring
-
7.3.1 Metrics and Logging
- Loss monitoring
- Gradient norms
- Learning rate tracking
- Model checkpointing
-
7.3.2 Debugging Training
- Gradient flow analysis
- Activation monitoring
- Weight distribution analysis
- Convergence diagnostics
-
7.3.3 Hyperparameter Tuning
- Grid and random search
- Bayesian optimization
- Population-based training
- Multi-objective optimization
Module 8: Efficient Computing
Hardware acceleration and optimization
8.1 Device Optimization
-
8.1.1 CPU vs GPU Computing
- CPU architecture
- GPU parallelism
- Memory hierarchies
- Compute vs memory bound
-
8.1.2 GPU Programming
- CUDA basics
- Tensor cores
- Memory coalescing
- Kernel optimization
-
8.1.3 Specialized Hardware
- TPUs and tensor processing
- FPGAs for inference
- Neuromorphic computing
- Edge computing devices
8.2 Precision and Quantization
-
8.2.1 Numerical Precision
- FP32, FP16, BF16
- Mixed precision training
- Automatic mixed precision
- Precision-accuracy trade-offs
-
8.2.2 Quantization Techniques
- Post-training quantization
- Quantization-aware training
- Dynamic quantization
- Pruning and sparsity
-
8.2.3 Low-Precision Inference
- INT8 quantization
- Binary neural networks
- FP8 formats
- Hardware-specific optimizations
8.3 Distributed Training
-
8.3.1 Data Parallelism
- Distributed Data Parallel (DDP)
- Gradient synchronization
- Communication optimization
- Fault tolerance
-
8.3.2 Model Parallelism
- Pipeline parallelism
- Tensor parallelism
- ZeRO optimizer states
- Gradient checkpointing
-
8.3.3 Large-Scale Training
- Multi-node training
- Communication backends
- Load balancing
- Scalability analysis
Module 9: Inference and Deployment
Model serving and optimization
9.1 Inference Optimization
-
9.1.1 KV-Cache
- Key-value caching
- Memory management
- Cache optimization
- Batched inference
-
9.1.2 Model Compression
- Pruning techniques
- Knowledge distillation
- Model quantization
- Neural architecture search
-
9.1.3 Serving Optimization
- Batch processing
- Dynamic batching
- Continuous batching
- Speculative decoding
9.2 Production Deployment
-
9.2.1 Model Serving
- REST API design
- gRPC services
- WebSocket connections
- Load balancing
-
9.2.2 Containerization
- Docker containers
- Kubernetes orchestration
- Serverless deployment
- Edge deployment
-
9.2.3 Monitoring and Observability
- Performance metrics
- Model monitoring
- A/B testing
- Failure detection
9.3 Scalability and Reliability
-
9.3.1 Horizontal Scaling
- Load balancing
- Auto-scaling
- Resource management
- Cost optimization
-
9.3.2 Fault Tolerance
- Error handling
- Fallback mechanisms
- Circuit breakers
- Graceful degradation
-
9.3.3 Security and Privacy
- Input validation
- Rate limiting
- Data privacy
- Model security
Module 10: Fine-tuning and Adaptation
Customizing models for specific tasks
10.1 Supervised Fine-tuning
-
10.1.1 Full Fine-tuning
- Transfer learning
- Layer freezing
- Learning rate scheduling
- Catastrophic forgetting
-
10.1.2 Parameter-Efficient Fine-tuning
- Low-Rank Adaptation (LoRA)
- Adapters
- Prompt tuning
- Prefix tuning
-
10.1.3 Task-Specific Adaptation
- Classification fine-tuning
- Generation fine-tuning
- Multi-task learning
- Domain adaptation
10.2 Reinforcement Learning
-
10.2.1 RL Fundamentals
- Markov Decision Processes
- Policy gradient methods
- Value-based methods
- Actor-critic algorithms
-
10.2.2 RLHF (Reinforcement Learning from Human Feedback)
- Reward modeling
- Proximal Policy Optimization (PPO)
- Direct Preference Optimization (DPO)
- Constitutional AI
-
10.2.3 Advanced RL Techniques
- Multi-agent RL
- Hierarchical RL
- Meta-learning
- Offline RL
10.3 Alignment and Safety
-
10.3.1 AI Alignment
- Value alignment
- Reward hacking
- Goodhart's law
- Interpretability
-
10.3.2 Safety Measures
- Content filtering
- Bias detection
- Adversarial robustness
- Failure modes
-
10.3.3 Evaluation and Testing
- Benchmarking
- Human evaluation
- Stress testing
- Red teaming
Module 11: Multimodal Learning
Beyond text: images, audio, and video
11.1 Vision-Language Models
-
11.1.1 Image Processing
- Convolutional neural networks
- Vision transformers
- Image tokenization
- Patch embeddings
-
11.1.2 Vision-Language Architectures
- CLIP model
- Cross-modal attention
- Unified encoders
- Multimodal fusion
-
11.1.3 Applications
- Image captioning
- Visual question answering
- Text-to-image generation
- Multimodal reasoning
11.2 Generative Models
-
11.2.1 Variational Autoencoders
- VAE fundamentals
- VQVAE and VQGAN
- Discrete representation learning
- Reconstruction quality
-
11.2.2 Diffusion Models
- Denoising diffusion models
- Stable diffusion
- Diffusion transformers
- Classifier-free guidance
-
11.2.3 Autoregressive Models
- PixelCNN and PixelRNN
- Autoregressive transformers
- Masked language modeling
- Bidirectional generation
11.3 Audio and Video
-
11.3.1 Audio Processing
- Speech recognition
- Text-to-speech
- Audio generation
- Music modeling
-
11.3.2 Video Understanding
- Video transformers
- Temporal modeling
- Action recognition
- Video generation
-
11.3.3 Multimodal Integration
- Audio-visual learning
- Cross-modal retrieval
- Multimodal reasoning
- Unified architectures
Module 12: Research and Advanced Topics
Cutting-edge developments and research directions
12.1 Emerging Architectures
-
12.1.1 Alternative Architectures
- State space models
- Retrieval-augmented generation
- Memory networks
- Neural Turing machines
-
12.1.2 Efficiency Improvements
- Linear attention
- Sparse transformers
- Efficient architectures
- Mobile-friendly models
-
12.1.3 Scaling Laws
- Compute scaling
- Data scaling
- Parameter scaling
- Emergence and phase transitions
12.2 Advanced Training Techniques
-
12.2.1 Self-Supervised Learning
- Contrastive learning
- Masked language modeling
- Next sentence prediction
- Pretext tasks
-
12.2.2 Few-Shot Learning
- Meta-learning
- In-context learning
- Prompt engineering
- Chain-of-thought reasoning
-
12.2.3 Continual Learning
- Lifelong learning
- Catastrophic forgetting
- Elastic weight consolidation
- Progressive neural networks
12.3 Evaluation and Benchmarking
-
12.3.1 Evaluation Frameworks
- Standardized benchmarks
- Evaluation metrics
- Human evaluation
- Automated evaluation
-
12.3.2 Bias and Fairness
- Bias detection
- Fairness metrics
- Debiasing techniques
- Ethical considerations
-
12.3.3 Interpretability
- Attention visualization
- Gradient-based methods
- Concept activation vectors
- Mechanistic interpretability
Each subtopic at the lowest level (e.g., 1.1.1, 1.1.2) represents a complete lesson that can be developed into a step-by-step tutorial with code examples, intuitive explanations, and practical exercises.
I will move a bunch of these into optional?
Top comments (0)