DEV Community: Fais Azis Wibowo

Deep Learning for Image Classification: Ensemble CNN Architectures with Test Time Augmentation

Fais Azis Wibowo — Sun, 29 Mar 2026 08:46:33 +0000

Accuracy: 0.99628 · Rank: 72 / 1,181 · Kaggle Digit Recognizer Competition

Problem Framing

MNIST is a solved problem in the academic sense — state-of-the-art models have exceeded human-level performance on it for years. The challenge in a competitive context is not whether a CNN can classify handwritten digits, but how much variance you can squeeze out of an already high-performing system when the ceiling is 1.0, and the marginal gains are measured in the fourth decimal place.

At 99.6%+ accuracy, a single misclassified digit per 200 samples is the difference between medal territory and the middle of the leaderboard. The solution presented here addresses this precision problem through two compounding mechanisms: ensemble diversity to reduce variance across model architectures, and Test Time Augmentation (TTA) to reduce variance across the inference distribution. The combination pushed a single-model baseline of 0.995035 to a final leaderboard score of 0.99628.

Data Preprocessing Pipeline

The preprocessing pipeline is intentionally minimal — MNIST's controlled acquisition conditions mean aggressive preprocessing adds noise rather than signal.

Normalisation scales pixel intensities from the raw [0, 255] integer range to [0.0, 1.0] float32. This is not optional — unnormalised inputs cause gradient instability during early training epochs, particularly with BatchNormalisation layers in the network.

Reshaping transforms the flat 784-dimensional pixel vectors into (28, 28, 1) tensors. The channel dimension is explicit even though MNIST is grayscale — omitting it causes shape mismatches in Conv2D layers expecting a channel axis.

Label encoding uses one-hot encoding across 10 classes, compatible with categorical cross-entropy loss.

No augmentation is applied at the preprocessing stage — augmentation is handled dynamically during training via Keras ImageDataGenerator, keeping the validation set clean and unperturbed for honest accuracy measurement.

Ensemble Architecture Design

The core hypothesis driving the ensemble design is that architecturally diverse models make uncorrelated errors. When models fail on different samples, their averaged softmax outputs smooth over individual weaknesses. Five CNN architectures were selected to span a range of representational depths and widths:

CNN-A and CNN-C (2-Block Shallow)
Conv2D(32) × 2 → MaxPool → Conv2D(64) × 2 → MaxPool → Dense(256)
Shallow networks generalise quickly and act as low-variance baselines. Their representational capacity is sufficient for MNIST's relatively simple feature hierarchy.

CNN-B and CNN-E (3-Block Deep)
Conv2D(32) × 2 → MaxPool → Conv2D(64) × 2 → MaxPool → Conv2D(128) × 2 → MaxPool → Dense(512)
The additional convolutional block captures higher-order spatial relationships — stroke intersections, curve terminations — that shallower networks approximate less precisely.
CNN-D (Wide 2-Block)

Conv2D(64) × 2 → MaxPool → Conv2D(128) × 2 → MaxPool → Dense(512)
Wider early filters increase the number of low-level feature detectors without adding depth. This architecture is architecturally distinct from both A/C and B/E, contributing a different error profile to the ensemble.

Regularisation applied uniformly across all five architectures:
BatchNormalisation after each convolutional block — stabilises activations and reduces sensitivity to weight initialisation

Dropout(0.25) after convolutional blocks, Dropout(0.5) before the final dense layer — independently drops units during training to prevent co-adaptation, MaxPooling2D for spatial downsampling and translation invariance

The choice to duplicate architectures (A/C, B/E) rather than use five entirely distinct designs is intentional — identical architectures trained from different random initialisations with different augmentation sequences produce meaningfully different weight configurations and therefore different error patterns.

Training Strategy

1. Optimiser: Adam with default parameters. Adaptive learning rate methods consistently outperform SGD with momentum on vision tasks at this scale without requiring manual schedule tuning.
*2. Real-time Data Augmentation via ImageDataGenerator: *

Rotation range (15°): Simulates natural variations in handwritten digits
Zoom range (15%): Accounts for differences in digit size/scale
Width shift range (15%): Handles horizontal misalignment or centering variation
Height shift range (15%): Handles vertical misalignment or centering variation
Shear range (0.1): Simulates slight perspective distortions

Augmentation is applied on-the-fly during training — each epoch sees a stochastically transformed version of the training set, effectively expanding the training distribution without increasing dataset size. This is the primary mechanism preventing overfitting across 50 training epochs.

3. Callbacks:
ReduceLROnPlateau monitors validation accuracy and halves the learning rate when no improvement is observed over a patience window. This allows the optimiser to take larger steps during early training and finer steps as it approaches the loss minimum — recovering accuracy that flat learning rate schedules leave on the table.

EarlyStopping with restore_best_weights=True terminates training when validation accuracy plateaus and restores the checkpoint from the optimal epoch. This is critical in an ensemble context — each of the five models must contribute its best possible weights, not its final weights.

4. Inference: Test Time Augmentation

A trained model's prediction on a single test image is a point estimate — one forward pass through a stochastic function approximator. TTA converts this point estimate into a distributional estimate by averaging predictions across multiple augmented versions of the same image.

5. TTA protocol:

For each test image, generate 15 augmented variants using the same ImageDataGenerator configuration used during training
Run each variant through each of the 5 trained models
Collect softmax probability vectors from all 80 forward passes (5 models × 16 passes: 1 original + 15 augmented)
Average the 80 probability vectors element-wise
Select the class index with the highest averaged probability as the final prediction

The mathematical intuition: if a model misclassifies an augmented variant of a digit, the correct class still accumulates probability mass across the other 79 passes. The averaging operation suppresses low-confidence noise and amplifies the consistent signal.

6. Quantified impact of each component:

Single CNN baseline: ~99.54% validation accuracy, 0.99503 leaderboard score
5-model ensemble + TTA: ~99.57% average validation accuracy, 0.99628 leaderboard score

The leaderboard gain of +0.00125 over the single-model baseline — while appearing marginal — represents a reduction of roughly 1 misclassification per 800 test samples. At this accuracy regime, that is the practical limit of what ensemble diversity and distributional inference averaging can recover.

Execution

bash# 
Download competition data
kaggle competitions download -c digit-recognizer

# Train all five CNN architectures
python train_ensemble.py

# Run TTA inference and generate submission
python predict_ensemble.py

Analysis: Where the Remaining Errors Live
At 0.99628, the residual errors are concentrated in a small set of structurally ambiguous digit pairs — primarily (4, 9), (3, 5), and (7, 1) — where stroke topology is genuinely similar and the distinguishing feature is a single curve or termination point. These are the cases where even human annotators disagree at non-trivial rates.

Pushing beyond this threshold would require either capsule networks (which preserve spatial hierarchies that MaxPooling discards), or a larger training set sourced from distributions beyond MNIST's controlled acquisition environment. Within the constraints of this competition, 80-pass TTA across a 5-model ensemble represents a practical ceiling.

Key Findings

Three methodological conclusions generalise beyond MNIST:

Architectural diversity outperforms depth uniformity in ensembles. Five architecturally varied models with moderate depth outperform five deep identical models — the variance reduction mechanism requires uncorrelated errors, which requires architectural differences.
TTA is a zero-cost accuracy gain at inference time. Once models are trained, TTA costs only additional forward passes. On MNIST-scale images this is computationally trivial. On larger datasets the compute cost scales proportionally with image size and TTA count — budget accordingly.

EarlyStopping with weight restoration is non-negotiable in multi-model ensembles. Final-epoch weights frequently underperform best-epoch weights by 0.1–0.3% validation accuracy. Across five models this compounds — the ensemble ceiling is only as high as the best checkpoint of each contributor.

Full implementation available at: github.com/faissssss/kaggle-digit-recognizer
Source: kaggle.com/c/digit-recognizer

Tabular Machine Learning for Predictive Modeling: A Ridge-XGBoost N-gram Pipeline for Customer Churn

Fais Azis Wibowo — Sun, 29 Mar 2026 07:37:33 +0000

Kaggle Playground Series S6E3 — Predict Customer Churn | ROC-AUC 0.91685 | Rank 286 / 3,718

The Problem

Customer churn prediction sounds straightforward — given a telecom customer's usage history and contract details, predict whether they'll leave. But the Kaggle Playground S6E3 dataset had 594,000 rows of heavily categorical data where the signal was buried inside combinations of features, not individual columns. Standard approaches plateau quickly here.
My starting point was a LightGBM single model. It was decent — but decent doesn't crack the top 10%. Getting there required rethinking how the model saw the categorical features entirely.

The Core Insight: Treat Categories Like Text

The breakthrough came from an unconventional direction — NLP. In text classification, n-grams capture phrase-level patterns that individual words miss. The same logic applies to categorical feature combinations.

A customer with Contract: Month-to-month is one signal. A customer with Contract: Month-to-month + InternetService: Fiber optic + PaymentMethod: Electronic check is a completely different risk profile — and that combination is what predicts churn.

So I treated categorical columns like tokens and generated bigrams and trigrams across high-impact features: Contract, InternetService, and PaymentMethod. Each unique combination became a new feature, capturing interaction patterns that a standard feature matrix would miss entirely.

Feature Engineering Pipeline

Step 1 — N-gram Categorical Interactions
For each high-signal categorical column, I generated pairwise (bigram) and three-way (trigram) combinations across the feature space. This produced a set of composite interaction features that encoded relationship patterns directly into the model input.

Step 2 — Nested Target Encoding
Raw target encoding leaks — if you encode a categorical feature using the target mean, the model sees information from the row it's predicting. The fix is nested k-fold encoding: encode each fold using only the other folds' target statistics. I used a nested 5-fold stratified scheme applied to both the original categorical features and the new n-gram features.

Step 3 — Service Stack Analysis
Beyond the n-grams, I engineered explicit service combination counts — how many internet services, how many phone add-ons — and their intersections. Customers with more bundled services behave differently at churn time, and these counts captured that pattern numerically.

Step 4 — Digit Features
For continuous columns like tenure and MonthlyCharges, I extracted distributional digit features — essentially encoding the numerical range and pattern of each value. This gave the model a richer representation of where each customer sat within the tenure and charge distributions.

The Two-Stage Ensemble

With the feature matrix built, I used a two-stage stacking approach rather than a single model.

Stage 1 — Ridge Regression
A heavily regularised Ridge classifier served as the first-stage learner. Ridge is simple and interpretable — it captures broad linear trends across the feature space and generalises cleanly. Critically, I ran this with 10-fold stratified cross-validation and collected the out-of-fold (OOF) predictions. These OOF predictions became meta-features for Stage 2.
The reason for Ridge first: it acts as a stabilising baseline. Its predictions encode a smooth, low-variance signal that helps XGBoost in Stage 2 avoid overfitting to noisy feature interactions.

Stage 2 — XGBoost on Original + OOF Features
The second-stage XGBoost classifier was trained on the full engineered feature matrix plus the Ridge OOF predictions as an additional input. This gave XGBoost a pre-computed linear summary of the data to work with alongside the raw features — effectively letting it model residuals and non-linear interactions on top of the Ridge baseline.
Cross-validation remained 10-fold stratified with a fixed seed throughout, ensuring consistent and reproducible evaluation across both stages.

Results

MetricValuePublic Leaderboard AUC 0.91685 Global Rank286 / 3,718 Percentile Top 8%
The top 20 features by importance were dominated by the n-gram interaction features and nested-encoded categorical combinations — validating the core hypothesis that combination patterns outperform individual categorical signals on this dataset.

What I'd Do Differently

The LightGBM baseline I started with was actually competitive on its own — the jump came almost entirely from the n-gram feature engineering, not from model complexity. In hindsight, I would have invested more time earlier in feature interaction design and less time tuning hyperparameters on simpler models.

A second improvement would be experimenting with higher-order n-grams (four-way combinations) on the service stack features — the signal was clearly present in three-way combinations, and there may have been further lift available.

Code & Reproducibility

The full pipeline is open source. The main winning script is src/train_ridge_xgb_ngram.py — run with:

bashpython src/train_ridge_xgb_ngram.py --folds 10 --inner-folds 5 --seed 42 --output-prefix ridge_xgb_ngram10

Full repository: github.com/faissssss/predict-customer-churn

Key Takeaways

Three things that actually moved the needle on this competition:

N-gram thinking for categorical data. If your features are categorical and interactions matter, treat them like text tokens. The combination is the signal.
Nested target encoding, not naive encoding. Leaky encoding hurts generalisation silently — you won't see it in training metrics until the leaderboard disagrees with your CV score.
Stack for stability, not complexity. Ridge + XGBoost worked not because XGBoost needed help, but because Ridge's OOF predictions gave it a cleaner starting point. Stacking should reduce variance, not add layers for its own sake.

Sources:
github.com/faissssss/predict-customer-churn
kaggle.com/competitions/playground-series-s6e3

Hello, this is my first time joining this platform. I wanna share my experience building my first SaaS called Skoowl AI. I hope u find it helpful. Cheers!

Fais Azis Wibowo — Mon, 23 Mar 2026 17:00:42 +0000

Fais Azis Wibowo

Mar 23

Here's How I Built My First SaaS and the Stack Behind It

#ai #webdev #nextjs #tutorial

Comments

7 min read

Designing and Deploying an AI-Powered EdTech SaaS with LLM Integration

Fais Azis Wibowo — Mon, 23 Mar 2026 16:58:02 +0000

It Started With My Brother

My brother has a habit. Whenever he's studying from a YouTube video, he doesn't just watch it — he wants to be tested on it. Every time, he'd come to me: "Can you make me a quiz from this video?"

Every. Single. Time.

The first few times, I did it manually. Watched the video, wrote some questions, and formatted them. It took longer to make the quiz than it took him to watch the video. There had to be a better way.

That frustration became Skoowl AI.

I'm Fais — a 20-year-old CS enthusiast from Indonesia. I had never shipped a full SaaS product before. I had the technical background (Next.js, TypeScript, some AI/ML work), but building something real that is deployed and actually used was new territory.

This is the full story of how I built Skoowl AI: what it does, every major tech decision, the AI pipeline that powers it, and what's coming next.

What Is Skoowl AI?

Skoowl AI is a study platform that takes raw educational content — PDFs, audio recordings, YouTube videos — and transforms it into structured, study-ready materials using large language models.

Here's what it can do:

Feature	What it does
📝 Smart Notes	Auto-generate formatted study notes from any uploaded file
⚡ Flashcards	Create spaced-repetition flashcards for key terms and concepts
❓ Adaptive Quizzes	Generate MCQ, true/false, or fill-in-the-blank quizzes with AI hints
🧠 Mind Maps	Visualize topics with interactive Radial, Tree, Fishbone, and much more layouts
🎙️ Live Transcription	Record lectures in real-time or upload audio for instant transcription
📺 YouTube Learning	Paste a video URL, extract and process the knowledge directly
🗣️ Chat Assistant	Ask questions and get answers scoped to your own study notes

The idea is simple: you bring the content, Skoowl handles the processing. You spend your time learning, not formatting.

The Stack — And Why I Chose It

Every decision in this stack was deliberate. Here's the breakdown.

⚡ Frontend

Next.js 15 (App Router) + React 19 + TypeScript

Next.js with the App Router gives me server components, streaming, and a clean file-based routing model in one package. React 19's concurrency improvements matter specifically for this app — the UI needs to stay responsive while AI is generating content in the background. TypeScript across the entire codebase keeps everything honest; when you're wiring together AI responses, database models, and API handlers, loose types cause real bugs.

For styling, Tailwind CSS v4 handles the utility-first layout and design system. Framer Motion covers complex, fluid UI animations — page transitions, entrance/exit effects, and loading states. Radix UI provides the headless, accessible primitives (dialogs, dropdowns, tabs, accordions) so I didn't have to reinvent accessibility from scratch, and Lucide React keeps the icon system consistent throughout.

🎨 Styling & UI Enhancements

Tiptap + React Markdown + KaTeX + React Flow + Three.js + GSAP

AI-generated notes need to be editable, so I built a full rich text editing experience using Tiptap — a headless editor that supports highlights, text alignment, color, and more. Users can edit, format, and reorganize their generated notes directly in the app.

For rendering AI output that includes markdown, tables, and mathematical formulas, react-markdown with remark/rehype plugins and KaTeX handles LaTeX math rendering cleanly. This matters for STEM content — physics equations and calculus notation render correctly, not as broken text.

React Flow powers the interactive mind maps — a node-based graph UI where users can drag, expand, and explore topic hierarchies.

For the heavier visual layer: Three.js and React Three Fiber handle interactive 3D elements in the UI, ShaderGradient creates the smooth WebGL-powered gradient backgrounds, and GSAP handles complex timeline-based animation sequences that go beyond what Framer Motion covers.

🗄️ Backend & Database

PostgreSQL (Neon) + Prisma + Upstash Redis + Clerk + Dodo Payments

PostgreSQL via Neon Serverless is the production database — relational data fits this product well since users, documents, and generated content all have clear structure. Locally, I use SQLite for fast, zero-config development. Prisma v5 sits on top as the ORM, giving type-safe queries and clean migrations that talk directly to TypeScript.

Upstash Redis handles two things: caching expensive AI results so the same document doesn't get reprocessed unnecessarily, and rate limiting API routes to protect endpoints from abuse. Both are the kind of infrastructure you don't think about until you need them — I added Upstash early and it paid off.

Clerk handles user authentication — sessions, OAuth, user management. It took roughly a day to integrate and has needed zero maintenance since. Rolling your own auth is a classic beginner mistake; I skipped it.

Dodo Payments handles the billing and subscription infrastructure. Svix sits alongside it for webhook signature verification — ensuring incoming webhooks from Clerk and Dodo are legitimate before acting on them.

🤖 AI Layer

Full pipeline breakdown in the next section, but the core libraries:

Vercel AI SDK — the backbone of all AI integration, handles streaming responses seamlessly to the React frontend
Google Gemini + OpenAI (via Vercel AI SDK) — fast, efficient reasoning for notes, quizzes, and all text generation
Deepgram — high-quality real-time audio and speech transcription, covering both uploaded files and live recordings

📄 File Processing

The content ingestion layer handles more formats than most people expect:

Documents: pdf-parse for PDFs, mammoth for Word files, officeparser for PowerPoint and other Office formats
YouTube: youtube-transcript for caption extraction, @distube/ytdl-core and yt-dlp-exec for videos requiring deeper audio/video processing
Audio: Deepgram SDK for both batch and real-time speech-to-text

Skoowl AI - AI-Powered Study Assistant

Turn your lectures into structured notes, flashcards, quizzes, and mind maps instantly.

skoowlai.com

The AI Pipeline — How It Actually Works

1. Document Pipeline (PDF / Word / Text)

The file is parsed server-side to extract raw text, then sent to Gemini via the Vercel AI SDK with a structured prompt. The SDK handles streaming, so users see notes generating word-by-word rather than waiting for a full response.

For structured outputs like flashcard arrays or quiz sets, the model returns JSON validated by Zod — if the response doesn't match the expected schema, it retries before surfacing an error to the user.

2. Audio Pipeline

Deepgram handles both modes:

Uploaded audio: Batch transcription — the file is sent and a full transcript comes back, then feeds into the same Gemini pipeline.
Live recording: Deepgram's streaming API returns incremental transcripts as the user speaks, with low enough latency that the text appears almost in real-time.

3. YouTube Pipeline

This one is the reason Skoowl exists — my brother's quiz requests, remember?

The pipeline uses a YouTube transcript to extract auto-generated or creator-provided captions from a YouTube URL, cleans them up, and sends them through Gemini like any other text input. For videos without captions, yt-dlp-exec pulls the audio, which then goes through Deepgram for transcription first.

Paste a link, get a quiz in seconds. My brother now generates his own quizzes. Problem solved.

4. Mind Maps

The model is prompted to return a structured JSON graph representing the topic hierarchy. React Flow renders it as an interactive visual — users can drag, expand, and explore nodes. Zod validates the JSON schema before it ever reaches the renderer.

Deployment & Getting to 100+ Users

Skoowl AI is deployed on Vercel, which was the obvious choice given the Next.js stack. Edge functions, automatic preview deployments per branch, and zero-config CI/CD — it removed an entire category of DevOps problems.

Getting the first users was uncomfortable in a productive way. I posted in a few Reddit communities, framed it as "I built this, honest feedback welcome." The early adopters came from that. Real usage exposed things that no amount of local testing would have found — different file encodings, unexpected internet speeds affecting streaming, and mobile layouts that needed work.

The 100+ international users milestone came within the first few weeks, which validated the core idea more than any internal testing could.

What's Next for Skoowl AI

The current feature set covers the core study workflow, but there's a lot more planned:

🔍 Discover — An AI-powered research assistant that automatically finds relevant materials for your topic — web searches, books, papers — so you don't have to start from scratch
📊 Slides — Auto-generate presentation slides directly from your uploaded content
🖼️ Infographics — Turn dense information into shareable visual summaries
🎙️ AI Podcast — Convert your study materials into a conversational audio format you can listen to on the go
And many more new features in the future...

The Discover feature is the one I'm most excited about. Right now, users bring their own content to Skoowl AI. Discover flips that — Skoowl AI helps you find the content in the first place.

Closing

Building Skoowl AI taught me more in a few months than I could have learned any other way. Shipping something real, getting it in front of real users, and watching it actually solve a problem — even a small one like my brother's quiz requests — is a different kind of education.

If you've built something similar, have thoughts on the stack, or want to explore a collaboration, drop a comment. I read all of them.

🔗 Website: skoowlai.com
🔗 GitHub: github.com/faissssss/skoowlai
🔗 Instagram: instagram.com/skoowlai/
🔗 X: x.com/skoowlai
🔗 Linkedin: linkedin.com/company/skoowl-ai/
🔗 Tiktok: tiktok.com/@skoowlai
🔗 Discord: discord/skoowlai

This was my first SaaS. It won't be my last.