No Dataset? No Problem. How I Curated a Custom AI Dataset From Instagram & Pinterest to Build a Pose Suggester

#ai #python #architecture #programming

When you start a new Machine Learning project, you pray there’s a clean, ready-to-use dataset on Kaggle or Hugging Face.

But when I decided to build an AI-powered Pose Suggester—a system that analyzes a user's background (like a cafe or a park) and overlays a suggested 2D stick-figure pose skeleton on their camera screen—I hit a massive wall: No such dataset existed.

Nobody had built an open-source dataset mapping lifestyle locations to "good" aesthetic poses.

If I wanted this project to exist, I had to stop looking for a dataset and start building one. Here is exactly how I built, annotated, and augmented a custom computer vision dataset from scratch.

Step 1: Defining the Taxonomy
Before scrolling mindlessly for images, I needed a strict data structure. If your categories are messy, your neural network will learn absolute nonsense. I broke my target universe down into 3 core environments, each with 2 framing subcategories:

Cafe Indoor (Full Body / Waist Up)

Nature/Parks (Full Body / Waist Up)

Urban Street (Full Body / Waist Up)

My target was to source 50 high-quality, distinct anchor images per subcategory, giving me a baseline of 300 reference images.

Step 2: Sourcing via Pinterest & Instagram
Where do you find the best examples of people posing naturally in everyday environments? Instagram and Pinterest.

I spent hours reverse-engineering what makes a photo "good" for these platforms, looking for distinct body compositions. However, building a dataset this way comes with strict engineering rules:

Avoiding Bias: I couldn't just download photos of the same 5 influencers. The model needed to learn diverse body types, heights, and clothing to ensure the pose estimation wouldn't fail in the real world.

Background Variety: "Nature" couldn't just mean a green lawn; it needed to include forests, beaches, and hiking trails so the Scene Classifier wouldn't overfit to a single shade of green.

Clarity: Every image needed a clear, unobstructed view of the primary subject so the pose landmarker wouldn't get confused by crowded backgrounds.

Step 3: Automated Labeling (The MediaPipe Hack)
Manually labeling coordinates for 17 to 33 skeletal joints across hundreds of images sounds like a nightmare. To save my sanity, I engineered an automated pipeline.

Instead of hand-labeling, I ran MediaPipe’s Pose Landmarker over my curated directory.

Python

The core logic behind the automation pipeline

import mediapipe as mp

MediaPipe processes the curated image and auto-extracts joint coordinates

detection_result = landmarker.detect(mp_image)
keypoints = serialize_to_json(detection_result.pose_landmarks)
The script automatically detected the human subjects, extracted their coordinates, normalized the vectors relative to the torso scale, and saved them into a neat pose_library/ folder as JSON annotations. Boom. Zero manual coordinate labeling required.

Step 4: Multiplying Data with Smart Augmentation
300 images are great for a prototype, but a deep CNN like MobileNetV2 will overfit and fail on a dataset that small. I needed to scale my 300 images up to roughly 1,800 training samples.

Using the albumentations library, I applied strategic augmentations to simulate real-world conditions without breaking the underlying pose logic:

Brightness & Contrast Jitter: To simulate poor lighting inside dim cafes or harsh sunlight outdoors.

Blur & Noise: To prepare the model for shaky, low-quality phone cameras.

Horizontal Flips: To double the dataset instantly.

The Tricky Part: When you flip an image horizontally, your labels break! A person's left hand becomes their right hand. My augmentation script had to explicitly intercept the MediaPipe keypoints and swap the left/right joint indices so the data remained perfectly accurate.

Key Takeaways from Building Data
Building your own dataset teaches you things a clean Kaggle download never can:

Data Quality > Model Complexity: A lightweight model like MobileNetV2 trained on tightly curated, high-quality data will effortlessly outperform a massive model trained on garbage data.

Think Like a Product Manager: Sourcing data forces you to think about how your users will actually use the app. Defining "Egocentric" (first-person/glasses) vs "Exocentric" (third-person/laptop) views early completely changed how I filtered my images.

Now that the dataset is verified and locked, Phase 2 is officially underway: fine-tuning the scene classifier and training the K-Nearest Neighbors (KNN) engine to match a user's background with the perfect pose.

Have you ever had to build a dataset from scratch for a passion project? What did your pipeline look like? Let me know in the comments!