DEV Community

Cover image for I Tried to Normalize 8 Social Platforms Into One JSON Schema. Here's What Broke
Olamide Olaniyan
Olamide Olaniyan

Posted on

I Tried to Normalize 8 Social Platforms Into One JSON Schema. Here's What Broke

I thought this was going to be easy.

Take data from TikTok, Instagram, YouTube, Reddit, LinkedIn, Threads, X, and Facebook. Map it into one clean shape. Ship it. Move on.

That lasted about a day.

By day two, I had already learned the first ugly truth of social data engineering:

the hard part is not collecting the data. The hard part is deciding what the same thing even means across platforms.

"Likes" are not always likes. "Views" are not always views. Some platforms expose shares publicly, some do not. One platform gives you createTime, another gives you ISO timestamps, another gives you nested objects with three different possible IDs depending on the endpoint.

If you're building a social dashboard, creator analytics tool, moderation workflow, or competitor monitoring system, you hit this wall fast.

So this post is the version I wish I had read earlier: what broke, what schema I kept, what I stopped trying to normalize, and how I now build a social media JSON schema without lying to myself.

The Mistake I Made First

My first schema looked clean.

{
  "id": "123",
  "platform": "tiktok",
  "author": "creator_handle",
  "text": "post text",
  "likes": 100,
  "comments": 12,
  "views": 5000,
  "shares": 4,
  "published_at": "2026-04-24T12:00:00Z"
}
Enter fullscreen mode Exit fullscreen mode

That shape is attractive because it feels universal.

It is also incomplete in exactly the ways that cause expensive bugs later.

Here is what broke almost immediately:

  • TikTok and Instagram expose author and stats differently.
  • YouTube comments do not behave like short-form video posts.
  • Reddit has score, comments, and post metadata that do not map cleanly to "likes/views/shares."
  • LinkedIn often has less public engagement detail than people expect.
  • Some endpoints return counts as nested stats objects. Others flatten them.
  • URLs and canonical IDs vary wildly by platform.
  • Missing values are common, and missing does not mean zero.

That last point matters a lot.

If a platform does not expose a public field, setting it to 0 is false precision. You are making downstream analysis worse while pretending you are making it simpler.

The Rule That Saved Me

I stopped trying to force every platform into the exact same semantic model.

Instead, I switched to this rule:

Normalize the envelope. Preserve the raw meaning.

That means every object should share a predictable wrapper, but platform-specific nuance should stay visible.

The schema I ended up liking looks more like this:

{
  "platform": "tiktok",
  "entity_type": "post",
  "external_id": "735902002991",
  "canonical_url": "https://www.tiktok.com/@creator/video/735902002991",
  "author": {
    "handle": "creator",
    "display_name": "Creator Name"
  },
  "content": {
    "text": "caption text",
    "media_type": "video"
  },
  "metrics": {
    "likes": 1200,
    "comments": 87,
    "shares": 41,
    "views": 54000
  },
  "published_at": "2026-04-24T12:00:00Z",
  "availability": {
    "views": "public",
    "shares": "public",
    "saves": "not_available"
  },
  "raw": {}
}
Enter fullscreen mode Exit fullscreen mode

Notice what changed:

  • metrics is grouped instead of flattened.
  • availability tells downstream code whether a field is real, missing, or unavailable.
  • raw is preserved for debugging, reprocessing, and future schema changes.

This is much less elegant than the first version.

It is also much more survivable.

What I Normalize Now

These are the fields I consider worth normalizing across platforms:

  • platform
  • entity_type
  • external_id
  • canonical_url
  • author.handle
  • author.display_name
  • content.text
  • content.media_type
  • metrics.likes
  • metrics.comments
  • metrics.views
  • metrics.shares
  • published_at

That is enough to make cross-platform dashboards, alerts, ranking jobs, and exports practical.

But I do not pretend every field exists everywhere.

JavaScript Version: A Normalizer That Stays Honest

This example is intentionally boring. That is a good sign.

If your normalization layer is too clever, it becomes unmaintainable fast.

function pickFirst(...values) {
  return values.find(value => value !== undefined && value !== null && value !== '');
}

function toNumber(value) {
  const number = Number(value);
  return Number.isFinite(number) ? number : null;
}

function toIsoDate(value) {
  if (!value) return null;

  if (typeof value === 'number') {
    return new Date(value * 1000).toISOString();
  }

  const parsed = new Date(value);
  return Number.isNaN(parsed.getTime()) ? null : parsed.toISOString();
}

function normalizePost(platform, raw) {
  const base = {
    platform,
    entity_type: 'post',
    external_id: null,
    canonical_url: null,
    author: {
      handle: null,
      display_name: null,
    },
    content: {
      text: null,
      media_type: null,
    },
    metrics: {
      likes: null,
      comments: null,
      views: null,
      shares: null,
    },
    published_at: null,
    availability: {
      likes: 'unknown',
      comments: 'unknown',
      views: 'unknown',
      shares: 'unknown',
    },
    raw,
  };

  if (platform === 'tiktok') {
    return {
      ...base,
      external_id: pickFirst(raw.id, raw.aweme_id),
      canonical_url: pickFirst(raw.share_url, raw.url),
      author: {
        handle: pickFirst(raw.author?.uniqueId, raw.author?.handle),
        display_name: pickFirst(raw.author?.nickname, raw.author?.displayName),
      },
      content: {
        text: pickFirst(raw.desc, raw.caption),
        media_type: 'video',
      },
      metrics: {
        likes: toNumber(pickFirst(raw.stats?.diggCount, raw.like_count)),
        comments: toNumber(pickFirst(raw.stats?.commentCount, raw.comment_count)),
        views: toNumber(pickFirst(raw.stats?.playCount, raw.view_count)),
        shares: toNumber(pickFirst(raw.stats?.shareCount, raw.share_count)),
      },
      published_at: toIsoDate(pickFirst(raw.createTime, raw.created_at)),
      availability: {
        likes: 'public',
        comments: 'public',
        views: 'public',
        shares: 'public',
      },
    };
  }

  if (platform === 'youtube_comment') {
    return {
      ...base,
      entity_type: 'comment',
      external_id: pickFirst(raw.commentId, raw.id),
      canonical_url: pickFirst(raw.url),
      author: {
        handle: pickFirst(raw.author?.channelHandle, raw.author?.name),
        display_name: pickFirst(raw.author?.name, raw.authorDisplayName),
      },
      content: {
        text: pickFirst(raw.content, raw.text),
        media_type: 'text',
      },
      metrics: {
        likes: toNumber(pickFirst(raw.likes, raw.likeCount)),
        comments: toNumber(pickFirst(raw.replyCount, raw.repliesCount)),
        views: null,
        shares: null,
      },
      published_at: toIsoDate(pickFirst(raw.publishedAt, raw.timestamp)),
      availability: {
        likes: 'public',
        comments: 'public',
        views: 'not_available',
        shares: 'not_available',
      },
    };
  }

  if (platform === 'reddit') {
    return {
      ...base,
      external_id: pickFirst(raw.id, raw.post_id),
      canonical_url: pickFirst(raw.permalink, raw.url),
      author: {
        handle: pickFirst(raw.author, raw.author_name),
        display_name: pickFirst(raw.author, raw.author_name),
      },
      content: {
        text: pickFirst(raw.selftext, raw.title),
        media_type: raw.is_video ? 'video' : 'text',
      },
      metrics: {
        likes: toNumber(pickFirst(raw.score, raw.upvotes)),
        comments: toNumber(pickFirst(raw.num_comments, raw.comment_count)),
        views: null,
        shares: null,
      },
      published_at: toIsoDate(pickFirst(raw.created_utc, raw.created_at)),
      availability: {
        likes: 'public_proxy',
        comments: 'public',
        views: 'not_available',
        shares: 'not_available',
      },
    };
  }

  throw new Error(`Unsupported platform: ${platform}`);
}

const tiktokExample = {
  id: '735902002991',
  desc: 'How I batch content research in 15 minutes',
  createTime: 1713936000,
  author: { uniqueId: 'creator_handle', nickname: 'Creator Name' },
  stats: { diggCount: 1200, commentCount: 87, playCount: 54000, shareCount: 41 },
  share_url: 'https://www.tiktok.com/@creator_handle/video/735902002991',
};

console.log(normalizePost('tiktok', tiktokExample));
Enter fullscreen mode Exit fullscreen mode

The important part is not the exact shape above.

The important part is that unsupported or unavailable fields stay explicit.

That makes your downstream analytics much easier to trust.

Python Version: Same Strategy, Different Runtime

If your data jobs live in Python, the same idea applies.

from datetime import datetime, timezone


def pick_first(*values):
    for value in values:
        if value not in (None, '', []):
            return value
    return None


def to_number(value):
    try:
        return int(value)
    except (TypeError, ValueError):
        return None


def to_iso_date(value):
    if value is None:
        return None

    if isinstance(value, (int, float)):
        return datetime.fromtimestamp(value, tz=timezone.utc).isoformat()

    try:
        return datetime.fromisoformat(str(value).replace('Z', '+00:00')).isoformat()
    except ValueError:
        return None


def normalize_post(platform, raw):
    base = {
        'platform': platform,
        'entity_type': 'post',
        'external_id': None,
        'canonical_url': None,
        'author': {
            'handle': None,
            'display_name': None,
        },
        'content': {
            'text': None,
            'media_type': None,
        },
        'metrics': {
            'likes': None,
            'comments': None,
            'views': None,
            'shares': None,
        },
        'published_at': None,
        'availability': {
            'likes': 'unknown',
            'comments': 'unknown',
            'views': 'unknown',
            'shares': 'unknown',
        },
        'raw': raw,
    }

    if platform == 'tiktok':
        return {
            **base,
            'external_id': pick_first(raw.get('id'), raw.get('aweme_id')),
            'canonical_url': pick_first(raw.get('share_url'), raw.get('url')),
            'author': {
                'handle': pick_first(raw.get('author', {}).get('uniqueId'), raw.get('author', {}).get('handle')),
                'display_name': pick_first(raw.get('author', {}).get('nickname'), raw.get('author', {}).get('displayName')),
            },
            'content': {
                'text': pick_first(raw.get('desc'), raw.get('caption')),
                'media_type': 'video',
            },
            'metrics': {
                'likes': to_number(pick_first(raw.get('stats', {}).get('diggCount'), raw.get('like_count'))),
                'comments': to_number(pick_first(raw.get('stats', {}).get('commentCount'), raw.get('comment_count'))),
                'views': to_number(pick_first(raw.get('stats', {}).get('playCount'), raw.get('view_count'))),
                'shares': to_number(pick_first(raw.get('stats', {}).get('shareCount'), raw.get('share_count'))),
            },
            'published_at': to_iso_date(pick_first(raw.get('createTime'), raw.get('created_at'))),
            'availability': {
                'likes': 'public',
                'comments': 'public',
                'views': 'public',
                'shares': 'public',
            },
        }

    if platform == 'youtube_comment':
        return {
            **base,
            'entity_type': 'comment',
            'external_id': pick_first(raw.get('commentId'), raw.get('id')),
            'canonical_url': raw.get('url'),
            'author': {
                'handle': pick_first(raw.get('author', {}).get('channelHandle'), raw.get('author', {}).get('name')),
                'display_name': pick_first(raw.get('author', {}).get('name'), raw.get('authorDisplayName')),
            },
            'content': {
                'text': pick_first(raw.get('content'), raw.get('text')),
                'media_type': 'text',
            },
            'metrics': {
                'likes': to_number(pick_first(raw.get('likes'), raw.get('likeCount'))),
                'comments': to_number(pick_first(raw.get('replyCount'), raw.get('repliesCount'))),
                'views': None,
                'shares': None,
            },
            'published_at': to_iso_date(pick_first(raw.get('publishedAt'), raw.get('timestamp'))),
            'availability': {
                'likes': 'public',
                'comments': 'public',
                'views': 'not_available',
                'shares': 'not_available',
            },
        }

    if platform == 'reddit':
        return {
            **base,
            'external_id': pick_first(raw.get('id'), raw.get('post_id')),
            'canonical_url': pick_first(raw.get('permalink'), raw.get('url')),
            'author': {
                'handle': pick_first(raw.get('author'), raw.get('author_name')),
                'display_name': pick_first(raw.get('author'), raw.get('author_name')),
            },
            'content': {
                'text': pick_first(raw.get('selftext'), raw.get('title')),
                'media_type': 'video' if raw.get('is_video') else 'text',
            },
            'metrics': {
                'likes': to_number(pick_first(raw.get('score'), raw.get('upvotes'))),
                'comments': to_number(pick_first(raw.get('num_comments'), raw.get('comment_count'))),
                'views': None,
                'shares': None,
            },
            'published_at': to_iso_date(pick_first(raw.get('created_utc'), raw.get('created_at'))),
            'availability': {
                'likes': 'public_proxy',
                'comments': 'public',
                'views': 'not_available',
                'shares': 'not_available',
            },
        }

    raise ValueError(f'Unsupported platform: {platform}')


tiktok_example = {
    'id': '735902002991',
    'desc': 'How I batch content research in 15 minutes',
    'createTime': 1713936000,
    'author': {'uniqueId': 'creator_handle', 'nickname': 'Creator Name'},
    'stats': {'diggCount': 1200, 'commentCount': 87, 'playCount': 54000, 'shareCount': 41},
    'share_url': 'https://www.tiktok.com/@creator_handle/video/735902002991',
}

print(normalize_post('tiktok', tiktok_example))
Enter fullscreen mode Exit fullscreen mode

What I Stopped Trying to Normalize

This part saved me more time than any helper function.

There are fields I no longer force into a universal schema unless the use case truly needs them:

  • saves/bookmarks
  • watch time
  • audience demographics
  • thread depth nuances
  • platform-specific moderation states
  • estimated reach proxies

Why?

Because those fields mean different things, appear inconsistently, or are missing often enough that the abstraction becomes misleading.

If a product feature really depends on one of those platform-specific concepts, I expose that field under a platform namespace instead.

That keeps the shared schema clean and the specialized data honest.

Honest Alternatives

There are three valid approaches here.

1. Normalize everything into one schema

Good for dashboards and cross-platform ranking.

Bad if you start pretending unavailable fields are universal.

2. Keep completely separate per-platform models

Good for correctness.

Bad for product teams that want a single report, feed, or analytics layer.

3. Keep a thin shared schema plus raw JSON

This is the one I use most often.

It gives you enough consistency for application logic without throwing away nuance.

If you are early, start there.

Where SociaVault Fits

This is exactly where a data layer like SociaVault helps.

I do not want to spend my week maintaining eight different scraping stacks just to get to the actual product problem, which is normalization, ranking, alerting, and reporting.

So my usual split is:

  • use SociaVault for the public social data collection layer
  • normalize into my own schema
  • keep the raw payload beside the normalized object

That keeps the engineering work focused on product logic instead of collection plumbing.

Final Take

Trying to build a universal social media JSON schema taught me something useful: the goal is not to erase platform differences.

The goal is to make those differences manageable.

Normalize the envelope. Preserve the raw payload. Mark unavailable fields honestly. Avoid fake precision.

If you are building social dashboards, creator analytics, competitor tracking, or moderation tools, that approach will hold up much better than the "one perfect schema" fantasy.

And if you want to skip the collection layer and spend your time on the normalization layer instead, start with SociaVault.

webdev #api #json #datascience #javascript

Top comments (0)