I thought this was going to be easy.
Take data from TikTok, Instagram, YouTube, Reddit, LinkedIn, Threads, X, and Facebook. Map it into one clean shape. Ship it. Move on.
That lasted about a day.
By day two, I had already learned the first ugly truth of social data engineering:
the hard part is not collecting the data. The hard part is deciding what the same thing even means across platforms.
"Likes" are not always likes. "Views" are not always views. Some platforms expose shares publicly, some do not. One platform gives you createTime, another gives you ISO timestamps, another gives you nested objects with three different possible IDs depending on the endpoint.
If you're building a social dashboard, creator analytics tool, moderation workflow, or competitor monitoring system, you hit this wall fast.
So this post is the version I wish I had read earlier: what broke, what schema I kept, what I stopped trying to normalize, and how I now build a social media JSON schema without lying to myself.
The Mistake I Made First
My first schema looked clean.
{
"id": "123",
"platform": "tiktok",
"author": "creator_handle",
"text": "post text",
"likes": 100,
"comments": 12,
"views": 5000,
"shares": 4,
"published_at": "2026-04-24T12:00:00Z"
}
That shape is attractive because it feels universal.
It is also incomplete in exactly the ways that cause expensive bugs later.
Here is what broke almost immediately:
- TikTok and Instagram expose author and stats differently.
- YouTube comments do not behave like short-form video posts.
- Reddit has score, comments, and post metadata that do not map cleanly to "likes/views/shares."
- LinkedIn often has less public engagement detail than people expect.
- Some endpoints return counts as nested
statsobjects. Others flatten them. - URLs and canonical IDs vary wildly by platform.
- Missing values are common, and missing does not mean zero.
That last point matters a lot.
If a platform does not expose a public field, setting it to 0 is false precision. You are making downstream analysis worse while pretending you are making it simpler.
The Rule That Saved Me
I stopped trying to force every platform into the exact same semantic model.
Instead, I switched to this rule:
Normalize the envelope. Preserve the raw meaning.
That means every object should share a predictable wrapper, but platform-specific nuance should stay visible.
The schema I ended up liking looks more like this:
{
"platform": "tiktok",
"entity_type": "post",
"external_id": "735902002991",
"canonical_url": "https://www.tiktok.com/@creator/video/735902002991",
"author": {
"handle": "creator",
"display_name": "Creator Name"
},
"content": {
"text": "caption text",
"media_type": "video"
},
"metrics": {
"likes": 1200,
"comments": 87,
"shares": 41,
"views": 54000
},
"published_at": "2026-04-24T12:00:00Z",
"availability": {
"views": "public",
"shares": "public",
"saves": "not_available"
},
"raw": {}
}
Notice what changed:
-
metricsis grouped instead of flattened. -
availabilitytells downstream code whether a field is real, missing, or unavailable. -
rawis preserved for debugging, reprocessing, and future schema changes.
This is much less elegant than the first version.
It is also much more survivable.
What I Normalize Now
These are the fields I consider worth normalizing across platforms:
platformentity_typeexternal_idcanonical_urlauthor.handleauthor.display_namecontent.textcontent.media_typemetrics.likesmetrics.commentsmetrics.viewsmetrics.sharespublished_at
That is enough to make cross-platform dashboards, alerts, ranking jobs, and exports practical.
But I do not pretend every field exists everywhere.
JavaScript Version: A Normalizer That Stays Honest
This example is intentionally boring. That is a good sign.
If your normalization layer is too clever, it becomes unmaintainable fast.
function pickFirst(...values) {
return values.find(value => value !== undefined && value !== null && value !== '');
}
function toNumber(value) {
const number = Number(value);
return Number.isFinite(number) ? number : null;
}
function toIsoDate(value) {
if (!value) return null;
if (typeof value === 'number') {
return new Date(value * 1000).toISOString();
}
const parsed = new Date(value);
return Number.isNaN(parsed.getTime()) ? null : parsed.toISOString();
}
function normalizePost(platform, raw) {
const base = {
platform,
entity_type: 'post',
external_id: null,
canonical_url: null,
author: {
handle: null,
display_name: null,
},
content: {
text: null,
media_type: null,
},
metrics: {
likes: null,
comments: null,
views: null,
shares: null,
},
published_at: null,
availability: {
likes: 'unknown',
comments: 'unknown',
views: 'unknown',
shares: 'unknown',
},
raw,
};
if (platform === 'tiktok') {
return {
...base,
external_id: pickFirst(raw.id, raw.aweme_id),
canonical_url: pickFirst(raw.share_url, raw.url),
author: {
handle: pickFirst(raw.author?.uniqueId, raw.author?.handle),
display_name: pickFirst(raw.author?.nickname, raw.author?.displayName),
},
content: {
text: pickFirst(raw.desc, raw.caption),
media_type: 'video',
},
metrics: {
likes: toNumber(pickFirst(raw.stats?.diggCount, raw.like_count)),
comments: toNumber(pickFirst(raw.stats?.commentCount, raw.comment_count)),
views: toNumber(pickFirst(raw.stats?.playCount, raw.view_count)),
shares: toNumber(pickFirst(raw.stats?.shareCount, raw.share_count)),
},
published_at: toIsoDate(pickFirst(raw.createTime, raw.created_at)),
availability: {
likes: 'public',
comments: 'public',
views: 'public',
shares: 'public',
},
};
}
if (platform === 'youtube_comment') {
return {
...base,
entity_type: 'comment',
external_id: pickFirst(raw.commentId, raw.id),
canonical_url: pickFirst(raw.url),
author: {
handle: pickFirst(raw.author?.channelHandle, raw.author?.name),
display_name: pickFirst(raw.author?.name, raw.authorDisplayName),
},
content: {
text: pickFirst(raw.content, raw.text),
media_type: 'text',
},
metrics: {
likes: toNumber(pickFirst(raw.likes, raw.likeCount)),
comments: toNumber(pickFirst(raw.replyCount, raw.repliesCount)),
views: null,
shares: null,
},
published_at: toIsoDate(pickFirst(raw.publishedAt, raw.timestamp)),
availability: {
likes: 'public',
comments: 'public',
views: 'not_available',
shares: 'not_available',
},
};
}
if (platform === 'reddit') {
return {
...base,
external_id: pickFirst(raw.id, raw.post_id),
canonical_url: pickFirst(raw.permalink, raw.url),
author: {
handle: pickFirst(raw.author, raw.author_name),
display_name: pickFirst(raw.author, raw.author_name),
},
content: {
text: pickFirst(raw.selftext, raw.title),
media_type: raw.is_video ? 'video' : 'text',
},
metrics: {
likes: toNumber(pickFirst(raw.score, raw.upvotes)),
comments: toNumber(pickFirst(raw.num_comments, raw.comment_count)),
views: null,
shares: null,
},
published_at: toIsoDate(pickFirst(raw.created_utc, raw.created_at)),
availability: {
likes: 'public_proxy',
comments: 'public',
views: 'not_available',
shares: 'not_available',
},
};
}
throw new Error(`Unsupported platform: ${platform}`);
}
const tiktokExample = {
id: '735902002991',
desc: 'How I batch content research in 15 minutes',
createTime: 1713936000,
author: { uniqueId: 'creator_handle', nickname: 'Creator Name' },
stats: { diggCount: 1200, commentCount: 87, playCount: 54000, shareCount: 41 },
share_url: 'https://www.tiktok.com/@creator_handle/video/735902002991',
};
console.log(normalizePost('tiktok', tiktokExample));
The important part is not the exact shape above.
The important part is that unsupported or unavailable fields stay explicit.
That makes your downstream analytics much easier to trust.
Python Version: Same Strategy, Different Runtime
If your data jobs live in Python, the same idea applies.
from datetime import datetime, timezone
def pick_first(*values):
for value in values:
if value not in (None, '', []):
return value
return None
def to_number(value):
try:
return int(value)
except (TypeError, ValueError):
return None
def to_iso_date(value):
if value is None:
return None
if isinstance(value, (int, float)):
return datetime.fromtimestamp(value, tz=timezone.utc).isoformat()
try:
return datetime.fromisoformat(str(value).replace('Z', '+00:00')).isoformat()
except ValueError:
return None
def normalize_post(platform, raw):
base = {
'platform': platform,
'entity_type': 'post',
'external_id': None,
'canonical_url': None,
'author': {
'handle': None,
'display_name': None,
},
'content': {
'text': None,
'media_type': None,
},
'metrics': {
'likes': None,
'comments': None,
'views': None,
'shares': None,
},
'published_at': None,
'availability': {
'likes': 'unknown',
'comments': 'unknown',
'views': 'unknown',
'shares': 'unknown',
},
'raw': raw,
}
if platform == 'tiktok':
return {
**base,
'external_id': pick_first(raw.get('id'), raw.get('aweme_id')),
'canonical_url': pick_first(raw.get('share_url'), raw.get('url')),
'author': {
'handle': pick_first(raw.get('author', {}).get('uniqueId'), raw.get('author', {}).get('handle')),
'display_name': pick_first(raw.get('author', {}).get('nickname'), raw.get('author', {}).get('displayName')),
},
'content': {
'text': pick_first(raw.get('desc'), raw.get('caption')),
'media_type': 'video',
},
'metrics': {
'likes': to_number(pick_first(raw.get('stats', {}).get('diggCount'), raw.get('like_count'))),
'comments': to_number(pick_first(raw.get('stats', {}).get('commentCount'), raw.get('comment_count'))),
'views': to_number(pick_first(raw.get('stats', {}).get('playCount'), raw.get('view_count'))),
'shares': to_number(pick_first(raw.get('stats', {}).get('shareCount'), raw.get('share_count'))),
},
'published_at': to_iso_date(pick_first(raw.get('createTime'), raw.get('created_at'))),
'availability': {
'likes': 'public',
'comments': 'public',
'views': 'public',
'shares': 'public',
},
}
if platform == 'youtube_comment':
return {
**base,
'entity_type': 'comment',
'external_id': pick_first(raw.get('commentId'), raw.get('id')),
'canonical_url': raw.get('url'),
'author': {
'handle': pick_first(raw.get('author', {}).get('channelHandle'), raw.get('author', {}).get('name')),
'display_name': pick_first(raw.get('author', {}).get('name'), raw.get('authorDisplayName')),
},
'content': {
'text': pick_first(raw.get('content'), raw.get('text')),
'media_type': 'text',
},
'metrics': {
'likes': to_number(pick_first(raw.get('likes'), raw.get('likeCount'))),
'comments': to_number(pick_first(raw.get('replyCount'), raw.get('repliesCount'))),
'views': None,
'shares': None,
},
'published_at': to_iso_date(pick_first(raw.get('publishedAt'), raw.get('timestamp'))),
'availability': {
'likes': 'public',
'comments': 'public',
'views': 'not_available',
'shares': 'not_available',
},
}
if platform == 'reddit':
return {
**base,
'external_id': pick_first(raw.get('id'), raw.get('post_id')),
'canonical_url': pick_first(raw.get('permalink'), raw.get('url')),
'author': {
'handle': pick_first(raw.get('author'), raw.get('author_name')),
'display_name': pick_first(raw.get('author'), raw.get('author_name')),
},
'content': {
'text': pick_first(raw.get('selftext'), raw.get('title')),
'media_type': 'video' if raw.get('is_video') else 'text',
},
'metrics': {
'likes': to_number(pick_first(raw.get('score'), raw.get('upvotes'))),
'comments': to_number(pick_first(raw.get('num_comments'), raw.get('comment_count'))),
'views': None,
'shares': None,
},
'published_at': to_iso_date(pick_first(raw.get('created_utc'), raw.get('created_at'))),
'availability': {
'likes': 'public_proxy',
'comments': 'public',
'views': 'not_available',
'shares': 'not_available',
},
}
raise ValueError(f'Unsupported platform: {platform}')
tiktok_example = {
'id': '735902002991',
'desc': 'How I batch content research in 15 minutes',
'createTime': 1713936000,
'author': {'uniqueId': 'creator_handle', 'nickname': 'Creator Name'},
'stats': {'diggCount': 1200, 'commentCount': 87, 'playCount': 54000, 'shareCount': 41},
'share_url': 'https://www.tiktok.com/@creator_handle/video/735902002991',
}
print(normalize_post('tiktok', tiktok_example))
What I Stopped Trying to Normalize
This part saved me more time than any helper function.
There are fields I no longer force into a universal schema unless the use case truly needs them:
- saves/bookmarks
- watch time
- audience demographics
- thread depth nuances
- platform-specific moderation states
- estimated reach proxies
Why?
Because those fields mean different things, appear inconsistently, or are missing often enough that the abstraction becomes misleading.
If a product feature really depends on one of those platform-specific concepts, I expose that field under a platform namespace instead.
That keeps the shared schema clean and the specialized data honest.
Honest Alternatives
There are three valid approaches here.
1. Normalize everything into one schema
Good for dashboards and cross-platform ranking.
Bad if you start pretending unavailable fields are universal.
2. Keep completely separate per-platform models
Good for correctness.
Bad for product teams that want a single report, feed, or analytics layer.
3. Keep a thin shared schema plus raw JSON
This is the one I use most often.
It gives you enough consistency for application logic without throwing away nuance.
If you are early, start there.
Where SociaVault Fits
This is exactly where a data layer like SociaVault helps.
I do not want to spend my week maintaining eight different scraping stacks just to get to the actual product problem, which is normalization, ranking, alerting, and reporting.
So my usual split is:
- use SociaVault for the public social data collection layer
- normalize into my own schema
- keep the raw payload beside the normalized object
That keeps the engineering work focused on product logic instead of collection plumbing.
Final Take
Trying to build a universal social media JSON schema taught me something useful: the goal is not to erase platform differences.
The goal is to make those differences manageable.
Normalize the envelope. Preserve the raw payload. Mark unavailable fields honestly. Avoid fake precision.
If you are building social dashboards, creator analytics, competitor tracking, or moderation tools, that approach will hold up much better than the "one perfect schema" fantasy.
And if you want to skip the collection layer and spend your time on the normalization layer instead, start with SociaVault.
Top comments (0)