BridgeComm AI

Posted on Apr 26 • Originally published at bridgecomm.ai

Your Product Data Schema Has the Same Problem TCP/IP Solved in 1974

#canonicalschema #cpgproductcatalog #productdataarchitecture #retaildataschemadesign

https://www.bridgecomm.ai/data-schema-interoperability-lesson/

I. A Problem That Has Already Been Solved
In 1974, Vint Cerf and Bob Kahn published a paper describing a new approach to moving data across computer networks. The core insight was simple: instead of designing a separate protocol for every combination of computer and network type, design one universal set of rules for how data should be packaged and addressed — and let adapters handle the differences at each end.

That insight became TCP/IP, the foundation of the modern internet. Today, an application running on your phone doesn’t need to know whether it’s talking to a server over WiFi, cellular LTE, a fiber connection, or satellite. The layers are separated. Each layer solves its own problem and exposes a clean interface to the layers above and below it.

The retail product data problem is structurally identical. A brand’s product information doesn’t need to care what format Target requires, what Walmart’s validation rules look like, or how Amazon structures its category taxonomy. Those are concerns for the edges of the system. The center — the place where product data lives and where AI operates on it — should be clean, stable, and retailer-agnostic.

At BridgeCommAI, we designed our product data schema by borrowing directly from network protocol engineering. This post describes three principles we applied and the design decisions they led to.

“The history of computing is largely a history of solving the same integration problems with better abstractions. Canonical data models, protocol layers, and universal intermediate formats are the same insight applied to different domains.”

II. The Canonical Model Pattern: A Proven Idea
The concept of routing data through a common intermediate format is older than TCP/IP. It appears across domains wherever integration complexity threatens to scale quadratically:

Unicode: Before Unicode, every software application managed its own character encoding. Unicode created one universal encoding that everything maps to and from. Adding a new language now requires one mapping to Unicode, not one mapping to every other encoding in existence.

MIDI: The Musical Instrument Digital Interface standard (1983) created a common language for electronic instruments to communicate. MIDI didn’t standardize what instruments sound like — it standardized how they describe what they’re doing, leaving the interpretation to each device.

HL7 FHIR: Healthcare faced the same N×M problem with medical records: every hospital system, insurance platform, and pharmacy had different data formats. HL7 FHIR is the universal standard that lets disparate systems exchange patient information without building custom integrations between every pair.

FIX Protocol: Financial trading systems use the Financial Information eXchange protocol as a universal message format for trade orders and confirmations. A fund manager’s system doesn’t need a custom integration to each exchange — it speaks FIX, and every exchange speaks FIX.

Retail product data is the same problem in a new domain. The solution is the same pattern. What changes are the design decisions that determine how well the schema serves the specific constraints of product data — and this is where network protocol engineering has a lot to teach.

III. Principle 1: Separation of Concerns
The OSI model — the seven-layer framework that describes how network communication is organized — is built on one central rule: each layer solves exactly one problem, exposes a clean interface upward, and doesn’t need to know anything about how the layers below it work.

When you send an email, the email application doesn’t know or care whether the network it’s using is Ethernet or WiFi or cellular. That’s not its job. Its job is to format a message. The transport layer’s job is to guarantee delivery. The network layer’s job is to handle addressing and routing. Each layer has a defined responsibility and a clean boundary.

This is exactly the principle we applied to the product data pipeline:

The critical design decision is what goes in each layer and what does not. The universal schema — our middle layer — is explicitly retailer-agnostic. Nothing in it reflects how Target or Walmart structures their submissions. Those details live entirely in the target adapters at Layer 3.

This might seem like unnecessary abstraction until a retailer changes their requirements. When Target updated its product submission workflow in 2024 and 2025, the change required updating exactly one configuration: the Target adapter. Every brand’s data that had already been processed in universal form could be re-exported to the new format in minutes. Under a direct-mapping architecture, the same change would have required updating every brand–Target pipeline individually.

Anti-Pattern: Retailer-Specific Fields in the Universal Schema
The most common mistake we see in custom-built product data pipelines is contaminating the middle layer with retailer-specific fields. Teams add a “target_item_description” field alongside the standard product name field because it’s convenient. Then they add a “walmart_short_description.” The schema becomes a collection of retailer-specific exceptions, and the separation of concerns breaks down.

Once that boundary blurs, every change to any retailer’s requirements becomes a change to the central schema. The coupling that the architecture was designed to avoid has crept back in.

IV. Principle 2: Extensibility Without Breaking Changes
Network protocols are built to last decades. IPv4, published in 1981, is still the dominant version of the Internet Protocol in use today. One reason is that IPv4’s designers-built extensibility into the protocol from the start: the header includes a version field, an options field, and a clear distinction between required and optional components. New capabilities could be added without invalidating existing implementations.

The design principle is simple: new fields are always optional. Existing implementations that don’t know about a new field ignore it. Nothing breaks. You can add capabilities to a protocol without forcing every participant to upgrade simultaneously.

We applied the same principle to the universal schema’s version strategy. The schema carries a version number. Changes follow semantic versioning — minor versions add optional fields (fully backward compatible), major versions indicate a breaking change requiring migration (rare and deliberate).

The practical consequence: we can add an entirely new category of product attributes — the semantic attributes that AI commerce platforms will need for product discovery — without touching anything that’s already working. Existing retailer adapters continue to generate correct output. The new fields are simply ignored by adapters that don’t know about them.

The Forward Compatibility Problem
The schema is designed to accommodate attributes that aren’t required today but will be critical within 12 to 24 months as AI-mediated shopping becomes mainstream. These aren’t speculative features — they’re the data fields that make the difference between a product that appears in an AI shopping agent’s recommendation and one that doesn’t.

When a consumer asks an AI agent to recommend an organic, fair-trade coffee brand that would make a good gift for a friend who loves single-origin beans, the AI needs structured, machine-readable data to answer the question accurately. That contextual meaning — what a product means in use, not just what it physically is — belongs in the schema now, accumulated quietly while the physical requirements are handled.

V. Principle 3: Typed Structure + Flexible Payload
One of the most elegant design decisions in the original IP specification is the structure of a packet: a fixed, strongly-typed header containing the routing information every network node needs, followed by a payload that can contain anything.

The IP header specifies source address, destination address, time-to-live, and protocol type as fixed, typed fields. Every router on the internet knows exactly where to find these fields and how to parse them. This rigid structure is what makes high-speed packet routing possible.

The payload, however, is completely unspecified at the IP level. It could be a TCP segment, a UDP datagram, or something else entirely. The flexibility of the payload allows the same protocol to carry email, video, file transfers, and applications that weren’t invented when IP was designed.

The universal schema uses the same pattern. In our database implementation, performance-critical fields — those that appear frequently in operational queries, dashboard filters, and compliance checks — are stored as typed, indexed columns. The complete product record, including every field in the schema, lives in a flexible document field alongside them.

The typed columns give us IP-header-style performance: fast indexed queries on the data that operations need constantly. The flexible document field gives us IP-payload-style flexibility: the full product record can be updated, extended, and versioned without a database schema migration.

Why Not Pure Relational? Why Not Pure NoSQL?
A fully normalized relational schema would enforce strong typing everywhere but would require migrations whenever a field is added. In a product data context where schemas evolve constantly — retailer requirements change, new product categories appear, new attribute types are needed — this creates operational friction that compounds over time.

A document database would give schema flexibility without migrations, but at the cost of ACID transactions (critical for data integrity in a service where errors cause failed retailer submissions), type enforcement, and query performance.

The hybrid approach gets the benefits of both: typed fields for the data that needs performance and integrity guarantees, a flexible document field for everything that needs to evolve freely. PostgreSQL handles both within the same system, which simplifies the architecture considerably at early scale.

VI. Schema Field Groups: What Goes Where and Why
The universal schema organizes fields into logical groups, each with a clear purpose. The boundaries between groups are enforced by design — fields belong to the group that owns their meaning, not the group that’s most convenient.

A key design constraint: the universal schema is the single source of truth. Typed columns in the database are derived from the flexible document field, not maintained separately. When the full product record changes, derived fields update automatically. There is never a question of which representation is authoritative.

VII. Extending for AI Commerce: The Version 1.5 Pattern
AI shopping agents are already mediating product discovery on platforms like ChatGPT and Perplexity. The question isn’t whether structured semantic attributes will matter for product data; it’s when they will.

When that moment comes, a brand with a well-structured universal schema adds a new optional field group. A brand with a flat, retailer-specific spreadsheet system has to rethink its entire data infrastructure.

The extensibility principle makes this transition a schema increment rather than a migration event. Existing records continue to work. Retailer adapters that don’t use the new field group ignore it. An AI commerce adapter, when it’s built, reads exactly those fields. The same schema supports both simultaneously, without conflict.

This is exactly what IPv4’s options field was designed to do: allow new capabilities to be added to the protocol without requiring every existing implementation to change.

VIII. Lessons Learned and Anti-Patterns
Operating this schema across real CPG brand data surfaced several non-obvious issues worth sharing.

Anti-Pattern: Over-Normalization

The instinct in relational database design is to normalize aggressively. For product data, this creates more problems than it solves. Product records benefit from co-location: all the fields for a product together, retrievable in one read. Excessive normalization fragments the data, increases query complexity, and creates maintenance overhead that isn’t justified by the actual data access patterns. Reference data (brands, categories, retailers) belongs in separate tables. The product record itself should be as cohesive as possible.

Anti-Pattern: Premature Optimization

It’s tempting to index every field “just in case.” Start with primary keys, foreign keys, and the handful of fields that appear most often in operational queries. Add indexes as actual query patterns emerge. The cost of maintaining unnecessary indexes compounds, and it’s always cheaper to add an index later than to carry one you don’t need from the start.

Lesson: Triggers for Auto-Extraction

Maintaining typed columns that mirror data in the flexible document field manually creates consistency bugs. The solution is database triggers that automatically sync typed columns whenever the full record is updated. This keeps a single source of truth while providing performance benefits for the queries that need it.

Lesson: Rich Search Without Full Denormalization

PostgreSQL’s generalized inverted index support on flexible document fields enables fast searches within the schema — finding all products with a specific certification, filtering by keywords, or locating products by attribute combinations — without requiring those fields to be promoted to typed columns. This significantly expands operational capability without increasing schema rigidity.

IX. The Pattern Beyond Retail
The canonical data model problem — and the network protocol principles that solve it — appears wherever integration complexity threatens to scale quadratically. Retail product data is a particularly acute case given the number of stakeholders, the pace at which requirements change, and the cost of errors.

But the same architecture applies to any domain with multiple sources, multiple destinations, and a need for a stable, extensible common format in between. Healthcare records. Financial trade messages. IoT device telemetry. The pattern is the same. The design principles are the same.

What changes is the depth of domain knowledge required to design the schema well — which fields matter, how they should be typed, what the validation constraints look like, and how they need to evolve. That’s where operational experience compounds in a way that no architectural decision alone can substitute for.

X. Open-Sourcing the Reference Schema
We are releasing a sanitized version of the BridgeCommAI universal schema as a reference implementation on GitHub. This includes a schema definition document with field documentation, a database DDL showing the structure and key design decisions, and example records illustrating how brand data maps to the universal format.

What we’re not releasing: retailer-specific adapter configurations, validation rule libraries, and the AI enrichment patterns we’ve developed through production operations. Those represent the accumulated operational expertise that makes the architecture work reliably at scale, and they’re where the practical advantage lives. The schema structure is the foundation — and, in our view, foundations should be shared.

GitHub: https://github.com/bridgecommai/canonical-product-schema

Conclusion
The three principles from network protocol engineering — separation of concerns, extensibility without breaking changes, and typed structure with flexible payload — aren’t abstractions borrowed loosely from an adjacent domain. They’re solutions to the same underlying problem: how do you design a data structure that works reliably for the use cases you have today while remaining extensible for the use cases you can’t fully anticipate?

For product data, those future use cases are increasingly concrete. AI commerce is already reshaping product discovery on major consumer platforms. The brands that build their product data infrastructure on a well-designed universal schema are building toward a future where their data speaks fluently to whatever comes next — not scrambling to retrofit it.

BridgeCommAI builds and operates this architecture for CPG brands going into Target, Walmart, and Amazon. If you’re a technical lead or product manager working through a product data infrastructure decision, visit bridgecomm.ai or reach out directly.

DEV Community

Your Product Data Schema Has the Same Problem TCP/IP Solved in 1974

Top comments (0)