Reza Brianca

Posted on Apr 7

AI-Powered Intellectual Property Similarity Detection on Snowflake

#ai #machinelearning #snowflake #datascience

Intellectual property offices worldwide face a growing challenge: how do you ensure that every new trademark, logo, or jingle is genuinely distinct from the millions already registered? For one of our prospect in public sector, this problem spans two fundamentally different media types: images (product logos, packaging designs) and audio (brand jingles, musical trademarks). Each new submission must be compared against the entire existing registry before it can be approved.

Manual review simply does not scale. An examiner can visually compare a handful of logos per hour, but the backlog grows faster than human reviewers can process it. Audio comparison is even harder. You can imagine listening to thousands of jingles to catch a subtle similarity is impractical. Subjectivity compounds the problem: two examiners may disagree on whether a logo is "too similar" to an existing registration. And consistency matters: the same pair of assets should always produce the same similarity verdict, regardless of which examiner handles the case or what time of day it is reviewed.

The traditional approach by hiring more examiners and building manual review queues is not practical since it will introduce higher cost, latency, and inconsistency. What if you could reduce the comparison to a mathematical operation? Convert each IP asset into a numerical fingerprint (a vector embedding), store those fingerprints in a database, and compute similarity scores automatically.

This article walks through how we built exactly that: an AI-powered IP similarity detection system that runs entirely within Snowflake, combining GPU-accelerated deep learning models with native vector search and a Streamlit interface for IP examiners.

Solution Architecture

The system is built on four Snowflake capabilities working together:

Snowpark Container Services (SPCS) for GPU-backed model inference
The native VECTOR data type for embedding storage
VECTOR_COSINE_SIMILARITY for similarity search
Streamlit in Snowflake for the examiner-facing UI

The data flow is straightforward: raw files sit on a Snowflake internal stage. SPCS containers mount that stage directly as a volume, so the Flask-based inference services read files from /stage without any data transfer. Service function uses SQL UDFs that wrap HTTP calls to the containers. This makes embedding generation callable from any SQL query. The resulting vectors are stored in registry tables and compared via VECTOR_COSINE_SIMILARITY.

A critical design choice was keeping everything within Snowflake's governance boundary. For a public-sector organizations, data sovereignty is non-negotiable. No IP assets leave the platform. This means the models run inside SPCS, the files live on Snowflake stages, and the embeddings are stored in Snowflake tables. Access control is handled through Snowflake RBAC, and every query is captured in Snowflake's access history for audit purposes.

This architecture also means there is no external API dependency during inference. The models are loaded into containers at startup and serve requests entirely within the Snowflake network. If internet access is revoked after deployment, the system continues to function normally. This is an important consideration for government networks with strict egress policies.

Model Selection: Why the Right Embedding Matters

Choosing the right embedding model is the most consequential architectural decision in a similarity search system. The model determines what "similar" means. For IP registration system, getting this wrong renders the entire system useless.

Image Pipeline: Vision Transformer (ViT)

For image similarity, we use Google's vit-base-patch16-224-in21k, a Vision Transformer pretrained on ImageNet-21k. It produces 768-dimensional embeddings extracted from the CLS token of the final hidden layer. ViT is a strong general-purpose choice: it captures both low-level visual features (color, shape, texture) and higher-level semantic patterns (object composition, layout). No fine-tuning was needed for our use case since the pretrained model already distinguishes product packaging designs effectively.

from transformers import AutoFeatureExtractor, AutoModel
import torch

model_id = "google/vit-base-patch16-224-in21k"
extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)

inputs = extractor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
embedding = outputs.last_hidden_state[:, 0]  # 768-dim CLS token

Audio Pipeline: PANNs CNN14

Audio is where the model selection gets interesting. The natural first choice might be CLAP (laion/larger_clap_music), a contrastive language-audio model that has become popular for audio understanding tasks. However, during prototyping we discovered a fundamental problem: CLAP produces semantic embeddings that cluster audio by genre and type. When we tested it against our jingle corpus, all files within a similar genre scored 0.93-0.99 cosine similarity. A pop jingle and a completely different pop jingle were nearly indistinguishable. This makes CLAP excellent for genre classification but useless for IP distinctiveness detection.

PANNs CNN14 (Pretrained Audio Neural Networks) solves this problem. Trained on AudioSet's 527 sound classes across roughly 5,000 hours of audio, CNN14 produces 2048-dimensional acoustic embeddings that capture the unique spectral characteristics of each file including timbre, rhythm, frequency patterns, and temporal structure. Two pop jingles that sound superficially similar will produce meaningfully different embeddings because PANNs encodes their actual acoustic fingerprint rather than their semantic category.

from panns_inference import AudioTagging
import librosa
import numpy as np

at = AudioTagging(checkpoint_path=None, device="cpu")
audio_array, _ = librosa.load("jingle.mp3", sr=32000, mono=True)
audio_array = audio_array[np.newaxis, :]
_, embedding = at.inference(audio_array)  # 2048-dim acoustic embedding

This distinction between acoustic and semantic embeddings is critical. For IP registration, you need to answer "does this specific audio file sound like that specific audio file?" and not "are these two files in the same genre?"

Implementation Highlights

Deploying Models as SPCS Services

Each model runs in its own Docker container, deployed as an SPCS service with GPU resources. The Docker images are built locally, pushed to Snowflake's built-in image registry, and referenced in the service specification. The containers mount the Snowflake stage as a volume at /stage, enabling direct file access. The Flask app simply reads from the filesystem without any data serialization or network transfer overhead.

Here is the abbreviated service creation for the image pipeline:

CREATE SERVICE IMAGE_EMBEDDING_SERVICE
  IN COMPUTE POOL POOL_GPU_S
  FROM SPECIFICATION $$
  spec:
    containers:
    - name: image-embedding
      image: /img_demo/raw/img_repo/image-embedding-service:latest
      resources:
        requests:
          nvidia.com/gpu: 1
      volumeMounts:
      - name: stage-volume
        mountPath: /stage
    endpoints:
    - name: embed
      port: 8080
      public: false
    volumes:
    - name: stage-volume
      source: "@IMG_DEMO.RAW.IP_STAGE"
  $$
  MIN_INSTANCES = 1
  MAX_INSTANCES = 1;

The readinessProbe on the /health endpoint ensures Snowflake only routes traffic to containers that have fully loaded their models. ViT takes a few seconds, while PANNs CNN14 needs to download its ~300 MB checkpoint on first run.

The key pattern here is the service function. This is an SQL UDF that routes calls to the container's HTTP endpoint. This abstraction lets any SQL query generate embeddings without knowing anything about Docker, Flask, or PyTorch:

CREATE FUNCTION GENERATE_IMAGE_EMBEDDING(FILE_PATH VARCHAR)
  RETURNS VARIANT
  SERVICE = IMAGE_EMBEDDING_SERVICE
  ENDPOINT = embed
  AS '/embed-file';

Similarity Search in SQL

With embeddings stored as native VECTOR columns, similarity search becomes a single SQL query. The pattern uses a CTE to generate the candidate embedding, then cross-joins against the registry with VECTOR_COSINE_SIMILARITY:

WITH uploaded_embedding AS (
    SELECT GENERATE_IMAGE_EMBEDDING('uploads/new_logo.png')::VECTOR(FLOAT, 768) AS emb
)
SELECT
    r.IP_ID,
    r.FILE_PATH,
    VECTOR_COSINE_SIMILARITY(r.IMAGE_EMBEDDING, u.emb) AS SIMILARITY
FROM IMAGE_IP_REGISTRY r
CROSS JOIN uploaded_embedding u
WHERE r.STATUS = 'ACTIVE'
ORDER BY SIMILARITY DESC;

This query generates the embedding for the new submission, compares it against every active registration, and returns results ranked by similarity. All of this can be done in a single SQL statement. The ::VECTOR(FLOAT, 768) cast converts the VARIANT output from the service function into Snowflake's native VECTOR type, which is required for VECTOR_COSINE_SIMILARITY to operate.

The same pattern applies to audio, substituting GENERATE_AUDIO_EMBEDDING and VECTOR(FLOAT, 2048) for the 2048-dimensional PANNs embeddings.

Threshold-Based Classification

Similarity scores are classified into four tiers that drive the examiner's workflow:

Similarity Score	Category	Action
> 0.95	DUPLICATE	Reject registration
0.85 -- 0.95	HIGHLY SIMILAR	Manual review required
0.70 -- 0.85	MODERATE	Flag for review
< 0.70	DISTINCT	Safe to register

These thresholds were calibrated using sample data from prospect's existing registry and should be adjusted based on real-world feedback during production use. The boundaries are not hard-coded in SQL. They live in the application layer, making them easy to tune without redeploying any infrastructure.

Streamlit UI: The Examiner's Workflow

The Streamlit in Snowflake application provides a tabbed interface for image and audio similarity detection. The examiner's workflow is:

Upload - drag and drop an image or audio file into the uploader
Compare - the system uploads to stage, generates an embedding via the service function, and runs VECTOR_COSINE_SIMILARITY against all active registry entries
Review - a color-coded verdict badge (red/orange/yellow/green) appears alongside a table of all comparison scores and a preview of the closest match
Register - if the verdict is DISTINCT, a registration form appears where the examiner enters metadata (brand, product, title, genre) and submits to insert the new record

The app also includes a gallery view showing all currently registered IP assets with presigned URLs for previewing images and playing audio directly in the browser. Service readiness is checked on page load, with a warning banner displayed if either embedding service is still warming up.

Operational Considerations

Cost Optimization

GPU compute pools are not required to run 24/7. Since public sectors operate during standard business hours, we implemented scheduled Snowflake Tasks to resume services at 07:45 and suspend them at 17:30 (Jakarta time), giving 15 minutes for GPU nodes to provision and containers to pass readiness probes before the 8 AM workday begins:

CREATE TASK RESUME_SERVICES_MORNING
  SCHEDULE = 'USING CRON 45 7 * * MON-FRI Asia/Bangkok'
AS
BEGIN
  ALTER COMPUTE POOL POOL_GPU_S RESUME;
  ALTER SERVICE IMAGE_EMBEDDING_SERVICE RESUME;
  ALTER SERVICE AUDIO_EMBEDDING_SERVICE RESUME;
END;

The Streamlit app's compute pool (POOL_CPU_S) uses AUTO_RESUME and AUTO_SUSPEND_SECS instead, waking automatically when an examiner visits the URL and suspending after 5 minutes of inactivity. This two-tier approach to explicitly schedule for GPU pools and automatic management for lightweight CPU pools will balance cost control with user experience.

Health Monitoring

A stored procedure (CHECK_HEALTH) runs 10 minutes after the morning resume, checking compute pool and service status and logging results to a SPCS_HEALTH_LOG table. This provides operational visibility without requiring manual checks.

Governance and Access Control

The system uses a minimal-privilege EXAMINER_ROLE for examiners, granting only USAGE on the database, schema, warehouse, and Streamlit app. The Streamlit app executes with owner's rights, so examiners never need direct access to the registry tables, service functions, or compute pools. Combined with Snowflake's built-in access history logging and encryption at rest and in transit, this satisfies the audit and compliance requirements typical of public-sector deployments.

Scaling for Large Registries

For initial data migration from legacy systems, the batch embedding generation processes files in chunks using LIMIT/OFFSET patterns to avoid query timeouts. A seed manifest table tracks which files have been processed, allowing the migration to be resumed if interrupted. The SPCS compute pool supports auto-scaling (MIN_NODES to MAX_NODES) to handle concurrent users during peak registration periods.

For very large registries (100,000+ assets), the current brute-force VECTOR_COSINE_SIMILARITY scan across all rows will eventually need to be supplemented with approximate nearest neighbor (ANN) indexing. Snowflake's vector search capabilities continue to evolve in this area, and the registry table schema is already structured to take advantage of future optimizations without any migration.

Results and Takeaways

Building this system validated several key points:

Pretrained models eliminate the cold-start problem.
Using HuggingFace models (ViT and PANNs CNN14) that are already trained on large-scale datasets meant we could go from concept to working prototype without any model training. The total model weight is roughly 650 MB combined, which is a trivial size for a GPU container.

Model selection is the critical decision.
The CLAP vs. PANNs comparison demonstrated that a seemingly reasonable model choice (CLAP is newer, more popular, and supports text-audio queries) can be fundamentally wrong for the use case. Acoustic embeddings and semantic embeddings answer different questions and IP similarity detection requires acoustic fidelity.

Snowflake SPCS makes model deployment operationally simple.
The pattern of Docker container + SPCS service + service function turns a PyTorch model into a SQL-callable function. There is no separate inference infrastructure to manage, no API gateway to configure, and no data movement between systems.

The solution extends naturally.
The same architecture supports internal use via Streamlit or external access via API endpoints. Additional media types (video, 3D models) can be added by deploying new embedding services and registry tables following the same pattern.

For public-sector organizations managing intellectual property at scale, this approach offers a path to automated, consistent, and auditable similarity detection. This can all be done within a single governed platform using Snowflake.

Top comments (2)

Jonathan Murray • Apr 8

the cross-media problem is interesting, most similarity tools pick one modality and ignore the rest. curious how you handle the threshold for audio similarity because jingles can sound completely different in isolation but feel obviously derivative in context. also wondering if the snowflake cortex approach adds meaningful latency vs a dedicated vector db for real-time trademark checks

Reza Brianca • Apr 8

Hi Jonathan, thanks for your feedback!

The audio similarity is a bit tricky to be honest. I chose the CLAP during the first iteration and puzzled with the results since even remotely different jingles can have high similarity score (I listened to the audio to confirm this). Further investigation showed that the CLAP model actually works by categorizing the sound 'genre' so any jingles with 'pop' theme will have high similarity. That's why I decided to switch this to use PANNs CNN14 and the results were much better. In addition, I put the UI in Streamlit so the users can actually play the uploaded jingles and compare it with the top-n most similar sounds already registered.

The latency is not an issue for what I test so far with small sample dataset and I deploy this in Streamlit with container runtime. The issue may arise when the dataset is quite large (i.e. more than 10k sounds embedding). That's where we will need another indexing to improve the search result.

Hope this helps!