Intellectual property offices worldwide face a growing challenge: how do you ensure that every new trademark, logo, or jingle is genuinely distinct from the millions already registered? For one of our prospect in public sector, this problem spans two fundamentally different media types: images (product logos, packaging designs) and audio (brand jingles, musical trademarks). Each new submission must be compared against the entire existing registry before it can be approved.
Manual review simply does not scale. An examiner can visually compare a handful of logos per hour, but the backlog grows faster than human reviewers can process it. Audio comparison is even harder. You can imagine listening to thousands of jingles to catch a subtle similarity is impractical. Subjectivity compounds the problem: two examiners may disagree on whether a logo is "too similar" to an existing registration. And consistency matters: the same pair of assets should always produce the same similarity verdict, regardless of which examiner handles the case or what time of day it is reviewed.
The traditional approach by hiring more examiners and building manual review queues is not practical since it will introduce higher cost, latency, and inconsistency. What if you could reduce the comparison to a mathematical operation? Convert each IP asset into a numerical fingerprint (a vector embedding), store those fingerprints in a database, and compute similarity scores automatically.
This article walks through how we built exactly that: an AI-powered IP similarity detection system that runs entirely within Snowflake, combining GPU-accelerated deep learning models with native vector search and a Streamlit interface for IP examiners.
Solution Architecture
The system is built on four Snowflake capabilities working together:
- Snowpark Container Services (SPCS) for GPU-backed model inference
- The native
VECTORdata type for embedding storage -
VECTOR_COSINE_SIMILARITYfor similarity search - Streamlit in Snowflake for the examiner-facing UI
The data flow is straightforward: raw files sit on a Snowflake internal stage. SPCS containers mount that stage directly as a volume, so the Flask-based inference services read files from /stage without any data transfer. Service function uses SQL UDFs that wrap HTTP calls to the containers. This makes embedding generation callable from any SQL query. The resulting vectors are stored in registry tables and compared via VECTOR_COSINE_SIMILARITY.
A critical design choice was keeping everything within Snowflake's governance boundary. For a public-sector organizations, data sovereignty is non-negotiable. No IP assets leave the platform. This means the models run inside SPCS, the files live on Snowflake stages, and the embeddings are stored in Snowflake tables. Access control is handled through Snowflake RBAC, and every query is captured in Snowflake's access history for audit purposes.
This architecture also means there is no external API dependency during inference. The models are loaded into containers at startup and serve requests entirely within the Snowflake network. If internet access is revoked after deployment, the system continues to function normally. This is an important consideration for government networks with strict egress policies.
Model Selection: Why the Right Embedding Matters
Choosing the right embedding model is the most consequential architectural decision in a similarity search system. The model determines what "similar" means. For IP registration system, getting this wrong renders the entire system useless.
Image Pipeline: Vision Transformer (ViT)
For image similarity, we use Google's vit-base-patch16-224-in21k, a Vision Transformer pretrained on ImageNet-21k. It produces 768-dimensional embeddings extracted from the CLS token of the final hidden layer. ViT is a strong general-purpose choice: it captures both low-level visual features (color, shape, texture) and higher-level semantic patterns (object composition, layout). No fine-tuning was needed for our use case since the pretrained model already distinguishes product packaging designs effectively.
from transformers import AutoFeatureExtractor, AutoModel
import torch
model_id = "google/vit-base-patch16-224-in21k"
extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
inputs = extractor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
embedding = outputs.last_hidden_state[:, 0] # 768-dim CLS token
Audio Pipeline: PANNs CNN14
Audio is where the model selection gets interesting. The natural first choice might be CLAP (laion/larger_clap_music), a contrastive language-audio model that has become popular for audio understanding tasks. However, during prototyping we discovered a fundamental problem: CLAP produces semantic embeddings that cluster audio by genre and type. When we tested it against our jingle corpus, all files within a similar genre scored 0.93-0.99 cosine similarity. A pop jingle and a completely different pop jingle were nearly indistinguishable. This makes CLAP excellent for genre classification but useless for IP distinctiveness detection.
PANNs CNN14 (Pretrained Audio Neural Networks) solves this problem. Trained on AudioSet's 527 sound classes across roughly 5,000 hours of audio, CNN14 produces 2048-dimensional acoustic embeddings that capture the unique spectral characteristics of each file including timbre, rhythm, frequency patterns, and temporal structure. Two pop jingles that sound superficially similar will produce meaningfully different embeddings because PANNs encodes their actual acoustic fingerprint rather than their semantic category.
from panns_inference import AudioTagging
import librosa
import numpy as np
at = AudioTagging(checkpoint_path=None, device="cpu")
audio_array, _ = librosa.load("jingle.mp3", sr=32000, mono=True)
audio_array = audio_array[np.newaxis, :]
_, embedding = at.inference(audio_array) # 2048-dim acoustic embedding
This distinction between acoustic and semantic embeddings is critical. For IP registration, you need to answer "does this specific audio file sound like that specific audio file?" and not "are these two files in the same genre?"
Implementation Highlights
Deploying Models as SPCS Services
Each model runs in its own Docker container, deployed as an SPCS service with GPU resources. The Docker images are built locally, pushed to Snowflake's built-in image registry, and referenced in the service specification. The containers mount the Snowflake stage as a volume at /stage, enabling direct file access. The Flask app simply reads from the filesystem without any data serialization or network transfer overhead.
Here is the abbreviated service creation for the image pipeline:
CREATE SERVICE IMAGE_EMBEDDING_SERVICE
IN COMPUTE POOL POOL_GPU_S
FROM SPECIFICATION $$
spec:
containers:
- name: image-embedding
image: /img_demo/raw/img_repo/image-embedding-service:latest
resources:
requests:
nvidia.com/gpu: 1
volumeMounts:
- name: stage-volume
mountPath: /stage
endpoints:
- name: embed
port: 8080
public: false
volumes:
- name: stage-volume
source: "@IMG_DEMO.RAW.IP_STAGE"
$$
MIN_INSTANCES = 1
MAX_INSTANCES = 1;
The readinessProbe on the /health endpoint ensures Snowflake only routes traffic to containers that have fully loaded their models. ViT takes a few seconds, while PANNs CNN14 needs to download its ~300 MB checkpoint on first run.
The key pattern here is the service function. This is an SQL UDF that routes calls to the container's HTTP endpoint. This abstraction lets any SQL query generate embeddings without knowing anything about Docker, Flask, or PyTorch:
CREATE FUNCTION GENERATE_IMAGE_EMBEDDING(FILE_PATH VARCHAR)
RETURNS VARIANT
SERVICE = IMAGE_EMBEDDING_SERVICE
ENDPOINT = embed
AS '/embed-file';
Similarity Search in SQL
With embeddings stored as native VECTOR columns, similarity search becomes a single SQL query. The pattern uses a CTE to generate the candidate embedding, then cross-joins against the registry with VECTOR_COSINE_SIMILARITY:
WITH uploaded_embedding AS (
SELECT GENERATE_IMAGE_EMBEDDING('uploads/new_logo.png')::VECTOR(FLOAT, 768) AS emb
)
SELECT
r.IP_ID,
r.FILE_PATH,
VECTOR_COSINE_SIMILARITY(r.IMAGE_EMBEDDING, u.emb) AS SIMILARITY
FROM IMAGE_IP_REGISTRY r
CROSS JOIN uploaded_embedding u
WHERE r.STATUS = 'ACTIVE'
ORDER BY SIMILARITY DESC;
This query generates the embedding for the new submission, compares it against every active registration, and returns results ranked by similarity. All of this can be done in a single SQL statement. The ::VECTOR(FLOAT, 768) cast converts the VARIANT output from the service function into Snowflake's native VECTOR type, which is required for VECTOR_COSINE_SIMILARITY to operate.
The same pattern applies to audio, substituting GENERATE_AUDIO_EMBEDDING and VECTOR(FLOAT, 2048) for the 2048-dimensional PANNs embeddings.
Threshold-Based Classification
Similarity scores are classified into four tiers that drive the examiner's workflow:
| Similarity Score | Category | Action |
|---|---|---|
| > 0.95 | DUPLICATE | Reject registration |
| 0.85 -- 0.95 | HIGHLY SIMILAR | Manual review required |
| 0.70 -- 0.85 | MODERATE | Flag for review |
| < 0.70 | DISTINCT | Safe to register |
These thresholds were calibrated using sample data from prospect's existing registry and should be adjusted based on real-world feedback during production use. The boundaries are not hard-coded in SQL. They live in the application layer, making them easy to tune without redeploying any infrastructure.
Streamlit UI: The Examiner's Workflow
The Streamlit in Snowflake application provides a tabbed interface for image and audio similarity detection. The examiner's workflow is:
- Upload - drag and drop an image or audio file into the uploader
-
Compare - the system uploads to stage, generates an embedding via the service function, and runs
VECTOR_COSINE_SIMILARITYagainst all active registry entries - Review - a color-coded verdict badge (red/orange/yellow/green) appears alongside a table of all comparison scores and a preview of the closest match
- Register - if the verdict is DISTINCT, a registration form appears where the examiner enters metadata (brand, product, title, genre) and submits to insert the new record
The app also includes a gallery view showing all currently registered IP assets with presigned URLs for previewing images and playing audio directly in the browser. Service readiness is checked on page load, with a warning banner displayed if either embedding service is still warming up.
Operational Considerations
Cost Optimization
GPU compute pools are not required to run 24/7. Since public sectors operate during standard business hours, we implemented scheduled Snowflake Tasks to resume services at 07:45 and suspend them at 17:30 (Jakarta time), giving 15 minutes for GPU nodes to provision and containers to pass readiness probes before the 8 AM workday begins:
CREATE TASK RESUME_SERVICES_MORNING
SCHEDULE = 'USING CRON 45 7 * * MON-FRI Asia/Bangkok'
AS
BEGIN
ALTER COMPUTE POOL POOL_GPU_S RESUME;
ALTER SERVICE IMAGE_EMBEDDING_SERVICE RESUME;
ALTER SERVICE AUDIO_EMBEDDING_SERVICE RESUME;
END;
The Streamlit app's compute pool (POOL_CPU_S) uses AUTO_RESUME and AUTO_SUSPEND_SECS instead, waking automatically when an examiner visits the URL and suspending after 5 minutes of inactivity. This two-tier approach to explicitly schedule for GPU pools and automatic management for lightweight CPU pools will balance cost control with user experience.
Health Monitoring
A stored procedure (CHECK_HEALTH) runs 10 minutes after the morning resume, checking compute pool and service status and logging results to a SPCS_HEALTH_LOG table. This provides operational visibility without requiring manual checks.
Governance and Access Control
The system uses a minimal-privilege EXAMINER_ROLE for examiners, granting only USAGE on the database, schema, warehouse, and Streamlit app. The Streamlit app executes with owner's rights, so examiners never need direct access to the registry tables, service functions, or compute pools. Combined with Snowflake's built-in access history logging and encryption at rest and in transit, this satisfies the audit and compliance requirements typical of public-sector deployments.
Scaling for Large Registries
For initial data migration from legacy systems, the batch embedding generation processes files in chunks using LIMIT/OFFSET patterns to avoid query timeouts. A seed manifest table tracks which files have been processed, allowing the migration to be resumed if interrupted. The SPCS compute pool supports auto-scaling (MIN_NODES to MAX_NODES) to handle concurrent users during peak registration periods.
For very large registries (100,000+ assets), the current brute-force VECTOR_COSINE_SIMILARITY scan across all rows will eventually need to be supplemented with approximate nearest neighbor (ANN) indexing. Snowflake's vector search capabilities continue to evolve in this area, and the registry table schema is already structured to take advantage of future optimizations without any migration.
Results and Takeaways
Building this system validated several key points:
Pretrained models eliminate the cold-start problem.
Using HuggingFace models (ViT and PANNs CNN14) that are already trained on large-scale datasets meant we could go from concept to working prototype without any model training. The total model weight is roughly 650 MB combined, which is a trivial size for a GPU container.
Model selection is the critical decision.
The CLAP vs. PANNs comparison demonstrated that a seemingly reasonable model choice (CLAP is newer, more popular, and supports text-audio queries) can be fundamentally wrong for the use case. Acoustic embeddings and semantic embeddings answer different questions and IP similarity detection requires acoustic fidelity.
Snowflake SPCS makes model deployment operationally simple.
The pattern of Docker container + SPCS service + service function turns a PyTorch model into a SQL-callable function. There is no separate inference infrastructure to manage, no API gateway to configure, and no data movement between systems.
The solution extends naturally.
The same architecture supports internal use via Streamlit or external access via API endpoints. Additional media types (video, 3D models) can be added by deploying new embedding services and registry tables following the same pattern.
For public-sector organizations managing intellectual property at scale, this approach offers a path to automated, consistent, and auditable similarity detection. This can all be done within a single governed platform using Snowflake.

Top comments (0)