Ana Silva for AWS Community Builders

Posted on May 2 • Originally published at anasilva.org

Upload, Describe, Discover: Architecting a Marketing Assets Library

#aws #opensearch #ai #llm

Glórund crossing heterogeneous terrain in search of Túrin felt like an apt opener for an article about searching across heterogeneous assets!

Is this too much of a stretch? Well, anyway…

If you search for "digital asset management software," you'll find many mature solutions. Adobe Experience Manager — probably the most recognizable name in enterprise marketing infrastructure — handles digital assets as part of a broader content management platform. Cloudinary and Bynder represent the more focused end of the spectrum: purpose-built DAMs with polished interfaces, rich metadata management, and integrations designed for marketing teams. These are mature, well-funded products with years of iteration behind them.

So why build one from scratch?

The honest answer: I didn't build this because the market had a gap. I built it because I had some questions:

How do you model metadata for creative assets that are structurally heterogeneous: a PNG, an HTML email and a push notification living in the same library?
How do you integrate an LLM into an indexing pipeline without making uploads feel slow?
How do you expose a single search endpoint that handles both rigid filter-based queries and natural language, without the interface becoming a mess?

These are questions that appear the moment you try to build anything resembling a searchable content repository. Whether you're integrating with an off-the-shelf DAM via API, building a lightweight internal tool or extending an existing platform, the underlying mechanics are the same. Understanding them gives you leverage regardless of which path you choose.

The fictional system I built — Orqestra Assets — is a DAM focused on marketing creative pieces: app banners (PNG), email templates (HTML), and SMS/push payloads (JSON). It's not a production system, it's a deliberate architecture built to answer those questions, with real code, real tradeoffs, and a stack that maps directly to what you'd use in an AWS environment.

It's also part of a larger platform I've been working on, so there may be more parts to come. Here, I'll walk through the architecture for Assets: how assets are ingested, how they're indexed asynchronously with LLM-generated descriptions and how search works across both structured filters and natural language queries.

The code is available on Github.

The solution draft

"Orqestra Assets" is built around three distinct flows that happen in sequence but are deliberately decoupled from each other:

Upload: a client uploads an asset or submits a text payload; the API stores the file in S3, registers a row in PostgreSQL, and publishes a message to an SQS queue.
Describe: a worker consumes the queue, generates a description if needed, creates an embedding and upserts a document into OpenSearch.
Discover: a client queries the library, either through structured filters resolved in SQL, or through natural language resolved via hybrid search in OpenSearch, enriched with data from Postgres.

The stack maps directly to AWS primitives you'd use in production: S3 for object storage, SQS for async decoupling, OpenSearch for vector and full-text search, PostgreSQL as the source of truth for structured metadata, and the OpenAI API for both description generation and embeddings.

1. Upload

The upload layer has the job of accepting an asset, persist it reliably and hand off to the indexing pipeline without blocking the client.

"Without blocking" is the key constraint. A multimodal LLM call for a PNG can take several seconds, so if the upload endpoint waited for indexing to complete before responding, the client experience would be unacceptable. Because of this, the API does the minimum necessary synchronously, and delegates everything else to a queue.

The API receives the file, generates a S3 key, stores the object, writes a row to PostgreSQL and publishes a message to SQS. The response returns immediately with the asset ID and indexing_status: pending.

The S3 key encodes the asset's channel and type in the path — a PNG uploaded to a campaign might land at campaigns/{id}/App/{space}/{uuid}.png — and uses a UUID as the filename. Separately, the API computes a SHA-256 of the content and stores it in the asset's metadata, giving you a foundation for deduplication logic if you need it later.

The OpenSearch document ID is derived from a hash of the S3 key. This means that if the same object triggers multiple indexing attempts — a duplicate queue message, an S3 notification racing with an explicit publish — the upsert always lands on the same document. Re-indexing is safe; OpenSearch doesn't accumulate duplicates.

PNG and HTML come as file uploads — POST /assets/upload-app and POST /assets/upload-email respectively. The API validates the format, reads the bytes, and writes the object to S3. SMS and push work differently: the client submits the message text as a JSON body to POST /assets/text, and the API itself serialises it into a .json file before writing it to the bucket. There is no file to upload; the file is constructed server-side.

All three paths write a row to the assets table with a channel and format field — App/png, E-mail/html, SMS/text, Push/text — and then publish to the same queue. By the time the worker picks up the message, it knows what it's dealing with: the combination of channel and format is enough to decide whether to call the vision model, the text completion model, or neither and go straight to embedding.

2. Describe

Once a message lands in the queue, a background worker takes over. Its job is to do everything the upload endpoint deliberately skipped: fetch the asset from S3, generate a description if the asset type requires one, create an embedding, and push the result to OpenSearch.

So, "the worker" is a long-polling SQS consumer. It receives batches of up to ten messages, processes each one concurrently using a thread pool and deletes a message from the queue only after its asset has been successfully indexed. If processing fails, the message is not deleted, SQS makes it visible again after the visibility timeout, and the worker will retry on the next poll. Failures that exhaust all retries land in the DLQ (dead-letter queue).

For each message, the worker reads the s3_key from the payload, downloads the object from S3 and decides what to do based on channel and format. The decision tree from that point is straightforward. For PNG app banners, the worker encodes the image in base64 and sends it to a multimodal model with a prompt asking for a concise marketing description: dominant colours, visible text, campaign theme, appropriate channel. For HTML email templates, it decodes the file and sends it to the same model with a different prompt focused on the email's call to action, tone and campaign fit. For SMS and push payloads there is no LLM call, the text is extracted directly from the JSON stored in S3 and used as-is.

In our case, minor overruns are acceptable, so a 500-character prompt instruction is sufficient to keep descriptions within a reasonable size without needing hard truncation or other techniques in code.

_SYSTEM_PROMPT = (
    "You are a digital marketing expert specialized in creative asset cataloguing. "
    "Generate concise, retrieval-optimized descriptions of marketing assets. "
    "Reply only with the description text, in English, in at most 500 characters."
    "No introduction, no title, no 'here is', no numbered lists, no meta-commentary."
)

_PNG_PROMPT = (
    "Describe this creative asset for search retrieval. "
    "Include: dominant colors, main visual elements, visible text, campaign theme, and suitable channel."
)

_HTML_PROMPT = (
    "Describe this HTML email template for search retrieval. "
    "Include: main theme, call to action, message tone, and suitable campaign type."
)

This is the asymmetry the queueing design was built to absorb. A push notification costs one fast JSON parse. An app banner costs a vision model call that might take several seconds. From the upload client's perspective, both are the same: post the asset, get a response, check back later.

Once a description exists, the worker prepends the asset's display title if one was set, and passes the combined text to OpenAI's text-embedding-3-small to generate a vector. That vector, along with the description and the asset's structured fields (channel, format, locale, lifecycle status, campaign id) is upserted into OpenSearch under the document ID derived from the S3 key.

The final step is a write back to Postgres: the description and embedding_id columns are updated and indexing_status is set to indexed. If anything fails before that point, the status is set to error instead, and the message stays in the queue for retry.

One thing this pipeline doesn't do is validate description quality before indexing. A description that's technically successful but semantically weak lands in OpenSearch indistinguishably from a good one. The practical consequence is that recall degrades silently: users searching for "bold red promotional banner" may not surface an asset that matches visually, if the model described it as "a marketing creative with promotional messaging." Validating description quality without a reference set is hard. The most honest mitigation at this stage is observability: log every description, monitor length distributions across batches and treat significant anomalies as a signal to inspect manually.

⚠️ The local approach was to make embedding generation sit outside OpenSearch entirely. The trade-off of this option is that you own the orchestration: every indexing job and every search request carries an outbound API call to a model provider, with the associated latency and failures.

One alternative, available in production on Amazon OpenSearch Service, is to register a model connector either pointing to Amazon Bedrock or to an external provider, and delegate embedding generation to OpenSearch itself via an ingest pipeline processor at index time and a neural query at search time. In that setup, the worker would send plain text and OpenSearch would handle the vector internally, removing the custom embedding code from the application entirely.

3. Discover

In the library page, assets appear as a card grid, where PNG banners render with a thumbnail, HTML and text assets show an icon and a type badge. Clicking any card opens a detail sheet with the full metadata, the generated description, the indexing status and a download link.

Above the grid sits a search bar and a row of filters: channel, format, locale, lifecycle status, tags and campaign partition. They all coexist on the same view and feed the same request. A user can narrow to all active push notifications in Brazilian Portuguese or type a natural language query like "summer promotion with red background" and let the ranking handle the rest. The user can also do both at once, combining structured filters with semantic search in a single call. Typing triggers a debounced query, so the grid updates as the user types without hammering the API on every keystroke.

When no query text is provided, the request goes entirely through Postgres. Assets are filtered by the supplied fields, ordered by creation date and paginated. It's a straightforward SQL query and returns quickly. When a query string is present, the path is different.

How hybrid search works

To understand why hybrid search matters here, it helps to understand what each component does on its own.

BM25 is the algorithm behind traditional keyword search. It ranks documents by how often the query terms appear in them, adjusted for document length and term frequency across the corpus. It's fast, interpretable and works well when the user knows the right words. But it's brittle: a query for "urgent promotional tone" returns nothing if none of those exact words appear in the indexed descriptions, even if a perfectly relevant asset exists.
kNN (k-nearest neighbors) operates on embeddings — vector representations that encode semantic meaning rather than surface text. When you embed the query and search for the nearest vectors in the index, you're finding assets that are conceptually similar, regardless of whether they share any words with the query. This is what makes "something warm and summery for a mobile audience" a valid search. kNN is indifferent to exact matches, though, so a query for a specific campaign name or a precise tag will often return semantically adjacent but wrong results.

Hybrid search combines both. In this project, when a natural language query arrives, the API embeds it in real time using text-embedding-3-small, the same model used during indexing, and sends both the query string and the embedding to OpenSearch as a hybrid query. OpenSearch runs the BM25 and kNN sub-queries in parallel, normalizes each score set independently using min-max normalization, and combines them into a single ranking via weighted arithmetic mean. The weights favor the vector component slightly, on the assumption that semantic similarity is more useful than keyword overlap for creative asset retrieval.

Those asset IDs come back from OpenSearch without the full metadata. The API then fetches the corresponding rows from Postgres — joined to whatever SQL-only filters remain, such as tags — and re-orders them to match the ranking OpenSearch produced. What the client receives is a page of fully hydrated asset objects, ordered by relevance, with pagination driven by the original limit and offset parameters.

The separation between the two stores is intentional. OpenSearch owns relevance ranking, while Postgres owns the "facts" metadata.

Does it work?

Hybrid search returns results, but that doesn't mean it returns the right results. Without a way to measure retrieval quality, tuning the pipeline is guesswork: you don't know whether changing parameters helped or whether a prompt revision improved description usefulness. Evaluation doesn't need to be elaborate to be useful, but it needs to be systematic.

What I built in this project is a lightweight evaluation pipeline to test the natural-language retrieval quality only. That means no deterministic UI filters (channel, format, locale, etc.) are allowed to influence the score. Each test query is sent as plain language, and the system must rank relevant assets using the same hybrid search path the system uses.

What was built

The evaluation flow is split into three scripts:

evals/generate_eval_dataset.py
evals/upload_assets.py
evals/run_eval.py

Together, they form a reproducible loop from synthetic asset generation to scored retrieval results.

1) Dataset generation (generate_eval_dataset.py)

This script creates a controlled benchmark corpus:

synthetic creative assets (PNG, HTML, SMS, Push);
a manifest describing each asset;
a query specification file (query_specs.json) containing query, expected_ids and type.

The query types were reorganized around the search intent:

exact_intent
paraphrase_intent
cross_channel_intent
ambiguous_intent

This makes reporting easier to interpret: you can see whether the engine performs differently on literal requests, paraphrased requests, cross-channel intents or harder ambiguous intents.

2) Upload + dataset resolution (upload_assets.py)

This script uploads generated assets to the DAM API, waits for indexing, and resolves logical IDs into real API asset IDs. It then builds eval_dataset.json, which is what the runner consumes.

3) Evaluation runner (run_eval.py)

The runner reads eval_dataset.json, sends each query to POST /assets/search and compares ranked results against expected relevant assets.

What it measures

The evaluation reports quality per query, per intent category and globally.

The metrics are:

Success@1 — Did the first result match any expected relevant asset?
Success@3 — Did at least one relevant asset appear in top 3?
Recall@3 — How much of the relevant set appears in top 3?
MRR — How early does the first relevant result appear?

How to run it

From the repository root:

# 1) Start the stack
docker compose up -d --build

# 2) Build eval dataset from existing uploaded mapping
docker compose --profile eval run --rm eval-upload --build-dataset-only

# 3) Run evaluation
docker compose --profile eval run --rm eval-run

Outputs:

evals/output/eval_results.json for per-query details
evals/output/eval_summary.json for aggregate metrics

Interpreting results

With K=3, the benchmark produced:

Success@1: 0.7143
Success@3: 1.0000
Recall@3:  0.8988
MRR:       0.8393

The big picture: the right asset is always somewhere in the top 3, but it only shows up in first place about 71% of the time. The system is good at finding the right assets, but not always at ranking them first.

1. Paraphrase intent — perfect

S@1 = 1.00, S@3 = 1.00, Recall@3 = 1.00, MRR = 1.00

When describing what is wanted in natural language (pink floral banner for mother's day, abandoned cart recovery email) the system got it right every time. All 12 queries in this category landed the correct asset at rank 1.

This is the category the system is built for: the LLM-generated descriptions and the embeddings are doing exactly what they should, bridging the gap between how a user phrases a request and how the asset was originally described. With the hybrid weighting set at 0.35 lexical / 0.65 vector, this is also the category that benefits most from the current configuration.

2. Cross-channel intent — mostly limited by the K=3 cap

S@1 = 1.00, S@3 = 1.00, MRR = 1.00, Recall@3 = 0.7083

When a query expects 4 assets (one per channel) and we only look at the top 3, we can never get full recall. That's a limitation of how we chose to measure, not of the system itself. Three of the four queries hit this ceiling cleanly.

The exception is reactivation of inactive customers with offer: the email shows up first, but the SMS and Push versions don't make the top 3, even though they exist and the system finds them on other queries. This one query is dragging the average down.

It's also worth noting that K=3 is a deliberate choice, not the only sensible one. It reflects what users actually see in the first row of results, but for cross-channel queries it under-rewards the system. A small refinement worth considering would be reporting Recall@N (where N matches the number of expected assets).

3. Exact intent — the weak spot

S@1 = 0.125, S@3 = 1.00, Recall@3 = 1.00, MRR = 0.5208

For keyword-style queries (free shipping, order tracking, black friday urgency countdown) the right asset is always in the top 3, but almost never at rank 1. It usually lands at rank 2 or 3, behind thematically similar assets.

The cause is fairly direct: the hybrid search currently weights lexical matches at 0.35 and vector matches at 0.65. That bias works beautifully for paraphrased queries, but for short, literal queries it lets thematically related assets outrank the one that matches the exact words. This is still good enough, though.

4. Ambiguous intent — better than expected

S@1 = 0.75, S@3 = 1.00, Recall@3 = 0.5833, MRR = 0.8333

Vague queries actually do better at S@1 than the exact ones above, which reinforces the idea that the system favours semantic matches. For example, creative with warm tone and soft visual elements correctly surfaces the Mother's Day pink floral banner at rank 1, even though nothing in the query mentions Mother's Day or florals.

Recall@3 is the lowest of all categories, but that's expected: when a query is broad, more assets could plausibly be relevant, and not all of them fit in the 3 slots.

Wrapping up

This project forced a series of decisions that documentation tends to skip over: where exactly to place filters, why description quality matters, how the failure model of a queue-based pipeline is fundamentally different from a synchronous one. Those things only become visible when you have to make them yourself.

One thing I'd revisit is embedding ownership. Generating embeddings in the application layer works fine at this scale, but it's something that Amazon OpenSearch can absorb in production through model connectors and neural queries. Whether that tradeoff is worth it depends on how much you want to own.

Evaluation showed that the search reliably finds the right things. The main area that could be improved is ranking, especially for short keyword queries. One option worth exploring could be to give literal word matches a bit more weight in the hybrid search.

If you've built something similar or made different trade-offs around indexing, search or evaluation, I'd be curious to hear how you approached it.

DEV Community