DEV Community

Cover image for Measuring How LLMs Recommend Brands & Sites: Entity-Conditioned Probing & Resampling
Jim Liu
Jim Liu

Posted on

Measuring How LLMs Recommend Brands & Sites: Entity-Conditioned Probing & Resampling

TL;DR: We open-sourced a method and dataset to evaluate how LLMs surface brands/sites across queries. It uses entity-conditioned probing with multi-sampling + half-split consensus to check reliability. You can reproduce everything with the repo and datasets below.


Why this matters

LLMs increasingly act as recommenders in everyday queries (“best running shoes”, “top B2B CRMs”, etc.). If you’re shipping AI products—or your brand cares about LLM visibility—you probably want to know:

  • Which brands/sites are shown most often?
  • How stable are the results across samples/locales/models?
  • How reliable is a “top-k” list you derive from an LLM?

Our goal: make this measurable, reproducible, and honest about limitations.


Method in 90 seconds

Entity-conditioned probing (ECP):

We design prompts per category (e.g., “best XXX tools in DE”) and collect multiple independent samples per (category, locale) on each model. Each response is parsed into a list of entities (brands/sites).

Resampling for reliability:

We split the set of lists into two halves, compute a consensus top-k list for each half, and measure overlap@k between the halves.

  • If overlap@k is high → the “top-k” ranking is stable for that setup.
  • If low → treat any single top-k as noisy.

Figure 1: Diagram of ECP sampling + half-split consensus flow
Figure 1: Diagram of ECP sampling + half-split consensus flow

We ran 15,600 samples across 52 categories/locales to check stability patterns and surface interesting divergences.


What’s in the repo & data

  • /pl_top/*.csv — per-prompt list outputs and parsed entities
  • results.*.jsonl — structured results + metadata for analysis
  • Scripts to:
    • aggregate list outputs,
    • compute consensus tops,
    • evaluate overlap@k reliability,
    • export tables/figures.

Quickstart (Python)


python
# pip install pandas numpy
import pandas as pd
import json

# Example: load top lists and compute simple frequency
pl = pd.read_csv("pl_top/example_category_en-US.csv")  # swap for your file
pl["entity"] = pl["entity"].str.strip().str.lower()
freq = pl["entity"].value_counts().head(20)
print(freq)
Enter fullscreen mode Exit fullscreen mode

Top comments (0)