<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jim Liu</title>
    <description>The latest articles on DEV Community by Jim Liu (@digitalpeaksv).</description>
    <link>https://dev.to/digitalpeaksv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3590472%2Fffb66af1-3fd6-4492-a6c1-4daf394173be.jpg</url>
      <title>DEV Community: Jim Liu</title>
      <link>https://dev.to/digitalpeaksv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/digitalpeaksv"/>
    <language>en</language>
    <item>
      <title>Measuring How LLMs Recommend Brands &amp; Sites: Entity-Conditioned Probing &amp; Resampling</title>
      <dc:creator>Jim Liu</dc:creator>
      <pubDate>Fri, 31 Oct 2025 03:16:10 +0000</pubDate>
      <link>https://dev.to/digitalpeaksv/measuring-how-llms-recommend-brands-sites-entity-conditioned-probing-resampling-21fl</link>
      <guid>https://dev.to/digitalpeaksv/measuring-how-llms-recommend-brands-sites-entity-conditioned-probing-resampling-21fl</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; We open-sourced a method and dataset to evaluate how LLMs surface brands/sites across queries. It uses &lt;strong&gt;entity-conditioned probing&lt;/strong&gt; with &lt;strong&gt;multi-sampling + half-split consensus&lt;/strong&gt; to check reliability. You can reproduce everything with the repo and datasets below.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Paper (preprint):&lt;/strong&gt; &lt;a href="https://zenodo.org/records/17489350" rel="noopener noreferrer"&gt;https://zenodo.org/records/17489350&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/jim-seovendor/entity-probe" rel="noopener noreferrer"&gt;https://github.com/jim-seovendor/entity-probe&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data (HF):&lt;/strong&gt; &lt;a href="https://huggingface.co/datasets/seovendorco/entity-probe" rel="noopener noreferrer"&gt;https://huggingface.co/datasets/seovendorco/entity-probe&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;LLMs increasingly act as &lt;em&gt;recommenders&lt;/em&gt; in everyday queries (“best running shoes”, “top B2B CRMs”, etc.). If you’re shipping AI products—or your brand cares about LLM visibility—you probably want to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Which brands/sites are shown most often?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How stable are the results across samples/locales/models?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How reliable is a “top-k” list you derive from an LLM?&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Our goal: make this measurable, reproducible, and honest about limitations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Method in 90 seconds
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Entity-conditioned probing (ECP):&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
We design prompts per category (e.g., “best XXX tools in DE”) and collect &lt;strong&gt;multiple independent samples&lt;/strong&gt; per (category, locale) on each model. Each response is parsed into a list of entities (brands/sites).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resampling for reliability:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
We split the set of lists into two halves, compute a &lt;strong&gt;consensus top-k&lt;/strong&gt; list for each half, and measure &lt;strong&gt;overlap@k&lt;/strong&gt; between the halves.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If overlap@k is high → the “top-k” ranking is &lt;strong&gt;stable&lt;/strong&gt; for that setup.&lt;/li&gt;
&lt;li&gt;If low → treat any single top-k as noisy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkltag2fvlg6z9sbqmmit.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkltag2fvlg6z9sbqmmit.png" alt="Figure 1: Diagram of ECP sampling + half-split consensus flow" width="800" height="487"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 1: Diagram of ECP sampling + half-split consensus flow&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We ran &lt;strong&gt;15,600 samples&lt;/strong&gt; across &lt;strong&gt;52 categories/locales&lt;/strong&gt; to check stability patterns and surface interesting divergences.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s in the repo &amp;amp; data
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;/pl_top/*.csv&lt;/code&gt; — per-prompt list outputs and parsed entities
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;results.*.jsonl&lt;/code&gt; — structured results + metadata for analysis
&lt;/li&gt;
&lt;li&gt;Scripts to:

&lt;ul&gt;
&lt;li&gt;aggregate list outputs,&lt;/li&gt;
&lt;li&gt;compute consensus tops,&lt;/li&gt;
&lt;li&gt;evaluate &lt;strong&gt;overlap@k&lt;/strong&gt; reliability,&lt;/li&gt;
&lt;li&gt;export tables/figures.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  Quickstart (Python)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
# pip install pandas numpy
import pandas as pd
import json

# Example: load top lists and compute simple frequency
pl = pd.read_csv("pl_top/example_category_en-US.csv")  # swap for your file
pl["entity"] = pl["entity"].str.strip().str.lower()
freq = pl["entity"].value_counts().head(20)
print(freq)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>datascience</category>
      <category>research</category>
    </item>
  </channel>
</rss>
