<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Daniel Griffin</title>
    <description>The latest articles on DEV Community by Daniel Griffin (@danielsgriffin).</description>
    <link>https://dev.to/danielsgriffin</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1966396%2Ff13b908c-c4fe-4b56-b6ce-1e7f10cd0eb1.png</url>
      <title>DEV Community: Daniel Griffin</title>
      <link>https://dev.to/danielsgriffin</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/danielsgriffin"/>
    <language>en</language>
    <item>
      <title>Examining HN Discovery Quality Using Existing Complaints</title>
      <dc:creator>Daniel Griffin</dc:creator>
      <pubDate>Tue, 10 Sep 2024 19:07:43 +0000</pubDate>
      <link>https://dev.to/danielsgriffin/examining-hn-discovery-quality-using-existing-complaints-13pa</link>
      <guid>https://dev.to/danielsgriffin/examining-hn-discovery-quality-using-existing-complaints-13pa</guid>
      <description>&lt;p&gt;We've &lt;a href="https://trieve.ai/launching-trieve-hn-discovery/" rel="noopener noreferrer"&gt;launched a new Hacker News search experience&lt;/a&gt;, focused on discovery: &lt;a href="https://hn.trieve.ai/" rel="noopener noreferrer"&gt;hn.trieve.ai&lt;/a&gt; (GitHub: &lt;a href="https://github.com/devflowinc/trieve" rel="noopener noreferrer"&gt;Trieve API backend&lt;/a&gt;, &lt;a href="https://github.com/devflowinc/trieve-hn-discovery/" rel="noopener noreferrer"&gt;frontend search interface&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Hacker News has long been a playground for search innovation—with the community often leaning in to explore new possibilities in search. Over the past six months, Nick has been looking back at the various search experiences and detailed his findings in a post: &lt;a href="https://trieve.ai/history-of-hnsearch/" rel="noopener noreferrer"&gt;History of HackerNews Search: From 2007 to 2024&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We combed through HN (and &lt;a href="https://github.com/algolia/hn-search/issues" rel="noopener noreferrer"&gt;user issues posted to Algolia's repo for HN search&lt;/a&gt;) in search of search complaints. Over the years there have been some complaints about indexing issues, and we’re not covering those in this post. Instead, we looked for examples where people shared actual search queries. For each query, we looked for what they said or implied about their search intent and the search results they found. What have people said about search quality? What searches are not possible or not easy in the current HN search? When are folks resorting to running a &lt;code&gt;site:&lt;/code&gt; search on a search engine like Google? Where can Trieve help make search better?&lt;/p&gt;

&lt;h2&gt;
  
  
  Discover well-beyond exact matches in titles
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Searching for "postgres clustering"
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Search intent:&lt;/strong&gt; Finding information about PostgreSQL clustering solutions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sourced from:&lt;/strong&gt; &lt;a href="https://news.ycombinator.com/item?id=40240237" rel="noopener noreferrer"&gt;@avereveard comment on Hacker News&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Algolia for "postgres clustering"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14nkej58n44snftft2bq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14nkej58n44snftft2bq.png" alt="Algolia for postgres clustering" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Trieve for "postgres clustering"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fym7h4x1i63123hunnedm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fym7h4x1i63123hunnedm.png" alt="Trieve for postgres clustering" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Searching for "AT&amp;amp;T says criminals stole phone records of 'nearly all' customers in new data breach"
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Search intent:&lt;/strong&gt; Exact phrase match with long query&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sourced from:&lt;/strong&gt; &lt;a href="https://github.com/algolia/hn-search/issues/243" rel="noopener noreferrer"&gt;@Pranoy1c on github/algolia/hn-search/issues&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Algolia for "AT&amp;amp;T says criminals stole phone records of nearly all customers in new data breach"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs72m4foi7m33a7cyfxfo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs72m4foi7m33a7cyfxfo.png" alt="Algolia for AT&amp;amp;T says criminals stole phone records of nearly all customers in new data breach" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Trieve for "AT&amp;amp;T says criminals stole phone records of nearly all customers in new data breach"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe8qlon9oljvfti5a7sst.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe8qlon9oljvfti5a7sst.png" alt="Trieve for AT&amp;amp;T says criminals stole phone records of nearly all customers in new data breach" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Searching with special characters
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Searching for "[video]"
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Search intent:&lt;/strong&gt; Exact string match when string contains non-alphanumeric characters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sourced from:&lt;/strong&gt; &lt;a href="https://github.com/algolia/hn-search/issues/197" rel="noopener noreferrer"&gt;@some1else on github/algolia/hn-search/issues&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Algolia for "[video]"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff7kd13w5elywlv91ejwn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff7kd13w5elywlv91ejwn.png" alt="Algolia for [video]" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Trieve (semantic) for "[video]"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl7qouhe9f5ldpvxtp181.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl7qouhe9f5ldpvxtp181.png" alt="Trieve (semantic) for [video]" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Algolia (quoted) for "[video]"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fltumh3pud7n5gz695fuo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fltumh3pud7n5gz695fuo.png" alt="Algolia (quoted) for [video]" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Trieve (semantic, quoted) for "[video]"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxz87bysam4fdqswessvq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxz87bysam4fdqswessvq.png" alt="Trieve (semantic, quoted) for [video]" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Searching for "AT&amp;amp;T"
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Search intent:&lt;/strong&gt; Exact match for acronym (with ampersand)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sourced from:&lt;/strong&gt; &lt;a href="https://github.com/algolia/hn-search/issues/244" rel="noopener noreferrer"&gt;@Pranoy1c on github/algolia/hn-search/issues&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Algolia (prefix=true) for "AT&amp;amp;T"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1vuy6s6npr9t7m4gb9no.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1vuy6s6npr9t7m4gb9no.png" alt="Algolia (prefix=true) for AT&amp;amp;T" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Algolia (prefix=false) for "AT&amp;amp;T"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy5disnwycoz4uedi2wgg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy5disnwycoz4uedi2wgg.png" alt="Algolia (prefix=false) for AT&amp;amp;T" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Algolia (quoted) for "AT&amp;amp;T"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7eurcg8wrup7jjnajf2t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7eurcg8wrup7jjnajf2t.png" alt="Algolia (quoted) for AT&amp;amp;T" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Trieve for "AT&amp;amp;T"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjwk4m1go6sau0b1s948u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjwk4m1go6sau0b1s948u.png" alt="Trieve for AT&amp;amp;T" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Out-of-domain strings
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Searching for "lootitooti"
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Search intent:&lt;/strong&gt; Match for out-of-domain string&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sourced from:&lt;/strong&gt; &lt;a href="https://news.ycombinator.com/item?id=40869786" rel="noopener noreferrer"&gt;@douglaskayama comment on Hacker News&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Algolia for "lootitooti"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdsyij2e0f5skm9kecnr0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdsyij2e0f5skm9kecnr0.png" alt="Algolia for lootitooti" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Trieve for "lootitooti"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffvxe3lmd6mwa67t4xcpk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffvxe3lmd6mwa67t4xcpk.png" alt="Trieve for lootitooti" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Presque vue searches
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Searching for "deterministic Docker builds"
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Search intent:&lt;/strong&gt; Trying to remember "the name of an SaaS to pin/cache/ back up my apt/apk/pip dependencies"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sourced from:&lt;/strong&gt; &lt;a href="https://news.ycombinator.com/item?id=40241070" rel="noopener noreferrer"&gt;@codethief comment on Hacker News&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Algolia (type: Story) for "deterministic Docker builds"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnxnc62zp6iuptno48750.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnxnc62zp6iuptno48750.png" alt="Algolia (type: Story) for deterministic Docker builds" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Trieve (type: Story) for "deterministic Docker builds"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyaa04t19l3q4d1p9ngyl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyaa04t19l3q4d1p9ngyl.png" alt="Trieve (type: Story) for deterministic Docker builds" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Searching for "tip of your tongue phenomenon"
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Bonus!&lt;/strong&gt; Again, precision focused approach of requiring "your" has downsides.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Algolia for "tip of your tongue phenomenon"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F74qkg0xwmvqfk7uwp1tv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F74qkg0xwmvqfk7uwp1tv.png" alt="Algolia for tip of your tongue phenomenon" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Trieve for "tip of your tongue phenomenon"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxyx0mtyryjvmjulfhqpq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxyx0mtyryjvmjulfhqpq.png" alt="Trieve for tip of your tongue phenomenon" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Filter on author with a hyphenated username
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Searching for ""It Won't Fail Because of Me" by:1970-01-01"
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Search intent:&lt;/strong&gt; Find posts by author with a hyphenated username&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sourced from:&lt;/strong&gt; &lt;a href="https://news.ycombinator.com/item?id=38866518" rel="noopener noreferrer"&gt;@SushiHippie Ask HN on Hacker News&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Algolia for "It Won't Fail Because of Me" by:1970-01-01&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff6art968wi25yh0kaibp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff6art968wi25yh0kaibp.png" alt="Algolia for " width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Trieve for "It Won't Fail Because of Me"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftows16ahuoowlcjwo8hs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftows16ahuoowlcjwo8hs.png" alt="Trieve for " width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Default sorting by relevance v. popularity metrics
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Searching for "Excel"
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Search intent:&lt;/strong&gt; Find popular Excel-related stories on Hacker News&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sourced from:&lt;/strong&gt; &lt;a href="https://news.ycombinator.com/item?id=41394126" rel="noopener noreferrer"&gt;@airstrike on Hacker News&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is from a comparison that @airstrike shared after we launched our discovery search. He preferred the results from Algolia. Algolia defaults to a popularity-based sort. Algolia also has sort-by-date, but does not have a specific relevance-focused sorting option.&lt;/p&gt;

&lt;p&gt;Trieve offers multiple sorting options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;default: relevance (not tuned to extrinsic popularity metrics)&lt;/li&gt;
&lt;li&gt;number of points (similar to Algolia's "popularity")&lt;/li&gt;
&lt;li&gt;date (reverse chronological)&lt;/li&gt;
&lt;li&gt;descendants (number of comments)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://news.ycombinator.com/item?id=41394412" rel="noopener noreferrer"&gt;Nick (@skeptrune) responded&lt;/a&gt; with some of our internal deliberations:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We went back and forth on making points sorting default and ended up deciding against it, but maybe we should have. Our thinking was that since it's focused on "discovery" it was worth prioritizing relevance, but I can see how it can feel the result quality isn't as great.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If someone is looking for more of the popularity-focused results, they can start their Trieve HN Discovery searches with the &lt;code&gt;sortby=&lt;/code&gt; parameter set to &lt;code&gt;num_value&lt;/code&gt; (try &lt;a href="https://hn.trieve.ai/?score_threshold=5&amp;amp;page_size=30&amp;amp;prefetch_amount=30&amp;amp;rerank_type=none&amp;amp;highlight_delimiters=+%2C-%2C_%2C.%2C%2C&amp;amp;highlight_threshold=0.85&amp;amp;highlight_max_length=50&amp;amp;highlight_max_num=50&amp;amp;highlight_window=0&amp;amp;recency_bias=0&amp;amp;highlight_results=true&amp;amp;use_quote_negated_terms=true&amp;amp;storyType=story&amp;amp;matchAnyAuthorNames=&amp;amp;matchNoneAuthorNames=&amp;amp;popularityFilters=%7B%7D&amp;amp;sortby=num_value&amp;amp;dateRange=all&amp;amp;searchType=fulltext&amp;amp;page=1&amp;amp;getAISummary=false" rel="noopener noreferrer"&gt;this link&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Algolia (sorted by popularity) for "Excel"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjbtqb5w3cw9fbkj8oqqi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjbtqb5w3cw9fbkj8oqqi.png" alt="Algolia (sorted by popularity) for Excel" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Trieve (sorted by relevance) for "Excel"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4bet8kc51ogykz3pz49a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4bet8kc51ogykz3pz49a.png" alt="Trieve (sorted by relevance) for Excel" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Trieve (sorted by points) for "Excel"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fab9saz9mot7wf8e37ywy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fab9saz9mot7wf8e37ywy.png" alt="Trieve (sorted by points) for Excel" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Trieve (sorted by descendants) for "Excel"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fed6904svd8t3kdkkrgew.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fed6904svd8t3kdkkrgew.png" alt="Trieve (sorted by descendants) for Excel" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;If you want to explore comparisons between the Algolia HN Search and our Trieve HN Discovery, it can help to use our "Try it with Trieve!" button via our open-source unpacked Chrome extension: &lt;a href="https://github.com/devflowinc/try-it-with-trieve" rel="noopener noreferrer"&gt;github.com/devflowinc/try-it-with-trieve&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;A "Try it with Trieve!" button in action&lt;/em&gt;&lt;br&gt;
&lt;a href="https://github.com/devflowinc/try-it-with-trieve" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzf9gemb5jtoderfujafv.png" alt="A Try it with Trieve button in action." width="800" height="170"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Learn something from this post? If you'd like to support our project, we'd be grateful if you'd explore and star &lt;a href="https://github.com/devflowinc/trieve" rel="noopener noreferrer"&gt;our GitHub repository&lt;/a&gt;!&lt;/p&gt;

</description>
      <category>search</category>
      <category>complaints</category>
      <category>learning</category>
    </item>
    <item>
      <title>Keeping up with Mintlify's AI Chat</title>
      <dc:creator>Daniel Griffin</dc:creator>
      <pubDate>Tue, 03 Sep 2024 23:29:26 +0000</pubDate>
      <link>https://dev.to/danielsgriffin/keeping-up-with-mintlifys-ai-chat-5a0m</link>
      <guid>https://dev.to/danielsgriffin/keeping-up-with-mintlifys-ai-chat-5a0m</guid>
      <description>&lt;p&gt;I am pretty new to the Trieve codebase (&lt;a href="https://github.com/devflowinc/trieve" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;, &lt;a href="https://docs.trieve.ai" rel="noopener noreferrer"&gt;Docs&lt;/a&gt;) and so am sometimes hesitant to ask about things that initially seem to me to be missing, confusing, or not working. This is common for many—even when, like in my case, those around you strongly encourage any and all questions. For &lt;a href="https://danielsgriffin.com/diss/" rel="noopener noreferrer"&gt;my dissertation research at UC Berkeley&lt;/a&gt;, I studied how data engineers search the web for their work and &lt;a href="https://danielsgriffin.com/diss/repairing_searching/" rel="noopener noreferrer"&gt;wrote about how they try to keep learning&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A key observation of data engineering work is that they are operating at the edge, working with the new, or new to them, tools, systems, or other ways of handling data. Navigating around these edges, or learning around the edge, necessitates constantly working to “keep up” &lt;a href="https://doi.org/10.1080/13691180110117631" rel="noopener noreferrer"&gt;(Kotamraju, 2002)&lt;/a&gt; and engage in “intensive self-learning” &lt;a href="https://doi.org/10.1177/0950017020977306" rel="noopener noreferrer"&gt;(Avnoon, 2021)&lt;/a&gt; in a fluid, and loosely defined, field. They use web search to keep up.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In my research, I also looked at what people did when they were struggling despite searching the web. Often, they—eventually—ask others. And no matter how encouraging your teammates, you may still wonder if your question is going to waste their time or make you appear a fool. Many people fear, or have experienced, being &lt;a href="https://en.wiktionary.org/wiki/rag_on" rel="noopener noreferrer"&gt;ragged on&lt;/a&gt; for asking a question that was perhaps searchable. Maybe they were harshly told to &lt;a href="https://en.wikipedia.org/wiki/RTFM" rel="noopener noreferrer"&gt;read the manual&lt;/a&gt; or sent a &lt;a href="https://www.urbandictionary.com/define.php?term=lmgtfy" rel="noopener noreferrer"&gt;LMGTFY&lt;/a&gt; link (&lt;a href="https://meta.stackoverflow.com/questions/255397/lmgtfy-link-cant-be-added" rel="noopener noreferrer"&gt;a practice long-ago banned on Stack Overflow&lt;/a&gt;). Some of my research participants told me how they sometimes said or sent such things to their colleagues to reduce the demand on their own time. Others confessed how they found themselves silenced when a boss or colleague said something similar to them. So, folks often took the time to carefully package questions, looking for ways to demonstrate due diligence.&lt;/p&gt;

&lt;p&gt;Thanks to Trieve's integration with &lt;a href="https://mintlify.com/" rel="noopener noreferrer"&gt;Mintlify&lt;/a&gt;, my due diligence process looked roughly like this, advancing to the next step as necessary:&lt;/p&gt;

&lt;p&gt;1) Identify a doubt or question.&lt;br&gt;
2) Search &lt;a href="https://docs.trieve.ai/introduction" rel="noopener noreferrer"&gt;the docs&lt;/a&gt; (sometimes through the integrated search bar and sometimes just on the page with Ctrl/⌘+F).&lt;br&gt;
3) Write (and even iteratively reformulate) a question in &lt;a href="https://mintlify.com/blog/chat" rel="noopener noreferrer"&gt;Mintlify's AI Chat&lt;/a&gt; interface.&lt;br&gt;
4) Share a screenshot of the question &amp;amp; answer with the rest of the Trieve team.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6jeKzpyi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.trieve.ai/blog/keeping-up-with-mintlify/search.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6jeKzpyi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.trieve.ai/blog/keeping-up-with-mintlify/search.png" alt="screenshot of the Mintlify search bar" width="800" height="97"&gt;&lt;/a&gt;&lt;br&gt;You can jump to the Mintlify search bar at the top of the page with Ctrl/⌘+K
    &lt;/p&gt;

&lt;p&gt;Some questions I only had to recognize to be able to resolve (on my own with the code and computer). Many just required a quick search. For others I could refine my understanding through the act of simply writing out fully formulated questions (like in &lt;a href="https://rubberduckdebugging.com/" rel="noopener noreferrer"&gt;rubber duck debugging&lt;/a&gt; or how even &lt;a href="https://danielsgriffin.com/diss/repairing_searching/#renewed-impetus" rel="noopener noreferrer"&gt;"starting to ask"&lt;/a&gt; a colleague a question helps clarify a path forward). Sometimes the value was in reading the docs as repackaged by the LLM chat.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Z1J8OGQu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.trieve.ai/blog/keeping-up-with-mintlify/searching.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Z1J8OGQu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.trieve.ai/blog/keeping-up-with-mintlify/searching.png" alt="screenshot of Mintlify's search interface" width="800" height="302"&gt;&lt;/a&gt;&lt;br&gt;The Mintlify interface lets you jump straight to the RAG response or the retrieved search results.
  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_ZbQrsVg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.trieve.ai/blog/keeping-up-with-mintlify/sources.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_ZbQrsVg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.trieve.ai/blog/keeping-up-with-mintlify/sources.png" alt="screenshot of Mintlify's RAG source links" width="800" height="280"&gt;&lt;/a&gt;&lt;br&gt;Links to the sources used by the LLM to generate the response are displayed at the bottom.
  &lt;/p&gt;

&lt;p&gt;Then there was that last type of question—often dealing with API parameters or some special term in the docs—where the RAG response helped me really recognize my misunderstandings or reframe my confusion. The question-and-answer was displayed in a way that let me quickly relay to the team my doubt or surprise packaged alongside the RAG response. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ClNZGBZU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.trieve.ai/blog/keeping-up-with-mintlify/full.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ClNZGBZU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.trieve.ai/blog/keeping-up-with-mintlify/full.png" alt="screenshot of Mintlify's RAG response to 'What is the difference between group_ids and group_tracking_ids?'" width="800" height="907"&gt;&lt;/a&gt;&lt;br&gt;Screenshot of the Mintlify AI Chat interface showing both the question and the RAG response.
  &lt;/p&gt;

&lt;p&gt;Sharing these screenshots helped make clear how the confusion or uncertainty was connected with the existing documentation, or not. These questions then helped the team identify where I really needed help keeping up and where the documentation might be improved to support a better developer experience in the docs, in the search, and with the RAG.&lt;/p&gt;




&lt;p&gt;Learn something from this post? If you'd like to support our project, we'd be grateful if you'd explore and star &lt;a href="https://github.com/devflowinc/trieve" rel="noopener noreferrer"&gt;our GitHub repository&lt;/a&gt;!&lt;/p&gt;

</description>
      <category>rag</category>
      <category>learning</category>
    </item>
    <item>
      <title>Build Search and RAG for Any Website with Firecrawl and Trieve</title>
      <dc:creator>Daniel Griffin</dc:creator>
      <pubDate>Thu, 22 Aug 2024 19:12:03 +0000</pubDate>
      <link>https://dev.to/danielsgriffin/build-search-and-rag-for-any-website-with-firecrawl-and-trieve-4cgp</link>
      <guid>https://dev.to/danielsgriffin/build-search-and-rag-for-any-website-with-firecrawl-and-trieve-4cgp</guid>
      <description>&lt;p&gt;In this guide, we will show how to use &lt;a href="https://www.firecrawl.dev/" rel="noopener noreferrer"&gt;Firecrawl&lt;/a&gt; and Trieve to build search and RAG for &lt;a href="https://signoz.io/docs/" rel="noopener noreferrer"&gt;SigNoz's documentation&lt;/a&gt; in both Python and JS.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.firecrawl.dev/" rel="noopener noreferrer"&gt;Firecrawl&lt;/a&gt;'s REST API can be queried to convert every page available at a URL into vector search and RAG-ready markdown.&lt;/p&gt;

&lt;p&gt;Trieve's API can then receive chunks of the markdown docs, embed and place them into a search index, and finally be called to perform AI Search and RAG on all of the site's content.&lt;/p&gt;

&lt;p&gt;All the code used (both node.js and Python) is also in this GitHub repo (MIT license): &lt;a href="https://github.com/devflowinc/firecrawl-to-trieve" rel="noopener noreferrer"&gt;firecrawl-to-trieve&lt;/a&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  Requirements - free API keys:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Firecrawl API key: Signup at &lt;a href="https://www.firecrawl.dev/signin/signup" rel="noopener noreferrer"&gt;firecrawl.dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Trieve API key and dataset id: Register and setup at &lt;a href="https://dashboard.trieve.ai/" rel="noopener noreferrer"&gt;dashboard.trieve.ai&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Setup a .env file with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FIRECRAWL_API_KEY=
TRIEVE_DATASET_ID=
TRIEVE_API_KEY=
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  1. Converting Signoz docs to Markdown Chunks with Firecrawl
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.firecrawl.dev/features/crawl" rel="noopener noreferrer"&gt;Firecrawl Docs&lt;/a&gt; &amp;amp; &lt;a href="https://github.com/mendableai/firecrawl/" rel="noopener noreferrer"&gt;Github&lt;/a&gt; (see also their &lt;a href="https://github.com/mendableai/firecrawl/tree/main/apps/js-sdk" rel="noopener noreferrer"&gt;js-sdk&lt;/a&gt; and &lt;a href="https://github.com/mendableai/firecrawl/tree/main/apps/python-sdk" rel="noopener noreferrer"&gt;python-sdk&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;We use Firecrawl to convert the pages on &lt;code&gt;https://signoz.io/docs/&lt;/code&gt; into markdown then save them to a json file (with a timestamp). You can adjust the crawler limit (there are only 303 pages so it falls within the free credits plan).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;FirecrawlApp&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@mendable/firecrawl-js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;dotenv&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;dotenv&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;fileURLToPath&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;url&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;path&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;fs&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;__filename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fileURLToPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;import&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;__dirname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dirname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;__filename&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nx"&gt;dotenv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;__dirname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;../.env&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Initialize the FirecrawlApp with your API key&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FirecrawlApp&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;FIRECRAWL_API_KEY&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Define crawl parameters&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;crawlUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://signoz.io/docs/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;crawlerOptions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;maxDepth&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;docs/*&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;pageOptions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;onlyMainContent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="c1"&gt;// Crawl the website&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;crawlResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;crawlUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nx"&gt;crawlUrl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// wait_until_done&lt;/span&gt;
    &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="c1"&gt;// poll_interval&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Save the crawl result to a file (with a timestamp)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/:/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;-&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;writeFileSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`crawl_results_&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.json`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;crawlResult&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Cleaning the Markdown
&lt;/h3&gt;

&lt;p&gt;Each crawl result item has a &lt;code&gt;markdown&lt;/code&gt; field that contains the full markdown text of the page. We did some initial data cleaning as we took our first look at the data and worked out the chunking. You can always add more cleaning later. We cut boilerplate from the end and transformed how links were presented in the metadata. This was done through tested functions that we can easily add to as we start searching.&lt;/p&gt;

&lt;p&gt;Initial cleaning functions are in &lt;code&gt;cleaners.py&lt;/code&gt; / &lt;code&gt;cleaners.js&lt;/code&gt;, imported into the transform scripts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Chunking the Markdown
&lt;/h3&gt;

&lt;p&gt;For a baseline chunking approach we just want to find ways to split pages into smaller chunks, while maintaining some semantic cohesion within the chunks. Just as with cleaning, we can always refine our chunking strategy as we gain more familarity with the data and our use cases.&lt;/p&gt;

&lt;p&gt;Learn more about chunking techniques in &lt;a href="https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/a4570f3c4883eb9b835b0ee18990e62298f518ef/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb" rel="noopener noreferrer"&gt;Greg Kamradt's "5 Levels Of Text Splitting" Notebook&lt;/a&gt; and &lt;a href="https://unstructured.io/blog/chunking-for-rag-best-practices" rel="noopener noreferrer"&gt;Maria Khalusova's "Chunking for RAG: best practices"&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our transform script first makes chunks of any page that is less than 500 words (though this could easily be switched for a token-based limit based on the LLM of your RAG system). Then it splits longer content into chunks by top-level anchor links (e.g. &lt;code&gt;[](#overview)\nOverview&lt;/code&gt; ) and then splitting by h3, h4, etc (e.g. &lt;code&gt;### [](#step-1-setup-otel-collector)\n&lt;/code&gt;). (Explicit h1-h2 (i.e. &lt;code&gt;# this-is-h1&lt;/code&gt; and &lt;code&gt;## this-is-h2&lt;/code&gt;) headings for this dataset are not indicated in the markdown. The html itself has h2 headings for both the page title and for what is rendered here as the top-level anchor link headings but it is not represented with &lt;code&gt;#&lt;/code&gt; here.)&lt;/p&gt;

&lt;p&gt;This transform script uses regex to recursively split longer markdown content whenever the output was greater than &lt;code&gt;max_words&lt;/code&gt; (we used 500, determined via a crude split on whitespace) or until the split is at the &lt;code&gt;max_depth&lt;/code&gt; (we used &lt;code&gt;h4&lt;/code&gt;). So some chunks are longer than 500 words (25 of 972). We can later investigate lower heading-level splits, alternative split points, or manual edits for those chunks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;processContent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pageMarkdown&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pageTitle&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pageLink&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pageTagsSet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="nx"&gt;pageDescription&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;maxWords&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;CONFIGS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;maxWords&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;maxDepth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;CONFIGS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;maxDepth&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

  &lt;span class="c1"&gt;// Splits content into sections based on headings&lt;/span&gt;
  &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;splitContent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;matches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;RegExp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;g&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;match&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[,&lt;/span&gt; &lt;span class="nx"&gt;headingLink&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;headingText&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;indexOf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;match&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;indexOf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;headingLink&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;headingText&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;end&lt;/span&gt;&lt;span class="p"&gt;)];&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Creates chunks from sections&lt;/span&gt;
  &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;createChunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sections&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;currentTitle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;depth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;localChunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;lastChunkHeadingOnly&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="nx"&gt;sections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forEach&lt;/span&gt;&lt;span class="p"&gt;(([&lt;/span&gt;&lt;span class="nx"&gt;headingLink&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;headingText&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sectionContent&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;lastChunkHeadingOnly&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;headingText&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;lastChunkHeadingOnly&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; - &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;headingText&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="nx"&gt;lastChunkHeadingOnly&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;

      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;chunkHtml&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getChunkHtml&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sectionContent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pageTitle&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;headingText&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunkHtml&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;HEADING_ONLY&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;lastChunkHeadingOnly&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;headingText&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;

      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fullTitle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;currentTitle&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;headingText&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/^:&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;+/&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;chunkWordCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;chunkHtml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;+/&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;isWithinChunkingConstrains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;chunkWordCount&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="nx"&gt;maxWords&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;depth&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nx"&gt;maxDepth&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="c1"&gt;// true if the chunk is within the word limit or we've reached max depth.&lt;/span&gt;
      &lt;span class="c1"&gt;// we don't split further at max depth, even if over word limit&lt;/span&gt;

      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;isWithinChunkingConstrains&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;localChunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;createChunkObject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunkHtml&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pageLink&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;headingLink&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;headingText&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                           &lt;span class="nx"&gt;pageTagsSet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pageTitle&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pageDescription&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;fullTitle&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;subsections&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;splitContent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sectionContent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
          &lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;(\\&lt;/span&gt;&lt;span class="sr"&gt;n###+ &lt;/span&gt;&lt;span class="se"&gt;\\[\\]\\((&lt;/span&gt;&lt;span class="sr"&gt;#.*&lt;/span&gt;&lt;span class="se"&gt;?)\\))\\&lt;/span&gt;&lt;span class="sr"&gt;n&lt;/span&gt;&lt;span class="se"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;.*&lt;/span&gt;&lt;span class="se"&gt;?)\\&lt;/span&gt;&lt;span class="sr"&gt;n/&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="c1"&gt;// regex to find subsections of the current section&lt;/span&gt;
        &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;subsections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="nx"&gt;localChunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(...&lt;/span&gt;&lt;span class="nf"&gt;createChunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;subsections&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;fullTitle&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;depth&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="c1"&gt;// if no subsections, create a chunk for the current section&lt;/span&gt;
          &lt;span class="nx"&gt;localChunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;createChunkObject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunkHtml&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pageLink&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;headingLink&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;headingText&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                             &lt;span class="nx"&gt;pageTagsSet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pageTitle&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pageDescription&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;fullTitle&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;localChunks&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;topSections&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;splitContent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pageMarkdown&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;(\\&lt;/span&gt;&lt;span class="sr"&gt;n&lt;/span&gt;&lt;span class="se"&gt;\\[\\]\\((&lt;/span&gt;&lt;span class="sr"&gt;#.*&lt;/span&gt;&lt;span class="se"&gt;?)\\))\\&lt;/span&gt;&lt;span class="sr"&gt;n&lt;/span&gt;&lt;span class="se"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;.*&lt;/span&gt;&lt;span class="se"&gt;?)\\&lt;/span&gt;&lt;span class="sr"&gt;n/&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;createChunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;topSections&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We also take the heading itself and add it at the top of the chunk along with the original page title.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Example: The chunk for the "Enable a Prometheus Receiver" heading on the "Send Metrics to SigNoz Cloud" page is titled: "Send Metrics to SigNoz Cloud: Enable a Prometheus Receiver"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the split results in a chunk consists of only a heading (&lt;code&gt;HEADING_ONLY&lt;/code&gt;), that chunk is skipped and the heading is added to the top of the subsequent chunk.&lt;/p&gt;

&lt;h3&gt;
  
  
  Preparing the Chunks for Trieve
&lt;/h3&gt;

&lt;p&gt;The chunk will include various other fields, metadata and labels specific for Trieve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;tracking_id&lt;/code&gt;: parsed from the URL + the id of the heading for splits&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;group_tracking_ids&lt;/code&gt;: parsed from the URL (this will let us link the page chunks together)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tag_set&lt;/code&gt;: parsed from the URL segments (ex. &lt;code&gt;https://signoz.io/docs/install/kubernetes/&lt;/code&gt; -&amp;gt; &lt;code&gt;['install', 'kubernetes']&lt;/code&gt;; optimized for filtering)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;image_urls&lt;/code&gt;: parsed from the markdown (the docs have both &lt;code&gt;png&lt;/code&gt; and &lt;code&gt;webp&lt;/code&gt; images; this will show within the Trieve search playground)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;metadata&lt;/code&gt;: parsed from the crawl result metadata (we save the page title and description (if it exists), in case useful later; can be filtered)
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;createChunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunkHtml&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pageLink&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;headingLink&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;headingText&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pageTagsSet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                     &lt;span class="nx"&gt;pageTitle&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pageDescription&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;chunk_html&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;chunkHtml&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;link&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;pageLink&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;headingLink&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;tags_set&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;pageTagsSet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;image_urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;getImages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunkHtml&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;tracking_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;getTrackingId&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pageLink&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;headingLink&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;group_tracking_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;getTrackingId&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pageLink&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
    &lt;span class="na"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;pageTitle&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;: &lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;headingText&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;page_title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;pageTitle&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;page_description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;pageDescription&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The groups are automatically created in Trieve for each tracking ID in the &lt;code&gt;group_tracking_ids&lt;/code&gt; array if a group with that tracking ID does not yet exist. Groups can be created (and edited) in the &lt;code&gt;chunk_group&lt;/code&gt; route. This lets you provide a name, description, tag_set and metadata for the group. See the docs: &lt;a href="https://docs.trieve.ai/api-reference/chunk-group/create-or-upsert-group-or-groups" rel="noopener noreferrer"&gt;Create or Upsert Group or Groups&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Storing the Chunks in Trieve for Search and RAG
&lt;/h2&gt;

&lt;p&gt;We use the Trieve &lt;code&gt;api/chunk&lt;/code&gt; route to create the chunks in Trieve.&lt;/p&gt;

&lt;p&gt;See the docs: &lt;a href="https://docs.trieve.ai/api-reference/chunk/create-or-upsert-chunk-or-chunks" rel="noopener noreferrer"&gt;Create or Upsert Chunk or Chunks&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Chunks here are just parsed JSON in batches of 120.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;loadChunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;upsert&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;basePath&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/api/chunk`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;TR-Dataset&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;datasetId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Authorization&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Bearer &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Content-Type&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;

  &lt;span class="nx"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forEach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;upsert_by_tracking_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;headers&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Successfully &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;upsert&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;upserted&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;created&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; batch of `&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
      &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; chunks to &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;datasetId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Failed to &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;upsert&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;upsert&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;create&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; batch. `&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
      &lt;span class="s2"&gt;`Status: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;, `&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
      &lt;span class="s2"&gt;`Data: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;, `&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
      &lt;span class="s2"&gt;`Message: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Testing Search and RAG quality using Trieve Playgrounds
&lt;/h3&gt;

&lt;p&gt;Trying out some searches and chatting with the chunks is one way to get a look at your data and the quality of your initial cleaning and chunking steps.&lt;/p&gt;

&lt;p&gt;After loading the chunks you can head to &lt;a href="https://dashboard.trieve.ai/" rel="noopener noreferrer"&gt;dashboard.trieve.ai&lt;/a&gt;. Here you'll find a datasets table with your ingested dataset, listing the number of chunks along with quicklinks to settings and the search, RAG, and analytics playgrounds. Both the Search and RAG Playgrounds show the chunks with their &lt;code&gt;chunk_html&lt;/code&gt;, the links to their source pages, tracking IDs, and metadata.&lt;/p&gt;

&lt;h4&gt;
  
  
  Search Playground: &lt;a href="https://search.trieve.ai/" rel="noopener noreferrer"&gt;search.trieve.ai&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;This is a search interface optimized for fully exploring the search options in Trieve. You'll see configurations for adjusting filters, choosing between search types (ex. hybrid, semantic, fulltext (SPLADE), BM25, etc.), sorting and reranking. You can also do multi-query searches and group search (returning the corresponding group for the retrieved chunks). There are additional options to set the score threshold for retrieved results, page size, stop word removal, and various settings for managing Trieve's sub-sentence highlighting. Once a search has been conducted you will also see a button to "Rate This Search" which opens a modal with a 0-10 slider and notes field.&lt;/p&gt;

&lt;p&gt;Example searches to showcase the search types:&lt;/p&gt;

&lt;p&gt;Query: &lt;em&gt;visualizations&lt;/em&gt; (Semantic)&lt;br&gt;
This is a dense vector search using cosine distance vectors on &lt;a href="https://jina.ai/news/jina-ai-launches-worlds-first-open-source-8k-text-embedding-rivaling-openai" rel="noopener noreferrer"&gt;Jina english embeddings&lt;/a&gt;), &lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.trieve.ai%2Fblog%2Ffirecrawl-and-trieve%2Fvisualizations_query_signoz_semantic.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.trieve.ai%2Fblog%2Ffirecrawl-and-trieve%2Fvisualizations_query_signoz_semantic.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Query: &lt;em&gt;visualizations&lt;/em&gt; (Fulltext)&lt;br&gt;
Notice how the retrieval model for our fulltext search type, &lt;a href="https://github.com/naver/splade" rel="noopener noreferrer"&gt;SPLADE&lt;/a&gt;, is returning chunks containing words closer to the query rather than the broader semantics.&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.trieve.ai%2Fblog%2Ffirecrawl-and-trieve%2Fvisualizations_query_signoz_fulltext.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.trieve.ai%2Fblog%2Ffirecrawl-and-trieve%2Fvisualizations_query_signoz_fulltext.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Query: &lt;em&gt;visualizations&lt;/em&gt; (Hybrid)&lt;br&gt;
This takes in results from both fulltext and semantic and re-ranks with a cross-encoder model (here the &lt;a href="https://huggingface.co/BAAI/bge-reranker-large" rel="noopener noreferrer"&gt;BAAI/bge-reranker-large&lt;/a&gt;).&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.trieve.ai%2Fblog%2Ffirecrawl-and-trieve%2Fvisualizations_query_signoz_hybrid.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.trieve.ai%2Fblog%2Ffirecrawl-and-trieve%2Fvisualizations_query_signoz_hybrid.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Group Search:&lt;/strong&gt;&lt;br&gt;
Internally this uses the &lt;a href="https://docs.trieve.ai/api-reference/chunk-group/search-over-groups" rel="noopener noreferrer"&gt;chunk_group/group_oriented_search&lt;/a&gt; route.&lt;/p&gt;

&lt;p&gt;Query: &lt;em&gt;OpenTelemetry FastAPI example&lt;/em&gt; (Hybrid)&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.trieve.ai%2Fblog%2Ffirecrawl-and-trieve%2Ffastapi_example_query_signoz_semantic_group.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.trieve.ai%2Fblog%2Ffirecrawl-and-trieve%2Ffastapi_example_query_signoz_semantic_group.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  RAG Playground: &lt;a href="https://chat.trieve.ai/" rel="noopener noreferrer"&gt;chat.trieve.ai&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;This is a question-and-answer chat interface to showcase and explore RAG in your dataset with Trieve. It has everything you are looking for: input at bottom, streaming response from the LLM, and documents (the retrieved chunks) used by the LLM in its response on the right-hand side. You can regenerate a response, ask follow-on questions, start new conversations, and open any of the retrieved chunks. The various RAG-relevant parameters can be adjusted in the dataset settings in the dashboard: LLM, HyDE prompt (optional), System Prompt, RAG prompt, and more. We support models available on OpenRouter: &lt;a href="https://openrouter.ai/docs/models" rel="noopener noreferrer"&gt;openrouter.ai/docs/models&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is HyDE?&lt;/strong&gt;&lt;br&gt;
Hypothetical Document Embeddings (HyDE) is a retrieval method where an LLM first generates a hypothetical document that might address a user's query, and then that document is used in a semantic search (instead of the original query) to retrieve documents from the dataset. Read more in &lt;a href="https://arxiv.org/abs/2212.10496" rel="noopener noreferrer"&gt;Luyu Gao et al.'s "Precise Zero-Shot Dense Retrieval without Relevance Labels"&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Example interactions on the RAG interface:&lt;/p&gt;

&lt;p&gt;Question: &lt;em&gt;What is SigNoz?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.trieve.ai%2Fblog%2Ffirecrawl-and-trieve%2Fwhat_signoz_chat_signoz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.trieve.ai%2Fblog%2Ffirecrawl-and-trieve%2Fwhat_signoz_chat_signoz.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Question: &lt;em&gt;What are the benefits of OpenTelemetry?&lt;/em&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.trieve.ai%2Fblog%2Ffirecrawl-and-trieve%2Fopen_telemetry_benefits_chat_signoz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.trieve.ai%2Fblog%2Ffirecrawl-and-trieve%2Fopen_telemetry_benefits_chat_signoz.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Firecrawl and Trieve are a powerful combination for quickly building search and RAG systems. &lt;a href="https://dashboard.trieve.ai/" rel="noopener noreferrer"&gt;Get started today!&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Firecrawl: &lt;a href="https://firecrawl.dev/" rel="noopener noreferrer"&gt;firecrawl.dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Trieve: &lt;a href="https://trieve.ai/" rel="noopener noreferrer"&gt;trieve.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Github repo: &lt;a href="https://github.com/devflowinc/firecrawl-to-trieve" rel="noopener noreferrer"&gt;firecrawl-to-trieve&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>search</category>
      <category>rag</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
