<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Adam Tauber</title>
    <description>The latest articles on DEV Community by Adam Tauber (@asciimoo).</description>
    <link>https://dev.to/asciimoo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F41076%2Fe2223504-77f9-4b0f-bb2f-8d224f38c44f.jpeg</url>
      <title>DEV Community: Adam Tauber</title>
      <link>https://dev.to/asciimoo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/asciimoo"/>
    <language>en</language>
    <item>
      <title>Data Indexing in Golang</title>
      <dc:creator>Adam Tauber</dc:creator>
      <pubDate>Thu, 16 Apr 2026 18:57:14 +0000</pubDate>
      <link>https://dev.to/hister/data-indexing-in-golang-368l</link>
      <guid>https://dev.to/hister/data-indexing-in-golang-368l</guid>
      <description>&lt;p&gt;If you need fast, content-based retrieval of large amounts of documents, your best option is to use a full-text indexer. Popular solutions like Elasticsearch and Meilisearch are more than capable of getting the job done. But what if you don't want to depend on an external service, or if you need a higher level of control over how your data is stored and searched?&lt;/p&gt;

&lt;p&gt;Luckily, Go has an excellent library for exactly this purpose: &lt;a href="https://blevesearch.com/" rel="noopener noreferrer"&gt;Bleve&lt;/a&gt;. Bleve lets you quickly index any Go struct with sensible defaults and a built-in Google-like query language. Or you can go further and build your own query language and customize every single detail of the indexer.&lt;/p&gt;

&lt;p&gt;Bleve is a file-based indexer that can handle millions of records. It supports concurrent reads and writes, hot-swapping of indexes, match highlighting, and much more.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/asciimoo/hister" rel="noopener noreferrer"&gt;Hister&lt;/a&gt; is built on top of Bleve and uses a wide range of its features: custom field mappings with language-specific analyzers, a hand-crafted query language with per-field boosting, cursor-based pagination, multi-language index aliases, and fine-grained Scorch tuning. The examples through this post are inspired from our codebase and the knowledge we collected during the development.&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating a Simple Indexer
&lt;/h2&gt;

&lt;p&gt;Getting started with Bleve only takes a few lines of code. The two core operations are &lt;strong&gt;indexing&lt;/strong&gt; (storing a document so it can be searched later) and &lt;strong&gt;querying&lt;/strong&gt; (retrieving ranked documents that match a search expression).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"fmt"&lt;/span&gt;
    &lt;span class="s"&gt;"log"&lt;/span&gt;

    &lt;span class="n"&gt;bleve&lt;/span&gt; &lt;span class="s"&gt;"github.com/blevesearch/bleve/v2"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// Document represents the data we want to index and search.&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Document&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Title&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;URL&lt;/span&gt;   &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;Text&lt;/span&gt;  &lt;span class="kt"&gt;string&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Create a new index on disk. If one already exists at that path, open it.&lt;/span&gt;
    &lt;span class="n"&gt;mapping&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewIndexMapping&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"example.bleve"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mapping&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"example.bleve"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c"&gt;// Index a handful of documents. The first argument is a unique ID;&lt;/span&gt;
    &lt;span class="c"&gt;// the second is any Go value, Bleve will reflect over its fields.&lt;/span&gt;
    &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s"&gt;"1"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Title&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Go Programming"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;URL&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"https://go.dev"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Text&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Go is an open source programming language that makes it easy to build reliable software."&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="s"&gt;"2"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Title&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Bleve Search"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;URL&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"https://blevesearch.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Text&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Bleve is a full-text search and indexing library for Go."&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="s"&gt;"3"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Title&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Hister - Your own search engine"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;URL&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"https://hister.org/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Text&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Full-text search across your files, browsing history and beyond."&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"failed to index %s: %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Query the index. NewMatchQuery performs a full-text search across&lt;/span&gt;
    &lt;span class="c"&gt;// all indexed fields and ranks results by relevance score.&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewMatchQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Hister search engine"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewSearchRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fields&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"Title"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"URL"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="c"&gt;// which stored fields to return&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;                         &lt;span class="c"&gt;// maximum number of hits&lt;/span&gt;

    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Found %d result(s):&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Hits&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"  [%.4f] %s  %s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fields&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"Title"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fields&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"URL"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things to note:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;bleve.New&lt;/code&gt; vs &lt;code&gt;bleve.Open&lt;/code&gt;&lt;/strong&gt; &lt;code&gt;New&lt;/code&gt; creates a fresh index at the given path; &lt;code&gt;Open&lt;/code&gt; opens an existing one. The pattern shown above. Try &lt;code&gt;New&lt;/code&gt;, fall back to &lt;code&gt;Open&lt;/code&gt; on error is the idiomatic way to handle both the first run and subsequent runs with the same index directory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;bleve.NewIndexMapping()&lt;/code&gt;&lt;/strong&gt; Returns a default mapping that works well out of the box: text fields are tokenized, lowercased, and stop-word filtered using the English analyzer. You can replace this with a custom mapping when you need more control (see the &lt;em&gt;Mappings&lt;/em&gt; section below).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic field discovery&lt;/strong&gt; Bleve uses reflection to inspect your struct. Every exported field is automatically tokenized and made searchable with no extra configuration. Unexported fields are silently skipped.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unique document IDs&lt;/strong&gt; The string ID you pass to &lt;code&gt;Index&lt;/code&gt; is how you identify documents for updates and deletes. Calling &lt;code&gt;Index&lt;/code&gt; with an ID that already exists replaces the previous document in place, making it safe to re-index pages that have changed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SearchRequest.Fields&lt;/code&gt;&lt;/strong&gt; By default Bleve returns only document IDs and relevance scores to keep responses lean. Specify the field names you want returned in &lt;code&gt;Fields&lt;/code&gt;, or pass &lt;code&gt;[]string{"*"}&lt;/code&gt; to get every stored field.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;hit.Score&lt;/code&gt;&lt;/strong&gt; Each result carries a floating-point relevance score computed by Bleve's BM25-based scorer. Higher scores indicate a stronger match. You can influence scores with boost values (covered in the &lt;em&gt;Querying&lt;/em&gt; section).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Mappings
&lt;/h2&gt;

&lt;p&gt;The default mapping works well for a quick start, but real applications usually need more control over how Bleve analyzes and stores each field. A &lt;strong&gt;mapping&lt;/strong&gt; tells Bleve what type each field is, which analyzer to use when tokenizing it, whether to store the original value, and whether to include it in the index at all.&lt;/p&gt;

&lt;p&gt;Mappings can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Control tokenization&lt;/strong&gt; split text into terms using whitespace, language rules, edge n-grams, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filter input data&lt;/strong&gt; lowercase terms, strip HTML, apply stop-word lists, or run a stemmer so that "running" and "runs" match the same root token&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exclude fields from the index&lt;/strong&gt; omit sensitive or irrelevant fields to save disk space and keep the index lean&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Define custom analyzers&lt;/strong&gt; combine any tokenizer with any chain of token filters to get exactly the behavior you need&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is a concrete example that applies language-based stemming to the &lt;code&gt;Text&lt;/code&gt; and &lt;code&gt;Title&lt;/code&gt; fields, and excludes a raw HTML field from being indexed at all:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/blevesearch/bleve/v2/analysis/analyzer/en"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/blevesearch/bleve/v2/mapping"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;buildIndexMapping&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;mapping&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IndexMappingImpl&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// English analyzer: tokenizes, lowercases, removes stop words, and stems.&lt;/span&gt;
    &lt;span class="c"&gt;// "running" and "runs" will both match a search for "run".&lt;/span&gt;
    &lt;span class="n"&gt;englishField&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewTextFieldMapping&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;englishField&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Analyzer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;en&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AnalyzerName&lt;/span&gt;

    &lt;span class="c"&gt;// A keyword analyzer treats the entire field value as a single token&lt;/span&gt;
    &lt;span class="c"&gt;// useful for exact-match fields like URLs or tags.&lt;/span&gt;
    &lt;span class="n"&gt;keywordField&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewTextFieldMapping&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;keywordField&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Analyzer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"keyword"&lt;/span&gt;

    &lt;span class="c"&gt;// Disable indexing for a field we only want to store, not search.&lt;/span&gt;
    &lt;span class="n"&gt;storedOnlyField&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewTextFieldMapping&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;storedOnlyField&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;

    &lt;span class="n"&gt;docMapping&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewDocumentMapping&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;docMapping&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AddFieldMappingsAt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;englishField&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;docMapping&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AddFieldMappingsAt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;englishField&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;docMapping&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AddFieldMappingsAt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keywordField&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;docMapping&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AddFieldMappingsAt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"raw_html"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;storedOnlyField&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;// stored but not indexed&lt;/span&gt;

    &lt;span class="n"&gt;indexMapping&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewIndexMapping&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;indexMapping&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AddDocumentMapping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"document"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;docMapping&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;indexMapping&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DefaultAnalyzer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;en&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AnalyzerName&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;indexMapping&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pass the result of &lt;code&gt;buildIndexMapping()&lt;/code&gt; to &lt;code&gt;bleve.New&lt;/code&gt; or &lt;code&gt;bleve.NewUsing&lt;/code&gt; when creating the index. Mappings are baked into the index at creation time and cannot be changed afterwards. To apply a new mapping you need to create a fresh index and re-index all documents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Querying
&lt;/h2&gt;

&lt;p&gt;Bleve provides a powerful built-in text query processor called &lt;code&gt;QueryStringQuery&lt;/code&gt;. It supports field filters (&lt;code&gt;title:golang&lt;/code&gt;), quoted phrases (&lt;code&gt;"error handling"&lt;/code&gt;), term exclusion (&lt;code&gt;go -python&lt;/code&gt;), wildcard patterns (&lt;code&gt;auth*&lt;/code&gt;), and boolean operators (&lt;code&gt;go AND concurrency&lt;/code&gt;). Its syntax closely mirrors Google's search syntax. You can read more about it &lt;a href="https://blevesearch.com/docs/Query-String-Query/" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;But where Bleve really shines is in providing composable building blocks for constructing your own domain-specific query language. The &lt;a href="https://pkg.go.dev/github.com/blevesearch/bleve/v2@v2.5.7/search/query" rel="noopener noreferrer"&gt;query package&lt;/a&gt; exposes a wide variety of primitives. Match queries, wildcard queries, range queries, boolean combinators, and more that you can wire together however you like.&lt;/p&gt;

&lt;p&gt;Here's a simplified example from our app to demonstrate how powerful this can be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;queries&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Query&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="n"&gt;negatedQueries&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Query&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fields&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;queryString&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;negated&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;
    &lt;span class="c"&gt;// negate the term if it starts with "-"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cut&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CutPrefix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keyword&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"-"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cut&lt;/span&gt;
        &lt;span class="n"&gt;negated&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// WildcardQuery matches the keyword anywhere inside the URL string.&lt;/span&gt;
    &lt;span class="c"&gt;// The 10x boost means a URL match raises the document's score&lt;/span&gt;
    &lt;span class="c"&gt;// significantly compared to a plain text match.&lt;/span&gt;
    &lt;span class="c"&gt;//&lt;/span&gt;
    &lt;span class="c"&gt;// The boost number 10 is arbitrary, adjust it to your needs&lt;/span&gt;
    &lt;span class="n"&gt;urlq&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewWildcardQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"*"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;"*"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;urlq&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;urlq&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetBoost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// MatchQuery tokenizes the keyword with the field's analyzer before&lt;/span&gt;
    &lt;span class="c"&gt;// matching, so stemming and stop-word removal apply automatically.&lt;/span&gt;
    &lt;span class="n"&gt;textq&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewMatchQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keyword&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;textq&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// Title matches are given 50x weight. A keyword found in the title&lt;/span&gt;
    &lt;span class="c"&gt;// is a very strong signal of relevance.&lt;/span&gt;
    &lt;span class="c"&gt;//&lt;/span&gt;
    &lt;span class="c"&gt;// The boost number 50 is arbitrary, adjust it to your needs&lt;/span&gt;
    &lt;span class="n"&gt;titleq&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewMatchQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keyword&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;titleq&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;titleq&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetBoost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// DisjunctionQuery is an OR combinator: the document scores as a match&lt;/span&gt;
    &lt;span class="c"&gt;// if it satisfies *any* of the sub-queries. The final score is taken&lt;/span&gt;
    &lt;span class="c"&gt;// from whichever sub-query scored highest.&lt;/span&gt;
    &lt;span class="n"&gt;disjq&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewDisjunctionQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;urlq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;textq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;titleq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;negated&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;negatedQueries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;negatedQueries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;disjq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;queries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;disjq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// BooleanQuery is an AND/OR/NOT combinator at the keyword level:&lt;/span&gt;
&lt;span class="c"&gt;//   - must    (first arg):  document must satisfy every query in this list&lt;/span&gt;
&lt;span class="c"&gt;//   - should  (second arg): optional queries that boost score when matched&lt;/span&gt;
&lt;span class="c"&gt;//   - mustNot (third arg):  document must satisfy none of these queries&lt;/span&gt;
&lt;span class="c"&gt;//&lt;/span&gt;
&lt;span class="c"&gt;// The result: every non-negated keyword must appear somewhere in the&lt;/span&gt;
&lt;span class="c"&gt;// document, while negated keywords disqualify a document entirely.&lt;/span&gt;
&lt;span class="n"&gt;fullQuery&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewBooleanQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;negatedQueries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each keyword in the input string becomes its own &lt;code&gt;DisjunctionQuery&lt;/code&gt; that spans all three fields. The &lt;code&gt;BooleanQuery&lt;/code&gt; then requires that &lt;em&gt;all&lt;/em&gt; keyword disjunctions are satisfied, giving us an implicit AND between keywords and a per-field OR within each keyword. Negated keywords (prefixed with &lt;code&gt;-&lt;/code&gt;) are placed in the &lt;code&gt;mustNot&lt;/code&gt; list and disqualify any document that matches them.&lt;/p&gt;

&lt;p&gt;This structure is easy to extend: you could add date-range filters, weight fields dynamically based on user preferences, or introduce special syntax for field-scoped searches.&lt;/p&gt;

&lt;p&gt;Take a look at our &lt;a href="https://github.com/asciimoo/hister/blob/master/server/indexer/querybuilder/builder.go" rel="noopener noreferrer"&gt;query builder&lt;/a&gt; for a more complete real-world example.&lt;/p&gt;

&lt;h2&gt;
  
  
  Paging
&lt;/h2&gt;

&lt;p&gt;Bleve's &lt;a href="https://pkg.go.dev/github.com/blevesearch/bleve/v2#SearchRequest" rel="noopener noreferrer"&gt;SearchRequest&lt;/a&gt; controls both the page size (&lt;code&gt;Size&lt;/code&gt;) and the starting offset of results. A natural first instinct is to use the &lt;code&gt;From&lt;/code&gt; field, set it to &lt;code&gt;0&lt;/code&gt; for the first page, &lt;code&gt;20&lt;/code&gt; for the second, and so on. This works, but it has a serious problem: Bleve must score and sort &lt;em&gt;all&lt;/em&gt; matching documents up to &lt;code&gt;From + Size&lt;/code&gt; on every request, making deep pages increasingly expensive in both memory and CPU. Worse, if new documents are indexed between two page requests, the offset shifts and users see duplicate or missing results.&lt;/p&gt;

&lt;p&gt;The correct approach is to use cursor-based pagination via &lt;a href="https://pkg.go.dev/github.com/blevesearch/bleve/v2#SearchRequest.SetSearchAfter" rel="noopener noreferrer"&gt;SearchAfter&lt;/a&gt; and &lt;a href="https://pkg.go.dev/github.com/blevesearch/bleve/v2#SearchRequest.SetSearchBefore" rel="noopener noreferrer"&gt;SearchBefore&lt;/a&gt;. These functions resume the result stream from a known position rather than re-scanning from the beginning, which is both accurate and efficient. We learned to prefer them the &lt;a href="https://github.com/asciimoo/hister/issues/173" rel="noopener noreferrer"&gt;hard way&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;pageSize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;

&lt;span class="c"&gt;// First page, no cursor needed.&lt;/span&gt;
&lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewSearchRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;myQuery&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pageSize&lt;/span&gt;
&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SortBy&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"_score"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"_id"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="c"&gt;// stable sort is required for cursors&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// Subsequent pages, pass the sort key of the last hit as the cursor.&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Hits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;pageSize&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;lastHit&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Hits&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Hits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;cursor&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;lastHit&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sort&lt;/span&gt; &lt;span class="c"&gt;// []string, one element per sort field&lt;/span&gt;

    &lt;span class="n"&gt;nextReq&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewSearchRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;myQuery&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;nextReq&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pageSize&lt;/span&gt;
    &lt;span class="n"&gt;nextReq&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SortBy&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"_score"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"_id"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;nextReq&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetSearchAfter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;nextResults&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nextReq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things to keep in mind:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stable sort is required.&lt;/strong&gt; &lt;code&gt;SearchAfter&lt;/code&gt; uses the sort key of the last result as its cursor. If the sort key is changing the cursor become invalid.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Sort&lt;/code&gt; is always a &lt;code&gt;[]string&lt;/code&gt;.&lt;/strong&gt; Even when sorting by a numeric field, Bleve serializes the sort key as a string. Read the cursor from &lt;code&gt;hit.Sort[0]&lt;/code&gt; (or whichever index corresponds to your primary sort field) and pass it directly to &lt;code&gt;SetSearchAfter&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SearchBefore&lt;/code&gt; works the same way&lt;/strong&gt; but moves in the opposite direction, which is useful for implementing a "previous page" button.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Handling Multiple Indexes
&lt;/h2&gt;

&lt;p&gt;Bleve can transparently manage multiple indexes at the same time through &lt;a href="https://pkg.go.dev/github.com/blevesearch/bleve/v2#IndexAlias" rel="noopener noreferrer"&gt;IndexAlias&lt;/a&gt;. An alias is a virtual index that fans a query out to several real indexes and merges their results back into a single ranked list.&lt;/p&gt;

&lt;p&gt;This is particularly useful when you want to maintain separate indexes for different languages. Each language gets its own index with a tailored analyzer (English stemming, French stop-words, custom tokenization, etc.), but a single alias lets you search all of them at once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;enIndex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"index_en.bleve"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;frIndex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"index_fr.bleve"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;deIndex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"index_de.bleve"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// Combine all language indexes behind a single alias.&lt;/span&gt;
&lt;span class="n"&gt;alias&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewIndexAlias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enIndex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;frIndex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;deIndex&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// Query the alias exactly as you would a regular index.&lt;/span&gt;
&lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewSearchRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewMatchQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Hister search engine"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Aliases also make hot-swapping painless. When you need to rebuild an index (for example, to apply a new mapping), you can build the new index in the background, then atomically swap it into the alias with &lt;code&gt;alias.Swap(newIndexes, oldIndexes)&lt;/code&gt;. In-flight queries complete against the old index while new queries immediately use the new one, with zero downtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fine-tuning
&lt;/h2&gt;

&lt;p&gt;Bleve's performance knobs are not prominently documented, but they make a real difference under load. Configuration is passed as a &lt;code&gt;map[string]any&lt;/code&gt; to &lt;a href="https://pkg.go.dev/github.com/blevesearch/bleve/v2#NewUsing" rel="noopener noreferrer"&gt;NewUsing&lt;/a&gt; or &lt;a href="https://pkg.go.dev/github.com/blevesearch/bleve/v2#OpenUsing" rel="noopener noreferrer"&gt;OpenUsing&lt;/a&gt; instead of the regular &lt;code&gt;New&lt;/code&gt;/&lt;code&gt;Open&lt;/code&gt; functions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;any&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// How long the BoltDB storage layer will wait for a write lock&lt;/span&gt;
    &lt;span class="c"&gt;// before returning an error. Increase this if you see timeout&lt;/span&gt;
    &lt;span class="c"&gt;// errors under concurrent write load.&lt;/span&gt;
    &lt;span class="s"&gt;"bolt_timeout"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"2s"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="s"&gt;"scorchPersisterOptions"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;any&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c"&gt;// Number of goroutines that flush in-memory segments to disk&lt;/span&gt;
        &lt;span class="c"&gt;// in parallel. More workers help throughput on multi-core machines&lt;/span&gt;
        &lt;span class="c"&gt;// at the cost of higher memory usage during flushing.&lt;/span&gt;
        &lt;span class="s"&gt;"NumPersisterWorkers"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

        &lt;span class="c"&gt;// Maximum bytes each persister worker holds in memory before&lt;/span&gt;
        &lt;span class="c"&gt;// flushing. Larger values reduce I/O by writing bigger segments,&lt;/span&gt;
        &lt;span class="c"&gt;// but increase peak memory consumption.&lt;/span&gt;
        &lt;span class="s"&gt;"MaxSizeInMemoryMergePerWorker"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="m"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="m"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c"&gt;// 80 MB&lt;/span&gt;

        &lt;span class="c"&gt;// The persister pauses merging when the number of on-disk segment&lt;/span&gt;
        &lt;span class="c"&gt;// files is below this threshold, reducing unnecessary write&lt;/span&gt;
        &lt;span class="c"&gt;// amplification when the index is small or lightly loaded.&lt;/span&gt;
        &lt;span class="s"&gt;"PersisterNapUnderNumFiles"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;

    &lt;span class="s"&gt;"scorchMergePlanOptions"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;any&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c"&gt;// Segments smaller than this size are candidates for merging.&lt;/span&gt;
        &lt;span class="c"&gt;// Raising this value reduces the total number of segments (and&lt;/span&gt;
        &lt;span class="c"&gt;// therefore read latency) at the cost of more merge I/O.&lt;/span&gt;
        &lt;span class="s"&gt;"FloorSegmentFileSize"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="m"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="m"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c"&gt;// 20 MB&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bleve&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OpenUsing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"my.bleve"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These settings live in the Scorch storage backend, which is Bleve's default. Consult the &lt;a href="https://github.com/blevesearch/bleve/blob/master/index/scorch/persister.go#L67" rel="noopener noreferrer"&gt;persister source&lt;/a&gt; for the full list of available options and their default values.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Bleve is one of Go's hidden gems that deserves more attention. It lets you add full-text search to your application, without complex infrastructure. The default configuration gets you up and running in minutes, while the custom mapping system, composable query primitives, performance, debugging options and deep custimzation provides a great toolset to solve specific problems optimally.&lt;/p&gt;

&lt;p&gt;The official documentation has gaps, but the GitHub issues and real-life open-source projects fill them in well. Check out our &lt;a href="https://github.com/asciimoo/hister/tree/master/server/indexer" rel="noopener noreferrer"&gt;indexer package&lt;/a&gt; to see all of the above concepts working together in a production codebase.&lt;/p&gt;

&lt;p&gt;Happy indexing.&lt;/p&gt;

</description>
      <category>go</category>
      <category>indexer</category>
    </item>
    <item>
      <title>Firefox Extension IDs: The Bad and the Ugly</title>
      <dc:creator>Adam Tauber</dc:creator>
      <pubDate>Thu, 09 Apr 2026 08:26:20 +0000</pubDate>
      <link>https://dev.to/hister/firefox-extension-ids-the-bad-and-the-ugly-2dgi</link>
      <guid>https://dev.to/hister/firefox-extension-ids-the-bad-and-the-ugly-2dgi</guid>
      <description>&lt;p&gt;If you've ever developed a web application that communicates with a browser extension, you've probably encountered the subtle but significant differences between how Chrome and Firefox handle extension identifiers. While both browsers allow developers to specify static extension IDs, their implementation approaches diverge in ways that create real problems for security, privacy, user and developer experience.&lt;/p&gt;

&lt;p&gt;This post explores an issue I discovered while building &lt;a href="https://github.com/asciimoo/hister" rel="noopener noreferrer"&gt;Hister&lt;/a&gt;. What started as a straightforward CSRF protection implementation turned into a deep dive into Firefox's extension architecture decisions.&lt;/p&gt;




&lt;p&gt;Both Chrome and Firefox allow extension developers to have a static extension ID in their manifest. This ID serves as a persistent identifier for the extension across different installations and updates.&lt;/p&gt;

&lt;p&gt;In Chrome (and Chromium-based browsers), extension ID handling works exactly as you'd expect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You specify a public key in your manifest which guarantees a static extension ID&lt;/li&gt;
&lt;li&gt;The browser uses this ID consistently&lt;/li&gt;
&lt;li&gt;All network requests from the extension include this ID in the &lt;code&gt;Origin&lt;/code&gt; HTTP header&lt;/li&gt;
&lt;li&gt;Servers can identify which extension is making requests&lt;/li&gt;
&lt;li&gt;The ID remains the same across all installations of the extension&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your extension ID is &lt;code&gt;cciilamhchpmbdnniabclekddabkifhb&lt;/code&gt;, every installation of your extension will use that ID, and every HTTP request will identify itself with that origin.&lt;/p&gt;

&lt;p&gt;Firefox's approach... is different:&lt;/p&gt;

&lt;p&gt;Firefox also lets you specify a static extension ID in the manifest. However, at the moment of installation, Firefox generates a unique "internal UUID" for each installation. This UUID is what actually appears in the &lt;code&gt;Origin&lt;/code&gt; header of HTTP requests, &lt;strong&gt;not&lt;/strong&gt; the static ID you specified.&lt;/p&gt;

&lt;p&gt;On the surface, this might seem like a minor implementation detail. In practice, it creates significant problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bad: Breaking CSRF Protection
&lt;/h2&gt;

&lt;p&gt;Cross-Site Request Forgery (CSRF) protection is a fundamental security concern for any web application. The basic problem: how do you ensure that a request to your server came from your legitimate client application and not from a malicious site?&lt;/p&gt;

&lt;p&gt;For traditional web applications, there are well-established patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CSRF tokens embedded in forms&lt;/li&gt;
&lt;li&gt;Origin HTTP header checks&lt;/li&gt;
&lt;li&gt;SameSite cookie attributes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But browser extensions present a unique challenge. Extension code runs independently from web pages. It's not subject to the same-origin policy in the same way. This means traditional CSRF protection mechanisms don't work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Origin Header: The Natural Solution
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;Origin&lt;/code&gt; HTTP header was designed exactly for this purpose. When a browser makes a cross-origin request, it includes an &lt;code&gt;Origin&lt;/code&gt; header identifying where the request came from. For extensions, this header contains the extension ID.&lt;/p&gt;

&lt;p&gt;In Chrome, CSRF protection for extension-to-server communication is straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/api/add&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;allowedOrigin&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;chrome-extension://cciilamhchpmbdnniabclekddabkifhb&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;origin&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="nx"&gt;allowedOrigin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;403&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Invalid origin&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Process the request...&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is secure, simple, and requires no user interaction. The extension can make "authenticated" requests to your server, and you can verify they're coming from your legitimate extension, not from a malicious website or a rogue extension.&lt;/p&gt;

&lt;p&gt;With Firefox's unique internal UUID per installation, this pattern becomes impossible: You cannot allowlist a specific origin because you don't know what the UUID will be. Each user who installs your extension gets a different UUID.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Workaround: Manual Configuration
&lt;/h3&gt;

&lt;p&gt;The only reliable solution is to require users to manually configure a shared secret:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User installs your extension&lt;/li&gt;
&lt;li&gt;Server generates a secret token&lt;/li&gt;
&lt;li&gt;User manually copies this token into the extension's settings&lt;/li&gt;
&lt;li&gt;Extension includes the token in all requests&lt;/li&gt;
&lt;li&gt;Server validates the token instead of the &lt;code&gt;Origin&lt;/code&gt; header&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This works, but it's terrible UX:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extra setup steps discourage users&lt;/li&gt;
&lt;li&gt;High potential for user error&lt;/li&gt;
&lt;li&gt;Token management becomes the user's problem&lt;/li&gt;
&lt;li&gt;Can't automatically validate origin at the HTTP layer&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Ugly: Privacy Implications
&lt;/h2&gt;

&lt;p&gt;While breaking CSRF protection is bad for developers, Firefox's internal UUID approach has even more troubling implications for user privacy.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Built-in Tracking Mechanism
&lt;/h3&gt;

&lt;p&gt;The internal UUID is unique per browser installation, persistent across websites, and &lt;strong&gt;completely unavoidable&lt;/strong&gt;. This way of tracking is even worse than cookies:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tracking cookies:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can be blocked by browser settings&lt;/li&gt;
&lt;li&gt;Can be cleared by the user&lt;/li&gt;
&lt;li&gt;Subject to SameSite policies&lt;/li&gt;
&lt;li&gt;Users are increasingly aware of them&lt;/li&gt;
&lt;li&gt;Privacy tools can block them&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Firefox extension internal UUIDs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ Cannot be disabled&lt;/li&gt;
&lt;li&gt;❌ Cannot be cleared (except by reinstalling)&lt;/li&gt;
&lt;li&gt;❌ Persist across all websites&lt;/li&gt;
&lt;li&gt;❌ Invisible to users (not shown in extension details)&lt;/li&gt;
&lt;li&gt;❌ Not affected by privacy tools or private browsing&lt;/li&gt;
&lt;li&gt;❌ Unique to each browser installation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why Did Firefox Do This?
&lt;/h2&gt;

&lt;p&gt;I don't have a clear answer to that. Mozilla mentions "sandboxing and security" reasons. But, for me neither of the arguments validate the usage of "internal UUID" in the &lt;code&gt;Origin&lt;/code&gt; HTTP header.&lt;/p&gt;

&lt;p&gt;I can speculate on why Firefox implemented internal UUIDs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Possible reason 1: Security isolation&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Perhaps the intent was to provide better security isolation between different extension installations. If each installation has a unique ID at the browser level, it's theoretically harder for one malicious extension to impersonate another.&lt;/p&gt;

&lt;p&gt;However, this benefit is questionable. Extension IDs are already validated by the browser. A malicious extension can't fake someone else's ID because the browser controls the &lt;code&gt;Origin&lt;/code&gt; header generation and the extension installation process as well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Possible reason 2: Migration from legacy extension system&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Firefox underwent a major transition from legacy XUL extensions to WebExtensions. The internal UUID system might be a holdover from the legacy architecture that was never fully reconsidered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Possible reason 3: Accidental consequence&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
It's possible this wasn't a deliberate design decision at all, but rather an accidental consequence of how Firefox's extension system was architected.&lt;/p&gt;

&lt;p&gt;Whatever the reason is, the current behavior has serious flaws.&lt;/p&gt;

&lt;p&gt;You know the issue is serious when even Chrome has a more privacy-respecting solution to the problem&lt;/p&gt;

&lt;h4&gt;
  
  
  UPDATE (2026.02.16)
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1372288" rel="noopener noreferrer"&gt;Seems like&lt;/a&gt; their goal was to prevent &lt;strong&gt;extension fingerprinting&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Developer Perspective
&lt;/h2&gt;

&lt;p&gt;As someone building an free software project that prioritizes privacy and local-first architecture, Firefox's behavior is frustrating:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For users:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Firefox users get a worse experience (manual configuration)&lt;/li&gt;
&lt;li&gt;The browser marketed for privacy actually creates privacy issues&lt;/li&gt;
&lt;li&gt;No transparency about the internal UUID system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;For developers:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can't implement proper CSRF protection via &lt;code&gt;Origin&lt;/code&gt; header&lt;/li&gt;
&lt;li&gt;Must implement workarounds that harm UX&lt;/li&gt;
&lt;li&gt;Documentation becomes more complex&lt;/li&gt;
&lt;li&gt;Testing is harder (can't easily simulate multiple Firefox installations)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Should Firefox Do?
&lt;/h2&gt;

&lt;p&gt;The solution is straightforward: &lt;strong&gt;use a static extension ID in the &lt;code&gt;Origin&lt;/code&gt; HTTP header&lt;/strong&gt;, just like Chrome does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Disclaimer
&lt;/h2&gt;

&lt;p&gt;While I've spent significant amount of time researching and trying to find ways to resolve these issues, it can easily happen that I've completely missed something and there is solution to either or both of the mentioned problems. In this case please contact me at &lt;a href="https://chaos.social/@asciimoo" rel="noopener noreferrer"&gt;@asciimoo@chaos.social&lt;/a&gt; on Mastodon.&lt;/p&gt;

</description>
      <category>firefox</category>
      <category>extension</category>
      <category>bug</category>
    </item>
    <item>
      <title>How I Cut My Google Search Dependence in Half</title>
      <dc:creator>Adam Tauber</dc:creator>
      <pubDate>Thu, 02 Apr 2026 15:05:43 +0000</pubDate>
      <link>https://dev.to/hister/how-i-cut-my-google-search-dependence-in-half-4mi1</link>
      <guid>https://dev.to/hister/how-i-cut-my-google-search-dependence-in-half-4mi1</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; I built &lt;a href="https://github.com/asciimoo/hister" rel="noopener noreferrer"&gt;Hister&lt;/a&gt;, a self-hosted web history search tool that indexes visited pages locally. In just 1.5 months, I reduced my reliance on Google Search by 50%.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: Online Search Isn't What It Used to Be
&lt;/h2&gt;

&lt;p&gt;Like many developers and knowledge workers, I found myself constantly reaching for Google Search throughout my workday. It had become such an ingrained habit that I barely noticed how often I was context-switching away from my actual work to perform searches. But over time, something had changed about the experience. The search results that once felt reliable and helpful were increasingly problematic in several ways.&lt;/p&gt;

&lt;h3&gt;
  
  
  Too Many Advertisements
&lt;/h3&gt;

&lt;p&gt;What used to be a clean list of relevant links now requires scrolling past multiple sponsored results, shopping suggestions, and promoted content just to reach the organic results. Often, the actual information I'm looking for doesn't appear until halfway down the page, after I've mentally filtered out all the commercial noise.&lt;/p&gt;

&lt;h3&gt;
  
  
  Manipulative SEO Tactics
&lt;/h3&gt;

&lt;p&gt;Organic results themselves have been manipulated by SEO tactics rather than truly reflecting the most relevant and helpful content. Websites optimized for search engines rather than humans dominate the rankings, while genuinely useful resources from smaller sites or personal blogs get buried on page two or three. The signal-to-noise ratio has degraded significantly.&lt;/p&gt;

&lt;h3&gt;
  
  
  AI Suggestions
&lt;/h3&gt;

&lt;p&gt;Google has recently added AI-generated summaries at the top of many search results. While sometimes helpful, these summaries often miss crucial nuance, provide oversimplified or occasionally incorrect information, and add yet another layer between me and the actual source material I'm trying to find. For technical queries where precision matters, these AI answers can be misleading or incomplete.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lack of Privacy
&lt;/h3&gt;

&lt;p&gt;Google tracks every query I make, building a detailed profile of my interests, work patterns, and information needs. This data is used for ad targeting and who knows what else. The convenience of search comes at the cost of giving away intimate details about my work and life.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Insight
&lt;/h2&gt;

&lt;p&gt;But the realization that pushed me to build a solution was that I was often searching for pages &lt;strong&gt;I'd already visited&lt;/strong&gt;. That documentation page I read last week but forgot to bookmark. That GitHub issue I commented on yesterday but couldn't remember the exact project name. Those internal wiki pages with crucial information about our infrastructure. I was using Google as a personal memory aid, outsourcing my recall to an external service that was tracking my every query. And for content behind authentication (internal tools, documentation, private repositories) Google couldn't help at all, since it can't index pages it can't access.&lt;/p&gt;

&lt;h3&gt;
  
  
  Two Types of Search
&lt;/h3&gt;

&lt;p&gt;Thinking on how to replace Google led me to a crucial realization about the nature of search itself. When we type queries into a search box, we're actually doing one of two fundamentally different things, even though the interface is identical:&lt;/p&gt;

&lt;h4&gt;
  
  
  Discovery Search: Finding New Information
&lt;/h4&gt;

&lt;p&gt;Discovery search is what we typically think of when we imagine "searching the internet". It's about finding information we've never encountered before. This is true exploration, we're venturing into unknown territory, discovering new resources, learning about topics we're unfamiliar with, and finding answers to questions we've never asked before. For this type of search, we genuinely need the vast index of the internet that services like Google provide. We need to cast a wide net and see what the collective knowledge of the web has to offer.&lt;/p&gt;

&lt;h4&gt;
  
  
  Recall Search: Refinding Known Information
&lt;/h4&gt;

&lt;p&gt;But then there's the other type of search what I call "recall search". This is when we're trying to find information we've already encountered. We're not discovering something new; we're trying to remember where we saw something. Examples of this include searches like "That authentication bug I fixed last month..." when you remember solving a problem but can't recall the exact solution, or "The Bleve docs page about result highlighters..." when you know you've read the documentation before but can't remember the specific URL or section title. Another common example: "That Stack Overflow answer about async/await..". when you remember reading a particularly clear explanation but didn't save the link.&lt;/p&gt;

&lt;p&gt;A significant portion of my daily searches — probably more than half — were recall searches, not discovery searches.&lt;/p&gt;

&lt;p&gt;The revelation that changed everything for me was this: A significant portion of my daily searches - probably more than half - were recall searches, not discovery searches. I was constantly using Google to search my own browsing history, to refind pages I'd already visited and information I'd already read. But Google's interface treats both types of search identically, and it has no special optimization for helping you refind your own content. Worse, for pages behind authentication or on private networks, Google can't help at all because it can't index content it can't access.&lt;/p&gt;

&lt;p&gt;This insight suggested a solution: What if I had a dedicated tool optimized specifically for recall search for refinding my own browsing history, and only fall back to Google for true discovery search?&lt;/p&gt;

&lt;p&gt;The potential benefits were enormous:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;faster results&lt;/li&gt;
&lt;li&gt;better privacy&lt;/li&gt;
&lt;li&gt;access to authenticated content&lt;/li&gt;
&lt;li&gt;results tailored specifically to my interests and work&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Solution: Index Everything Locally
&lt;/h2&gt;

&lt;p&gt;The solution seemed obvious once I'd articulated the problem: what if I could search my entire browsing history - including the full page content, not just URLs and titles - locally and privately? This would give me a personal search engine optimized specifically for recall search, while still allowing me to fall back to Google for discovery search when needed.&lt;/p&gt;

&lt;p&gt;I started looking for existing solutions. Surely someone had built this before? Browser history exists, but it only stores URLs and page titles, making it nearly useless for finding pages based on their content. Some note-taking apps like Evernote or Notion offer web clippers, but they require manual action for each page you want to save. Personal knowledge management tools like &lt;a href="https://github.com/asciimoo/omnom" rel="noopener noreferrer"&gt;Omnom&lt;/a&gt; exist, but they're focused on curated notes rather than comprehensive browsing history, but they require conscious decisions about what to save.&lt;/p&gt;

&lt;p&gt;None of the existing tools I found met all my requirements. I needed something that combined the comprehensive automatic capture of browser history, the full-text search capabilities of a search engine, the performance of local software, and the privacy of self-hosted solutions. Since nothing existed that checked all these boxes, I decided to build it myself.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I Needed
&lt;/h3&gt;

&lt;p&gt;The requirements for my ideal solution were clear:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fast lookup&lt;/strong&gt; If searching my local index took longer than just Googling, I'd never use it. I needed instant, sub-second search response times, keyboard shortcuts to make it faster to search locally than to context-switch to Google.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automatic indexing&lt;/strong&gt; I didn't want to manually save pages or make conscious decisions about what to index. It needed to capture pages as I browse with zero manual work on my part. The tool should disappear into the background and just work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Authentication aware indexing&lt;/strong&gt; So much of the content I reference daily is behind authentication: internal wikis, private documents, authenticated API documentation, internal dashboards. Any solution that couldn't handle authenticated content would miss a huge portion of my actual browsing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full-text search&lt;/strong&gt; Meant searching the actual page content, not just URLs and titles. Browser history is useless when you remember reading something about "microservice authentication patterns" but can't remember which blog or doc site it was on. I needed to be able to search the words within the pages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Powerful query capabilities&lt;/strong&gt; Like Boolean operators (AND, OR, NOT), field-specific searches (search only URLs, or only titles), and wildcard matching would make it possible to narrow down results quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zero cognitive overhead&lt;/strong&gt; The tool needed to work seamlessly in my workflow. It should integrate naturally with how I already browse and search.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transparent fallback to online search engines&lt;/strong&gt; If I searched locally and didn't find what I wanted, I should be able to immediately fall back to Google with the same query, making adoption gradual rather than requiring a complete workflow change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fine-tuning capabilities&lt;/strong&gt; Let me customize the experience over time. I wanted to be able to blacklist irrelevant sites I never want to see again, prioritize important sources, and create keyword aliases for common searches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Offline preview of saved content&lt;/strong&gt; I could read indexed pages even if the original site went down or the page was deleted; a nice bonus that would occasionally save me from link rot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Import existing history&lt;/strong&gt; I wanted to start with years of browsing data already indexed, rather than building up an index from scratch over months.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free software&lt;/strong&gt; Self-Hosted, with no recurring costs or vendor lock-in. My browsing history is my personal data, it should not be owned by companies.&lt;/p&gt;

&lt;p&gt;No existing tool checked all these boxes. So I decided to build &lt;a href="https://github.com/asciimoo/hister" rel="noopener noreferrer"&gt;Hister&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introducing Hister
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/asciimoo/hister" rel="noopener noreferrer"&gt;Hister&lt;/a&gt; is a self-hosted web history management tool that treats your browsing history as a personal search engine.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Results: 50% Reduction in 1.5 Months
&lt;/h2&gt;

&lt;p&gt;After using Hister for six weeks, I analyzed my search patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;~50% of my Google searches now answered locally&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Found content Google couldn't&lt;/strong&gt; (authenticated pages, deleted content)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero privacy concerns&lt;/strong&gt; No tracking, no profiling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Better results&lt;/strong&gt; for my specific needs (because it's MY history)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The more I use it, the better it gets. My local index is now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More relevant than Google for my common queries&lt;/li&gt;
&lt;li&gt;As fast as opening a new browser tab&lt;/li&gt;
&lt;li&gt;Comprehensive across authenticated services&lt;/li&gt;
&lt;li&gt;A personal knowledge base of everything I've read&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Unexpected Benefits
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Rediscovery:&lt;/strong&gt; I'm finding valuable content I'd forgotten about. That article I bookmarked 2 years ago but never revisited? Now it shows up in relevant searches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learning patterns:&lt;/strong&gt; Seeing what I search for reveals my knowledge gaps and interests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Offline access:&lt;/strong&gt; When documentation sites go down or pages get deleted, I still have the content.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;We've accepted that search means "go to Google" for so long that we've forgotten there are alternatives. But for a huge portion of my daily searches - probably more than half - I don't need the entire internet. I need OUR internet: the pages I've read, the docs I've opened, the internal tools I use daily.&lt;/p&gt;

&lt;p&gt;Hister isn't trying to replace Google for discovery. It's trying to replace Google for recall. And in that domain, it's already better than Google could ever be, because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It knows about authenticated pages Google will never see&lt;/li&gt;
&lt;li&gt;It searches YOUR history, not the entire web&lt;/li&gt;
&lt;li&gt;It's instant, private, and ad-free&lt;/li&gt;
&lt;li&gt;It gets better the more you use it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After 1.5 months, I've cut my Google dependence in half. I expect this number will increase as my index grows.&lt;/p&gt;

&lt;p&gt;If you're a developer, researcher, or knowledge worker who constantly re-searches for information you've already found, give Hister a try. It might just change how you find information on the internet.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before Hister:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Open Google&lt;/li&gt;
&lt;li&gt;Search: "bleve query"&lt;/li&gt;
&lt;li&gt;Click first result (probably wrong)&lt;/li&gt;
&lt;li&gt;Click second result (looks familiar…)&lt;/li&gt;
&lt;li&gt;Realize I've been here before&lt;/li&gt;
&lt;li&gt;Finally find the specific page I wanted&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Time: ~1-2 minutes, 5-10 clicks&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  With Hister:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Open Hister&lt;/li&gt;
&lt;li&gt;Type: "bleve query", press enter&lt;/li&gt;
&lt;li&gt;First result is opened with the EXACT page I visited last month&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Time: ~5 seconds, few keystrokes&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Take Back Your Search
&lt;/h2&gt;

&lt;p&gt;To get started with Hister check out the following links:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/asciimoo/hister/releases" rel="noopener noreferrer"&gt;Download Hister&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://addons.mozilla.org/en-US/firefox/addon/hister/" rel="noopener noreferrer"&gt;Download Firefox Extension&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://chromewebstore.google.com/detail/hister/cciilamhchpmbdnniabclekddabkifhb" rel="noopener noreferrer"&gt;Download Chrome Extension&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Future Development
&lt;/h3&gt;

&lt;p&gt;I'm actively developing Hister with these goals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Improve usability&lt;/li&gt;
&lt;li&gt;Add automatic indexing capabilities based on the index and opened results&lt;/li&gt;
&lt;li&gt;Find a secure and privacy respecting way to connect local Hister's to a distributed search engine&lt;/li&gt;
&lt;li&gt;Export search results&lt;/li&gt;
&lt;li&gt;Advanced analytics and search insights&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hister is open source (AGPLv3) and welcomes contributions!&lt;/p&gt;

&lt;h3&gt;
  
  
  Ways to Contribute
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;🐛 Report bugs and suggest features on &lt;a href="https://github.com/asciimoo/hister/issues" rel="noopener noreferrer"&gt;GitHub Issues&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;💻 Submit pull requests (check out &lt;a href="https://github.com/asciimoo/hister/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22" rel="noopener noreferrer"&gt;good first issues&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;📖 Improve documentation&lt;/li&gt;
&lt;li&gt;🎨 Design better UI/UX&lt;/li&gt;
&lt;li&gt;🌍 Translate to other languages&lt;/li&gt;
&lt;li&gt;⭐ Star the repo and spread the word!&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Have questions or feedback? Open an issue on &lt;a href="https://github.com/asciimoo/hister" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; or reach out to &lt;a href="https://github.com/asciimoo" rel="noopener noreferrer"&gt;@asciimoo&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>searchengine</category>
      <category>indexer</category>
      <category>search</category>
    </item>
    <item>
      <title>How to Scrape Instagram Profiles</title>
      <dc:creator>Adam Tauber</dc:creator>
      <pubDate>Mon, 13 Nov 2017 00:00:00 +0000</pubDate>
      <link>https://dev.to/asciimoo/how-to-scrape-instagram-profiles-4gm</link>
      <guid>https://dev.to/asciimoo/how-to-scrape-instagram-profiles-4gm</guid>
      <description>

&lt;p&gt;Scraping can be tedious work especially if the target site isn't just a standard static HTML page. Plenty of modern sites have JavaScript only UIs where extracting content is not always trivial. Instagram is one of these websites, so I would like to show you how it is possible to write a scraper relatively fast to get images from Instagram. I'm using &lt;a href="http://go-colly.org/"&gt;Colly&lt;/a&gt;, a scraping framework for Golang. The full working example can be found &lt;a href="http://go-colly.org/docs/examples/instagram/"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;Information gathering&lt;/h2&gt;

&lt;p&gt;First, if we view the source code of a profile page (e.g. &lt;a href="https://instagram.com/instagram"&gt;https://instagram.com/instagram&lt;/a&gt;), we can see a bunch of JavaScript code inside the &lt;code&gt;body&lt;/code&gt; tag instead of static HTML tags. Let's take a closer look at it. We can see that the first &lt;code&gt;script&lt;/code&gt; is just a variable declaration where a huge JSON is assigned to a single variable (&lt;code&gt;window._sharedData&lt;/code&gt;). This JSON can be easily extracted from the &lt;code&gt;script&lt;/code&gt; tag by finding the first &lt;code&gt;{&lt;/code&gt; character and getting the whole content after it:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;jsonData&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="n"&gt;scriptContent&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scriptContent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="s"&gt;"{"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scriptContent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="x"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Note that because it is a JavaScript variable declaration it has a trailing semicolon what we have to cut off to get a valid JSON. That's why the example above ends with &lt;code&gt;len(scriptContent)-1&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The formatted view of the extracted JSON reveals all the information we are looking for. The JSON contains information about a user's images and some metadata of the profile (e.g. the profile ID is &lt;code&gt;25025320&lt;/code&gt;). There is an interesting part of the metadata called &lt;code&gt;page_info&lt;/code&gt;:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"page_info"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="s2"&gt;"has_next_page"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="s2"&gt;"end_cursor"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AQBiQhGRC6c6f-YOxdU0ApaAvotN4zI601ymkAtQ8SutdWz2n-bKFCkv51PMAoi9im3tNDTFLyhV969z8a6JnAkQMzHbYVwNI4Ke7jbk99nvFA"&lt;/span&gt;&lt;span class="w"&gt;
 &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Probably, the value of &lt;code&gt;end_cursor&lt;/code&gt;' is the attribute of the URL to get the next page when &lt;code&gt;has_next_page&lt;/code&gt; is &lt;code&gt;true&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Format JSONs with the handy &lt;a href="https://github.com/stedolan/jq"&gt;jq&lt;/a&gt; command line tool&lt;/p&gt;

&lt;h3&gt;Paging&lt;/h3&gt;

&lt;p&gt;The next page of the user profile is retrieved by an AJAX call, so we have to use the browser's Network Inspector to find out what is required to fetch it. Network Inspector shows a long and cryptic URL which has two GET parameters &lt;code&gt;query_id&lt;/code&gt; and &lt;code&gt;variables&lt;/code&gt;:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://www.instagram.com/graphql/query/?query_id=17888483320059182&amp;amp;variables=%7B%22id%22%3A%2225025320%22%2C%22first%22%3A12%2C%22after%22%3A%22AQBiQhGRC6c6f-YOxdU0ApaAvotN4zI601ymkAtQ8SutdWz2n-bKFCkv51PMAoi9im3tNDTFLyhV969z8a6JnAkQMzHbYVwNI4Ke7jbk99nvFA%22%7D
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;It seems like Instagram uses a &lt;a href="https://en.wikipedia.org/wiki/GraphQL"&gt;GraphQL&lt;/a&gt; API and the value of &lt;code&gt;variables&lt;/code&gt; GET parameter is an URL encoded value. We can decode it with a single line of Python code:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ python -c 'import urlparse;print(urlparse.parse_qs("variables=%7B%22id%22%3A%2225025320%22%2C%22first%22%3A12%2C%22after%22%3A%22AQBiQhGRC6c6f-YOxdU0ApaAvotN4zI601ymkAtQ8SutdWz2n-bKFCkv51PMAoi9im3tNDTFLyhV969z8a6JnAkQMzHbYVwNI4Ke7jbk99nvFA%22%7D")["variables"][0])'
{"id":"25025320","first":12,"after":"AQBiQhGRC6c6f-YOxdU0ApaAvotN4zI601ymkAtQ8SutdWz2n-bKFCkv51PMAoi9im3tNDTFLyhV969z8a6JnAkQMzHbYVwNI4Ke7jbk99nvFA"}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;As you can see it is a JSON object and the value of the &lt;code&gt;after&lt;/code&gt; attribute is the same as the value of the &lt;code&gt;end_cursor&lt;/code&gt; and &lt;code&gt;id&lt;/code&gt; is the ID of the profile.&lt;/p&gt;

&lt;p&gt;The only unknown information in the next page URL is the &lt;code&gt;query_id&lt;/code&gt; GET parameter. The HTML source code does not contain it, nor the cookies or response headers. After a little bit of digging it can be found in a static JS file included in the main page and seems it is a constant value.&lt;/p&gt;

&lt;p&gt;The format of the response is also JSON but the structure is different from what we've found on the main page. This JSON contains the same information as the previous one, however we cannot use the same method to extract data due to structural differences.&lt;/p&gt;

&lt;h2&gt;Building the scraper&lt;/h2&gt;

&lt;p&gt;The information gathering phase clearly shows that we need four building blocks to be able to fetch all images found on an Instagram profile. Let's do it using Colly.&lt;/p&gt;

&lt;h3&gt;Extract and parse JSON from the main page&lt;/h3&gt;

&lt;p&gt;To extract content from HTML we need a new &lt;code&gt;Collector&lt;/code&gt; which has a HTML callback to extract the JSON data from the &lt;code&gt;script&lt;/code&gt; element. Specifying this callback and when it must be called can be done in &lt;code&gt;OnHTML&lt;/code&gt; function of &lt;code&gt;Collector&lt;/code&gt;.&lt;br&gt;
The JSON can be easily converted to native Go structure using &lt;code&gt;json.Unmarshal&lt;/code&gt; from the standard library.&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="n"&gt;colly&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewCollector&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="x"&gt;

&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OnHTML&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"body &amp;gt; script:first-of-type"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;colly&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HTMLElement&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="x"&gt;
    &lt;/span&gt;&lt;span class="c"&gt;// find JSON string&lt;/span&gt;&lt;span class="x"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;jsonData&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="s"&gt;"{"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="x"&gt;

    &lt;/span&gt;&lt;span class="c"&gt;// parse JSON&lt;/span&gt;&lt;span class="x"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="x"&gt;
       &lt;/span&gt;&lt;span class="n"&gt;EntryData&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="x"&gt;
           &lt;/span&gt;&lt;span class="n"&gt;ProfilePage&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="x"&gt;
               &lt;/span&gt;&lt;span class="n"&gt;User&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="x"&gt;
                   &lt;/span&gt;&lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="x"&gt;    &lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="s"&gt;`json:"id"`&lt;/span&gt;&lt;span class="x"&gt;
                   &lt;/span&gt;&lt;span class="n"&gt;Media&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="x"&gt;
                       &lt;/span&gt;&lt;span class="n"&gt;Nodes&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="x"&gt;
                           &lt;/span&gt;&lt;span class="n"&gt;ImageURL&lt;/span&gt;&lt;span class="x"&gt;     &lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="s"&gt;`json:"display_src"`&lt;/span&gt;&lt;span class="x"&gt;
                           &lt;/span&gt;&lt;span class="n"&gt;ThumbnailURL&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="s"&gt;`json:"thumbnail_src"`&lt;/span&gt;&lt;span class="x"&gt;
                           &lt;/span&gt;&lt;span class="n"&gt;IsVideo&lt;/span&gt;&lt;span class="x"&gt;      &lt;/span&gt;&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="x"&gt;   &lt;/span&gt;&lt;span class="s"&gt;`json:"is_video"`&lt;/span&gt;&lt;span class="x"&gt;
                           &lt;/span&gt;&lt;span class="n"&gt;Date&lt;/span&gt;&lt;span class="x"&gt;         &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="x"&gt;    &lt;/span&gt;&lt;span class="s"&gt;`json:"date"`&lt;/span&gt;&lt;span class="x"&gt;
                           &lt;/span&gt;&lt;span class="n"&gt;Dimensions&lt;/span&gt;&lt;span class="x"&gt;   &lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="x"&gt;
                               &lt;/span&gt;&lt;span class="n"&gt;Width&lt;/span&gt;&lt;span class="x"&gt;  &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="s"&gt;`json:"width"`&lt;/span&gt;&lt;span class="x"&gt;
                               &lt;/span&gt;&lt;span class="n"&gt;Height&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="s"&gt;`json:"height"`&lt;/span&gt;&lt;span class="x"&gt;
                           &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="x"&gt;
                       &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="x"&gt;
                       &lt;/span&gt;&lt;span class="n"&gt;PageInfo&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="n"&gt;pageInfo&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="s"&gt;`json:"page_info"`&lt;/span&gt;&lt;span class="x"&gt;
                   &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="s"&gt;`json:"media"`&lt;/span&gt;&lt;span class="x"&gt;
               &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="s"&gt;`json:"user"`&lt;/span&gt;&lt;span class="x"&gt;
           &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="s"&gt;`json:"ProfilePage"`&lt;/span&gt;&lt;span class="x"&gt;
       &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="s"&gt;`json:"entry_data"`&lt;/span&gt;&lt;span class="x"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}{}&lt;/span&gt;&lt;span class="x"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unmarshal&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jsonData&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="x"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="x"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="x"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="x"&gt;

    &lt;/span&gt;&lt;span class="c"&gt;// enumerate images&lt;/span&gt;&lt;span class="x"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EntryData&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ProfilePage&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="x"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;actualUserId&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;User&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="x"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="k"&gt;range&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;User&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Media&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Nodes&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="x"&gt;
        &lt;/span&gt;&lt;span class="c"&gt;// skip videos&lt;/span&gt;&lt;span class="x"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IsVideo&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="x"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="x"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="x"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Visit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ImageURL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="x"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="x"&gt;
    &lt;/span&gt;&lt;span class="c"&gt;// ...&lt;/span&gt;&lt;span class="x"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="x"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h3&gt;Create and visit next page URLs&lt;/h3&gt;

&lt;p&gt;The format of the next page URL is fixed, so a format string can be declared which accepts the changing &lt;code&gt;id&lt;/code&gt; and &lt;code&gt;after&lt;/code&gt; parameters.&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="n"&gt;nextPageURLTemplate&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="s"&gt;`https://www.instagram.com/graphql/query/?query_id=17888483320059182&amp;amp;variables={"id":"%s","first":12,"after":"%s"}`&lt;/span&gt;&lt;span class="x"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h3&gt;Parse next page JSONs&lt;/h3&gt;

&lt;p&gt;This is pretty much the same as the conversion of the main page's JSON except these responses have some different attribute names (e.g. the image url is &lt;code&gt;display_url&lt;/code&gt; instead of &lt;code&gt;display_src&lt;/code&gt;).&lt;/p&gt;

&lt;h3&gt;Download and save images extracted from JSONs&lt;/h3&gt;

&lt;p&gt;After requesting images from Instagram using the &lt;code&gt;Visit&lt;/code&gt; function, responses can be handled in &lt;code&gt;OnResponse&lt;/code&gt;. It requires a callback as a parameter which is called after the response has arrived. To select responses which include images, we should filter based on &lt;code&gt;Content-Type&lt;/code&gt; HTTP header. If it is image, it must be saved.&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OnResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;colly&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="x"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Headers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Content-Type"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="s"&gt;"image"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="x"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputDir&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="x"&gt; &lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FileName&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="x"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="x"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="x"&gt;
    &lt;/span&gt;&lt;span class="c"&gt;// handle further response types...&lt;/span&gt;&lt;span class="x"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="x"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h2&gt;Epilogue&lt;/h2&gt;

&lt;p&gt;Scraping JS-only sites isn't always trivial, but can be handled without headless browsers and client side code execution to achieve great performance. This scraper example downloads approximately 1000 images a minute on a single thread over a regular home Internet connection.&lt;/p&gt;

&lt;p&gt;It can be tweaked further to handle videos and extract meta information.&lt;/p&gt;


</description>
      <category>scraping</category>
      <category>tutorial</category>
      <category>go</category>
      <category>instagram</category>
    </item>
  </channel>
</rss>
