James

Posted on May 9

Why I Built a Privacy-First Search Engine After 10 Years of Being Tracked

#ai #privacy #search #startup

Last year, I searched Google for "competitor pricing analysis tools." Within 24 hours, pricing software ads flooded my LinkedIn. My inbox filled with cold outreach. A sales rep called my business line, quoting my exact query back to me.

I build automation tools for a living. I know how this machinery works. Still, the precision of that targeting made me realize something: the modern search engine is not a tool. It is a surveillance device with a search bar attached.

So I spent six months understanding exactly how search data gets harvested, sold, and weaponized. Then I built a different architecture. This article is what I learned.

How Search Data Actually Flows

Most developers know Google collects data. Few understand the full pipeline. Here is how a single query moves through the ecosystem.

Your device sends the query to your ISP. Your ISP logs the DNS request. In the US, ISPs can legally sell that log. In the EU, GDPR applies, but DNS is still resolved and logged somewhere.

Google receives the query and records: your IP address, device fingerprint, browser version, screen resolution, installed fonts, timezone, language, search history, click patterns, dwell time on results, and every subsequent search in that session. This is all correlated with your YouTube history, Gmail content, Android app usage, and any site using Google Analytics or AdSense.

Data brokers like Acxiom, Experian, and Oracle Data Cloud buy aggregated search behavior by category. They know you searched for CRM pricing not because they see your query, but because Google told them someone in your demographic bracket showed commercial intent in business software in the last 48 hours.

Competitor intelligence platforms buy these reports. They know which companies are researching which tools. They know when a startup is evaluating a new tech stack. They know when an enterprise is unhappy with its current vendor.

Your competitors then receive alerts: "A company in the EU matching your target profile is evaluating alternatives to your product."

This is not theoretical. This is the standard data supply chain for B2B sales intelligence.

The Architecture Problem

The issue is architectural, not ethical. Google's business model requires data extraction to fund the index. Every "free" search is subsidized by ad targeting.

The trade-off looks like this:

Feature	Google	DuckDuckGo	Self-Hosted
Index quality	Excellent	Good (Bing)	Requires setup
Privacy	None	Partial (Microsoft ads)	Full
Personalization	Extreme	None	Configurable
Speed	Fast	Fast	Depends on infra
Cost to user	$0	$0	Infra cost
Cost to privacy	Total	Reduced	None

The middle column is the trap. DuckDuckGo does not build a profile, but it still serves Microsoft ads, uses Bing's index, and cannot guarantee what happens upstream. Startpage proxies Google results but is owned by System1, an adtech company. The privacy is conditional.

A real solution requires a different architecture entirely: no query storage, no user profiles, no upstream correlation, and a business model that does not depend on surveillance.

Designing a Zero-Knowledge Search Stack

When I started building, I set five constraints:

No query logging. The server processes the query, returns results, and forgets it.
No user profiles. No accounts, no cookies for tracking, no "personalization."
Federated sources. Do not rely on a single index. Query multiple sources simultaneously.
Client-side execution. Where possible, run the search logic in the user's browser, not on the server.
Sustainable economics. Charge for the service, not the data.

The architecture that emerged is not revolutionary, but it is rare because it violates the default business model of search.

Query Processing

User Browser
  → TLS 1.3 encrypted query
  → Ephemeral session created (60-second TTL)
  → Query dispatched to multiple sources in parallel
  → Results aggregated server-side
  → Session data purged
  → Response returned
  → No log entry written

The server never stores the query. It cannot. The session is in-memory only, with a hard TTL. If the process crashes, the data is gone. This is by design.

Federated Search

Instead of crawling and indexing the web ourselves — a multi-billion-dollar problem — we query existing sources simultaneously:

Open search APIs (where available)
Specialized vertical engines (academic, legal, technical)
Curated datasets (government data, open research)
User-defined sources (via custom search agents)

The trade-off: results are slightly slower (200-500ms vs. 50ms for Google) because we are querying multiple APIs in parallel. The gain: no single party sees your full query history.

Client-Side Agents

The feature that surprised me most in early testing: technical users do not want better search. They want programmable search.

A search agent is a JSON definition that specifies sources, ranking logic, filters, and output format. The agent definition is sent to the server, but the interpretation happens in the browser where possible. This means the server sees "execute agent X" but not the specific parameters or results.

Example agent for competitor monitoring:

{
    "name": "Competitor Monitor",
    "sources": ["news_api", "crunchbase", "linkedin_posts"],
    "query_template": "{company_name} funding OR acquisition OR product_launch",
    "filters": {"language": "en", "region": "EU", "date_range": "7d"},
    "output": "structured_json",
    "schedule": "daily_0600"
}

The user creates this once, and the agent runs autonomously. The server knows an agent exists, but not what it searches for or what it finds.

The EU Compliance Angle

For businesses in the European Union, this is not optional. Article 32 of GDPR requires "technical and organizational measures" to ensure data confidentiality. Logging every search query your employees make is a ticking compliance bomb.

Consider three scenarios:

Trade secret exposure. Your R&D team searches for "solid-state battery electrolyte 2024." That query reveals strategic direction. If logged by Google, it becomes part of your data profile. If subpoenaed or breached, it becomes evidence.

Competitive intelligence leak. Your search patterns create a behavioral fingerprint. Data brokers sell "technology stack shift signals" to investors and competitors. A sudden cluster of queries around a specific vendor category flags strategic intent.

Legal discovery risk. In litigation, search histories are discoverable. A pattern of queries about competitors' patents, pricing, or partnerships can support claims of bad faith or anticompetitive behavior.

The companies I have spoken with who take this seriously are not paranoid. They are lawyers, compliance officers, and security engineers who have read the case law.

What We Learned from 200 Beta Users

I launched a private beta with a simple thesis: technical professionals in the EU need privacy-first search for competitive research.

The usage patterns were unexpected:

Startup founders (35% of users) used it for investor and competitor research. Not because they distrust Google, but because they distrust being profiled while evaluating vendors.
Consultants (28%) used it for client due diligence. They cannot let their search history reveal which clients they are pitching.
Security researchers (22%) used it for vulnerability and threat intel. They literally cannot use tracked search for their job.
Journalists (15%) used it for source protection.

The common thread: these are not privacy extremists. They are professionals for whom search history creates liability.

The Hard Parts (What Did Not Work)

Building this revealed problems I did not anticipate:

Source reliability. Federated search is only as good as its weakest source. Some APIs throttle aggressively. Some return stale results. Some change schemas without notice. We now maintain a source health dashboard and automatic failover.

Speed vs. privacy trade-off. Querying multiple sources in parallel is inherently slower than a monolithic index. Users notice 300ms vs. 50ms. We mitigated with aggressive caching of non-personalized results and pre-fetching for scheduled agents.

Search syntax divergence. Every source uses different query syntax. DuckDuckGo uses !bang commands. Academic APIs use Boolean operators. News APIs use natural language. Normalizing queries across sources is a problem nobody has solved well.

Business model skepticism. Charging for search feels foreign. Users expect search to be free. We had to reframe it: you are not paying for search. You are paying for the absence of surveillance.

If You Want to Build Something Similar

For engineers who want to build their own privacy-first search tools, here is the minimal viable architecture:

Use a memory-only session store (Redis with TTL, not a database). No persistence.
Query multiple sources in parallel using asyncio.gather() or equivalent. Handle failures gracefully.
Implement client-side agent execution where possible. Pjax or WASM work well for this.
Use residential proxies with rotation if you need to scrape sources without APIs. Respect robots.txt and rate limits.
Build a source health layer. Each source gets a reliability score. Fallback automatically.
Charging? Use a SaaS model, not ads. Stripe or Paddle for EU compliance.

I am the founder of Graham Miranda UG, a Berlin-based company building privacy-first web intelligence tools. The architecture described above is what we implemented in asearchz.online — if you are evaluating tools in this space, it is one data point among many.

DEV Community