<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ZoktrFall</title>
    <description>The latest articles on DEV Community by ZoktrFall (@zoktr).</description>
    <link>https://dev.to/zoktr</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3925486%2F2ce2e16e-644f-4e9e-92ac-7a5b585ad811.jpg</url>
      <title>DEV Community: ZoktrFall</title>
      <link>https://dev.to/zoktr</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/zoktr"/>
    <language>en</language>
    <item>
      <title>Building a Website Contact Scraper API in .NET 10: Architecture, Crawling, and Fighting Cloudflare</title>
      <dc:creator>ZoktrFall</dc:creator>
      <pubDate>Mon, 11 May 2026 17:43:10 +0000</pubDate>
      <link>https://dev.to/zoktr/building-a-website-contact-scraper-api-in-net-10-architecture-crawling-and-fighting-cloudflare-16n2</link>
      <guid>https://dev.to/zoktr/building-a-website-contact-scraper-api-in-net-10-architecture-crawling-and-fighting-cloudflare-16n2</guid>
      <description>&lt;h2&gt;
  
  
  Building a Website Contact Scraper API in .NET 10: Crawling, Extraction, and a Cloudflare Problem I Can't Fully Solve
&lt;/h2&gt;

&lt;p&gt;I built an API that takes a domain and returns emails, phones, social profiles, and company info. One call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;GET /api/v1/website/contacts?domain&lt;span class="o"&gt;=&lt;/span&gt;stripe.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Returns verified emails with confidence scores, phones, LinkedIn/Twitter/GitHub links, and crawl metadata. Here's how the interesting parts work.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Clean layered architecture — Api → Application → Domain, with Infrastructure implementing the Application interfaces. The controller is 12 lines of plumbing. Everything real happens in the crawler and extractor.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Two-Phase Crawler
&lt;/h2&gt;

&lt;p&gt;The crawler uses a priority queue and runs in two phases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fast path&lt;/strong&gt; — first 18 pages, only high-value routes: &lt;code&gt;/contact&lt;/code&gt;, &lt;code&gt;/about&lt;/code&gt;, &lt;code&gt;/privacy&lt;/code&gt;, &lt;code&gt;/legal&lt;/code&gt;. Gets real contacts in under 2 seconds for most sites.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage two&lt;/strong&gt; — deferred URLs get promoted once the fast path finishes. Handles sites where contacts are buried under &lt;code&gt;/company/offices/regional/emea/contact&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Every URL gets a priority score before entering the queue:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;readonly&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;Segment&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;Score&lt;/span&gt;&lt;span class="p"&gt;)[]&lt;/span&gt; &lt;span class="n"&gt;PriorityPathSegments&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/contact"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="m"&gt;120&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/contact-us"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;118&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/support"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="m"&gt;115&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/privacy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="m"&gt;110&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/about"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="m"&gt;95&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;];&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Route family deduplication strips locale prefixes so &lt;code&gt;/en/contact&lt;/code&gt;, &lt;code&gt;/fr/contact&lt;/code&gt;, &lt;code&gt;/de/contact&lt;/code&gt; are treated as one family and fetched once. This was the highest-leverage optimization — cut unnecessary fetches dramatically on international sites.&lt;/p&gt;




&lt;h2&gt;
  
  
  Email Extraction
&lt;/h2&gt;

&lt;p&gt;Five passes over each page's DOM: text nodes, &lt;code&gt;mailto:&lt;/code&gt; anchors, &lt;code&gt;data-cfemail&lt;/code&gt; attributes, element attributes, and JSON-LD blocks.&lt;/p&gt;

&lt;p&gt;The Cloudflare email decoder was satisfying to build — CF XORs each byte with the first byte of the encoded string:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nf"&gt;DecodeCloudflareProtectedEmail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IElement&lt;/span&gt; &lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;encoded&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;GetAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"data-cfemail"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;IsNullOrWhiteSpace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoded&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;||&lt;/span&gt; &lt;span class="n"&gt;encoded&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Length&lt;/span&gt; &lt;span class="p"&gt;%&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt; &lt;span class="p"&gt;!=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Convert&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ToByte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoded&lt;/span&gt;&lt;span class="p"&gt;[..&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="m"&gt;16&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;characters&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;encoded&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Length&lt;/span&gt; &lt;span class="p"&gt;/&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;encoded&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;+=&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;characters&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;/&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;Convert&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ToByte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoded&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Substring&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="m"&gt;16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;^&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;characters&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each email gets a confidence score built from multiple signals: domain match, role-based address, &lt;code&gt;mailto:&lt;/code&gt; source, page context, footer placement, surrounding phrase ("email us at", "send resumes to"). Scoring beats hard accept/reject rules — real-world emails are messy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Social Extraction
&lt;/h2&gt;

&lt;p&gt;JSON-LD &lt;code&gt;sameAs&lt;/code&gt; fields are the most reliable source. Sites that care about SEO publish their structured data carefully. Footer anchor tags are noisier — share buttons, partner links, and embedded widgets all look like profiles. Weighting &lt;code&gt;sameAs&lt;/code&gt; much higher than anchors halved the false-positive rate.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Cloudflare Problem I Haven't Fully Solved
&lt;/h2&gt;

&lt;p&gt;This is where I'm stuck and genuinely want input from anyone who's dealt with this.&lt;/p&gt;

&lt;p&gt;Locally, the crawler handles Cloudflare-protected sites reasonably well — persistent cookie jar, correct &lt;code&gt;Sec-Fetch-*&lt;/code&gt; headers, headless Chrome fallback with a spoofed user agent. Works fine on my machine.&lt;/p&gt;

&lt;p&gt;In production on Railway (datacenter IP), the same code gets blocked on a significant percentage of Cloudflare-protected sites. Challenge pages, 403s, silent blocks. The headless fallback helps but doesn't fully solve it.&lt;/p&gt;

&lt;p&gt;My current setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Persistent cookie jar across requests&lt;/span&gt;
&lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UseCookies&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CookieContainer&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;CookieContainer&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// Full Chrome header fingerprint&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DefaultRequestHeaders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;TryAddWithoutValidation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Sec-Fetch-Dest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"document"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DefaultRequestHeaders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;TryAddWithoutValidation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Sec-Fetch-Mode"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"navigate"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DefaultRequestHeaders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;TryAddWithoutValidation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"sec-ch-ua"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"\"Google Chrome\";v=\"135\", \"Not-A.Brand\";v=\"8\""&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I understand the core issue — datacenter IPs are pre-scored as high-risk by Cloudflare regardless of headers. Residential proxies are the obvious answer but add cost and complexity I haven't wired up yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I'm wondering:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Has anyone solved this cleanly in .NET without proxies?&lt;/li&gt;
&lt;li&gt;Is there a proxy provider that works well for this use case without breaking the bank?&lt;/li&gt;
&lt;li&gt;Any other signals I'm missing that would help on datacenter IPs?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can test the API yourself and see where it succeeds and fails — free tier, no credit card:&lt;br&gt;
👉 &lt;a href="https://rapidapi.com/zoktrapi-zoktrapi-default/api/website-contacts-finder" rel="noopener noreferrer"&gt;https://rapidapi.com/zoktrapi-zoktrapi-default/api/website-contacts-finder&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you find a domain where results are wrong or missing, drop it in the comments. Genuinely useful for debugging.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stack
&lt;/h2&gt;

&lt;p&gt;.NET 10 · ASP.NET Core · HtmlAgilityPack · AngleSharp · Redis · headless Chrome · Railway&lt;/p&gt;

&lt;p&gt;Happy to answer questions — and really hoping someone has cracked the datacenter IP problem.&lt;/p&gt;

</description>
      <category>csharp</category>
      <category>dotnet</category>
      <category>api</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
