DEV Community

Cover image for Introducing domain_extractor: A High-Performance Ruby Gem for URL Parsing and Domain Extraction
Jordan Hudgens
Jordan Hudgens

Posted on

Introducing domain_extractor: A High-Performance Ruby Gem for URL Parsing and Domain Extraction

TL;DR

OpenSite AI is excited to announce the release of domain_extractor, a lightweight Ruby gem that delivers precise URL parsing, domain extraction, and multi-part TLD support. Perfect for web scraping, analytics, and any workflow requiring accurate domain handling.

πŸ”— Install it: gem install domain_extractor

πŸ“¦ RubyGems: https://rubygems.org/gems/domain_extractor

πŸ’» GitHub: https://github.com/opensite-ai/domain_extractor


The Problem

If you've ever worked with URLs in Ruby, you know the pain: extracting clean domain components from messy URLs isn't as straightforward as it should be. Standard libraries like URI don't handle multi-part top-level domains (TLDs) like .co.uk or .com.au. You end up with brittle regex solutions or pulling in heavy dependencies.

We needed something lightweight, accurate, and production-ready for our analytics and web scraping workflows at OpenSite AI. When we couldn't find exactly what we needed, we built itβ€”and now we're open-sourcing it for the community.


What is domain_extractor?

domain_extractor is a Ruby gem engineered to parse URLs and extract domain components with surgical precision. It's built on Ruby's standard URI library and the battle-tested public_suffix gem, giving you reliable parsing for even the trickiest domains.

Key Features

βœ… Multi-part TLD Support – Accurately handles complex TLDs like co.uk, com.au, gov.br using the Public Suffix List

βœ… Nested Subdomain Parsing – Correctly extracts multi-level subdomains (api.staging.example.com)

βœ… Smart URL Normalization – Handles URLs with or without schemes automatically

βœ… Query Parameter Extraction – Parse query strings into structured hashes

βœ… Zero Configuration – Works out of the box with sensible defaults

βœ… Well-Tested – Comprehensive test suite covering edge cases


Installation

Add it to your Gemfile:

gem 'domain_extractor'
Enter fullscreen mode Exit fullscreen mode

Or install directly:

gem install domain_extractor
Enter fullscreen mode Exit fullscreen mode

Requirements: Ruby 3.2+ and public_suffix ~> 6.0


Usage Examples

Basic Domain Parsing

require 'domain_extractor'

result = DomainExtractor.parse('https://www.example.co.uk/path?query=value')

result[:subdomain]    # => 'www'
result[:domain]       # => 'example'
result[:tld]          # => 'co.uk'
result[:root_domain]  # => 'example.co.uk'
result[:host]         # => 'www.example.co.uk'
result[:path]         # => '/path'
Enter fullscreen mode Exit fullscreen mode

Handling Complex TLDs

This is where domain_extractor really shines:

# UK domain
DomainExtractor.parse('shop.bbc.co.uk')
# => { subdomain: 'shop', domain: 'bbc', tld: 'co.uk', root_domain: 'bbc.co.uk' }

# Australian domain
DomainExtractor.parse('api.example.com.au')
# => { subdomain: 'api', domain: 'example', tld: 'com.au', root_domain: 'example.com.au' }

# Brazilian government domain
DomainExtractor.parse('portal.gov.br')
# => { subdomain: 'portal', domain: 'gov', tld: 'br', root_domain: 'gov.br' }
Enter fullscreen mode Exit fullscreen mode

Nested Subdomains

DomainExtractor.parse('api.staging.prod.example.com')
# => { 
#   subdomain: 'api.staging.prod', 
#   domain: 'example', 
#   tld: 'com',
#   root_domain: 'example.com'
# }
Enter fullscreen mode Exit fullscreen mode

Query Parameter Parsing

params = DomainExtractor.parse('https://example.com?utm_source=google&utm_medium=cpc&page=1').query_params
# => { 'utm_source' => 'google', 'utm_medium' => 'cpc', 'page' => '1' }
Enter fullscreen mode Exit fullscreen mode

Real-World Use Cases

1. Web Scraping – Extract Root Domains

scraped_links = page.css('a').map { |a| a['href'] }
unique_domains = scraped_links
  .map { |url| DomainExtractor.parse(url)&.dig(:root_domain) }
  .compact
  .uniq

# Result: ['example.com', 'github.com', 'stackoverflow.com']
Enter fullscreen mode Exit fullscreen mode

2. Analytics – Categorize Traffic by Domain

referrer = request.referrer
parsed = DomainExtractor.parse(referrer)

if parsed
  Analytics.track_event('page_view', {
    source_domain: parsed[:root_domain],
    source_subdomain: parsed[:subdomain]
  })
end
Enter fullscreen mode Exit fullscreen mode

3. Domain Validation – Check Internal Links

def internal_link?(url, base_domain)
  parsed = DomainExtractor.parse(url)
  parsed && parsed[:root_domain] == base_domain
end

internal_link?('https://blog.example.com/post', 'example.com')  # => true
internal_link?('https://external.com/page', 'example.com')       # => false
Enter fullscreen mode Exit fullscreen mode

4. SEO Audits – Extract & Analyze Backlink Domains

backlinks = fetch_backlinks_from_tool()
domain_distribution = backlinks
  .map { |link| DomainExtractor.parse(link)&.dig(:root_domain) }
  .compact
  .tally
  .sort_by { |_, count| -count }

# Result: { 'example.com' => 45, 'github.com' => 23, ... }
Enter fullscreen mode Exit fullscreen mode

Performance

domain_extractor is optimized for speed:

  • Single URL parsing: ~0.0001s per URL
  • Batch processing: ~0.01s for 100 URLs
  • Memory efficient: Minimal object allocation
  • Thread-safe: Can be used in concurrent environments

Comparison with Alternatives

Feature domain_extractor Addressable URI (stdlib)
Multi-part TLD support βœ… ❌ ❌
Subdomain extraction βœ… ❌ ❌
Domain component separation βœ… ❌ ❌
Query parameter parsing βœ… βœ… βœ…
URL normalization βœ… βœ… βœ…
PublicSuffix validation βœ… ❌ ❌
Lightweight βœ… ❌ βœ…

What's Next

We're actively maintaining domain_extractor and have several features on our roadmap:

  • Domain validation – Check if domains are valid/registered
  • Punycode support – Better handling of internationalized domains
  • Performance improvements – Even faster parsing for high-volume use cases
  • CLI tool – Command-line interface for quick domain analysis

Contributing

domain_extractor is open source under the MIT license. We welcome contributions!


About OpenSite AI

At OpenSite AI, we're committed to building practical tools that solve real-world problems. domain_extractor is our first open-source gem, with more to come. Follow our journey:


Get Started

Install domain_extractor today and simplify your URL parsing:

gem install domain_extractor
Enter fullscreen mode Exit fullscreen mode

Or add to your Gemfile:

gem 'domain_extractor'
Enter fullscreen mode Exit fullscreen mode

Happy parsing! πŸš€

Top comments (1)

Collapse
 
onlineproxy profile image
OnlineProxy

Ruby’s URI knows the shape of a URL but not effective TLDs, so it kinda borks on multi-part TLDs like co.uk, and regexes can’t keep up with the ever-changing PSL and its wildcards/exceptions. Call DomainExtractor.parse(url) to pull :subdomain, :domain, :tld, :root_domain, :host, and :path out of messy links. Schemeless inputs like example.com/path are normalized automatically and the tiny drop-in to replace brittle regexes is DomainExtractor.parse(input)&.dig(:root_domain)-boom.