Introducing domain_extractor: A High-Performance Ruby Gem for URL Parsing and Domain Extraction

#ruby #webdev #rubygem #opensource

TL;DR

OpenSite AI is excited to announce the release of domain_extractor, a lightweight Ruby gem that delivers precise URL parsing, domain extraction, and multi-part TLD support. Perfect for web scraping, analytics, and any workflow requiring accurate domain handling.

🔗 Install it: gem install domain_extractor

📦 RubyGems: https://rubygems.org/gems/domain_extractor

💻 GitHub: https://github.com/opensite-ai/domain_extractor

The Problem

If you've ever worked with URLs in Ruby, you know the pain: extracting clean domain components from messy URLs isn't as straightforward as it should be. Standard libraries like URI don't handle multi-part top-level domains (TLDs) like .co.uk or .com.au. You end up with brittle regex solutions or pulling in heavy dependencies.

We needed something lightweight, accurate, and production-ready for our analytics and web scraping workflows at OpenSite AI. When we couldn't find exactly what we needed, we built it—and now we're open-sourcing it for the community.

What is domain_extractor?

domain_extractor is a Ruby gem engineered to parse URLs and extract domain components with surgical precision. It's built on Ruby's standard URI library and the battle-tested public_suffix gem, giving you reliable parsing for even the trickiest domains.

Key Features

✅ Multi-part TLD Support – Accurately handles complex TLDs like co.uk, com.au, gov.br using the Public Suffix List

✅ Nested Subdomain Parsing – Correctly extracts multi-level subdomains (api.staging.example.com)

✅ Smart URL Normalization – Handles URLs with or without schemes automatically

✅ Query Parameter Extraction – Parse query strings into structured hashes

✅ Zero Configuration – Works out of the box with sensible defaults

✅ Well-Tested – Comprehensive test suite covering edge cases

Installation

Add it to your Gemfile:

gem 'domain_extractor'

Or install directly:

gem install domain_extractor

Requirements: Ruby 3.2+ and public_suffix ~> 6.0

Usage Examples

Basic Domain Parsing

require 'domain_extractor'

result = DomainExtractor.parse('https://www.example.co.uk/path?query=value')

result[:subdomain]    # => 'www'
result[:domain]       # => 'example'
result[:tld]          # => 'co.uk'
result[:root_domain]  # => 'example.co.uk'
result[:host]         # => 'www.example.co.uk'
result[:path]         # => '/path'

Handling Complex TLDs

This is where domain_extractor really shines:

# UK domain
DomainExtractor.parse('shop.bbc.co.uk')
# => { subdomain: 'shop', domain: 'bbc', tld: 'co.uk', root_domain: 'bbc.co.uk' }

# Australian domain
DomainExtractor.parse('api.example.com.au')
# => { subdomain: 'api', domain: 'example', tld: 'com.au', root_domain: 'example.com.au' }

# Brazilian government domain
DomainExtractor.parse('portal.gov.br')
# => { subdomain: 'portal', domain: 'gov', tld: 'br', root_domain: 'gov.br' }

Nested Subdomains

DomainExtractor.parse('api.staging.prod.example.com')
# => { 
#   subdomain: 'api.staging.prod', 
#   domain: 'example', 
#   tld: 'com',
#   root_domain: 'example.com'
# }

Query Parameter Parsing

params = DomainExtractor.parse('https://example.com?utm_source=google&utm_medium=cpc&page=1').query_params
# => { 'utm_source' => 'google', 'utm_medium' => 'cpc', 'page' => '1' }

Real-World Use Cases

1. Web Scraping – Extract Root Domains

scraped_links = page.css('a').map { |a| a['href'] }
unique_domains = scraped_links
  .map { |url| DomainExtractor.parse(url)&.dig(:root_domain) }
  .compact
  .uniq

# Result: ['example.com', 'github.com', 'stackoverflow.com']

2. Analytics – Categorize Traffic by Domain

referrer = request.referrer
parsed = DomainExtractor.parse(referrer)

if parsed
  Analytics.track_event('page_view', {
    source_domain: parsed[:root_domain],
    source_subdomain: parsed[:subdomain]
  })
end

3. Domain Validation – Check Internal Links

def internal_link?(url, base_domain)
  parsed = DomainExtractor.parse(url)
  parsed && parsed[:root_domain] == base_domain
end

internal_link?('https://blog.example.com/post', 'example.com')  # => true
internal_link?('https://external.com/page', 'example.com')       # => false

4. SEO Audits – Extract & Analyze Backlink Domains

backlinks = fetch_backlinks_from_tool()
domain_distribution = backlinks
  .map { |link| DomainExtractor.parse(link)&.dig(:root_domain) }
  .compact
  .tally
  .sort_by { |_, count| -count }

# Result: { 'example.com' => 45, 'github.com' => 23, ... }

Performance

domain_extractor is optimized for speed:

Single URL parsing: ~0.0001s per URL
Batch processing: ~0.01s for 100 URLs
Memory efficient: Minimal object allocation
Thread-safe: Can be used in concurrent environments

Comparison with Alternatives

Feature	domain_extractor	Addressable	URI (stdlib)
Multi-part TLD support	✅	❌	❌
Subdomain extraction	✅	❌	❌
Domain component separation	✅	❌	❌
Query parameter parsing	✅	✅	✅
URL normalization	✅	✅	✅
PublicSuffix validation	✅	❌	❌
Lightweight	✅	❌	✅

What's Next

We're actively maintaining domain_extractor and have several features on our roadmap:

Domain validation – Check if domains are valid/registered
Punycode support – Better handling of internationalized domains
Performance improvements – Even faster parsing for high-volume use cases
CLI tool – Command-line interface for quick domain analysis

Contributing

domain_extractor is open source under the MIT license. We welcome contributions!

🐛 Report bugs: https://github.com/opensite-ai/domain_extractor/issues
💡 Feature requests: Open an issue with your idea
🔧 Pull requests: Fork, branch, commit, and submit!

About OpenSite AI

At OpenSite AI, we're committed to building practical tools that solve real-world problems. domain_extractor is our first open-source gem, with more to come. Follow our journey: