TL;DR
OpenSite AI is excited to announce the release of domain_extractor, a lightweight Ruby gem that delivers precise URL parsing, domain extraction, and multi-part TLD support. Perfect for web scraping, analytics, and any workflow requiring accurate domain handling.
π Install it: gem install domain_extractor
π¦ RubyGems: https://rubygems.org/gems/domain_extractor
π» GitHub: https://github.com/opensite-ai/domain_extractor
The Problem
If you've ever worked with URLs in Ruby, you know the pain: extracting clean domain components from messy URLs isn't as straightforward as it should be. Standard libraries like URI don't handle multi-part top-level domains (TLDs) like .co.uk or .com.au. You end up with brittle regex solutions or pulling in heavy dependencies.
We needed something lightweight, accurate, and production-ready for our analytics and web scraping workflows at OpenSite AI. When we couldn't find exactly what we needed, we built itβand now we're open-sourcing it for the community.
What is domain_extractor?
domain_extractor is a Ruby gem engineered to parse URLs and extract domain components with surgical precision. It's built on Ruby's standard URI library and the battle-tested public_suffix gem, giving you reliable parsing for even the trickiest domains.
Key Features
β
Multi-part TLD Support β Accurately handles complex TLDs like co.uk, com.au, gov.br using the Public Suffix List
β
Nested Subdomain Parsing β Correctly extracts multi-level subdomains (api.staging.example.com)
β
Smart URL Normalization β Handles URLs with or without schemes automatically
β
Query Parameter Extraction β Parse query strings into structured hashes
β
Zero Configuration β Works out of the box with sensible defaults
β
Well-Tested β Comprehensive test suite covering edge cases
Installation
Add it to your Gemfile:
gem 'domain_extractor'
Or install directly:
gem install domain_extractor
Requirements: Ruby 3.2+ and public_suffix ~> 6.0
Usage Examples
Basic Domain Parsing
require 'domain_extractor'
result = DomainExtractor.parse('https://www.example.co.uk/path?query=value')
result[:subdomain] # => 'www'
result[:domain] # => 'example'
result[:tld] # => 'co.uk'
result[:root_domain] # => 'example.co.uk'
result[:host] # => 'www.example.co.uk'
result[:path] # => '/path'
Handling Complex TLDs
This is where domain_extractor really shines:
# UK domain
DomainExtractor.parse('shop.bbc.co.uk')
# => { subdomain: 'shop', domain: 'bbc', tld: 'co.uk', root_domain: 'bbc.co.uk' }
# Australian domain
DomainExtractor.parse('api.example.com.au')
# => { subdomain: 'api', domain: 'example', tld: 'com.au', root_domain: 'example.com.au' }
# Brazilian government domain
DomainExtractor.parse('portal.gov.br')
# => { subdomain: 'portal', domain: 'gov', tld: 'br', root_domain: 'gov.br' }
Nested Subdomains
DomainExtractor.parse('api.staging.prod.example.com')
# => {
# subdomain: 'api.staging.prod',
# domain: 'example',
# tld: 'com',
# root_domain: 'example.com'
# }
Query Parameter Parsing
params = DomainExtractor.parse('https://example.com?utm_source=google&utm_medium=cpc&page=1').query_params
# => { 'utm_source' => 'google', 'utm_medium' => 'cpc', 'page' => '1' }
Real-World Use Cases
1. Web Scraping β Extract Root Domains
scraped_links = page.css('a').map { |a| a['href'] }
unique_domains = scraped_links
.map { |url| DomainExtractor.parse(url)&.dig(:root_domain) }
.compact
.uniq
# Result: ['example.com', 'github.com', 'stackoverflow.com']
2. Analytics β Categorize Traffic by Domain
referrer = request.referrer
parsed = DomainExtractor.parse(referrer)
if parsed
Analytics.track_event('page_view', {
source_domain: parsed[:root_domain],
source_subdomain: parsed[:subdomain]
})
end
3. Domain Validation β Check Internal Links
def internal_link?(url, base_domain)
parsed = DomainExtractor.parse(url)
parsed && parsed[:root_domain] == base_domain
end
internal_link?('https://blog.example.com/post', 'example.com') # => true
internal_link?('https://external.com/page', 'example.com') # => false
4. SEO Audits β Extract & Analyze Backlink Domains
backlinks = fetch_backlinks_from_tool()
domain_distribution = backlinks
.map { |link| DomainExtractor.parse(link)&.dig(:root_domain) }
.compact
.tally
.sort_by { |_, count| -count }
# Result: { 'example.com' => 45, 'github.com' => 23, ... }
Performance
domain_extractor is optimized for speed:
- Single URL parsing: ~0.0001s per URL
- Batch processing: ~0.01s for 100 URLs
- Memory efficient: Minimal object allocation
- Thread-safe: Can be used in concurrent environments
Comparison with Alternatives
| Feature | domain_extractor | Addressable | URI (stdlib) |
|---|---|---|---|
| Multi-part TLD support | β | β | β |
| Subdomain extraction | β | β | β |
| Domain component separation | β | β | β |
| Query parameter parsing | β | β | β |
| URL normalization | β | β | β |
| PublicSuffix validation | β | β | β |
| Lightweight | β | β | β |
What's Next
We're actively maintaining domain_extractor and have several features on our roadmap:
- Domain validation β Check if domains are valid/registered
- Punycode support β Better handling of internationalized domains
- Performance improvements β Even faster parsing for high-volume use cases
- CLI tool β Command-line interface for quick domain analysis
Contributing
domain_extractor is open source under the MIT license. We welcome contributions!
- π Report bugs: https://github.com/opensite-ai/domain_extractor/issues
- π‘ Feature requests: Open an issue with your idea
- π§ Pull requests: Fork, branch, commit, and submit!
About OpenSite AI
At OpenSite AI, we're committed to building practical tools that solve real-world problems. domain_extractor is our first open-source gem, with more to come. Follow our journey:
- π Website: https://opensite.ai/developers
- π» GitHub: https://github.com/opensite-ai
Get Started
Install domain_extractor today and simplify your URL parsing:
gem install domain_extractor
Or add to your Gemfile:
gem 'domain_extractor'
Happy parsing! π
Top comments (1)
Rubyβs URI knows the shape of a URL but not effective TLDs, so it kinda borks on multi-part TLDs like co.uk, and regexes canβt keep up with the ever-changing PSL and its wildcards/exceptions. Call DomainExtractor.parse(url) to pull :subdomain, :domain, :tld, :root_domain, :host, and :path out of messy links. Schemeless inputs like example.com/path are normalized automatically and the tiny drop-in to replace brittle regexes is DomainExtractor.parse(input)&.dig(:root_domain)-boom.