DEV Community

Michael Kohl
Michael Kohl

Posted on • Updated on • Originally published at

Exploring foreign data wrappers in Postgres

For many years now Postgres has been my database of choice, but I still regularly find new and interesting features that I wasn't yet aware of. The latest entry in this long list are "foreign data wrappers", an implementation of the SQL/MED ("SQL Management of External Data") specification which was added in Postgres 9.1.

This mechanism allows for integrating our database with data stored outside of it. The most common use case for this is probably accessing a remote RDBMS (Postgres or others), but in this example we will use the file remote data wrapper to access a CSV file from disk.

Integrating external data

We will use the basic world cities dataset from Simplemaps in this example, so go and download this file now if you want to follow along.

To be able to define a "foreign table", we first need to install the file_fdw extension and define a server that uses it.

Enter fullscreen mode Exit fullscreen mode

Once we have our server in place, we can define a foreign table to access data from it:

  city text,
  city_ascii text,
  lat real,
  long real,
  population real,
  country text,
  iso2 text,
  iso3 text,
  province text
) SERVER cities
  filename '/tmp/simplemaps-worldcities-basic.csv', 
  format 'csv', 
  header 'true'
Enter fullscreen mode Exit fullscreen mode

The above creates the schema of our table and also specifies the location of our data file, the file format (file_fdw doesn't only support CSV) and instructs Postgres to ignore the first row of the document since it's a header.

We can now query this CSV file just like any regular old table:

test=# select city, population::int from cities order by population desc limit 3;
    city     | population
 Tokyo       |   22006300
 Mumbai      |   15834918
 Mexico City |   14919501
(3 rows)
Enter fullscreen mode Exit fullscreen mode

I don't know about you, but I find the ability to access CSV data without having to first write an importer pretty exciting and performance is rather nice too:

test=# explain analyze select avg(population) from cities;
                                                     QUERY PLAN
 Aggregate  (cost=1896.14..1896.15 rows=1 width=8) (actual time=12.437..12.437 rows=1 loops=1)
   ->  Foreign Scan on cities  (cost=0.00..1851.60 rows=17816 width=4) (actual time=0.064..10.648 rows=7322 loops=1)
         Foreign File: /tmp/simplemaps-worldcities-basic.csv
         Foreign File Size: 570124
 Planning time: 0.605 ms
 Execution time: 13.517 ms
(6 rows)
Enter fullscreen mode Exit fullscreen mode

Enter Sequel

Now it's time to access our foreign table from Ruby. For this purpose we'll be using Jeremy Evans' excellent Sequel gem. It comes with its own console, so the command sequel postgres://localhost/test will drop us into an IRB session with the gem already required and the Sequel::Database object stored in the constant DB:

→ sequel postgres://localhost/test
Your database is stored in DB...
>> DB
#=> #<Sequel::Postgres::Database: "postgres://localhost/test" {:test=>true, :loggers=>[]}>
Enter fullscreen mode Exit fullscreen mode

As a first step, we assign our cities table to a variable. This returns a Sequel::Dataset which represents the query that will be executed when we want to retrieve the data.

cities = DB[:cities]
#=> #<Sequel::Postgres::Dataset: "SELECT * FROM \"cities\"">
Enter fullscreen mode Exit fullscreen mode

Now we can define a dataset called top3, which very closely resembles the SQL query shown above.

top3 =, :population).reverse_order(:population).limit(3)
#=> #<Sequel::Postgres::Dataset: "SELECT \"city\", \"population\" FROM \"cities\" ORDER BY \"population\" DESC LIMIT 3">
Enter fullscreen mode Exit fullscreen mode

NB: if we had enabled Sequel's core extensions the order clause could have been written as order(:population.desc), but whether or not one wants to use this style is left to personal preference.

All Sequel::Datasets include the Enumerable module, so we can use its #to_a method to materialize our dataset:

#=> [{:city=>"Tokyo", :population=>22006300.0}, {:city=>"Mumbai", :population=>15834900.0}, {:city=>"Mexico City", :population=>14919500.0}]
Enter fullscreen mode Exit fullscreen mode

Ruby and Sequel are blissfully ignorant of the fact that they are dealing with a foreign table and we're executing the same query we issued from the Postgres console.

But I want objects!

While the above query worked, it returned an array of plain hashes. If you prefer objects it's better to use Sequel::Model instead. To do this, we need to define a class like so:

class City < Sequel::Model ; end
Enter fullscreen mode Exit fullscreen mode

This exposes the underlying dataset via a class method:

=> #<Sequel::Postgres::Dataset: "SELECT * FROM \"cities\"">
Enter fullscreen mode Exit fullscreen mode

However, it's not considered good style to directly chain methods onto that. Instead the documentation recommends using a single method per tasks. If you're wondering about dataset_module here, it's a convenience method that let's us create named dataset methods that return cached datasets:

class City < Sequel::Model
  dataset_module do
    def top3
      select(:city, :population).
Enter fullscreen mode Exit fullscreen mode

Now let's use our newly defined class:

top3 = City.top3.to_a
#=> [#<City @values={:city=>"Tokyo", :population=>22006300.0}>, #<City @values={:city=>"Mumbai", :population=>15834900.0}>, #<City @values={:city=>"Mexico City", :population=>14919500.0}>]
#=> "Tokyo"
Enter fullscreen mode Exit fullscreen mode

Same result as before, though with object access instead of plain hashes.

Wrap up

Postgres' foreign data wrappers are an interesting concept and I'm definitely going to spend some more time exploring them, especially the more exotic ones like LDAP, IMAP and RSS provided by the Multicorn project.

Sequel is a wonderful gem and a good reminder that ActiveRecord is far from being the only game in town when it comes to database access and ORM in Ruby.

Discussion (2)

dmfay profile image
Dian Fay

For my money foreign data wrappers are the single most important feature Postgres has over its competition -- everything else is great, but being able to federate almost any other data source is revolutionary. And nobody else really has an answer; there's SQL Server Integration Services and Oracle can access external data sources through ODBC but those are much more limited in scope.

citizen428 profile image
Michael Kohl Author

Thanks for sharing your article Dian, interesting read and use case!