DEV Community


Posted on

Some of the Reasons Why I Prefer PostgreSQL Over MS SQL Server

There are plenty of articles all over the internet on various PostgreSQL vs MS SQL Server topics, everybody can look for themselves.

Here is one of my favorites Top 10 Reasons I like Postgres Over SQL Server that favors PostgreSQL heavily.

However, I'd like to focus on an aspect often overlooked and one that matters to me as a software developer - the programmability aspect.

Here are some random reasons why I prefer PostgreSQL in no particular order from the programmability aspect.

1. Array types

Arrays are the core part of programming.

Deal with it.

And yet, somehow, MS SQL Server does not have a built-in array type support nor does it support the array concept at all in any shape or form.

I imagine that reasoning behind this decision is somewhat pragmatic - why would you want to have an array type - when you can have foreign keys and foreign tables. It is a relational database after all - and for complicated processing - you have the table variables that can act as arrays for you.

That may be fine - but array types (and related functions) are incredibly useful in everyday database programming. I'll give you a couple of examples...

Imagine that we have a table or a query with the following results:

| id  | value |
|  1  |   A   |
|  1  |   B   |
|  1  |   C   |
|  2  |   B   |
|  2  |   C   |
|  2  |   D   |
Enter fullscreen mode Exit fullscreen mode

And your application requires you to return the unique id numbers, and with it - all of the values for each unique id.

If we return the result as is - first of all, we don't have unique values, and secondly - there is redundant data ( we need each number once, not three times) sent over the wire.

It doesn't look much more redundancy, but in real life, we could be talking thousands of records with dozens of redundant fields and that burdens the network significantly.

With PostgreSQL, we can use the array aggregate function and write something like this:

select id, array_agg(value) as array from example group by id
Enter fullscreen mode Exit fullscreen mode

This will transform the second field into an array of the same type as the value field (in this case array of strings, nut it could be anything):

| id  | array   |
|  2  | {B,C,D} |
|  1  | {A,B,C} |
Enter fullscreen mode Exit fullscreen mode

Database clients that will receive these results will output a second column as a normal array or a list - regardless of what language for your client you are using.

Now, since MS SQL Server does not support array types at all - there is no easy way to achieve this same operation with the same results. Not without hacks anyway.

Usually, developers don't even bother and they usually return raw, ungrouped data to a client and perform grouping operations there.

First of all, you're sending much more data down the wire than you need to, and what is even worse, your query is split into two parts in two different languages. And there's quite an unnecessary processing work to be done on the client too. Your database has performed already one scan over the result when you asked to select the data, it might as well build an aggregate while it is at it. That's just efficiency.

To be fair there is a way to achieve this with SQL Server.

Prior to MS SQL Server 2017, you would have to use the following query horror (which uses an undocumented function of XML path and it will break on special XML characters):

    id, array = stuff(
            select ',' + sub.value 
            from example sub where = 
            for xml path(''), type
        ).value('.', 'varchar(max)'), 1, 1, ''
from example e
group by id;
Enter fullscreen mode Exit fullscreen mode

With versions after the MS SQL Server 2017 this query is much simpler and cleaner:

select id, string_agg(value, ',') as array
from example
group by id;
Enter fullscreen mode Exit fullscreen mode

Both of them will return the same result:

| id  | array |
|  2  | B,C,D |
|  1  | A,B,C |
Enter fullscreen mode Exit fullscreen mode

And both of them are hacks (the first one being unspeakable horror that is hard to even to look at).

Both of them don't return an array at all, since array types don't even exist - they return comma-separated strings.

Naturally, there are a lot of problems with this:

  • It only works with string types: varchar or nvarchar to be exact. For any other data type, you would have to add manually explicit cast or convert them to your varchar (or a nvarchar).

  • You need to compensate for the value separator. Meaning, you need to somehow escape comma character in your values if it is a comma that you use for separation.

  • You still need to convert to an actual array on the client.

And that's a lot of work. I much rather prefer the clean and safe PostgreSQL approach.

Ok, fine, now how about arrays in the opposite direction, for example, you want to pass an array argument to your function or procedure?

With PostgreSQL, you simply declare your parameter as an array, like for example int[] or text[] and that is it. You just send your parameter as a normal array as you would expect.

With MS SQL Server it is a different story:

Prior to the 2008 version, you would still have to use CSV strings as a parameter and deal with all of the related issues. After the 2008 version, you can create a special, custom table type:

create type dbo.IntTable as table (Value int);
Enter fullscreen mode Exit fullscreen mode

And use parameters of that newly created type as:

-- param
@MyArray as dbo.IntTable
Enter fullscreen mode Exit fullscreen mode

And on the client-side in csharp something like this:

var values = new DataTable();
values.Columns.Add(new DataColumn("Value", typeof(int)));
var p = cmd.Parameters.AddWithValue("@MyArray ", values);
p.SqlDbType = SqlDbType.Structured;
p.TypeName = "dbo.IntTable";
Enter fullscreen mode Exit fullscreen mode

I suppose that we would have to create a custom type for each of the types we are planning to use...

What's wrong with the arrays?

2. Finding rows from a group with DISTINCT ON

Imagine that we have a table or a query with the following results:

|   id  | value |
|   1   |   A   |
|   2   |   A   |
|   3   |   A   |
|   4   |   B   |
|   5   |   B   |
|   6   |   C   |
Enter fullscreen mode Exit fullscreen mode

Now, normally we can group by value:

select value from example group by value
Enter fullscreen mode Exit fullscreen mode

And it will give us all unique values A, B, and C as expected.

But, what if we wanted to include the id field also in this query?

Well, that would be impossible, since each group contains multiple id numbers.

But, what if we know exactly which id number we need?

Let's say it is always the highest one for each group. That should be possible because there is always exactly one id for each value group with the highest value.

In PostgreSQL that query would look like this:

select distinct on (value) id, value 
from example 
order by value, id desc;
Enter fullscreen mode Exit fullscreen mode

This will return our expected result. Each unique group with and also the highest id number for each group:

|   id  | value |
|   3   |   A   |
|   5   |   B   |
|   6   |   C   |
Enter fullscreen mode Exit fullscreen mode

How do we achieve the same result with MS SQL Server? Well, it's a bit complicated as you would imagine. This is the simplest version:

select id, value
from (
        row_number() over(partition by value order by id desc) rn
    from example
) sub
where sub.rn = 1;
Enter fullscreen mode Exit fullscreen mode

MS SQL Server doesn't support DISTINCT ON, which is a PostgreSQL extension to the SQL language. Instead, we must improvise.

We start first by adding a window function to our example result that will add another calculated column to the results that contain a specific row number inside each value group starting with the highest id.

Then we wrap it up into subquery and filter only where row number from a window function is 1 and that should match records with highest id and we simply discard that row number on our output.

The result is the same:

|   id  | value |
|   3   |   A   |
|   5   |   B   |
|   6   |   C   |
Enter fullscreen mode Exit fullscreen mode

Now, it may not look like much of an improvement in this very simple example, but when you have many records and you already have a couple of subqueries it adds up to the complexity.

And the number one complaint I hear about SQL is that queries tend to get large with too many subselects and it's hard to understand.

3. MERGE - record merging operation

Merge operation or something also called UPSERT (insert with update) is a fairly common and standardized database operation. It will create a new record if a record with the same key doesn't exist or update a record with the same key if it does.

Both systems have implementations and they are quite different.

In our previous example, assuming we have an example table with field id as unique key and textual value field, PostgreSQL implementation would look like this:

insert into example 
values (1, 'X')
on conflict (id) 
do update set value = 'X';
Enter fullscreen mode Exit fullscreen mode

MS SQL Server implementation is very much different:

merge example as t
using (values (1, 'X')) as v (id, value) 
on =
when not matched then insert (id, value) values (, v.value)
when matched then update set t.value = v.value;
Enter fullscreen mode Exit fullscreen mode

I find that PostgreSQL implementation is shorter and much more cleaner, readable, and understandable.

But that might be a matter of taste, some may prefer merge syntax, especially because it is much closer to the SQL standard.

That is not the real problem here.

Real problem is that it seems that MS SQL implementation with MERGE syntax suffers from race condition problems.

So that means that statement is dangerous and it may produce a primary key violation.

Please refer to the following articles about this problem:

As Micheal J Stewart notes, if you doing this, you have to take care of doing UPSERT correctly under high concurrency.

Basically, this means, that MERGE is a sitting bomb in your system, it occurs in a highly concurrent environment, but that also means it may or may not occur in a low concurrent environment.

So you either need a different implementation (with a couple of statements instead of just one) and with very special and somewhat cryptic lock hints (but then you must be careful to avoid deadlocks) - or - you can wrap it up in an exception handler like this:

begin try
    insert into example values (1, 'X')
end try
begin catch
    if error_number() in (2601, 2627)
    update example set value = 'X' where id = 1
end catch 
Enter fullscreen mode Exit fullscreen mode

This appears to be the safest way.

It is ugly, it is not the SQL declarative way and you also have to include very special error codes to avoid duplicate key errors.

Now, some may rush to judgment that this might be due to the very different underlying architecture of both systems (pessimistic concurrency in MSQQL and optimistic concurrency with MVCC in PostgreSQL - for the real nerds), but I don't think it really is.

Concurrent transactions in both systems have their own version of records and one may finish before the other and cause a duplicate key error.

In other words, if you implement something like this in PostgreSQL: check does record exists, insert if not, update if exists - you may end up with race condition and duplicate key error all the same.

And if you look at the PostgreSQL on conflict implementation carefully you may notice that it basically does the error handling, same as if it was wrapped up in exception handler in the example above.

But only in a much lower and deeper level with much more convenient syntax. And it is also much faster since exception handling has a performance price of its own.

Now, this is allegedly, pretty much confirmed by the PostgreSQL developer on Hackernews discussion: MERGE simply doesn't provide the guarantees that most users want -- not in theory and not in practice.

It's a shame that it doesn't come up with a warning for the developers. I mean, if they can mark all methods as thread-unsafe in the documentation, why they can't do it with the SQL statements? Like concurrency-unsafe perhaps.

Anyway, the PostgreSQL approach is much better, safer, and cleaner in my opinion. Although not entirely according to SQL standard.

Let's wrap it up.

There couple of other honorable mentions. For example:

UTF-8 support

PostgreSQL has always supported UTF-8 encoding, and so it doesn't have or need nvarchar type, it only has varchar and you don't have to deal with those conversions. At all.

MS SQL Server supports UTF-8 just only from the recent version 2019.

It would not be fair to write something that is supported, but the fact that is only supported only from recent versions means it is not available for the large majority of developers.

JSON support

PostgreSQL supports JSON and JSONB (binary JSON) built-in types for a long time and it has excellent JSON support.

MS SQL Server does not. Instead, JSON data is just plain text. However, it uses various built-in functions and keywords to let you work with JSON. You can also build indexes over JSON values by using calculated values.

In my opinion, PostgreSQL JSON support is much more advanced. PostgreSQL has certainly supported JSON for a much longer time and it is natural that is better.

However, this is a very large topic and a side-by-side comparison would take much more than one article to do properly.

Meanwhile, you can always check out the full documentation for both systems and see it for yourself:

Top comments (1)

slavius profile image
Slavius • Edited

Ah, for xml path('') my favorite undocumented CSV export feature of MSSQL. Such nostalgy...

On the other hand to do PIVOT - quite significant reporting function in Postgresql you need to install an tablefunc.crosstab() extension which is not always possible, especially in DBaaS scenarios (true story of mine) while in MSSQL PIVOT comes by default.

Also, in PostgreSQL using local variables is not possible without running in pgsql mode or wrapping your code into function and then calling it. Good luck if you have only SELECT permissions. Similarly temp tables and table variables in MSSQL are much more easy to use within a regular SQL command.
It's no problem to run this raw SQL query on MSSQL but not possible to do for PostgreSQL e.g. from EntityFramework:

DECLARE @table1 ( integer Id, nvarchar(16) first, nvarchar(16) last);

SELECT id, first, last
FROM users
WHERE last = 'Smith';

LEFT JOIN @table1 t1 ON users.supervisor_id = t1.Id;
Enter fullscreen mode Exit fullscreen mode