Primary Keys in Redshift
PKs in Redshift are an unusual implementation; or at least will seem so to those familiar with traditional transactional row-oriented DBs.
But AWS Redshift is columnar, optimized for data warehousing and the high-volume data loads and fast scan workloads required when aggregating granular data.
Two other main aspects of Redshift: it is massively parallel processing (MPP), and distributed shared-nothing – meaning each node manages its own memory (in contrast with shared memory), although nodes are aware of each other and of the leader node.
The overhead of maintaining fully-functional primary keys would be burdensome for this architecture. But primary keys do indeed exist; it’s just that they don’t enforce uniqueness. They are intended for reference, as a flag to say, “Data in this/these column(s) should be unique”. Note use of the word, “should”. Here’s where it gets tricky. You can go ahead and put duplicate data into PK columns, thus violating the principle of a PK. Redshift is fine with that.
It is understood (by Redshift, if not by you) that the data you are loading has already been “cleaned”, or you have an ETL (or ELT) process that includes a de-duplication step.
So what's the utility of this implementation, in addition to reduced overhead? The PKs can be read by ERD tools like Erwin, ER/Studio and PowerDesigner and incorporated into a reverse-engineered database diagram. These entity relationship diagrams (ERDs) are a map of relationships and join columns in the database.
I first became aware of Redshifts PK implementation when SELECT COUNT(DISTINCT) didn’t return the expected number of rows; Redshift assumed all rows were unique in PK and returned the theoretical rowcount based on that assumption. I was expecting a different number. Upon reading the docs, it started to become clear.
The Docs:
https://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html
Top comments (0)