Welcome to the R.A.G, a guide about Amazon's Redshift Database written for the Analyst's out there in the world who use it.
Previously...
For further actions, you may consider blocking this person and/or reporting abuse
"The inner table is scanned first, and appears nearer the bottom of the query plan." Gold information, thanks. The only place I could find in the docs that vaguely hints that the lower-down query is the inner is here docs.amazonaws.cn/en_us/redshift/l...
In terms of the join operations, nested-loop, hash, merge, it's not so much that nested-loop is always bad. Sure the algorithm is quadratic, but the complexity of finding the optimal query plan is (probably) NP hard. There's a threshold where creating a plan can take too long, perhaps longer than just running a sub-optimal plan. If the estimated rowcount ( statistics are king! run vacuum & analyse on your tables often!) of a table is small, and/or too much time has been spent exploring the space of possible plans it's better for it to just get on with it and do the nested-loop.
Also, it's not good to just assume merge joins are the best always. It totally depends, a lot of the time a hash join is just fine. For example compare joining 2 massive tables with a merge join, on Redshift, with colocated slices because the distkeys and sortkeys align perfectly. Not compare that with filtering both tables a lot and such that the results of the sort are not guaranteed to be sorted. In that case to achieve the merge-sort implies having to do a sort operation, which is one of the most costly operations you can do ( apart from network bcast AKA shuffling between nodes).
If you have a star schema data warehouse, a hash of a small(ish) dimension table joined to a large fact table is going to perform just fine. Once you join a second dimension, it's going to be on a different field and so the happy rainbow unicorn land of merge-joins goes out the window.
I'm just saying don't get fixated on merge joins being ideal, it's just sometimes true.
Great points!!!