Jen Chan

Posted on Apr 9, 2018 • Edited on Apr 10, 2018

What I learned from only using Select in PostgreSQL

#beginners #debugging #sql #productivity

I was tasked to gather detailed metrics for a project at work and advised to only use select (as a newbie to Postgres and also querying on production) so I would not break the whole system.

I can't go into details about the cool data I found but feel compelled to write a dummy post as a whole new world was opened up to me 😂
Markup nerds please forgive me—This is written on an iOS keyboard without backticks.

Relational databases are tables with cells that have relationships with cells in other tables

This was actually the hardest thing for me to wrap my head around coming from MongoDB and I kept having to ‘\d’ every table name to find the information I needed. I have an active mind, so without asking my coworkers, I was prone to thinking certain tables nested other tables.

Ben Sinclair gave perhaps a more comprehensible clarification of what I tried to say above:

"The "relational" in a relational database is just talking about the relations inside a table, i.e. the fields which all belong to a table form a relationship. What you're talking about with regards to other tables are foreign keys."

Joining is great for finding granular (edit: NOT granola) trends and commonalities between tables, but sometimes it’s overkill

Join with caution. The more tables I joined the more aliasing I had to manage. Finally if there isn’t really a reliable common key between two tables I wasn't sure in what sense the results gathered from the join were valid. Was it happenstance overlap, or exact correlation? And all I know is the left inner join, not the right inner join, outer joins, etc. Learn one well before you learn two because your results will vary.

Labels can be deceiving

I had this quandary about general statistics like population and household numbers for example. Was it worth finding the usage rate of a feature according to household if you can’t determine what constitutes a household, or not everyone uses it? What if the population count is just from Wikipedia or the census in 2011? A lot of things started to become neurotic inaccuracies. Data would have to be normalized due to differences in population density and terrain, weather, whether it’s a city or a town...etc. I had to assess whether it was worth the time to do the work of normalizing or finding usage per city data.

Don’t try to accomplish too much in one query or you’ll drive yourself crazy.

It is easy to get mired in numbers and tabled presentation. I wanted to use everything new and shiny that I learned in terms of from, where conditions, group by and order by. Sometimes it takes a few queries to drill down to the few pieces of info I really needed to produce useful results and come to concrete conclusions. Because otherwise, I won’t have a presentation.

Build up your query line by line using ‘\e’ in Vim

Or you lose it by pressing up the console or otherwise waste time using arrow keys to go back to your typo... only to accidentally press enter, submit a wrong query and have to rollback.
...or even better, save it as a .sql file and run it in bash!

Export your query as a csv

So you can marvel at your findings and prove you did work lol.
‘\copy（SELECT...FROM...WHERE...) ‘path/to/file.csv’ CSV HEADER;’

You can slice and dice data in 50 million ways but you need to come up with some hard and true answers

Trends over time matter as much as all time. Set limited constraints for your first queries or it will get confusing quick. Start at per week or month with large volumes of data.

Don’t get too obsessed with accuracy unless that really matters.

The point of metrics in my case was less so monitoring but seeing where I could make an improvement in our users experience and reduce pain points of using the internal software. There was a margin of error we’d have to accept but there were significant enough numbers to show we could definitely reduce reply rates here or complaints there. At the end of the day it was about creating an impact in our coworkers’ lives at work.

Every column you enter in select becomes an aggregate and should appear in “group by”

I found this particularly annoying because sometimes the information is redundant, and group by only organizes your results into columns so you don’t end up with a long list of records

You don’t necessarily need to create a view

Unless you’re going to use it all the time. This will just take up space and processing power.

Dian Fay corrected me a bit on my assumption(s) above:

Views are just stored queries that get run when you invoke the view. The space and power consumption is negligible unless you create a materialized view which actually stores its results (and therefore has to be manually refreshed).

GROUP BY is only necessary if you're actually doing aggregate calculations. If you're just pulling individual records without counting, summing, or otherwise deriving a value from multiple rows, you don't even need to specify it. If you just have duplicate rows you want to filter out, you can SELECT DISTINCT instead.

If you’re a fan of PostgreSQL or sql what do you think it’s best at? What are use cases best suited for it? What types of projects have you done implementing a pg database? Let me know!

Top comments (6)

Dian Fay • Apr 9 '18 • Edited

Couple tips :)

Views are just stored queries that get run when you invoke the view. The space and power consumption is negligible unless you create a materialized view which actually stores its results (and therefore has to be manually refreshed).
GROUP BY is only necessary if you're actually doing aggregate calculations. If you're just pulling individual records without counting, summing, or otherwise deriving a value from multiple rows, you don't even need to specify it. If you just have duplicate rows you want to filter out, you can SELECT DISTINCT instead.

More generally: relational databases are the Swiss Army knife of data storage. The real question is which use cases they aren't suited for -- and that's where you'll find some other model being used. RDBMSs are overkill for key:value storage, which is why Redis exists; they can't handle planet-scale flat data, which is why you have HBase, Cassandra, and so on; and they're traditionally terrible with hierarchical data, which is how we got MongoDB (although Postgres' JSON support is extremely powerful & lets you blend relational and document models quite effectively these days).

I've used RDBMSs in general for web and desktop applications across my career, and Postgres in particular mostly for web. I maintain a moderately popular data mapper for Postgres and Node, so I get some framework-level development in for fun too.

Jen Chan • Apr 9 '18

re: Materialized view. Maybe this is what my manager meant.
I could see if you had a huuuggggee database maybe it's not worth the time searching through everything to create a view.
They had said it wasn't worth creating a view unless I were to use it all the time or it would slow everything down.

I suppose I was compelled to use group by as I'm a visual person and seeing all similarities in a table is more satisfying and easy to comprehend than individual records.

Thank you for your comprehensive response by the way. I've not heard v much about Hbase or Cassandra.

Dian Fay • Apr 9 '18 • Edited

Think of views as a way to cut down on copy+paste. There's no point in creating them for one-off queries, but if you have a few:

SELECT c.name, o.id, p.name
FROM customers c
JOIN orders o ON o.customer_id = c.id
JOIN order_products op ON op.order_id = o.id
JOIN products p ON p.id = op.product_id;

SELECT p.id, p.name, COUNT(DISTINCT o.country) AS countries_shipped_to
FROM products p
JOIN order_products op ON op.product_id = p.id
JOIN orders o ON o.id = op.order_id
GROUP BY p.id, p.name
ORDER BY countries_shipped_to DESC;

Both of these queries use the same orders-order_products-products information. They do it in a different order since the first is starting from customers and the second from products itself, but it's the same data. Instead of joining those three tables, you could create a view with the query:

SELECT o.*, p.*
FROM orders o
JOIN order_products op ON op.order_id = o.id
JOIN products p ON p.id = op.product_id;

Then you can use the view in your queries and suddenly they're a good bit simpler -- the complexity of the order-to-product relationship is encapsulated in the view, so your queries don't have to manage it themselves.

Ben Sinclair • Apr 9 '18 • Edited

The "relational" in a relational database is just talking about the relations inside a table, i.e. the fields which all belong to a table form a relationship. What you're talking about with regards to other tables are foreign keys.

I think that's something which is quite confusing to people coming to them for the first time!

The words used to describe Mongo always confuse me. I have to figure out what a Document is and what it would be like in a different system and then remind myself again that they're not really the same.

Jen Chan • Apr 9 '18

YES. You say that more accurately than I did. I would like to update my article with this definition. I was prone to mixing up foreign key names with primary keys. In my head the former exists as the label in a table; the latter is the name of something unique in a table, locally... and this confusion happens when the names are different. (i.e bread_id as a foreign key of a oven table vs bread.id in the original table... otherwise just id local to the bread table.)

Starting from mongo with as a scatterbrain: my logic was just shove something in something. And I had only used it on small projects so I think the grievance comes in having to type long queries to find something instead of syntactic SELECT ___ FROM ___ commands, which make much more sense now that I've used it.

Lance Alligood • Apr 11 '18 • Edited

Regarding joins, you will primarily use 2: left outer join & inner join. A distant third is outer join. (BTW, the only difference between left outer & right outer are more in the head of the query's author than in any physical sense. IMHO, anyone who writes from left-to-right will probably only ever use left outer join.)

And advice with aliases: Keep them short yet obvious. Too many people use single letters for aliases, which are OK, but can quickly lose meaning--especially between queries. Aiming for 3-5 letter aliases are still reasonable to type yet are still descriptive enough for you when you come back to a query after a long time or hand over the query to someone else. And most importantly, be consistent with your aliases! So for your employee table, don't use e one time & emp on another query. (Yes, it's allowed but it creates confusion later.)