Geshan Manandhar

Posted on Dec 14, 2018 • Originally published at geshan.com.np

You Can Do it in SQL, Stop Writing Extra Code for it

#sql #webdev #productivity

“SQL, Lisp, and Haskell are the only programming languages that I’ve seen where one spends more time thinking than typing.“ — Philip Greenspun

Even with thinking more than typing SQL (Structured Query Language) we software engineers use it as a way to pull data only.

We usually don’t leverage SQL’s power of data manipulation and do the needed changes in code.

This is quite prevalent in software engineers who work in web applications. Another thing we miss is, if we do the manipulation in SQL directly the pulled data will be the same format for any programming language. This post aims to enlighten you about the powers of SQL you might know but generally don’t use.

Tea Lights image from Pixabay

TLDR;

Use SQL to do math like sum, average etc. Utilize it for grouping one to many relational values like getting categories of product. Leverage SQL for string manipulation like using CONCAT_WS for concating first name and last name. Exploit SQL to sort by a custom priority formula. Examples below…

The Example

It will be easier to explain the superpowers of SQL putting it in action on an example. Below is a basic schema with 2 tables in MYSQL for a refunds microservice:

There are 2 refunds and 7 related payments as example data.

Some assumptions

For the refunds microservice example schema and applications following assumptions are made:

Refunds microservice and data structure store the fk_item (the id of the ordered/delivered item), but it is not a hard foreign key.
Item can be refunded in either cash or credit for the amount paid for the same.
Items can be refunded many times as long as remaining balance can cover requested refund amount for each cash and credit. For example, item was paid 50 in cash and 50 in credit. 2 refunds of 20 cash and 20 credit can be done. So after these transactions balance will be 10 cash and 10 credit for that item (50–20–20).
Each refund can have multiple items payment. Each payment can be of type either cash or credit.
All amounts are stored in cents so they are integers.

Now let’s use some SQL powers. You can find the example with related queries running on SQL Fiddle.

Do the math in SQL

As software engineers, let’s say if we need to find the total cash and credit amount refunded for an item what would we do? We would run something like:

SELECT fk_item, fk_refund, amount, is_cash 
FROM payment WHERE fk_item=2001;

With current data, it will give 3 rows like below:

With these 3 rows, we would loop over them. If it is cash accumulate it to cashBalance variable, if not sum it up to creditBalace variable. Rather than that it would be a lot easier (probably faster) to do in SQL like:

SELECT fk_item, SUM(amount) AS total_paid, IF(is_cash = 1, 'cash', 'credit') as type 
FROM payment WHERE fk_item = 2001 GROUP BY fk_item, is_cash;

Resulting in:

The result is easy now if you need the total refund for the item just change the GROUP BY to be on fk_item and it’s done. For 2 and 3 records it won’t feel significant. If there were say 20 refunds for that item, the first solution with a loop is writing more code with no gain. Like sum, other SQL functions can be used too. Simple math operations like sum, multiply, average etc can be easy with SQL. This means no more loops.

Use GROUP_CONCAT to fetch related 1:m relation values

Group concat is a robust operation in SQL databases. It is instrumental when you need to get data from one to many relationship. For instance, you want to get all tags for a blog post or you want to get all categories of a product. Concerning this refunds example, one item can be refunded multiple times. So we will get all the refunds associated with the item id. To get this we will run only 1 query and get it without any loops in the code like below:

SELECT fk_item, GROUP_CONCAT(DISTINCT fk_refund) refund_ids 
FROM payment WHERE fk_item = 2001;

This results in:

Now we know that item 2001 has been refunded twice for 2 refunds. It will be easy to explode the refund Ids with , and proceed with any related operation.

String manipulation

Many string manipulation tasks like substring, concatenation, change case, and string compare can be done in SQL. With this example, I am going to show the usage of CONCAT_WS. It is concat with a separator. It can also be used to select for instance first_name and last_name with space in between.

In case of having an optional middle name COALESCE can be used with CONCAT_WS. That is something for you to explore :).

In this example, I will select refund_nr with its related reason:

SELECT CONCAT_WS("-", refund_nr, reason) AS refund_nr_with_reason 
FROM refund;

Resulting in:

If this needs to be shown on the credit note document, for example, no extra code is needed to join the values again. SQL makes it one step easier again.

Sorting with a custom formula

All software engineers know you can sort based on a column. But if you are given a custom priority formula to sort, what would you do? Probably again resort back to code and loop to sort. So lets set the priority formula rules for above example:

Premium customer refunds get the highest priority (we hack it with a priority of 9999999999)
Other than premium customers cash refunds get a priority of amount * 25 for credit it is amount * 20.

As per above rules it is decided that premium customers and priority above 50000 (in cents) will be processed first. Then other refunds will be processes. Let’s get the priority refunds as below:

SELECT r.refund_nr, r.reason, p.fk_item, p.amount, p.is_cash,
IF(p.premium_customer = 1, 9999999999, p.amount * (IF(is_cash = 1, 25, 20))) AS priority 
FROM refund AS r INNER JOIN payment AS p ON r.id = p.fk_refund 
HAVING priority > 50000 
ORDER BY priority DESC

The results are below:

With proper use of IF in SQL sorting by a custom priority formula is a lot easier than trying to do it with loops in code. Notice that even smaller amounts like 7.5 (750 cents) and 9.0 (900 cents) came to highest priority as these refund payment amounts were associated with premium customers.

Use the superpowers of SQL to make your life easier as a software engineer.

You can play with the example and run your queries on SQL fiddle.

Conclusion

There are other tricks of SQL that can help you as a software engineer. Like, UPDATE with INSERT using ON DUPLICATE KEY UPDATE. Whenever you have an itch of doing some manipulation for data pulled in from database in code with loops, think again. The main takeaway from this story is:

Exploit the power of SQL to write less code because “the best code is the code that was never written”. If it is not written there is no need to maintain it.

Originally published at geshan.com.np.

Oldest comments (100)

Ben Halpern • Dec 14 '18

I needed this

Geshan Manandhar • Dec 14 '18

Thanks!

Alexey Melezhik • Dec 14 '18 • Edited

Putting application logic into DB scope might end up with hard to maintain and troubleshoot code, though I can agree sometimes knowledge of SQL help you getting things done, but I'd leave this approach for custom reports based on SQL queries rather than for web application or something when you are free to process data in controller level.

Evaldas Buinauskas • Dec 14 '18 • Edited

Company I work for has exactly this issue. Literally all the logic (sometimes even presentational logic) is done in the database.

This is nightmare to maintain and now really difficult to refactor.

Also it's difficult to test SQL code due to lack of tools.

Geshan Manandhar • Dec 14 '18

Probably as most of you agree it is a bad practice due to:

cannot track code in DVCS like git
cannot test DB code
a lot harder to scale

It is better to convenience the devs and mgmt to slowly move it out if possible.

Evaldas Buinauskas • Dec 14 '18 • Edited

First one is not true.

We successfully version database code using Git, have working pull requests, continuous integration and delivery.

Second is not true as well. For instance, T-SQL code can be tested using tSQLt framework. Yet again. This is more difficult and not a widely adopted practice, but it's possible and companies do it.

But in any case. Even though things are possible and can be done in database, it doesn't mean it should be.

David • Dec 14 '18

Database testing is a nightmare I hope to never repeat. Just because a thing can be done do not mean that it should be done.

Divakar Rajashekaran • Dec 16 '18

It's not true in anyway that DB code is hard to test. As long as the person is aware of DB based code units it's just the same.

Geshan Manandhar • Dec 16 '18

Maybe you are right!

Geshan Manandhar • Dec 14 '18

Thanks for the comment. I am not advocating putting any logic in DB. The code still stays in your git repo. I am just trying to demonstrate the power of SQL for simple tasks can be a lifesaver.

Cristian Baluta • Dec 18 '18

Not only that, but imagine wanting to move to a different database that doesn't have this features. Anyway, is sql a programming language?

Paweł Świątkowski • Dec 18 '18

I have never seen a company moving to another database. This is something that just does not happen.

jdelgit • Dec 18 '18

I hope you're being sarcastic.
In case you aren't:
Why Uber Engineering Switched from Postgres to MySQL
Why we Moved From NoSQL MongoDB to PostgreSQL

Cristian Baluta • Dec 19 '18

From my mobile experience there is Parse who burned its developers 2 years ago. My personal app is still not working because i could not replicate in Firebase the query based on location coordinates. Since then i learned to use clean code and the database is switchable whenever i wish. I do have one right now where an iOS, Mac and command line app use the same codebase but different databases.
Of course, if you have something highly important no need to worry about this and better to implement how is more efficient.

Jon Bristow • Dec 14 '18

This is all good, but it leaves out a major point: Pulling directly from a database (or a filesystem) and exposing that to the consumer is not really scalable.

Sure, it works fine in a single application with its own database, but imagine 50 different webpages and services working off that same data. If one service makes a change to the database, then everyone needs to make changes to their code.

But, if everyone depends on a data accessor library (or service, or whatever), only the accessor needs to change! Everyone else's contracts will remain the same.

True, you're probably not going to swap out the entire underlying store all at once, (It happens, though... Especially early on in the lifecycle of the datastore.)

That being said, everything above is still applicable! Databases often are highly optimized for pretty much everything you pointed out (though sometimes full-string operations slow things down a LOT... I'm looking at you, Bob. I saw you put that LIKE statement on a multi-million row join. I warned you. I warned you about the stairs.)

I'm looking at a bunch of BigQuery stuff right now at my current job, and I keep seeing these queries that over-use with(). It's a simple join, Karen! It's not even a NOT IN or fancy grouping! But BQ is usually fast enough that this doesn't even matter...

Geshan Manandhar • Dec 14 '18

Appreciate your views. As the example mentions it is a "microservice" so it is assumed that the DB is accessed by only one application. If you have 50 applications accessing one DB there is a bigger problem to solve IMHO. Not using SQL well is a different problem like you pointed to Bob for the like query. That could even run kinda OK if the indexes are placed logically. Thanks!

Jesse Phillips • Dec 15 '18

I would think that a database would be the one common thing when doing Mico services. Do you really scatter the data around like you do service endpoints. (my experience using and mocking micro services may not be representative.

Geshan Manandhar • Dec 15 '18

Better to keep data central.

Mihail Malo • Dec 18 '18

Keeping data central causes all the problems @jbristow mentioned.
And a "data accessor library" isn't a solution, it's just a bottleneck, in terms of flexibility if not even performance for your whole system.

Microservices should own their data, and communicate through APIs they expose, ideally through commands and events.

Kyle travelet • Dec 14 '18

Thank you! My most recent project I implemented a materialized view that has some pretty nasty SQL in it but makes the data accessing so much easier for everyone.

Geshan Manandhar • Dec 14 '18

Generally, I am not a fan of views but yes they have their place too. Thanks!

Vincent Milum Jr • Dec 14 '18

Some things to keep in mind.

Depending on the SQL logic, this could add additional CPU resource pressure on the database server, a service that is hard to scale. Web servers are significantly easier to scale to handle the processing workload.

At least one of the examples looks like it may break the SQL query optimizer in MySQL or MariaDB, causes a full table scan, significantly harming performance. If tables are small, this is no issue, but even for small business workloads, I'm generally dealing with tables on the scale of 1mil to 100mil rows.

Others have noted that this could make parts of the database harder to maintain. Personally, I use some of these methods described above, but they're handled through a SQL query generator, the logic still entirely exists in the application layer, it simply generates and then passes the required SQL query string to the database to handle the operations.

I simply make these notes as someone who has crafted web sites with data processing times in the sub-100ms range, with the fastest now averaging 6ms.

Scott Tadman • Dec 14 '18

You'll always want to keep in mind how your application is deployed and where you have the most resources.

One application I worked on was deployed on a provider with a pricing model where you paid for web server instances but the database itself was bundled in as part of the package. This lead to a design that off-loaded as much as possible onto the database as it was basically "free" compute. The database instance was far, far faster than any of the individual web servers.

In modern configurations you're going to have a lot of web instances and a few databases that you want to keep lightly loaded so they're responsive. This is especially true if leaning hard on your database means buying more ultra-expensive database server licenses.

Geshan Manandhar • Dec 14 '18

It is a balance of trade-offs as it is usually in distributed systems. Choosing the right trade-off is one of the keys to success in our field.

Geshan Manandhar • Dec 14 '18

I like your point of view, databases are not easily scalable that's where managed services with load balancers come into play I suppose. About the full table scan, this is an example and setting indexes correctly would help tackle the problem. One can always run an EXPLAIN and plan next steps.

About maintenance, I believe none of the code should be in the DB itself it should be in git managed repos. I am also not promoting writing triggers or unnecessary views. 6ms is impressive, at that point there should be a good amount of caching involved. Thanks!

Vincent Milum Jr • Dec 14 '18

6ms is with zero caching, that's all live database queries and html template processing.

The query in question about the SQL optimizer is the one with the HAVING statement. I'd have to check, but limiting the number of rows in a HAVING statement based on an IF statement in the column selector I believe requires a full table scan. The GROUP BY statements should all be good though, especially considering the advances in optimizing these in the past two years.

Geshan Manandhar • Dec 14 '18

Sounds interesting for 6 ms.

For the having query it was just an elaborated example, if it was a real DB query I would have surely run an explain to see if that is the way to go. Thanks!

Connor McDonald 🍸☕️ • Dec 17 '18

If the concern is scalability of the database, as counter intuitive as it might sound, doing more work outside the database can make the database work harder then if that work was done inside the database.

This is not just an idle rant - here's a full exploration of that precise topic

youtube.com/watch?v=8jiJDflpw4Y

Sophie The Lionhart • Dec 14 '18

I'm always torn on how to actually implement this advice though. Sure its faster and in some ways easier to use the built in SQL stuff, especially if you're trying to reduce down to a value like with sum, but how can you deal with that in an ORM-esque way so you still have a type safe representation of that row for manipulating down the way? I can see the desire to have some custom calculated row in the return from sql but then what happens to the representation when you pass it around? I'm not sure how to balance these two things I guess, the pure SQL and the classed OOP type safe current approach.

seimic • Dec 14 '18

one way is to compose the orm representation of the data from the raw table and lazy loaded views mapped readonly. In case of JPA this means @SecondaryTable pointing to a View and columns with insertable=false, updatable=false. by this they never appear in any insert or update statement and the additional aggregation logic is cleanly separated from the raw data model. works fine.

Geshan Manandhar • Dec 14 '18

This is the million dollar question :) Yes as soon as you use an ORM your ability to run raw SQL queries becomes difficult. This depends on you and your team to set a guideline to come around this problem. For that, it is better to talk milliseconds and performance choosing a solution that is fast in execution and relatively easier to write and maintain would be my suggestion. Thanks!

rhymes • Dec 15 '18

The balance is isolation.

You isolate the code that uses the custom SQL, you document it and test it well and like everything else, you try to have all the code that needs it use that single function that returns the DB value.

Your app can still use the ORM for 99%, but that doesn't mean we shouldn't take advantage of the DB if after measuring we conclude that the tradeoff is worth it.

dandy • Dec 14 '18

True, SQL is indeed fast and powerful in data aggregation/manipulation.

We can also pre-compute these data aggregations/manipulations if they are queried frequently by users. I think running these aggregations queries, on the fly, on table with millions of records, by multiple users, maybe a slight overhead on CPU/RAM? No/Yes? Don't know.

Geshan Manandhar • Dec 14 '18

That is where you need to profile the queries, run some explains, check the resource usage and decide where is it better to run it on the DB or code. Thanks!

Mihail Malo • Dec 18 '18

True, event-updated projections are a much easier to conceptualize and scale than DB views.

Thomas Los • Dec 14 '18

Refreshing to see in an age were ORM's and SQL-handicaped dev's are the rule and no longer the exception.

Geshan Manandhar • Dec 14 '18

You can refresh as many times as you want, but I guess nothing much is going to change :)

Geshan Manandhar • Dec 14 '18

BTW this post has caught some fire on hacker news too - news.ycombinator.com/item?id=18679957 :)

Fernando Costa • Dec 15 '18

I'm trying hard, but I don't understand the concept of this post. From the purpose of databases to infrastructure management, I don't think I'd ever apply this method, even if it could cost me 10x in performance(in which case I would simply scale).

Use the superpowers of SQL to make your life easier as a software engineer.

Nope.

Geshan Manandhar • Dec 15 '18

I can agree to disagree :)

rhymes • Dec 15 '18

Hi @fjo_costa ,

it's the not the first time I hear this argument, and I'm not picking on you but I would honestly like to understand why you're against the idea of achieving 10x speed by using a well tested query or two.

I know it's harder to maintain if the specs change, but you can still swap the query for something else in the future, especially because ORMs tend to catch up to SQL features at some point.

Also, lots of languages have safe SQL generators, which mean you don't have to actually write the entire query, you can use tools like Arel.

I've built a query engine for composable conditions to select targets to send push notifications to once on it and I probably wrote two or three of pure SQL in total. I was still using the ORM, just augmented it through relational algebra, PostgreSQL json and basic geo support. It would have been immensely slower to do "in the app" (especially with hundres of thousands of devices). The engine was super fast and I had no caching.

I assure you I didn't anything special with tests, just tested that by selecting a condition, I had this or that device back, all inside the ORM.

Geshan Manandhar • Dec 15 '18

its test what you write not the library you are using and SQL has been around longer than many languages I know. SQL is one of the most battle-tested things. Thanks for your views @rhymes !

View full discussion (100 comments)