DEV Community

Cover image for Database constraints considered harmful?
Jon Lauridsen
Jon Lauridsen

Posted on • Updated on

Database constraints considered harmful?

In relational databases it's common practice to create constraints to lock down the data and its relationship to other tables. So, a Users table might have an email column that’s marked as unique because everyone has their own email, right? And a Subscription table gets a non-nullable User column because every Subscription must be associated with a User. Yeah, that also sounds reasonable.

But I'm here to argue against constraints! To some that might sound heretical (or at least very stupid), but hear me out:

Almost all constraints exist for two reasons: For performance (it's a lot quicker to search a uniquely-indexed column), and to guarantee data integrity (so we physically can’t enter invalid data, e.g. the business requires a subscription to be tied to a user).

Here's the problem: We programmers often optimize too early, and constraints don't actually express business requirements!

Let me break that down a bit:

  • On the topic of performance, there are of course valid cases where constraints speed up queries by several magnitudes, but unless we see that performance problem and its subsequent remediation with a constraint we run the risk of prematurely optimizing. Fewer constraints would make the code more malleable, and create less friction for making subsequent changes to the database structure.

    Instead defer database-optimizations until they are needed, to counteract our tendency for premature optimizations.

  • Constraints due to data integrity are more directly harmful, but it's also the hardest point to convey to programmers: No amount of unique indices, non-nullable fields, or even stored procedures can or should model the full set of business requirements. There's a reason we don't program in database-languages; we program in general-purpose languages such as C#, Python, Javascript, etc.

    Ultimately business requirements must be expressed in the programming-layer, but they are not required in the storage-layer. And so we must ask ourselves: Why partially replicate those requirements in the DB?

At this point we need to discuss the drawbacks of constraints, because they don't come for free. In fact the more constraints the more difficult a system becomes to test and the less malleable it becomes:

  • Constraints couple concepts together, making it necessary to arrange multiple sub-entities in order to create the one valid entity we want to test. I.e. we can't just create a Subscription, we must also now create a User. And if our test doesn't deal with users that's just a waste of complexity. Then we might start hiding that complexity in entity factories, but those are just abstractions on top of self-inflicted complexity. Why not get rid of the complexity entirely?
  • And those constraints make it a lot harder to adjust to new requirements: What if we actually want subscriptions without a user? What if an email column doesn't have to be unique? If the data-storage layer imposes artificial constraints it dramatically slows down experiments: Instead of just changing some code, that experiment now requires much more complex and dangerous migrations.

For some, what I'm writing here is heresy or deeply stupid or both. But think about it: Constraints are business requirements that almost-kinda-sorta duplicates what the code already does. Instead, imagine if we have perfect trust in the code, what advantage do constraints actually add? What does it really give us to mark a Subscription's User column as non-nullable, when the code already ensures a Subscription is only ever created in relation to a User? It actually amounts to no help at all because constraints are simply not the source of truth.

Instead, try not adding constraints, and focus on ensuring great clarity and trust in the code. I know this can sound extreme, but it really isn't that bad.

For me personally it's even an open question whether foreign keys are worth it, unless we're clearly entering into performance problem territory. There's absolutely nothing wrong with code making multiple database calls and merging datasets in memory, as long as performance isn't impacted.

I think many of us create constraints because it feels neat that way, it feels… optimal? And it's how we're trained to do it. But none of us are realistically implementing 100% of business rules in the database, so, where do we draw the line? Why do we only ever add the easy constraints, but not fully map out the actual requirements?

Instead let the code express the business requirements, and let the database focus on storing and retrieving data.

Photo by Sam Moqadam on Unsplash

Top comments (20)

Collapse
 
tsvetanganev profile image
Tsvetan Ganev

Interesting take. I've heard the same arguments about foreign keys being evil because they degraded performance. If we follow this logic, we don't need RDBS at all and all applications should store their data in plain binary files. I agree not all software systems require strong data consistency/integrity but we can't scrap 50 years of DB development so easily. Critical systems that work with money in some form must be modeled strictly on the data layer as well as on the application layer.

Collapse
 
jonlauridsen profile image
Jon Lauridsen • Edited

Yeah fully agreeing. All I really want is to challenge the dogma I’ve sensed at times of always creating constraints. FWIW if we sat down and you argued against foreign keys I’d be fine with that, I trust the code to handle business requirements and see no issue joining manually. I’ve done that before and found no issues.

I see where you’re coming from with your financial comment and that might be true: The data storage constraints can act as a kind of extra bookkeeping check, further enhancing correctness on top of test-automation. But I suspect I’d still want to see how far we can guarantee correctness with layers of test automation… it’d at least be an interesting topic to explore :)

Collapse
 
mauricebrg profile image
Maurice Borgmeier

In defense of constraints for data integrity: If you need to absolutely ensure that a given value is unique in a column (e.g., your email) and you're operating in an environment where multiple entities change the data (e.g., multiple threads/servers) at the same time, a database constraint is the way to go to avoid race conditions and other funky distributed systems behavior.

Constraints can make testing more complex and impact performance at scale. However, not any workload needs to be scalable to the moon (premature optimization), and often constraints can be a cheap (as in economical) way to solve a problem.

Collapse
 
fjones profile image
FJones

The database is not the tool to prevent race conditions. It's an easier solution, and works as a stopgap, absolutely, but vulnerability to race conditions on the data layer are an architectural flaw that should be remedied elsewhere.

I often notice this with asynchronous microservices - where the solution is an execution engine or message broker, not relying on the database to resolve race conditions.

Collapse
 
bradleyross profile image
Bradley Ross

When it comes to software, I am a belt, suspenders, and two pairs of pants type of person. Everybody makes mistakes sometime, and assuming that the other person was supposed to handle it is a recipe for disaster.

I had a case where a large number of records with information on cases contained NULL in the case identifier. The person responsible for that code didn't believe me and told me to prove it by providing ten examples. I gave him a list of four hundred cases. I had the report on my desk and the reason the number was so low was that the SQL query was limited to 400 cases. The response of the database administrator was that he had only asked for ten cases, and he thought that I was trying to embarrass him by listing so manycases. What you need to do is insert the constraints and review all the errors it picks up.

In another case, there was a web page on LinkedIn that didn't work. Many people were unable reorder interest groups on the profile page. People were complaining loudly on several social media sites. Requests to the help desk received the following message: "We are giving top priority and will notify you when a solution is found." I examined the code for the page and determined that the servlet was using two SQL statements: one had an inner join and the other had an outer join. They were joining the two result sets by row number. I notified the help desk and had them validate it by fixing the problem with my profile using the help desk's interface with the database. I wrote this up and sent it to the help desk who repeated the previous request. I asked them to reread my email, and they said that they were sorry for not reading it. The person also said that they had been told to give the above response and ignore it.

If you really want to do a good job, you have to be willing to look for things that make you look bad. Most people don't want to do that, and the managers tend to be worse.

  • You have to assume that other portions of the system may contain errors.
  • If you check for errors in other programs, the ability of other programs to totally corrupt the database is limited.
  • The optimizing routines for the DBMS are designed to use the constraints to run faster. More constraints mean that more information about the data is available, and better optimizations are possible.
Collapse
 
jonlauridsen profile image
Jon Lauridsen • Edited

Hi Bradley,

Thank you for your insights, and I think I recognize your angle. Your examples do highlight the benefits of using a database's powerful tools, but I also feel they implicitly tell a story where the code is not doing its job. If an empowered team owns its solution maybe there isn't a database administrator at all. Maybe a great domain needs to be modeled, which can encapsulate that complexity.

I'm reminded of this scene from the movie Contact, where sometimes letting go of the seatbelt results in a smoother ride. It can seem scary, and of course I don't know your context so ultimately only you get to decide what's the best way forward. I'm happy to simply delight in the wonderful differences there are between our contexts, but if it matters then I do think there's more to what you describe between the lines that's worth exploring deeper.

But I enjoyed reading your angle because it does illuminate different contexts we all find ourselves in. And thanks for reading the article.

Collapse
 
aarone4 profile image
Aaron Reese

WOW!
Where to start....
in a properly structured RDBMS where business entities are modelled correctly, constraints are never going to slow down your application. There are plenty of other reasons (missing indexes, partitioning, non SARGable queries) but check constraints and foreign keys are not going to be it.
SOMEONE needs to be responsible for the integrity of the business data. I agree that overly strict RDBMS constraints can APPEAR to slow down development but in reality they do not; they are like the canary in the mine and act as an early warning system that the business entities are not correctly modeled. If you don't do it at the database level you have to do it somewhere else. Most business solutions these days will have multiple entry points: Desktop, mobile and external applications. It is now possible to move the integrity checks to the API layer if it is common to all applications but DO NOT leave it up to the front end devs to encapsulate the business rules.
Strict RDBMS integrity rules FORCE changes in business requirements to be properly considered and investigated. You can't create a Subscription without the User, but without referential integrity built into the database model who is responsible for checking that you can delete the customer and not leave the Subscription orphaned.
I am going to assume that your example of the unique constraint for a User being the email address was contrived, but I fear it may not have been. There is a difference between the entity key and a unique constraint although there are often interchangable. the entity key should be unique AND idempotent (i.e. cannot change) AND ubiquitous (i.e. every entity must have one). For many reasons this will often become a NONCE key; either a UUID or an IDENTITY INT; it has no business significance outside the database. This fundamental failure of understanding is why you MUST engage a database architect and not leave entity definitions to the developers (sorry devs, but strings aren't strings. They are CHARS, VARCHARS, NVARCHARS, TEXT and have a length and a collation - otherwise you waste space and performance in my database). As a Database guy, I would not expect to know all the subtelties of your programming domain so why should you expect to know all the GOTACHAs in mine.
At the end of the day it will be the database people that are tasked with sorting out any crap data that creeps in through the front end and API layers.

NOONE puts data directly into my tables, all APIs call stored procedures for C_UD and will use a stored procedure or view for the R(ead). This is the only way to really validate that the data being entered is fully kosher. You say that we don't do turing complete code in SQL - Sorry. I do (MSSQL) and in the Oracle world PL/SQL is specifically designed to do this.

Collapse
 
fjones profile image
FJones

A few points:

  • Performance is indeed a touchy subject, and I do agree that database constraints should (at least theoretically) have similar or even better performance than application layer constraints - because the database is in fact optimized specifically for that.
  • Modelling business entities is not the job of the data layer. I know, people will disagree on this, but I firmly believe that the data layer should not be opinionated about business logic. It is storage, and it exists to model data structures, not business structures.
  • As I said in my other comment, the constraints the data layer should be responsible for are relationships (in a relational DBMS), and internal integrity (that is, valid types, limits, and so on).
  • While I agree that a lot of people eschew an API layer, that's bad practice as well. There should absolutely be an API core that models the actual business logic.
Collapse
 
jonlauridsen profile image
Jon Lauridsen • Edited

It is now possible to move the integrity checks to the API layer if it is common to all applications but DO NOT leave it up to the front end devs to encapsulate the business rules.

Fully agree here. I didn't even realize I was implying anything but keeping the API layer responsible for business requirements.

As for the rest, I really do like your clear angle and rebuttal, even if I disagree. I oppose the sentence "my database", because I want to break down silos and optimize for shared ownership of the product. I suppose you could just as well argue I'm an API-centric chauvinist who wants the world to revolve around the API and you wouldn't be entirely wrong, I consider the DB "just storage" and ultimately not worth super-specializing into (unless you're the vendor building one!) because it doesn't provide customer value directly.

So I think we're fundamentally pitted against each other because I want developers to not care about CHARS, VARCHARS, etc. Those are implementation details to be changed at a whim to cater to whatever business requirements provide value.

Thankfully it's fine we're "pitted against each other" because I love hearing your angle, it's so different from mine and we can only learn from our differences. It might be best we don't work on the same project haha πŸ˜…, but thank you for taking the time to elucidate your angle.

Collapse
 
aarone4 profile image
Aaron Reese

Its not a war, at the end of the day we are all on the same team trying to provide value to the business. My point is that constraints in the RDBMs database are there for a reason; it prevents rogue application code from destroying fundamental business data. If you are not going to use them you might as well use a document database or flat text files. Just like your C# / java / python classes are trying to model the real world, so is the relational database. As it is the last line of defense, it HAS to be correct. If front end and middleware developers are creating workarounds then this is an indication that the database model is out of date.
I once did a lignting talk at our local geek-meet about how code-first ORMs (specifically Entity Framework, but the same applies to Eloquent) can help you to rapidly prototype the application and supporting database, but the data types it uses are not optimised (unless you use Fluent, in which case you might as well use DDL in Management Studio) Using a repository layer and Active Record pattern can help but it is still no substitue for a properly configured database.

Some of the crap I have seen:
Dates stored in datetime fields.
Date and time stored in SEPARATE datetime fields.
dates stored as strings
Y/N held in VARCHAR(1000)
Phone number 1, Phone number 2, Phone number 3 ....

Collapse
 
moopet profile image
Ben Sinclair

I see your points, and even agree in principle, but disagree with most of them in the real world.

A lot of assumptions people have about anything are wrong. For instance, a lot of services use a unique email as an identifier, but there's no one-to-one mapping between people and email addresses. I have lots (ben@my.work.address, ben@my.play.address...), some people share one (support@my.work.address, thesmithfamily@example.com) and enforcing uniqueness then becomes a business problem instead or you'll get a buttload of users who're breaking your Ts&Cs.

In the real world, code gets deployed which may or may not have been tested well, and tests may or may not cover every eventuality. The QA user buys negative PI beers.

Without constraints, and especially without transactions, data will get corrupted. It's like using C to manage strings as character arrays. You'll screw it up at least once while you're writing your program and see some sort of delicious segfault. If you have the chance to use a string type which handles all that for you, why not do it?

The difference between my string example and the RDBMS of the question is that the database persists longer than your new code. Errors in it go unnoticed, code crashes part-way through processing a "DROP user" and you have dangling subscriptions. Then you need to write a new function called removeOrphanSubscriptions which you call on cron every night at the software witching hour.

And the problem with optimisations is that choosing to not use a constraint because it migh degrade performance is the optimisation, not the other way around.

Collapse
 
jonlauridsen profile image
Jon Lauridsen

I hadn't considered transactions as part of my argument, we're definitely in agreement that transactions form a backbone of reliable writes. There should never be a dangling Subscription because a service crashed halfway through, and if what I wrote implies I'm against transactions then please point out the offending text so I can fix it :)

I think we both agree that the correct place to prevent buying negative beer should be in the code, right? Maybe we'd argue exactly where and how to do it, but from what you write I think we'd both argue the code should be responsible. The code is the responsible system. You describe the case where a mistake in the code would let an error through that'd get caught in the database, and yes you're right constraints might catch such errors, but it's kind of based on luck: Because business requirements can't be (or certainly shouldn't be) modeled entirely in the database, there are no guarantees that errors in the code layer will be caught in the database. It is possible to get lucky and catch an error, but in my opinion & experience it can just as easily be a regex-constraint that's gone out of sync with the code and now produces hard-to-verify errors because of some edge-case where a certain combination of characters result in a database error. Hope is not a strategy, and if the code should be responsible I'm all for focusing all efforts on that layer instead of adding partial safety nets where they don't really belong.

Perhaps constraints as you describe it made a lot more sense when code was managed by different people than the database, because those constraints allow the database-context to establish rules to make sense of its own data. But if one team manages both code and DB? I'm really suspicious of the value a constraint adds in that context, other than taking up even more minutes to writing and verifying a migration.

That said I'm not here to tell anyone how to work, you make good points and as long as we both can work on systems we enjoy I'm all the happier :)

Collapse
 
fjones profile image
FJones

I definitely agree with keeping business constraints out of the database. One common issue we face with business constraints is testability. "This should only run once per day" translates to "let's add the date to our unique key". As a result, we can't run the same process more than once per day, because it clashes with the business constraint in the database, while the code path for testing bypasses those checks intentionally.

Database constraints should verify internal integrity. Even not-null checks are, at times, a hindrance both in operation and in changes of logic. Ensure that relationships are valid, ensure that types are valid - don't go beyond that.

Collapse
 
player8472 profile image
player8472

Whenever I read about something like this I ask myself if the person who write it actually ever worked on a production system that doesn't have constraints on the database side.

I have (twice in two jobs by now) and can only tell you:
It is not pretty.

Yeah, sure the programmers love it for the most part as they get less "weird" exceptions.
But fact is: Programmers forget stuff, database architects forget stuff.
You don't want to know how many hours I've wasted on finding out why a statistical analysis doesn't deliver the desired result on Production when it does on the DEV and Integration Environments.

And it is almost always because the user was able to create multiple identical entries...
Or entries which shouldn't be able to exists in the first place as there is no entity for the entered foreign key.
Or because some idiot didn't properly implement data deletion and there are invalid Foreign Key references.

And yes, I know the argument: "But we do it codeside" - Truth is: 90% of you don't!
You write Hibernate, Rails or whatever you are using to communicate with the database.
Database Architects and Developers don't implement constraints to make your life hard, but to prevent you from making mistakes.

Not using constraints (in classic RDBMS - some modern db-systems have mechanisms to take care of crap like this) is like just empty catching exceptions.
Makes your life easier until it doesn't...

Collapse
 
jonlauridsen profile image
Jon Lauridsen

Thanks for your reply! I too have been in places that weren't pretty, with database and code constraints diverging in various subtle ways in various legacy systems. And yeah it's hard to get on top of, and I happen to have also debugged my way into cases that could've been solved by adding more constraints. But the complexity of navigating tightly coupled entities where one has to traverse numerous foreign-key relations just to raise the entity I actually need is incredibly energy-sapping. And it's brittle, because fundamentally the DB cannot enforce all the business rules anyway, so there's no guarantee I'm spawning a truly representative entity in my test-code unless I go through the code to raise those related entities.

And if the code is easy enough to use and we trust it to do the right thing then we might as well only express those constraints in code.

End of the day it's fine we disagree, the Internet would be boring if we didn't :) I hope you see much success with your decisions.

Collapse
 
jonlauridsen profile image
Jon Lauridsen

I hear you state "With constraints you can trust the data".

But I don't think that can be true outside trivial cases, because business requirements aren't enforced in the database. You can constrain some business requirements (e.g. Subscription requires a User), but such constraints are simultaneously non-exhaustive (so the code still needs to validate if you don't trust what it's in the database) and aren't authoritative (because the code is where business requirements are supposed to be modeled).

If instead we say the database's role is only to store data (that is: a single service reads and writes from it), then there is plenty trust in the data because you know exactly what the service writes. A lack of trust there implies someone is messing with production data manually, in which case you either need to change that culture or add validation to all read-actions to ensure business requirement correctness (because database constraints would at best give a shallow appearance of correctness but might stil fail catastrophically inside the code because constraints don't model all business requirements).

I'd heavily suggest changing the culture though.

Collapse
 
joelbonetr profile image
JoelBonetR πŸ₯‡ • Edited

Didn't read all the comments (sorry if I duplicate a concern here).

A priority in software is robustness.
You can have a bug executing a query (so you'll need to rollback).
You can have a bug in code that alters the data wrongly.

Having a layer to ensure robustness in each step is preferrable, specially when working with users that pay you money.

You can define a model that can or cannot fullfill the formalization (I would expect, at least 1, 2, 3 and BC), but those normal forms are here for a reason plus they imply constraints.

If you really need to do something fast you can just use an ORM (Code First) without specifying any constraint and migrate in a single command instead doing the create tables in SQL commands.

I would anyway encourage you to analyze the business needs for each feature before coding a single line or executing a single query.

Analyzing the business needs you can find inconsistences and you can define and get to know the big picture and the forecast for the roadmap so you can take early decisions. Let's use the email column as example.

You can state that every user will have an email and that emails are unique and it will be a good assumption, but have you asked and defined with client if they are about to let (at any point in the future) the users to change their email from their account?

If it's a yes, simply add an auto increment ID and let the email as required but not as PK.
If it's a no, then you can use email as PK or ignore the client statement and do the thing defined in the line above, it will work anyway and you protected your future yourself from a possible bad day throwing alter tables and sync-ing the code with the new model.

Collapse
 
jonlauridsen profile image
Jon Lauridsen

If I may push back a bit, what does it matter if the client may consider letting users change their email some time in the future? If they say yes, would you design for functionality that isn't needed now? What if they say yes, but never get to it? What if they say yes, but actually what they need next year is something else? What if they say no, but actually meant yes?

We can both agree it's crucial to talk to our customers :), but I'll assert that in the majority of cases a discussion about "any point in the future" is too vague and too theoretical to be of use. I'd rather suggest we build what is needed right now: A User table with an email column. That's it. In data-storage terms I don't care if it's unique or not as I'll handle that in the more flexible code, all I know in data-storage terms is I need to store an email and right now the need is for it to be associated with a User. No constraints to mimic business logic, no concessions for theoretical futures, just what's needed to make the app work right now.

Collapse
 
aarone4 profile image
Aaron Reese

I don't care if it's unique or not as I'll handle that in the more flexible code,

Now you have to write it several times if there are multiple entry paths to the database

all I know in data-storage terms is I need to store an email and right now the need is for it to be associated with a User

So let's model the database with a UserContactMethod table:
ID (Primary Key)
UserID (Foreign key to the User table)
Contact method (enumerated value of phone/email/Twitter handle)
ContactMethodValue

Not a lot of effort to design up front and now caters for 0..N email address against a single User, the same email against multiple users and automatic extendability for other contact methods.. NOT modelling this according to the real world entities will likely result in significant remodeling in the future.