DEV Community

Cover image for A junior, a mid and a senior dev walk into a bar

A junior, a mid and a senior dev walk into a bar

anabella on December 29, 2018

Gather 'round and let me tell you about that time my teammates and I took down one of the platform's most critical services. For the sake of narr...
Collapse
 
stonelasley profile image
Stone Lasley • Edited

Good article, but I disagree with your bullet "migrations should be decoupled..". This decoupling is what caused the situation, the schema version and code version moved independently. The desire to have "control" over your migrations sounds like you don't trust your delivery pipeline. Most migration tools have an up and down or roll forward revert path for this very reason. If you need control you aren't really doing migrations you're just calling manual schema updates "migrations". And yes don't use *, your DBA should have yelled at your team long ago about that.

Collapse
 
tannermoore profile image
TannerMoore

That is what I was going to touch on too! I have worked on software that tightly couples the code to the db migrations and it has worked very well so far. The application just waits for all migrations to finish before trying to start the main execution path, so long-running schema changes are pretty painless.

Collapse
 
anabella profile image
anabella

Hey! Yeah, it's not that we didn't trust the pipeline, it's that adding fields with constraints and default values on a huge... HUGE db used by a system with heavy traffic would have been a mess. In the migration we don't usually care about optimizing for performance, we just wanna get the correct schema for the correct code. But if you have a situation like this where you should minimize locking operations as much as you can then I feel it's safer to run it manually.

The thing is that by trying to avoid downtime, we ended up causing even more downtime. But I will modify that bullet point to be more specific about our case.

Thread Thread
 
mhenrixon profile image
Mikael Henriksson

The migration aspect (new columns with a default value) is one that I keep having to remind my fellow developers of as being a bad idea.

There are ways around that with creating a new table with the default column and backfilling the data from the renamed table. It is also possible to first update the schema and then update the table in batches of say 100 to avoid locking the entire table, finish that up with setting the default value when all records have been updated. The third option would be to just update each row as it is fetched from the database if the value is nil.

I am perplexed how it seems that every company I work for is having this problem. I've resorted to every possible way of solving this. Also took down our entire stack one time at the worst possible time. We lost a lot of money that time, never forgot about default values again. I suspect you will have an eye on this in the future :)

Thread Thread
 
stonelasley profile image
Stone Lasley

One item I should add that I've seen overlooked on teams is the DB schema can progress without the code as long as the code handles each schema state. E.g. you check in code that adds a single column (in this case of you're Selecting * you may need to update a DTO or something) then deploy. That deployment would only do a portion of the migration you ultimately want. So a large disruptive DB change may take 5 deployments or 10 whatever makes sense. This approach still captures every single state of the schema in source control. Similarly, this is how I recommend destructive changes to the schema. Push a deployment that stops relying on a column then drop that column in a subsequent deployment. This way WHEN a migration doesn't run or does something weird you're less likely to explode. Anabella, in your case we're all speaking in generalities and it's easy to critique without knowing the details. Its possible, given all the information I wouldn't have a better solution. I liked your article and you have guts putting yourself out there.

Collapse
 
scotthannen profile image
Scott Hannen • Edited

Hopefully yours is not a culture of blame.

In order for this migration to work, you would have had to be extremely careful to get it exactly right. If it blows up, it's easy to say that you should have been more careful. And of course, you say that you'll be more careful next time.

That sounds good, but an environment in which everyone just has to be extremely careful or everything blows up is going to face catastrophic failures often. I've been there.

Developers make mistakes. That's reality. It doesn't have to break production systems if we understand it and work with it, not against it.

That's one reason why we write unit tests for our code. We catch little mistakes. If we didn't make small mistakes, then why write unit tests?

We also test the behavior of our applications, including QA tests. If we never introduced bugs, why test our applications?

Then we deploy our code. That can be a complex process, as you experienced. Do we test that deployment process? We test everything else, so why would we not test the part that has the greatest potential to crash everything?

That means having a staging environment which perfectly mirrors production, where we can test our deployments. Containerization also helps.

Was it your personal idea (or S's or M's) not to have a staging environment for such tests? I bet it wasn't. The risk of catastrophic failure was built in to your deployment process. It was a matter of time.

"Let's all be more careful" is not an answer. I doubt that you or anyone else were careless. Humans simply aren't capable of mentally calculating every variable and visualizing how complex systems will behave. (We do okay sometimes, but relying on that is a horrible idea.) If we could do that then we wouldn't need computers. We could just do everything in our heads.

Testing everything, which includes having the environment for such tests, is the answer.

Collapse
 
anabella profile image
anabella

Wise words! Loved your comment, it's almost like a mini-article on its own ๐Ÿ‘

We did introduce a lot of monitoring and testing after this happened but we still do not have staging as a part of the deployment process. We did have lower environments but they're used more in an "ad hoc" manner. So that means they were used for testing whenever someone acknowledged it was necessary to test something there first... And this might have been the biggest problem.

Collapse
 
ikhemissi profile image
Iheb KHEMISSI • Edited

A very nice article Anabella.
Unexpected things happen, so one must try as much as possible to guard against them with reviews and tests, but unfortunately we can never guarantee a bug free code (or unhackable system).

Well, next time you have a migration or a big change to do, I would suggest:

  • make sure to run integration tests on your integration environment (on your code). Ideally this step should be automated, but worst case scenario they should be done manually. Of course one should look for errors on test reports or in monitoring tools to confirm that everything is okey.
  • ask your ops team to setup a staging/qa environment: from what you wrote, it seems you have a "production" and your local dev machines but no other environments where you test what needs to go to prod (with a similar setup to the prod).
  • don't deploy changes to your prod near the end of the day (it may have been a super urgent task but unless you have enough tooling and/or manpower to monitor, get feedback, fix issues, and eventually rollback changes, then waiting half a day may be well worth it).

All the best Anabella and make sure to discuss with your team (dev, ops, qa, ...) what my happened and how to prevent it in the future.

Collapse
 
anabella profile image
anabella

Thank you for your advice and your kind words :) Nice to find some honest and nice feedback when talking about f* ups.

I agree with your points and we did talk extensively about what happened and introduced a lot of testing and monitoring to prevent it in the future. Luckily neither the company nor the team are guilt-oriented and we just focused on learning and prevention <3

Collapse
 
bloodgain profile image
Cliff

I wouldn't really consider this a screwup on your part. Sure, you probably didn't do everything perfectly, and the senior dev probably should have caught the potential problem. What went wrong, though, was a failure of the development process and the accumulation of technical debt. You should never have been able to introduce a problem on that scale to begin with. You might have still managed to introduce a bug, but it should have been minor at worst.

Collapse
 
phlash profile image
Phil Ashby

Ouch! I hope recovery was possible without excessive downtime...

In case people haven't met it already, the Refactoring Databases book by Pramod Sadalage has a number of strategies for this and other similar situations. The website has pretty much all you need to know, unless you like paper copies, and/or would like to support Pramod :)

databaserefactoring.com/

I personally like the use of migration scripts, tested & deployed with the code through the standard pipeline (perhaps as part of a management application). Decoupling when they are applied and leaving control of that to the appropriate ops team keeps the humans happy too.

Collapse
 
adambkaplan profile image
Adam Kaplan

Unfortunately the biggest "mistake" is architectural - your systems seem to require an outage to ensure code and migrations are in place before bringing the system back online. This may not be the fault of anyone - legacy systems are what they are and may never support in-place updates.

Ideally your migration code sits in the same repo as your code, so the same version will deploy the right scripts alongside the correct code. With that in place your team can start building automation around your deployments.

Collapse
 
davru profile image
ห™ห™ห™ษนวสŒo วษฏ dฤฑlษŸ สŽวษฅ

You really should have a complete dev environment with a DB schema matching production to test out your migration on before deploying to production. And you should have automated unit tests to smoke test the code after the migration completed; the select * is just one of many things that could have gone wrong.

Collapse
 
dig14step profile image
Darya

THIS! Never has this ever happened to me when you make sure your dev env matches your prod. And if it does? there is a sql statement/prod etc to ensure the tables match (if that's your db, of course)

This comment really struck a cord that you need to match your environments.

Collapse
 
fj2c profile image
Fernando Calatayud • Edited

If you ask your Senior, he/she will probably tell you the pain that not using * implies. Add a new field to the table, and search for every single query that might need it. Call to a method of your object, just to discover that this method depends (directly or indirectly) on fields you don't retrieved. And it goes and goes... it's a permanent maintenance hell, don't do it.

The good solution for this case is to prepare the code for work with or without this field; and the good enough for many cases (depending on how critical are the affected requests) is to be ready to trigger the deploy as soon as the migration is completed, accepting that some errors will be triggered in the meantime.

Take also into consideration that using * may save you problems when you add a new field (at a huge cost, as explained), but then you'll find yourself renaming or removing fields... and in those cases, your field lists not only will not save the pain of the deploy, but also add an extra pain to the code changes.

Collapse
 
thejoshuaevans profile image
Joshua Evans

Love the article! I don't have much experience with database migrations, so I will have to keep this stuff in mind

One though I had, and this is really just me thinking out loud: with the ease of access to cloud database solutions, would simply creating a brand new database from scratch be effective? Add the changes and migrate the data, then when you are ready all you have to do is change the database endpoint and shut down the old db

Thinking about this as I write, I don't think this would be viable with extremely large datasets - but there are managed solutions from AWS and others to help out with migrations, and some services don't charge for internal data exchanges (for instance, if both the databases are on the same server)

Collapse
 
buinauskas profile image
Evaldas Buinauskas • Edited

Sounds good on paper, but just that.

Creating a new database and moving all the data can take hours, days, even weeks. Also if database is write heavy, data has to be synchronized right before the switch. This creates even more complexity.

I've been dealing with this for past 4 years and sometimes even a new column with a default value can become a headache.

Not to mention that databases usually have more than just data. Replication, subscription to data, CDC solutions, permissions and more. Stuff that's hard to automate or difficult to version.

Collapse
 
davidwhit profile image
DavidWhit

You never go full *

Collapse
 
moopet profile image
Ben Sinclair • Edited

You say "itโ€™s a series of SQL scripts thatโ€Šโ€”โ€Šwhen executed sequentiallyโ€Šโ€”โ€Šcreate the whole database schema from scratch (tables, columns, default values, constraints, etc.)."

This is almost exactly what database migrations are there to replace - they serve to move it from one state to another. For example, if the state with field A but not field B works with code release 1.2.3, and the state with field B but not field A works with code release 1.2.4, then there will be a migration 1.2.3 -> 1.2.4, and if you have a proper migration library you'll also be able to migrate backwards from 1.2.4 -> 1.2.3, but this is sadly missing from a lot of implementations.

Migrations can work from scratch. On a new deployment, they will take you through each change incrementally until you're at the right one for the code. This could take a while and even involve adding and then removing fields more than once, depending on how crazy things got.

Nice post, though!

Collapse
 
lprefontaine profile image
Luc Prรฉfontaine

Nope, the real problem is using a language/dal/design that doesn't allow you to add stuff w/o rebulding
the whole castle.

I routinely add fields to databases w/o impacting running code. I remember having to prove to a
customer that the latest database schema update could run the current code and that no data migration
was required to run the latest release afterward.
It's called good design which should encompass these type of potential problems well before putting
your hairs on fire.

If you commit yourself to tools that are rigid, guess what ? You end up with rigid apps. ๐Ÿ˜ฌ
ORM are a good exmaple of a no-go for me after a single experiment.
Any slight schema change could screw up your running app.

Any process control after that bad architecture decision is essentially trying to correct
a problem that should not exist. I.e. a waste of time & money.

A bit of forward thinking can save a lot of Advil intakes... ๐Ÿ˜ฌ

Collapse
 
thorstenhirsch profile image
Thorsten Hirsch

See, all of our retrieval queries were using the * (known as an asterisk or star), which means results will include every field on the table. This was the day I learned this is a very very very bad idea.

Yep. We introduced the rule to never use "select *" in our queries ~12 years ago. ๐Ÿ˜Š

Collapse
 
timrmatteson profile image
Tim Matteson

This was a good article and I know the topic well, having made the same mistakes myself. Database changes can be tricky, particularly if you don't have the right testing strategy and assets in place. It sounds like this mistake will be a good opportunity for you to propose some changes to mitigate this risk in the future. This is the process I always push for when the database has matured:

  • You NEED a dedicated DEV/TEST/QA environment (at least one, preferably all 3).
  • The DEV/TEST/QA environments are not on your desktop and are built entirely from your CI/build automation tool(s).
  • The DB for those environments is also refreshed from PROD (or a backup at the right version). This can get a little tricky since PROD may be some revisions behind your DEV/TEST/QA. Your build utility will apply the change scripts associated to your revisions to get the environment up to date when a refresh is performed (e.g. PROD is at 100, refresh script applies changes for 101, 102, 103 post refresh and now you have a DEV database with PROD data at version 103).
  • If privacy/size constraints exist, your refresh script should handle purging that data and/or creating test data. Your goal is to have an environment matching production as closely as legally/realistically possible.
  • Your deployment includes a code change AND a db change script. These two are tightly coupled and must always be deployed together.
  • I'm assuming you always have a rollback strategy for code changes (if not, you need one) built into the deployment. Your database deploy script should also include a rollback strategy. In this example, that means you would be removing the column you just added and putting things back to where they were. Deployments should also have a pre-deployment backup performed as part of the script. This is your rollback of last resort should things go terribly wrong.
  • Deployments are done by a tool/automation or handed off to a third party with the right production access. No developer should be directly changing PROD. It sounds like you have this already.

With the above in place, your change would have been committed to source control, picked up in the DEV build and applied to the DEV database then tested by you. After that, it would move to QA (depending on that schedule, sometimes these would be a selection of revisions merged into QA from DEV targeted at the next release to PROD) and be tested there. Finally, it would be put into the PROD release and at that point, it would be fully vetted and tested from both the build automation perspective and functionality perspective. Can there still be issues? Of course. But from my experience, a setup like this will catch all but the most obscure and unpredictable bugs. It's a stricter schedule and some might argue it's not CI but it's still possible to do CI with this. Emergency changes may take a different route and be manually done but would still be tested in a dedicated environment.

Collapse
 
krupeshdhruv profile image
Krupeshdhruv

Nice article Anabella! I have learned above points in my latest project. From that experience I would like to add one more point:

There should be some db access layer and all db updates should go through insert, update or delete query. This gives a control for mixed mode setup. You can update the insert and update store proc with default values for new column along with default value during migration.

Collapse
 
blagh profile image
Hannele • Edited

Usually when I need to do a DB thing that is too risky to perform as a migration (need to time it for non peak, do some batching, e.g.) we'd make it a script that you can run as a CLI tool, and then run through a deployment tool like Rundeck.

Then, the infra team can look at it like any other code review, and you don't need to set up any special permissions to run it (the deployment tool already has them). Also, the developers who wrote the DB script are the ones who are fully responsible for running it and don't need any other ops involvement. Also, nobody is running bare SQL queries against your production database with the potential for copy/paste errors.

On the pro side of this incident it sounds like there has been time to reflect on why things went wrong and how to address it! I hope you get the time to do so ๐Ÿ‘ following up on incidents is almost as important as being able to recover from them.

Collapse
 
rguico profile image
Robert Guico

Here I thought the actual problem at the break was deploying at 5:00p and risking staying overnight if something broke. ๐Ÿ˜€

But really, you do get more eyes on a production problem at 10:00a instead of 7p.

Collapse
 
diegocrzt profile image
Diego Ramรญrez

Dear, let me tell you, this was delightful to read. Thank you for sharing this.

Collapse
 
squidbe profile image
squidbe

We added the column in the database, but we never merged the code that handled this new data.
...
Not merging the code for the new fields wasnโ€™t the problem though.

I understand your point about not using "select *", but saying the problem wasn't a result of not merging the code is definitely debatable. Try using something like Flyway. It greatly reduces the risks associated with migrations. There are other benefits, too, which you can read about on their site.

Collapse
 
mickmister profile image
Michael Kochell • Edited

ORMs use * to get all fields, I think it's a common practice. The branch with migration changes should be the same branch with code changes in my opinion, so there is no chance there will be one without the other when you deploy. I enjoyed reading your piece.

Collapse
 
leob profile image
leob • Edited

What sort of baffles me is deploying a database change, but not the application code that's linked to those changes. Why would you want to do that? Can't remember having ever done something like that, app + DB = system, always.

Second thing is, not having a script in place to quickly roll back those changes. Mistakes are human but you should always have a "plan B" or a recovery plan in place.

"SELECT *" might not be really recommended as a 'best practice' but I don't see it as the root cause of the fiasco. The root cause is more "tunnel vision" (focusing on one aspect only and overlooking other ones).

Collapse
 
pavlosisaris profile image
Paul Isaris

Nice post and awesome writing style ;)

Collapse
 
anabella profile image
anabella

Thank you! ๐Ÿ’ž

Collapse
 
rafaelsales profile image
Rafael Sales

I disagree with never using *. There are data access design patterns that play really well with *. So, one can easily blame the code that uses the result of the * query.

Collapse
 
anwar_nairi profile image
Anwar • Edited

Star is bad for 2 reasons:

You do not know what to expect from the query, and your code becomes less easy to maintain because the fog of star query does not make it easy to perform changes (e.g. refactoring).

You will fetch fields that you do not care in case migrations added new fields on the table. Unoptimized bandwidth it is for sure.

Collapse
 
blouzada profile image
Bruno Louzada

Really nice article!

Collapse
 
david_j_eddy profile image
David J Eddy

Great article anabella! Everyone messes up, it is the learning and knowledge we take away that makes your better and stronger. In this case looks like you learned quite a bit.

Collapse
 
kgkk416 profile image
GANESH

Awesome.

Collapse
 
damnjan profile image
Damnjan Jovanovic

Excellent article, I appreciate real life example ๐Ÿ‘๐Ÿป

Collapse
 
cardosocharmoso profile image
CardosoCharmoso

Sounds like Go. Is there any other language that complains about mismatched columns-structs?

Collapse
 
fluffy profile image
fluffy

This is a problem in any database binding that retrieves stuff as an array of values rather than as key-value pairs. This is the default behavior for many, many things, although itโ€™s generally a sign of micro-optimization when itโ€™s being used. (There are definitely systems where it makes a real performance difference though!)

For the reason of stories like these, I feel that doing positional row access is a mistake and you should always use key-value retrieval instead. Even if everything goes right in a migration you donโ€™t really want to be relying on the order of columns in a table. (Which is another reason to avoid SELECT * as well.)

Collapse
 
md1023 profile image
Maxim Nikolaev • Edited

I was lured by a bar story of programmers here. But the story didnโ€™t come out fun :-D Personally I fear migrations. Iโ€™ve got no clue for OPS teams have guts to run scripts they receive.

Collapse
 
matthewpersico profile image
Matthew O. Persico

Yes, never use *.

If your columns are nullable, you can add them before you install the corresponding code.