DEV Community

Suraj Sangani
Suraj Sangani

Posted on

Admitting your mistakes and taking ownership as a Software Engineer

As a Software Engineer who has been in the industry for 10+ years, it would be foolish to say that I have not committed any mistakes. I thought it would be great to write down some of the experiences I have had when I committed those mistakes and how I rose above them in order to become a better software developer. While those mistakes are avoidable in hindsight, I am not intending to dive into why those mistakes were made; I would like to dive into how I grew from them.

During my time at Amazon, I was under immense time and performance pressure. We had a deliverable that was running 3 months behind schedule and management had put a lot of pressure on everyone to get the project back on track. My manager was doing a great job of shielding the software engineers from those time pressures but at the same time we were aware of the mounting pressure from senior leadership. These stresses caused me to go into a funk where I shipped two changes in consecutive months that caused outages.

To ship code changes, Amazon has their own build pipelines that enables engineers to stagger their code changes from their local development environments to production systems serving their millions of customers. These pipelines could be either fully automated, semi-automated or manually triggered. For those pipelines that were manually triggered, engineers were required to write a change release document that would describe the changes going out, the metrics to be monitored when those changes went out, the testing performed for those changes, and rollback criteria. Despite writing those changes and being peer reviewed by other engineers, I somehow managed to ship back-to-back changes causing outages. These outages led to an all hands on deck situation with multiple senior engineers involved.

During these outages, I had two options - a) sit back and let the on-call engineer take responsibility or b) join the outage calls, and help out wherever possible. Its easy to take the first option. As a matter of fact, I have seen many engineers take the first option. It could also happen that the change that you deployed did not break immediately and broke after a while. The on-call engineer would debug the issue, mitigate it by rolling back the change or rolling forward with a fix and everyone would move on.

The latter option is what I opted for despite being backlogged with other tasks. This option did mean that I spent way more time working outside my normal hours because I could not forego my responsibility towards the already committed tasks. This idea of working outside hours and taking additional responsibility might be balked at by many others. However, at the same time, my thought process in my career has always been as follows: “When I am shipping a piece of software, no matter the size of it, it has my name attached to the software and therefore I am responsible for how it performs.”

There are a few advantages of taking the latter option. Incident calls provide a great opportunity of how changes perform in production and also how other systems react to your changes. Taking that responsibility also gets noted by everyone who joins the incident calls which sometimes can include senior leadership thus earning you more brownie points. Learning opportunities are plenty in any software development organization but picking up the right opportunities to learn is critical and incident calls are one of them.

I wouldn’t say I did everything correctly though. When the outage initially hit, even though I noted that an incident has occurred, I did not bother to check what caused the outage. This despite knowing that the outage happened in the exact same system I was working on. The on-call engineer who got assigned the incident, started looking into it, looped my manager in who in-turn then pinged me about the incident. After seeing those messages, I had a sinking feeling in my stomach but I decided to take responsibility and join the incident call.

As I grew more in my career at Galileo, I recently had a change which unfortunately broke in production. This happened due to my lack of experience working in C code and not understanding how memory leaks could happen due to the code change. In this particular situation, it was very apparent that this was my code that broke. I had a lot of learnings from what I had experienced in my career previously. I immediately notified both my manager and the on-call engineer in a group chat that there is a high probability that the change that I deployed could be the reason for the outage. The on-call engineer acknowledged this and started looking into the recent deployment and caught the bug that was deployed out. I was helping out wherever needed to take the shared responsibility and we were able to do a quick rollback to ensure there was no client impact.

In both the situations, I took responsibility for my changes however the perceived actions by my manager are slightly different. Earlier in my career, my manager would probably have marked this mistake against me had this been repeated again, but in the second situation, my manager acknowledged that bugs are part and parcel of code change deployments and would not hold the mistake against me.

I will leave you with the following advice that I received very early on in my career and has helped me a lot - “People are programmed to remember how you handle mistakes, not your successes.”

Top comments (0)