DEV Community

Cover image for Why you should create new GraphQL queries: How I caused an incident expanding an existing query and the lessons learned
James Won
James Won

Posted on

Why you should create new GraphQL queries: How I caused an incident expanding an existing query and the lessons learned

I recently started a new job about a month ago and at my new company we use GraphQL.

I'm fairly new to GraphQL. While I had dabbled with Graphql in a personal project years before I hadn't used it in a professional context.

My impression over the past few weeks is that it is an amazing framework, and I especially love developer friendly features like introspection to quickly identify backend API changes. Also I personally did not like using global state management systems like Redux, and GraphQL's caching really is a breath of fresh air.

I didn't really have major issues using it over the past few weeks so I thought I was in the clear.

Then I caused an incident.

The issue

The funny thing was the incident was caused in production by feature flagged work.

In a nutshell:

I expanded an existing GraphQL query to get additional data. This query was used once somewhere else but this existing usage caused CPU utilisation in the backend to skyrocket, leading to: (1) queries adding huge load on the backend and (2) increased query response delays by a factor of upto 10-20x.

Dissecting the issue a bit more there were three front-end factors that led to potency of the problematic code:

  1. Potentially large data: The attributes that I added on to the query were getting back nested arrays of data. Depending on the complexity of the query subject this led to huge data coming back for the existing usage of the query.

  2. Polling: I didn't know at the time but the query in question was being aggressively polled to refresh stale data.

  3. Existing usage of the query: The common usage of the page where this query was used is one of the most popular pages and one where any given user can have multiple versions of this page open in tabs at once. This meant that the issues in (1) and (2) were amplified multiple-fold.

Triaging

By the time this reached my attention the problem was well-dissected and my PR had been identified as the cause. Lucky for me our company has a great incident management system and thankfully this allowed the problem to be identified quickly and incisively.

While the issue end-user impact was a slightly degraded experience, this was a potentially critical issue that needed to be dealt with before higher traffic hit our application.

When I jumped on a call with a staff-engineer and discussed the data I realised almost instantly what had happened and that my expanded query was polling the backend CPU with exponentially dense data like a cancer.

Solution

We quickly made a plan to revert my change to the existing query and to split off the new behaviour to a new query that would be fully cordoned off from production. The new query would not be polled.

I actioned this and we immediately noticed the CPU utilisation drop back to normal.

Incorrect assumptions

In retrospect, the problem seemed so obvious. However at the time I naively thought I was making the right decision to expand the existing query:

  • The existing query was getting a type of data that was consistent with the existing query. I mistakenly thought that the query could be enhanced with the new attributes that I was trying to get. I had a rough idea about the GraphQL caching so thought it would be great to get this data earlier in the existing usage then a subsequent query to get this data would not be required.

  • I didn't realise we polled so aggressively. In most products I have worked on in the past polling was used rarely and only where absolutely necessary, I made an assumption that this was the case here. I neglected to check this assumption before checking in the code.

Lessons

A couple of lessons that I took away from this experience:

  1. Make new GraphQL queries by default. Where there is a new usage, by default creating a new query makes sure that unintended consequences to existing usages like here do not happen. This is especially important for feature flagged features - no new feature flag feature should ever affect existing production features or behaviour. Fight the temptation to tack on attributes to existing queries.

2 Test new code in staging throughly to see the GraphQL query requests and responses. I did check, but only with subjects with small data and definitely not long enough to see the impact of polling. If I had (1) tested a subject with a large amount of data and (2) waited for the polling to kick in I would've realised pretty quickly that the combination of frequent polling and large response sizes would cause issues.

  1. Beware of polling, don't use it unless required and especially beware of altering existing queries that use polling.

Power of GraphQL: Buffet of queries

A huge lesson for me was to be precise with GraphQL queries. It is vital to make sure that your query is targeted and get only what you need.

One of our leading front-end engineers put it perfectly by likening using GraphQL to going to a buffet.

You shouldn't try to get all the food (aka. attributes) you want in one plate (aka. query). There is absolutely no need as GraphQL handles the caching of the data received. Instead you should try to be incisive in the fresh data that you need specifically for the purpose of using that query when you need it.

Conclusion

So long-story short I caused an incident by tacking on new data to an existing query. Lucky for me I only caused a slightly degraded experience for users for a short period of time it was a huge learning experience that I will take to heart.

I learned massive lessons on using GraphQL, and the biggest being to make new queries each time you have a new usage to consume data from GraphQL.

Top comments (1)

Collapse
 
naucode profile image
Al - Naucode

Hey! Thank you for this, I liked it ;) keep writing, you got my follow!