Connection creation correlation with slow response times - in Azure WebApp

We were experiencing a strange performance issue with one of our API's running in an Azure Web App. None of the usual metrics were spiking in correlation with the response time spikes, CPU, memory, even request count didn't really correlate.

Then I noticed a distinct correlation with, Connections.

Red line = Response Time, blue line = Connections

I had never seen this link before, so first I had to figure out exactly what "Connections" meant. Essentially it means Connections out, in our case a connection to an SQL Server.

This leads to the question, how are we creating the connections...and are we doing something inefficient? Looking in the code, this is how we get a new connection:

new System.Data.SqlClient.SqlConnection("valid-sql-connection-string")

Nothing tricky there, at first I thought maybe we were creating a new connection every time and we could help the issue by re-using connections. That connection re-use actually happens automagically under the hood in the SqlClient library via pooling.

One key point from the docs for me was:

The connection pooler removes a connection from the pool after it has been idle for approximately 4-8 minutes

Our spikes were happening every 15 min (it's an api, there must be an automated process running every 15 min), so (as you can see in the metrics graph) after a short period of time all of those connections that were created, get cleaned up and the pool drops to the minimum.

Which leads to the next question, "Can I increase that minimum connection count?"

Turns out that's really easy by adding Min Pool Size to the connection string.

So set Min Pool Size=70 in the connection string. Result, it still needs to create more connections sometimes but those spikes were significantly flattened. In fact, looking in App Insights, our average response times for one endpoint went from >10s (yes, seconds) to <800ms!

Creating connections is expensive (I found this SO useful), but having never seen this correlation before, why did it impact this service so significantly and not others? I believe it is due to the fact that the DB is in a different region to the service (most of our other services are in the same region). That slight increase in latency (guessing ~100ms) blows out the connection creation time on a service as busy as this one.