I’m running a CRON script which checks the database for work and executes anything that needs to be done. It does this across ~500 customers per minute, but we are using AWS RDS with a 16 vCPU machine which, until recently, has been plentiful to keep it happy (normally plugging along under 20%).
This weekend we updated customers to the latest version of the code and implemented some tooling, and since then we’ve started seeing these huge waits:
Further I’m seeing that about half of our busiest queries are EXPLAIN statements, somewhere illustrated here:
Nowhere in our code base is an "EXPLAIN" performed (though we are using AWS RDS performance insights, ProxySQL and New Relic for monitoring). I did notice that in the past week our number of DB connections was previously baselined around 10 and is now closer to 90.
Any ideas on where I should be digging to find the cause of these waits and explain statements? And if they could justify the large number of open connections?