On 30th August, we encountered a platform issue that prevented users from reaching the platform.
9 am to 3 pm
First alert or first ticket
Status page update time
All the website is slow or unreachable
Users cannot connect to the application
Quickly the issue was identified to come from the database, the number of locks ACCESS SHARE MODE were growing.
It appeared that some locks on the DB weren’t released, blocking a part of the other queries. Those locks triggered queries to be stacked, and so the backend was starting to slow down. The auto scaling has started to start some new containers but as the issue was on the DB, this was not helping.
The last update on production was 10 days ago, the issue was not coming from a new deployment or modification.
Doing a full rolling restart of containers allowed us to get back to a normal situation but after several minutes the issue was here again.
The non-released locks were coming from transactions not closed. After several investigations we have found that the transactions weren’t closed due to network issues. The connections between the containers of one specific node were having trouble, the clients and the server were waiting for each other. This network instability started after a cluster GKE update and was getting worse day after day.
Draining the node to start the containers on a new node fixed the network issues.