Issue on Production

Incident Report for Kili

Postmortem

Post Mortem

Production issue 30/08/2021

Summary

On 30th August, we encountered a platform issue that prevented users from reaching the platform.

Incident Timing (UTC+2)

9 am to 3 pm

Incident Timeline (UTC+2)

30/08/2022

First alert or first ticket
- Internally at 9 am
First Announcement
- Status page incident creation at 9:01 am
Status page update time
- 11:19 am Investigation in progress: We are currently investigating this issue.
- 1:35 pm Situation under control: The issue has been identified and a fix is being implemented
- 2 pm Incident set to stable: A fix has been implemented and we are monitoring the results
- 6 pm Incident resolved: This incident has been resolved.

‌

Actions:
- 9:46am, 9:55am and 10am, the service has restored but for only few minutes
- from 10:36am to 10:56am, the service was slow but available
- 11:23am to 1:13pm, the service was getting better response time
- 1:13pm to 1:42pm, the service was slow or unresponsive
- 1:42pm, the service was get back to normal response time
- 2pm,the incident was set as fix implemented and monitoring

End-User Impact

All the website is slow or unreachable

Users cannot connect to the application

What caused the incident?

Quickly the issue was identified to come from the database, the number of locks ACCESS SHARE MODE were growing.

It appeared that some locks on the DB weren’t released, blocking a part of the other queries. Those locks triggered queries to be stacked, and so the backend was starting to slow down. The auto scaling has started to start some new containers but as the issue was on the DB, this was not helping.

The last update on production was 10 days ago, the issue was not coming from a new deployment or modification.

Doing a full rolling restart of containers allowed us to get back to a normal situation but after several minutes the issue was here again.

The non-released locks were coming from transactions not closed. After several investigations we have found that the transactions weren’t closed due to network issues. The connections between the containers of one specific node were having trouble, the clients and the server were waiting for each other. This network instability started after a cluster GKE update and was getting worse day after day.

Draining the node to start the containers on a new node fixed the network issues.

Corrective elements put in place to ensure that this does not happen again

Improve monitoring to detect sooner issue on unreleased lock
Improve vacuum monitoring on postgresql
Add a new alert on oldest transaction for each database

Posted Sep 02, 2022 - 10:56 UTC

Resolved

This incident has been resolved.

Posted Aug 31, 2022 - 07:44 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Aug 30, 2022 - 13:04 UTC

Update

We are continuing to work on a fix for this issue.

Posted Aug 30, 2022 - 11:38 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Aug 30, 2022 - 11:35 UTC

Update

We are continuing to investigate this issue.

Posted Aug 30, 2022 - 10:25 UTC

Investigating

We are currently investigating this issue.

Posted Aug 30, 2022 - 09:19 UTC

This incident affected: Europe (Europe - Kili API).