Issue on Production
Incident Report for Kili
Postmortem

Post Mortem

Production issue 05/12/2022 - 12/12/2022

Summary

On 5th December 2022 week we encountered a slowness issue on the platform

Incident Timing (UTC+1)

05/12 11:24 am to 9:24 pm => 12/12 3:20 pm to 5: 57 pm 

Incident Timeline (UTC+1)

05/12/2022

  • First alert or first ticket

    • Internally at 11:21 am
  • First Announcement 

    • Status page incident creation at 11:24 am
  • Status page update time

    • 12:00 Investigation in progress:  We are currently investigating this issue.
    • 12:23  Situation under control: The issue has been identified and a fix is being implemented
    • 04:18 pm Incident set to stable: We are continuing to monitor for any further issues
    • 9:24 pm Incident resolved: This incident has been resolved.

06/12/2022

  • First alert or first ticket

    • Internally at 03:17 pm
  • First Announcement 

    • Status page incident creation at 03:20 pm
  • Status page update time

    • 03:25  Situation under control: The issue has been identified and a fix is being implemented
    • 05:00 to 05:10 pm Scheduled maintenance intervention annuncement : The platform will unavailable for 10 mn  
    • 06:06 pm *Incident resolved: *This incident has been resolved.

08/12/2022

  • First alert or first ticket

    • Internally at 1:10 pm
  • First Announcement 

    • Status page incident creation at 01:16 pm
  • Status page update time

    • 9/12 08:49 am  We are continuing to investigate this issue.
    • 9/12 1:37 pm  We are continuing to investigate this issue.
    • 12/12 11:50  A fix has been implemented and we are monitoring the results.
    • 12/12 12:00  We are continuing to monitor for any further issues.
    • 12/12 05:57 pm *Incident resolved: *This incident has been resolved.

 

 

Actions

5/12/2022

Indexes optimisations

6/12/2022

Maintenance operation on our Database (increasing memory and creating indexes)

11/12/2022

Vacuum the asset table to improve performance

12/12/2022

Release a hotfix to fix tcp memory  leak.

End-User Impact

Global slow on the platform could lead to delays on Replica that lead to update issue and make very slow the access to the platform 

Users can not access to the platform due to an issue with authentication

What caused the incident?

The creation by script of several million assets at the same time.

The load has been multiplied by 10

We supported the creation but the time that the platform scale, slowness was felt.

Heavy SQL queries have triggered some zombies connections triggering memory leak on the tcp kernel part, impacting all the k8s node. 

Corrective elements put in place to ensure that this does not happen again

  • Create an alarm on too many dead rows
  • Rate limiting
  • Create an alarm on tcp kernel memory ( /proc/net/sockstat)
Posted Dec 14, 2022 - 15:20 UTC

Resolved
This incident has been resolved.
Posted Dec 12, 2022 - 16:57 UTC
Update
A fix has been implemented and we are monitoring the results.
Posted Dec 12, 2022 - 13:08 UTC
Update
We are continuing to monitor for any further issues.
Posted Dec 12, 2022 - 11:01 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Dec 12, 2022 - 10:50 UTC
Update
We are continuing to investigate this issue.
Posted Dec 12, 2022 - 09:14 UTC
Update
We are continuing to investigate this issue.
Posted Dec 09, 2022 - 14:37 UTC
Update
Dear all,

We deeply apologize for the inconvenience caused by this incident. Our teams are mobilized to resolve all the issues as soon as possible.

We are sorry that it has affected your work on Kili.

Kind regards,

Kili support team
Posted Dec 09, 2022 - 11:10 UTC
Update
We are continuing to investigate this issue.
Posted Dec 09, 2022 - 08:49 UTC
Update
We are continuing to investigate this issue.
Posted Dec 08, 2022 - 18:07 UTC
Investigating
We are currently investigating this issue.
Posted Dec 08, 2022 - 14:16 UTC
This incident affected: Europe (Kili API).