EUROPE - Production Issue

Incident Report for Kili

Postmortem

Production issue February 12th, 2025 

Summary: 

On February 12, 2025, at approximately 1:25 PM UTC, the platform became unresponsive due to a high number of transactions caused by an updatePropertiesInAssets operation on many assets. This excessive transaction load made the backend unresponsive, impacting users.

The incident was declared at 1:25 PM UTC and stability was restored at 2:16 PM UTC, resulting in 51 minutes of customer impact. Full resolution was completed at 7:03 PM UTC, with a total resolution time of 5 hours and 38 minutes.

Incident Timeline: 

Incident Detection

Time: 1:23 PM  UTC

Description: Platform unavailability is detected internally 

Incident Declaration

Time: 1:25  PM UTC

Description:  Application unresponsiveness incident open on the Status page 

Customer Impact Start

Time: 1:25 PM UTC

Description: The application became unresponsive for users.

Stable State Achieved

Time: 2:16 PM UTC

Description: Mitigation efforts restored application responsiveness, and the system was declared stable. The time to mitigate was 51 minutes.

Customer Impact End

Time: 2:16 PM UTC

Description: The application became responsive for users. Users were informed that the services were back to normal.

Issue Identified

Time: 4:22 UTC

Description: The root cause of the issue was identified, and the team started implementing a fix.

Resolution

Time: 7:03 PM UTC

Description: The underlying issue was fully resolved, and all systems were confirmed to be fully operational. The time to resolution was 5 hours and 38 minutes.

End-User Impact: 

The application was unresponsive for users for 51 minutes, from February 12, 2025, 1:25 PM UTC to February 12, 2025, 2:16 PM UTC.

What caused the incident?

An `updatePropertiesInAssets` operation with a large number of assets resulted in an excessive number of blocked transactions. This overloaded the backend, making it unresponsive.

Corrective elements put in place to ensure that this does not happen again

We optimized transaction handling to prevent these transactions to be blocked and so on, from overloading the backend.

We sincerely apologize for the inconvenience caused by this incident. Our team is committed to improving system reliability.

Thank you for your patience and understanding,

Kili Team

Posted Feb 17, 2025 - 13:43 UTC

Resolved

This incident has been resolved.
Posted Feb 12, 2025 - 19:07 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Feb 12, 2025 - 19:06 UTC

Update

We are continuing to work on a fix for this issue.
Posted Feb 12, 2025 - 16:33 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Feb 12, 2025 - 14:29 UTC

Investigating

We are currently investigating this issue.
Posted Feb 12, 2025 - 13:40 UTC

Update

We are continuing to monitor for any further issues.
Posted Feb 12, 2025 - 13:31 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Feb 12, 2025 - 13:31 UTC

Investigating

Dear all,

We inform you that we are currently experiencing a production incident that is impacting our services.
We apologize for any inconvenience this may have caused.

We are working diligently to resolve this issue and restore our services as soon as possible. We will continue to provide updates on our status page https://status.kili-technology.com/

Thank you for your understanding and patience during this time.

Sincerely,

Kili Support Team
Posted Feb 12, 2025 - 13:25 UTC
This incident affected: Europe (Europe - Kili Frontend).