Production Issue

Incident Report for Kili

Postmortem

Post Mortem  

Production issue February 10th, 2025 

Summary

On February 10, 2025, at approximately 10:30 AM UTC, the platform experienced disruptions due to the merging of our Assets Distribution System with our backend. This incident affected several users, resulting in platform unavailability. 

The situation was declared on our status page at 11:15 AM UTC, and team mitigation efforts restored stability by 11:47 AM. The total disruption lasted 1 hour and 17 minutes, from February 10, 2025, 10:30 AM UTC to February 10, 2025, 11:47 AM UTC. 

In addition, some projects experienced issues with asset distribution, which persisted for 4 hours and 19 minutes, from February 10, 2025, 10:41 AM UTC to February 10, 2025, 3:00 PM UTC.

Incident Timeline 

Incident Detection

Time: 10:30 UTC

Description: A production incident was detected internally, impacting the Kili API and Frontend services in the Europe environment. The team began investigating the issue.

Incident Declaration 

Time: 11:15 UTC

Description: Incident created on Status page 

Issue Identified

Time: 11:39 UTC

Description: The root cause of the issue was identified, and the team started implementing a fix.

Fix Implemented

Time: 11:48 UTC 

Description: A fix was implemented, and the team began monitoring the results to ensure the issue was resolved.

Status page update

Time: 13:35 UTC 

Kili API and Kili Frontend are operational 

Fix Implemented

Time: 14:29 UTC 

Description: A fix was implemented, and the team began monitoring the results to ensure the issue was resolved.

Monitoring

Time: 16:17 UTC 

Description: The team continued to monitor the system for any further issues, ensuring stability.

Resolution

Time: 17:18 UTC 

Description: The issue was successfully resolved, and all systems were confirmed to be fully operational. Users were informed that the services were back to normal.

End-User Impact

The application was unavailable for users for 1 hour and 17 minutes, from February 10, 2025, 10:30 AM UTC to February 10, 2025, 11:47 AM UTC. 

Additionally, several projects faced issues with asset distribution for 4 hours and 19 minutes, from February 10, 2025, 10:41 AM UTC to February 10, 2025, 3:00 PM UTC.

What caused the incident?

The incident occurred due to the merging of the Asset distribution system with the backend, which required all project queues to be rebuilt. Two main issues were resolved:

  • Excessive parallel rebuilds for the same project due to an update of the rebuild status 

  • Rebuild issues for some projects when some assets were returned and fixed by a different user than the one who made the label that was sent back.

Corrective elements put in place to ensure that this does not happen again

To prevent this incident from happening again, the following measures were taken: 

  • Fixed the rebuilding process for projects
Posted Feb 11, 2025 - 14:29 UTC

Resolved

Dear users,

We inform you that the issue affecting our Europe environment has been successfully resolved. Our team has implemented the necessary fixes, and all systems are now fully operational.

We appreciate your patience and understanding while we worked to resolve this matter. If you continue to experience any issues, please don't hesitate to reach out to our support team.

Best regards,
Kili Team
Posted Feb 10, 2025 - 17:18 UTC

Update

We are continuing to monitor for any further issues.
Posted Feb 10, 2025 - 16:17 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Feb 10, 2025 - 13:35 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Feb 10, 2025 - 11:39 UTC

Investigating

Dear all,

We inform you that we are currently experiencing a production incident that is impacting our services.
We apologize for any inconvenience this may have caused.

We are working diligently to resolve this issue and restore our services as soon as possible. We will continue to provide updates on our status page https://status.kili-technology.com/

Thank you for your understanding and patience during this time.

Sincerely,

Kili Support Team
Posted Feb 10, 2025 - 11:15 UTC
This incident affected: Europe (Europe - Kili API, Europe - Kili Frontend).