Post Mortem
Summary
On February 10, 2025, at approximately 10:30 AM UTC, the platform experienced disruptions due to the merging of our Assets Distribution System with our backend. This incident affected several users, resulting in platform unavailability.
The situation was declared on our status page at 11:15 AM UTC, and team mitigation efforts restored stability by 11:47 AM. The total disruption lasted 1 hour and 17 minutes, from February 10, 2025, 10:30 AM UTC to February 10, 2025, 11:47 AM UTC.
In addition, some projects experienced issues with asset distribution, which persisted for 4 hours and 19 minutes, from February 10, 2025, 10:41 AM UTC to February 10, 2025, 3:00 PM UTC.
Incident Detection
Time: 10:30 UTC
Description: A production incident was detected internally, impacting the Kili API and Frontend services in the Europe environment. The team began investigating the issue.
Incident Declaration
Time: 11:15 UTC
Description: Incident created on Status page
Issue Identified
Time: 11:39 UTC
Description: The root cause of the issue was identified, and the team started implementing a fix.
Fix Implemented
Time: 11:48 UTC
Description: A fix was implemented, and the team began monitoring the results to ensure the issue was resolved.
Status page update
Time: 13:35 UTC
Kili API and Kili Frontend are operational
Fix Implemented
Time: 14:29 UTC
Description: A fix was implemented, and the team began monitoring the results to ensure the issue was resolved.
Monitoring
Time: 16:17 UTC
Description: The team continued to monitor the system for any further issues, ensuring stability.
Resolution
Time: 17:18 UTC
Description: The issue was successfully resolved, and all systems were confirmed to be fully operational. Users were informed that the services were back to normal.
The application was unavailable for users for 1 hour and 17 minutes, from February 10, 2025, 10:30 AM UTC to February 10, 2025, 11:47 AM UTC.
Additionally, several projects faced issues with asset distribution for 4 hours and 19 minutes, from February 10, 2025, 10:41 AM UTC to February 10, 2025, 3:00 PM UTC.
The incident occurred due to the merging of the Asset distribution system with the backend, which required all project queues to be rebuilt. Two main issues were resolved:
Excessive parallel rebuilds for the same project due to an update of the rebuild status
Rebuild issues for some projects when some assets were returned and fixed by a different user than the one who made the label that was sent back.
To prevent this incident from happening again, the following measures were taken: