Slowness issue

Incident Report for Kili

Postmortem

US Cloud Production Issue – October 15-16, 2025

Summary:

On October 15, 2025, at approximately 4:20 PM UTC, the US cloud platform became almost unavailable due to Redis service overload caused by badly managed big batch deletions of assets. The incident was declared at 6:01 PM UTC and customer impact ended by 6:19 AM UTC on October 16. The issue was fully resolved and closed on October 17, 8:08 AM UTC.

Incident Timeline:

Incident Detection & Customer Impact Start

  • Time: October 15, 4:20 PM UTC
  • Description: US cloud became almost unavailable due to Redis overload

Incident Declaration

  • Time: October 15, 6:01 PM UTC
  • Description: Incident published on the Status Page

Stable State Achieved

  • Time: October 16, 6:19 AM UTC
  • Description: Customer impact ended and stability was restored. The time to mitigate was 12 hours and 23 minutes.

Incident Resolution

  • Time: October 17, 8:08 AM UTC
  • Description: The incident was officially marked as resolved. Time to resolution: 38 hours and 12 minutes.

End-User Impact:

Users experienced US cloud being almost unavailable for a total of 13 hours and 59 minutes, from October 15, 4:20 PM UTC to October 16, 6:19 AM UTC. This primarily impacted workflow V1 projects.

What caused the incident?

The US cloud was almost unavailable because our Redis service was overloaded. This was caused by big batch deletions of assets that were badly managed on our side, causing too many commands to be sent to Redis during big batch deletion of assets. Additionally, our PostgreSQL database was also impacted by these big batches of deletion due to very long transactions.

Corrective elements put in place to ensure that this does not happen again:

Immediate mitigation: Services were stabilized to restore normal operations

Long-term mitigations:

  • Better cleaning of project queues list + filtering of assets before cleaning the queues to limit the number of commands sent to Redis
  • Ensuring that big batches of deletion are done in small chunks, each properly managed in separate transactions

We sincerely apologize for the inconvenience caused by this incident and its impact.

Thank you for your patience and continued trust.

The Kili Team

Posted Oct 20, 2025 - 15:08 UTC

Resolved

Dear users,

The issue affecting our services has been successfully resolved. All systems are now back to normal operation.

We apologize for any inconvenience caused and appreciate your patience during this time.

Our team will share a comprehensive post-mortem report in the coming days to provide more details on the incident and our preventive actions.

If you experience any further issues, please contact our support team at support@kili-technology.com

Sincerely,

Kili Team
Posted Oct 17, 2025 - 08:01 UTC

Monitoring

Dear users,

We have implemented a first fix for the issue affecting our services and are currently monitoring the situation to ensure full recovery.
A second fix will be implemented this afternoon to resolve the problem fully.

We will provide a final update once we confirm that everything is stable.

Thank you for your patience and understanding.

Sincerely,
The Kili Team
Posted Oct 16, 2025 - 10:35 UTC

Update

Dear users,

We are currently experiencing an issue that is impacting our platform's performance. Our team is actively investigating the root cause and working to resolve it as quickly as possible.

We will provide further updates on our status page https://status.kili-technology.com/.

Thank you for your patience and understanding.

Sincerely,

Kili Team
Posted Oct 16, 2025 - 07:45 UTC

Investigating

We are currently investigating this issue.
Posted Oct 15, 2025 - 18:01 UTC
This incident affected: US (US - Kili API, US - Kili Frontend).