The Canva outage: another tale of saturation and resilience

Canva's outage started when a new editor version was deployed and millions of clients began downloading the JavaScript assets from Cloudflare. An unnoticed stale traffic rule forced IPv6 traffic between Northern Virginia and Singapore over the public internet, causing massive packet loss and latency for Asian users. The latency turned the CDN cache into a barrier, queuing over 270,000 requests while the origin fetch completed. When the fetch finally succeeded, all queued requests were released at once, creating a thundering-herd of 1.5 million API calls per second. A concurrent performance regression in Canva's Netty-based API gateway-introduced by a lock in a telemetry library-further reduced throughput, causing tasks to exhaust memory and trigger the Linux OOM killer. The OOM-killer terminated containers faster than the autoscaler could add capacity, leading to a full-scale collapse. Engineers responded by manually scaling tasks, then blocking all CDN traffic with a temporary Cloudflare firewall rule to stop the flood, and finally ramping traffic back up incrementally, first to Australian users under rate limits and then globally. The incident showcases classic decompensation: a latent configuration error combined with a code-level performance bug created a positive feedback loop that outpaced automated scaling, forcing manual load-shedding and careful capacity restoration.

The Canva outage: another tale of saturation and resilience

Problems this helps solve:

Explore more resources