Back tostdlib
Blog Post

The Canva outage: another tale of saturation and resilience

A stale Cloudflare routing rule and a blocked telemetry lock created a latency surge, a thundering-herd cache miss, and OOM-killer cascades that overwhelmed Canva's API, illustrating how saturation and decompensation can cripple even well-architected services.

Canva's outage started when a new editor version was deployed and millions of clients began downloading the JavaScript assets from Cloudflare. An unnoticed stale traffic rule forced IPv6 traffic between Northern Virginia and Singapore over the public internet, causing massive packet loss and latency for Asian users. The latency turned the CDN cache into a barrier, queuing over 270,000 requests while the origin fetch completed. When the fetch finally succeeded, all queued requests were released at once, creating a thundering-herd of 1.5 million API calls per second. A concurrent performance regression in Canva's Netty-based API gateway-introduced by a lock in a telemetry library-further reduced throughput, causing tasks to exhaust memory and trigger the Linux OOM killer. The OOM-killer terminated containers faster than the autoscaler could add capacity, leading to a full-scale collapse. Engineers responded by manually scaling tasks, then blocking all CDN traffic with a temporary Cloudflare firewall rule to stop the flood, and finally ramping traffic back up incrementally, first to Australian users under rate limits and then globally. The incident showcases classic decompensation: a latent configuration error combined with a code-level performance bug created a positive feedback loop that outpaced automated scaling, forcing manual load-shedding and careful capacity restoration.

Source: surfingcomplexity.blog
#leadership#engineering management#incident management#resilience#SaaS#technical leadership

Problems this helps solve:

CommunicationProcess inefficienciesScaling

Explore more resources

Check out the full stdlib collection for more frameworks, templates, and guides to accelerate your technical leadership journey.