2025-03-08 Post Mortem
At 12:53UTC, Floofy.tech displayed a Cloudflare SSL error, and subsequently displayed a 404 not found page. Recovery occured at 13:35UTC, after troubleshooting and uncovering some technical debt in our Kubernetes state configuration for our Mastodon ingress and ingress-nginx Helm chart. We cleaned up some of the tehnical debt and updated our Helm values to allow us to configure nginx response headers in ingress annotations.
What happened?
During a routine and seemingly low impact upgrade to ingress-nginx
, responsible for running
nginx instances to route traffic to the right Kubernetes services, the nginx pods were unable to
properly serve traffic directed to Mastodon web pods. Cloudflare initially served an invalid
certificate error, as nginx had stopped serving the domain entirely. As part of troubleshooting,
the SSL policy from Cloudflare to origin was changed from "Full (strict)" to "Flexible", which resulted
in nginx returning a 404 as it was unable to serve the Mastodon web traffic. This was because we
utilise nginx.ingress.kubernetes.io/server-snippet
in our Mastodon ingress to set a CSP
header, along with configuring some now uneeded Matrix related routing. When we upgraded the ingress-nginx
Helm chart, it changed the default annotations-risk-level
for the controller from "Critical" to
"High", preventing the Mastodon ingress from being loaded entirely
(pull request).
How was it resolved?
By setting the value back to "Critical" manually, the ingress was allowed to be loaded by ingress-nginx, restoring service. We subsequently removed the offending annotation and moved back to "High" annotation risk level.
How will we avoid this?
More carefully reading release notes, and being informed of the various annotation risk levels will prevent this issue going forward. We will also rely on Cloudflare to perform safe header manipulation rather than ingress annotations, which are deemed unsafe by upstream.