Okay, all redundancy has been restored across all Fediverse.Games services.
I promise to be transparent about what goes on with this instance, so I thought I'd take you through the cause.
A part of our storage infrastructure is built on ceph, which is a clustered, self-healing storage solution.
One of our nodes is (now was) a reasonably old Dell Micro PC. I've been gradually working through a technology upgrade to remove these ultra-compact, low power devices. While they're really cool little devices, they have limited usage in a production environment (don't worry though, I've still got plenty of uses for them).
Unfortunately, in purging and removing one of the ceph-managed discs on the mini PC, something has gone wrong which caused a number of volumes to have lost/out of sync objects, which completely blocked IO across the entire cluster.
All of our data is backed up to the cloud, so there was never any risk of significant permanent data loss, but it was easier and quicker to repair than restore, so that's what we went for.
Part of the delay (aside from needing to go to sleep for the night for my first day back at work after the holidays - yes great timing I know!) was diagnosing the actual issue and taking the steps to set up ceph to do the self-healing work on its own.
I've learnt some extra steps I could have taken to guarantee this won't happen in future, but also the steps to diagnose the issue and set the ball rolling to solve the issue.