On March 26 at 10:15 UTC one of our hypervisor crashed, and brought some
VMs down with it. When the hypervisor rebooted, a few VMs — most
notably the ones serving Bugzilla, AskBot, and the download archive —
refused to boot due to File System corruption issues. After some time
spent trying to repair these, we discovered that GlusterFS' (the
distributed File System we're currently using) self-healing daemon
didn't trigger and some disk images were stuck in split-brain state,
despite not being reported as such, let alone auto-repaired.
AskBot was brought up again at 18:45 UTC, to the best of our knowledge
without data data loss.
Unfortunately for Bugzilla we weren't able to get the FS back into a
consistent enough stage, and had to restore a snapshot from backups.
Our backups are not continuous (they have a ~24h granularity), and
changes since March 25 ~23:00 UTC (about 80 changes) were not included
when the service was brought up again at 22:00 UTC. The missing changes
were later replayed from the notification mails sent to the
libreoffice-bugs mailing list.
We lowered the priority of restoring the list archives due to the low
number of requests to that service, and also because verifying file
integrity of the ~500k files is a slow process. (Like for the other
services we want to make sure the data we serve isn't corrupted, but the
list archive is much larger than our other data store hence take much
longer.) A partial archive with ≥5.4 releases was restored on April 03;
then we moved on to older releases, and the entire archive was available
again on April 04 evening.
We apology for the inconvenience. There are a few things we can do to
ensure this won't happen again, and we'll discuss some of these during
our next infra call:
* Storage: rebalance gluster volumes; evaluate alternative backend
solutions, incl. on- and off-site (aka “cloud”).
* Backup: for performance reasons writes are not reflected to physical
disks immediately. We could reduce the interval between journal
commits but that won't fully eliminate the uncertainty window unless
we also install a battery backed cache. Similarly we can't achieve
fully continuous backups, but we can improve granularity there. A
recurring topic in our infra calls is to replace dump-based database
backups with continuous archiving and Point-in-Time Recovery (PITR).
Unfortunately this solution has not been implemented yet; it would
have solved the data loss in the Bugzilla database (or at least
reduced the 24h granularity to a sub-minute one), while at the same
time providing referential integrity guaranties.
* Communication (notifying the community): while infra team members are
busy trying to put the pieces back, we're not always in a position to
respond to questions from users & community. Sophie, Italo, Mike and
Florian were discussing how to best support infra with communicating
the status quo on the different channels (IRC, Telegram, email,
Planet, Twitter, etc.), so progress and resolution becomes more
visible to all.