Minutes from the Thu Apr 18 infra call

classic Classic list List threaded Threaded
1 message Options
Guilhem Moulin Guilhem Moulin
Reply | Threaded
Open this post in threaded view

Minutes from the Thu Apr 18 infra call

[Rescheduled from Tue Apr 16.]


1. guilhem
2. cloph
3. Brett


 * Post mortem March 26 incident
   + Distributed FS
     . cloph: not opposing other solution, cloud solution probably not
       feasible monetary wise (Brett and guilhem agree, AI guilhem to double
     . would require rebalancing (hence changing isolation level) as ordering
       50-60 VPS would be very costly
     . cloph: a single big issue with Gluster in ~6 (?) years, convenient
       scapegoat but not all issues should be attributed to that setup
     . ceph was considered at the time, abandoned since ≥6 nodes are (were?)
   + guilhem: would like some SSD-based cache for faster reads; cloph: Gluster
     suport tier-based setups, but we're not using that currently, not sure if
     there are any slot left in the hypervisors (guilhem to check)
     . Another problems is random IO from the crashtest VM (vm138), the disk
       image of which resides on charly's /data/fast like gluster's delta
       arbiter.  The two race and trigger heals
     . If there is room for one or more SSD on each hyper we should move the
       arbiter (and swap) there
   + cloph: could also compare our current machines with the newer generation
     from manitu (125€/month, Intel Xeon E3-1240 v5)
   + rebalance gluster volumes: too many VMs on delta, heals drains IO on too
     many services
     . delta 2 x (2+1), alpha, gamma, kappa 1 x 3
     . cloph: straightforward to change the layout on an existing volume
     . can rebalance, can also tune the healing settings so it doesn't starve
       all guests
   + Continuous Archiving and Point-in-Time Recovery (PITR)
     . replace dump-based database backups with continuous logs
     . https://www.postgresql.org/docs/9.6/continuous-archiving.html
     . can be dry-run without disruption to the current setup
     . Brett: can try out on the [Matrix] host (vm222), push the WAL logs to
       itself for now
 * OCS-Webserver (new extension site) deployment
   + somewhat delayed due to infra issues
   + succesfully bridged to SSO, some hard-coded third-party requests to chase
     and remove still, before opening the site for contributions
 * tb31
   + connects to the outside from (not but
     seems to block incoming TCP SYN, ICMP
   + is it sitting behind a firewall/NAT box?
   + cloph: we can move lcov to another host (part of LODE)
   + AI guilhem: check with Sophie/thb for contact info, fix incoming
     connections, perform full backup and ask for an installation of CentOS7 (or
     just ask for KVM access if possible)
 * Build bots:
   + tb66 will be brought offline permanently in early May
   + large build logs
     . 4GiB logs are not sustainable (a build issue caused a serie of build
       logs to grow by a factor of >20 last months) and DoS CI
     . gerrit_linux_clang_dbgutil (normally up to ~250MiB then they grew to
       4GiB) cf. gerrit_linux_clang_dbgutil/builds/28522/log.gz (288MiB
       compressed, 4GiB uncompressed)
     . there is a jenkins plugin to limit log size but that only works with
       pipeline-based setups
 * Preserve old download links (redirect to downloadarchive), cf. wget's mail
   + cloph: not keen to manually maintain a map for a single distro
   + switching the links to use downloadarchive instead of download might
     take time but that's just a one-off overhead
 * Chance to speed up latam conference site deployment? → done
 * Next call: Tuesday May 21 2019 at 18:30 Berlin time (16:30 UTC).


To unsubscribe e-mail to: [hidden email]
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/website/
Privacy Policy: https://www.documentfoundation.org/privacy