How to upgrade
1000+ servers in 1 hour
Evolix – Grégory Colpart & Jérémy Lecour – MiniDebConf 2024
20 years in the game
1000+ Debian servers
just a few shell scripts
minor updates
vs. major upgrades
Package sources
Main internal mirror
1082 mirror.evolix.org
Evolix public repository
1014 pub.evolix.org
30 pub.evolix.net
Official repositories
824 security.debian.org
566 archive.debian.org
ELTS repositories
425 elts.evolix.org
10 deb.freexian.com
Third-party repositories
211 artifacts.elastic.co
22 packages.elastic.co
152 deb.nodesource.com
64 dl.yarnpkg.com
138 hwraid.le-vert.net
29 downloads.linux.hpe.com
77 download.docker.com
47 packages.fluentbit.io
37 repo.percona.com
33 packages.wazuh.com
17 apt.postgresql.org
16 apt.grafana.com
15 packages.sury.org
12 download.nfs-ganesha.org
12 download.gluster.org
11 packages.tideways.com
How to upgrade them all?
Some possible ways to do this
- manually, over SSH, 1 server at a time
- unattended-upgrade
- automated apt upgrade with Ansible…
- apt-dater
- our own mix
We need to automate
- updating state and downloading packages
- safety checks (disk space…)
- LXC containers
… but keep the finger on the pulse
In the real world
- flaky services
- manual actions by clients during updates
- specific order of operations
Inform clients in advance
- list of upgradable packages
- list of services that will restart
- possibility to hold if necessary
A consistent workflow
- Tuesday morning : preparation and notifications
- Tuesday afternoon : first upgrade slot for non critical servers (like pre-production)
- Thursday evening : upgrade slot 2 (most critical servers)
- Friday morning : upgrade slot 3 (critical servers for clients who prefer the morning)
Tuesday morning (prep)
- update APT
- download packages
- inform clients
Tuesday afternoon (upgrade 1/3)
- upgrade non-critical servers
- 200-250 servers
Thursday evening (upgrade 2/3)
- upgrade majority of critical servers
- 600-700 servers
Friday morning (upgrade 3/3)
- upgrade "working hours" servers
- upgrade USA/CAN servers
- 50-100 servers
How often?
- 1-3 times a month
- avoid holidays
Preparation
📄 listupgrade.sh on every server
- GET config from URL
- disabled if no human action
- allow/block list of releases
- allow/block list of packages
- abort and disable on any error
Freeze until upgrade
apt update -o Dir::State::Lists=/var/lib/listupgrade
apt upgrade --download-only
Timing matters
- every Tuesday at 09:42
🔨 hammer effect on mirrors
- randomize between 06:00 and 10:00
⚖️ spread the load, reduce errors
Upgrade warmup
📄 parse-listupgrade.sh on workstation
- sync emails from dedicated INBOX
- build lists of servers
- prepare batches
Divide and conquer
- use a terminal multiplexer
- run actions in parallel
- focus on a single server if needed
ClusterSSH
xpanes, the power of Tmux
- Single window with multiple panes
- More robust and rich, but more complex than CSSH
Upgrade time
- 8 batches of 18 servers => 144 servers in a group
Upgrade time
- 8 batches of 18 servers => 144 servers in a group
- Pause after each group
- check monitoring, fix issues…
- optional longer pause if needed
- consider interrupting the session if needed
🔁 rinse and repeat 4-5 times
Special monitoring
- check frequency is increased
- alerts are sent directly to the upgrade person/team
- issues are fixed as soon as possible
Let's zoom on a server
📄 maj.sh on every server
- a single script to upgrade the server
- run with DEBIAN_FRONTEND=noninteractive
- keep a lot of logs (outputs, APT term/history…)
- limited verbosity + TERM colors for clarity
Kernel cleanup
- check for available space
- keep running and latest kernels
- based on bullseye /etc/kernel/postinst.d/apt-auto-removal
apt-get upgrade
- upgrade packages on host system
- then loop over LXC containers if any
- use the custom state files
- run with --no-download --no-remove
need restart?
- when upgrading glibc, libssl…
- needrestart -b tells us what needs a restart
- show systemctl status after service restarts
# needrestart -b
NEEDRESTART-VER: 3.4
NEEDRESTART-KCUR: 4.19.0-26-cloud-amd64
NEEDRESTART-KEXP: 4.19.0-27-cloud-amd64
NEEDRESTART-KSTA: 3
NEEDRESTART-SVC: auditd.service
NEEDRESTART-SVC: dbus.service
NEEDRESTART-SVC: getty@tty1.service
NEEDRESTART-SVC: munin-node.service
NEEDRESTART-SVC: ntp.service
NEEDRESTART-SVC: rsyslog.service
NEEDRESTART-SVC: squid.service
NEEDRESTART-SVC: systemd-logind.service
book keeping
- notifications to clients and sysadmins
- commit changes in /etc (equivalent of etckeeper)
- traceability tasks
are we done yet?
- only green lines at the end => all good!
- [PageUp]-[PageDown] to read history
- focus on servers with issues
- [Ctrl-L] or [Enter] to check for a responsive shell
The good
- 1 hour for 1000 servers
- a lot of automation
- without losing control
Possible improvements
- no strategy for reboots
- insufficient information about LXC containers
- mostly a single-person workflow
- a batch is as slow as the slowest server
- hiding APT logs is nice, except when we need them