How to upgrade

1000+ servers in 1 hour

Evolix – Grégory Colpart & Jérémy Lecour – MiniDebConf 2024

20 years in the game

1000+ Debian servers

just a few shell scripts

minor updates

vs. major upgrades

Package sources

Main internal mirror

    1082 mirror.evolix.org

Evolix public repository

    1014 pub.evolix.org
      30 pub.evolix.net

Official repositories

     824 security.debian.org
     566 archive.debian.org

ELTS repositories

     425 elts.evolix.org
      10 deb.freexian.com

Third-party repositories

     211 artifacts.elastic.co   
      22 packages.elastic.co

     152 deb.nodesource.com
      64 dl.yarnpkg.com

     138 hwraid.le-vert.net
      29 downloads.linux.hpe.com

      77 download.docker.com
      47 packages.fluentbit.io
      37 repo.percona.com
      33 packages.wazuh.com
      17 apt.postgresql.org
      16 apt.grafana.com
      15 packages.sury.org
      12 download.nfs-ganesha.org
      12 download.gluster.org
      11 packages.tideways.com

How to upgrade them all?

Some possible ways to do this

manually, over SSH, 1 server at a time
unattended-upgrade
automated apt upgrade with Ansible…
apt-dater
our own mix

We need to automate

updating state and downloading packages
safety checks (disk space…)
LXC containers

… but keep the finger on the pulse

In the real world

flaky services
manual actions by clients during updates
specific order of operations

Inform clients in advance

list of upgradable packages
list of services that will restart
possibility to hold if necessary

A consistent workflow

Tuesday morning : preparation and notifications
Tuesday afternoon : first upgrade slot for non critical servers (like pre-production)
Thursday evening : upgrade slot 2 (most critical servers)
Friday morning : upgrade slot 3 (critical servers for clients who prefer the morning)

Tuesday morning (prep)

update APT
download packages
inform clients

Tuesday afternoon (upgrade 1/3)

upgrade non-critical servers
200-250 servers

Thursday evening (upgrade 2/3)

upgrade majority of critical servers
600-700 servers

Friday morning (upgrade 3/3)

upgrade "working hours" servers
upgrade USA/CAN servers
50-100 servers

How often?

1-3 times a month
avoid holidays

Preparation

📄 listupgrade.sh on every server

GET config from URL
disabled if no human action
allow/block list of releases
allow/block list of packages
abort and disable on any error

Freeze until upgrade

apt update -o Dir::State::Lists=/var/lib/listupgrade

apt upgrade --download-only

Timing matters

every Tuesday at 09:42
🔨 hammer effect on mirrors
randomize between 06:00 and 10:00
⚖️ spread the load, reduce errors

Upgrade warmup

📄 parse-listupgrade.sh on workstation

sync emails from dedicated INBOX
build lists of servers
prepare batches

Divide and conquer

use a terminal multiplexer
run actions in parallel
focus on a single server if needed

ClusterSSH

discrete terminal per server + common text input
very simple and efficient
issues with copy/paste and special chars

fine-tuning to prevent windows from overlapping

# ~/.clusterssh/config

terminal_reserve_bottom=0 terminal_reserve_top=32

xpanes, the power of Tmux

Single window with multiple panes
More robust and rich, but more complex than CSSH

Upgrade time

8 batches of 18 servers => 144 servers in a group

Upgrade time

8 batches of 18 servers => 144 servers in a group
Pause after each group
check monitoring, fix issues…
optional longer pause if needed
consider interrupting the session if needed

🔁 rinse and repeat 4-5 times

Special monitoring

check frequency is increased
alerts are sent directly to the upgrade person/team
issues are fixed as soon as possible

Let's zoom on a server

📄 maj.sh on every server

a single script to upgrade the server
run with DEBIAN_FRONTEND=noninteractive
keep a lot of logs (outputs, APT term/history…)
limited verbosity + TERM colors for clarity

Kernel cleanup

check for available space
keep running and latest kernels
based on bullseye /etc/kernel/postinst.d/apt-auto-removal

apt-get upgrade

upgrade packages on host system
then loop over LXC containers if any
use the custom state files
run with --no-download --no-remove

need restart?

when upgrading glibc, libssl…
needrestart -b tells us what needs a restart
show systemctl status after service restarts

                        
                            # needrestart -b
                            NEEDRESTART-VER: 3.4
                            NEEDRESTART-KCUR: 4.19.0-26-cloud-amd64
                            NEEDRESTART-KEXP: 4.19.0-27-cloud-amd64
                            NEEDRESTART-KSTA: 3
                            NEEDRESTART-SVC: auditd.service
                            NEEDRESTART-SVC: dbus.service
                            NEEDRESTART-SVC: getty@tty1.service
                            NEEDRESTART-SVC: munin-node.service
                            NEEDRESTART-SVC: ntp.service
                            NEEDRESTART-SVC: rsyslog.service
                            NEEDRESTART-SVC: squid.service
                            NEEDRESTART-SVC: systemd-logind.service

book keeping

notifications to clients and sysadmins
commit changes in /etc (equivalent of etckeeper)
traceability tasks

are we done yet?

only green lines at the end => all good!
[PageUp]-[PageDown] to read history
focus on servers with issues
[Ctrl-L] or [Enter] to check for a responsive shell

Review

The good

1 hour for 1000 servers
a lot of automation
without losing control

Possible improvements

no strategy for reboots
insufficient information about LXC containers
mostly a single-person workflow
a batch is as slow as the slowest server
hiding APT logs is nice, except when we need them

How to upgrade

1000+ servers in 1 hour

20 years in the game

1000+ Debian servers

just a few shell scripts

minor updates

vs. major upgrades

Package sources

Main internal mirror

Evolix public repository

Official repositories

ELTS repositories

Third-party repositories

How to upgrade them all?

Some possible ways to do this

We need to automate

… but keep the finger on the pulse

In the real world

Inform clients in advance

A consistent workflow

Tuesday morning (prep)

Tuesday afternoon (upgrade 1/3)

Thursday evening (upgrade 2/3)

Friday morning (upgrade 3/3)

How often?

Preparation

Freeze until upgrade

Timing matters

Upgrade warmup

Divide and conquer

ClusterSSH

xpanes, the power of Tmux

Upgrade time

Upgrade time

Special monitoring

Let's zoom on a server

Kernel cleanup

apt-get upgrade

need restart?

book keeping

are we done yet?

Review

The good

Possible improvements

How do you do this?

Thank you