How to upgrade

1000+ servers in 1 hour

Evolix – Grégory Colpart & Jérémy Lecour – MiniDebConf 2024

20 years in the game

1000+ Debian servers

just a few shell scripts

minor updates

vs. major upgrades

Package sources

Main internal mirror
    1082 mirror.evolix.org
Evolix public repository
    1014 pub.evolix.org
      30 pub.evolix.net
Official repositories
     824 security.debian.org
     566 archive.debian.org
ELTS repositories
     425 elts.evolix.org
      10 deb.freexian.com
Third-party repositories
     211 artifacts.elastic.co   
      22 packages.elastic.co
     152 deb.nodesource.com
      64 dl.yarnpkg.com
     138 hwraid.le-vert.net
      29 downloads.linux.hpe.com
      77 download.docker.com
      47 packages.fluentbit.io
      37 repo.percona.com
      33 packages.wazuh.com
      17 apt.postgresql.org
      16 apt.grafana.com
      15 packages.sury.org
      12 download.nfs-ganesha.org
      12 download.gluster.org
      11 packages.tideways.com

How to upgrade them all?

Some possible ways to do this

  • manually, over SSH, 1 server at a time
  • unattended-upgrade
  • automated apt upgrade with Ansible…
  • apt-dater
  • our own mix

We need to automate

  • updating state and downloading packages
  • safety checks (disk space…)
  • LXC containers

… but keep the finger on the pulse

In the real world

  • flaky services
  • manual actions by clients during updates
  • specific order of operations

Inform clients in advance

  • list of upgradable packages
  • list of services that will restart
  • possibility to hold if necessary

A consistent workflow

  • Tuesday morning : preparation and notifications
  • Tuesday afternoon : first upgrade slot for non critical servers (like pre-production)
  • Thursday evening : upgrade slot 2 (most critical servers)
  • Friday morning : upgrade slot 3 (critical servers for clients who prefer the morning)

Tuesday morning (prep)

  • update APT
  • download packages
  • inform clients

Tuesday afternoon (upgrade 1/3)

  • upgrade non-critical servers
  • 200-250 servers

Thursday evening (upgrade 2/3)

  • upgrade majority of critical servers
  • 600-700 servers

Friday morning (upgrade 3/3)

  • upgrade "working hours" servers
  • upgrade USA/CAN servers
  • 50-100 servers

How often?

  • 1-3 times a month
  • avoid holidays

Preparation

📄 listupgrade.sh on every server

  • GET config from URL
  • disabled if no human action
  • allow/block list of releases
  • allow/block list of packages
  • abort and disable on any error

Freeze until upgrade

apt update -o Dir::State::Lists=/var/lib/listupgrade

apt upgrade --download-only

Timing matters

  • every Tuesday at 09:42
    🔨 hammer effect on mirrors
  • randomize between 06:00 and 10:00
    ⚖️ spread the load, reduce errors

Upgrade warmup

📄 parse-listupgrade.sh on workstation

  • sync emails from dedicated INBOX
  • build lists of servers
  • prepare batches

Divide and conquer

  • use a terminal multiplexer
  • run actions in parallel
  • focus on a single server if needed

ClusterSSH

  • discrete terminal per server + common text input
  • very simple and efficient
  • issues with copy/paste and special chars
  • fine-tuning to prevent windows from overlapping
                                    

    # ~/.clusterssh/config

    terminal_reserve_bottom=0 terminal_reserve_top=32
cssh

xpanes, the power of Tmux

  • Single window with multiple panes
  • More robust and rich, but more complex than CSSH
xpanes

Upgrade time

  • 8 batches of 18 servers => 144 servers in a group

 

 

 

 

xpanes
xpanes
xpanes

Upgrade time

  • 8 batches of 18 servers => 144 servers in a group
  • Pause after each group
  • check monitoring, fix issues…
  • optional longer pause if needed
  • consider interrupting the session if needed

🔁 rinse and repeat 4-5 times

Special monitoring

  • check frequency is increased
  • alerts are sent directly to the upgrade person/team
  • issues are fixed as soon as possible

Let's zoom on a server

📄 maj.sh on every server

  • a single script to upgrade the server
  • run with DEBIAN_FRONTEND=noninteractive
  • keep a lot of logs (outputs, APT term/history…)
  • limited verbosity + TERM colors for clarity

Kernel cleanup

  • check for available space
  • keep running and latest kernels
  • based on bullseye /etc/kernel/postinst.d/apt-auto-removal

apt-get upgrade

  • upgrade packages on host system
  • then loop over LXC containers if any
  • use the custom state files
  • run with --no-download --no-remove

need restart?

  • when upgrading glibc, libssl
  • needrestart -b tells us what needs a restart
  • show systemctl status after service restarts
                        
                            # needrestart -b
                            NEEDRESTART-VER: 3.4
                            NEEDRESTART-KCUR: 4.19.0-26-cloud-amd64
                            NEEDRESTART-KEXP: 4.19.0-27-cloud-amd64
                            NEEDRESTART-KSTA: 3
                            NEEDRESTART-SVC: auditd.service
                            NEEDRESTART-SVC: dbus.service
                            NEEDRESTART-SVC: getty@tty1.service
                            NEEDRESTART-SVC: munin-node.service
                            NEEDRESTART-SVC: ntp.service
                            NEEDRESTART-SVC: rsyslog.service
                            NEEDRESTART-SVC: squid.service
                            NEEDRESTART-SVC: systemd-logind.service                            
                        
                    

book keeping

  • notifications to clients and sysadmins
  • commit changes in /etc (equivalent of etckeeper)
  • traceability tasks

are we done yet?

  • only green lines at the end => all good!
  • [PageUp]-[PageDown] to read history
  • focus on servers with issues
  • [Ctrl-L] or [Enter] to check for a responsive shell

Review

The good

  • 1 hour for 1000 servers
  • a lot of automation
  • without losing control

Possible improvements

  • no strategy for reboots
  • insufficient information about LXC containers
  • mostly a single-person workflow
  • a batch is as slow as the slowest server
  • hiding APT logs is nice, except when we need them

How do you do this?

Thank you