Oh, that’s easy. We’ll shut down the instance, snapshot the instance, and then create a larger volume from the snapshot! Simple!
Somebody put in an /etc/fstab entry to a network file system that isn’t mounting. The instance goes to single-user recovery mode because the nofail option isn’t specified. Well, we’ll attach it to another instance and that’ll fix the problem.
The original image was a CentOS 7 image taken from the AWS “marketplace”, and EC2 won’t let me mount it to a running instance. Awesome. Let’s spin up another instance that we can easily attach it to. I still have the terraform script I used to launch it, so I’ll just change the terraform resource name.
Instance boots, shut it down, attach the other volume to fix the fstab, and… the instance won’t boot. Somehow, /dev/xvdf is causing that remote filesystem to mount. How can that be? Unattach the volume and the recovery instance boots. Re-attach the volume to recovery as /dev/xvdg and… nope, instance hangs for the same reason. How can this be?
Theory: for everything that cloud-init changes, it doesn’t change the volume UUID and it’s the UUID that’s being referenced by the bootloader. The newly attached volume has the same UUID as the root volume, and it over-rides it, making the newly attached /dev/xvd[fg] the root volume, processing its /etc/fstab, and hanging the instance in single user mode.
Alright, how about I change the UUID of the root volume? I can’t do that: you can’t change the label on a mounted volume using xfs_admin. I am so not doing direct editing of the block device. Terminate the instance, launch using a different AMI that will have a different UUID on the root device, shut that down, attach the volume to recovery, edit /etc/fstab, terminate the instance and finally we’re along enough to re-snapshot the broken volume. This should be fast as only a few boot logs and the /etc/fstab blocks have really changed since this whole thing started. OK, it’s a little slower than I like but it does get done.
That… was not fun. Marketplace codes attached to zero-pay open source instances are the devil. They aren’t desired to serve the goal of protecting the image, but they do cause all the pain.
PS: All of this was avoidable if I just had the machine’s salt configuration role defined. I grabbed the locally-created sls file I’d written as I went along when this instance was originally launched, head into my repo to add it as a new role… and it’s already there. I could have launched a new instance with the appropriate salt configuration the whole time. THE WHOLE TIME!