Monday, 2023-05-01

opendevreviewsean mooney proposed openstack/nova master: [WIP] use alpine instead of cirros  https://review.opendev.org/c/openstack/nova/+/88191212:54
opendevreviewsean mooney proposed openstack/nova master: [WIP] use alpine instead of cirros  https://review.opendev.org/c/openstack/nova/+/88191213:01
opendevreviewsean mooney proposed openstack/nova master: [WIP] use alpine instead of cirros  https://review.opendev.org/c/openstack/nova/+/88191213:04
dansmithsean-k-mooney: so check this out: https://264a86dd518c5142b8a6-3b04393c7506ab488b4fde073fc22e36.ssl.cf1.rackcdn.com/881764/1/check/cinder-tempest-plugin-lvm-multiattach/d28afe7/testr_results.html13:31
dansmithmaybe kashyap too ^13:32
dansmiththat almost kinda looks like we attached the disk to the guest and the kernel crashed while probing partitions/filesystems, right?13:32
dansmithin sysfs_add_file_mode_ns()13:33
dansmithor perhaps while creating sysfs entries for the disk perhaps?13:33
dansmitheharney: you around by chance?16:30
eharneydansmith: yes16:45
dansmitheharney: so, I seem to have gotten the nova ceph job down to a single repeatable failure16:45
dansmithit is in the volume extend test, and it fails during cleanup16:45
dansmithit's trying to, I guess, detach the volume from the server before deleting it and then before deleting the volume16:46
dansmithit does *not* fail locally, so I don't think it's something fundamentally broken with new ceph or anything like that16:46
dansmithhttps://955f32f8268e5d475e65-6c8f4c6e546a0854b4c11cc7c78829ca.ssl.cf5.rackcdn.com/881585/6/check/nova-ceph-multistore/1ecb09a/testr_results.html16:46
eharneydansmith: let me take a look through the logs16:47
eharneydetach failing isn't something i'm familiar with  (other than it being mentioned here the other day)16:47
dansmithI've just been tracing through the code looking for how this works and it seems to me like everything is working, but perhaps it's just legitimately that the guest doesn't let go of the volume when we detach after the resize happened16:48
dansmithokay, volume detach is without a doubt our most common failure16:48
dansmitheharney: yeah appreciate if you could see if you spot anything in the logs16:48
dansmithI see the guest saw the size change on vdb and also mentions that it's resizing the filesystem on it... that looks like more than just the block device size change,17:00
dansmithso I wonder if it is literally doing a resize2fs on it and that is still happening when we try to issue the detach and that gets us stuck17:00
dansmithbecause we start the detach less than half a second after the resize happens17:01
eharneydansmith: yeah, i was also just looking down the path of whether the libvirt block resize call is synchronous or not (it looks like nova assumes it is?)17:03
dansmitheharney: synchronous with what? It's synchronous to libvirt from compute, but I don't know that it waits to return until it's delivered to the guest, but definitely stuff like resizing filesystems would happen after that returns17:04
dansmitheven still, the test polls for completion of the operation as far as nova is concerned, but the guest stuff would all be async17:05
eharneyi see17:05
dansmithI think maybe I should make the test ssh to the guest and see if it's mounted, and perhaps try to unmount it before the test ends or something, to avoid racing with the detach17:05
dansmiththe resize happens before cirros is even done with its startup stuff,17:11
dansmithso I wonder on a slow emulated guest we mount all the filesystems we find, which means it's mounted in the guest when we resize, so it does the resize2fs activity automatically17:11
dansmithbut on a fast non-nested local run, it finishes startup before we do the attach and thus doesn't end up with it mounted during the resize17:11
eharneyis resize2fs etc triggered by qemu-guest-agent?17:12
dansmithcould be.. does cirros have the guest agent in it? I assumed not17:13
eharneyi don't think so17:13
dansmithon regular systems I've never had resize2fs triggered automatically for me, so I'm not really sure why that would happen17:13
dansmithbut it seems like it is here17:13
eharneyi would have guessed that resize2fs doesn't happen in these test jobs17:14
dansmithme too17:14
dansmithyou see it in the output though right?17:14
eharneyno, where is that?17:14
dansmith[   48.156649] EXT4-fs (vda1): resizing filesystem from 25600 to 259835 blocks17:14
dansmith[   48.255755] EXT4-fs (vda1): resized filesystem to 25983517:14
dansmithin the guest console dump17:14
dansmithoh damn17:14
dansmiththat is vda, nevermind!17:14
dansmiththat's cirros resizing its root disk on startup, not the attached volume17:15
eharneyah, right17:15
dansmithmah bad17:15
dansmithso, without a doubt the most common failure in nova jobs is failing to detach volumes17:17
dansmithwe've been trying to get a handle on it for a long time,17:17
eharneydoes it show up on non-rbd volumes?17:17
dansmithyeah17:17
dansmiththere is some assertion that if we attach to an instance before it is far enough along during boot, then it might prevent it from being detached later17:18
dansmithI'm slightly skeptical of that, but we've been adding "wait for sshable" checks everywhere17:18
eharneyi guess i'm not sure what kind of conditions in the libvirt area would prevent detach from completing17:19
dansmithI just added that for this test recently (merged on friday) which passes normally (and passes locally) but with this rbd job it seems to fail.. it passed the gate on the focal-based rbd, but not on new ceph and jammy17:19

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!