Thursday, 2023-08-10

opendevreviewmelanie witt proposed openstack/nova master: WIP libvirt: Introduce support for rbd with LUKS  https://review.opendev.org/c/openstack/nova/+/88991201:01
opendevreviewElod Illes proposed openstack/nova master: Fix tox docs target  https://review.opendev.org/c/openstack/nova/+/89103313:24
opendevreviewJimmy McCrory proposed openstack/nova master: Fix typo in placement audit command argument  https://review.opendev.org/c/openstack/nova/+/89103514:00
opendevreviewDan Smith proposed openstack/nova master: DNM: Test glance new location API  https://review.opendev.org/c/openstack/nova/+/89103614:27
kashyapdansmith: Thank you for doing the recheck work for me!  https://review.opendev.org/c/openstack/nova/+/887255  </embarassed>14:56
ozzzo_workI have a couple of VMs that are half-created: https://paste.openstack.org/show/b7wqqzgumoH8tdtvmEh1/15:28
ozzzo_workhow can I clean this up? I looked in the db and they don't exist in "instances"15:28
dansmithozzzo_work: check the api database's instance_build_requests table15:30
dansmithI'm guessing you're seeing that because those request-to-build-an-instance entries are there and we never got beyond that15:31
dansmithI'd check the logs to see if you're seeing any fireworks there, because that *should* be resolvable, AFAIK15:31
dansmithany manual DB surgery that has been done?15:31
ozzzo_workoic it was in the nova_cell0 table15:49
ozzzo_workwe're not using multiple cells but I guess everything must go in cell0 by default now15:50
dansmithozzzo_work: dead instances that fail to schedule always go into the cell0 DB16:12
dansmitheveryone is using cells, you might just only have one cell of computes :)16:12
melwittdansmith: have you noticed "no valid host" fails in nova-ceph-multistore and "Seems service nova-compute on host np0034917240 is down. Last heartbeat was 2023-08-10 15:59:29. Elapsed time is 847.603029" messages in the n-sch log?17:02
dansmithI saw one example of that and it looked like rabbit crashed or something because I saw a bunch of amq errors in multiple places17:03
melwitthm, I hadn't noticed the rabbit correlation17:04
melwittthe couple of examples I saw so far, the n-cpu log just stops ... early. and the nova valid host "returned 0 hosts" messages in n-sch start after the last timestamp seen in n-cpu17:06
melwittlast log message Aug 10 16:00:00.843521 https://zuul.openstack.org/build/e7c306fad21e4057a3af40720049482a/log/controller/logs/screen-n-cpu.txt and then first "ComputeFilter returned 0 hosts" at Aug 10 16:13:36.604315 https://zuul.openstack.org/build/e7c306fad21e4057a3af40720049482a/log/controller/logs/screen-n-sch.txt#388117:09
dansmithmelwitt: you've seen multiple cases of n-cpu just freezing or just the one?17:10
melwittmy first thought was, is it possible n-cpu is being OOM killed or something but I don't see any OOM messages in syslog.txt (assuming that's where they would be)17:11
dansmithyeah I saw no OOM in my case either (n-cpu or otherwise)17:13
dansmithyep, syslog17:13
melwittI saw two so far, the other one being https://zuul.openstack.org/build/4884721dbf764b33995d71c990f65dfe/logs17:15
melwitt(there's a third but it's in a DNM ceph job I'm using to test ephemeral encryption)17:15
dansmithoh that one was mine, yeah17:16
dansmithmelwitt: so in that last one, the last thing nova logs is running a ceph command ... :)17:19
dansmithoh but it returned17:19
melwitt:)17:22
melwittit does seem to stop at the same place though, "Creating image(s)"17:22
dansmithweird, I dunno what would have changed in the last few days related to this, and I haven't seen this before17:27
melwittsame :(17:28
sean-k-mooney f****17:33
sean-k-mooneyDEBUG ovsdbapp.backend.ovs_idl.vlog [-] 4999-ms timeout {{(pid=1435862) __log_wakeup /opt/stack/data/venv/lib/python3.10/site-packages/ovs/poller.py:248}}17:34
* melwitt imagines sean flipping a table17:34
sean-k-mooneyare we seeing that in ci17:34
sean-k-mooneyit looks liek the ovsdbapp thing might be regressed17:35
sean-k-mooneyim gong to check my devstack/nova logs17:35
sean-k-mooneythats the issue where ovs blocks nova for 5 seconds at a time17:35
melwittactually I found one in there https://zuul.openstack.org/build/e7c306fad21e4057a3af40720049482a/log/controller/logs/screen-n-cpu.txt#3952317:36
sean-k-mooneythis could be a regression in the devstack in a venv change17:36
sean-k-mooney but my code in devstack to workaround this does not seam to have put the thing in the nova.conf17:36
melwittit's logged 15 times on that run17:37
sean-k-mooneyreally..... https://github.com/openstack/devstack/commit/db3eff7dd27acdc973e8d189bda80d642be92f0317:38
sean-k-mooneyi need to revert that17:39
sean-k-mooneythis is fix only if you use a new enough ovs17:39
sean-k-mooneywhich apprenly is not in ubuntu 22.0417:39
melwittthe other example run has DEBUG ovsdbapp.backend.ovs_idl.vlog [-] 999-ms timeout {{(pid=98090) __log_wakeup /usr/local/lib/python3.10/dist-packages/ovs/poller.py:248}} not sure if that's bad also17:42
melwittit doesn't have the 499917:42
sean-k-mooneyyep the exact amount isnt really imporant17:42
sean-k-mooneythe timeout is around 5 seconds17:43
sean-k-mooneybut you can get reconnnect at shorter intervals depending on what else nova is doing17:43
sean-k-mooneyhttps://review.opendev.org/c/openstack/devstack/+/891101 will revert it17:46
melwittwonder why I only see it with ceph. a coincidence?17:46
sean-k-mooneyprobaly im seeing it locally17:46
sean-k-mooneyi noticed it while i was looking at somethign for dansmith 17:47
melwittoh ok17:47
sean-k-mooneyso i have a 50 second span17:52
sean-k-mooneywhere the compute agent does not log anything else implying it was blocked for 50 seconds reconnecting ot ovs17:52
sean-k-mooneythe problem is i have a version of ovs that apprenly has this fix and a version of ovsdbapp that also has this fix17:52
sean-k-mooneyso either the fix does not work or it has regressed again17:53
sean-k-mooneythis woudl cause the ci jobs to fail alot but also cause issues for people in production obviously17:53
melwittsean-k-mooney++17:56
sean-k-mooneyso i have a patch version os-vif, ovsdbapp and ovs python binings and its still happening meaning it looks like maybe it regressed again17:58
melwitt🤯 18:00
sean-k-mooneyor this is the problem https://github.com/openstack/os-vif/blob/master/vif_plug_ovs/ovsdb/impl_idl.py#L3518:02
sean-k-mooneythis is the ovsdbapp fix18:02
sean-k-mooneyhttps://github.com/openstack/ovsdbapp/blob/master/ovsdbapp/backend/ovs_idl/connection.py#L148-L15018:02
sean-k-mooneyin its subclass of idl.Idl18:02
sean-k-mooneybut we are doing "    return idl.Idl(conn, helper)" directly usign the one form ovs18:03
sean-k-mooneythat should be using the  NeutronOvsdbIdl class below i think18:04
dansmithmelwitt: oh SNAP18:35
dansmithmelwitt: I wonder if the thing with the ceph job is something I broke18:35
dansmithmelwitt: https://review.opendev.org/c/openstack/devstack/+/89043918:36
dansmithI adjusted the interval and the deadline so I think it *should* be okay18:36
dansmithalthough that wouldn't explain nova just not logging anything after the hang18:36
melwittyeah.. 🤔 18:37
sean-k-mooneyok the issue i mentioned might be fixed 18:54
sean-k-mooneyi configured my keepalive interval in the kernel to 10 secodn i think but ovs is expecting probes every 5 seconds18:55
melwittyeah, I was just looking at it again and n-cpu hadn't reported a heartbeat in 14 minutes and never came back after the hang18:55
melwittbizarre18:57
melwittlibvirt log activity also stops for the most part at the same time n-cpu stops https://zuul.openstack.org/build/e7c306fad21e4057a3af40720049482a/log/controller/logs/libvirt/libvirt/libvirtd_log.txt19:00
melwittnot sure if anything near the end of that log is unusual or problematic19:00
melwittfinally found that on those failures, at the same time that n-cpu freezes, ceph starts logging WRN (warn) health status such as "osd.0 47 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.15524.0:32 4.1e 4:7cf325a3:::rbd_header.3c6ebb6b8706:head [watch ping cookie 140317856977664] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e47)"21:16
melwittexample https://zuul.openstack.org/build/4884721dbf764b33995d71c990f65dfe/log/controller/logs/ceph/ceph/7f57d687-6678-46f4-ae09-7e6faa476d1a/ceph-osd.0_log.txt#1175221:16
melwittlog_channel(cluster) log [WRN] : 1 slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'vms' : 1 ])21:17
opendevreviewmelanie witt proposed openstack/nova master: DNM test ephemeral encryption + resize: qcow2, raw, rbd  https://review.opendev.org/c/openstack/nova/+/86241622:50

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!