Wednesday, 2023-02-15

*** dasm is now known as dasm|off01:49
opendevreviewYusuke Okada proposed openstack/nova master: Fix failed count for anti-affinity check  https://review.opendev.org/c/openstack/nova/+/87321605:04
*** blarnath is now known as d34dh0r5306:47
opendevreviewTobias Urdin proposed openstack/nova master: libvirt: set remaining to 0 when no disk to migrate  https://review.opendev.org/c/openstack/nova/+/87384607:36
bauzasgood morning Nova08:14
bauzasgibi: I thought we merged most of the wait_for_ssh series for volume attachments issues https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_6f9/868236/5/gate/nova-next/6f9f3d0/testr_results.html08:28
gibibauzas: good morning08:35
gibibauzas: what you see there is that the wait for ssh step we merged times out. based on the guest log, the guest is still waiting for the DHCP to finish when the tempest times out08:36
gibislow node? slow dhcp?08:36
bauzasoh so this the udhchpd issue08:37
bauzashttps://bugs.launchpad.net/nova/+bug/200646708:37
gibino08:37
gibithis is the last message from the guest in your case08:37
gibiudhcpc: sending discover08:37
gibiin the bug you linked the discover fails08:38
bauzasyup, now I see the difference08:39
gibimaybe it would have failed the discover there as well if we the test waited enough08:39
bauzasthe tempest test finishes sooner before the dhcp lease is attributed08:39
gibiyepp08:39
gibibumping the ssh timeout in tempest could be an option08:40
gibialso I suggest to look at a successfull case and check the guest log to see how fast the guest boots in a successful case08:40
gibiin the failed case it needs more than 12 sec to cirros to get to dhcp08:41
bauzashmmm08:41
bauzasthe problem here is that I don't know how to identify whether this issue is happening a lot or just unfortunate08:42
gibiwe have to check for `wait_for_ssh_or_ping` and remove the cases where the dhcp lease failes as per https://bugs.launchpad.net/nova/+bug/200646708:43
gibiprobably manual work08:43
gibithere is a work in progress to add multiple regex support to logsearch to filter for more than one independent pattern in a file but it is not on main branch yet08:44
bauzasI'm also looking at the job output08:46
bauzastrying to identify the timings08:46
bauzasfrom what I can read, I agree with you, it only calls the dhcp lease after more than 12 secs08:47
bauzas Feb 14 18:18:56.240226 np0033093378 nova-compute[83239]: INFO nova.compute.manager [None req-053318ab-09ad-4a3a-8ddb-633cc0002c3e tempest-AttachVolumeNegativeTest-1605485622 tempest-AttachVolumeNegativeTest-1605485622-project] [instance: 6a265379-ebfd-4aea-a081-8b271f32c0ea] Took 6.36 seconds to spawn the instance on the hypervisor.08:53
bauzasso we waited for the instance to be fully spawned before ssh'ing to it 08:54
bauzas2023-02-14 18:22:39.102680 | controller | 2023-02-14 18:18:59,630 92653 INFO     [tempest.lib.common.ssh] Creating ssh connection to '172.24.5.161:22' as 'cirros' with public key authentication08:54
bauzasand that's only 2m30s after this that we give up to connect thru ssh 08:56
bauzas2023-02-14 18:22:39.103394 | controller | 2023-02-14 18:22:31,398 92653 ERROR    [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.5.161 after 16 attempts. Proxy client: no proxy client08:56
bauzasso, even if the guest took 12 secs to fully boot and being sshable, we still have enough delay here08:57
gibiI only have timestamp until the guest kernel boots08:58
gibiafter that the guest logs has no timestamps08:58
gibi[   12.638156] sr 0:0:0:0: Attached scsi generic sg0 type 508:58
gibicurrently loaded modules: 8139cp 8390 9pnet 9pnet_virtio ahci drm drm_kms_helper e1000 failover fb_sys_fops hid hid_generic ip_tables isofs libahci mii ne2k_pci net_failover nls_ascii nls_iso8859_1 nls_utf8 pcnet32 qemu_fw_cfg syscopyarea sysfillrect sysimgblt ttm usbhid virtio_blk virtio_gpu virtio_input virtio_net virtio_rng virtio_scsi x_tables 08:58
gibiinfo: copying initramfs to /dev/vda108:59
gibiinfo: initramfs loading root from /dev/vda108:59
gibiinfo: /etc/init.d/rc.sysinit: up at 18.8408:59
gibiinfo: container: none08:59
gibicurrently loaded modules: 8139cp 8390 9pnet 9pnet_virtio ahci drm drm_kms_helper e1000 failover fb_sys_fops hid hid_generic ip_tables isofs libahci mii ne2k_pci net_failover nls_ascii nls_iso8859_1 nls_utf8 pcnet32 qemu_fw_cfg syscopyarea sysfillrect sysimgblt ttm usbhid virtio_blk virtio_gpu virtio_input virtio_net virtio_rng virtio_scsi x_tables 08:59
gibiInitializing random number generator... done.08:59
gibiStarting acpid: OK08:59
bauzasyup, that is what's in the console08:59
gibiStarting network: udhcpc: started, v1.29.308:59
gibiudhcpc: sending discover08:59
gibiudhcpc: sending discover08:59
gibiudhcpc: sending discover08:59
gibiso it take at least 12.6 sec to reach this point08:59
gibibut there are things without timestamp that can take time08:59
bauzasgibi: I checked n-cpu to get when the instance was spawned08:59
gibilike re-sending the discover08:59
bauzashttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_6f9/868236/5/gate/nova-next/6f9f3d0/controller/logs/screen-n-cpu.txt 09:00
bauzaswith instance uuid 6a265379-ebfd-4aea-a081-8b271f32c0ea09:00
bauzasFeb 14 18:19:02.741562 np0033093378 nova-compute[83239]: DEBUG nova.network.neutron [req-1396e9c5-b2f1-4ceb-ac85-97edccc3919a req-5a2c93ac-f84a-44aa-9f96-5cca954570de service nova] [instance: 6a265379-ebfd-4aea-a081-8b271f32c0ea] Updated VIF entry in instance network info cache for port 77cc62f3-8d3f-4a4f-bbef-ac68fdcd9c9e. {{(pid=83239) _build_network_info_model /opt/stack/nova/nova/network/neutron.py:3459}}09:00
bauzasat 18:19:02, the instance network info cache was updated09:01
bauzasso we're still in the ssh connnection attempts window, which started at 18:18:59 and tried until 18:22:3109:02
gibimy point is if the 3rd discover would succeed then the test could pass if waiting more for ssh09:07
gibibut we don't know that the 3rd will succeed 09:07
gibior fail with https://bugs.launchpad.net/nova/+bug/2006467 as we don't wait enough to see09:07
bauzasgibi: the problem is that we don't know *when* that 3rd discovery was made given it was on the console09:08
bauzasif we assume that the guest starts after the spawn, then the 3rd discover was at 18:18:56+12s-ish09:08
bauzasand we were still trying to connect thru ssh09:09
* bauzas tries to correlate the tempest timings with n-api calls09:09
gibido you suggest the that after the 3rd discover request the udhcp freeze and never time out?09:10
bauzasyeah, confirmed, the timings match between the job-output and the n-api log09:13
bauzasin n-api I can see Feb 14 18:18:57.185675 np0033093378 devstack@n-api.service[74363]: DEBUG nova.api.openstack.wsgi [None req-de265685-f44e-4d63-98a7-2e38964ce455 tempest-AttachVolumeNegativeTest-1605485622 tempest-AttachVolumeNegativeTest-1605485622-project] Calling method '<bound method ServersController.show of <nova.api.openstack.compute.servers.ServersController object at 0x7f3bd48f7dc0>>' {{(pid=74363) _process_stack /opt/s09:13
bauzastack/nova/nova/api/openstack/wsgi.py:513}}09:13
bauzasthat correlates with job-output : 2023-02-14 18:22:39.100791 | controller | 2023-02-14 18:18:57,782 92653 INFO     [tempest.lib.common.rest_client] Request (AttachVolumeNegativeTest:test_attach_attached_volume_to_different_server): 200 GET https://10.176.197.146/compute/v2.1/servers/6a265379-ebfd-4aea-a081-8b271f32c0ea 0.621s09:13
gibione thing we can do before bump the ssh timeout is to add "os-getConsoleOutput" nova API call to tempest when ssh timeouts to see the guest log at that point in time09:14
bauzasgibi: I dunno what happens but yeah, the discover happens way earlier than the ssh timeout AFAICU09:14
bauzasor09:15
bauzasbetween the last console timestamp that says [12...] and the 3rd discover, then 2min15s lasted09:15
* bauzas looks at udhcpc client to say how long lasts a discover 09:16
bauzashttps://bugs.launchpad.net/cirros/+bug/127315909:17
bauzas"It only sends up to 3 DHCP discover packets with a 60 second pause between." :)09:18
bauzasso yeah, that explains09:18
bauzasthat explains why we don't see the dhcp lease failing, but that doesn't explain why after 2mins, the dhcp server wasn't providing a lease09:18
* gibi_ lost the cable internet 09:24
gibi_bauzas: this was the last thing I saw09:25
gibi_10:18 < bauzas> so yeah, that explains09:25
bauzas(10:18:57) bauzas: that explains why we don't see the dhcp lease failing, but that doesn't explain why after 2mins, the dhcp server wasn't providing a lease09:25
gibi_yeah09:25
gibi_so it might be a neutron dhcp issue09:26
bauzasgibi: I'm inclined to say that increasing the ssh timeout wouldn't help09:26
gibi_bauzas: it might prove that the 3rd discover fails too and then udhcp gives up09:26
gibi_but I agree that it is unlikely that the 3rd discover will succeed09:27
gibi_(except if this is a slow worker node situation)09:27
bauzaswell09:28
bauzasI wouldn't say that a ssh connection successful after 3 mins would be a good user experience09:28
bauzaswe asked to bind the port way earlier09:28
bauzasand the network cache was updated09:29
bauzasso this doesn't look a vif plugging issue09:29
bauzasgibi: welcome back :)09:29
bauzasanyway, I gonna drop my investigations09:29
bauzaswe identified the problem09:29
bauzasnow we now it can somehow be kinda-related to https://bugs.launchpad.net/nova/+bug/200646709:30
gibi_bauzas: would it make sense to ping #openstack-neutron with ^^09:32
bauzasgibi_: seems to me a good thing to do + adding neutron to the list of projects impacted by that bug09:33
* bauzas grabbing a coffee and then writing some bug comment 09:33
opendevreviewMerged openstack/nova-specs master: Amend FQDN in hostname spec to reflect implementation  https://review.opendev.org/c/openstack/nova-specs/+/87242209:43
sean-k-mooneybauzas: its not really a openstack bug is it10:15
sean-k-mooneybauzas: like its not a nova bug we have nothing to do with dhcp10:15
sean-k-mooneybauzas: perhaps we can finally bump the verion of cirros in devstack10:16
sean-k-mooneyto a 6.x version10:16
sean-k-mooneythat should fix some of the kernel panics we see too10:16
stephenfinDo folks think dropping the legacy migrations and fully supporting SQLA 2.0 is viable during the A cycle? https://review.opendev.org/c/openstack/nova/+/872428/10:16
sean-k-mooneyyou mean by tommorow right10:17
stephenfinoof, I thought we had longer for services10:17
stephenfinthen I guess that answers my question 0:)10:17
sean-k-mooneyi saw you submit that and also use sdk seriese10:17
sean-k-mooneywell we might be able too but i have not looked at the patch yet10:18
stephenfinyeah, the use SDK series is a nice-to-have and highlights some SDK gaps. This one's a _little_ more urgent10:18
bauzasstephenfin: while I understand your concern, I'm afraid of merging it 2 days before FF, given the huge number of CI failures we havbe10:18
sean-k-mooneyi skimed it when you first submitted it and the second patch failed zuul10:18
sean-k-mooneylook liek it passed after a recheck10:19
sean-k-mooneybauzas: why10:19
sean-k-mooneythis code is not used currently10:19
sean-k-mooneyif i recall correctly its only used if your coming form pre train10:20
sean-k-mooneyactully pre wallaby based on the commit message10:20
sean-k-mooneyso thsi wont impact any ci jobs10:21
sean-k-mooneystephenfin: for what its worth we can land them early after RC1 if we dont land them now10:21
bauzassean-k-mooney: because we already have a lot of changes to be merged10:21
bauzasand I'm done with the CI failures10:22
stephenfinyeah, early in RC1 would work also. We just don't want to drag our feet on this.10:22
sean-k-mooneywell i ment after RC1 so in bobcat10:22
sean-k-mooneyi would be ok with proceedign with this in Antelope however10:23
sean-k-mooneyif ci does not object10:23
sean-k-mooneywhat is the sate of ci in general currently. has the tempest issue been fixed10:25
bauzasyes10:25
bauzasbut now we also have other issues10:25
opendevreviewAlexey Stupnikov proposed openstack/nova stable/victoria: Clean up when queued live migration aborted  https://review.opendev.org/c/openstack/nova/+/84575410:29
sean-k-mooneystephenfin: im +2 on both of those if others care to review10:30
sean-k-mooneyotherwise we can try and land it again in 2 weeks or so10:30
sean-k-mooneybauzas: is https://review.opendev.org/c/openstack/nova/+/873584 just there so we can merge it before RC1 or have ye found the issue in the tests?11:35
bauzassean-k-mooney: we merged it yesterday11:35
bauzasI mean the logging patch11:35
bauzasso I'd prefer to keep it logging for more than one day11:35
sean-k-mooneyack i was just asking if they issue had been found or not11:37
opendevreviewRajesh Tailor proposed openstack/nova master: Fix case-sensitivity for metadata keys  https://review.opendev.org/c/openstack/nova/+/87390111:37
sean-k-mooneywe can leave it there for a week or so until we get clsoe to rc111:37
opendevreviewMerged openstack/nova master: cpu: interfaces for managing state and governor  https://review.opendev.org/c/openstack/nova/+/86823612:25
opendevreviewAlexey Stupnikov proposed openstack/nova stable/victoria: Cleanup old resize instances dir before resize  https://review.opendev.org/c/openstack/nova/+/86473013:50
*** dasm|off is now known as dasm14:01
bauzasgibi: another interesting usual suspect kinda related https://ae59d1e8526fa7671728-240e4b572b6f89b26c1b0e70b1c00c17.ssl.cf1.rackcdn.com/872413/3/check/nova-multi-cell/5e89e48/testr_results.html14:09
bauzassomethink looks wrong to me with the userplane14:09
bauzasthis time, we got a lease directly 14:09
bauzasbut when adding the route, it failed14:09
bauzasand when calling the metadata API, we got a failure14:10
bauzasgibi: oh, ralonsoh did an update on the udhcpc bug https://review.opendev.org/c/openstack/neutron/+/87127214:25
bauzasI guess we need to update our zuul config, lemme check14:26
gibibauzas: I don't know if the root of this is the warning in the route add, or that just a red herring and we have a failure in metadata request handling in neutron or in nova14:26
bauzasthe 'route add default gw' command failed14:26
gibiwith a WARN :D14:27
bauzasso basically there is no default route set 14:27
sean-k-mooneybauzas: that happens somethimes14:27
sean-k-mooneyits normal14:27
bauzasthat's why I guess the call to the metadata API is failing14:27
bauzassince the route to the network isn't told14:27
sean-k-mooneyadding the default route often fails because its already there14:27
bauzassean-k-mooney: I'm just checking the 'normality' in logsearch14:28
bauzasroute: SIOCADDRT: File exists WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"14:28
sean-k-mooneyi litrally have been seeing that for years14:28
bauzastechnically there is indeed a default route set14:28
bauzasbecause of the 'file exists'14:28
sean-k-mooneyi dont think this is related14:29
sean-k-mooneyas i said im used to seeing htat warning. its possible however i think its unlikely14:30
bauzaswell, OK14:30
bauzasI already spent my month investigating14:31
bauzasand we still have the udhcpc fix to merge, working on it14:31
opendevreviewSylvain Bauza proposed openstack/nova master: upgrade nova-next to use dhcpcd client w/ cirros-0.6.1 guests  https://review.opendev.org/c/openstack/nova/+/87393414:42
bauzasgibi: sean-k-mooney: ralonsoh: reviews appreciated ^14:42
ralonsohlet me check14:44
bauzasralonsoh: thanks a lot14:45
gibibauzas: about the cirros image version change. Does our common jobs switched to 0.6.1 already? I.e. do we know that such cirros guest change is safe from the other tests perspective?15:01
bauzasgibi: that's a very good point15:01
bauzasI don't really want this change to be merged *now*15:02
ralonsohwe have been using this image for 1 month now15:02
bauzasbut at least this is nova-next15:02
ralonsohbut maybe now this is not the time for nova, maybe in some weeks15:02
bauzasso we can somehow test cirros-6 in that job15:02
bauzasralonsoh: tbh, I'm a bit reluctant to do any job changes while we are so close to FF as this is already a terrible month of CI failures but we somehow need to balance the benefits vs. the risks15:04
ralonsohI agree15:04
ralonsohmaybe it should be better to avoid this change for now15:05
gibibauzas: if neutron using this image for a month now then maybe we can take the risk and merge it in nova-next too15:05
gibiit is not like we would be the first to switch then15:05
*** d34dh0r53 is now known as d34dh0r5-15:24
*** d34dh0r5- is now known as d34dh0r5315:27
*** d34dh0r53 is now known as blarnath15:28
*** blarnath is now known as d34dh0r5315:28
opendevreviewKashyap Chamarthy proposed openstack/nova stable/train: libvirt: At start-up rework compareCPU() usage with a workaround  https://review.opendev.org/c/openstack/nova/+/87372215:39
bauzassean-k-mooney: if you wanna get some status on the accepted blueprints https://etherpad.opendev.org/p/nova-antelope-blueprint-status17:17
sean-k-mooneythanks17:18
opendevreviewAlexey Stupnikov proposed openstack/nova stable/victoria: Clean up when queued live migration aborted  https://review.opendev.org/c/openstack/nova/+/84575417:18
opendevreviewAlexey Stupnikov proposed openstack/nova stable/ussuri: Test aborting queued live migration  https://review.opendev.org/c/openstack/nova/+/87357517:21
opendevreviewAlexey Stupnikov proposed openstack/nova stable/ussuri: Add functional tests to reproduce bug #1960412  https://review.opendev.org/c/openstack/nova/+/87357617:21
bauzasgmann: every morning, I'm opening https://review.opendev.org/c/openstack/nova/+/864594 to check its status, I guess we gonna defer it to Bobcat unfortunately ?17:21
sean-k-mooneybauzas: https://pypi.org/project/python-novaclient/#history so is that still broken17:23
bauzasyes17:23
bauzasthe requirements gate is mostly broken17:23
bauzasoh the novaclient17:23
bauzasno, we haven't released yet a client patch17:24
bauzaselodilles: planning to propose a releases patch for novaclient soon or want me to do it ?17:25
sean-k-mooneyok so nova-client is blocking the patch to osc17:29
sean-k-mooneyso i tough we were going to do a release last week i guess not17:29
sean-k-mooneybauzas: in relation to https://review.opendev.org/c/openstack/python-openstackclient/+/87242017:30
opendevreviewAlexey Stupnikov proposed openstack/nova stable/ussuri: Clean up when queued live migration aborted  https://review.opendev.org/c/openstack/nova/+/87357717:30
bauzasno, last week was a nonclient lib featurefreeze17:30
sean-k-mooneyyep17:30
bauzasand we just merged 2.95 support in novaclient yesterday :)17:30
sean-k-mooneybut i tought wee were also going to do a release of novaclinet to unblock artom patch17:30
bauzaswe did17:31
sean-k-mooneyoh ok what does that do17:31
bauzasand it was merged yesterday17:31
bauzasnow we only have OSC patch up17:31
sean-k-mooneyoh 2.95 is the evacuate one17:32
sean-k-mooneyoh right that has been blocked in ci17:32
sean-k-mooneyok that makes sense17:32
bauzasiirc, 2.94 and 2.95 were noop on the client side17:32
bauzasthis is just for saying that novaclient actually supports those microversion calls, that's it17:33
sean-k-mooneyits adding help text and a devstack functional test17:36
sean-k-mooneybut its not strictly required17:36
bauzasthe OSC patch ? ya17:36
sean-k-mooneyyep17:36
bauzasand I think sahid abandoned his OSC patch17:36
bauzassince no really support was required17:37
sean-k-mooneyi asked artom to add docs saying that as of microverison ... fqdns are supported17:37
bauzasbut that can be addressed after ff17:37
sean-k-mooneyif the osc team is fine wiht that yes17:37
sean-k-mooneyas i said its just help text really17:38
bauzasyeah it's out of our scope anyway17:38
bauzasfwiw, if someone tells me gate is great, I just send that https://review.opendev.org/c/openstack/nova/+/87241317:39
bauzaslike I already said, I'm no longer a core reviewer17:39
bauzasmy name should be "core rechecker"17:39
bauzashttps://zuul.openstack.org/status#openstack/nova17:40
bauzassean-k-mooney: -1d https://review.opendev.org/c/openstack/releases/+/87353717:44
bauzas(and +1d https://review.opendev.org/c/openstack/releases/+/873515 )17:45
gmannbauzas: ok, agree to move 864594 to bobcat. do not want to rush at last movement.17:45
bauzasgmann: ack and thanks for your dedication17:46
sean-k-mooneybauzas: im ok with the feature bump for novaclient instead of bugfix17:46
sean-k-mooneyso makes sense17:46
sean-k-mooneyosc-placement also makes sesne17:47
spatelsean-k-mooney quick question, I have ceph storage backed openstack today ceph went down because of power17:55
spatelnow ceph is back but when i start VMs its getting stuck in boot script no available boot disk 17:55
spatelSeaBIOS POST script getting that message. 17:56
spatellogs are clean no errors related ceph in nova logs17:56
spatelit stuck here - https://paste.opendev.org/show/bPcyI5UPi0PWoN7iUX2J/17:57
spatelI found this when searching - https://bugs.launchpad.net/kolla/+bug/168849617:59
elodillesbauzas: ok, i see you have found the generated release patches :)18:30
elodillesbauzas: btw, requirements gate fix has landed18:30
elodillesi've rechecked the os-traits. now we have to wait for the ci jobs to finish18:32
mnaserquestion: do you have to enable the new scope enforcement for the reader role to work?19:28
opendevreviewMerged openstack/nova master: Factor out a mixin class for candidate aware filters  https://review.opendev.org/c/openstack/nova/+/85492919:44
gmannmnaser: we enabled the scope/new-defaults by default which enforce reader role by default  https://review.opendev.org/c/openstack/nova/+/866218 19:47
gmannmnaser: it is on master which is not yet released. for older release yes you need to enable these two flags in nova.conf to enforce reader role . [oslo_policy] enforce_scope = True enforce_new_defaults = True19:48
mnasergmann: so it’s not possible to just have enforce new defaults only without scope?  I’m mostly worried about Horizon :)19:52
gmannmnaser: scope for all the policy rule is 'project' now so enabling the scope is to reject the system scope token at early stage with 403. existing project scope token keep working whether scope is enabled or disabled 19:54
gmannbut yes, you can keep only new defaults enable which will work fine unless system scoped token is used  19:55
gmannwith scope disabled system scope token may fail (for project level resources) later on as they do not have project id19:56
gmannI think horizon is using project scoped token (they have not changed it for srabc) so scope enable/disable should not impact there19:58
gmannkolla was the one who were using system scope for nova which is now fixed https://review.opendev.org/c/openstack/kolla-ansible/+/87087919:59
gmannother than that no body reported if new defaults or scope enable has impacted them20:00
mnasergmann: oh I see, so as of now, as long as we don’t have custom policies, we should be able to flip it on?20:10
gmannmnaser: yeah20:13
mnasergmann: ok we’ll flip that and see how it goes!  Thank you.20:13
*** nicolasbock_ is now known as nicolasbock20:47
*** dasm is now known as dasm|off21:56

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!