Wednesday, 2023-04-05

opendevreviewMiro Tomaska proposed openstack/neutron master: Fix ACL sync when default sg group is created  https://review.opendev.org/c/openstack/neutron/+/87598902:12
opendevreviewMiro Tomaska proposed openstack/neutron master: Fix intermittent failures in finding metada port in SB DB  https://review.opendev.org/c/openstack/neutron/+/87854902:51
opendevreviewLuis Tomas Bolivar proposed openstack/neutron master: Add support for localnet_learn_fdb OVN option  https://review.opendev.org/c/openstack/neutron/+/87767505:59
sahido/ - if someone avail to re-approv https://review.opendev.org/c/openstack/neutron/+/871113 I had to rebase for the fix on macvtap07:41
ralonsohsahid, there is no need to rebase the patch07:42
sahidyes right07:42
sahidactually i had, no? the patch was on neutron functional tests no?07:43
sahidhaleyb: https://review.opendev.org/c/openstack/neutron/+/875809 regarding the propal here, I'm probably missing something, I guess you have noticed that users are mentioning on their reports 68 and 7007:44
ralonsohthe gate queue will test the patch on top of master07:45
sahidralonsoh: during this step we are on master, i have never thought about this point, thank you07:48
opendevreviewMerged openstack/neutron stable/wallaby: Ensure redirect-type=bridged not used for geneve networks  https://review.opendev.org/c/openstack/neutron/+/87929909:00
opendevreviewMerged openstack/neutron master: rbacs: filter out model that are already owned by context  https://review.opendev.org/c/openstack/neutron/+/87111309:41
sahidif this (easy)) one can receive the missing +w https://review.opendev.org/c/openstack/neutron/+/87276209:46
sahidit's just a cleanup I wrongly rebased it09:46
haleybsahid: the bug associated with that was a user setting a small mtu, probably as a test. i now see the message on the discuss list was the same13:29
ralonsohslaweq, please check https://review.opendev.org/c/openstack/neutron/+/87863213:41
ralonsohthanks!13:41
slaweqralonsoh done14:15
slaweqI approved it but I also added one comment there14:15
ralonsohlet me check14:23
ihrachyshaleyb to confirm, you also suggest revising api-ref patch we merged with more explicit listing of protocols enabled?14:30
opendevreviewSlawek Kaplonski proposed openstack/neutron-tempest-plugin master: Get nc client's command based on the nc on correct host  https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/87956114:32
haleybihrachys: i guess it would be good to spell it all out. I have never played with the stateless SG code, so am just assuming it can't resolve a neighbor without a rule today? but maybe users always all all ICMP so it just works?15:02
ihrachyshaleyb yes it can't receive NA/RA by default. if you enable ICMPv6, sure it does but that's not optimal :)15:03
ihrachysI also now wonder why I do it just for stateless15:03
ihrachysbecause ovs firewall does it regardless15:03
ihrachysso even if you kill default egress rules, it still works15:03
ihrachysso even in this regard, the api-ref patch is not optimal because it suggests that the protocols are guaranteed for stateless but not in general15:04
haleybright, it's just assumed these "low layer" thing just work15:05
ihrachyslooks like I should just take ovs firewall add_flow logic and translate it into ovn acls15:05
ihrachyshaleyb yeah I haven't realized that it's not just a kludge for stateless but guaranteed regardless of default rules15:05
ihrachyssounds like more tempest scenarios to add too15:05
opendevreviewBence Romsics proposed openstack/neutron master: port-hints: api extension  https://review.opendev.org/c/openstack/neutron/+/87008115:05
opendevreviewBence Romsics proposed openstack/neutron master: port-hint-ovs-tx-steering: agent side  https://review.opendev.org/c/openstack/neutron/+/87290515:05
opendevreviewBence Romsics proposed openstack/neutron master: port-hint-ovs-tx-steering: shim extension  https://review.opendev.org/c/openstack/neutron/+/87311315:05
opendevreviewBence Romsics proposed openstack/neutron master: DNM debug logs and dev helper scripts  https://review.opendev.org/c/openstack/neutron/+/87290615:05
ihrachysmulticast and stateful SG without default rules behavior15:06
haleybihrachys: things work today for OVN stateful without extra work, is that in the acl code?15:06
ihrachysI *think* it's RD-RA / ND-NA session tracking? not sure15:08
haleyboh, and maybe OVN is responding to the NS with an NA based on a flow?15:10
TheJuliaout of curiosity, have any weird hangs been observed with neutron services/requests in CI as of recent? We've got some really weird failures in the grenade job for ironic (ironic-grenade) when on the 2023.1 code which seems to point in the direction of neutron with a couple different weird behaviors15:18
ralonsohTheJulia, did you do some analysis? Can you report that in launchpad? With some logs15:24
ralonsohis this job: https://zuul.opendev.org/t/openstack/builds?job_name=ironic-grenade&skip=015:24
ralonsoh?15:24
TheJuliaralonsoh: we're seen at least two different behaviors so far, one nova seems to hang on talking to neutron, another the fip doesn't work job fails. We can ssh into the held node and immediately ping the fip. Nothing in logs seems to be definitive, at least that we've found y et15:27
TheJuliayet15:27
ralonsohTheJulia, but so far this is what I see in nova api15:29
ralonsohhttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_25f/879498/1/check/ironic-grenade/25f471e/controller/logs/screen-n-api.txt15:29
ralonsoh(seen in two different logs)15:29
ralonsohhttps://paste.opendev.org/show/bObqZW35uqDoHkXwLBRb/15:30
ralonsohjust during the FIP ping15:30
TheJuliathat is after the fip hang15:30
TheJuliathat, I believe, is as it is trying to collect logs15:30
TheJuliaanything after 10:27:58 on that job has failed15:31
TheJuliaThat is a result of the cli trying to collect info for debugging15:32
ralonsoh10:27:58?15:32
ralonsoh2023-04-04 18:34:19.346 | + /opt/stack/new/grenade/functions:ping_check_public:65 :   timeout 60 sh -c 'while ! ping -c1 -w1 172.24.5.89; do sleep 1; done'15:32
TheJuliahttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_38b/879498/1/check/ironic-grenade/38b5b05/controller/logs/grenade.sh_log.txt15:32
TheJuliaerr, I'm looking at more than one job15:32
ralonsohok, I'll check this one15:32
TheJuliahmm, we're looking at the same job15:33
ralonsohbut different build15:33
TheJuliayup15:33
TheJuliathe fip working when I logged in is just bizar given I don't see further updates or re-application15:34
ralonsohTheJulia, this is using non-HA and non-DVR, right?15:40
TheJuliayeah15:40
ralonsohjust one controller node 15:40
ralonsohok15:40
ralonsohwhen the FIp is added, there is no reply from the garp15:41
ralonsohhttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_38b/879498/1/check/ironic-grenade/38b5b05/controller/logs/screen-q-l3.txt15:41
TheJuliameaning l3-agent might not have actually bound it yet?15:45
TheJulialooks like it ran the nat commands at 10:26:57, and then it failed15:46
ralonsohno, actually this is just to push a garp outside the l3 agent15:46
* TheJulia goes and looks to make sure the VM was up15:46
ralonsohand make this IP public15:46
TheJuliak15:46
ralonsoh?15:46
ralonsohthe FIP is correctly configured 15:46
TheJuliathe last entry in the log for the test vm was 10:26:35, so it could have possibly still been coming online, although the console log shows it was up ~6 seconds later with a dhcp address, checking dhcp agent logs15:49
ralonsohthe FIP is assigned at #15:51
ralonsoh2023-04-05 10:26:52.197 | + /opt/stack/new/grenade/projects/60_nova/resources.sh:create:130 :   openstack server add floating ip nova_server1 172.24.5.24815:51
ralonsohand configured in the L3 agent at 15:51
TheJuliaso that is presumably after the vm is already up15:51
ralonsohApr 05 10:26:57.325448 np0033659767 neutron-l3-agent[87510]: INFO neutron.agent.l3.agent [-] Finished a router update for 878c6663-6959-4c33-8245-6294f9754100, update_id 8410568d-632b-44e2-9291-ce7908656680. Time elapsed: 1.56015:51
ralonsohyes15:51
TheJuliait does look like the actual dhcp update came through at 10:27:1415:52
ralonsohdhcp is not the problem for FIPs15:53
TheJuliastill an oddity though15:53
ralonsohwhy?15:53
TheJuliavm should have alrady been powered up15:53
ralonsohdhcp is just receiving the IP from Neutron15:54
ralonsohjust the resource update, nothing in the network layer15:54
ralonsohwhen the FIP is created15:54
ralonsohTheJulia, the ping is done from the same host (controller), right?15:57
TheJuliaoh, heh, so dnsmasq had to be restarted and that occured at 10:27:1415:57
TheJuliait segfaulted15:57
ralonsohlet me check15:57
TheJulia[ 2751.364065] dnsmasq[120332]: segfault at c998 ip 00007f957a66a47e sp 00007fff0ab0f440 error 4 in libc.so.6[7f957a5ed000+195000]15:57
TheJuliaralonsoh: ping, afaik, yes is done on the same host15:57
ralonsohok, I didn't check the syslogs15:58
ralonsohare we running out of memory?15:59
TheJulianope, no signs of OOM15:59
ralonsohin any case, dhcp agent found that15:59
ralonsohand restarts the process15:59
ralonsohApr 05 10:27:14.602721 np0033659767 neutron-dhcp-agent[87108]: ERROR neutron.agent.linux.external_process [-] dnsmasq for dhcp with uuid 5a0a77ad-7203-4de7-b13a-e498ef5258c7 not found. The process should not have died15:59
ralonsohApr 05 10:27:14.603388 np0033659767 neutron-dhcp-agent[87108]: WARNING neutron.agent.linux.external_process [-] Respawning dnsmasq for uuid 5a0a77ad-7203-4de7-b13a-e498ef5258c715:59
ralonsohthis is for the private IP network15:59
TheJuliaokay, so the vm didn't post it's splash screen to the console until it was 88.4 seconds up, so long after it was started, so dhcp likely contributed 16:00
TheJuliathat may have put it past the window16:00
TheJuliaso dnsmasq is one thing, what were you thinking w/r/t the fip not working?16:01
ralonsohno, I have no reasons to think that16:02
ralonsoheverything seems to work in the l3 agent16:02
TheJuliaso then timing wise, if the vm is just not on the far side of it yet because dnsmasq went awol, then that should explain this specific case16:03
TheJuliawell, it should have arpinged16:04
TheJuliabecause the fip is bound to a namespace16:04
TheJuliabut I'm not a neutron expert16:04
ralonsohnot really, this is a way to make this FIP public16:04
ralonsohnothing else16:04
ralonsohthat forces an ARP that should be read by anyone in this broadcast domain16:04
TheJuliaso with the rules then, and I've not looked what exactly gets loaded, but then the kernel never actually processes on the inbound packets16:04
TheJuliait would just get re-tagged out to the vm16:05
ralonsohis it normal that the server creation takes 7 mins?16:07
TheJuliait is a baremetal test vm16:07
ralonsohwhere do you see the server logs?16:07
ralonsohjust to check the DHCP request16:07
TheJuliaso it goes through the entire lifecycle to be provisioned16:07
TheJuliawhich includes firing up a temporary ramdisk, deploying the OS image, and then rebooting to the final instance16:08
TheJuliathat final instance is where things went sideways with dhcp it seems16:08
ralonsohok, I was reviewing the DHCP request. The dnsmasq segfault could be a red herring16:08
ralonsohthe BM receives the IP 10.2.1.50 16:08
TheJuliaconfirmed16:09
opendevreviewMerged openstack/neutron master: Bump skip-level lower version to stable/zed  https://review.opendev.org/c/openstack/neutron/+/87863216:15
ralonsohTheJulia, we have segfaults even in passing jobs16:19
ralonsohthere is something broken there16:19
TheJuliaokay, I think part of this might be the cirros os image is doing a disk expansion in its boot process16:19
TheJuliaso a variable becomes io, but I'm not sure if that is before, or after networking is fully online16:20
TheJulialooks like it is after :\16:20
TheJuliabut I've definitely got a 21 second into the kernel boot before eth0 is up16:20
TheJulia83 seconds in before the address is *actually* up it looks like16:22
TheJuliawhich would do it16:22
ralonsohTheJulia, 83 seconds since what time? sorry, I can't refer to the global time used in the test16:23
TheJulia83 seconds before networking in cirros is online16:24
TheJuliaI'm inside the actual test vm OS right now on a terminal window16:24
ralonsohlet me check other jobs16:24
ralonsohTheJulia, then it is possible that the router namespace is nating the ping but the VM is deaf during this time16:30
ralonsohbecause the IP is not configured yet16:30
ralonsohis see normal time is around 35 seconds16:30
ralonsohTheJulia, you can try increasing ->16:33
ralonsohhttps://github.com/openstack/grenade/blob/3f9fe2e8fc1fccf0324538274e3b07b3e90b96b9/projects/60_nova/resources.sh#L13716:33
TheJuliaack, that is a good data point to have then16:33
ralonsohand make a ironic patch dependant on the grenade one16:33
TheJuliayup16:35
TheJulia...16:36
TheJuliathis is on rax16:36
TheJuliawhich means the VM is fully emulated 16:36
ralonsohwell, at least there is a good reason 16:36
TheJuliayeah, unfortuantely16:37
TheJuliaunfortunately16:37
TheJuliaralonsoh: thanks for the assistance!16:59
opendevreviewMerged openstack/neutron-lib master: port-hints: api-ref: Add field length limitation  https://review.opendev.org/c/openstack/neutron-lib/+/87904218:53
opendevreviewMerged openstack/neutron master: Fix concurrent port binding activate  https://review.opendev.org/c/openstack/neutron/+/85328119:14

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!