opendevreview | Miro Tomaska proposed openstack/neutron master: Fix ACL sync when default sg group is created https://review.opendev.org/c/openstack/neutron/+/875989 | 02:12 |
---|---|---|
opendevreview | Miro Tomaska proposed openstack/neutron master: Fix intermittent failures in finding metada port in SB DB https://review.opendev.org/c/openstack/neutron/+/878549 | 02:51 |
opendevreview | Luis Tomas Bolivar proposed openstack/neutron master: Add support for localnet_learn_fdb OVN option https://review.opendev.org/c/openstack/neutron/+/877675 | 05:59 |
sahid | o/ - if someone avail to re-approv https://review.opendev.org/c/openstack/neutron/+/871113 I had to rebase for the fix on macvtap | 07:41 |
ralonsoh | sahid, there is no need to rebase the patch | 07:42 |
sahid | yes right | 07:42 |
sahid | actually i had, no? the patch was on neutron functional tests no? | 07:43 |
sahid | haleyb: https://review.opendev.org/c/openstack/neutron/+/875809 regarding the propal here, I'm probably missing something, I guess you have noticed that users are mentioning on their reports 68 and 70 | 07:44 |
ralonsoh | the gate queue will test the patch on top of master | 07:45 |
sahid | ralonsoh: during this step we are on master, i have never thought about this point, thank you | 07:48 |
opendevreview | Merged openstack/neutron stable/wallaby: Ensure redirect-type=bridged not used for geneve networks https://review.opendev.org/c/openstack/neutron/+/879299 | 09:00 |
opendevreview | Merged openstack/neutron master: rbacs: filter out model that are already owned by context https://review.opendev.org/c/openstack/neutron/+/871113 | 09:41 |
sahid | if this (easy)) one can receive the missing +w https://review.opendev.org/c/openstack/neutron/+/872762 | 09:46 |
sahid | it's just a cleanup I wrongly rebased it | 09:46 |
haleyb | sahid: the bug associated with that was a user setting a small mtu, probably as a test. i now see the message on the discuss list was the same | 13:29 |
ralonsoh | slaweq, please check https://review.opendev.org/c/openstack/neutron/+/878632 | 13:41 |
ralonsoh | thanks! | 13:41 |
slaweq | ralonsoh done | 14:15 |
slaweq | I approved it but I also added one comment there | 14:15 |
ralonsoh | let me check | 14:23 |
ihrachys | haleyb to confirm, you also suggest revising api-ref patch we merged with more explicit listing of protocols enabled? | 14:30 |
opendevreview | Slawek Kaplonski proposed openstack/neutron-tempest-plugin master: Get nc client's command based on the nc on correct host https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/879561 | 14:32 |
haleyb | ihrachys: i guess it would be good to spell it all out. I have never played with the stateless SG code, so am just assuming it can't resolve a neighbor without a rule today? but maybe users always all all ICMP so it just works? | 15:02 |
ihrachys | haleyb yes it can't receive NA/RA by default. if you enable ICMPv6, sure it does but that's not optimal :) | 15:03 |
ihrachys | I also now wonder why I do it just for stateless | 15:03 |
ihrachys | because ovs firewall does it regardless | 15:03 |
ihrachys | so even if you kill default egress rules, it still works | 15:03 |
ihrachys | so even in this regard, the api-ref patch is not optimal because it suggests that the protocols are guaranteed for stateless but not in general | 15:04 |
haleyb | right, it's just assumed these "low layer" thing just work | 15:05 |
ihrachys | looks like I should just take ovs firewall add_flow logic and translate it into ovn acls | 15:05 |
ihrachys | haleyb yeah I haven't realized that it's not just a kludge for stateless but guaranteed regardless of default rules | 15:05 |
ihrachys | sounds like more tempest scenarios to add too | 15:05 |
opendevreview | Bence Romsics proposed openstack/neutron master: port-hints: api extension https://review.opendev.org/c/openstack/neutron/+/870081 | 15:05 |
opendevreview | Bence Romsics proposed openstack/neutron master: port-hint-ovs-tx-steering: agent side https://review.opendev.org/c/openstack/neutron/+/872905 | 15:05 |
opendevreview | Bence Romsics proposed openstack/neutron master: port-hint-ovs-tx-steering: shim extension https://review.opendev.org/c/openstack/neutron/+/873113 | 15:05 |
opendevreview | Bence Romsics proposed openstack/neutron master: DNM debug logs and dev helper scripts https://review.opendev.org/c/openstack/neutron/+/872906 | 15:05 |
ihrachys | multicast and stateful SG without default rules behavior | 15:06 |
haleyb | ihrachys: things work today for OVN stateful without extra work, is that in the acl code? | 15:06 |
ihrachys | I *think* it's RD-RA / ND-NA session tracking? not sure | 15:08 |
haleyb | oh, and maybe OVN is responding to the NS with an NA based on a flow? | 15:10 |
TheJulia | out of curiosity, have any weird hangs been observed with neutron services/requests in CI as of recent? We've got some really weird failures in the grenade job for ironic (ironic-grenade) when on the 2023.1 code which seems to point in the direction of neutron with a couple different weird behaviors | 15:18 |
ralonsoh | TheJulia, did you do some analysis? Can you report that in launchpad? With some logs | 15:24 |
ralonsoh | is this job: https://zuul.opendev.org/t/openstack/builds?job_name=ironic-grenade&skip=0 | 15:24 |
ralonsoh | ? | 15:24 |
TheJulia | ralonsoh: we're seen at least two different behaviors so far, one nova seems to hang on talking to neutron, another the fip doesn't work job fails. We can ssh into the held node and immediately ping the fip. Nothing in logs seems to be definitive, at least that we've found y et | 15:27 |
TheJulia | yet | 15:27 |
ralonsoh | TheJulia, but so far this is what I see in nova api | 15:29 |
ralonsoh | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_25f/879498/1/check/ironic-grenade/25f471e/controller/logs/screen-n-api.txt | 15:29 |
ralonsoh | (seen in two different logs) | 15:29 |
ralonsoh | https://paste.opendev.org/show/bObqZW35uqDoHkXwLBRb/ | 15:30 |
ralonsoh | just during the FIP ping | 15:30 |
TheJulia | that is after the fip hang | 15:30 |
TheJulia | that, I believe, is as it is trying to collect logs | 15:30 |
TheJulia | anything after 10:27:58 on that job has failed | 15:31 |
TheJulia | That is a result of the cli trying to collect info for debugging | 15:32 |
ralonsoh | 10:27:58? | 15:32 |
ralonsoh | 2023-04-04 18:34:19.346 | + /opt/stack/new/grenade/functions:ping_check_public:65 : timeout 60 sh -c 'while ! ping -c1 -w1 172.24.5.89; do sleep 1; done' | 15:32 |
TheJulia | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_38b/879498/1/check/ironic-grenade/38b5b05/controller/logs/grenade.sh_log.txt | 15:32 |
TheJulia | err, I'm looking at more than one job | 15:32 |
ralonsoh | ok, I'll check this one | 15:32 |
TheJulia | hmm, we're looking at the same job | 15:33 |
ralonsoh | but different build | 15:33 |
TheJulia | yup | 15:33 |
TheJulia | the fip working when I logged in is just bizar given I don't see further updates or re-application | 15:34 |
ralonsoh | TheJulia, this is using non-HA and non-DVR, right? | 15:40 |
TheJulia | yeah | 15:40 |
ralonsoh | just one controller node | 15:40 |
ralonsoh | ok | 15:40 |
ralonsoh | when the FIp is added, there is no reply from the garp | 15:41 |
ralonsoh | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_38b/879498/1/check/ironic-grenade/38b5b05/controller/logs/screen-q-l3.txt | 15:41 |
TheJulia | meaning l3-agent might not have actually bound it yet? | 15:45 |
TheJulia | looks like it ran the nat commands at 10:26:57, and then it failed | 15:46 |
ralonsoh | no, actually this is just to push a garp outside the l3 agent | 15:46 |
* TheJulia goes and looks to make sure the VM was up | 15:46 | |
ralonsoh | and make this IP public | 15:46 |
TheJulia | k | 15:46 |
ralonsoh | ? | 15:46 |
ralonsoh | the FIP is correctly configured | 15:46 |
TheJulia | the last entry in the log for the test vm was 10:26:35, so it could have possibly still been coming online, although the console log shows it was up ~6 seconds later with a dhcp address, checking dhcp agent logs | 15:49 |
ralonsoh | the FIP is assigned at # | 15:51 |
ralonsoh | 2023-04-05 10:26:52.197 | + /opt/stack/new/grenade/projects/60_nova/resources.sh:create:130 : openstack server add floating ip nova_server1 172.24.5.248 | 15:51 |
ralonsoh | and configured in the L3 agent at | 15:51 |
TheJulia | so that is presumably after the vm is already up | 15:51 |
ralonsoh | Apr 05 10:26:57.325448 np0033659767 neutron-l3-agent[87510]: INFO neutron.agent.l3.agent [-] Finished a router update for 878c6663-6959-4c33-8245-6294f9754100, update_id 8410568d-632b-44e2-9291-ce7908656680. Time elapsed: 1.560 | 15:51 |
ralonsoh | yes | 15:51 |
TheJulia | it does look like the actual dhcp update came through at 10:27:14 | 15:52 |
ralonsoh | dhcp is not the problem for FIPs | 15:53 |
TheJulia | still an oddity though | 15:53 |
ralonsoh | why? | 15:53 |
TheJulia | vm should have alrady been powered up | 15:53 |
ralonsoh | dhcp is just receiving the IP from Neutron | 15:54 |
ralonsoh | just the resource update, nothing in the network layer | 15:54 |
ralonsoh | when the FIP is created | 15:54 |
ralonsoh | TheJulia, the ping is done from the same host (controller), right? | 15:57 |
TheJulia | oh, heh, so dnsmasq had to be restarted and that occured at 10:27:14 | 15:57 |
TheJulia | it segfaulted | 15:57 |
ralonsoh | let me check | 15:57 |
TheJulia | [ 2751.364065] dnsmasq[120332]: segfault at c998 ip 00007f957a66a47e sp 00007fff0ab0f440 error 4 in libc.so.6[7f957a5ed000+195000] | 15:57 |
TheJulia | ralonsoh: ping, afaik, yes is done on the same host | 15:57 |
ralonsoh | ok, I didn't check the syslogs | 15:58 |
ralonsoh | are we running out of memory? | 15:59 |
TheJulia | nope, no signs of OOM | 15:59 |
ralonsoh | in any case, dhcp agent found that | 15:59 |
ralonsoh | and restarts the process | 15:59 |
ralonsoh | Apr 05 10:27:14.602721 np0033659767 neutron-dhcp-agent[87108]: ERROR neutron.agent.linux.external_process [-] dnsmasq for dhcp with uuid 5a0a77ad-7203-4de7-b13a-e498ef5258c7 not found. The process should not have died | 15:59 |
ralonsoh | Apr 05 10:27:14.603388 np0033659767 neutron-dhcp-agent[87108]: WARNING neutron.agent.linux.external_process [-] Respawning dnsmasq for uuid 5a0a77ad-7203-4de7-b13a-e498ef5258c7 | 15:59 |
ralonsoh | this is for the private IP network | 15:59 |
TheJulia | okay, so the vm didn't post it's splash screen to the console until it was 88.4 seconds up, so long after it was started, so dhcp likely contributed | 16:00 |
TheJulia | that may have put it past the window | 16:00 |
TheJulia | so dnsmasq is one thing, what were you thinking w/r/t the fip not working? | 16:01 |
ralonsoh | no, I have no reasons to think that | 16:02 |
ralonsoh | everything seems to work in the l3 agent | 16:02 |
TheJulia | so then timing wise, if the vm is just not on the far side of it yet because dnsmasq went awol, then that should explain this specific case | 16:03 |
TheJulia | well, it should have arpinged | 16:04 |
TheJulia | because the fip is bound to a namespace | 16:04 |
TheJulia | but I'm not a neutron expert | 16:04 |
ralonsoh | not really, this is a way to make this FIP public | 16:04 |
ralonsoh | nothing else | 16:04 |
ralonsoh | that forces an ARP that should be read by anyone in this broadcast domain | 16:04 |
TheJulia | so with the rules then, and I've not looked what exactly gets loaded, but then the kernel never actually processes on the inbound packets | 16:04 |
TheJulia | it would just get re-tagged out to the vm | 16:05 |
ralonsoh | is it normal that the server creation takes 7 mins? | 16:07 |
TheJulia | it is a baremetal test vm | 16:07 |
ralonsoh | where do you see the server logs? | 16:07 |
ralonsoh | just to check the DHCP request | 16:07 |
TheJulia | so it goes through the entire lifecycle to be provisioned | 16:07 |
TheJulia | which includes firing up a temporary ramdisk, deploying the OS image, and then rebooting to the final instance | 16:08 |
TheJulia | that final instance is where things went sideways with dhcp it seems | 16:08 |
ralonsoh | ok, I was reviewing the DHCP request. The dnsmasq segfault could be a red herring | 16:08 |
ralonsoh | the BM receives the IP 10.2.1.50 | 16:08 |
TheJulia | confirmed | 16:09 |
opendevreview | Merged openstack/neutron master: Bump skip-level lower version to stable/zed https://review.opendev.org/c/openstack/neutron/+/878632 | 16:15 |
ralonsoh | TheJulia, we have segfaults even in passing jobs | 16:19 |
ralonsoh | there is something broken there | 16:19 |
TheJulia | okay, I think part of this might be the cirros os image is doing a disk expansion in its boot process | 16:19 |
TheJulia | so a variable becomes io, but I'm not sure if that is before, or after networking is fully online | 16:20 |
TheJulia | looks like it is after :\ | 16:20 |
TheJulia | but I've definitely got a 21 second into the kernel boot before eth0 is up | 16:20 |
TheJulia | 83 seconds in before the address is *actually* up it looks like | 16:22 |
TheJulia | which would do it | 16:22 |
ralonsoh | TheJulia, 83 seconds since what time? sorry, I can't refer to the global time used in the test | 16:23 |
TheJulia | 83 seconds before networking in cirros is online | 16:24 |
TheJulia | I'm inside the actual test vm OS right now on a terminal window | 16:24 |
ralonsoh | let me check other jobs | 16:24 |
ralonsoh | TheJulia, then it is possible that the router namespace is nating the ping but the VM is deaf during this time | 16:30 |
ralonsoh | because the IP is not configured yet | 16:30 |
ralonsoh | is see normal time is around 35 seconds | 16:30 |
ralonsoh | TheJulia, you can try increasing -> | 16:33 |
ralonsoh | https://github.com/openstack/grenade/blob/3f9fe2e8fc1fccf0324538274e3b07b3e90b96b9/projects/60_nova/resources.sh#L137 | 16:33 |
TheJulia | ack, that is a good data point to have then | 16:33 |
ralonsoh | and make a ironic patch dependant on the grenade one | 16:33 |
TheJulia | yup | 16:35 |
TheJulia | ... | 16:36 |
TheJulia | this is on rax | 16:36 |
TheJulia | which means the VM is fully emulated | 16:36 |
ralonsoh | well, at least there is a good reason | 16:36 |
TheJulia | yeah, unfortuantely | 16:37 |
TheJulia | unfortunately | 16:37 |
TheJulia | ralonsoh: thanks for the assistance! | 16:59 |
opendevreview | Merged openstack/neutron-lib master: port-hints: api-ref: Add field length limitation https://review.opendev.org/c/openstack/neutron-lib/+/879042 | 18:53 |
opendevreview | Merged openstack/neutron master: Fix concurrent port binding activate https://review.opendev.org/c/openstack/neutron/+/853281 | 19:14 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!