Wednesday, 2025-04-02

ralonsohbcafarel, slaweq hello! If you have 1 min today: https://review.opendev.org/q/project:openstack/neutron+branch:stable/2025.1+status:open05:26
ralonsohthanks!05:26
ralonsohhaleyb, hello. Thanks for waiting. I've added some topics to the PTG agenda06:45
opendevreviewCandido Campos Rivas proposed x/whitebox-neutron-tempest-plugin master: Increase burst window size to avoid unstabilities in slow envs  https://review.opendev.org/c/x/whitebox-neutron-tempest-plugin/+/94461608:16
slaweqralonsoh I just reviewed your backports, but in https://review.opendev.org/c/openstack/neutron/+/945970 there is one comment which IMO can be valid, please check if we should maybe address it in follow up patch09:12
ralonsohslaweq, let me check now09:13
opendevreviewMaksim Kostrow proposed openstack/neutron master: Correction of a typo in the 'sub_ports' field  https://review.opendev.org/c/openstack/neutron/+/94612409:18
opendevreviewMaksim Kostrow proposed openstack/neutron master: Correction of a typo in the 'sub_ports' field  https://review.opendev.org/c/openstack/neutron/+/94612409:19
ralonsohslaweq, I added a comment09:20
ralonsohthis is what I tried to explain in the upper note: we wait for the post_fork event and at this point we know the placement extension is loaded09:21
ralonsohfolks, some good optimizations09:33
ralonsoh* https://review.opendev.org/c/openstack/neutron/+/94475609:33
ralonsoh* https://review.opendev.org/c/openstack/neutron/+/94465009:34
opendevreviewLajos Katona proposed openstack/neutron stable/2025.1: If OVS Manager creation failes retry to set values  https://review.opendev.org/c/openstack/neutron/+/94614212:11
sahidralonsoh: o/12:22
sahidany chance that you have a look to the ovs serie when you have a moment :-) 12:23
sahidIt's waiting for a while now to get some reviews12:23
sahidhere is the full serie https://review.opendev.org/q/topic:%22bug/2087939%22 12:24
opendevreviewMerged openstack/neutron stable/2025.1: Revert the network IDs filter in ``_gen_router_port_options``  https://review.opendev.org/c/openstack/neutron/+/94596812:57
ralonsohsahid, sorry, I'm quite busy these days, I'll try next week12:57
haleybralonsoh: np, i haven't had much time to even look at etherpad myself12:58
ralonsohhaleyb, I also added a couple of items from other people12:58
ralonsohping me if you don't know how much time could take every item12:59
haleyback12:59
cardoeralonsoh: Thanks for the reply on https://bugs.launchpad.net/neutron/+bug/2105855 and Thanks haleyb for giving me some time.13:34
cardoeI know some folks from #-ironic will be interested and will join for it.13:35
cardoeThere's some efforts within Ironic to improve the networking handling. I'm primarily interested in doing as much of it with Neutron as possible.13:36
haleybfrickler: so how did the py313 job get fixed? black magic?13:47
UgglaHi haleyb, so I set 1h from 13 --> 14h UTC on thursday for neutron/cinder/nova PTG cross team meeting. what room do you want to use nova one or neutron ?14:01
haleybUggla: either the nova or neutron since i think cinder would only be for part, there was an additional nova/neutron topic14:02
haleyblet me ping jon again14:02
Ugglahaleyb, is 1h enough for all topics ?14:05
otherwiseguyralonsoh: haleyb: if you all ever run into any of that mythical free time, I respun https://review.opendev.org/c/openstack/neutron/+/923971 which had ralonsoh's +2 to fix some nits.14:06
haleybUggla: I *think* so, we have the QoS one which could be 30 minutes or maybe more, then https://bugs.launchpad.net/nova/+bug/207325414:06
ralonsohotherwiseguy, let me check14:06
Ugglahaleyb, ok I'm gonna ping Jon too, because I have another session with him. So I can check for ours too.14:07
opendevreviewMerged openstack/neutron master: Duplicated table lookup in extending port resource request  https://review.opendev.org/c/openstack/neutron/+/94465014:08
haleybUggla: ack, i reached out privately so he's aware of the cross-project ask14:08
Ugglahaleyb, can we use the nova room so it will be clear on the agenda.14:10
Ugglaaustin room14:10
haleybotherwiseguy: i'll look again, it is an interesting problem and fix :) Does this type of fixup work for other objects? i'm not asking to boil the ocean (yet)14:11
fricklerhaleyb: fancy stuff, didn't I link to it? https://review.opendev.org/c/zuul/zuul-jobs/+/94015814:13
otherwiseguyhaleyb: yeah. It's unfortunately a solution that involves fixing a bunch of individual commands. But the good news is that we can do in individual patches whenever we see a problem crop up in tests, bug reports instead of pre-emptively changing everything with a huge fix.14:13
haleybfrickler: i didn't see the link. So https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/941246 can maybe be merged then?14:15
otherwiseguyThe general problem is "dependent objects being modified on different workers" whether that's a CREATE of one like an LS, then a CREATE of the dependent LSP rapidly on another, or just doing a CREATE of an object and then UPDATE immediately after with a different api call, etc.14:15
otherwiseguySo it's theoretically a problem for any object that can be created and updated.14:15
otherwiseguyIt's just that it may not be super common to do a create/update with different API calls *really quickly*.14:16
otherwiseguyhaleyb: but it's definitely something I'd look at pretty closely when seeing a flaky tempest test.14:17
haleybotherwiseguy: ack. I've actually seen a case where adding an SG rule fails because the SG isn't there, but it really is there. I always wondered if they got split-up but could never prove it14:18
otherwiseguyhaleyb: the thing that is a little weird about the fix/feature in ovs is that it moves the responsibility for choosing a globally unique id from ovsdb-server to the client. So you definitely need to make sure your UUIDs you're choosing are really unique. ;)14:20
otherwiseguyOn the LS, I just used the neutron network uuid, which also makes it easy to match them up for debugging.14:22
haleybotherwiseguy: i'm assuming since it's a neutron UUID it is unique? but i just work here14:22
otherwiseguy:D14:22
fricklerhaleyb: yes, just giving gmann a moment to wake up and respond ;)14:23
haleybfrickler: ack, good to know it was a simple fix14:23
otherwiseguyhaleyb: I do need to think for a second or two on the situation of upgrades. Creating an LSP on the "new" way when the LS was old way and not using the network_uuid as its uuid.14:25
haleybotherwiseguy: i might ask for your opinion on a bug, let me add an update, maybe you'll have an a-ha moment14:26
otherwiseguyBecause I think the patch keys only off of "persist uuid is supported" and then passes the network_id instead. I should write a test that exercises it at the very least.14:26
otherwiseguyhaleyb: sounds good14:26
haleybotherwiseguy: can you look at my last comment in https://bugs.launchpad.net/neutron/+bug/2090921 ? it just sounded similar but maybe it isn't14:36
* otherwiseguy looks14:37
fricklerhaleyb: ah, I had written a response comment earlier, but forgotten to submit it, thx for the hint :) https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/941246/comment/2c797d55_940220c4/14:41
otherwiseguyhaleyb: my first instinct is that the failing case is when update_acls_for_security_group() is called (which doesn't add the pg), and the working one in that comment would be when calling create_security_group() which calls creates the pg and calls add_acls_for_sg_port_group() for the default rules. And it could theoretically be that the create_security_group() and then adding rules happen on different workers.14:52
otherwiseguyI could definitely see tooling that would do api calls to create security groups and add rules very quickly after each other.14:54
haleybotherwiseguy: yeah, this is that case I think, and this failure is pretty reproducible by them, just not by me without some hacking. For the first case I guess we can change the call perhaps? But I wouldn't know an easy solution if they wound-up on different workers15:00
*** dmellado0755393737 is now known as dmellado15:03
haleybotherwiseguy: i mean, should we always put a pg_add(..., may_exist=True) call there? or restructure the code to do that? i'd think we'd want both operations in the same transaction?15:16
otherwiseguyhaleyb: We could. It would mildly irritate me (for no good reason) to try to create the PG every time we added a security group rule. Looking at the Commands, it *should* work.15:24
haleybotherwiseguy: it would annoy me as well, i'm open to options. This is why your uuid change seemed like it could apply here perhaps? although we see this issue as far back as yoga15:28
otherwiseguyYeah, with the mess of "persist_uuid doesn't quite work in various OVS branches" it's kind annoying from a backport perspective.15:29
haleybor we add a try/except and only pg_add on failure?15:29
* haleyb is between meetings and might go <poof>15:30
otherwiseguyhaleyb: honestly, the may_exist check would be really cheap. It's just checking local memory for the Port_Group by name, and it should be indexed.15:31
otherwiseguyhaleyb: I think I'd prefer that to retry (and honestly, by the time we retried the local cache would likely be updated)15:35
opendevreviewJakub Libosvar proposed openstack/neutron master: cleanup: Remove 'migrate' mode for OVN DB sync  https://review.opendev.org/c/openstack/neutron/+/94616815:39
haleybotherwiseguy: interested in looking at that :) i pay in beer though15:59
otherwiseguyThinking about the case where it exists in ovsdb-server, but not locally, then we try to create it with may_exist. The may_exist check is local, so we will actually try to create the pg on the server, which will fail since one with the name already exists. So it definitely couldn't be in a transaction with anything else or the whole transaction would fail. So we'd have to do a single transaction with may_exist=True, 16:11
otherwiseguycheck_error=False.16:11
otherwiseguyhaleyb: ^  But if we are doing that, then by the time we get the response from ovsdb-server that our txn has failed with the pg_add, we by definition will have already received the update from ovsdb-server with the added row. Basically, any write txn we do whether it creates the pg or writes to an external id, would ensure that we'd got the updated row in memory. :p16:12
otherwiseguyBut just retrying the txn isn't guaranteed to work immediately because when we fail because the pg doesn't exist, it's failing locally. So basically you're having to sit there polling for the pg unless you do a write (and if you were writing, trying to add the pg is as good a thing to write as anything).16:16
otherwiseguyhaleyb: which openvswitch does yoga use?16:16
otherwiseguyhaleyb: er, which openvswitch do you all use when using yoga I guess is what I really mean.16:28
haleybotherwiseguy: 2.17.9 i believe, would have to confirm on a running cloud16:57
mnaserI think I'm running into some real weird issue here.18:06
mnaserNova's network_info (at least the cached one) seems to be lacking subnet info, when using routed provider networks18:07
mnaserso when config drive is getting built, it's ending up with an empty config18:08
opendevreviewCandido Campos Rivas proposed x/whitebox-neutron-tempest-plugin master: Increase burst window size to avoid unstabilities in slow envs  https://review.opendev.org/c/x/whitebox-neutron-tempest-plugin/+/94461618:11
mnaseri assume its because the IP assignment is deferred18:12
mnaserso you end up with: https://paste.openstack.org/show/b6wcQ0hWl9JvdpwHEOks/18:12
otherwiseguyhaleyb: what I don't like, the pg we always really expect to be there. in no case should we really be recreating the pg (it could legitimately have been deleted for example). So, that leaves us with essentially doing what the code I'm trying to replace with ls/lsp did. Wait until the expected row exists, which will block the worker for an arbitrarily chosen amount of time. (or make an inconsequential write to the db, which will 18:33
otherwiseguyblock the worker for an indeterminate amount of time, bounded by the ovsdb_timeout).18:33
otherwiseguy(I'm slowly remembering all of the reasons I went with the persist_uuid idea literally years ago at this point)18:35
haleybotherwiseguy: yeah, I guess the API can only protect that the SG exists by allowing the SG rule to be created, but there's no guarantee the PG exists. I really don't know the best solution19:30
otherwiseguyhaleyb: For newer releases, persist_uuid-ifying PG/ACL. For older stuff...the check_for_row_by_value_and_retry('Port_Group', ...) solution is probably easiest.19:34
otherwiseguyI dislike polling and blocking the worker more than I can articulate. But reminding myself that it should only be in very rare cases helps.19:35
otherwiseguy(or if the ovsdb-server or neutron is just way overloaded, in which case the solution is to make them less overloaded)19:36
haleybotherwiseguy: oh, i like that check_for_row_by_value_and_retry() call, and i think in this case we have the SG id, so can calculate the PG name. Would you put that right in update_acls_for_security_group() ?19:40
haleybmnaser: sorry, guess i haven't seen that, maybe someone on the nova team has info if it's just something with the sync?19:44
otherwiseguyhaleyb: Yeah, and double check for any other places we might eventually call pg_acl_add() or whatever.19:44
otherwiseguyAnd hope that there aren't any dumb assert_called_once_with() unit tests that break. :p19:45
haleybotherwiseguy: i'll try and get something out soon, might be able to test on their setup too19:49
otherwiseguyhaleyb: I think should be possible to make a generalized test for this kind of issue. Have a test that uses a subclassed Connection or something that sleeps before calling idl.run() maybe.19:49
otherwiseguyOr easy enough to just manually patch that in and run tempest tests which are bound to have quite a few tests that create dependent objects right after each other or create/update.19:51
otherwiseguyI may play with that a bit this week.19:52
haleybotherwiseguy: where would the idl.run() code be for this? I'm trying to find the class19:57
otherwiseguyIt should be in ovsdbapp's Connection class. I think under run()19:57
* otherwiseguy checks19:57
otherwiseguyhaleyb: yeah, ovsdbapp.backend.ovs_idl.connection.Connection.run19:58
haleyboh, i was looking in neutron, duh19:58
* haleyb wouldn't exactly know how to tweak that code20:10
otherwiseguyI think just putting a time.sleep(5) or whatever right before the idl.run() call would do it. I usually just grab devstack, checkout ovsdbapp in /opt/stack and pip install -e. from the devstack venv to play around with it.20:12
otherwiseguyI'll try locally real quick and see if I can reproduce the error w/ a sg create/sg rule create.20:12
otherwiseguyhmm, though I guess unilaterally adding the delay there is delaying the initial api request as well, so it doesn't let me easily create a new rule quickly.20:23
opendevreviewBrian Haley proposed openstack/neutron master: Make sure OVN PG is present before adding a SG rule  https://review.opendev.org/c/openstack/neutron/+/94621621:59
haleybotherwiseguy: I will try something like that ^^^ out downstream and see what happens22:00
otherwiseguyhaleyb: +122:00
haleybit passed unit tests :)22:00

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!