Wednesday, 2025-04-02

ralonsoh	bcafarel, slaweq hello! If you have 1 min today: https://review.opendev.org/q/project:openstack/neutron+branch:stable/2025.1+status:open	05:26
ralonsoh	thanks!	05:26
ralonsoh	haleyb, hello. Thanks for waiting. I've added some topics to the PTG agenda	06:45
opendevreview	Candido Campos Rivas proposed x/whitebox-neutron-tempest-plugin master: Increase burst window size to avoid unstabilities in slow envs https://review.opendev.org/c/x/whitebox-neutron-tempest-plugin/+/944616	08:16
slaweq	ralonsoh I just reviewed your backports, but in https://review.opendev.org/c/openstack/neutron/+/945970 there is one comment which IMO can be valid, please check if we should maybe address it in follow up patch	09:12
ralonsoh	slaweq, let me check now	09:13
opendevreview	Maksim Kostrow proposed openstack/neutron master: Correction of a typo in the 'sub_ports' field https://review.opendev.org/c/openstack/neutron/+/946124	09:18
opendevreview	Maksim Kostrow proposed openstack/neutron master: Correction of a typo in the 'sub_ports' field https://review.opendev.org/c/openstack/neutron/+/946124	09:19
ralonsoh	slaweq, I added a comment	09:20
ralonsoh	this is what I tried to explain in the upper note: we wait for the post_fork event and at this point we know the placement extension is loaded	09:21
ralonsoh	folks, some good optimizations	09:33
ralonsoh	* https://review.opendev.org/c/openstack/neutron/+/944756	09:33
ralonsoh	* https://review.opendev.org/c/openstack/neutron/+/944650	09:34
opendevreview	Lajos Katona proposed openstack/neutron stable/2025.1: If OVS Manager creation failes retry to set values https://review.opendev.org/c/openstack/neutron/+/946142	12:11
sahid	ralonsoh: o/	12:22
sahid	any chance that you have a look to the ovs serie when you have a moment :-)	12:23
sahid	It's waiting for a while now to get some reviews	12:23
sahid	here is the full serie https://review.opendev.org/q/topic:%22bug/2087939%22	12:24
opendevreview	Merged openstack/neutron stable/2025.1: Revert the network IDs filter in ``_gen_router_port_options`` https://review.opendev.org/c/openstack/neutron/+/945968	12:57
ralonsoh	sahid, sorry, I'm quite busy these days, I'll try next week	12:57
haleyb	ralonsoh: np, i haven't had much time to even look at etherpad myself	12:58
ralonsoh	haleyb, I also added a couple of items from other people	12:58
ralonsoh	ping me if you don't know how much time could take every item	12:59
haleyb	ack	12:59
cardoe	ralonsoh: Thanks for the reply on https://bugs.launchpad.net/neutron/+bug/2105855 and Thanks haleyb for giving me some time.	13:34
cardoe	I know some folks from #-ironic will be interested and will join for it.	13:35
cardoe	There's some efforts within Ironic to improve the networking handling. I'm primarily interested in doing as much of it with Neutron as possible.	13:36
haleyb	frickler: so how did the py313 job get fixed? black magic?	13:47
Uggla	Hi haleyb, so I set 1h from 13 --> 14h UTC on thursday for neutron/cinder/nova PTG cross team meeting. what room do you want to use nova one or neutron ?	14:01
haleyb	Uggla: either the nova or neutron since i think cinder would only be for part, there was an additional nova/neutron topic	14:02
haleyb	let me ping jon again	14:02
Uggla	haleyb, is 1h enough for all topics ?	14:05
otherwiseguy	ralonsoh: haleyb: if you all ever run into any of that mythical free time, I respun https://review.opendev.org/c/openstack/neutron/+/923971 which had ralonsoh's +2 to fix some nits.	14:06
haleyb	Uggla: I think so, we have the QoS one which could be 30 minutes or maybe more, then https://bugs.launchpad.net/nova/+bug/2073254	14:06
ralonsoh	otherwiseguy, let me check	14:06
Uggla	haleyb, ok I'm gonna ping Jon too, because I have another session with him. So I can check for ours too.	14:07
opendevreview	Merged openstack/neutron master: Duplicated table lookup in extending port resource request https://review.opendev.org/c/openstack/neutron/+/944650	14:08
haleyb	Uggla: ack, i reached out privately so he's aware of the cross-project ask	14:08
Uggla	haleyb, can we use the nova room so it will be clear on the agenda.	14:10
Uggla	austin room	14:10
haleyb	otherwiseguy: i'll look again, it is an interesting problem and fix :) Does this type of fixup work for other objects? i'm not asking to boil the ocean (yet)	14:11
frickler	haleyb: fancy stuff, didn't I link to it? https://review.opendev.org/c/zuul/zuul-jobs/+/940158	14:13
otherwiseguy	haleyb: yeah. It's unfortunately a solution that involves fixing a bunch of individual commands. But the good news is that we can do in individual patches whenever we see a problem crop up in tests, bug reports instead of pre-emptively changing everything with a huge fix.	14:13
haleyb	frickler: i didn't see the link. So https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/941246 can maybe be merged then?	14:15
otherwiseguy	The general problem is "dependent objects being modified on different workers" whether that's a CREATE of one like an LS, then a CREATE of the dependent LSP rapidly on another, or just doing a CREATE of an object and then UPDATE immediately after with a different api call, etc.	14:15
otherwiseguy	So it's theoretically a problem for any object that can be created and updated.	14:15
otherwiseguy	It's just that it may not be super common to do a create/update with different API calls really quickly.	14:16
otherwiseguy	haleyb: but it's definitely something I'd look at pretty closely when seeing a flaky tempest test.	14:17
haleyb	otherwiseguy: ack. I've actually seen a case where adding an SG rule fails because the SG isn't there, but it really is there. I always wondered if they got split-up but could never prove it	14:18
otherwiseguy	haleyb: the thing that is a little weird about the fix/feature in ovs is that it moves the responsibility for choosing a globally unique id from ovsdb-server to the client. So you definitely need to make sure your UUIDs you're choosing are really unique. ;)	14:20
otherwiseguy	On the LS, I just used the neutron network uuid, which also makes it easy to match them up for debugging.	14:22
haleyb	otherwiseguy: i'm assuming since it's a neutron UUID it is unique? but i just work here	14:22
otherwiseguy	:D	14:22
frickler	haleyb: yes, just giving gmann a moment to wake up and respond ;)	14:23
haleyb	frickler: ack, good to know it was a simple fix	14:23
otherwiseguy	haleyb: I do need to think for a second or two on the situation of upgrades. Creating an LSP on the "new" way when the LS was old way and not using the network_uuid as its uuid.	14:25
haleyb	otherwiseguy: i might ask for your opinion on a bug, let me add an update, maybe you'll have an a-ha moment	14:26
otherwiseguy	Because I think the patch keys only off of "persist uuid is supported" and then passes the network_id instead. I should write a test that exercises it at the very least.	14:26
otherwiseguy	haleyb: sounds good	14:26
haleyb	otherwiseguy: can you look at my last comment in https://bugs.launchpad.net/neutron/+bug/2090921 ? it just sounded similar but maybe it isn't	14:36
* otherwiseguy looks		14:37
frickler	haleyb: ah, I had written a response comment earlier, but forgotten to submit it, thx for the hint :) https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/941246/comment/2c797d55_940220c4/	14:41
otherwiseguy	haleyb: my first instinct is that the failing case is when update_acls_for_security_group() is called (which doesn't add the pg), and the working one in that comment would be when calling create_security_group() which calls creates the pg and calls add_acls_for_sg_port_group() for the default rules. And it could theoretically be that the create_security_group() and then adding rules happen on different workers.	14:52
otherwiseguy	I could definitely see tooling that would do api calls to create security groups and add rules very quickly after each other.	14:54
haleyb	otherwiseguy: yeah, this is that case I think, and this failure is pretty reproducible by them, just not by me without some hacking. For the first case I guess we can change the call perhaps? But I wouldn't know an easy solution if they wound-up on different workers	15:00
*** dmellado0755393737 is now known as dmellado		15:03
haleyb	otherwiseguy: i mean, should we always put a pg_add(..., may_exist=True) call there? or restructure the code to do that? i'd think we'd want both operations in the same transaction?	15:16
otherwiseguy	haleyb: We could. It would mildly irritate me (for no good reason) to try to create the PG every time we added a security group rule. Looking at the Commands, it should work.	15:24
haleyb	otherwiseguy: it would annoy me as well, i'm open to options. This is why your uuid change seemed like it could apply here perhaps? although we see this issue as far back as yoga	15:28
otherwiseguy	Yeah, with the mess of "persist_uuid doesn't quite work in various OVS branches" it's kind annoying from a backport perspective.	15:29
haleyb	or we add a try/except and only pg_add on failure?	15:29
* haleyb is between meetings and might go <poof>		15:30
otherwiseguy	haleyb: honestly, the may_exist check would be really cheap. It's just checking local memory for the Port_Group by name, and it should be indexed.	15:31
otherwiseguy	haleyb: I think I'd prefer that to retry (and honestly, by the time we retried the local cache would likely be updated)	15:35
opendevreview	Jakub Libosvar proposed openstack/neutron master: cleanup: Remove 'migrate' mode for OVN DB sync https://review.opendev.org/c/openstack/neutron/+/946168	15:39
haleyb	otherwiseguy: interested in looking at that :) i pay in beer though	15:59
otherwiseguy	Thinking about the case where it exists in ovsdb-server, but not locally, then we try to create it with may_exist. The may_exist check is local, so we will actually try to create the pg on the server, which will fail since one with the name already exists. So it definitely couldn't be in a transaction with anything else or the whole transaction would fail. So we'd have to do a single transaction with may_exist=True,	16:11
otherwiseguy	check_error=False.	16:11
otherwiseguy	haleyb: ^ But if we are doing that, then by the time we get the response from ovsdb-server that our txn has failed with the pg_add, we by definition will have already received the update from ovsdb-server with the added row. Basically, any write txn we do whether it creates the pg or writes to an external id, would ensure that we'd got the updated row in memory. :p	16:12
otherwiseguy	But just retrying the txn isn't guaranteed to work immediately because when we fail because the pg doesn't exist, it's failing locally. So basically you're having to sit there polling for the pg unless you do a write (and if you were writing, trying to add the pg is as good a thing to write as anything).	16:16
otherwiseguy	haleyb: which openvswitch does yoga use?	16:16
otherwiseguy	haleyb: er, which openvswitch do you all use when using yoga I guess is what I really mean.	16:28
haleyb	otherwiseguy: 2.17.9 i believe, would have to confirm on a running cloud	16:57
mnaser	I think I'm running into some real weird issue here.	18:06
mnaser	Nova's network_info (at least the cached one) seems to be lacking subnet info, when using routed provider networks	18:07
mnaser	so when config drive is getting built, it's ending up with an empty config	18:08
opendevreview	Candido Campos Rivas proposed x/whitebox-neutron-tempest-plugin master: Increase burst window size to avoid unstabilities in slow envs https://review.opendev.org/c/x/whitebox-neutron-tempest-plugin/+/944616	18:11
mnaser	i assume its because the IP assignment is deferred	18:12
mnaser	so you end up with: https://paste.openstack.org/show/b6wcQ0hWl9JvdpwHEOks/	18:12
otherwiseguy	haleyb: what I don't like, the pg we always really expect to be there. in no case should we really be recreating the pg (it could legitimately have been deleted for example). So, that leaves us with essentially doing what the code I'm trying to replace with ls/lsp did. Wait until the expected row exists, which will block the worker for an arbitrarily chosen amount of time. (or make an inconsequential write to the db, which will	18:33
otherwiseguy	block the worker for an indeterminate amount of time, bounded by the ovsdb_timeout).	18:33
otherwiseguy	(I'm slowly remembering all of the reasons I went with the persist_uuid idea literally years ago at this point)	18:35
haleyb	otherwiseguy: yeah, I guess the API can only protect that the SG exists by allowing the SG rule to be created, but there's no guarantee the PG exists. I really don't know the best solution	19:30
otherwiseguy	haleyb: For newer releases, persist_uuid-ifying PG/ACL. For older stuff...the check_for_row_by_value_and_retry('Port_Group', ...) solution is probably easiest.	19:34
otherwiseguy	I dislike polling and blocking the worker more than I can articulate. But reminding myself that it should only be in very rare cases helps.	19:35
otherwiseguy	(or if the ovsdb-server or neutron is just way overloaded, in which case the solution is to make them less overloaded)	19:36
haleyb	otherwiseguy: oh, i like that check_for_row_by_value_and_retry() call, and i think in this case we have the SG id, so can calculate the PG name. Would you put that right in update_acls_for_security_group() ?	19:40
haleyb	mnaser: sorry, guess i haven't seen that, maybe someone on the nova team has info if it's just something with the sync?	19:44
otherwiseguy	haleyb: Yeah, and double check for any other places we might eventually call pg_acl_add() or whatever.	19:44
otherwiseguy	And hope that there aren't any dumb assert_called_once_with() unit tests that break. :p	19:45
haleyb	otherwiseguy: i'll try and get something out soon, might be able to test on their setup too	19:49
otherwiseguy	haleyb: I think should be possible to make a generalized test for this kind of issue. Have a test that uses a subclassed Connection or something that sleeps before calling idl.run() maybe.	19:49
otherwiseguy	Or easy enough to just manually patch that in and run tempest tests which are bound to have quite a few tests that create dependent objects right after each other or create/update.	19:51
otherwiseguy	I may play with that a bit this week.	19:52
haleyb	otherwiseguy: where would the idl.run() code be for this? I'm trying to find the class	19:57
otherwiseguy	It should be in ovsdbapp's Connection class. I think under run()	19:57
* otherwiseguy checks		19:57
otherwiseguy	haleyb: yeah, ovsdbapp.backend.ovs_idl.connection.Connection.run	19:58
haleyb	oh, i was looking in neutron, duh	19:58
* haleyb wouldn't exactly know how to tweak that code		20:10
otherwiseguy	I think just putting a time.sleep(5) or whatever right before the idl.run() call would do it. I usually just grab devstack, checkout ovsdbapp in /opt/stack and pip install -e. from the devstack venv to play around with it.	20:12
otherwiseguy	I'll try locally real quick and see if I can reproduce the error w/ a sg create/sg rule create.	20:12
otherwiseguy	hmm, though I guess unilaterally adding the delay there is delaying the initial api request as well, so it doesn't let me easily create a new rule quickly.	20:23
opendevreview	Brian Haley proposed openstack/neutron master: Make sure OVN PG is present before adding a SG rule https://review.opendev.org/c/openstack/neutron/+/946216	21:59
haleyb	otherwiseguy: I will try something like that ^^^ out downstream and see what happens	22:00
otherwiseguy	haleyb: +1	22:00
haleyb	it passed unit tests :)	22:00

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!