ralonsoh | bcafarel, slaweq hello! If you have 1 min today: https://review.opendev.org/q/project:openstack/neutron+branch:stable/2025.1+status:open | 05:26 |
---|---|---|
ralonsoh | thanks! | 05:26 |
ralonsoh | haleyb, hello. Thanks for waiting. I've added some topics to the PTG agenda | 06:45 |
opendevreview | Candido Campos Rivas proposed x/whitebox-neutron-tempest-plugin master: Increase burst window size to avoid unstabilities in slow envs https://review.opendev.org/c/x/whitebox-neutron-tempest-plugin/+/944616 | 08:16 |
slaweq | ralonsoh I just reviewed your backports, but in https://review.opendev.org/c/openstack/neutron/+/945970 there is one comment which IMO can be valid, please check if we should maybe address it in follow up patch | 09:12 |
ralonsoh | slaweq, let me check now | 09:13 |
opendevreview | Maksim Kostrow proposed openstack/neutron master: Correction of a typo in the 'sub_ports' field https://review.opendev.org/c/openstack/neutron/+/946124 | 09:18 |
opendevreview | Maksim Kostrow proposed openstack/neutron master: Correction of a typo in the 'sub_ports' field https://review.opendev.org/c/openstack/neutron/+/946124 | 09:19 |
ralonsoh | slaweq, I added a comment | 09:20 |
ralonsoh | this is what I tried to explain in the upper note: we wait for the post_fork event and at this point we know the placement extension is loaded | 09:21 |
ralonsoh | folks, some good optimizations | 09:33 |
ralonsoh | * https://review.opendev.org/c/openstack/neutron/+/944756 | 09:33 |
ralonsoh | * https://review.opendev.org/c/openstack/neutron/+/944650 | 09:34 |
opendevreview | Lajos Katona proposed openstack/neutron stable/2025.1: If OVS Manager creation failes retry to set values https://review.opendev.org/c/openstack/neutron/+/946142 | 12:11 |
sahid | ralonsoh: o/ | 12:22 |
sahid | any chance that you have a look to the ovs serie when you have a moment :-) | 12:23 |
sahid | It's waiting for a while now to get some reviews | 12:23 |
sahid | here is the full serie https://review.opendev.org/q/topic:%22bug/2087939%22 | 12:24 |
opendevreview | Merged openstack/neutron stable/2025.1: Revert the network IDs filter in ``_gen_router_port_options`` https://review.opendev.org/c/openstack/neutron/+/945968 | 12:57 |
ralonsoh | sahid, sorry, I'm quite busy these days, I'll try next week | 12:57 |
haleyb | ralonsoh: np, i haven't had much time to even look at etherpad myself | 12:58 |
ralonsoh | haleyb, I also added a couple of items from other people | 12:58 |
ralonsoh | ping me if you don't know how much time could take every item | 12:59 |
haleyb | ack | 12:59 |
cardoe | ralonsoh: Thanks for the reply on https://bugs.launchpad.net/neutron/+bug/2105855 and Thanks haleyb for giving me some time. | 13:34 |
cardoe | I know some folks from #-ironic will be interested and will join for it. | 13:35 |
cardoe | There's some efforts within Ironic to improve the networking handling. I'm primarily interested in doing as much of it with Neutron as possible. | 13:36 |
haleyb | frickler: so how did the py313 job get fixed? black magic? | 13:47 |
Uggla | Hi haleyb, so I set 1h from 13 --> 14h UTC on thursday for neutron/cinder/nova PTG cross team meeting. what room do you want to use nova one or neutron ? | 14:01 |
haleyb | Uggla: either the nova or neutron since i think cinder would only be for part, there was an additional nova/neutron topic | 14:02 |
haleyb | let me ping jon again | 14:02 |
Uggla | haleyb, is 1h enough for all topics ? | 14:05 |
otherwiseguy | ralonsoh: haleyb: if you all ever run into any of that mythical free time, I respun https://review.opendev.org/c/openstack/neutron/+/923971 which had ralonsoh's +2 to fix some nits. | 14:06 |
haleyb | Uggla: I *think* so, we have the QoS one which could be 30 minutes or maybe more, then https://bugs.launchpad.net/nova/+bug/2073254 | 14:06 |
ralonsoh | otherwiseguy, let me check | 14:06 |
Uggla | haleyb, ok I'm gonna ping Jon too, because I have another session with him. So I can check for ours too. | 14:07 |
opendevreview | Merged openstack/neutron master: Duplicated table lookup in extending port resource request https://review.opendev.org/c/openstack/neutron/+/944650 | 14:08 |
haleyb | Uggla: ack, i reached out privately so he's aware of the cross-project ask | 14:08 |
Uggla | haleyb, can we use the nova room so it will be clear on the agenda. | 14:10 |
Uggla | austin room | 14:10 |
haleyb | otherwiseguy: i'll look again, it is an interesting problem and fix :) Does this type of fixup work for other objects? i'm not asking to boil the ocean (yet) | 14:11 |
frickler | haleyb: fancy stuff, didn't I link to it? https://review.opendev.org/c/zuul/zuul-jobs/+/940158 | 14:13 |
otherwiseguy | haleyb: yeah. It's unfortunately a solution that involves fixing a bunch of individual commands. But the good news is that we can do in individual patches whenever we see a problem crop up in tests, bug reports instead of pre-emptively changing everything with a huge fix. | 14:13 |
haleyb | frickler: i didn't see the link. So https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/941246 can maybe be merged then? | 14:15 |
otherwiseguy | The general problem is "dependent objects being modified on different workers" whether that's a CREATE of one like an LS, then a CREATE of the dependent LSP rapidly on another, or just doing a CREATE of an object and then UPDATE immediately after with a different api call, etc. | 14:15 |
otherwiseguy | So it's theoretically a problem for any object that can be created and updated. | 14:15 |
otherwiseguy | It's just that it may not be super common to do a create/update with different API calls *really quickly*. | 14:16 |
otherwiseguy | haleyb: but it's definitely something I'd look at pretty closely when seeing a flaky tempest test. | 14:17 |
haleyb | otherwiseguy: ack. I've actually seen a case where adding an SG rule fails because the SG isn't there, but it really is there. I always wondered if they got split-up but could never prove it | 14:18 |
otherwiseguy | haleyb: the thing that is a little weird about the fix/feature in ovs is that it moves the responsibility for choosing a globally unique id from ovsdb-server to the client. So you definitely need to make sure your UUIDs you're choosing are really unique. ;) | 14:20 |
otherwiseguy | On the LS, I just used the neutron network uuid, which also makes it easy to match them up for debugging. | 14:22 |
haleyb | otherwiseguy: i'm assuming since it's a neutron UUID it is unique? but i just work here | 14:22 |
otherwiseguy | :D | 14:22 |
frickler | haleyb: yes, just giving gmann a moment to wake up and respond ;) | 14:23 |
haleyb | frickler: ack, good to know it was a simple fix | 14:23 |
otherwiseguy | haleyb: I do need to think for a second or two on the situation of upgrades. Creating an LSP on the "new" way when the LS was old way and not using the network_uuid as its uuid. | 14:25 |
haleyb | otherwiseguy: i might ask for your opinion on a bug, let me add an update, maybe you'll have an a-ha moment | 14:26 |
otherwiseguy | Because I think the patch keys only off of "persist uuid is supported" and then passes the network_id instead. I should write a test that exercises it at the very least. | 14:26 |
otherwiseguy | haleyb: sounds good | 14:26 |
haleyb | otherwiseguy: can you look at my last comment in https://bugs.launchpad.net/neutron/+bug/2090921 ? it just sounded similar but maybe it isn't | 14:36 |
* otherwiseguy looks | 14:37 | |
frickler | haleyb: ah, I had written a response comment earlier, but forgotten to submit it, thx for the hint :) https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/941246/comment/2c797d55_940220c4/ | 14:41 |
otherwiseguy | haleyb: my first instinct is that the failing case is when update_acls_for_security_group() is called (which doesn't add the pg), and the working one in that comment would be when calling create_security_group() which calls creates the pg and calls add_acls_for_sg_port_group() for the default rules. And it could theoretically be that the create_security_group() and then adding rules happen on different workers. | 14:52 |
otherwiseguy | I could definitely see tooling that would do api calls to create security groups and add rules very quickly after each other. | 14:54 |
haleyb | otherwiseguy: yeah, this is that case I think, and this failure is pretty reproducible by them, just not by me without some hacking. For the first case I guess we can change the call perhaps? But I wouldn't know an easy solution if they wound-up on different workers | 15:00 |
*** dmellado0755393737 is now known as dmellado | 15:03 | |
haleyb | otherwiseguy: i mean, should we always put a pg_add(..., may_exist=True) call there? or restructure the code to do that? i'd think we'd want both operations in the same transaction? | 15:16 |
otherwiseguy | haleyb: We could. It would mildly irritate me (for no good reason) to try to create the PG every time we added a security group rule. Looking at the Commands, it *should* work. | 15:24 |
haleyb | otherwiseguy: it would annoy me as well, i'm open to options. This is why your uuid change seemed like it could apply here perhaps? although we see this issue as far back as yoga | 15:28 |
otherwiseguy | Yeah, with the mess of "persist_uuid doesn't quite work in various OVS branches" it's kind annoying from a backport perspective. | 15:29 |
haleyb | or we add a try/except and only pg_add on failure? | 15:29 |
* haleyb is between meetings and might go <poof> | 15:30 | |
otherwiseguy | haleyb: honestly, the may_exist check would be really cheap. It's just checking local memory for the Port_Group by name, and it should be indexed. | 15:31 |
otherwiseguy | haleyb: I think I'd prefer that to retry (and honestly, by the time we retried the local cache would likely be updated) | 15:35 |
opendevreview | Jakub Libosvar proposed openstack/neutron master: cleanup: Remove 'migrate' mode for OVN DB sync https://review.opendev.org/c/openstack/neutron/+/946168 | 15:39 |
haleyb | otherwiseguy: interested in looking at that :) i pay in beer though | 15:59 |
otherwiseguy | Thinking about the case where it exists in ovsdb-server, but not locally, then we try to create it with may_exist. The may_exist check is local, so we will actually try to create the pg on the server, which will fail since one with the name already exists. So it definitely couldn't be in a transaction with anything else or the whole transaction would fail. So we'd have to do a single transaction with may_exist=True, | 16:11 |
otherwiseguy | check_error=False. | 16:11 |
otherwiseguy | haleyb: ^ But if we are doing that, then by the time we get the response from ovsdb-server that our txn has failed with the pg_add, we by definition will have already received the update from ovsdb-server with the added row. Basically, any write txn we do whether it creates the pg or writes to an external id, would ensure that we'd got the updated row in memory. :p | 16:12 |
otherwiseguy | But just retrying the txn isn't guaranteed to work immediately because when we fail because the pg doesn't exist, it's failing locally. So basically you're having to sit there polling for the pg unless you do a write (and if you were writing, trying to add the pg is as good a thing to write as anything). | 16:16 |
otherwiseguy | haleyb: which openvswitch does yoga use? | 16:16 |
otherwiseguy | haleyb: er, which openvswitch do you all use when using yoga I guess is what I really mean. | 16:28 |
haleyb | otherwiseguy: 2.17.9 i believe, would have to confirm on a running cloud | 16:57 |
mnaser | I think I'm running into some real weird issue here. | 18:06 |
mnaser | Nova's network_info (at least the cached one) seems to be lacking subnet info, when using routed provider networks | 18:07 |
mnaser | so when config drive is getting built, it's ending up with an empty config | 18:08 |
opendevreview | Candido Campos Rivas proposed x/whitebox-neutron-tempest-plugin master: Increase burst window size to avoid unstabilities in slow envs https://review.opendev.org/c/x/whitebox-neutron-tempest-plugin/+/944616 | 18:11 |
mnaser | i assume its because the IP assignment is deferred | 18:12 |
mnaser | so you end up with: https://paste.openstack.org/show/b6wcQ0hWl9JvdpwHEOks/ | 18:12 |
otherwiseguy | haleyb: what I don't like, the pg we always really expect to be there. in no case should we really be recreating the pg (it could legitimately have been deleted for example). So, that leaves us with essentially doing what the code I'm trying to replace with ls/lsp did. Wait until the expected row exists, which will block the worker for an arbitrarily chosen amount of time. (or make an inconsequential write to the db, which will | 18:33 |
otherwiseguy | block the worker for an indeterminate amount of time, bounded by the ovsdb_timeout). | 18:33 |
otherwiseguy | (I'm slowly remembering all of the reasons I went with the persist_uuid idea literally years ago at this point) | 18:35 |
haleyb | otherwiseguy: yeah, I guess the API can only protect that the SG exists by allowing the SG rule to be created, but there's no guarantee the PG exists. I really don't know the best solution | 19:30 |
otherwiseguy | haleyb: For newer releases, persist_uuid-ifying PG/ACL. For older stuff...the check_for_row_by_value_and_retry('Port_Group', ...) solution is probably easiest. | 19:34 |
otherwiseguy | I dislike polling and blocking the worker more than I can articulate. But reminding myself that it should only be in very rare cases helps. | 19:35 |
otherwiseguy | (or if the ovsdb-server or neutron is just way overloaded, in which case the solution is to make them less overloaded) | 19:36 |
haleyb | otherwiseguy: oh, i like that check_for_row_by_value_and_retry() call, and i think in this case we have the SG id, so can calculate the PG name. Would you put that right in update_acls_for_security_group() ? | 19:40 |
haleyb | mnaser: sorry, guess i haven't seen that, maybe someone on the nova team has info if it's just something with the sync? | 19:44 |
otherwiseguy | haleyb: Yeah, and double check for any other places we might eventually call pg_acl_add() or whatever. | 19:44 |
otherwiseguy | And hope that there aren't any dumb assert_called_once_with() unit tests that break. :p | 19:45 |
haleyb | otherwiseguy: i'll try and get something out soon, might be able to test on their setup too | 19:49 |
otherwiseguy | haleyb: I think should be possible to make a generalized test for this kind of issue. Have a test that uses a subclassed Connection or something that sleeps before calling idl.run() maybe. | 19:49 |
otherwiseguy | Or easy enough to just manually patch that in and run tempest tests which are bound to have quite a few tests that create dependent objects right after each other or create/update. | 19:51 |
otherwiseguy | I may play with that a bit this week. | 19:52 |
haleyb | otherwiseguy: where would the idl.run() code be for this? I'm trying to find the class | 19:57 |
otherwiseguy | It should be in ovsdbapp's Connection class. I think under run() | 19:57 |
* otherwiseguy checks | 19:57 | |
otherwiseguy | haleyb: yeah, ovsdbapp.backend.ovs_idl.connection.Connection.run | 19:58 |
haleyb | oh, i was looking in neutron, duh | 19:58 |
* haleyb wouldn't exactly know how to tweak that code | 20:10 | |
otherwiseguy | I think just putting a time.sleep(5) or whatever right before the idl.run() call would do it. I usually just grab devstack, checkout ovsdbapp in /opt/stack and pip install -e. from the devstack venv to play around with it. | 20:12 |
otherwiseguy | I'll try locally real quick and see if I can reproduce the error w/ a sg create/sg rule create. | 20:12 |
otherwiseguy | hmm, though I guess unilaterally adding the delay there is delaying the initial api request as well, so it doesn't let me easily create a new rule quickly. | 20:23 |
opendevreview | Brian Haley proposed openstack/neutron master: Make sure OVN PG is present before adding a SG rule https://review.opendev.org/c/openstack/neutron/+/946216 | 21:59 |
haleyb | otherwiseguy: I will try something like that ^^^ out downstream and see what happens | 22:00 |
otherwiseguy | haleyb: +1 | 22:00 |
haleyb | it passed unit tests :) | 22:00 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!