opendevreview | sean mooney proposed openstack/nova master: Force api to exit if exceptions are uncaught https://review.opendev.org/c/openstack/nova/+/944933 | 00:25 |
---|---|---|
opendevreview | OpenStack Proposal Bot proposed openstack/nova master: Imported Translations from Zanata https://review.opendev.org/c/openstack/nova/+/944963 | 04:01 |
opendevreview | Adam Kijak proposed openstack/nova master: Fix race condition when attaching a volume and deleting the instance https://review.opendev.org/c/openstack/nova/+/944976 | 09:51 |
opendevreview | Stephen Finucane proposed openstack/nova master: Remove tags from README https://review.opendev.org/c/openstack/nova/+/944994 | 12:30 |
opendevreview | Pierre Riteau proposed openstack/nova master: Fix missing backtick in configuration option help https://review.opendev.org/c/openstack/nova/+/931253 | 13:43 |
priteau | Hello Nova cores. This trivial doc fix has been open for a few months ^ | 13:43 |
priteau | Could you please review? | 13:43 |
opendevreview | sean mooney proposed openstack/nova master: Force api to exit if exceptions are uncaught https://review.opendev.org/c/openstack/nova/+/944933 | 14:23 |
opendevreview | Bogdan Dobrelya proposed openstack/nova master: [DNM] Hard OS exit instead of SysExit exception https://review.opendev.org/c/openstack/nova/+/945011 | 14:24 |
opendevreview | sean mooney proposed openstack/nova master: Force api to exit if exceptions are uncaught https://review.opendev.org/c/openstack/nova/+/944933 | 14:26 |
opendevreview | Bogdan Dobrelya proposed openstack/nova master: [DNM] Hard OS exit instead of SysExit exception https://review.opendev.org/c/openstack/nova/+/945011 | 14:31 |
dansmith | sean-k-mooney: so one approach is to try to exit really really hard to get apache/mod_wsgi to notice... but, | 15:10 |
dansmith | what about us making sure we never raise and just return 500s if we would have otherwise failed? | 15:11 |
sean-k-mooney | hum | 15:11 |
dansmith | even without a dedicated healthcheck that would be easy to arrange to not be healthy | 15:11 |
sean-k-mooney | we would have to kind of hack it | 15:12 |
dansmith | curl /versions or whatever, always returning 500 would leave a pod unsuitable for traffic | 15:12 |
sean-k-mooney | but we could set a module level global i guess | 15:12 |
sean-k-mooney | and check that really early | 15:12 |
dansmith | sure, we'd need some way to record it and continue logging the config error or whatever so the operator saw it in the logs | 15:12 |
dansmith | sean-k-mooney: or just make application() return a different stub object if we failed to init the real one | 15:12 |
sean-k-mooney | perhaps. the issue is it reloading that applciation into the exisitng interperter | 15:13 |
sean-k-mooney | so we keep all the module level state like that partly parsed oslo config global | 15:13 |
dansmith | do we need to? | 15:13 |
sean-k-mooney | do we need to keep that state? | 15:14 |
dansmith | we could just have a safety proxy that sits around the real one and keeps trying to load the real one, eventually proxying to it if it loads, or returning the 500s if not | 15:14 |
sean-k-mooney | i was condiring trying to reset it so that we woul parse the config and raise the excption again | 15:14 |
dansmith | I kinda don't think we need to worry about the partial state | 15:14 |
sean-k-mooney | well that why its getting to the microverison endpoint when its reloaded | 15:15 |
dansmith | I imagine everyone doing this in a modern way is going to fix the config and restart a container when they do, so just having it be a terminal state if config is broken seems fine to me | 15:15 |
sean-k-mooney | just enough fo the state is thare that we get passed where the excption woudl be raised | 15:15 |
sean-k-mooney | right im just wondering how to make it terminal | 15:15 |
dansmith | I mean if mod_wsgi reloads us and we have to try initializing again and still fail, who cares? | 15:16 |
sean-k-mooney | im totally fine with changing the guts of the decorator to do something like record if it ever raised an excption | 15:16 |
sean-k-mooney | and then keep raising that if invoked in the future | 15:16 |
sean-k-mooney | that would be pretty clean i think | 15:16 |
sean-k-mooney | dansmith: right now when we are reloaded we dont fail on the init the second time | 15:17 |
dansmith | yeah, just set a global if we failed, stop trying, just return all the 500s, wait for them to fix config and restart | 15:17 |
sean-k-mooney | becuase of the leaked state | 15:17 |
dansmith | sean-k-mooney: ahh, I see, then yeah, some global to prevent us from trying | 15:17 |
sean-k-mooney | dansmith: lets wait for the current ci job to complete | 15:17 |
sean-k-mooney | ill refactor to raise 500 if we ever raised before and see what that looks like | 15:18 |
dansmith | I just feel like trying to sys.exit() harder and harder is sort of missing the point | 15:18 |
bogdando | we also have one testing the os._exit "fork" | 15:18 |
sean-k-mooney | ya i dont like calling that ^ | 15:18 |
dansmith | we should just detect, log, and signal (which for a web application is a 500 healthcheck, IMHO) | 15:18 |
dansmith | sean-k-mooney: agree about os._exit, that's what I mean by "harder" :) | 15:18 |
sean-k-mooney | ya so i dont know if you looked at my most recent version wich follows the reload docs | 15:19 |
sean-k-mooney | i choose to use sig kill instead of sig interupt | 15:19 |
sean-k-mooney | but either i think woudl work | 15:19 |
bogdando | sys exit is trying the door, but apache won't let us out | 15:20 |
sean-k-mooney | dansmith: i think your suggestion is worth trying as it shoudl be pretty clean. | 15:20 |
bogdando | os.exit - trying a window | 15:20 |
bogdando | 500 code - sitting in the room arguing with the door guard | 15:20 |
sean-k-mooney | 500s are more polite | 15:21 |
dansmith | sean-k-mooney: to me, the kill/reload thing is what we should do if we need to retry, but for bad config or almost all of our situations, we don't | 15:21 |
dansmith | so doing kill there just means we crashloop | 15:21 |
dansmith | starting up and returning 500 will signal that we're not happy instead of constantly looking like we're restarting | 15:21 |
sean-k-mooney | yep | 15:21 |
dansmith | I much prefer the "unhealthy" marker instead of the constant "starting..." status where you're not sure if you're still supposed to wait | 15:22 |
sean-k-mooney | long term either implemeting helthchecks or hooking the system ready socket is cleaner | 15:22 |
bogdando | the problem is we need a hard stop, no half measures | 15:22 |
sean-k-mooney | i.e. for knowing if we started properly | 15:22 |
bogdando | letting the service operating half-configured even a fraction of a second could be destructive | 15:22 |
sean-k-mooney | well if my current patch works it wont | 15:23 |
bogdando | (if we got unlucky enough) | 15:23 |
sean-k-mooney | and if we do the 500 thing it wont either | 15:23 |
dansmith | it's not going to operate half-configured either way if we never return application | 15:23 |
dansmith | right | 15:23 |
bogdando | ok, makes sense | 15:23 |
bogdando | what about a tail of rpc calls? | 15:23 |
dansmith | not sure what you mean | 15:24 |
bogdando | it won't accept new API requests, but what about a pending work | 15:24 |
sean-k-mooney | it gets dropped on the floor | 15:24 |
bogdando | woudln't it heppily pick it up after crash looping? | 15:24 |
bogdando | ok | 15:24 |
sean-k-mooney | the http request wont get into the applciation logic and it will get a 500 | 15:24 |
opendevreview | ribaudr proposed openstack/nova master: FUP Remove unnecessary PCI check https://review.opendev.org/c/openstack/nova/+/944105 | 15:24 |
opendevreview | ribaudr proposed openstack/nova master: FUP improve and add integration tests for PCI SR-IOV servers https://review.opendev.org/c/openstack/nova/+/944106 | 15:24 |
opendevreview | ribaudr proposed openstack/nova master: FUP Add a warning to make non-explicit live migration request debugging easier https://review.opendev.org/c/openstack/nova/+/944133 | 15:24 |
opendevreview | ribaudr proposed openstack/nova master: FUP Update pci-passthrough and virtual-gpu documentation https://review.opendev.org/c/openstack/nova/+/944153 | 15:24 |
bogdando | IIRC, there is a logic for recovering ongoing rpc calls (long running ops) after restarting the process. Wouldn't restarting half-configured introduce more problems that it solves, even if never taking new API requests? | 15:26 |
sean-k-mooney | no | 15:26 |
sean-k-mooney | config errors should not allow the api to ever come up | 15:27 |
dansmith | if we never return application, we can never start rpc calls | 15:27 |
sean-k-mooney | so you shoudl nto be able to start rpcs | 15:27 |
dansmith | I'm not sure what the concern is | 15:27 |
bogdando | oh ok then, sorry | 15:27 |
opendevreview | ribaudr proposed openstack/nova master: FUP Remove unnecessary PCI check https://review.opendev.org/c/openstack/nova/+/944105 | 15:27 |
opendevreview | ribaudr proposed openstack/nova master: FUP improve comment accuracy and variable naming for tag removal https://review.opendev.org/c/openstack/nova/+/943124 | 15:27 |
opendevreview | ribaudr proposed openstack/nova master: FUP Add a warning to make non-explicit live migration request debugging easier https://review.opendev.org/c/openstack/nova/+/944133 | 15:27 |
sean-k-mooney | also if you restart the api and it has done an rpc the reply is dropped on the floor | 15:27 |
opendevreview | ribaudr proposed openstack/nova master: FUP Update pci-passthrough and virtual-gpu documentation https://review.opendev.org/c/openstack/nova/+/944153 | 15:27 |
opendevreview | ribaudr proposed openstack/nova master: FUP improve and add integration tests for PCI SR-IOV servers https://review.opendev.org/c/openstack/nova/+/944106 | 15:27 |
dansmith | also, the 500 approach specifically avoids restarting if we know we're not going to be able to do so successfull | 15:28 |
sean-k-mooney | it will be about 2 hours until the ci job finishes | 15:29 |
sean-k-mooney | ill see if i can do a poc of the 500 approch this evenign or early tomorrow | 15:29 |
sean-k-mooney | i mean i can push one now | 15:29 |
bogdando | what about conductor, it service no API, right? but we cannot afford it restarted half-configured for some moment? sorry I am fully noob in Nova services architecture | 15:29 |
sean-k-mooney | but i guess i can triger a job via a second dnm change | 15:30 |
sean-k-mooney | the conductor is not a wsig process | 15:30 |
bogdando | it serves no api* | 15:30 |
sean-k-mooney | this si only a problem because of mod_wsgi | 15:30 |
bogdando | oh yes, indeed | 15:30 |
sean-k-mooney | all the rest exit on config error | 15:30 |
dansmith | yeah I'm not sure what conductor has to do with this.. it can use the usual systemd stuff | 15:30 |
bogdando | not that much a noob to grasp that :) | 15:30 |
sean-k-mooney | dansmith: i have a poc locally but i dont think we want to keep raising the excption and returning 500 in all cases. for config yes but for DBConnectionError maybe not. | 16:28 |
sean-k-mooney | dansmith: ill push what i have shortly and we can dicuss if we want to filter ot a subset of excptions | 16:29 |
dansmith | sean-k-mooney: yep, that's fair.. I would default to 500, excluding specific things we know are transient | 16:29 |
sean-k-mooney | ack | 16:29 |
dansmith | anything unexpected should be assumed to be fatal | 16:29 |
bogdando | you see, arguing more with the door gouard? :) | 16:31 |
bogdando | fail early is always the best approach | 16:31 |
sean-k-mooney | so today we allow retrying in case teh db is not ready yet | 16:31 |
sean-k-mooney | so it would be a regression to hard fail in that case | 16:31 |
sean-k-mooney | a small one but still a chagne in behvior | 16:31 |
bogdando | well, systemd or k8s or pidone shall restart | 16:32 |
sean-k-mooney | it might | 16:32 |
sean-k-mooney | it might not | 16:32 |
bogdando | I mean it is a normal assumption for underlying OS to expect that | 16:32 |
sean-k-mooney | we can change the behavior we jsut need to conservitive and advertise it peroperly ot operators | 16:32 |
dansmith | compute tolerates certain things on startup and not others, so this mirrors that | 16:32 |
dansmith | compute also handles conductor not started, waits for it, etc | 16:33 |
sean-k-mooney | i am raising this because im modifying a test that emulatest this and im liek "i proably shoudl not be changing this test" | 16:33 |
bogdando | then it seems we only have one choice, code 500 with tricky logic around | 16:34 |
dansmith | how is it tricky logic? | 16:35 |
sean-k-mooney | no we can hard exit just not on any excption | 16:35 |
sean-k-mooney | config errors are fine | 16:35 |
sean-k-mooney | i jsut cant do excpet Excpetion | 16:35 |
sean-k-mooney | which is what my current patch does | 16:35 |
bogdando | ack | 16:35 |
sean-k-mooney | if the genal approch looks ok to folks after i push it ill refactor it to be more selective | 16:36 |
bogdando | @dansmith the trickyness that I meant is how we have started firstly with a simple return http 500, and this evolved into decisions making | 16:38 |
dansmith | this is not tricky, it's a single "try...except $allowed: pass except Exception: fail" | 16:39 |
sean-k-mooney | bogdando: its more we cant regress https://launchpad.net/bugs/1882094 while fixing this | 16:39 |
sean-k-mooney | while i can change the test case we added to prevent that regressing there are trade offs and if i can aovid changing the test i likely shoudl | 16:41 |
bogdando | what happens first, oslo config validation or DB connection during wsgi app init? | 16:41 |
sean-k-mooney | config | 16:41 |
sean-k-mooney | so that it can know what db to connect too | 16:41 |
bogdando | I recall there were corner cases with logs eating some events because errors happened before the logs configure | 16:42 |
bogdando | sean-k-mooney: while that is totally logical, that was not clear to me after reading https://bugs.launchpad.net/nova/+bug/1882094 | 16:44 |
bogdando | that "Removing and moving the DB connection establishment out of our WSGI init script" could be understood like it connects to DB even before config gets validated | 16:45 |
dansmith | of course config :) | 16:46 |
sean-k-mooney | bogdando: it could but that not what the code does | 16:46 |
bogdando | well, sometimes it firstly needs alive MQ connection before to start verifying something else in a config | 16:47 |
bogdando | not saying that's exactly a case here | 16:47 |
sean-k-mooney | the api does not | 16:47 |
sean-k-mooney | so this call to init_global_data is waht calls config.parse_args https://github.com/openstack/nova/blob/master/nova/api/openstack/wsgi_app.py#L128 | 16:48 |
sean-k-mooney | here https://github.com/openstack/nova/blob/master/nova/api/openstack/wsgi_app.py#L101 | 16:48 |
sean-k-mooney | the first time the db is acccesed is here https://github.com/openstack/nova/blob/master/nova/api/openstack/wsgi_app.py#L64 | 16:49 |
sean-k-mooney | which is after we return from init_global_data | 16:49 |
sean-k-mooney | dansmith: by hte way this is why we are not getting the config error the second time https://github.com/openstack/nova/blob/master/nova/api/openstack/wsgi_app.py#L92-L93 | 16:50 |
bogdando | at least my silly question brought some useful considerations | 16:50 |
dansmith | sean-k-mooney: ah interesting :) | 16:50 |
sean-k-mooney | i could proably tweak run_once for this edge case too but im slighly reluctant to broaden the scope of that | 16:53 |
bogdando | @sean-k-mooney so, that means we can hit that initialization again with an invalid config | 16:53 |
bogdando | a rogue database connection :) | 16:53 |
sean-k-mooney | bogdando: yes but rememebr python is ment to die if you dont cache the excption | 16:53 |
sean-k-mooney | mod_wsgi is the thing that is breaking expections :) | 16:54 |
bogdando | my point is now this try catch becomes even more nuanced | 16:54 |
sean-k-mooney | that run_once was added for the orginal bug | 16:55 |
bogdando | yea, mod_wsgi is the thing that is breaking excEPtions | 16:55 |
sean-k-mooney | it was there to make sure we dint try and re parse the config amoung other things if it succeed the first time | 16:55 |
sean-k-mooney | becuase that is an error | 16:55 |
sean-k-mooney | bogdando: extra context we have shipt the run_once approch for 5 years and backported it to all our downstream release that are still supproted. | 16:57 |
sean-k-mooney | so while we can condier changing this we have several years fo experince that this work good enough | 16:57 |
sean-k-mooney | so im trying to be tactical without createing more tech debt | 16:58 |
opendevreview | sean mooney proposed openstack/nova master: [WIP] raise_forever on wsgi app init https://review.opendev.org/c/openstack/nova/+/945028 | 17:00 |
bogdando | puppet used to be much more restrictive on configs management | 17:00 |
bogdando | than eating free form snippets | 17:00 |
sean-k-mooney | pupet was a terribel user experince | 17:01 |
bogdando | but it was guarding us against this behavior | 17:01 |
sean-k-mooney | to be clear we are only having this converstaion because healthcheck are not merged | 17:01 |
sean-k-mooney | we have an implmemtned feature in review for over 2 years that was intned to solve this and other problems | 17:02 |
bogdando | good point | 17:02 |
sean-k-mooney | https://review.opendev.org/q/topic:%22per-process-healthchecks%22+status:open | 17:02 |
opendevreview | sean mooney proposed openstack/nova master: [WIP] raise_forever on wsgi app init https://review.opendev.org/c/openstack/nova/+/945028 | 17:10 |
sean-k-mooney | i just rebased that to master | 17:10 |
sean-k-mooney | and here is the test patch https://github.com/openstack-k8s-operators/nova-operator/pull/943 | 17:12 |
sean-k-mooney | im going to leave it there for today but i will look at the result tomrrow | 17:12 |
opendevreview | Mina Shahsavan pour proposed openstack/nova master: Volume Affinity Filter https://review.opendev.org/c/openstack/nova/+/945031 | 17:19 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Ignore metadata tags in pci/stats _find_pool logic https://review.opendev.org/c/openstack/nova/+/944277 | 17:26 |
dansmith | bauzas: Uggla good to approve stuff for master=flamingo yes? | 17:53 |
dansmith | aight, stable/2025.1 is present so I'm just gonna do it | 18:28 |
sean-k-mooney | dansmith: do you have any idea what "reservation id" is for a nova instance | 19:10 |
sean-k-mooney | hum | 19:12 |
sean-k-mooney | "The reservation id for the server. This is an id that can be useful in tracking groups of servers created with multiple create, that will all have the same reservation_id. By default, it appears in the response for administrative users only." | 19:12 |
sean-k-mooney | ok apprently that show up even if you dont use multi create | 19:13 |
sean-k-mooney | i dont think i have seen that in horizon before but i guess i dont use the admin project most of the time | 19:14 |
dansmith | sean-k-mooney: that may be a holdover from being able to support the ec2 api | 19:18 |
sean-k-mooney | it was added in 2.3 but maybe | 19:18 |
sean-k-mooney | i was wondering if it was somehow blazar related | 19:18 |
sean-k-mooney | just off the name | 19:18 |
dansmith | orly okay | 19:19 |
opendevreview | Merged openstack/nova master: Fix missing backtick in configuration option help https://review.opendev.org/c/openstack/nova/+/931253 | 19:25 |
opendevreview | Merged openstack/nova master: Reproduce bug/2098496 https://review.opendev.org/c/openstack/nova/+/941673 | 19:26 |
opendevreview | Merged openstack/nova master: Ignore metadata tags in pci/stats _find_pool logic https://review.opendev.org/c/openstack/nova/+/944277 | 22:04 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!