Tuesday, 2024-12-17

*** gryf is now known as Guest334005:36
*** Guest3340 is now known as gryf05:49
rpittaugood morning ironic! o/07:52
opendevreviewMerged openstack/ironic master: docs: mention bug 1995078  https://review.opendev.org/c/openstack/ironic/+/93737309:34
iurygregorygood morning ironic10:48
ContinuityHey, Ironic people. I have an odd one for you today. We have some Lenovo machines which do no seem to report   "Systems": {  "@odata.id": "/redfish/v1/Systems"} via redfish when you speak to them unauthenticated. So we had some machines BMC disconnected for a couple of mins, long enough for them to drop int maint mode for power failure, now however they are unable to recover because when the _power_failure_recovery routine runs it hits an 11:28
Continuityexception that systems odata.id does not exist. And they get stuck in that loop forever. 11:28
ContinuityAny ideas how to get them out of this loop, or how to resovle11:28
ContinuityThis is using REDFISH 11:28
ContinuityHappy to raise a bug if thats more helpful, but could do with getting these machines out of maintenance mode and reporting power again11:29
iurygregory"when you speak to them unauthenticated" I'm not even sure if we can access redfish without the user/password ...11:44
rpittauthat sounds really weird13:07
rpittauopening a bug would probably be worth it13:07
TheJuliaI suspect continuity needs to file a bug since the error should... should... invalidate the session which should nuke the cached client14:14
cardoeMy question would be which version of sushy.15:01
iurygregoryTheJulia, do you think it makes sense to add a retry in https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/redfish/boot.py#L718 this seems to be the place where things go side ways when doing a firmware update for the bmc https://paste.opendev.org/show/bdf8ZjY9DXtJhzYGKTDm/ 15:02
TheJuliacardoe: That is a *great* question, considering sushy, itself, also (if memory serves) has invalidation logic15:02
TheJuliaiurygregory: so.. lets just walk through the order of operations real quick15:04
iurygregorysure =)15:04
TheJuliaso you do the firmware update, it completes, I guess the node is being moved back to some sort of deployed state?15:04
TheJuliaand prepare_ramdisk is getting called as a "just make sure we can boot the node" ?15:04
iurygregoryI think that is the case15:05
iurygregoryand I'm using redfish-virtual-media15:05
iurygregorynot sure if the workflow would be different when using redfish-pxe15:06
TheJuliaIt would be15:06
TheJuliaso... hmm15:06
iurygregoryduring the time prepare_ramdisk was calling the BMC was still unresponsive so managers returned 400's15:07
cardoeThis is one of the areas I've been burned as well.15:07
TheJuliaWould it be wrong to add a tenacity wrapper to prepare_ramdisk and add a "bmc not ready" exception?15:07
TheJuliaand have it match that exception as a reason to retry?15:07
cardoeSomeone made a bug report about "id" missing and I think dtantsur and I +2'd something to not make it required.15:07
TheJuliaWell, id wise it doesn't really matter as someone has changed15:08
TheJuliaor proposed a change, the fundimental issue is that doesn't get them out of this weird unhappy case15:08
iurygregoryTheJulia, humm I think it does, normally I would get sushy.exceptions.BadRequestError directly 15:09
TheJuliaWell, I am suggesting id explicitly might not matter in this case, because even if you somehow had an ID and it was valid, or it was not required, the BMC is not in a state where the method can execute successfully15:10
TheJuliaits sort of a case where we likely just need to fail/retry... I think15:10
cardoeyes15:11
iurygregoryI will add some retry logic in prepare_ramdisk now, and test in my bifrost just to see how it goes 15:11
cardoeSo I think it goes back to the Task Manager (or whatever its called). We need to read the Task of the update and track that to completion before trying to move to the next step.15:11
TheJuliacardoe: That is sort of why I was thinking tenacity, since the prepare_ramdisk method gets called, but the task doesn't have any way of pausing where it is at and suspending a lock15:16
cardoeWell the issue is we move to the next part of the step as soon as the power cycles. We assume that's "done" when there's really a task that we don't track.15:17
TheJuliaindeed15:22
TheJuliaWe're failing early on to read out the status, so at least we can sort of hold things there and retry15:22
TheJuliawe can't proceed past the point *until* we get a good/well behaving bmc back15:22
TheJuliaooohhh ahhhhhh tox -epep8 and py3 both pass on my current change15:27
iurygregoryre-running things, adding the node again and will request an update15:37
iurygregorywhen we have a baremetal node in service failed we can only delete or try running another service step right?15:37
iurygregoryor try rescue (but I didn't have a password set XD)15:38
TheJuliaiurygregory: you should be able to re-request service15:43
TheJuliaiurygregory: if you can't, then that is a bug15:43
iurygregoryTheJulia, yeah, but I would need to specify something (like a new firmware update or change some bios settings) right?15:56
TheJuliaiurygregory: you could ask to hold the node and then unhold it15:57
iurygregoryoh ok!15:58
iurygregorytrying firmware update now with the code for retry... fingers crossed15:58
TheJuliaI guess we do need some sort of abort verb for a failure state to knock a node out of it15:59
TheJuliawhich is like... our longest desired feature request15:59
iurygregoryyeah totally agree16:03
opendevreviewVerification of a change to openstack/ironic master failed: Trivial deprecation fixes.  https://review.opendev.org/c/openstack/ironic/+/93641116:05
TheJuliaand truthfully, the pattern would be good to also apply to power on16:18
TheJuliathat would... address nobodycam's report16:18
iurygregoryFailed to set node power state to power on.16:20
iurygregoryTheJulia, did you just read my mind?  I just got the error like 3min ago for power on XD16:20
TheJuliaiurygregory: I'm aware of the failure pattern :)16:36
iurygregoryok, retry added in node_wait_for_power_state (fingers crossed)16:39
rpittaugood night! o/17:10
opendevreviewJulia Kreger proposed openstack/ironic master: WIP OCI container adjacent artifact support  https://review.opendev.org/c/openstack/ironic/+/93789617:10
opendevreviewJulia Kreger proposed openstack/ironic master: WIP - A very early wip of bootc deployment on the ironic side  https://review.opendev.org/c/openstack/ironic/+/93789717:10
cardoeDon’t we have an abort verb? Maybe make it generic to multiple states?17:16
TheJuliawe do have an abort verb17:16
TheJulia.... it does, if memory serves, work in a few different states17:16
TheJuliaI guess if enabled, it might work, but likely need a warning sign next to it17:17
TheJuliaSomething like a rabbit or a duck would have about hunting seasons17:17
mnasiadkaGood evening Ironic17:21
TheJuliaGood evening17:22
mnasiadkaI stumbled across a weird thing - bifrost is setting image_checksum to sha256 (well checking it on the deployment_image file in /httpboot) and then IPA is complaning that the md5 checksum does not match the sha256 checksum... which is extremely weird because I thought IPA will ,,find out'' the hash algo based on checksum length? (stable/2024.1)17:23
mnasiadkatraceback: Dec 17 11:49:28 localhost.localdomain ironic-python-agent[1108]: 2024-12-17 11:49:28.253 1108 ERROR root ironic_python_agent.errors.ImageChecksumError: Error verifying image checksum: Image failed to verify against checksum. location: deployment_image.qcow2; image ID: /tmp/deployment_image.qcow2; image checksum: 75d9201066cfcfd4dbd49b78d140c4821c3bb3592a08e4dd45c6599ec15e28f0; verification checksum:17:23
mnasiadka14274b4da1630a9e15dad60a9152fa0317:23
mnasiadkaAny ideas?17:24
TheJulia... That just seems bizzare17:26
TheJuliais there an os_image_algo field as well17:26
TheJuliaThat takes precedence17:26
mnasiadkain instance_info? No17:27
TheJuliaHonestly, that is the only thing I can think of, so it sounds like there is a problem somewhere or something is not parsing out properly17:27
mnasiadkaimage_source, image_checksum, image_type, config_drive and image_url17:27
mnasiadkaand based on the docs I thought md5 is disabled by default?17:28
TheJuliaand the image_checksum value is 75d9201066cfcfd4dbd49b78d140c4821c3bb3592a08e4dd45c6599ec15e28f0 17:28
TheJuliano, the docs pointed to "it was going to be" and I think we kept stalling on actually killing md5 support or maybe it fianlly did17:29
TheJuliaHonestly, it is all a blur17:29
mnasiadkawell, I tried redeploying for the third time and now it worked...17:29
TheJulia... okay17:30
TheJuliathat sounds like some sort of weird bug17:30
mnasiadkanow I'm even more puzzled17:30
opendevreviewVerification of a change to openstack/ironic master failed: Trivial deprecation fixes.  https://review.opendev.org/c/openstack/ironic/+/93641117:36
JayFTheJulia: maybe a ?force=true to the existing abort set-provision-state could "free" a stuck node, similar to nova reset-state17:36
JayFallow someone to basically coerce the node into the state they believe it needs to be, with a very, very locked down access list17:36
TheJuliaI'd really dread a force flag17:38
TheJuliabecause someone not understanding the state machine might hten use it to try and make things happen which then we would be on the hook to support or point them to the "that was a mistake" documentaiton17:39
JayFI mean, that's the basic argument I've heard against stuff like nova's reset-state, but I know in practice as an admin, having that exist has prevented me from needing to touch the DB in the past17:42
JayFI am *not* saying I like the idea17:42
JayFonly that maybe this is a sign we should consider it, like it or not17:42
TheJuliaThat is a valid point17:46
TheJuliatotally valid17:46
JayFHonestly the thing I worry more about17:47
JayFis right now, certain failures we have rigged to cleanup on conductor restart17:47
JayFwhich saves us from side effects of those failures17:47
JayFsince we're starting back from scratch17:47
TheJuliaIndeed17:47
JayFif we gave an escape hatch that could be used to avoid that conductor restart, I wonder if there could be second-order bugs as an impact17:48
TheJuliaA really good example is Continuity's questions17:48
JayFactually, maybe that's the model?17:48
JayFThe API call does what would happen on a conductor restart.17:48
TheJuliasounds like the session is *borked* to the bmc, that only gets otherwise cleared by restarting the conductor17:48
TheJulia... perhaps17:48
TheJuliaAn admin call to do a full reset for a node17:48
JayFonly real question is if we'd allow people to go to an arbitrary state17:49
JayFor force it into the old state/a failed state/etc17:49
* TheJulia goes outside with cloves and adhesive remover to attempt the same task from Saturday morning17:49
ContinuityHey TheJulia, and everyone, we did a ironic conductor restart and the magic happened.. so really not sure now what the problem was17:51
ContinuityWill keep an eye on it and grab more data if it happens again17:51
ContinuitySorry, been in bed with a chest infection so not really redading IRC :D17:51
JayFthat's basically how it's supposed to work, I'm glad to hear17:51
JayFnow go away from computers and rest :D 17:51
ContinuityAye aye17:52
JayFno api call can reset state on your body, I've checked ;) 17:52
ContinuityThis is very very true.17:52
opendevreviewVerification of a change to openstack/ironic master failed: Handle Power On/Off for child node cases  https://review.opendev.org/c/openstack/ironic/+/89657018:22
-opendevstatus- NOTICE: Gerrit will be restarted to pick up a small configuration update. You may notice a short Gerrit outage.21:01
opendevreviewJay Faulkner proposed openstack/ironic-python-agent master: Remove dependency on ironic-lib  https://review.opendev.org/c/openstack/ironic-python-agent/+/93774322:00
JayF^ should be ready for review now, will tag it once it gets a ci pass22:00
-opendevstatus- NOTICE: You may have noticed the Gerrit restart was a bit bumpy. We have identified an issue with Gerrit caches that we'd like to address which we think will make this better. This requires one more restart22:13
JayFgross22:39
JayFI'm migrating ironic_lib thing to ironic22:39
JayFand it has a opt that is deprecated_group=DEFAULT22:39
JayFgroup=ironic_lib22:39
JayFwe don't support >1 deprecated group in oslo_config22:40
JayFif it's not old enough to remove the deprecated group, I might just swap 'em around22:40
iurygregorythat would make sense to me22:40
JayFbwahaha22:41
JayFit's only been deprecated for ... 9 years22:41
JayFugh I forgot we have a bunch of docs to migrate too, we probably will have to setup redirects23:44
JayFI think that sans-docs I might get a patch up by EOW that will remove it from ironic completely though, I'm about to push my WIP as I go home for the day23:47
iurygregoryhumm adding retry to node_wait_for_power_state doesn't seem to help =( going to try moving the retry to https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/redfish/power.py#L107 <fingers crossed>23:50
opendevreviewJay Faulkner proposed openstack/ironic master: WIP: Migrate ironic_lib to ironic  https://review.opendev.org/c/openstack/ironic/+/93794823:50

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!