*** gryf is now known as Guest3340 | 05:36 | |
*** Guest3340 is now known as gryf | 05:49 | |
rpittau | good morning ironic! o/ | 07:52 |
---|---|---|
opendevreview | Merged openstack/ironic master: docs: mention bug 1995078 https://review.opendev.org/c/openstack/ironic/+/937373 | 09:34 |
iurygregory | good morning ironic | 10:48 |
Continuity | Hey, Ironic people. I have an odd one for you today. We have some Lenovo machines which do no seem to report "Systems": { "@odata.id": "/redfish/v1/Systems"} via redfish when you speak to them unauthenticated. So we had some machines BMC disconnected for a couple of mins, long enough for them to drop int maint mode for power failure, now however they are unable to recover because when the _power_failure_recovery routine runs it hits an | 11:28 |
Continuity | exception that systems odata.id does not exist. And they get stuck in that loop forever. | 11:28 |
Continuity | Any ideas how to get them out of this loop, or how to resovle | 11:28 |
Continuity | This is using REDFISH | 11:28 |
Continuity | Happy to raise a bug if thats more helpful, but could do with getting these machines out of maintenance mode and reporting power again | 11:29 |
iurygregory | "when you speak to them unauthenticated" I'm not even sure if we can access redfish without the user/password ... | 11:44 |
rpittau | that sounds really weird | 13:07 |
rpittau | opening a bug would probably be worth it | 13:07 |
TheJulia | I suspect continuity needs to file a bug since the error should... should... invalidate the session which should nuke the cached client | 14:14 |
cardoe | My question would be which version of sushy. | 15:01 |
iurygregory | TheJulia, do you think it makes sense to add a retry in https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/redfish/boot.py#L718 this seems to be the place where things go side ways when doing a firmware update for the bmc https://paste.opendev.org/show/bdf8ZjY9DXtJhzYGKTDm/ | 15:02 |
TheJulia | cardoe: That is a *great* question, considering sushy, itself, also (if memory serves) has invalidation logic | 15:02 |
TheJulia | iurygregory: so.. lets just walk through the order of operations real quick | 15:04 |
iurygregory | sure =) | 15:04 |
TheJulia | so you do the firmware update, it completes, I guess the node is being moved back to some sort of deployed state? | 15:04 |
TheJulia | and prepare_ramdisk is getting called as a "just make sure we can boot the node" ? | 15:04 |
iurygregory | I think that is the case | 15:05 |
iurygregory | and I'm using redfish-virtual-media | 15:05 |
iurygregory | not sure if the workflow would be different when using redfish-pxe | 15:06 |
TheJulia | It would be | 15:06 |
TheJulia | so... hmm | 15:06 |
iurygregory | during the time prepare_ramdisk was calling the BMC was still unresponsive so managers returned 400's | 15:07 |
cardoe | This is one of the areas I've been burned as well. | 15:07 |
TheJulia | Would it be wrong to add a tenacity wrapper to prepare_ramdisk and add a "bmc not ready" exception? | 15:07 |
TheJulia | and have it match that exception as a reason to retry? | 15:07 |
cardoe | Someone made a bug report about "id" missing and I think dtantsur and I +2'd something to not make it required. | 15:07 |
TheJulia | Well, id wise it doesn't really matter as someone has changed | 15:08 |
TheJulia | or proposed a change, the fundimental issue is that doesn't get them out of this weird unhappy case | 15:08 |
iurygregory | TheJulia, humm I think it does, normally I would get sushy.exceptions.BadRequestError directly | 15:09 |
TheJulia | Well, I am suggesting id explicitly might not matter in this case, because even if you somehow had an ID and it was valid, or it was not required, the BMC is not in a state where the method can execute successfully | 15:10 |
TheJulia | its sort of a case where we likely just need to fail/retry... I think | 15:10 |
cardoe | yes | 15:11 |
iurygregory | I will add some retry logic in prepare_ramdisk now, and test in my bifrost just to see how it goes | 15:11 |
cardoe | So I think it goes back to the Task Manager (or whatever its called). We need to read the Task of the update and track that to completion before trying to move to the next step. | 15:11 |
TheJulia | cardoe: That is sort of why I was thinking tenacity, since the prepare_ramdisk method gets called, but the task doesn't have any way of pausing where it is at and suspending a lock | 15:16 |
cardoe | Well the issue is we move to the next part of the step as soon as the power cycles. We assume that's "done" when there's really a task that we don't track. | 15:17 |
TheJulia | indeed | 15:22 |
TheJulia | We're failing early on to read out the status, so at least we can sort of hold things there and retry | 15:22 |
TheJulia | we can't proceed past the point *until* we get a good/well behaving bmc back | 15:22 |
TheJulia | ooohhh ahhhhhh tox -epep8 and py3 both pass on my current change | 15:27 |
iurygregory | re-running things, adding the node again and will request an update | 15:37 |
iurygregory | when we have a baremetal node in service failed we can only delete or try running another service step right? | 15:37 |
iurygregory | or try rescue (but I didn't have a password set XD) | 15:38 |
TheJulia | iurygregory: you should be able to re-request service | 15:43 |
TheJulia | iurygregory: if you can't, then that is a bug | 15:43 |
iurygregory | TheJulia, yeah, but I would need to specify something (like a new firmware update or change some bios settings) right? | 15:56 |
TheJulia | iurygregory: you could ask to hold the node and then unhold it | 15:57 |
iurygregory | oh ok! | 15:58 |
iurygregory | trying firmware update now with the code for retry... fingers crossed | 15:58 |
TheJulia | I guess we do need some sort of abort verb for a failure state to knock a node out of it | 15:59 |
TheJulia | which is like... our longest desired feature request | 15:59 |
iurygregory | yeah totally agree | 16:03 |
opendevreview | Verification of a change to openstack/ironic master failed: Trivial deprecation fixes. https://review.opendev.org/c/openstack/ironic/+/936411 | 16:05 |
TheJulia | and truthfully, the pattern would be good to also apply to power on | 16:18 |
TheJulia | that would... address nobodycam's report | 16:18 |
iurygregory | Failed to set node power state to power on. | 16:20 |
iurygregory | TheJulia, did you just read my mind? I just got the error like 3min ago for power on XD | 16:20 |
TheJulia | iurygregory: I'm aware of the failure pattern :) | 16:36 |
iurygregory | ok, retry added in node_wait_for_power_state (fingers crossed) | 16:39 |
rpittau | good night! o/ | 17:10 |
opendevreview | Julia Kreger proposed openstack/ironic master: WIP OCI container adjacent artifact support https://review.opendev.org/c/openstack/ironic/+/937896 | 17:10 |
opendevreview | Julia Kreger proposed openstack/ironic master: WIP - A very early wip of bootc deployment on the ironic side https://review.opendev.org/c/openstack/ironic/+/937897 | 17:10 |
cardoe | Don’t we have an abort verb? Maybe make it generic to multiple states? | 17:16 |
TheJulia | we do have an abort verb | 17:16 |
TheJulia | .... it does, if memory serves, work in a few different states | 17:16 |
TheJulia | I guess if enabled, it might work, but likely need a warning sign next to it | 17:17 |
TheJulia | Something like a rabbit or a duck would have about hunting seasons | 17:17 |
mnasiadka | Good evening Ironic | 17:21 |
TheJulia | Good evening | 17:22 |
mnasiadka | I stumbled across a weird thing - bifrost is setting image_checksum to sha256 (well checking it on the deployment_image file in /httpboot) and then IPA is complaning that the md5 checksum does not match the sha256 checksum... which is extremely weird because I thought IPA will ,,find out'' the hash algo based on checksum length? (stable/2024.1) | 17:23 |
mnasiadka | traceback: Dec 17 11:49:28 localhost.localdomain ironic-python-agent[1108]: 2024-12-17 11:49:28.253 1108 ERROR root ironic_python_agent.errors.ImageChecksumError: Error verifying image checksum: Image failed to verify against checksum. location: deployment_image.qcow2; image ID: /tmp/deployment_image.qcow2; image checksum: 75d9201066cfcfd4dbd49b78d140c4821c3bb3592a08e4dd45c6599ec15e28f0; verification checksum: | 17:23 |
mnasiadka | 14274b4da1630a9e15dad60a9152fa03 | 17:23 |
mnasiadka | Any ideas? | 17:24 |
TheJulia | ... That just seems bizzare | 17:26 |
TheJulia | is there an os_image_algo field as well | 17:26 |
TheJulia | That takes precedence | 17:26 |
mnasiadka | in instance_info? No | 17:27 |
TheJulia | Honestly, that is the only thing I can think of, so it sounds like there is a problem somewhere or something is not parsing out properly | 17:27 |
mnasiadka | image_source, image_checksum, image_type, config_drive and image_url | 17:27 |
mnasiadka | and based on the docs I thought md5 is disabled by default? | 17:28 |
TheJulia | and the image_checksum value is 75d9201066cfcfd4dbd49b78d140c4821c3bb3592a08e4dd45c6599ec15e28f0 | 17:28 |
TheJulia | no, the docs pointed to "it was going to be" and I think we kept stalling on actually killing md5 support or maybe it fianlly did | 17:29 |
TheJulia | Honestly, it is all a blur | 17:29 |
mnasiadka | well, I tried redeploying for the third time and now it worked... | 17:29 |
TheJulia | ... okay | 17:30 |
TheJulia | that sounds like some sort of weird bug | 17:30 |
mnasiadka | now I'm even more puzzled | 17:30 |
opendevreview | Verification of a change to openstack/ironic master failed: Trivial deprecation fixes. https://review.opendev.org/c/openstack/ironic/+/936411 | 17:36 |
JayF | TheJulia: maybe a ?force=true to the existing abort set-provision-state could "free" a stuck node, similar to nova reset-state | 17:36 |
JayF | allow someone to basically coerce the node into the state they believe it needs to be, with a very, very locked down access list | 17:36 |
TheJulia | I'd really dread a force flag | 17:38 |
TheJulia | because someone not understanding the state machine might hten use it to try and make things happen which then we would be on the hook to support or point them to the "that was a mistake" documentaiton | 17:39 |
JayF | I mean, that's the basic argument I've heard against stuff like nova's reset-state, but I know in practice as an admin, having that exist has prevented me from needing to touch the DB in the past | 17:42 |
JayF | I am *not* saying I like the idea | 17:42 |
JayF | only that maybe this is a sign we should consider it, like it or not | 17:42 |
TheJulia | That is a valid point | 17:46 |
TheJulia | totally valid | 17:46 |
JayF | Honestly the thing I worry more about | 17:47 |
JayF | is right now, certain failures we have rigged to cleanup on conductor restart | 17:47 |
JayF | which saves us from side effects of those failures | 17:47 |
JayF | since we're starting back from scratch | 17:47 |
TheJulia | Indeed | 17:47 |
JayF | if we gave an escape hatch that could be used to avoid that conductor restart, I wonder if there could be second-order bugs as an impact | 17:48 |
TheJulia | A really good example is Continuity's questions | 17:48 |
JayF | actually, maybe that's the model? | 17:48 |
JayF | The API call does what would happen on a conductor restart. | 17:48 |
TheJulia | sounds like the session is *borked* to the bmc, that only gets otherwise cleared by restarting the conductor | 17:48 |
TheJulia | ... perhaps | 17:48 |
TheJulia | An admin call to do a full reset for a node | 17:48 |
JayF | only real question is if we'd allow people to go to an arbitrary state | 17:49 |
JayF | or force it into the old state/a failed state/etc | 17:49 |
* TheJulia goes outside with cloves and adhesive remover to attempt the same task from Saturday morning | 17:49 | |
Continuity | Hey TheJulia, and everyone, we did a ironic conductor restart and the magic happened.. so really not sure now what the problem was | 17:51 |
Continuity | Will keep an eye on it and grab more data if it happens again | 17:51 |
Continuity | Sorry, been in bed with a chest infection so not really redading IRC :D | 17:51 |
JayF | that's basically how it's supposed to work, I'm glad to hear | 17:51 |
JayF | now go away from computers and rest :D | 17:51 |
Continuity | Aye aye | 17:52 |
JayF | no api call can reset state on your body, I've checked ;) | 17:52 |
Continuity | This is very very true. | 17:52 |
opendevreview | Verification of a change to openstack/ironic master failed: Handle Power On/Off for child node cases https://review.opendev.org/c/openstack/ironic/+/896570 | 18:22 |
-opendevstatus- NOTICE: Gerrit will be restarted to pick up a small configuration update. You may notice a short Gerrit outage. | 21:01 | |
opendevreview | Jay Faulkner proposed openstack/ironic-python-agent master: Remove dependency on ironic-lib https://review.opendev.org/c/openstack/ironic-python-agent/+/937743 | 22:00 |
JayF | ^ should be ready for review now, will tag it once it gets a ci pass | 22:00 |
-opendevstatus- NOTICE: You may have noticed the Gerrit restart was a bit bumpy. We have identified an issue with Gerrit caches that we'd like to address which we think will make this better. This requires one more restart | 22:13 | |
JayF | gross | 22:39 |
JayF | I'm migrating ironic_lib thing to ironic | 22:39 |
JayF | and it has a opt that is deprecated_group=DEFAULT | 22:39 |
JayF | group=ironic_lib | 22:39 |
JayF | we don't support >1 deprecated group in oslo_config | 22:40 |
JayF | if it's not old enough to remove the deprecated group, I might just swap 'em around | 22:40 |
iurygregory | that would make sense to me | 22:40 |
JayF | bwahaha | 22:41 |
JayF | it's only been deprecated for ... 9 years | 22:41 |
JayF | ugh I forgot we have a bunch of docs to migrate too, we probably will have to setup redirects | 23:44 |
JayF | I think that sans-docs I might get a patch up by EOW that will remove it from ironic completely though, I'm about to push my WIP as I go home for the day | 23:47 |
iurygregory | humm adding retry to node_wait_for_power_state doesn't seem to help =( going to try moving the retry to https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/redfish/power.py#L107 <fingers crossed> | 23:50 |
opendevreview | Jay Faulkner proposed openstack/ironic master: WIP: Migrate ironic_lib to ironic https://review.opendev.org/c/openstack/ironic/+/937948 | 23:50 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!