Tuesday, 2024-12-17

*** gryf is now known as Guest3340		05:36
*** Guest3340 is now known as gryf		05:49
rpittau	good morning ironic! o/	07:52
opendevreview	Merged openstack/ironic master: docs: mention bug 1995078 https://review.opendev.org/c/openstack/ironic/+/937373	09:34
iurygregory	good morning ironic	10:48
Continuity	Hey, Ironic people. I have an odd one for you today. We have some Lenovo machines which do no seem to report "Systems": { "@odata.id": "/redfish/v1/Systems"} via redfish when you speak to them unauthenticated. So we had some machines BMC disconnected for a couple of mins, long enough for them to drop int maint mode for power failure, now however they are unable to recover because when the _power_failure_recovery routine runs it hits an	11:28
Continuity	exception that systems odata.id does not exist. And they get stuck in that loop forever.	11:28
Continuity	Any ideas how to get them out of this loop, or how to resovle	11:28
Continuity	This is using REDFISH	11:28
Continuity	Happy to raise a bug if thats more helpful, but could do with getting these machines out of maintenance mode and reporting power again	11:29
iurygregory	"when you speak to them unauthenticated" I'm not even sure if we can access redfish without the user/password ...	11:44
rpittau	that sounds really weird	13:07
rpittau	opening a bug would probably be worth it	13:07
TheJulia	I suspect continuity needs to file a bug since the error should... should... invalidate the session which should nuke the cached client	14:14
cardoe	My question would be which version of sushy.	15:01
iurygregory	TheJulia, do you think it makes sense to add a retry in https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/redfish/boot.py#L718 this seems to be the place where things go side ways when doing a firmware update for the bmc https://paste.opendev.org/show/bdf8ZjY9DXtJhzYGKTDm/	15:02
TheJulia	cardoe: That is a great question, considering sushy, itself, also (if memory serves) has invalidation logic	15:02
TheJulia	iurygregory: so.. lets just walk through the order of operations real quick	15:04
iurygregory	sure =)	15:04
TheJulia	so you do the firmware update, it completes, I guess the node is being moved back to some sort of deployed state?	15:04
TheJulia	and prepare_ramdisk is getting called as a "just make sure we can boot the node" ?	15:04
iurygregory	I think that is the case	15:05
iurygregory	and I'm using redfish-virtual-media	15:05
iurygregory	not sure if the workflow would be different when using redfish-pxe	15:06
TheJulia	It would be	15:06
TheJulia	so... hmm	15:06
iurygregory	during the time prepare_ramdisk was calling the BMC was still unresponsive so managers returned 400's	15:07
cardoe	This is one of the areas I've been burned as well.	15:07
TheJulia	Would it be wrong to add a tenacity wrapper to prepare_ramdisk and add a "bmc not ready" exception?	15:07
TheJulia	and have it match that exception as a reason to retry?	15:07
cardoe	Someone made a bug report about "id" missing and I think dtantsur and I +2'd something to not make it required.	15:07
TheJulia	Well, id wise it doesn't really matter as someone has changed	15:08
TheJulia	or proposed a change, the fundimental issue is that doesn't get them out of this weird unhappy case	15:08
iurygregory	TheJulia, humm I think it does, normally I would get sushy.exceptions.BadRequestError directly	15:09
TheJulia	Well, I am suggesting id explicitly might not matter in this case, because even if you somehow had an ID and it was valid, or it was not required, the BMC is not in a state where the method can execute successfully	15:10
TheJulia	its sort of a case where we likely just need to fail/retry... I think	15:10
cardoe	yes	15:11
iurygregory	I will add some retry logic in prepare_ramdisk now, and test in my bifrost just to see how it goes	15:11
cardoe	So I think it goes back to the Task Manager (or whatever its called). We need to read the Task of the update and track that to completion before trying to move to the next step.	15:11
TheJulia	cardoe: That is sort of why I was thinking tenacity, since the prepare_ramdisk method gets called, but the task doesn't have any way of pausing where it is at and suspending a lock	15:16
cardoe	Well the issue is we move to the next part of the step as soon as the power cycles. We assume that's "done" when there's really a task that we don't track.	15:17
TheJulia	indeed	15:22
TheJulia	We're failing early on to read out the status, so at least we can sort of hold things there and retry	15:22
TheJulia	we can't proceed past the point until we get a good/well behaving bmc back	15:22
TheJulia	ooohhh ahhhhhh tox -epep8 and py3 both pass on my current change	15:27
iurygregory	re-running things, adding the node again and will request an update	15:37
iurygregory	when we have a baremetal node in service failed we can only delete or try running another service step right?	15:37
iurygregory	or try rescue (but I didn't have a password set XD)	15:38
TheJulia	iurygregory: you should be able to re-request service	15:43
TheJulia	iurygregory: if you can't, then that is a bug	15:43
iurygregory	TheJulia, yeah, but I would need to specify something (like a new firmware update or change some bios settings) right?	15:56
TheJulia	iurygregory: you could ask to hold the node and then unhold it	15:57
iurygregory	oh ok!	15:58
iurygregory	trying firmware update now with the code for retry... fingers crossed	15:58
TheJulia	I guess we do need some sort of abort verb for a failure state to knock a node out of it	15:59
TheJulia	which is like... our longest desired feature request	15:59
iurygregory	yeah totally agree	16:03
opendevreview	Verification of a change to openstack/ironic master failed: Trivial deprecation fixes. https://review.opendev.org/c/openstack/ironic/+/936411	16:05
TheJulia	and truthfully, the pattern would be good to also apply to power on	16:18
TheJulia	that would... address nobodycam's report	16:18
iurygregory	Failed to set node power state to power on.	16:20
iurygregory	TheJulia, did you just read my mind? I just got the error like 3min ago for power on XD	16:20
TheJulia	iurygregory: I'm aware of the failure pattern :)	16:36
iurygregory	ok, retry added in node_wait_for_power_state (fingers crossed)	16:39
rpittau	good night! o/	17:10
opendevreview	Julia Kreger proposed openstack/ironic master: WIP OCI container adjacent artifact support https://review.opendev.org/c/openstack/ironic/+/937896	17:10
opendevreview	Julia Kreger proposed openstack/ironic master: WIP - A very early wip of bootc deployment on the ironic side https://review.opendev.org/c/openstack/ironic/+/937897	17:10
cardoe	Don’t we have an abort verb? Maybe make it generic to multiple states?	17:16
TheJulia	we do have an abort verb	17:16
TheJulia	.... it does, if memory serves, work in a few different states	17:16
TheJulia	I guess if enabled, it might work, but likely need a warning sign next to it	17:17
TheJulia	Something like a rabbit or a duck would have about hunting seasons	17:17
mnasiadka	Good evening Ironic	17:21
TheJulia	Good evening	17:22
mnasiadka	I stumbled across a weird thing - bifrost is setting image_checksum to sha256 (well checking it on the deployment_image file in /httpboot) and then IPA is complaning that the md5 checksum does not match the sha256 checksum... which is extremely weird because I thought IPA will ,,find out'' the hash algo based on checksum length? (stable/2024.1)	17:23
mnasiadka	traceback: Dec 17 11:49:28 localhost.localdomain ironic-python-agent[1108]: 2024-12-17 11:49:28.253 1108 ERROR root ironic_python_agent.errors.ImageChecksumError: Error verifying image checksum: Image failed to verify against checksum. location: deployment_image.qcow2; image ID: /tmp/deployment_image.qcow2; image checksum: 75d9201066cfcfd4dbd49b78d140c4821c3bb3592a08e4dd45c6599ec15e28f0; verification checksum:	17:23
mnasiadka	14274b4da1630a9e15dad60a9152fa03	17:23
mnasiadka	Any ideas?	17:24
TheJulia	... That just seems bizzare	17:26
TheJulia	is there an os_image_algo field as well	17:26
TheJulia	That takes precedence	17:26
mnasiadka	in instance_info? No	17:27
TheJulia	Honestly, that is the only thing I can think of, so it sounds like there is a problem somewhere or something is not parsing out properly	17:27
mnasiadka	image_source, image_checksum, image_type, config_drive and image_url	17:27
mnasiadka	and based on the docs I thought md5 is disabled by default?	17:28
TheJulia	and the image_checksum value is 75d9201066cfcfd4dbd49b78d140c4821c3bb3592a08e4dd45c6599ec15e28f0	17:28
TheJulia	no, the docs pointed to "it was going to be" and I think we kept stalling on actually killing md5 support or maybe it fianlly did	17:29
TheJulia	Honestly, it is all a blur	17:29
mnasiadka	well, I tried redeploying for the third time and now it worked...	17:29
TheJulia	... okay	17:30
TheJulia	that sounds like some sort of weird bug	17:30
mnasiadka	now I'm even more puzzled	17:30
opendevreview	Verification of a change to openstack/ironic master failed: Trivial deprecation fixes. https://review.opendev.org/c/openstack/ironic/+/936411	17:36
JayF	TheJulia: maybe a ?force=true to the existing abort set-provision-state could "free" a stuck node, similar to nova reset-state	17:36
JayF	allow someone to basically coerce the node into the state they believe it needs to be, with a very, very locked down access list	17:36
TheJulia	I'd really dread a force flag	17:38
TheJulia	because someone not understanding the state machine might hten use it to try and make things happen which then we would be on the hook to support or point them to the "that was a mistake" documentaiton	17:39
JayF	I mean, that's the basic argument I've heard against stuff like nova's reset-state, but I know in practice as an admin, having that exist has prevented me from needing to touch the DB in the past	17:42
JayF	I am not saying I like the idea	17:42
JayF	only that maybe this is a sign we should consider it, like it or not	17:42
TheJulia	That is a valid point	17:46
TheJulia	totally valid	17:46
JayF	Honestly the thing I worry more about	17:47
JayF	is right now, certain failures we have rigged to cleanup on conductor restart	17:47
JayF	which saves us from side effects of those failures	17:47
JayF	since we're starting back from scratch	17:47
TheJulia	Indeed	17:47
JayF	if we gave an escape hatch that could be used to avoid that conductor restart, I wonder if there could be second-order bugs as an impact	17:48
TheJulia	A really good example is Continuity's questions	17:48
JayF	actually, maybe that's the model?	17:48
JayF	The API call does what would happen on a conductor restart.	17:48
TheJulia	sounds like the session is borked to the bmc, that only gets otherwise cleared by restarting the conductor	17:48
TheJulia	... perhaps	17:48
TheJulia	An admin call to do a full reset for a node	17:48
JayF	only real question is if we'd allow people to go to an arbitrary state	17:49
JayF	or force it into the old state/a failed state/etc	17:49
* TheJulia goes outside with cloves and adhesive remover to attempt the same task from Saturday morning		17:49
Continuity	Hey TheJulia, and everyone, we did a ironic conductor restart and the magic happened.. so really not sure now what the problem was	17:51
Continuity	Will keep an eye on it and grab more data if it happens again	17:51
Continuity	Sorry, been in bed with a chest infection so not really redading IRC :D	17:51
JayF	that's basically how it's supposed to work, I'm glad to hear	17:51
JayF	now go away from computers and rest :D	17:51
Continuity	Aye aye	17:52
JayF	no api call can reset state on your body, I've checked ;)	17:52
Continuity	This is very very true.	17:52
opendevreview	Verification of a change to openstack/ironic master failed: Handle Power On/Off for child node cases https://review.opendev.org/c/openstack/ironic/+/896570	18:22
-opendevstatus- NOTICE: Gerrit will be restarted to pick up a small configuration update. You may notice a short Gerrit outage.		21:01
opendevreview	Jay Faulkner proposed openstack/ironic-python-agent master: Remove dependency on ironic-lib https://review.opendev.org/c/openstack/ironic-python-agent/+/937743	22:00
JayF	^ should be ready for review now, will tag it once it gets a ci pass	22:00
-opendevstatus- NOTICE: You may have noticed the Gerrit restart was a bit bumpy. We have identified an issue with Gerrit caches that we'd like to address which we think will make this better. This requires one more restart		22:13
JayF	gross	22:39
JayF	I'm migrating ironic_lib thing to ironic	22:39
JayF	and it has a opt that is deprecated_group=DEFAULT	22:39
JayF	group=ironic_lib	22:39
JayF	we don't support >1 deprecated group in oslo_config	22:40
JayF	if it's not old enough to remove the deprecated group, I might just swap 'em around	22:40
iurygregory	that would make sense to me	22:40
JayF	bwahaha	22:41
JayF	it's only been deprecated for ... 9 years	22:41
JayF	ugh I forgot we have a bunch of docs to migrate too, we probably will have to setup redirects	23:44
JayF	I think that sans-docs I might get a patch up by EOW that will remove it from ironic completely though, I'm about to push my WIP as I go home for the day	23:47
iurygregory	humm adding retry to node_wait_for_power_state doesn't seem to help =( going to try moving the retry to https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/redfish/power.py#L107 <fingers crossed>	23:50
opendevreview	Jay Faulkner proposed openstack/ironic master: WIP: Migrate ironic_lib to ironic https://review.opendev.org/c/openstack/ironic/+/937948	23:50

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!