Thursday, 2023-04-13

iurygregory	TheJulia, tks for the review in the spec, I just answered some of your questions there o/	01:58
iurygregory	tomorrow I will be looking at the open specs we have for bobcat	01:58
TheJulia	Cool cool	02:05
TheJulia	As long as we don’t do that whole strict tie to a cycle process of control nightmare.	02:08
iurygregory	yeah	02:09
rpittau	good morning ironic! o/	06:37
kaloyank	morning ironic o/	06:57
kubajj	good morning everyone	09:23
dtantsur	morning kubajj! how are your studies going?	09:42
kubajj	dtantsur: it's not that bad. Exams are slowly approaching though, so I'm basically spending 9-19 at the library. Slightly more than a month to go and then I'm free. 🥹	09:45
dtantsur	great, good luck!	09:47
iurygregory	morning Ironic	11:35
opendevreview	Maksim Malchuk proposed openstack/bifrost master: [DNM] test linters https://review.opendev.org/c/openstack/bifrost/+/880163	12:32
TheJulia	good morning	12:33
opendevreview	Maksim Malchuk proposed openstack/bifrost master: [DNM] test linters https://review.opendev.org/c/openstack/bifrost/+/880163	12:33
opendevreview	Maksim Malchuk proposed openstack/bifrost master: Fix ansible-lint https://review.opendev.org/c/openstack/bifrost/+/880163	12:34
Sandzwerg[m]	TheJulia: Turns out having nodes stuck in deleting is not that hard if ironic has issues reaching the node via IPMI :)	12:57
TheJulia	Sandzwerg[m]: intermittently?	12:59
opendevreview	Maksim Malchuk proposed openstack/bifrost master: Fix ansible-lint https://review.opendev.org/c/openstack/bifrost/+/880163	13:00
Sandzwerg[m]	<TheJulia> "Sandzwerg: intermittently?" <- More or less. The node was put to maintenance because ironic fails the power sync even after some retries. Probably something on the node side is broken	13:10
opendevreview	Maksim Malchuk proposed openstack/bifrost master: Fix ansible-lint https://review.opendev.org/c/openstack/bifrost/+/880163	13:44
TheJulia	Sandzwerg[m]: do you have another system talking to the BMC via ipmi?	13:47
TheJulia	Sandzwerg[m]: are you using power sync? the default is yes :)	13:48
TheJulia	Sandzwerg[m]: Also, have any timers been changed from the defaults?	13:48
Sandzwerg[m]	We have a vendor console (lxca/openmanage/etc) not sure if that uses IPMI or something else. Monitoring is done via SNMP If I'm not mistaken. I don't think the vendor console had anz tasks which would block the node	13:49
Sandzwerg[m]	We sync the status to ironic, but no longer let ironic change the status if it does not match what it expects. Would need to check if we changed the default times, I assume not	13:50
TheJulia	so depending on the console, it might and some of the vendor's BMCs are designed around no more than 1 request every so often	13:50
TheJulia	I don't think we've changed the default behavior there, but it sounds like your using the knobs	13:51
TheJulia	s/knobs/correct knobs/	13:51
Sandzwerg[m]	Yeah, it's also not a big issue. I found two nodes in two dazs which is a bit more, but we don't see it all the time. We know that sometimes the remoteboard get's unresponsive and will not respond to anything for a while, still haven' t found the issue. Could be such a case. Interestingly only (older ~4ß5 years) Lenovos seem to be affected this time	13:53
TheJulia	Older supermicro gear in particular, I think makes a login entry and logging into the web console would count as a login to the bmc and it would eventually time out the fact i touched it via ipmi	13:53
TheJulia	but if i hit it too many times, I'd start having errors	13:53
TheJulia	it might be that using snmp on those bmcs, whatever they are, could be doing the same basic thing in that it could be creating a session and eventually time it out which is counting possibly	13:53
Sandzwerg[m]	It's not an big issue currently, mostly annoying. But yesterday you mentioned you can't image a case in which a node might get stuck in the "deleted" state so I thought I described how it could happen	13:54
TheJulia	yeah, I guess I can see that happening then :)	13:55
Sandzwerg[m]	Yeah, wouldn't be suprised but I think it was to rare for that. We never saw it everywhere. But sometimes a whole block (~14 nodes in one or two racks) are affected at the same time	13:55
TheJulia	oh joy	13:55
TheJulia	:(	13:55
Sandzwerg[m]	Most of our ironic infrastructure is pretty static and the support has a script to fix it. I think it tries in a loop to reboot the bmc or something, but could take a day or so till it succeeds. I'm not sad if these Lenovos get decommissoned, hopefully this year	13:57
samuelkunkel[m]	Has anyone ever seen that in a HPE Node using ILO5 (redfish)? Node is being booted to clean, Redfish calls to boot the device and directly afterwards throws: Extended information: [{'MessageArgs': ['BootSourceOverrideTarget'], 'MessageId': 'iLO.2.15.UnableToModifyDuringSystemPOST.	13:57
samuelkunkel[m]	If I retry it once or twice (starting back from "manage", "provide") it works	13:58
TheJulia	Sandzwerg[m]: oh joy! Sounds a lot like the HP Gen ?6? IPMI BMCs we had at HP Cloud	13:58
samuelkunkel[m]	Uff gen6? :D	13:59
TheJulia	samuelkunkel[m]: oh my, yes! We've seen a report of that before	13:59
TheJulia	samuelkunkel[m]: it was a very very long time ago	13:59
samuelkunkel[m]	I dont get why it happens now. And only with these ARM HPE Nodes (I really start to hate them)	13:59
Sandzwerg[m]	We also have some 8 Socket lenovo, where they basically just strap two 4 sockets together. These are also always "fun" for all their quirks	13:59
opendevreview	Maksim Malchuk proposed openstack/bifrost master: Remove extra symbols accidentally added https://review.opendev.org/c/openstack/bifrost/+/879547	13:59
opendevreview	Maksim Malchuk proposed openstack/bifrost master: Remove extra symbols accidentally added https://review.opendev.org/c/openstack/bifrost/+/879547	14:00
TheJulia	samuelkunkel[m]: so, the report I believe was before any of those machines shipped, but it sounds like the window might be longer :\	14:00
samuelkunkel[m]	Hmm, for now I just retry it	14:01
samuelkunkel[m]	But this sounds pretty inconvenient	14:02
TheJulia	samuelkunkel[m]: with a full error and details, like how much time it takes, I suspect a patch could be created for sushy or ironic	14:04
samuelkunkel[m]	The question is rather, should we handle this in sushy?	14:04
TheJulia	depends on the details needed	14:04
samuelkunkel[m]	If I recall this correctly we already have exponential backoff and retry?	14:04
samuelkunkel[m]	I can provide atleast the details and maybe work on a Patch. But currently it only happens on ILO6 with the RL300 Nodes	14:05
samuelkunkel[m]	So I will ask HPE first what they think of this :D	14:06
TheJulia	looking at the sushy code	14:06
opendevreview	Merged openstack/ironic stable/zed: Always fall back from hard linking to copying files https://review.opendev.org/c/openstack/ironic/+/879868	14:07
opendevreview	Merged openstack/ironic bugfix/21.3: Always fall back from hard linking to copying files https://review.opendev.org/c/openstack/ironic/+/879869	14:08
opendevreview	Merged openstack/ironic bugfix/21.2: Always fall back from hard linking to copying files https://review.opendev.org/c/openstack/ironic/+/880090	14:08
opendevreview	Merged openstack/ironic stable/2023.1: Always fall back from hard linking to copying files https://review.opendev.org/c/openstack/ironic/+/879867	14:08
samuelkunkel[m]	Shall I create a bug for that to provide the details?	14:09
TheJulia	it might back off.....	14:09
TheJulia	samuelkunkel[m]: please	14:09
TheJulia	I just got off a call and there is a wait depending on the precise error code we get back	14:09
TheJulia	http error code at that	14:09
samuelkunkel[m]	I would create it for sushy repo?	14:10
samuelkunkel[m]	yes I have the logs from the conductor atlast	14:10
TheJulia	Yes, that should hopefully have the http error code	14:10
samuelkunkel[m]	sushy bug also via bugs.launchpad?	14:12
samuelkunkel[m]	never opened one for sushy	14:12
TheJulia	I believe so, had to step away for moment	14:12
TheJulia	https://bugs.launchpad.net/sushy	14:14
dtantsur	TheJulia: morning! could you take a look at https://review.opendev.org/c/openstack/ironic-specs/+/878001 when you have a minute or let me know if you fine with us just merging it (has 2x +2)?	15:22
TheJulia	i can try and glance in a little bit, we're starting to reach the "time to be able to focus" window	15:44
TheJulia	dtantsur: so my only concern in it is plugin migration, since the method names are being chagned, it is not highlighted, it might not really need to be highlighted, but folks with custom plugins will need to modify code which seems reasonable given the level of work	16:01
dtantsur	TheJulia: yep, they'll also need to change the plugin entry point. So it's not going to be automatic either way.	16:01
TheJulia	yup	16:01
dtantsur	do we need anything other than good docs?	16:01
TheJulia	I don't think we can do anything besides that	16:02
TheJulia	I do like the callout of what should and should not be done in a plugin	16:02
TheJulia	also, I made two notes on the new api additions, I would expect we would have the ability to just entirely disable the endpoints like lookup/heartbeat have for operators with API surfaces pointed towards untrusted or semi-trusted users	16:03
rpittau	good night! o/	16:07
opendevreview	Merged openstack/ironic-specs master: Merge Inspector into Ironic https://review.opendev.org/c/openstack/ironic-specs/+/878001	16:20
dtantsur	TheJulia: thanks! I'll work on a follow-up, also taking into account our discussion with hjensas	16:50
* TheJulia is unsure which or where		16:50
* TheJulia goes back to paperwork		16:50
opendevreview	Chris Krelle proposed openstack/ironic master: Add ablity to power off nodes in clean failed https://review.opendev.org/c/openstack/ironic/+/880165	18:02
NobodyCam	Good Morning OpenStack Folks!	18:02
TheJulia	it is nearly afternoon	18:08
TheJulia	:)	18:09
opendevreview	Chris Krelle proposed openstack/ironic master: Add ablity to power off nodes in clean failed https://review.opendev.org/c/openstack/ironic/+/880165	19:46
opendevreview	Chris Krelle proposed openstack/ironic master: Add ablity to power off nodes in clean failed https://review.opendev.org/c/openstack/ironic/+/880165	20:03
TheJulia	NobodyCam: current version of gerrit dislikes ``	20:03
TheJulia	https://review.opendev.org/c/openstack/ironic/+/880165 on the reno ``[config_group]option_name``	20:03
TheJulia	https://review.opendev.org/c/openstack/ironic/+/880165 on the reno ```[config_group]option_name```	20:03
NobodyCam	le sigh	20:04
TheJulia	oh jeeze, also irccloud is mucking with it	20:04
NobodyCam	will fix right after lunch	20:04
NobodyCam	heheheeh	20:04
* TheJulia looks for a cane to shake and to talk about the times before where we didn't have this new fangled stuff		20:04
NobodyCam	😱	20:04
samuelkunkel[m]	TheJulia: you remember vaguely the case (I think it was yesterday or 2 days ago) where efibootmgr was not able to access the uefi?	20:12
samuelkunkel[m]	yeah seems like its related to the image.	20:12
samuelkunkel[m]	Stream-9 IPA works properly	20:12
samuelkunkel[m]	Debian 12 IPA not	20:12
TheJulia	samuelkunkel[m]: oh my...	20:12
samuelkunkel[m]	so seems like I switch back to Stream-9.	20:14
samuelkunkel[m]	But it seems like your fix about the image size works. Atleast with the latest version of the diskimage-builder / ironic-python-agent-builder the initramfs of a stream-9 is only around 350M (no longer 800~)	20:15
samuelkunkel[m]	thanks for that :)	20:15
clarkb	note that you can adjust partition sizes via dib too if necessary	20:17
TheJulia	well, in a ramdisk case there is no partitions	20:29
TheJulia	:)	20:29
clarkb	ah I mixed this up with the raid thing	20:31
NobodyCam	TheJulia: no space between ] and the option name? `[conductor]poweroff_in_cleanfail` vs `[conductor] poweroff_in_cleanfail`	20:32
NobodyCam	hey hey clarkb long Time no see	20:32
clarkb	hello!	20:33
TheJulia	NobodyCam: correct	20:33
opendevreview	Chris Krelle proposed openstack/ironic master: Add ablity to power off nodes in clean failed https://review.opendev.org/c/openstack/ironic/+/880165	20:34
NobodyCam	okay that should tackle the dreaded pep8 error and the Reno note issue	20:34

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!