Wednesday, 2025-01-08

jandershere is an example from a known-good and known-bad machine: https://paste.opendev.org/show/bqJeilPE0gOGYlUSbjED/00:00
jandersWhen I was talking to dtantsur about it we figured we should run it by TheJulia JayF and cardoe and see if you folks have any experience using this field and if you think it's a good idea to tap into it more in Ironic (I may be wrong here but I don't think we use it a lot)00:01
jandersto give an example, think of listing this in "baremetal node list" output like we do with say Maintenance - not necessarily the best way to do this  but could provide quick and easy insight into hardware health00:03
jandersI would be curious if you have any prior experience using this information and what do you think about the idea in general00:03
jandersthanks in advance! :)00:05
TheJuliajanders: I've thought of something like this in the past and I think it is a good idea in general00:07
TheJuliaTruth be told, I'd almost consider an unhealthy maintenance state00:07
TheJuliaset a maintenance reason00:07
TheJuliaand all that if a periodic detects an issue has appeared00:08
TheJuliasince human intervention is really required, and we likely shouldn't be moving stuff through the state machine then00:08
TheJuliaI'm sure a nova user might complain if they can't delete the node but... *shrug*00:08
jandersthose are all good points - that would provide with an easy, undisruptive way of bringing in some of this functionality00:09
janderswith blocking state transitions though - I wonder if we should do this for any warning state00:10
janderssay in this example from paste - it's ECC errors00:10
jandersfrom my past ops experience I know machines could run with these for weeks without showing any signs of trouble (but it wouldn't always be like that)00:11
janderswe would typically book a tech to replace the DIMMs as soon as we saw this pop up and then run normally till the box crashes or tech comes00:11
jandersbut that was a system where workloads were pinned to nodes (HPC-esque context)00:11
jandersone could make an argument it wouldn't be unreasonable to preemptively migrate the workloads off the affected node00:12
janderstl;dr - would we want to automatically set maintenance each time this field changes to something other than OK, or just pass this to the operators and have them make the call00:13
TheJuliawell, ecc errors is a good point00:15
TheJulialikely not fatal00:15
TheJuliamaybe make it a knob?00:15
TheJuliadunno, if we don't, then we really need a new field :)00:16
jandersmy gut feel is +1 for the new field00:20
jandersanother (but less important IMO) question is what would we want to do with non-Redfish HW types. I used "ipmi sel elist" quite successfully in my monitoring in the past, just for the record. But having said that it feels to me like IPMI is very very obsolete by now and everything seems to be converging on Redfish so I'd say focusing on Redfish only00:25
jandersfor this feature may be the go00:25
janders(if it is to become a feature! :) )00:25
TheJuliaI'm kind of -1 to going down the path of doing it with ipmi00:38
TheJuliawe already see huge scaling issues with ipmi00:38
TheJuliaand realistically we shouldn't be encouraging future use00:38
TheJuliaand if other vendors want to add support, cool00:38
jandersI agree00:45
JayFTheJulia: janders: I suggest looking at node faults spec for some previous work in this area. We need a better way to express what's wrong, imo, before we go seeking more ways to find that are wrong. At least unless we want a field on node for *_state :D 01:42
JayFThis dovetails right into what iurygregory and cardoe were talking about -- there are cases when Ironic can know, either circumstantially or via direct report (the API you mention on redfish) that something is /wrong/01:43
JayFwhether that is "out to lunch for firmware update" or "systemhealth says your ram is bad" or "cleaning failed" :)01:44
opendevreviewSteve Baker proposed openstack/ironic-specs master: Graphical Console Support  https://review.opendev.org/c/openstack/ironic-specs/+/93852602:06
jandersJayF thank you for your input. I had a quick look at ironic-specs and see power fault recovery spec https://opendev.org/openstack/ironic-specs/src/branch/master/specs/approved/power-fault-recovery.rst is this the one you mentioned?02:47
jandersI think on a high level (at least from what I know) those two things ("health monitoring" and "dealing with unresponsive BMCs") are stemming from two different requirements but now that you said that I realise there may be more common ground between the two than I initially thought - thanks for this02:49
jandersno point raising alerts about BMC not responding if we just ran a firmware upgrade - but it may also not be the best idea to perform a firmware upgrade on a known-sick node02:51
janders(depends how it is sick exactly but that's something for a more specific discussion down the track)02:51
*** janders3 is now known as janders03:16
TheJuliajanders: I do agree, they are distinctly separate topics and combining them increases the scope and slowness at which a discussion can move forward. That whole problem we see a solution and then try to engineer more. We really just need to improve the handling around some key points and interacitons like revisiting power states or when we pull a client as the next step after a fresh firmware upgrade being perceived as 03:16
TheJuliacompleted.03:16
TheJuliacardoe: do your point, I wonder if instead of signaling the need to re-auth, if unknown authorization is resulting in BadRequest in cases. BadRequest is a hard one to handle and detect because we do other case detection in the sushy library03:17
TheJulias/do/to/03:18
TheJuliahttps://github.com/openstack/sushy/blob/master/sushy/connector.py#L260C1-L278C22 is the sushy side code. In the cache, we evaluate stuffs here: https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/redfish/utils.py#L265-L27803:28
TheJuliaI *think* a possible clean path is to just look for BadResult and clear the session03:28
TheJuliaThat would bring a new session together nuking the risk of BMC being totally freaky and restart our overall state in interaction into it.03:30
TheJuliaWe'd have to find the old bug again regarding power states03:31
TheJulia... I think that was much more the BMC is buggy03:31
TheJuliabut it is one of those vendors people like to buy from03:33
jandersreading the last sentence made me think of VPBF acronym as a catch-all for cheaper/scaleout kit without pointing fingers03:35
TheJuliaVPBF?03:36
jandersVendors People Buy From03:36
janders(dropped the L for Like for sake of brevity)03:37
TheJuliaheh03:37
jandersmaybe I am nerding out but it feels like there could be a term for this "class" of HW03:37
cardoeI missed the convo a bit. But yes I would love to pull in a field like that. We gather that up for our gear. There’s a sliding scale of replacement. Especially for hypervisor machines. Cause stuff warns and works fine. There’s a bunch of cases like the SD card that the BIOS uses has hit its write cycles so general warning stage 5 alarms!!! (Looking at you 2 letter now 3 letter HW vendor)05:53
cardoeIt actually goes along the model I’m trying to promote that the whole DC is a lot more fluid for everyone’s work loads. So I want that to be promoted up the stack closer to the user. Today we do behind the scenes stuff to scramble people cause customers have uptime guarantees.05:57
cardoeHere’s a recent one… Non-Ironic/Nova exposed environment. End user had a down time cause networking went out. The hardware had redundant links. Not bonded. But some kind of ifup / ifdown shenanigans. Now via Redfish we knew the down link was bad cause the PHY reported no link. They could technically see it as well.06:07
cardoeBut I’m just thinking in a world where they are consuming this gear straight from nova and ironic. How could I make that apparent.06:08
cardoeFor the hypervisors that we provide to Zulu06:10
cardoeZuul. I listen to an out of band event stream and proactively attempt to migrate away VMs from bad hosts and disable them in nova.06:10
cardoeBut rather than being special how could I somehow do that. Maybe I’m running bare metal k8s on some Ironic provided gear and so I see a hardware warning and I automatically cordon it so no new workloads land on that box.06:13
cardoeAs far as the BadResult, I peeked and there is way too much fiddling based on vendors. It goes to what janders said and what we talked about before. A hardware “class”. That in some cases that reply means dump your session.06:18
rpittaugood morning ironic! o/07:43
rpittaujanders: re health monitoring: just my 2 cents, I agree with TheJulia to  not go down the ipmi path; for the rest I think we're in the good track :)08:16
rpittauFYI we'll have releases for ironicclient https://review.opendev.org/c/openstack/releases/+/938590 and sushy https://review.opendev.org/c/openstack/releases/+/93860209:49
rpittauthe sushy one removes support for python 3.809:49
rpittauI blocked the release for metalsmith https://review.opendev.org/c/openstack/releases/+/938574 for the time being pending https://review.opendev.org/c/openstack/metalsmith/+/93315409:49
opendevreviewVerification of a change to openstack/ironic master failed: move imports to top of file for lints  https://review.opendev.org/c/openstack/ironic/+/93727109:49
opendevreviewVerification of a change to openstack/ironic master failed: enable ruff in pre-commit with some initial lints  https://review.opendev.org/c/openstack/ironic/+/93727209:49
opendevreviewAdam Rozman proposed openstack/ironic master: disable ISO cache image format and safety checks  https://review.opendev.org/c/openstack/ironic/+/93836310:28
priteauHello Ironic team! Could another core reviever please approve this Bifrost CI fix? Thanks! https://review.opendev.org/c/openstack/bifrost/+/93812410:51
dtantsurpriteau: I think rpittau should return online within a few hours11:07
priteauThank you Dmitry. Nothing urgent, but actual bug fix is waiting for this and its cherry-pick to merge.11:11
iurygregory+W11:24
priteauThanks!11:25
janderscardoe thank you for great insights11:51
jandersrpittau thank you for your feedback - I agree, let's disregard IPMI11:51
opendevreviewDmitry Tantsur proposed openstack/ironic-python-agent master: [PoC] [WIP] Workaround for lshw JSON output  https://review.opendev.org/c/openstack/ironic-python-agent/+/93865413:00
iurygregory<eyes>13:03
rpittaudtantsur: no chance for backporting the lshw patch I guess :/13:12
dtantsurrpittau: I don't know yet, just researching all possible paths13:13
rpittauright, makes sense13:13
iurygregory++13:13
opendevreviewMerged openstack/bifrost stable/2024.2: [stable-only] CI: Remove SLURP upgrade job  https://review.opendev.org/c/openstack/bifrost/+/93812413:13
dtantsurrpittau, iurygregory, https://issues.redhat.com/browse/RHEL-73141 is my request13:15
rpittaudtantsur: ack thanks I started watching it13:16
iurygregorydtantsur, tks 13:17
opendevreviewVerification of a change to openstack/ironic master failed: enable ruff in pre-commit with some initial lints  https://review.opendev.org/c/openstack/ironic/+/93727213:18
opendevreviewRiccardo Pittau proposed openstack/ironic bugfix/24.0: Pin upper-constraints  https://review.opendev.org/c/openstack/ironic/+/93866013:48
opendevreviewRiccardo Pittau proposed openstack/ironic bugfix/25.0: [bugfix only] Pin upper-constraints  https://review.opendev.org/c/openstack/ironic/+/93866113:50
opendevreviewRiccardo Pittau proposed openstack/ironic bugfix/24.0: [bugfix only] Pin upper-constraints  https://review.opendev.org/c/openstack/ironic/+/93866013:51
opendevreviewRiccardo Pittau proposed openstack/ironic bugfix/26.0: [bugfix only] Pin upper-constraints  https://review.opendev.org/c/openstack/ironic/+/93866213:53
opendevreviewRiccardo Pittau proposed openstack/ironic bugfix/27.0: [bugfix only] Pin upper-constraints  https://review.opendev.org/c/openstack/ironic/+/93866313:54
TheJuliaGood morning14:04
shermanmhey, is there a baremetal-networking wg meeting today, or have they not resumed after the holidays?15:02
TheJuliaI keep forgetting :(15:15
TheJuliaits at one of the worst times for me to remember too15:15
TheJuliashermanm: does happen to be on my calendar for next week15:20
dtantsuryeah, I see it there as well15:20
shermanmgotcha, I'll add it to mine as well. last I saw from the etherpad was "? January 8th, 2025"15:21
TheJuliaI think the last time we managed to meet, we kept going and never reached the point to discuss future meetings15:27
opendevreviewJulia Kreger proposed openstack/ironic master: trivial: remove xclarity remenent  https://review.opendev.org/c/openstack/ironic/+/93597315:33
opendevreviewRiccardo Pittau proposed openstack/ironic bugfix/24.0: [bugfix only] Pin upper-constraints  https://review.opendev.org/c/openstack/ironic/+/93866015:37
opendevreviewRiccardo Pittau proposed openstack/ironic bugfix/27.0: [bugfix only] Pin upper-constraints and remove metal3 job  https://review.opendev.org/c/openstack/ironic/+/93866315:41
opendevreviewRiccardo Pittau proposed openstack/ironic bugfix/27.0: Calculate missing checksum for file:// based images  https://review.opendev.org/c/openstack/ironic/+/93867415:50
rpittaugood night! o/16:48
opendevreviewVerification of a change to openstack/ironic stable/2024.2 failed: Calculate missing checksum for file:// based images  https://review.opendev.org/c/openstack/ironic/+/93861717:12
opendevreviewVerification of a change to openstack/ironic stable/2024.1 failed: Calculate missing checksum for file:// based images  https://review.opendev.org/c/openstack/ironic/+/93861818:05
opendevreviewVerification of a change to openstack/ironic stable/2024.1 failed: Calculate missing checksum for file:// based images  https://review.opendev.org/c/openstack/ironic/+/93861818:44
*** jamesdenton_ is now known as jamesdenton19:24
opendevreviewMerged openstack/ironic master: disable ISO cache image format and safety checks  https://review.opendev.org/c/openstack/ironic/+/93836320:24
cardoewhy does zuul hate me? I'm gonna hotkey "recheck"20:35
opendevreviewVerification of a change to openstack/ironic master failed: move imports to top of file for lints  https://review.opendev.org/c/openstack/ironic/+/93727120:52
cardoeyeah hate you too zuul20:57
JayFthere's clearly some flakiness21:36
JayFI've not had a lot of success sussing it out except the idea of neutron being slow21:36
JayF(and we don't properly coordinate ironic<>neutron, we just wait "long enough")21:37
opendevreviewMerged openstack/ironic stable/2024.1: Calculate missing checksum for file:// based images  https://review.opendev.org/c/openstack/ironic/+/93861821:56
opendevreviewJulia Kreger proposed openstack/ironic master: WIP OCI container adjacent artifact support  https://review.opendev.org/c/openstack/ironic/+/93789622:03
opendevreviewJulia Kreger proposed openstack/ironic master: WIP OCI container adjacent artifact support  https://review.opendev.org/c/openstack/ironic/+/93789622:12
iurygregorycardoe, most of the time is because your forgot to give cookies to zuul :D22:20
iurygregoryit happens more often than you can imagine :D22:21
opendevreviewMerged openstack/ironic master: move imports to top of file for lints  https://review.opendev.org/c/openstack/ironic/+/93727123:38
opendevreviewMerged openstack/ironic master: enable ruff in pre-commit with some initial lints  https://review.opendev.org/c/openstack/ironic/+/93727223:49
cardoeiurygregory: this time around mean talk worked23:51

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!