janders | here is an example from a known-good and known-bad machine: https://paste.opendev.org/show/bqJeilPE0gOGYlUSbjED/ | 00:00 |
---|---|---|
janders | When I was talking to dtantsur about it we figured we should run it by TheJulia JayF and cardoe and see if you folks have any experience using this field and if you think it's a good idea to tap into it more in Ironic (I may be wrong here but I don't think we use it a lot) | 00:01 |
janders | to give an example, think of listing this in "baremetal node list" output like we do with say Maintenance - not necessarily the best way to do this but could provide quick and easy insight into hardware health | 00:03 |
janders | I would be curious if you have any prior experience using this information and what do you think about the idea in general | 00:03 |
janders | thanks in advance! :) | 00:05 |
TheJulia | janders: I've thought of something like this in the past and I think it is a good idea in general | 00:07 |
TheJulia | Truth be told, I'd almost consider an unhealthy maintenance state | 00:07 |
TheJulia | set a maintenance reason | 00:07 |
TheJulia | and all that if a periodic detects an issue has appeared | 00:08 |
TheJulia | since human intervention is really required, and we likely shouldn't be moving stuff through the state machine then | 00:08 |
TheJulia | I'm sure a nova user might complain if they can't delete the node but... *shrug* | 00:08 |
janders | those are all good points - that would provide with an easy, undisruptive way of bringing in some of this functionality | 00:09 |
janders | with blocking state transitions though - I wonder if we should do this for any warning state | 00:10 |
janders | say in this example from paste - it's ECC errors | 00:10 |
janders | from my past ops experience I know machines could run with these for weeks without showing any signs of trouble (but it wouldn't always be like that) | 00:11 |
janders | we would typically book a tech to replace the DIMMs as soon as we saw this pop up and then run normally till the box crashes or tech comes | 00:11 |
janders | but that was a system where workloads were pinned to nodes (HPC-esque context) | 00:11 |
janders | one could make an argument it wouldn't be unreasonable to preemptively migrate the workloads off the affected node | 00:12 |
janders | tl;dr - would we want to automatically set maintenance each time this field changes to something other than OK, or just pass this to the operators and have them make the call | 00:13 |
TheJulia | well, ecc errors is a good point | 00:15 |
TheJulia | likely not fatal | 00:15 |
TheJulia | maybe make it a knob? | 00:15 |
TheJulia | dunno, if we don't, then we really need a new field :) | 00:16 |
janders | my gut feel is +1 for the new field | 00:20 |
janders | another (but less important IMO) question is what would we want to do with non-Redfish HW types. I used "ipmi sel elist" quite successfully in my monitoring in the past, just for the record. But having said that it feels to me like IPMI is very very obsolete by now and everything seems to be converging on Redfish so I'd say focusing on Redfish only | 00:25 |
janders | for this feature may be the go | 00:25 |
janders | (if it is to become a feature! :) ) | 00:25 |
TheJulia | I'm kind of -1 to going down the path of doing it with ipmi | 00:38 |
TheJulia | we already see huge scaling issues with ipmi | 00:38 |
TheJulia | and realistically we shouldn't be encouraging future use | 00:38 |
TheJulia | and if other vendors want to add support, cool | 00:38 |
janders | I agree | 00:45 |
JayF | TheJulia: janders: I suggest looking at node faults spec for some previous work in this area. We need a better way to express what's wrong, imo, before we go seeking more ways to find that are wrong. At least unless we want a field on node for *_state :D | 01:42 |
JayF | This dovetails right into what iurygregory and cardoe were talking about -- there are cases when Ironic can know, either circumstantially or via direct report (the API you mention on redfish) that something is /wrong/ | 01:43 |
JayF | whether that is "out to lunch for firmware update" or "systemhealth says your ram is bad" or "cleaning failed" :) | 01:44 |
opendevreview | Steve Baker proposed openstack/ironic-specs master: Graphical Console Support https://review.opendev.org/c/openstack/ironic-specs/+/938526 | 02:06 |
janders | JayF thank you for your input. I had a quick look at ironic-specs and see power fault recovery spec https://opendev.org/openstack/ironic-specs/src/branch/master/specs/approved/power-fault-recovery.rst is this the one you mentioned? | 02:47 |
janders | I think on a high level (at least from what I know) those two things ("health monitoring" and "dealing with unresponsive BMCs") are stemming from two different requirements but now that you said that I realise there may be more common ground between the two than I initially thought - thanks for this | 02:49 |
janders | no point raising alerts about BMC not responding if we just ran a firmware upgrade - but it may also not be the best idea to perform a firmware upgrade on a known-sick node | 02:51 |
janders | (depends how it is sick exactly but that's something for a more specific discussion down the track) | 02:51 |
*** janders3 is now known as janders | 03:16 | |
TheJulia | janders: I do agree, they are distinctly separate topics and combining them increases the scope and slowness at which a discussion can move forward. That whole problem we see a solution and then try to engineer more. We really just need to improve the handling around some key points and interacitons like revisiting power states or when we pull a client as the next step after a fresh firmware upgrade being perceived as | 03:16 |
TheJulia | completed. | 03:16 |
TheJulia | cardoe: do your point, I wonder if instead of signaling the need to re-auth, if unknown authorization is resulting in BadRequest in cases. BadRequest is a hard one to handle and detect because we do other case detection in the sushy library | 03:17 |
TheJulia | s/do/to/ | 03:18 |
TheJulia | https://github.com/openstack/sushy/blob/master/sushy/connector.py#L260C1-L278C22 is the sushy side code. In the cache, we evaluate stuffs here: https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/redfish/utils.py#L265-L278 | 03:28 |
TheJulia | I *think* a possible clean path is to just look for BadResult and clear the session | 03:28 |
TheJulia | That would bring a new session together nuking the risk of BMC being totally freaky and restart our overall state in interaction into it. | 03:30 |
TheJulia | We'd have to find the old bug again regarding power states | 03:31 |
TheJulia | ... I think that was much more the BMC is buggy | 03:31 |
TheJulia | but it is one of those vendors people like to buy from | 03:33 |
janders | reading the last sentence made me think of VPBF acronym as a catch-all for cheaper/scaleout kit without pointing fingers | 03:35 |
TheJulia | VPBF? | 03:36 |
janders | Vendors People Buy From | 03:36 |
janders | (dropped the L for Like for sake of brevity) | 03:37 |
TheJulia | heh | 03:37 |
janders | maybe I am nerding out but it feels like there could be a term for this "class" of HW | 03:37 |
cardoe | I missed the convo a bit. But yes I would love to pull in a field like that. We gather that up for our gear. There’s a sliding scale of replacement. Especially for hypervisor machines. Cause stuff warns and works fine. There’s a bunch of cases like the SD card that the BIOS uses has hit its write cycles so general warning stage 5 alarms!!! (Looking at you 2 letter now 3 letter HW vendor) | 05:53 |
cardoe | It actually goes along the model I’m trying to promote that the whole DC is a lot more fluid for everyone’s work loads. So I want that to be promoted up the stack closer to the user. Today we do behind the scenes stuff to scramble people cause customers have uptime guarantees. | 05:57 |
cardoe | Here’s a recent one… Non-Ironic/Nova exposed environment. End user had a down time cause networking went out. The hardware had redundant links. Not bonded. But some kind of ifup / ifdown shenanigans. Now via Redfish we knew the down link was bad cause the PHY reported no link. They could technically see it as well. | 06:07 |
cardoe | But I’m just thinking in a world where they are consuming this gear straight from nova and ironic. How could I make that apparent. | 06:08 |
cardoe | For the hypervisors that we provide to Zulu | 06:10 |
cardoe | Zuul. I listen to an out of band event stream and proactively attempt to migrate away VMs from bad hosts and disable them in nova. | 06:10 |
cardoe | But rather than being special how could I somehow do that. Maybe I’m running bare metal k8s on some Ironic provided gear and so I see a hardware warning and I automatically cordon it so no new workloads land on that box. | 06:13 |
cardoe | As far as the BadResult, I peeked and there is way too much fiddling based on vendors. It goes to what janders said and what we talked about before. A hardware “class”. That in some cases that reply means dump your session. | 06:18 |
rpittau | good morning ironic! o/ | 07:43 |
rpittau | janders: re health monitoring: just my 2 cents, I agree with TheJulia to not go down the ipmi path; for the rest I think we're in the good track :) | 08:16 |
rpittau | FYI we'll have releases for ironicclient https://review.opendev.org/c/openstack/releases/+/938590 and sushy https://review.opendev.org/c/openstack/releases/+/938602 | 09:49 |
rpittau | the sushy one removes support for python 3.8 | 09:49 |
rpittau | I blocked the release for metalsmith https://review.opendev.org/c/openstack/releases/+/938574 for the time being pending https://review.opendev.org/c/openstack/metalsmith/+/933154 | 09:49 |
opendevreview | Verification of a change to openstack/ironic master failed: move imports to top of file for lints https://review.opendev.org/c/openstack/ironic/+/937271 | 09:49 |
opendevreview | Verification of a change to openstack/ironic master failed: enable ruff in pre-commit with some initial lints https://review.opendev.org/c/openstack/ironic/+/937272 | 09:49 |
opendevreview | Adam Rozman proposed openstack/ironic master: disable ISO cache image format and safety checks https://review.opendev.org/c/openstack/ironic/+/938363 | 10:28 |
priteau | Hello Ironic team! Could another core reviever please approve this Bifrost CI fix? Thanks! https://review.opendev.org/c/openstack/bifrost/+/938124 | 10:51 |
dtantsur | priteau: I think rpittau should return online within a few hours | 11:07 |
priteau | Thank you Dmitry. Nothing urgent, but actual bug fix is waiting for this and its cherry-pick to merge. | 11:11 |
iurygregory | +W | 11:24 |
priteau | Thanks! | 11:25 |
janders | cardoe thank you for great insights | 11:51 |
janders | rpittau thank you for your feedback - I agree, let's disregard IPMI | 11:51 |
opendevreview | Dmitry Tantsur proposed openstack/ironic-python-agent master: [PoC] [WIP] Workaround for lshw JSON output https://review.opendev.org/c/openstack/ironic-python-agent/+/938654 | 13:00 |
iurygregory | <eyes> | 13:03 |
rpittau | dtantsur: no chance for backporting the lshw patch I guess :/ | 13:12 |
dtantsur | rpittau: I don't know yet, just researching all possible paths | 13:13 |
rpittau | right, makes sense | 13:13 |
iurygregory | ++ | 13:13 |
opendevreview | Merged openstack/bifrost stable/2024.2: [stable-only] CI: Remove SLURP upgrade job https://review.opendev.org/c/openstack/bifrost/+/938124 | 13:13 |
dtantsur | rpittau, iurygregory, https://issues.redhat.com/browse/RHEL-73141 is my request | 13:15 |
rpittau | dtantsur: ack thanks I started watching it | 13:16 |
iurygregory | dtantsur, tks | 13:17 |
opendevreview | Verification of a change to openstack/ironic master failed: enable ruff in pre-commit with some initial lints https://review.opendev.org/c/openstack/ironic/+/937272 | 13:18 |
opendevreview | Riccardo Pittau proposed openstack/ironic bugfix/24.0: Pin upper-constraints https://review.opendev.org/c/openstack/ironic/+/938660 | 13:48 |
opendevreview | Riccardo Pittau proposed openstack/ironic bugfix/25.0: [bugfix only] Pin upper-constraints https://review.opendev.org/c/openstack/ironic/+/938661 | 13:50 |
opendevreview | Riccardo Pittau proposed openstack/ironic bugfix/24.0: [bugfix only] Pin upper-constraints https://review.opendev.org/c/openstack/ironic/+/938660 | 13:51 |
opendevreview | Riccardo Pittau proposed openstack/ironic bugfix/26.0: [bugfix only] Pin upper-constraints https://review.opendev.org/c/openstack/ironic/+/938662 | 13:53 |
opendevreview | Riccardo Pittau proposed openstack/ironic bugfix/27.0: [bugfix only] Pin upper-constraints https://review.opendev.org/c/openstack/ironic/+/938663 | 13:54 |
TheJulia | Good morning | 14:04 |
shermanm | hey, is there a baremetal-networking wg meeting today, or have they not resumed after the holidays? | 15:02 |
TheJulia | I keep forgetting :( | 15:15 |
TheJulia | its at one of the worst times for me to remember too | 15:15 |
TheJulia | shermanm: does happen to be on my calendar for next week | 15:20 |
dtantsur | yeah, I see it there as well | 15:20 |
shermanm | gotcha, I'll add it to mine as well. last I saw from the etherpad was "? January 8th, 2025" | 15:21 |
TheJulia | I think the last time we managed to meet, we kept going and never reached the point to discuss future meetings | 15:27 |
opendevreview | Julia Kreger proposed openstack/ironic master: trivial: remove xclarity remenent https://review.opendev.org/c/openstack/ironic/+/935973 | 15:33 |
opendevreview | Riccardo Pittau proposed openstack/ironic bugfix/24.0: [bugfix only] Pin upper-constraints https://review.opendev.org/c/openstack/ironic/+/938660 | 15:37 |
opendevreview | Riccardo Pittau proposed openstack/ironic bugfix/27.0: [bugfix only] Pin upper-constraints and remove metal3 job https://review.opendev.org/c/openstack/ironic/+/938663 | 15:41 |
opendevreview | Riccardo Pittau proposed openstack/ironic bugfix/27.0: Calculate missing checksum for file:// based images https://review.opendev.org/c/openstack/ironic/+/938674 | 15:50 |
rpittau | good night! o/ | 16:48 |
opendevreview | Verification of a change to openstack/ironic stable/2024.2 failed: Calculate missing checksum for file:// based images https://review.opendev.org/c/openstack/ironic/+/938617 | 17:12 |
opendevreview | Verification of a change to openstack/ironic stable/2024.1 failed: Calculate missing checksum for file:// based images https://review.opendev.org/c/openstack/ironic/+/938618 | 18:05 |
opendevreview | Verification of a change to openstack/ironic stable/2024.1 failed: Calculate missing checksum for file:// based images https://review.opendev.org/c/openstack/ironic/+/938618 | 18:44 |
*** jamesdenton_ is now known as jamesdenton | 19:24 | |
opendevreview | Merged openstack/ironic master: disable ISO cache image format and safety checks https://review.opendev.org/c/openstack/ironic/+/938363 | 20:24 |
cardoe | why does zuul hate me? I'm gonna hotkey "recheck" | 20:35 |
opendevreview | Verification of a change to openstack/ironic master failed: move imports to top of file for lints https://review.opendev.org/c/openstack/ironic/+/937271 | 20:52 |
cardoe | yeah hate you too zuul | 20:57 |
JayF | there's clearly some flakiness | 21:36 |
JayF | I've not had a lot of success sussing it out except the idea of neutron being slow | 21:36 |
JayF | (and we don't properly coordinate ironic<>neutron, we just wait "long enough") | 21:37 |
opendevreview | Merged openstack/ironic stable/2024.1: Calculate missing checksum for file:// based images https://review.opendev.org/c/openstack/ironic/+/938618 | 21:56 |
opendevreview | Julia Kreger proposed openstack/ironic master: WIP OCI container adjacent artifact support https://review.opendev.org/c/openstack/ironic/+/937896 | 22:03 |
opendevreview | Julia Kreger proposed openstack/ironic master: WIP OCI container adjacent artifact support https://review.opendev.org/c/openstack/ironic/+/937896 | 22:12 |
iurygregory | cardoe, most of the time is because your forgot to give cookies to zuul :D | 22:20 |
iurygregory | it happens more often than you can imagine :D | 22:21 |
opendevreview | Merged openstack/ironic master: move imports to top of file for lints https://review.opendev.org/c/openstack/ironic/+/937271 | 23:38 |
opendevreview | Merged openstack/ironic master: enable ruff in pre-commit with some initial lints https://review.opendev.org/c/openstack/ironic/+/937272 | 23:49 |
cardoe | iurygregory: this time around mean talk worked | 23:51 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!