Thursday, 2023-04-13

iurygregoryTheJulia, tks for the review in the spec, I just answered some of your questions there o/01:58
iurygregorytomorrow I will be looking at the open specs we have for bobcat01:58
TheJuliaCool cool02:05
TheJuliaAs long as we don’t do that whole strict tie to a cycle process of control nightmare.02:08
iurygregoryyeah02:09
rpittaugood morning ironic! o/06:37
kaloyankmorning ironic o/06:57
kubajjgood morning everyone09:23
dtantsurmorning kubajj! how are your studies going?09:42
kubajjdtantsur: it's not that bad. Exams are slowly approaching though, so I'm basically spending 9-19 at the library. Slightly more than a month to go and then I'm free. 🥹09:45
dtantsurgreat, good luck!09:47
iurygregorymorning Ironic11:35
opendevreviewMaksim Malchuk proposed openstack/bifrost master: [DNM] test linters  https://review.opendev.org/c/openstack/bifrost/+/88016312:32
TheJuliagood morning12:33
opendevreviewMaksim Malchuk proposed openstack/bifrost master: [DNM] test linters  https://review.opendev.org/c/openstack/bifrost/+/88016312:33
opendevreviewMaksim Malchuk proposed openstack/bifrost master: Fix ansible-lint  https://review.opendev.org/c/openstack/bifrost/+/88016312:34
Sandzwerg[m]TheJulia: Turns out having nodes stuck in deleting is not that hard if ironic has issues reaching the node via IPMI :) 12:57
TheJuliaSandzwerg[m]: intermittently?12:59
opendevreviewMaksim Malchuk proposed openstack/bifrost master: Fix ansible-lint  https://review.opendev.org/c/openstack/bifrost/+/88016313:00
Sandzwerg[m]<TheJulia> "Sandzwerg: intermittently?" <- More or less. The node was put to maintenance because ironic fails the power sync even after some retries. Probably something on the node side is broken13:10
opendevreviewMaksim Malchuk proposed openstack/bifrost master: Fix ansible-lint  https://review.opendev.org/c/openstack/bifrost/+/88016313:44
TheJuliaSandzwerg[m]: do you have another system talking to the BMC via ipmi?13:47
TheJuliaSandzwerg[m]: are you using power sync? the default is yes :)13:48
TheJuliaSandzwerg[m]: Also, have any timers been changed from the defaults?13:48
Sandzwerg[m]We have a vendor console (lxca/openmanage/etc) not sure if that uses IPMI or something else. Monitoring is done via SNMP If I'm not mistaken. I don't think the vendor console had anz tasks which would block the node13:49
Sandzwerg[m]We sync the status to ironic, but no longer let ironic change the status if it does not match what it expects. Would need to check if we changed the default times, I assume not13:50
TheJuliaso depending on the console, it *might* and some of the vendor's BMCs are designed around no more than 1 request every so often13:50
TheJuliaI don't think we've changed the default behavior there, but it sounds like your using the knobs13:51
TheJulias/knobs/correct knobs/13:51
Sandzwerg[m]Yeah, it's also not a big issue. I found  two nodes in two dazs which is a bit more, but we don't see it all the time. We know that sometimes the remoteboard get's unresponsive and will not respond to anything for a while, still haven' t found the issue. Could be such a case. Interestingly only (older ~4ß5 years) Lenovos seem to be affected this time13:53
TheJuliaOlder supermicro gear in particular, I think makes a login entry and logging into the web console would count as a login to the bmc and it would eventually time out the fact i touched it via ipmi13:53
TheJuliabut if i hit it too many times, I'd start having errors13:53
TheJuliait might be that using snmp on those bmcs, whatever they are, could be doing the same basic thing in that it could be creating a session and eventually time it out which is counting possibly13:53
Sandzwerg[m]It's not an big issue currently, mostly annoying. But yesterday you mentioned you can't image a case in which a node might get stuck in the "deleted" state so I thought I described how it could happen13:54
TheJuliayeah, I guess I can see that happening then :)13:55
Sandzwerg[m]Yeah, wouldn't be suprised but I think it was to rare for that. We never saw it everywhere. But sometimes a whole block (~14 nodes in one or two racks) are affected at the same time13:55
TheJuliaoh joy13:55
TheJulia:(13:55
Sandzwerg[m]Most of our ironic infrastructure is pretty static and the support has a script to fix it. I think it tries in a loop to reboot the bmc or something, but could take a day or so till it succeeds. I'm not sad if these Lenovos get decommissoned, hopefully this year13:57
samuelkunkel[m]Has anyone ever seen that in a HPE Node using ILO5 (redfish)? Node is being booted to clean, Redfish calls to boot the device and directly afterwards throws: Extended information: [{'MessageArgs': ['BootSourceOverrideTarget'], 'MessageId': 'iLO.2.15.UnableToModifyDuringSystemPOST.13:57
samuelkunkel[m]If I retry it once or twice (starting back from "manage", "provide") it works13:58
TheJuliaSandzwerg[m]: oh joy!  Sounds a lot like the HP Gen ?6? IPMI BMCs we had at HP Cloud13:58
samuelkunkel[m]Uff gen6? :D13:59
TheJuliasamuelkunkel[m]: oh my, yes! We've seen a report of that before13:59
TheJuliasamuelkunkel[m]: it was a very very long time ago13:59
samuelkunkel[m]I dont get why it happens now. And only with these ARM HPE Nodes (I really start to hate them)13:59
Sandzwerg[m]We also have some 8 Socket lenovo, where they basically just strap two 4 sockets together. These are also always "fun" for all their quirks13:59
opendevreviewMaksim Malchuk proposed openstack/bifrost master: Remove extra symbols accidentally added  https://review.opendev.org/c/openstack/bifrost/+/87954713:59
opendevreviewMaksim Malchuk proposed openstack/bifrost master: Remove extra symbols accidentally added  https://review.opendev.org/c/openstack/bifrost/+/87954714:00
TheJuliasamuelkunkel[m]: so, the report I believe was before any of those machines shipped, but it sounds like the window might be longer :\14:00
samuelkunkel[m]Hmm, for now I just retry it14:01
samuelkunkel[m]But this sounds pretty inconvenient14:02
TheJuliasamuelkunkel[m]: with a full error and details, like how much time it takes, I suspect a patch could be created for sushy or ironic14:04
samuelkunkel[m]The question is rather, should we handle this in sushy?14:04
TheJuliadepends on the details needed14:04
samuelkunkel[m]If I recall this correctly we already have exponential backoff and retry?14:04
samuelkunkel[m]I can provide atleast the details and maybe work on a Patch. But currently it only happens on ILO6 with the RL300 Nodes14:05
samuelkunkel[m]So I will ask HPE first what they think of this :D14:06
TheJulialooking at the sushy code14:06
opendevreviewMerged openstack/ironic stable/zed: Always fall back from hard linking to copying files  https://review.opendev.org/c/openstack/ironic/+/87986814:07
opendevreviewMerged openstack/ironic bugfix/21.3: Always fall back from hard linking to copying files  https://review.opendev.org/c/openstack/ironic/+/87986914:08
opendevreviewMerged openstack/ironic bugfix/21.2: Always fall back from hard linking to copying files  https://review.opendev.org/c/openstack/ironic/+/88009014:08
opendevreviewMerged openstack/ironic stable/2023.1: Always fall back from hard linking to copying files  https://review.opendev.org/c/openstack/ironic/+/87986714:08
samuelkunkel[m]Shall I create a bug for that to provide the details?14:09
TheJuliait might back off.....14:09
TheJuliasamuelkunkel[m]: please14:09
TheJuliaI just got off a call and there is a wait depending on the precise error code we get back14:09
TheJuliahttp error code at that14:09
samuelkunkel[m]I would create it for sushy repo?14:10
samuelkunkel[m]yes I have the logs from the conductor atlast14:10
TheJuliaYes, that should hopefully have the http error code14:10
samuelkunkel[m]sushy bug also via bugs.launchpad?14:12
samuelkunkel[m]never opened one for sushy14:12
TheJuliaI believe so, had to step away for moment14:12
TheJuliahttps://bugs.launchpad.net/sushy14:14
dtantsurTheJulia: morning! could you take a look at https://review.opendev.org/c/openstack/ironic-specs/+/878001 when you have a minute or let me know if you fine with us just merging it (has 2x +2)?15:22
TheJuliai can try and glance in a little bit, we're starting to reach the "time to be able to focus" window15:44
TheJuliadtantsur: so my only concern in it is plugin migration, since the method names are being chagned, it is not highlighted, it might not really need to be highlighted, but folks with custom plugins will need to modify code which seems reasonable given the level of work16:01
dtantsurTheJulia: yep, they'll also need to change the plugin entry point. So it's not going to be automatic either way.16:01
TheJuliayup16:01
dtantsurdo we need anything other than good docs?16:01
TheJuliaI don't think we can do anything besides that16:02
TheJuliaI do like the callout of what should and should not be done in a plugin16:02
TheJuliaalso, I made two notes on the new api additions, I would expect we would have the ability to just entirely disable the endpoints like lookup/heartbeat have for operators with API surfaces pointed towards untrusted or semi-trusted users16:03
rpittaugood night! o/16:07
opendevreviewMerged openstack/ironic-specs master: Merge Inspector into Ironic  https://review.opendev.org/c/openstack/ironic-specs/+/87800116:20
dtantsurTheJulia: thanks! I'll work on a follow-up, also taking into account our discussion with hjensas 16:50
* TheJulia is unsure which or where16:50
* TheJulia goes back to paperwork16:50
opendevreviewChris Krelle proposed openstack/ironic master: Add ablity to power off nodes in clean failed  https://review.opendev.org/c/openstack/ironic/+/88016518:02
NobodyCamGood Morning OpenStack Folks!18:02
TheJuliait is nearly afternoon18:08
TheJulia:)18:09
opendevreviewChris Krelle proposed openstack/ironic master: Add ablity to power off nodes in clean failed  https://review.opendev.org/c/openstack/ironic/+/88016519:46
opendevreviewChris Krelle proposed openstack/ironic master: Add ablity to power off nodes in clean failed  https://review.opendev.org/c/openstack/ironic/+/88016520:03
TheJuliaNobodyCam: current version of gerrit dislikes ``20:03
TheJuliahttps://review.opendev.org/c/openstack/ironic/+/880165 on the reno ``[config_group]option_name``20:03
TheJuliahttps://review.opendev.org/c/openstack/ironic/+/880165 on the reno ```[config_group]option_name```20:03
NobodyCamle sigh20:04
TheJuliaoh jeeze, also irccloud is mucking with it20:04
NobodyCamwill fix right after lunch20:04
NobodyCamheheheeh20:04
* TheJulia looks for a cane to shake and to talk about the times before where we didn't have this new fangled stuff20:04
NobodyCam😱20:04
samuelkunkel[m]TheJulia: you remember vaguely the case (I think it was yesterday or 2 days ago) where efibootmgr was not able to access the uefi?20:12
samuelkunkel[m]yeah seems like its related to the image.20:12
samuelkunkel[m]Stream-9 IPA works properly20:12
samuelkunkel[m]Debian 12 IPA not20:12
TheJuliasamuelkunkel[m]: oh my...20:12
samuelkunkel[m]so seems like I switch back to Stream-9.20:14
samuelkunkel[m]But it seems like your fix about the image size works. Atleast with the latest version of the diskimage-builder / ironic-python-agent-builder the initramfs of a stream-9 is only around 350M (no longer 800~)20:15
samuelkunkel[m]thanks for that :)20:15
clarkbnote that you can adjust partition sizes via dib too if necessary20:17
TheJuliawell, in a ramdisk case there is no partitions20:29
TheJulia:)20:29
clarkbah I mixed this up with the raid thing20:31
NobodyCamTheJulia: no space between ] and the option name? `[conductor]poweroff_in_cleanfail` vs `[conductor] poweroff_in_cleanfail`20:32
NobodyCamhey hey clarkb long Time no see20:32
clarkbhello!20:33
TheJuliaNobodyCam: correct20:33
opendevreviewChris Krelle proposed openstack/ironic master: Add ablity to power off nodes in clean failed  https://review.opendev.org/c/openstack/ironic/+/88016520:34
NobodyCamokay that should tackle the dreaded pep8 error and the Reno note issue20:34

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!