Friday, 2023-09-15

opendevreviewIury Gregory Melo Ferreira proposed openstack/ironic master: RedfishFirmware Interface  https://review.opendev.org/c/openstack/ironic/+/88542502:47
dtantsurJayF, I cannot do it either, I wonder if hashtags are not enabled there at all08:48
opendevreviewDmitry Tantsur proposed openstack/bifrost stable/yoga: Remove Fedora from the CI  https://review.opendev.org/c/openstack/bifrost/+/89375408:52
opendevreviewMerged openstack/ironic master: [releasenotes] Prelude for 2023.2/bobcat  https://review.opendev.org/c/openstack/ironic/+/89500708:58
opendevreviewMerged openstack/python-ironic-inspector-client stable/2023.2: Update .gitreview for stable/2023.2  https://review.opendev.org/c/openstack/python-ironic-inspector-client/+/89509309:30
opendevreviewMerged openstack/python-ironic-inspector-client stable/2023.2: Update TOX_CONSTRAINTS_FILE for stable/2023.2  https://review.opendev.org/c/openstack/python-ironic-inspector-client/+/89509409:30
opendevreviewMerged openstack/python-ironic-inspector-client master: Update master for stable/2023.2  https://review.opendev.org/c/openstack/python-ironic-inspector-client/+/89509509:30
opendevreviewAdam Rozman proposed openstack/ironic-python-agent master: implement basic-auth support for user-image download process  https://review.opendev.org/c/openstack/ironic-python-agent/+/89027210:19
opendevreviewAdam Rozman proposed openstack/ironic-python-agent master: implement basic-auth support for user-image download process  https://review.opendev.org/c/openstack/ironic-python-agent/+/89027210:30
opendevreviewVerification of a change to openstack/networking-generic-switch master failed: Do not make actual device changes in bind_port()  https://review.opendev.org/c/openstack/networking-generic-switch/+/84759210:52
iurygregorygood morning Ironic11:25
iurygregorydtantsur, JayF hashtags need to be enabled in the ACL of each project 11:34
opendevreviewDmitry Tantsur proposed openstack/ironic master: Remove most prints for unit tests  https://review.opendev.org/c/openstack/ironic/+/89526913:01
dtantsurFYI TheJulia ^^^13:01
TheJuliathat that will make debugging api stuff harder, but c'est la vie13:11
opendevreviewJulia Kreger proposed openstack/ironic stable/2023.1: DB: Streamline allocation interactions  https://review.opendev.org/c/openstack/ironic/+/89503813:17
opendevreviewJulia Kreger proposed openstack/ironic stable/2023.1: DB: Select upon delete for allocations  https://review.opendev.org/c/openstack/ironic/+/89503913:21
TheJulia^ are a result of https://github.com/openstack-k8s-operators/ironic-operator/pull/333#issuecomment-172110370513:35
dtantsurmoar database issues \o/14:04
dtantsuriurygregory, TheJulia, could we try merging https://review.opendev.org/c/openstack/ironic/+/895269 please? We're in a pretty bad place downstream with the new packages failing to build and old packages affected by the sqlite issue14:08
iurygregorytotally forgot to add the +2, I checked when you mentioned in slack the patch14:09
dtantsuralso needs a 2nd +2 https://review.opendev.org/c/openstack/bifrost/+/89375414:14
TheJuliadtantsur: well, it was months ago, really15:14
TheJulia2-ish15:14
opendevreviewMerged openstack/ironic master: Remove most prints for unit tests  https://review.opendev.org/c/openstack/ironic/+/89526915:29
* TheJulia raises at a stable job which failed on ipmi commands and a master branch job which failed on ipmi commands15:55
opendevreviewMerged openstack/bifrost stable/yoga: Remove Fedora from the CI  https://review.opendev.org/c/openstack/bifrost/+/89375416:06
JayFdtantsur: I think they set some kind of env var in upstream CI to prevent the error you saw on print()16:38
JayFdtantsur: I think the removal of prints is a fine fix; just saying if you still have pain you might wanna go looking16:38
clarkbJayF: dtantsur: many projects capture stdout and stderr and logging into fixtures. Some projects got a step further and only save that data on failure which keeps things clean on success but when you fail you get all the debugging info you want16:40
JayFTheJulia: dtantsur: iurygregory: stevebaker[m]: Unless one of you squeals loudly otherwise, I'm going to start doing work to cut stable/2023.2 of Ironic projects first thing Monday. Are there any items that need to land before we do that?16:46
JayFRelease is due in 2 weeks and even though we don't cut RCs I'd prefer not cut it close16:46
iurygregoryJayF,in two weeks you mean Sep 28?16:47
iurygregoryor something like this16:47
JayFI mean I want to cut the releases literally Monday16:47
JayFWe have to do it no later than > Final RC deadline: September 28th, 2023 (R-1 week) AIUI16:47
iurygregoryyeah, I'm asking about the due date in 2 weeks =)16:47
iurygregoryyup! can we wait till 25? :D16:47
JayFYes but I view that as a deadline for us to have everything already done16:47
JayFand I'd strongly prefer not frantically cutting a release with a day or two before it's due16:48
iurygregoryit would be on the week of 28 which can work I would say16:48
JayFiurygregory: I'd strongly prefer *not* putting it off that long, lets talk about it this way: what changes were you hoping to get in that aren't ready? What can me, or others in the community do to help you get them ready sooner?16:49
iurygregorythe Redfish Firmware is almost there (it updates the firmware, but the clean step says it failed to power off)16:49
iurygregoryso I'm trying to figure out what is happening there16:49
iurygregorysee my last comment in https://review.opendev.org/c/openstack/ironic/+/885425 there 16:50
JayFiurygregory: makes sense, I'd like to have that in as well although we put in cycle highlights we didn't have redfish firmware in this release16:50
JayFiurygregory: I thought we talked in IRC yesterday (?) about creating a separate, longer timeout for the power action in the firmware upgrade case only?16:50
JayFiurygregory: was that not enough? It still didn't get the power flipped?16:51
iurygregoryJayF, maybe I missed ?16:51
JayFiurygregory: yeah b/c I thought you were talking about extending the existing power timeout16:51
TheJuliaI think it might have been the day before16:51
JayFdtantsur: said not in all cases iirc16:51
TheJuliaI don’t remember, sinus headache atm16:51
JayFand I thought that's where it was left16:51
iurygregoryso, in the management update_firmware clean step has a wait parameter that would handle this 16:51
iurygregorybut in my case it didn't work (and it uses the same logic .-.)16:51
JayFIf I were experiencing this set of symptoms, I'd probably be out of band of Ironic querying the redfish16:52
iurygregorywell, directly to redfish it works16:52
JayFbecause it sounds like you're dealing with hardware weirdness moreso than software weirdness, even if the difference is ... minimal in our world16:52
TheJuliaI bet it says it is done, and then reboots16:52
iurygregoryTheJulia, correct16:52
TheJuliaAnd then we think it is done, but it is not really really done16:52
iurygregorybecause the iLO needs to restart etc so it takes some time16:53
JayF++ We just need to make the code expect the behavior whatever your test machine is doing, and we can tweak from there as we test against other hardware. 16:53
JayFiurygregory: so is there no timeout long enough to make it work then? I'm confused why it's an issue if we know that waiting it eventually recovers16:53
JayFiurygregory: just a matter of Ironic not respecting the timeout?16:53
iurygregorythe second, (doesn't seem like ironic is accepting the fact I've passed the timeout and just ignores)16:53
TheJuliaBMC can drop for minutes afterwards, we should likely expect a power action to fail16:53
iurygregoryI'm going to test the update via management to see16:54
JayFyeah I suspect you'll need to loop with a try/catch16:54
JayFand just expect power status calls to fail for $long_time16:54
JayF> Sep 13 14:37:20 dl380gen10-u25.dev3 ironic[86542]: 2023-09-13 14:37:20.121 86542 ERROR ironic.conductor.utils oslo_service.loopingcall.LoopingCallTimeOut: Looping call timed out after 79.93 seconds # this in your error logs implies there's a ~60-75 second timeout somewhere that probably needs a zero added lol16:55
JayFah that's node_wait_for_power_state16:55
JayFwill that take a timeout?16:55
* JayF is looking at this now16:56
* iurygregory brb on a customer call16:59
JayFiurygregory: okay, I have a very specific suggestion16:59
iurygregorygo ahead, I will look in a few =)16:59
JayFhttps://github.com/openstack/ironic/blob/master/ironic/drivers/modules/deploy_utils.py#L1539 should take a timeout kwarg and pass it thru to manager_utils.node_power_action16:59
JayFyou should be able to make that long enough to work, then if it does, put that in config for like a [redfish]firmware_update_power_timeout or wherever it'd fit17:00
iurygregoryinteresting, I will tesk17:02
iurygregorytest*17:02
JayFhell that timeout might even be something we want overridable per node, but that can be a follow-on if we need it17:07
JayF"this terrible hardware needs 10x as long as my normal gear" sorta stuff17:07
TheJulia++17:12
opendevreviewMerged openstack/ironic stable/2023.1: DB: Streamline allocation interactions  https://review.opendev.org/c/openstack/ironic/+/89503817:29
iurygregoryJayF, I couldn't find the config you mentioned O.o17:39
JayFiurygregory: that's because you haven't created it yet17:39
iurygregoryLOL17:39
JayFiurygregory: I'm saying, plumb that kwarg thru, create a config to set arbitrarily high timeouts17:39
JayFthe default is whatever it takes to get your test to pass :)17:39
iurygregorythis will be the second thing I will try17:40
iurygregorylet me test the clean step update_firmware in management =D17:40
iurygregoryJayF, bad news ... seems like the update_firmware clean step also doesn't work LOL19:08
* iurygregory cries19:08
iurygregorybut the firmware was updated LOL19:08
iurygregoryhttps://paste.opendev.org/show/brqU8qBgdeDPuSpfGJfs/19:08
JayFYep, just gotta get Ironic recognizing it19:08
JayFI've been in that house of pain before too :) 19:09
iurygregoryoh jesus19:09
JayFthat is without the change I suggested, yes?19:09
iurygregorycorrect19:09
iurygregoryI was testing the clean step we have via management interface19:09
iurygregorynow let me try to work on what you mentioned19:09
JayFah, yeah, I think some of what you're dealing with is the hardware being funky19:09
iurygregoryand see how ironic will react19:09
JayFso I think you've got sorta two things mixed here:19:10
JayF1) it's extremely likely your update_firmware stuff works on some hardware that exists in the world19:10
JayF2) the machine *you are on* has annoying behavior around reboots taking longer after firmware updates19:10
iurygregoryyeah19:10
iurygregoryand ironic doesn't like 219:10
iurygregorylol19:10
JayFso when you solve #2 just try to do it in a way that fixes the clean step, too?19:10
iurygregoryyeah19:10
iurygregoryfix the error we get, because it triggers the right thing19:11
iurygregoryand the bmc does what we asked19:11
JayFYeah, I think that the suggestion from earlier is still the right route19:11
JayFyou just may have found another place you need to plumb that new config value thru to for making timeouts much longer19:11
iurygregorylet me read now to understand things and make sure I'm not doing crazy things :D19:11
JayFiurygregory: I can jump on a call if you want to walk thru it with words, I think I understand the moving parts19:12
iurygregoryoh  manager_utils.node_power_action already has a timeout arg :D19:12
JayFyeah you have to hook it in a layer up19:13
iurygregoryyeah!19:13
JayFin like reboot_for_node_clean19:13
iurygregorygotcha19:13
JayFor something like that19:13
JayFjust plumb it back as far as you need for the driver to be able to dictate the timeout19:13
TheJuliaJayF: thinking about it, it would be nice to get https://review.opendev.org/c/openstack/ironic-python-agent/+/890864 and it's child patch merged before releasing19:13
iurygregorytime to hack some code in the venv here19:13
iurygregory:D19:13
JayFand in the redfish case, it should likely be configurable19:13
JayF(and per-node configurable but that can be a follow-on if it's hard)19:13
iurygregorywell, since we have the config wait we could check if there is a value and pass I think19:14
TheJuliaIw ouldn't name it reboot_for_node_clean, maybe more "reboot_for_firmware_update"19:14
JayFTheJulia: there's already a method19:14
TheJuliasince we in theory should be able to do on deploy firmware 19:14
JayFTheJulia: I'm just saying he adds a flag to it19:14
iurygregoryand we wouldn't add new configs..19:14
TheJuliaahh, okay19:14
iurygregoryit's already configurable in the clean step19:14
iurygregorywdyt?19:14
JayFiurygregory: I trust you to know how to plumb that thru; just make sure that an operator could have a machine that takes $ridiculously_long and could still get the firmware update to work19:14
iurygregorygotcha :D19:15
JayFTheJulia: fwiw we could remove hardware manager versioning support now that we don't support manage_boot=false19:20
JayFTheJulia: that was working around a bug that only occurred when an operator changed the image w/o ironic being aware, which is ~impossible nowadays19:20
JayFTheJulia: +2 on both of those19:21
TheJuliaWell, you could do it in standalone easily, but yeah19:22
TheJuliait just feels like extra code we don't really need19:22
JayFTheJulia: oooh, yeah, because we just point to an http url, there's no guarantee the payload doesn't change19:23
TheJuliayup19:23
JayFTheJulia: so yeah, we need it, thank you for the correction19:23
iurygregorygood news it worked19:51
iurygregorybut the database wasn't updated XD19:51
iurygregorynow I have a good point where to look for things I think19:51
iurygregorytks JayF 19:51
iurygregorytime to get more coffee and go to another meeting .-.19:52
JayFwoohoo19:56
JayFI always say that changing the error is a victory, now you start a new problem :D19:56
TheJuliaInteresting: https://bugs.launchpad.net/ironic/+bug/203495320:23
JayFTheJulia: I'll let you respond there; I suspect you have more of an idea of hte moving parts? If you don't wanna I will though :)20:26
TheJuliaAlready responded20:26
TheJuliaAnd put it on the etherpad. Realistically, it should be an easy win20:27
JayFare you sure? seems like a lot of work, adding two fields to get access to an entire new ecosystem of switch integrations /s20:32
JayF:D 20:32
JayFTheJulia: we are both on board, tagging that rfe-approved and I'll mention it in the next meeting20:33
iurygregoryok, I'm finally done with the meetings today20:41
iurygregorytime to code to forget about the problems they mentioned in the meeting :D 20:42
JayFiurygregory: maybe before you get started on that, take a look at Julia's IPA service steps?20:43
JayFhttps://review.opendev.org/c/openstack/ironic-python-agent/+/89086420:43
iurygregorylet me grab coffee first20:43
JayFit's the other thing we've ID'd we want to get in for release, so getting it in the gate today would be nice20:43
JayFI'm going to be stepping away in about 15 minutes or so, but will be around a little over the weekend so if we get it in the gate I can kick it thru20:44
iurygregoryack20:44
opendevreviewMerged openstack/ironic stable/2023.1: DB: Select upon delete for allocations  https://review.opendev.org/c/openstack/ironic/+/89503920:45
TheJuliaso w/r/t that bug about not being able to clean using vmedia20:58
TheJuliaafter deploying with a config drive20:58
TheJulia(dhcpless mode)20:58
TheJulia... what if we re-trigger config ourselves20:58
TheJuliawe can detect the condition, and backport that20:58
JayFI don't understand "what is we re-trigger config ourselves"20:58
TheJuliathe thing would be "hey, $network_thingiee, please start interfaces again"20:59
JayFand I'll note that anything that involves evaluating contents of a tenant's unwiped disk is a major security risk20:59
JayFso how would we get to your spot without pathing through that?20:59
TheJuliano, we can explicitly skip it20:59
TheJuliawe can detect if it is what we came in with boot wise20:59
TheJuliajust by path checking20:59
TheJuliaerrrr20:59
TheJuliano20:59
TheJuliashoot20:59
JayFYeah, and there's another point to consider here too20:59
TheJuliawell, we know the device, we can still do it20:59
JayFany path that is "the first step of cleaning is 'execute random code from configdrive'" is SCARY AF21:00
TheJuliawe just have to trigger python-glean with the right args and then re-trigger networking21:00
JayF(and yes I know that's the current condition)21:00
JayFI wonder if we could do something like ... if a config opt is set, IPA boots, if it finds a configdrive it automatically removes it and reboots, or something like that21:00
TheJuliaI'm fairly sure we have the knowledge of the disk we actually booted from21:00
JayFlike an implicit clean step of "wipe out configdrive and forbid glean/cloud-init from running"21:00
JayFif a flag is set21:01
JayFthat way we don't just recover from the bad behavior; we avoid it altogether21:01
TheJuliai guess, yeah, we could detect the condition, determine what is the local disk config drive, nuke it, and reboot21:02
TheJuliaservicing could be PAIN21:02
JayFWell, detecting the condition is hard because we intentionally support network-data21:02
JayFservicing can't be supported in this case21:03
JayFnot in a secure manner21:03
TheJuliaI'm still fairly sure we can determine what our intended is21:03
TheJuliawe've got guarding logic on even considering some other aspects when vmedia booting21:03
JayFyou sorta understand the defensive approach I think we should take though, right? like for me it's *alarms go off* at the possibility of a configdrive from a tenant being evaluated21:03
TheJuliayeah, and I think we can bypass that entirely because we should know which one we should use outright21:04
TheJuliajust glean doesn't21:04
TheJuliabut we can re-trigger glean21:04
JayFI think we should do that *unconditionally*21:04
JayFglean/cloud-init, in an IPA ramdisk, never runs against anything but configdrives it's explicitly told are coming21:04
JayFe.g. ipa_configdrive_uuid=XXXX in the kernel cli21:04
TheJuliamight actually make the experience better then, and they could drop simple-init, just have glean present21:04
JayFexactly21:05
JayFand we cut off, at the pass, an entire class of tenant-escape-to-prov-network bugs21:05
TheJuliabahahahahahaha21:06
JayFTheJulia: that's either really good or really bad haha21:08
TheJuliawe already have detection logic in ipa, it is more so just telling glean "use the supplied iso content explicitly"21:08
JayFI'm curious to see what code comes out21:10
opendevreviewJulia Kreger proposed openstack/ironic-python-agent master: WIP/DNM: Get rid of simple-init (but keep glean)  https://review.opendev.org/c/openstack/ironic-python-agent/+/89551922:04
TheJuliaan idea22:05
TheJuliaThere really doesnt' seem to support all of the legacy interface/configuration logic to make static files at this point22:08
TheJuliaand then try to init them22:08
TheJuliajust restart the networking subsystem and hopefully magic()22:08
JayFThat doesn't address the actual security concern though AFAICT22:09
JayFbecause it doesn't ever prevent us running against the wrong configdrive in the first place .... right?22:10
JayF(or is that going to happen in ipa-b later?)22:10
TheJuliaDo you remember when I added the logic to detect if the config drive came from the system itself ?22:15
TheJuliathat would have to be in ipa-b22:15
TheJuliaat least, prevent it oturight22:15
TheJuliaoutright22:15
JayFokay I thought so but when you said IPA already had the detection logic22:15
JayFI didn't grok fully what you meant22:16
JayFnow I know: you just reused the "is it vmedia" method22:16
JayFand the other stuff is all outside of ipa22:16
TheJuliabasically, yeah22:16
TheJulia*should* work22:16
TheJuliaand doesn't seem horribly awful22:16
TheJuliaLike, I don't think adam savage is going to appear and say "That is horrible!"22:17
JayFit feels sorta like what we should've done from the start for this feature tbh22:17
TheJuliayeah, we resisted for some reason22:17
TheJuliaI don't remember anymore22:17
TheJuliahind sight is also always 20-2022:18
JayFvery much22:19
TheJuliaI bet we were focused on classical interface init22:22
TheJuliaand I bet tinycore doesn't have systemd :)22:22
TheJuliaour vmedia job uses tinycore, but does dhcp by default22:25
JayFI'm sure we can work out the kinks; this will be an improvement. Not sure it's upstream-backportable though because of the coordination that'll be required in ramdisk and ipa22:27
JayF(a new IPA on a not-updated ramdisk build is likely to break; and I don't think we should assume people will bump both)22:28
JayFmight even wanna ensure that our downstream packagers know what's up for distros like debian that will build their own ramdisk22:28
TheJuliaas is, that shouldn't break existing ramdisks or users of it, and prior builders should already have glean22:29
opendevreviewMerged openstack/ironic-python-agent master: Add get_service_steps logic to the agent  https://review.opendev.org/c/openstack/ironic-python-agent/+/89086422:29
TheJuliaand if we check if it is there and only then invoke, that seems sane22:30
TheJuliathe element change for simple-init, or for the base default, that... is the conundrum I guess22:30
JayFexactly22:30
JayFIt's one of those things where I'm not sure the combo of: 1) Cases that this can happen in and 2) Risks from backporting it  really is worth it? but I could be wrong22:31
TheJuliadib itself, actually, which is where afaik simple-init comes from, is also not stable branched, if you install it, you get the latest22:43
TheJuliaso inadvertently it might just break upon any change22:43
JayFSo you'd hide the behavior behind and environment variable, defaulted to the old way, and have ipa-b set that variable22:44
TheJuliato be determiend22:50
clarkbskimming what you've said I feel like you want something other than simple-init22:50
clarkblet simple-init continue to be simple and work for existing users then have a different thing that installs glean in a "trigger glean manually" sort of way alongside simple init22:51
clarkbthen ipa or whatever can trigger the appropriate scripts when logic dictates rather than udev22:51
TheJuliathat could also work, as long as we disable it on ramdisks from auto-starting, we should be fine, and even then the basic path should also be able to recover things from the unhappy path22:54
JayF👍 the key is just to make sure that nothing is going to launch and do something with the config drive until IPA explicitly tells it to.23:14

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!